Link to Project
Overview
With the internet becoming an ever evolving technology, ways to protect ourselves from malicious attacks have become ever so important. Everyday these attacks become more and more intelligent in trying to trick not only the user but also current spam protection systems. This illustrates that most current implementations lack in detecting the hijacking of already “trusted” email addresses. Having a system capable of analysing and detecting such malicious attacks would be of great use. This project aims to create such a system and explore the multiple ways in which Artificial Intelligence and strict filtering rules can leverage advancements in the field of Cyber security and Datascience .
What is SEA?
Sea stands for Spam . Email – Analysis. It can be used to train and test large amounts of data and also classify single Subject HEadings and Email addresses. Compared to more traditional methods , SEA justifies its classification using both the Subject heading , Email address while also passing through custom filters that further filter out smaller details not addressed by the classification models.
Features
- Naive Bayes and LSTM classification
- Spell checking
- Keyword and Profanity Filtering
- Domain Name and Domain Extesion filtering
- Multithreaded Importing
- Low Memory Usage
Requirements
- Python 3.7 +
- Linux/Mac OS/Windows (CLI might not work)
- Multi-Core , Multi-GPU system recomended
- Kitty terminal emulator (Optional GPU accelated terminal emulator)
Perfomance
- Naive Bayes has an 86% accuracy
- LSTM has an 98.92% accuracy
Using our custom filtering techniques the overall precision of the software increases drastically.
How to Run
Autamated installation can be done using the start.sh file.
Step 0: Download Glove Vectors
Due to githubs file size limitation this cannot be included in the repository , so download , unzip and move the files into the Prototype/loader folder
wget http://nlp.stanford.edu/data/glove.6B.zip
unzip glove.6b.zip
Step 1: Easy one liner install depencencies
sudo pip3 install inquirer tqdm colorama nltk pandas autocorrect pympler keras tensorflow keras_metrics sklearn ann_visualizer pyfiglet textblob
Step 2 : Install the punkt NLTK wordlist
python3 -m textblob.download_corpora
step 3 : Run the SEA
cd /Prototype/
python3 -W ignore start.py
Options for Naive Bayes
- Show top 10 most informative Features
- Test a subject heading and email address without filtering
- Test a subject heading and email address with filtering
Options for LSTM
- Test a subject heading and email address without filtering
- Test a subject heading and email address with filtering
Notes On training using your own dataset
Place the files in the dataset folder and change the path in the respective files you wish to train with. There is no guarantie that your dataset will be read by the program, as some datasets use different delimeters for spacing out the data.