ML in the Wild: Spam Detection with Python

Getting Started on Windows

Getting Started on Mac

Download your own Copy of the Notebook (Click Here)

Installing the Required Packages

You can do this directly in the notebook, but it is advised to run the following

pip3 install "packageName"

for EACH the following lines shown below (excluding %%bash)

In [ ]:
%%bash
#you must  install the following python libraries to run this notebook. 
pip3 install pandas
pip3 install matplotlib
pip3 install sklearn
pip3 install nltk
pip3 install spacy
python3 -m spacy download en 
pip3 install wordcloud

Downloading the Data

In [12]:
%%bash
wget http://www.gabechurch.com/assets/data_files/sms-spam-collection-dataset.zip
unzip sms-spam-collection-dataset.zip
Archive:  sms-spam-collection-dataset.zip
  inflating: spam.csv                
--2019-09-17 15:28:45--  http://www.gabechurch.com/assets/data_files/sms-spam-collection-dataset.zip
Resolving www.gabechurch.com (www.gabechurch.com)... 136.35.11.153
Connecting to www.gabechurch.com (www.gabechurch.com)|136.35.11.153|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 212825 (208K) [application/zip]
Saving to: ‘sms-spam-collection-dataset.zip’

     0K .......... .......... .......... .......... .......... 24% 6.79M 0s
    50K .......... .......... .......... .......... .......... 48% 11.7M 0s
   100K .......... .......... .......... .......... .......... 72% 5.38M 0s
   150K .......... .......... .......... .......... .......... 96% 14.4M 0s
   200K .......                                               100%  115M=0.02s

2019-09-17 15:28:57 (8.50 MB/s) - ‘sms-spam-collection-dataset.zip’ saved [212825/212825]

Imports

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Warnings
import warnings
warnings.filterwarnings('ignore')

# Styles
plt.style.use('ggplot')
sns.set_style('whitegrid')

plt.rcParams['font.family'] = 'serif'
plt.rcParams['font.serif'] = 'Ubuntu'
plt.rcParams['font.monospace'] = 'Ubuntu Mono'
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 10
plt.rcParams['xtick.labelsize'] = 8
plt.rcParams['ytick.labelsize'] = 8
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['figure.titlesize'] = 12
plt.rcParams['patch.force_edgecolor'] = True

# Text Preprocessing
import nltk
# nltk.download("all")
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize
from nltk import sent_tokenize

import spacy
nlp = spacy.load("en")
[nltk_data] Downloading package punkt to /home/gchurch/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/gchurch/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Loading the data

In [111]:
#replace the spam.csv with the path to your data if downloaded using another method
messages = pd.read_csv("spam.csv", encoding = 'latin-1')

# Drop the extra columns and rename columns
messages = messages.drop(labels = ["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis = 1)
messages.columns = ["category", "text"]
In [21]:
display(messages.head(n = 10))
category text
0 ham Go until jurong point, crazy.. Available only ...
1 ham Ok lar... Joking wif u oni...
2 spam Free entry in 2 a wkly comp to win FA Cup fina...
3 ham U dun say so early hor... U c already then say...
4 ham Nah I don't think he goes to usf, he lives aro...
5 spam FreeMsg Hey there darling it's been 3 week's n...
6 ham Even my brother is not like to speak with me. ...
7 ham As per your request 'Melle Melle (Oru Minnamin...
8 spam WINNER!! As a valued network customer you have...
9 spam Had your mobile 11 months or more? U R entitle...
In [22]:
# Lets look at the dataset info to see if everything is alright

messages.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
category    5572 non-null object
text        5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB
In [23]:
#Lets see what precentage of our data is spam/ham
messages["category"].value_counts().plot(kind = 'pie', explode = [0, 0.1], figsize = (6, 6), autopct = '%1.1f%%', shadow = True)
plt.ylabel("Spam vs Ham")
plt.legend(["Ham", "Spam"])
plt.show()
findfont: Font family ['serif'] not found. Falling back to DejaVu Sans.
findfont: Font family ['serif'] not found. Falling back to DejaVu Sans.

A lot of messages are actually not spam. About 86% of our dataset consists of normal messages.

While we split our data set into train and test or when we use cross validation, we will have to use stratified sampling, otherwise we have a chance of our training model being skewed towards normal messages. If the sample we choose to train our model consists majorly of normal messages, it may end up predicting everything as ham and we might not be able to figure this out since most of the messages we get are actually ham and will have a pretty good accuracy.

A very basic model would be a model that predicts everything as ham. It would have a decent accuracy. But then again, is that right? No. We will then have to use an accuracy metric that keeps this in mind. Goal : We don't mind if we miss the odd spam message but we surely don't want to mark a ham message as spam i.e Precision is very important. Hence we will use fbeta score as our accuracy metric with inclination towards Precision

In [25]:
#Lets see the top spam/ham messages

topMessages = messages.groupby("text")["category"].agg([len, np.max]).sort_values(by = "len", ascending = False).head(n = 10)
display(topMessages)
len amax
text
Sorry, I'll call later 30 ham
I cant pick the phone right now. Pls send a message 12 ham
Ok... 10 ham
Your opinion about me? 1. Over 2. Jada 3. Kusruthi 4. Lovable 5. Silent 6. Spl character 7. Not matured 8. Stylish 9. Simple Pls reply.. 4 ham
Wen ur lovable bcums angry wid u, dnt take it seriously.. Coz being angry is d most childish n true way of showing deep affection, care n luv!.. kettoda manda... Have nice day da. 4 ham
Please call our customer service representative on FREEPHONE 0808 145 4742 between 9am-11pm as you have WON a guaranteed å£1000 cash or å£5000 prize! 4 spam
Okie 4 ham
Say this slowly.? GOD,I LOVE YOU &amp; I NEED YOU,CLEAN MY HEART WITH YOUR BLOOD.Send this to Ten special people &amp; u c miracle tomorrow, do it,pls,pls do it... 4 ham
7 wonders in My WORLD 7th You 6th Ur style 5th Ur smile 4th Ur Personality 3rd Ur Nature 2nd Ur SMS and 1st \Ur Lovely Friendship\"... good morning dear" 4 ham
Ok. 4 ham
In [ ]:
So. People are really busy it seems. "Sorry, i'll call later" tops the ham message list with 30 counts with "I cant pick the phone right now. Pls send a message" comes second with 12 counts.

Theres a quite lot of Ok..., Okie and Ok. in there too :>

Lets study individual Spam/ham words
In [36]:
spam_messages = messages[messages["category"] == "spam"]["text"]
ham_messages = messages[messages["category"] == "ham"]["text"]

spam_words = []
ham_words = []

# Since this is just classifying the message as spam or ham, we can use isalpha(). 
# This will also remove the not word in something like can't etc. 
# In a sentiment analysis setting, its better to use 
# sentence.translate(string.maketrans("", "", ), chars_to_remove)

def extractSpamWords(spamMessages):
    global spam_words
    words = [word.lower() for word in word_tokenize(spamMessages) if word.lower() not in stopwords.words("english") and word.lower().isalpha()]
    spam_words = spam_words + words
    
def extractHamWords(hamMessages):
    global ham_words
    words = [word.lower() for word in word_tokenize(hamMessages) if word.lower() not in stopwords.words("english") and word.lower().isalpha()]
    ham_words = ham_words + words

spam_messages.apply(extractSpamWords)
ham_messages.apply(extractHamWords)
Out[36]:
0       None
1       None
3       None
4       None
6       None
7       None
10      None
13      None
14      None
16      None
17      None
18      None
20      None
21      None
22      None
23      None
24      None
25      None
26      None
27      None
28      None
29      None
30      None
31      None
32      None
33      None
35      None
36      None
37      None
38      None
        ... 
5538    None
5539    None
5541    None
5542    None
5543    None
5544    None
5545    None
5546    None
5548    None
5549    None
5550    None
5551    None
5552    None
5553    None
5554    None
5555    None
5556    None
5557    None
5558    None
5559    None
5560    None
5561    None
5562    None
5563    None
5564    None
5565    None
5568    None
5569    None
5570    None
5571    None
Name: text, Length: 4825, dtype: object
In [39]:
from wordcloud import WordCloud
#Spam Word cloud

spam_wordcloud = WordCloud(width=600, height=400).generate(" ".join(spam_words))
plt.figure( figsize=(10,8), facecolor='k')
plt.imshow(spam_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
In [40]:
#Ham word cloud

ham_wordcloud = WordCloud(width=600, height=400).generate(" ".join(ham_words))
plt.figure( figsize=(10,8), facecolor='k')
plt.imshow(ham_wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
In [41]:
# Top 10 spam words

spam_words = np.array(spam_words)
print("Top 10 Spam words are :\n")
pd.Series(spam_words).value_counts().head(n = 10)
Top 10 Spam words are :

Out[41]:
call      346
free      217
txt       156
ur        144
u         144
mobile    123
text      121
stop      114
claim     113
reply     104
dtype: int64
In [42]:
# Top 10 Ham words

ham_words = np.array(ham_words)
print("Top 10 Ham words are :\n")
pd.Series(ham_words).value_counts().head(n = 10)
Top 10 Ham words are :

Out[42]:
u       974
gt      318
lt      316
get     301
ok      246
go      246
got     242
ur      237
know    234
like    231
dtype: int64

Does the length of the message indicates us anything?

In [44]:
messages["messageLength"] = messages["text"].apply(len)
messages["messageLength"].describe()
Out[44]:
count    5572.000000
mean       80.118808
std        59.690841
min         2.000000
25%        36.000000
50%        61.000000
75%       121.000000
max       910.000000
Name: messageLength, dtype: float64
In [45]:
f, ax = plt.subplots(1, 2, figsize = (20, 6))

sns.distplot(messages[messages["category"] == "spam"]["messageLength"], bins = 20, ax = ax[0])
ax[0].set_xlabel("Spam Message Word Length")

sns.distplot(messages[messages["category"] == "ham"]["messageLength"], bins = 20, ax = ax[1])
ax[0].set_xlabel("Ham Message Word Length")

plt.show()

Looks like spam messages are usually longer. Maybe messageLength can become a feature to predict whether the message is spam/ ham ?

Method 1)

TEXT TRANSFORMATION

Lets clean our data by removing punctuations/ stopwords and stemming words

In [46]:
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")

def cleanText(message):
    
    message = message.translate(str.maketrans('', '', string.punctuation))
    words = [stemmer.stem(word) for word in message.split() if word.lower() not in stopwords.words("english")]
    
    return " ".join(words)

messages["text"] = messages["text"].apply(cleanText)
messages.head(n = 10)  
Out[46]:
category text messageLength
0 ham go jurong point crazi avail bugi n great world... 111
1 ham ok lar joke wif u oni 29
2 spam free entri 2 wkli comp win fa cup final tkts 2... 155
3 ham u dun say earli hor u c alreadi say 49
4 ham nah dont think goe usf live around though 61
5 spam freemsg hey darl 3 week word back id like fun ... 148
6 ham even brother like speak treat like aid patent 77
7 ham per request mell mell oru minnaminungint nurun... 160
8 spam winner valu network custom select receivea å£9... 158
9 spam mobil 11 month u r entitl updat latest colour ... 154

Lets convert our clean text into a representation that a machine learning model can understand. I'll use the Tfifd for this.

In [47]:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(encoding = "latin-1", strip_accents = "unicode", stop_words = "english")
features = vec.fit_transform(messages["text"])
print(features.shape)
(5572, 7903)

MODEL APPLICATION

In [48]:
def encodeCategory(cat):
    if cat == "spam":
        return 1
    else:
        return 0
        
messages["category"] = messages["category"].apply(encodeCategory)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, messages["category"], stratify = messages["category"], test_size = 0.2)
In [49]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import fbeta_score

from sklearn.naive_bayes import MultinomialNB
gaussianNb = MultinomialNB()
gaussianNb.fit(X_train, y_train)

y_pred = gaussianNb.predict(X_test)

print(fbeta_score(y_test, y_pred, beta = 0.5))
0.9380234505862646

TODO : Use more models/ Include Grid Search to find the most optimal model/ Use Lemma? / Use NER ? etc

In [50]:
y_pred
Out[50]:
array([0, 0, 0, ..., 0, 0, 0])

Method 2)

In [75]:
message_data = messages.rename(columns = {'category':'Spam/Not_Spam','text':'message'})
In [89]:
message_data.drop(columns=["messageLength"])
Out[89]:
Spam/Not_Spam message
0 0 go jurong point crazi avail bugi n great world...
1 0 ok lar joke wif u oni
2 1 free entri 2 wkli comp win fa cup final tkts 2...
3 0 u dun say earli hor u c alreadi say
4 0 nah dont think goe usf live around though
5 1 freemsg hey darl 3 week word back id like fun ...
6 0 even brother like speak treat like aid patent
7 0 per request mell mell oru minnaminungint nurun...
8 1 winner valu network custom select receivea å£9...
9 1 mobil 11 month u r entitl updat latest colour ...
10 0 im gonna home soon dont want talk stuff anymor...
11 1 six chanc win cash 100 20000 pound txt csh11 s...
12 1 urgent 1 week free membership å£100000 prize j...
13 0 ive search right word thank breather promis wo...
14 0 date sunday
15 1 xxxmobilemovieclub use credit click wap link n...
16 0 oh kim watch
17 0 eh u rememb 2 spell name yes v naughti make v wet
18 0 fine thatåõ way u feel thatåõ way gota b
19 1 england v macedonia dont miss goalsteam news t...
20 0 serious spell name
21 0 i‰û÷m go tri 2 month ha ha joke
22 0 ì pay first lar da stock comin
23 0 aft finish lunch go str lor ard 3 smth lor u f...
24 0 ffffffffff alright way meet sooner
25 0 forc eat slice im realli hungri tho suck mark ...
26 0 lol alway convinc
27 0 catch bus fri egg make tea eat mom left dinner...
28 0 im back amp pack car ill let know there room
29 0 ahhh work vagu rememb feel like lol
... ... ...
5542 0 armand say get ass epsilon
5543 0 u still havent got urself jacket ah
5544 0 im take derek amp taylor walmart im back time ...
5545 0 hi durban still number
5546 0 ic lotta childporn car
5547 1 contract mobil 11 mnths latest motorola nokia ...
5548 0 tri weekend v
5549 0 know wot peopl wear shirt jumper hat belt know...
5550 0 cool time think get
5551 0 wen get spiritu deep that great
5552 0 safe trip nigeria wish happi soon compani shar...
5553 0 hahahaus brain dear
5554 0 well keep mind ive got enough gas one round tr...
5555 0 yeh indian nice tho kane bit shud go 4 drink s...
5556 0 yes that u text pshewmiss much
5557 0 meant calcul ltgt unit ltgt school realli expe...
5558 0 sorri ill call later
5559 0 arent next ltgt hour imma flip shit
5560 0 anyth lor juz us lor
5561 0 get dump heap mom decid come low bore
5562 0 ok lor soni ericsson salesman ask shuhui say q...
5563 0 ard 6 like dat lor
5564 0 dont wait til least wednesday see get
5565 0 huh lei
5566 1 remind o2 get 250 pound free call credit detai...
5567 1 2nd time tri 2 contact u u å£750 pound prize 2...
5568 0 ì b go esplanad fr home
5569 0 piti mood soani suggest
5570 0 guy bitch act like id interest buy someth els ...
5571 0 rofl true name

5572 rows × 2 columns

In [90]:
message_data.groupby('Spam/Not_Spam').describe()
Out[90]:
messageLength
count mean std min 25% 50% 75% max
Spam/Not_Spam
0 4825.0 71.023627 58.016023 2.0 33.0 52.0 92.0 910.0
1 747.0 138.866131 29.183082 13.0 132.5 149.0 157.0 224.0
In [91]:
message_data_copy = message_data['message'].copy()
In [92]:
def text_preprocess(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = [word for word in text.split() if word.lower() not in  stopwords.words('english')]
    return " ".join(text)
In [93]:
message_data_copy = message_data_copy.apply(text_preprocess)
In [94]:
message_data_copy
Out[94]:
0       go jurong point crazi avail bugi n great world...
1                                   ok lar joke wif u oni
2       free entri 2 wkli comp win fa cup final tkts 2...
3                     u dun say earli hor u c alreadi say
4               nah dont think goe usf live around though
5       freemsg hey darl 3 week word back id like fun ...
6           even brother like speak treat like aid patent
7       per request mell mell oru minnaminungint nurun...
8       winner valu network custom select receivea å£9...
9       mobil 11 month u r entitl updat latest colour ...
10      im gonna home soon dont want talk stuff anymor...
11      six chanc win cash 100 20000 pound txt csh11 s...
12      urgent 1 week free membership å£100000 prize j...
13      ive search right word thank breather promis wo...
14                                            date sunday
15      xxxmobilemovieclub use credit click wap link n...
16                                           oh kim watch
17      eh u rememb 2 spell name yes v naughti make v wet
18               fine thatåõ way u feel thatåõ way gota b
19      england v macedonia dont miss goalsteam news t...
20                                     serious spell name
21                        i‰û÷m go tri 2 month ha ha joke
22                         ì pay first lar da stock comin
23      aft finish lunch go str lor ard 3 smth lor u f...
24                     ffffffffff alright way meet sooner
25      forc eat slice im realli hungri tho suck mark ...
26                                      lol alway convinc
27      catch bus fri egg make tea eat mom left dinner...
28                 im back amp pack car ill let know room
29                    ahhh work vagu rememb feel like lol
                              ...                        
5542                           armand say get ass epsilon
5543                  u still havent got urself jacket ah
5544    im take derek amp taylor walmart im back time ...
5545                               hi durban still number
5546                               ic lotta childporn car
5547    contract mobil 11 mnths latest motorola nokia ...
5548                                        tri weekend v
5549    know wot peopl wear shirt jumper hat belt know...
5550                                  cool time think get
5551                           wen get spiritu deep great
5552    safe trip nigeria wish happi soon compani shar...
5553                                  hahahaus brain dear
5554    well keep mind ive got enough gas one round tr...
5555    yeh indian nice tho kane bit shud go 4 drink s...
5556                            yes u text pshewmiss much
5557    meant calcul ltgt unit ltgt school realli expe...
5558                                 sorri ill call later
5559                  arent next ltgt hour imma flip shit
5560                                 anyth lor juz us lor
5561                get dump heap mom decid come low bore
5562    ok lor soni ericsson salesman ask shuhui say q...
5563                                   ard 6 like dat lor
5564                dont wait til least wednesday see get
5565                                              huh lei
5566    remind o2 get 250 pound free call credit detai...
5567    2nd time tri 2 contact u u å£750 pound prize 2...
5568                              ì b go esplanad fr home
5569                              piti mood soani suggest
5570    guy bitch act like id interest buy someth els ...
5571                                       rofl true name
Name: message, Length: 5572, dtype: object
In [98]:
from sklearn.feature_extraction.text import TfidfVectorizer #TfidfTransformer, 

vectorizer = TfidfVectorizer("english")
message_mat = vectorizer.fit_transform(message_data_copy)
message_mat
Out[98]:
<5572x8020 sparse matrix of type '<class 'numpy.float64'>'
	with 46681 stored elements in Compressed Sparse Row format>
In [99]:
from sklearn.model_selection import train_test_split

message_train, message_test, spam_nospam_train, spam_nospam_test = train_test_split(message_mat, 
                                                                                    message_data['Spam/Not_Spam'], 
                                                                                    test_size=0.3, 
                                                                                    random_state=20)
In [100]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Spam_model = LogisticRegression(solver='liblinear', penalty='l1')
Spam_model.fit(message_train, spam_nospam_train)
pred = Spam_model.predict(message_test)
accuracy_score(spam_nospam_test,pred)
Out[100]:
0.9467703349282297

Let's try using stemming and normalizing length of the messages

In [101]:
def stemmer (text):
    text = text.split()
    words = ""
    for i in text:
            stemmer = SnowballStemmer("english")
            words += (stemmer.stem(i))+" "
    return words
In [102]:
message_data_copy = message_data_copy.apply(stemmer)
vectorizer = TfidfVectorizer("english")
message_mat = vectorizer.fit_transform(message_data_copy)
In [104]:
message_train, message_test, spam_nospam_train, spam_nospam_test = train_test_split(message_mat, 
                                                        message_data['Spam/Not_Spam'], test_size=0.3, random_state=20)
In [105]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Spam_model = LogisticRegression(solver='liblinear', penalty='l1')
Spam_model.fit(message_train, spam_nospam_train)
pred = Spam_model.predict(message_test)
accuracy_score(spam_nospam_test,pred)
Out[105]:
0.9467703349282297

Accuracy score improved. Let's try normalizing length.

In [107]:
message_data['length'] = message_data['message'].apply(len)
message_data.head()
Out[107]:
Spam/Not_Spam message messageLength length
0 0 go jurong point crazi avail bugi n great world... 111 76
1 0 ok lar joke wif u oni 29 21
2 1 free entri 2 wkli comp win fa cup final tkts 2... 155 132
3 0 u dun say earli hor u c alreadi say 49 35
4 0 nah dont think goe usf live around though 61 41
In [108]:
length = message_data['length'].as_matrix()
new_mat = np.hstack((message_mat.todense(),length[:, None]))
In [109]:
message_train, message_test, spam_nospam_train, spam_nospam_test = train_test_split(new_mat, 
                                                        message_data['Spam/Not_Spam'], test_size=0.3, random_state=20)
In [110]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Spam_model = LogisticRegression(solver='liblinear', penalty='l1')
Spam_model.fit(message_train, spam_nospam_train)
pred = Spam_model.predict(message_test)
accuracy_score(spam_nospam_test,pred)
Out[110]:
0.9485645933014354

Your Challenge

  1. Download 150 of your own Emails (non-personal emails).

Follow this guide to export emails from outlook, I recommend using a separate folder and quickly adding those emails to it. https://helpdeskgeek.com/office-tips/how-to-export-your-e-mails-from-outlook-to-csv-pst/

If you can't do this find another dataset containing emails or text with a classifier.

  1. Open in Excel and select only the text column. Delete the date etc. Add another colulumn ("category" in the first position, with values you can use in the above code (ham or spam) or 0,1 if you modify the code. Save this output as a CSV.

  2. Load the Data into your newly created notebook and run.

  3. Make sure you have ultimately tried splitting the data into test and train sets and make predictions on the results. Is the model accurate, if not what do you think the problem could be? (Hint, this does not require complete undertstanding of the underlying methods/code).

Additional Info

In [ ]: