building a simple spam filter with naive bayes classifier and c#

Have you also often wondered how a filter works? I have and last month I got a book in my fingers that deals with machine learning. In that book there is a chapter about naive bayer classifier, and that naive bayer classifiers are used to implement spam filtering. All programming examples in that book are in python and I do not speak python, so i calculated the naive bayer classifier myself, wrote a c# program and put it on codeplex (naivebayes.codeplex.com). I made the application only do demonstrate the principle of spam filtering, it is not for practical use. To understand how naive bayes classifiers work the best example is the cookie problem. Lets assume we have two bowls with cookies in it.

Bowl 1: 5 chocolate cookies and 35 vanilla cookies

Bowl 2: 20 chocolate cookies and 20 vanilla cookies

Now we take one cookie (it is a vanilla cookie) and we have 50 percent chance that we take it from bowl 1 or from bowl2. Now we have two hypothesis:

1. It is from bowl 1

H1 = 1/2

2. It is from bowl 2

H2 = 1/2

That means that we have a 50 to 50 chance to pick from bowl 1 or from bowl 2

P(h1) = P(h2) = 1/2

After defining our hypothesis and taking one vanilla cookie, now we have one evidence, and we can calculate the probability that we took it from bowl 1 or bowl 2. So now we can define the hypothesis under the evidence that we took a vanilla cookie.

P(E/h1) = 35/40 (because we took one cookie that is vanilla the chance is 35/40)
P(E/h2) = 20/40 = 1/2 (because we took one cookie that is vanilla but the chance is 1/2)

P(E/h1) is 35/40 because we have 35 vanilla cookies out of 40
P(E/h2) is 1/2 because we have 20 vanilla cookies out of 40
At this point we need bayes theorem:

P(h1/E) = P(h1) * P(E/h1) / P(E)

Bayes theorem says that you can calculate the evidence under the hypothesis one P(h1/E) with multiplying the P(h1) with the calculated evidence under the hypothesis P(E/h1) divided by a base P(E). To calculate P(h1/E) that is the value that tells us how big the chance is that our hypothesis one is correct. What we need to calculate that value is P(E). In this case it is very easy to find. Is is the sum of all vanilla cookies divided by the sum of all cookies.

P(E) = sum(vanilla cookies) / sum(all cookies) -> 55 / 80 

So now we have all values we need for bayes theorem.

P(h1/E) = P(h1) * P(E/h1) / P(E) -> P(h1/E) = 
(1/2 * 35/40) / 55/80 = 0.5 * 0,875 / 0,6875 = 0.6363636363 = 63.6 %

That means after taking one vanilla cookie we can say that the chance that we take cookies from bowl 1 is 63.6 percent.

So that’s it with the cookie problem but how can us help that by implementing a spam filter? We can calculate the possibility if a mail we get is spam or not spam with the same formula, with bayes theorem.
But instead of cookies we use mails we have classified as spam mails or as not spam mails. So if we get a new mail, we take every word and calculate a value Q. If Q is greater than 1 the word comes more likely from a spam mail, if Q is smaller than 1 the word comes more likely not from a spam mail. So the rule will be

Q > 1 = spam
Q < 0 = not spam

Now we will calculate Q for every word w.
Our hypothesis h1 and h2 are:

H1 = HSpam = 1/2
H2 = HNotSpam = 1/2

So we have again have a 50 to 50 chance

P(h1) = P(h2) = 1/2

Now we need some evidence data, in our case we have 10 spam mails and 10 not spam mails. That mails will give us the evidence we need to decide if the word comes from a spam mail or not. So now we take one word we have in the mail we want to check. Lets say the word is mother. We look in the spam mails and we find it one time. Then we look in the not spam mails and we find it five times. With that information we can calculate P(E/h1) and P(E/h2):

P(E/h1) = 1 / 10 (one mail out of ten)
P(E/h2) = 5 / 10 (five mails out of ten)

Now we use bayes theorem to calculate P(h1/E) and P(h2/E).

P(h1/E) = P(h1) * P(E/h1) / P(E)
P(h2/E) = P(h2) * P(E/h2) / P(E)

Normally we have to figure out what P(E) is but in this case P(E) refers only to the actual mail that means it is one and we can remove it. Here is the calculation:

P(h1/E) = 1/2 * 1/10 / 1 = 1/20
P(h2/E) = 1/2 * 5/10 / 1 = 5/20

So we have two values that tell us how big the chance is that a mail is a spam mail or not. (based on the spam and not spam mails we already got)
Now we calculate Q and check if it is greater or smaller that 1.

Q = P(h1/E) / P(h2/E) = 1/20 / 5/20 = 1/5 = 0.2

Q is smaller that one that means that the word mother occurs more often in one of our not spam mails. To check all the text in a mail we only have to calculate Q for every word sum all Q values up and divide it through the number of all words in the email.

Q overall = sum(calculate Q for every word w) / count(every word w)
Q overall > 1 = spam
Q overall < 0 = not spam

Now all we have to do is to implement a program that takes the emails and checks a given email for spam. To check if the program does what it should do, i implemented two tests that check phrases. The NaiveBayersSpamTrueTest method uses phrases from spam mails and the NaiveBayersNotSpamFalseTest method uses phrases from not spam mails.
That is the spam true test.

[TestMethod]
public void NaiveBayersSpamTrueTest()
{
    NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
    var result = naiveBayes.CheckEmail("Buy Cheap cialis Online");
    Assert.AreEqual(true, result);
    result = naiveBayes.CheckEmail("Enlarge Your Penis");
    Assert.AreEqual(true, result);
    result = naiveBayes.CheckEmail("accept VISA, MasterCard");
    Assert.AreEqual(true, result);
}

That is the spam not true test.

[TestMethod]
public void NaiveBayersNotSpamFalseTest()
{
    NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
    var result = naiveBayes.CheckEmail("Sincerely Mathias");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("Dear Graduates");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("Thanks in advance for your support");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("for it with my Mastercard");
    Assert.AreEqual(false, result);
}

Now that we have the tests we start with the part that reads the evidence data. What i did is, i took 10 normal mails and 10 spam mails from my inbox and my spam folder and made 10 text files out of them. Then i implemented a method that read all the data from the text files. (the text files with the text from the spam mails and not spam mails are at the naivebayes.codeplex.com project)

public class EmailReader
{
    public List<string> Read(string pathToFiles)
    {
        List<string> list = new List<string>();
        DirectoryInfo currentdirectoryInfo = 
            new DirectoryInfo(Environment.CurrentDirectory);
        DirectoryInfo parent = 
            currentdirectoryInfo.Parent.Parent.Parent;
        DirectoryInfo directoryInfo = 
            new DirectoryInfo(parent.FullName + @"\NaiveBayes" + pathToFiles);
        if (directoryInfo.Exists)
        {
            foreach (var file in directoryInfo.EnumerateFiles())
            {
                var text = String.Empty;
                using(FileStream filestream = file.OpenRead())
                {
                    System.IO.StreamReader streamReader = 
                        new System.IO.StreamReader(filestream);
                    text = streamReader.ReadToEnd();
                }
                list.Add(text);
            }
        }
        return list;
    }
}

The methode Read is called two times, one time for the spam mails and one time for the not spam mails. After reading the text from the text files i need a class that parses all words from the text into a dictionary and count how often a word occurs. So that parser tells me how often a word occurs in all my spam mails and in all my not spam mails. So for simplicity the parser takes the text of every mail and splits it into word by using the string.Split(‘ ‘) method, that means every times a blank occurs a new word starts.

public class Parser
{
    private Dictionary<string, int> wordCounter = new Dictionary<string,int>();

    public Dictionary<string, int> WordCounter
    {
        get { return wordCounter; }
        set { wordCounter = value; }
    }

    public void Parse(string textToParse)
    {
        var stringArray = textToParse.Split(' ');
        foreach (var str in stringArray)
        {
            if (wordCounter.ContainsKey(str.ToLower()))
            {
                wordCounter[str.ToLower()] += 1;
            }
            else
            {
                wordCounter.Add(str.ToLower(), 1);
            }
        }
    }
}

Now all the groundwork is done. What we need now is the method that calculates Q for every word and calculates out of all Q values of all words of our email, if it is spam or not spam. In the constructor of the NaiveBayes class the mails are read and saved in the spamMails and notSpamMails variable. The CheckMail method is the starting point of the calculation. First it parses the spamMails and the notSpamMails string and fills two dictionaries with the spam and the not spam mails words. Then the CheckIfSpam method is called that splits the mail text in words and calls for every word the CalculateQ method. Then it sums up all q values and divided it by the count of all tested words. If the calculated value is bigger one it returns true that means spam if not it returns false that means no spam. In the CalculateQ method we calculate Q with the bayes theorem.

P(h1/E) = P(h1) * P(E/h1) / P(E)

We calculate Ph1e by dividing wordCountSpam by countSpamMails and Ph2e by dividing wordCountNotSpam with countNotSpamMails. Then we divide Ph1e by Ph2e and the result is q.

public class NaiveBayes
{
    private List<string> spamMails;
    private List<string> notSpamMails;
    private int countSpamMails;
    private int countNotSpamMails;

    public NaiveBayes()
    {
        EmailReader emailReader = new EmailReader();
        spamMails = emailReader.Read(@"\Mails\Spam\");
        notSpamMails = emailReader.Read(@"\Mails\NotSpam\");
        countSpamMails = spamMails.Count();
        countNotSpamMails = notSpamMails.Count();
    }

    public bool CheckEmail(string text)
    {            
        Parser parser = new Parser();
        foreach(var spamMail in spamMails)
        {
            parser.Parse(spamMail);
        }
        var spamWords = parser.WordCounter;
        parser = new Parser();
        foreach (var notSpamMail in notSpamMails)
        {
            parser.Parse(notSpamMail);
        }
        var notSpamWords = parser.WordCounter;
        return CheckIfSpam(text, countSpamMails,
            spamWords, countNotSpamMails, notSpamWords);
    }

    private bool CheckIfSpam(string text, 
        int countSpamMails, Dictionary<string,int> spamWordList,
        int countNotSpamMails, Dictionary<string,int> notSpamWordList)
    {
        var stringArray = text.Split(' ');
        var sumQ = 0.0;
        var wordCounter = 0;
        foreach (var item in stringArray)
        {
            var q = CalculateQ(item.ToLower(), 
                countSpamMails, spamWordList, 
                countNotSpamMails, notSpamWordList);
            sumQ += q;
            wordCounter++;
        }
        if (sumQ / wordCounter > 1)
        {
            return true;
        }
        else
        {
            return false;
        }
    }

    private double CalculateQ(string word, 
        int countSpamMails, Dictionary<string,int> spamWordList,
        int countNotSpamMails, Dictionary<string,int> notSpamWordList)
    {
        double wordCountSpam = 1;
        if (spamWordList.ContainsKey(word))
        {
            wordCountSpam = spamWordList[word];
        }
        double Ph1e = 0.5 * wordCountSpam / countSpamMails;
        double wordCountNotSpam = 1;
        if (notSpamWordList.ContainsKey(word))
        {
            wordCountNotSpam = notSpamWordList[word];
        }
        double Ph2e = 0.5 * wordCountNotSpam / countNotSpamMails;
        double q = Ph1e / Ph2e;
        return q;
    }
}

So that’s it, if we now give the CheckEmail method a phrase (e.g. “Thanks in advance for your support”) it recognises that this is not from a spam mail. The complete source code including the text from spam and not spam mails is in the naivebayes.codeplex.com project. If you need more information about naive bayes classification i would recommend the open book Think Stats: Probability and Statistics for Programmers. I have to say that I really enjoyed implementing the naive bayes classifier, because i always wanted to know how a spam filter works and it is surprising how good it works.

Advertisements


If you have a note or a question please write a comment.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s