building a simple spam filter with naive bayes classifier and c#

Have you also often wondered how a filter works? I have and last month I got a book in my fingers that deals with machine learning. In that book there is a chapter about naive bayer classifier, and that naive bayer classifiers are used to implement spam filtering. All programming examples in that book are in python and I do not speak python, so i calculated the naive bayer classifier myself, wrote a c# program and put it on codeplex (naivebayes.codeplex.com). I made the application only do demonstrate the principle of spam filtering, it is not for practical use. To understand how naive bayes classifiers work the best example is the cookie problem. Lets assume we have two bowls with cookies in it.

Now we take one cookie (it is a vanilla cookie) and we have 50 percent chance that we take it from bowl 1 or from bowl2. Now we have two hypothesis:

1. It is from bowl 1

`H1 = 1/2`

2. It is from bowl 2

`H2 = 1/2`

That means that we have a 50 to 50 chance to pick from bowl 1 or from bowl 2

`P(h1) = P(h2) = 1/2`

After defining our hypothesis and taking one vanilla cookie, now we have one evidence, and we can calculate the probability that we took it from bowl 1 or bowl 2. So now we can define the hypothesis under the evidence that we took a vanilla cookie.

```P(E/h1) = 35/40 (because we took one cookie that is vanilla the chance is 35/40)
P(E/h2) = 20/40 = 1/2 (because we took one cookie that is vanilla but the chance is 1/2)```

P(E/h1) is 35/40 because we have 35 vanilla cookies out of 40
P(E/h2) is 1/2 because we have 20 vanilla cookies out of 40
At this point we need bayes theorem:

`P(h1/E) = P(h1) * P(E/h1) / P(E)`

Bayes theorem says that you can calculate the evidence under the hypothesis one P(h1/E) with multiplying the P(h1) with the calculated evidence under the hypothesis P(E/h1) divided by a base P(E). To calculate P(h1/E) that is the value that tells us how big the chance is that our hypothesis one is correct. What we need to calculate that value is P(E). In this case it is very easy to find. Is is the sum of all vanilla cookies divided by the sum of all cookies.

`P(E) = sum(vanilla cookies) / sum(all cookies) -> 55 / 80 `

So now we have all values we need for bayes theorem.

```P(h1/E) = P(h1) * P(E/h1) / P(E) -> P(h1/E) =
(1/2 * 35/40) / 55/80 = 0.5 * 0,875 / 0,6875 = 0.6363636363 = 63.6 %```

That means after taking one vanilla cookie we can say that the chance that we take cookies from bowl 1 is 63.6 percent.

So that’s it with the cookie problem but how can us help that by implementing a spam filter? We can calculate the possibility if a mail we get is spam or not spam with the same formula, with bayes theorem.
But instead of cookies we use mails we have classified as spam mails or as not spam mails. So if we get a new mail, we take every word and calculate a value Q. If Q is greater than 1 the word comes more likely from a spam mail, if Q is smaller than 1 the word comes more likely not from a spam mail. So the rule will be

`Q > 1 = spam`
`Q < 0 = not spam`

Now we will calculate Q for every word w.
Our hypothesis h1 and h2 are:

`H1 = HSpam = 1/2`
`H2 = HNotSpam = 1/2`

So we have again have a 50 to 50 chance

`P(h1) = P(h2) = 1/2`

Now we need some evidence data, in our case we have 10 spam mails and 10 not spam mails. That mails will give us the evidence we need to decide if the word comes from a spam mail or not. So now we take one word we have in the mail we want to check. Lets say the word is mother. We look in the spam mails and we find it one time. Then we look in the not spam mails and we find it five times. With that information we can calculate P(E/h1) and P(E/h2):

`P(E/h1) = 1 / 10 (one mail out of ten)`
`P(E/h2) = 5 / 10 (five mails out of ten)`

Now we use bayes theorem to calculate P(h1/E) and P(h2/E).

`P(h1/E) = P(h1) * P(E/h1) / P(E)`
`P(h2/E) = P(h2) * P(E/h2) / P(E)`

Normally we have to figure out what P(E) is but in this case P(E) refers only to the actual mail that means it is one and we can remove it. Here is the calculation:

`P(h1/E) = 1/2 * 1/10 / 1 = 1/20`
`P(h2/E) = 1/2 * 5/10 / 1 = 5/20`

So we have two values that tell us how big the chance is that a mail is a spam mail or not. (based on the spam and not spam mails we already got)
Now we calculate Q and check if it is greater or smaller that 1.

`Q = P(h1/E) / P(h2/E) = 1/20 / 5/20 = 1/5 = 0.2`

Q is smaller that one that means that the word mother occurs more often in one of our not spam mails. To check all the text in a mail we only have to calculate Q for every word sum all Q values up and divide it through the number of all words in the email.

`Q overall = sum(calculate Q for every word w) / count(every word w)`
`Q overall > 1 = spam`
`Q overall < 0 = not spam`

Now all we have to do is to implement a program that takes the emails and checks a given email for spam. To check if the program does what it should do, i implemented two tests that check phrases. The NaiveBayersSpamTrueTest method uses phrases from spam mails and the NaiveBayersNotSpamFalseTest method uses phrases from not spam mails.
That is the spam true test.

```[TestMethod]
{
NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
var result = naiveBayes.CheckEmail("Buy Cheap cialis Online");
Assert.AreEqual(true, result);
Assert.AreEqual(true, result);
result = naiveBayes.CheckEmail("accept VISA, MasterCard");
Assert.AreEqual(true, result);
}```

That is the spam not true test.

```[TestMethod]
public void NaiveBayersNotSpamFalseTest()
{
NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
var result = naiveBayes.CheckEmail("Sincerely Mathias");
Assert.AreEqual(false, result);
Assert.AreEqual(false, result);
Assert.AreEqual(false, result);
result = naiveBayes.CheckEmail("for it with my Mastercard");
Assert.AreEqual(false, result);
}```

Now that we have the tests we start with the part that reads the evidence data. What i did is, i took 10 normal mails and 10 spam mails from my inbox and my spam folder and made 10 text files out of them. Then i implemented a method that read all the data from the text files. (the text files with the text from the spam mails and not spam mails are at the naivebayes.codeplex.com project)

```public class EmailReader
{
{
List<string> list = new List<string>();
DirectoryInfo currentdirectoryInfo =
new DirectoryInfo(Environment.CurrentDirectory);
DirectoryInfo parent =
currentdirectoryInfo.Parent.Parent.Parent;
DirectoryInfo directoryInfo =
new DirectoryInfo(parent.FullName + @"\NaiveBayes" + pathToFiles);
if (directoryInfo.Exists)
{
foreach (var file in directoryInfo.EnumerateFiles())
{
var text = String.Empty;
{
}
}
}
return list;
}
}```

The methode Read is called two times, one time for the spam mails and one time for the not spam mails. After reading the text from the text files i need a class that parses all words from the text into a dictionary and count how often a word occurs. So that parser tells me how often a word occurs in all my spam mails and in all my not spam mails. So for simplicity the parser takes the text of every mail and splits it into word by using the string.Split(‘ ‘) method, that means every times a blank occurs a new word starts.

```public class Parser
{
private Dictionary<string, int> wordCounter = new Dictionary<string,int>();

public Dictionary<string, int> WordCounter
{
get { return wordCounter; }
set { wordCounter = value; }
}

public void Parse(string textToParse)
{
var stringArray = textToParse.Split(' ');
foreach (var str in stringArray)
{
if (wordCounter.ContainsKey(str.ToLower()))
{
wordCounter[str.ToLower()] += 1;
}
else
{
}
}
}
}```

Now all the groundwork is done. What we need now is the method that calculates Q for every word and calculates out of all Q values of all words of our email, if it is spam or not spam. In the constructor of the NaiveBayes class the mails are read and saved in the spamMails and notSpamMails variable. The CheckMail method is the starting point of the calculation. First it parses the spamMails and the notSpamMails string and fills two dictionaries with the spam and the not spam mails words. Then the CheckIfSpam method is called that splits the mail text in words and calls for every word the CalculateQ method. Then it sums up all q values and divided it by the count of all tested words. If the calculated value is bigger one it returns true that means spam if not it returns false that means no spam. In the CalculateQ method we calculate Q with the bayes theorem.

`P(h1/E) = P(h1) * P(E/h1) / P(E)`

We calculate Ph1e by dividing wordCountSpam by countSpamMails and Ph2e by dividing wordCountNotSpam with countNotSpamMails. Then we divide Ph1e by Ph2e and the result is q.

```public class NaiveBayes
{
private List<string> spamMails;
private List<string> notSpamMails;
private int countSpamMails;
private int countNotSpamMails;

public NaiveBayes()
{
countSpamMails = spamMails.Count();
countNotSpamMails = notSpamMails.Count();
}

public bool CheckEmail(string text)
{
Parser parser = new Parser();
foreach(var spamMail in spamMails)
{
parser.Parse(spamMail);
}
var spamWords = parser.WordCounter;
parser = new Parser();
foreach (var notSpamMail in notSpamMails)
{
parser.Parse(notSpamMail);
}
var notSpamWords = parser.WordCounter;
return CheckIfSpam(text, countSpamMails,
spamWords, countNotSpamMails, notSpamWords);
}

private bool CheckIfSpam(string text,
int countSpamMails, Dictionary<string,int> spamWordList,
int countNotSpamMails, Dictionary<string,int> notSpamWordList)
{
var stringArray = text.Split(' ');
var sumQ = 0.0;
var wordCounter = 0;
foreach (var item in stringArray)
{
var q = CalculateQ(item.ToLower(),
countSpamMails, spamWordList,
countNotSpamMails, notSpamWordList);
sumQ += q;
wordCounter++;
}
if (sumQ / wordCounter > 1)
{
return true;
}
else
{
return false;
}
}

private double CalculateQ(string word,
int countSpamMails, Dictionary<string,int> spamWordList,
int countNotSpamMails, Dictionary<string,int> notSpamWordList)
{
double wordCountSpam = 1;
if (spamWordList.ContainsKey(word))
{
wordCountSpam = spamWordList[word];
}
double Ph1e = 0.5 * wordCountSpam / countSpamMails;
double wordCountNotSpam = 1;
if (notSpamWordList.ContainsKey(word))
{
wordCountNotSpam = notSpamWordList[word];
}
double Ph2e = 0.5 * wordCountNotSpam / countNotSpamMails;
double q = Ph1e / Ph2e;
return q;
}
}
```

So that’s it, if we now give the CheckEmail method a phrase (e.g. “Thanks in advance for your support”) it recognises that this is not from a spam mail. The complete source code including the text from spam and not spam mails is in the naivebayes.codeplex.com project. If you need more information about naive bayes classification i would recommend the open book Think Stats: Probability and Statistics for Programmers. I have to say that I really enjoyed implementing the naive bayes classifier, because i always wanted to know how a spam filter works and it is surprising how good it works.

little or big endian in c#

If you want to check the endianess of your operating system in c# you have two possibilities, a safe and a unsafe way. But first of all i want to explain what endianess is and what big and little endian means. If you have an Int32 value in c#, that Int32 is saved in 4 bytes. But the question is which part of the Int32 is saved in which byte with which address. If the system is using little endian (like Windows does) the last byte (=least significant byte) is saved in the lowest address. That means if you have the hex value 0A0B0C0D and you have the Int32 has the address 00000001, the bytes are saved in the following order:

The following images shows the insertion order:

If the system has big endian order the last byte (=least significant byte) is saved in the highest address. That means the Int32 value 0A0B0C0D with the address 00000001 the bytes are inserted the following way:

The following image shows the insertion order:

Now if we want to check if the system uses big or little endian in c# there is a very easy way, you can use the BitConverter class that has an IsLittleEndian property.

```if(BitConverter.IsLittleEndian)
return "Little Endian";
```

If you want to implement the check yourself the simplest method is the following one:

```public enum BigLittleEndianEnum
{
Big,
Little
}
public BigLittleEndianEnum TestSystemIsBigOrLittleEndianSafe()
{
Int32 intValue = 0x0001;
byte[] bytes = BitConverter.GetBytes(intValue);
var safeBigLittleEndian = (bytes[0] == 1) ?
BigLittleEndianEnum.Little : BigLittleEndianEnum.Big;
return safeBigLittleEndian;
}
```

What we do is we take the Int32 type intValue and assign 1 to it. Then we use the BitConverter class to generate a byte array out of that Int32 value. If the zero byte of the array (byte[0]) is one the we know that the system uses little endian because the 1 is in the lowest address (byte[0]).
If you search for check little or big endian in google you often find the c/c++ solution. In c# you have to use the unsafe keyword, because you need pointer syntax to implement the c/c++ solution.

```public BigLittleEndianEnum TestSystemIsBigOrLittleEndianUnsafe()
{
Int32 intValue = 0x0001;
unsafe
{
var unsafeBigLittleEndian = (byteValue == 1) ?
BigLittleEndianEnum.Little : BigLittleEndianEnum.Big;
return unsafeBigLittleEndian;
}
}```

Here we use a pointer of the Int32 intValue and cast it to a byte pointer. That byte pointer points on the lowest address (byte[0]) of the Int32 value and you can check if the value the pointer points on is zero or one, if it is one its little endian. A real c/c++ programmer would write that whole method in two lines:

```public BigLittleEndianEnum TestSystemIsBigOrLittleEndianUnsafe()
{
Int32 intValue = 0x0001;
return *((byte*) &intValue) == 1 ?
BigLittleEndianEnum.Little : BigLittleEndianEnum.Big;
}```

For a more detailed explanation please look at the wickipedia article about Endianess.