building a simple spam filter with naive bayes classifier and c#

Have you also often wondered how a filter works? I have and last month I got a book in my fingers that deals with machine learning. In that book there is a chapter about naive bayer classifier, and that naive bayer classifiers are used to implement spam filtering. All programming examples in that book are in python and I do not speak python, so i calculated the naive bayer classifier myself, wrote a c# program and put it on codeplex (naivebayes.codeplex.com). I made the application only do demonstrate the principle of spam filtering, it is not for practical use. To understand how naive bayes classifiers work the best example is the cookie problem. Lets assume we have two bowls with cookies in it.

Bowl 1: 5 chocolate cookies and 35 vanilla cookies

Bowl 2: 20 chocolate cookies and 20 vanilla cookies

Now we take one cookie (it is a vanilla cookie) and we have 50 percent chance that we take it from bowl 1 or from bowl2. Now we have two hypothesis:

1. It is from bowl 1

H1 = 1/2

2. It is from bowl 2

H2 = 1/2

That means that we have a 50 to 50 chance to pick from bowl 1 or from bowl 2

P(h1) = P(h2) = 1/2

After defining our hypothesis and taking one vanilla cookie, now we have one evidence, and we can calculate the probability that we took it from bowl 1 or bowl 2. So now we can define the hypothesis under the evidence that we took a vanilla cookie.

P(E/h1) = 35/40 (because we took one cookie that is vanilla the chance is 35/40)
P(E/h2) = 20/40 = 1/2 (because we took one cookie that is vanilla but the chance is 1/2)

P(E/h1) is 35/40 because we have 35 vanilla cookies out of 40
P(E/h2) is 1/2 because we have 20 vanilla cookies out of 40
At this point we need bayes theorem:

P(h1/E) = P(h1) * P(E/h1) / P(E)

Bayes theorem says that you can calculate the evidence under the hypothesis one P(h1/E) with multiplying the P(h1) with the calculated evidence under the hypothesis P(E/h1) divided by a base P(E). To calculate P(h1/E) that is the value that tells us how big the chance is that our hypothesis one is correct. What we need to calculate that value is P(E). In this case it is very easy to find. Is is the sum of all vanilla cookies divided by the sum of all cookies.

P(E) = sum(vanilla cookies) / sum(all cookies) -> 55 / 80 

So now we have all values we need for bayes theorem.

P(h1/E) = P(h1) * P(E/h1) / P(E) -> P(h1/E) = 
(1/2 * 35/40) / 55/80 = 0.5 * 0,875 / 0,6875 = 0.6363636363 = 63.6 %

That means after taking one vanilla cookie we can say that the chance that we take cookies from bowl 1 is 63.6 percent.

So that’s it with the cookie problem but how can us help that by implementing a spam filter? We can calculate the possibility if a mail we get is spam or not spam with the same formula, with bayes theorem.
But instead of cookies we use mails we have classified as spam mails or as not spam mails. So if we get a new mail, we take every word and calculate a value Q. If Q is greater than 1 the word comes more likely from a spam mail, if Q is smaller than 1 the word comes more likely not from a spam mail. So the rule will be

Q > 1 = spam
Q < 0 = not spam

Now we will calculate Q for every word w.
Our hypothesis h1 and h2 are:

H1 = HSpam = 1/2
H2 = HNotSpam = 1/2

So we have again have a 50 to 50 chance

P(h1) = P(h2) = 1/2

Now we need some evidence data, in our case we have 10 spam mails and 10 not spam mails. That mails will give us the evidence we need to decide if the word comes from a spam mail or not. So now we take one word we have in the mail we want to check. Lets say the word is mother. We look in the spam mails and we find it one time. Then we look in the not spam mails and we find it five times. With that information we can calculate P(E/h1) and P(E/h2):

P(E/h1) = 1 / 10 (one mail out of ten)
P(E/h2) = 5 / 10 (five mails out of ten)

Now we use bayes theorem to calculate P(h1/E) and P(h2/E).

P(h1/E) = P(h1) * P(E/h1) / P(E)
P(h2/E) = P(h2) * P(E/h2) / P(E)

Normally we have to figure out what P(E) is but in this case P(E) refers only to the actual mail that means it is one and we can remove it. Here is the calculation:

P(h1/E) = 1/2 * 1/10 / 1 = 1/20
P(h2/E) = 1/2 * 5/10 / 1 = 5/20

So we have two values that tell us how big the chance is that a mail is a spam mail or not. (based on the spam and not spam mails we already got)
Now we calculate Q and check if it is greater or smaller that 1.

Q = P(h1/E) / P(h2/E) = 1/20 / 5/20 = 1/5 = 0.2

Q is smaller that one that means that the word mother occurs more often in one of our not spam mails. To check all the text in a mail we only have to calculate Q for every word sum all Q values up and divide it through the number of all words in the email.

Q overall = sum(calculate Q for every word w) / count(every word w)
Q overall > 1 = spam
Q overall < 0 = not spam

Now all we have to do is to implement a program that takes the emails and checks a given email for spam. To check if the program does what it should do, i implemented two tests that check phrases. The NaiveBayersSpamTrueTest method uses phrases from spam mails and the NaiveBayersNotSpamFalseTest method uses phrases from not spam mails.
That is the spam true test.

[TestMethod]
public void NaiveBayersSpamTrueTest()
{
    NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
    var result = naiveBayes.CheckEmail("Buy Cheap cialis Online");
    Assert.AreEqual(true, result);
    result = naiveBayes.CheckEmail("Enlarge Your Penis");
    Assert.AreEqual(true, result);
    result = naiveBayes.CheckEmail("accept VISA, MasterCard");
    Assert.AreEqual(true, result);
}

That is the spam not true test.

[TestMethod]
public void NaiveBayersNotSpamFalseTest()
{
    NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
    var result = naiveBayes.CheckEmail("Sincerely Mathias");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("Dear Graduates");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("Thanks in advance for your support");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("for it with my Mastercard");
    Assert.AreEqual(false, result);
}

Now that we have the tests we start with the part that reads the evidence data. What i did is, i took 10 normal mails and 10 spam mails from my inbox and my spam folder and made 10 text files out of them. Then i implemented a method that read all the data from the text files. (the text files with the text from the spam mails and not spam mails are at the naivebayes.codeplex.com project)

public class EmailReader
{
    public List<string> Read(string pathToFiles)
    {
        List<string> list = new List<string>();
        DirectoryInfo currentdirectoryInfo = 
            new DirectoryInfo(Environment.CurrentDirectory);
        DirectoryInfo parent = 
            currentdirectoryInfo.Parent.Parent.Parent;
        DirectoryInfo directoryInfo = 
            new DirectoryInfo(parent.FullName + @"\NaiveBayes" + pathToFiles);
        if (directoryInfo.Exists)
        {
            foreach (var file in directoryInfo.EnumerateFiles())
            {
                var text = String.Empty;
                using(FileStream filestream = file.OpenRead())
                {
                    System.IO.StreamReader streamReader = 
                        new System.IO.StreamReader(filestream);
                    text = streamReader.ReadToEnd();
                }
                list.Add(text);
            }
        }
        return list;
    }
}

The methode Read is called two times, one time for the spam mails and one time for the not spam mails. After reading the text from the text files i need a class that parses all words from the text into a dictionary and count how often a word occurs. So that parser tells me how often a word occurs in all my spam mails and in all my not spam mails. So for simplicity the parser takes the text of every mail and splits it into word by using the string.Split(‘ ‘) method, that means every times a blank occurs a new word starts.

public class Parser
{
    private Dictionary<string, int> wordCounter = new Dictionary<string,int>();

    public Dictionary<string, int> WordCounter
    {
        get { return wordCounter; }
        set { wordCounter = value; }
    }

    public void Parse(string textToParse)
    {
        var stringArray = textToParse.Split(' ');
        foreach (var str in stringArray)
        {
            if (wordCounter.ContainsKey(str.ToLower()))
            {
                wordCounter[str.ToLower()] += 1;
            }
            else
            {
                wordCounter.Add(str.ToLower(), 1);
            }
        }
    }
}

Now all the groundwork is done. What we need now is the method that calculates Q for every word and calculates out of all Q values of all words of our email, if it is spam or not spam. In the constructor of the NaiveBayes class the mails are read and saved in the spamMails and notSpamMails variable. The CheckMail method is the starting point of the calculation. First it parses the spamMails and the notSpamMails string and fills two dictionaries with the spam and the not spam mails words. Then the CheckIfSpam method is called that splits the mail text in words and calls for every word the CalculateQ method. Then it sums up all q values and divided it by the count of all tested words. If the calculated value is bigger one it returns true that means spam if not it returns false that means no spam. In the CalculateQ method we calculate Q with the bayes theorem.

P(h1/E) = P(h1) * P(E/h1) / P(E)

We calculate Ph1e by dividing wordCountSpam by countSpamMails and Ph2e by dividing wordCountNotSpam with countNotSpamMails. Then we divide Ph1e by Ph2e and the result is q.

public class NaiveBayes
{
    private List<string> spamMails;
    private List<string> notSpamMails;
    private int countSpamMails;
    private int countNotSpamMails;

    public NaiveBayes()
    {
        EmailReader emailReader = new EmailReader();
        spamMails = emailReader.Read(@"\Mails\Spam\");
        notSpamMails = emailReader.Read(@"\Mails\NotSpam\");
        countSpamMails = spamMails.Count();
        countNotSpamMails = notSpamMails.Count();
    }

    public bool CheckEmail(string text)
    {            
        Parser parser = new Parser();
        foreach(var spamMail in spamMails)
        {
            parser.Parse(spamMail);
        }
        var spamWords = parser.WordCounter;
        parser = new Parser();
        foreach (var notSpamMail in notSpamMails)
        {
            parser.Parse(notSpamMail);
        }
        var notSpamWords = parser.WordCounter;
        return CheckIfSpam(text, countSpamMails,
            spamWords, countNotSpamMails, notSpamWords);
    }

    private bool CheckIfSpam(string text, 
        int countSpamMails, Dictionary<string,int> spamWordList,
        int countNotSpamMails, Dictionary<string,int> notSpamWordList)
    {
        var stringArray = text.Split(' ');
        var sumQ = 0.0;
        var wordCounter = 0;
        foreach (var item in stringArray)
        {
            var q = CalculateQ(item.ToLower(), 
                countSpamMails, spamWordList, 
                countNotSpamMails, notSpamWordList);
            sumQ += q;
            wordCounter++;
        }
        if (sumQ / wordCounter > 1)
        {
            return true;
        }
        else
        {
            return false;
        }
    }

    private double CalculateQ(string word, 
        int countSpamMails, Dictionary<string,int> spamWordList,
        int countNotSpamMails, Dictionary<string,int> notSpamWordList)
    {
        double wordCountSpam = 1;
        if (spamWordList.ContainsKey(word))
        {
            wordCountSpam = spamWordList[word];
        }
        double Ph1e = 0.5 * wordCountSpam / countSpamMails;
        double wordCountNotSpam = 1;
        if (notSpamWordList.ContainsKey(word))
        {
            wordCountNotSpam = notSpamWordList[word];
        }
        double Ph2e = 0.5 * wordCountNotSpam / countNotSpamMails;
        double q = Ph1e / Ph2e;
        return q;
    }
}

So that’s it, if we now give the CheckEmail method a phrase (e.g. “Thanks in advance for your support”) it recognises that this is not from a spam mail. The complete source code including the text from spam and not spam mails is in the naivebayes.codeplex.com project. If you need more information about naive bayes classification i would recommend the open book Think Stats: Probability and Statistics for Programmers. I have to say that I really enjoyed implementing the naive bayes classifier, because i always wanted to know how a spam filter works and it is surprising how good it works.


implementing expression evaluator in c#

In this post i want to write about the implementation an expression evaluator. This expression evaluator evaluates simple c style expressions. I wrote it to implement defined rules like in my building a rule engine in c# post. A description about the use of the expression evaluator can be found at building a rule engine in c# (part 4: extending the rule engine to evaluate defined expressions). You can download the expression evaluator and the source code from (http://simpleexpeval.codeplex.com/). Now i want to show the use of the expression evaluator and then i will describe the implementation of the parts of the expression evaluator. That is the list of operations the expression evaluator supports:

  • open bracket ‘(‘ and close bracket ‘)’
  • addition ‘+’
  • subtraction ‘-‘
  • multiblication ‘*’
  • division ‘/’
  • modulo ‘%’
  • equality ‘=’
  • unequality ‘!=’
  • bigger than ‘>’
  • bigger than or equal ‘>=’
  • smaller than ‘<‘
  • smaller than or equal ‘<=’
  • logical and ‘&&’
  • logical or ‘||’

and the following data types:

  • integer ’10’
  • double ‘10.2’
  • string ‘”test”‘
  • boolean ‘true’ or ‘false’

Here is a example of the use of the expression evaluator.

Person person = new Person() { Name = "Mathias", Age = 36, Children = 2, Married = true };
var ruleText = " (Children = 2 && Married = true) || Age > 36 ";
Evaluator evaluator = new Evaluator();
var evaluatorResult = evaluator.Evaluate<Person>(ruleText, person);

As we see the expression evaluator takes a rule text and a object. In this case it is a person object with four properties. I use this properties in the rule text and the expression evaluator replaces the property name with the value of the overtaken object. That means that for the expression evaluator the rule text looks like ” (2 = 2 && true = true) || 36 > 36 “. Now the expression evaluator can evaluate that text and check if the expression is true or false. In this case it is true because 2 = 2 and true = true.
The expression evaluator consist of four parts. The Lexer that reads every character, the Parser that builds the expression parts the infix to postfix converter and the evaluator that takes the expression parts and evaluates it to a specific result. I want to start with a description of the lexer. It is a so called LL(k) lexer that means it left associative and starts with the first left character. The k means how many character the lexer can look ahead. In this case it is only one character. Here is the class diagram of the lexer.

SimpleExpressionEvaluatorLexer

In the Lexer class i have implementet the basis functionality for a LL(k) lexer. The ExpressionEvaluatorLexer class contains the specific implementation for the expression evaluator. The most importand part is the NextToken method that reads character after character and builds Tokens out of the characters. Here is the code for the NextToken method.

public override Token NextToken()
{
    while (lookahead[lookaheadIndex] != EOF)
    {
        switch (lookahead[lookaheadIndex])
        {
            case '.': ConsumeItem(); return new Token(TokenType.POINT, ".");
            case '+': ConsumeItem(); return Plus();
            case '-': ConsumeItem(); return Minus();
            case '*': ConsumeItem(); return Multibly();
            case '/': ConsumeItem(); return Divide();
            case '%': ConsumeItem(); return Modulo();
            case '<': ConsumeItem(); return Smaller();
            case '>': ConsumeItem(); return Greater();
            case '=': ConsumeItem(); return Equal();
            case '!': ConsumeItem(); return Unequal();
            case '"': ConsumeItem(); return String();
            case '\'': ConsumeItem(); return SingleQuoteString();
            case '|': ConsumeItem(); return Or();
            case '&': ConsumeItem(); return And();                    
            case '(': ConsumeItem(); return new Token(TokenType.OPENBRACKET, "(");
            case ')': ConsumeItem(); return new Token(TokenType.CLOSEBRACKET, ")");
            default:                        
                if (IsLetter())
                {
                    return Terminal();
                }
                if (IsNumber())
                {
                    return Number();
                }
                if (IsWhitespace())
                {
                    ConsumeItem();
                    continue;
                }
                throw new Exception("invalid character." + lookahead[lookaheadIndex]);
        }
    }
    return new Token(TokenType.EOF, "<EOF>");
}

As we see it is a loop that loops every character in the lookahead char array. If it finds a specific character for example a digit, the lexer assumes that it has an integer or an double. So i calls the Number method.

private Token Number()
{
    bool hasPoint = false;
    StringBuilder stringBuilder = new StringBuilder();
    while (IsNumber() || IsPoint() || IsExponent())
    {
        if (IsPoint())
        {
            hasPoint = true;
        }
        stringBuilder.Append(lookahead[lookaheadIndex]);
        ConsumeItem();
    }
    if (hasPoint)
        return new Token(TokenType.DOUBLE, stringBuilder.ToString());
    return new Token(TokenType.INTEGER, stringBuilder.ToString());
}

The Number method tries to find out if the text is a integer or a double and it returns an integer or an double Token with the value it read. The lexer does this for every character and the parser controls the lexer and tells it to read the next token. That means the lexer does not read every character in the character array, it reads it token by token. If it has identified one token it returns the control back to the parser. Here we see the class diagram of the parser.

SimpleExpressionEvaluatorParser

The ExpressionEvaluationParser reads the token from the lexer and builds a list of AbstractSyntaxTreeNodes. That nodes get rearranged by the InfixToPostfix class and then the result list is passed to the expression evaluator. That is necessary because the expression evaluator can handle operator precedence, that means that a multiblication is performed before and addition or substraction and that calculations are performed before comparison operations.

public AbstractSyntaxTreeNode BuildAST()
{                                    
    if (lookahead[lookaheadIndex].Type_ == TokenType.EQUAL)
    {
        return Equal();
    }
    else if (lookahead[lookaheadIndex].Type_ == TokenType.INTEGER)
    {
        return Integer(lookahead[lookaheadIndex].Text);
    }
    ...
public EqualNode Equal()
{
     EqualNode equalNode = new EqualNode();
     Match(TokenType.EQUAL);
     return equalNode;
}
public IntegerNode Integer(string value)
{
    IntegerNode integerNode = new IntegerNode();
    Match(TokenType.INTEGER);            
    integerNode.Value = int.Parse(value);
    return integerNode;
}

That is the parser code that perfoms the transformation from the tokens to the AbstractSyntaxTreeNodes. That means if it finds an Eqaul token it builds a EqualNode and consumes the next token, if it finds a Integer token it overtakes the value of the Integer, builds a IntegerNode with the overtaken value and consumes the next token with the match methode. The parser has a tranformation method for every token. We now have a list of AbstrancSyntaxTreeNodes that represent the overtaken text. After that we have to get the AbstractSyntaxTreeNodes in the correct order to handle operator precedence and open and closed brackets. The best way to do that is to convert the expression that is defined in infix into a postfix expression. The benefit of a postfix operation is that it is much easier to evaluate than an infix expression. All we need for the infix to postfix transformation is an operator stack. I will show an example for the transformation of a list of infix operations to a list of postfix operations. We start with the expression 5 + 3 * 2 – 1. To convert that infix expression to a postfix expression we take every part of the expression and tranform it. We need a stack for the operator precedence and a list that contains the postfix expressions after the transformation.

  • 5 => to List ( 5 )
  • + => to Operator Stack ( + )
  • 3 => to List ( 5, 3 )
  • * => to Operator Stack ( +, * )
  • 2 => to List ( 5, 3, 2 )
  • – => lower precedence than * on Stack => to List ( 5, 3, 2, * ) => Operator Stack ( +, – )
  • 1 => to List ( 5, 3, 2, *, 1 )
  • pop Operator Stack ( +, – ) => to List ( 5, 3, 2, *, 1, -, + )

Thats how the postfix list looks after the transformation: ( 5, 3, 2, *, 1, -, + )
Now we have one big benefit, we can take that list and evaluate it from the beginning to the end without mind of the operator precedence. But what is if we have brackets in out expression for example the expression ( 5 + 3 ) * 2. We can handle that also very easily.

  • ( => handle bracket, instanciate new Operator Stack
  • 5 => to List ( 5 )
  • + => to new Operator Stack ( + )
  • 3 => to List ( 5, 3 )
  • ) => close bracket pop Operator Stack ( + ) => to List ( 5, 3, + )
  • * => to Operator Stack ( * )
  • 2 => to List ( 5, 3, +, 2 )
  • pop Operator Stack ( * ) => to List ( 5, 3, +, 2, * )

As we see all brackets are gone and if we use postfix we dont need them anymore, if we evaluate that expression we can calculate the right result without any brackets.

What we now need to do is do define the operator precedence for all operators. That is necessary because at first the expression evaluator has to perform calculations (multiblication, division and modulo before addition and subtraction) then the expression evaluator has to check for comparison ( equal, unequal, greater than, smaller than). After he evaluated this it has to check for logical combinators (logical and, logical or). So that is the operator precedence of the expression evaluator.

  • *, /, % have the highest order
  • +, – have the second highest order
  • =, !=, >, >=, <, <= have the third highest order
  • &&, || have the lowest order

Here is the code for the ConvertInfixToPostfix expression that performs exact that transformation from an infix to a postfix expression.

public List<AbstractSyntaxTreeNode> ConvertInfixToPostfix(List<AbstractSyntaxTreeNode> postfixList)
{
    List<AbstractSyntaxTreeNode> returnList = new List<AbstractSyntaxTreeNode>();
    Stack<AbstractSyntaxTreeNode> operatorStack = new Stack<AbstractSyntaxTreeNode>();
    var position = 0;
    for (int i = position; i < postfixList.Count; i++)            
    {
        var item = postfixList[i];
        if (item is IntegerNode || item is DoubleNode || item is VariableNode ||
            item is BooleanNode || item is StringNode)
        {
            returnList.Add(item);
        }
        else if (item is OpenBracketNode)
        {
            i = ConvertInfixToPostfixBracket(i, postfixList, returnList);
        }
        else if (item is OrNode || item is AndNode)
        {
            while (operatorStack.Count > 0)
            {
                var abstractSyntaxTreeNode = operatorStack.Pop();
                returnList.Add(abstractSyntaxTreeNode);
            }
            operatorStack.Push(item);
        }
        else if (item is GreaterThenNode || item is GreaterThenOrEqualNode ||
            item is SmallerThenNode || item is SmallerThenOrEqualNode ||
            item is EqualNode || item is UnEqualNode)
        {
            if (operatorStack.Count() > 0 && (operatorStack.Peek() is MulNode || 
                operatorStack.Peek() is DivNode ||
                operatorStack.Peek() is ModuloNode))
            {
                AbstractSyntaxTreeNode node = operatorStack.Pop();
                returnList.Add(node);
            }
            else if (operatorStack.Count() > 0 && (operatorStack.Peek() is AddNode || 
                operatorStack.Peek() is SubNode))
            {
                AbstractSyntaxTreeNode node = operatorStack.Pop();
                returnList.Add(node);
            }
            operatorStack.Push(item);
        }
        else if (item is AddNode || item is SubNode)
        {
            if (operatorStack.Count() > 0 && (operatorStack.Peek() is MulNode || 
                operatorStack.Peek() is DivNode ||
                operatorStack.Peek() is ModuloNode))
            {
                AbstractSyntaxTreeNode node = operatorStack.Pop();
                returnList.Add(node);
            }
            operatorStack.Push(item);
        }
        else if (item is MulNode || item is DivNode || item is ModuloNode)
        {
            operatorStack.Push(item);
        }                
        position++;
    }
    while (operatorStack.Count > 0)
    {
        var abstractSyntaxTreeNode = operatorStack.Pop();
        returnList.Add(abstractSyntaxTreeNode);
    }
    return returnList;
}

As we see the code is very simple it loops all AbstractSyntaxTreeNodes in the overtaken list and if it finds a Integer, Double, String or Boolean AbstractSyntaxTreeNode it adds it to the return list. If it finds a add or sub Node it checks the stack. If it finds a mul, div or mod Node it pops the Node (that has a higher precedence than the add or sub Node) and adds it to the list, then it pushes the actual operator to the stack. It works simular for comparison operators ( equal, unequal … ) and for logical operators ( and, or). If it comes to a open bracket it calls the ConvertInfixToPostfixOpenBracket method. That method instanciates a new operator stack and works the same as the ConvertInfixToPostfix method. It calls it self recursive for nested brackets. Now we have brought our AbstractSyntaxTreeNodes in the correct order and we can overhand our postfix list to the expression evaluator. That evaluator takes one node after the other and evaluates them. For that evaluation we need a stack that takes the result of the calculations. For example if we want to calculate the result of the expression ( 5, 3, 2, *, 1, -, + )

  • 5 => to the Stack ( 5 )
  • 3 => to the Stack ( 5, 3 )
  • 2 => to the Stack ( 5, 3, 2 )
  • * => take first two parameter from the Stack ( 2, 3 ) and interchange them because they are in wrong order. ( 3, 2 ) => calculate the multiblication 3 * 2 = 6 => push result to Stack ( 5, 6 )
  • 1 => to the Stack ( 5, 6, 1 )
  • – => take first two parameter from the Stack ( 1, 6 ) and interchange them because they are in wrong order. ( 6, 1 ) => calcualte the subtraction 6 – 1 = 5 => push result to Stack ( 5, 5 )
  • + => take first two parameter from the Stack ( 5, 5 ) and interchange them because they are in wrong order. ( 5, 5 ) => calculate the addition 5 + 5 = 10 => push result to Stack

Now we can take the result from the Stack ( 10 ) => that is the result. Here is the code that shows the Evaluate method that takes the list of the AbstractSyntaxTreeNode and evaluates them.

public bool Evaluate<T>(List<AbstractSyntaxTreeNode> postfixList, 
    Dictionary<string, AbstractSyntaxTreeNode> symbolTable, T objectValue)
{
    foreach (var item in postfixList)
    {
        if (item is IntegerNode || item is DoubleNode || 
            item is BooleanNode || item is StringNode)
        {
            valueStack.Push(item);
        }
        else if (item is AddNode)
        {
            var secondoperand = valueStack.Pop();
            var firstoperand = valueStack.Pop();
            if (secondoperand is IntegerNode && firstoperand is IntegerNode)
            {
                 IntegerNode integerNode = new IntegerNode();
                 integerNode.Value = ((IntegerNode)firstoperand).Value + 
                      ((IntegerNode)secondoperand).Value;
                 valueStack.Push(integerNode);
            }
         }
         else if (item is MulNode)
         {
              var secondoperand = valueStack.Pop();
              var firstoperand = valueStack.Pop();
              if (secondoperand is IntegerNode && firstoperand is IntegerNode)
              {
                   IntegerNode integerNode = new IntegerNode();
                        integerNode.Value = ((IntegerNode)firstoperand).Value *
                   ((IntegerNode)secondoperand).Value;
                   valueStack.Push(integerNode);
              }
         }
         else if (item is EqualNode)
         {
              var secondoperand = valueStack.Pop();
              var firstoperand = valueStack.Pop();
              if (secondoperand is IntegerNode && firstoperand is IntegerNode)
              {
                   BooleanNode booleanNode = new BooleanNode();
                   booleanNode.Value = ((IntegerNode)firstoperand).Value == 
                        ((IntegerNode)secondoperand).Value;
                   valueStack.Push(booleanNode);
         }
}
...

The evaluator part is really easy, it uses on stack for the values and the calculation results. So it takes on AbstractSyntaxTreeNode after the other and if it is data it puts it to the stack, if it is an operator it performs the operation and puts the result to the stack. If it gets the following postfixlist ( IntegerNode(4), IntegerNode(5), IntegerNode(3), MulNode(*),AddNode(+)) it starts with pushing 4 to the stack, pushing 5 to the stack, pushing 3 to the stack. Poping 3 and 5 from the stack perform the multiblication and push 15 to the stack. Then it pops 15 and 4 from the stack, performs the addition and pushes 19 to the stack. So the result would be 15. The problem is that the Evaluate methode returns bool. That is because the expression evaluator can only perform expression the have true or false as result. So if you evaluate ( 4 + 5 * 3 ) you get an exception. You have to write ( 4 + 5 * 3 = 19 ) that is a correct expression that returns true or false. The expression evaluator was written to perform rules against the properties of objects (building a rule engine in c# (part 4: extending the rule engine to evaluate defined expressions)) and a rule has to return always true or false. You can download the source code from http://simpleexpeval.codeplex.com/.


Building a rule engine in c# (part 4: extending the rule engine to evaluate defined expressions)

In the first three posts of building a rule engine in c# building a rule engine in c#, building a rule engine in c# (part 2: extending the rule engine to handle collections), building a rule engine in c# (part 3: extending the rule engine to handle aggregations of collections) the definition of the rules are very basic and did not look very clear. So i decided to implement an expression evaluator. With this expression evaluator the definition of the rules becomes much clearer. I will show the difference in the next three pictures. The first one shows the definition of the rules how i described it in the past three posts.

ProcessingEngineDatabase

This second picture shows the definition with a expression evaluator.

ComplexProcessingRule

If we use a expression evaluator we can also use logical combinators (and, or) to combine the defined rules. The third picture shows how the processing rules would look like if they are combined.

ComplexProcessingRuleCombinator2

The problem with that processing rules is that in the first three post i used expression trees to generate the rules but i had the seperate parts of the rules and so i needed no parser. Now we have to develop a parser that parses the expression and an evaluator that checks the overtaken object against the parsed expression. In this post i will show how to use the expression evaluator i posted at codeplex (https://simpleexpeval.codeplex.com/) to validate a given object against the defined rule. In the additional blog post implementing an expression evaluator in c# i describes how the expression evaluator works. But in this post i will describe how to use it to define the rules you need to check your object again. The expression evaluator support the following operators:

  • open bracket ‘(‘ and close bracket ‘)’
  • addition ‘+’
  • subtraction ‘-‘
  • multiblication ‘*’
  • division ‘/’
  • modulo ‘%’
  • equality ‘=’
  • unequality ‘!=’
  • bigger than ‘>’
  • bigger than or equal ‘>=’
  • smaller than ‘<‘
  • smaller than or equal ‘<=’
  • logical and ‘&&’
  • logical or ‘||’

and the following data types:

  • integer ’10’
  • double ‘10.2’
  • string ‘”test”‘
  • boolean ‘true’ or ‘false’

This is a example how to use the expression evaluator with a defined person that is evaluated against a rule. That is the definition of the person class:

public class Person
{
    public string Name { get; set; }
    public int Age { get; set; }
    public int Children { get; set; }
    public bool Married { get; set; }
}

Now the defined rule text (” (Children = 2 && Married = true) || Age > 36 “) is evaluated against the defined person object.

Person person = new Person() { Name = "Mathias", Age = 36, Children = 2, Married = true };
RuleLoader ruleLoader = new RuleLoader();
Rule processingRule = ruleLoader.LoadProcessingRule(5);
//processingRule.ProcessingRule = " (Children = 2 && Married = true) || Age > 36 ";
Evaluator evaluator = new Evaluator();
var evaluatorResult = evaluator.Evaluate<Person>(ruleText, person);

The evaluatorResult is true because the person object has two children and is married. Here is a second example that shows that it is possible to use properties at any possition in the rule text.

Person person = new Person() { Name = "Mathias", Age = 36, Children = 2, Married = true };
RuleLoader ruleLoader = new RuleLoader();
Rule processingRule = ruleLoader.LoadProcessingRule(5);
//processingRule.ProcessingRule = " Name != 'test' && Children <= Age / 20 "; 
Evaluator evaluator = new Evaluator(); 
var evaluatorResult = evaluator.Evaluate<Person>(ruleText, person);

The evaluatorResult is also true because Age / 20 is smaller than Children and the name is unequal ‘test’. If you want to use the expression evaluation engine you can download it from codeplex (https://simpleexpeval.codeplex.com/) and add a reference in your visual studio project. Then you can use the Evaluator class to evaluate the defined rule text against an overtaken object. In the blog post implementing an expression evaluator in c# i describe the implementation of the expression evaluator and in the post building a rule engine in c# (part 5: bringing it all together) i try to give an overview of the usage of the ruleengine.codeplex.com project.


Implementing a regular expression to state machine parser. (part 1: introduction)

The idea behind this post is the generation of regular expressions. After writing a regular regular expression i read a book about data structures. In this book the author writes about implementing a regular expression parser with the help of a state maschine. The basic idea is to build a state maschine out of the regex expression, and then let the regex text flow through that state machine. If the text passes the state machine it corresponds with the regex expression, if not it throws an error. That principle can be transformed to a simple custom language. So in this post i first will show how to implement a subset of the regular expressions. In a second post i will describe the implementation of the RegexToStateMachine project. Then i will show how to transform a custom language that has control commands to send and receive data to a state machine. This control commands will be parsed and checked with the generated state maschine.

Now i will start with theoretical description of the process of generating a state machine out of a subset of the  regular expressions. This are the definied commands:

  1. c – Matches any literal character c
  2. . – Matches any single character
  3. * – Matches one or more occurrences of the previous character or characters
  4. () – Brackets to group the characters
  5. | – Logical or

First we have to think about how we could build a dynamic state machine out of a given set of characters. That is very simple. We parse each character and build a State object for it and we build a state transition from the current state to the next state. If we do so we get a state maschine where we have only one state transition from one state to another state.

For example if we have the regular expression:

aabba

we have the state maschine:

aabba

Now we have the text “aabba”, that text would flow through that state maschine and would throw now error. Every other text would fail and throw an error. But what is with regular expressions that contains repeating. For example the following expression.

a* (a can be repeated n times)

If we want a state machine that works for such a regular expression we need a different state transition. The following graphic shows how that state transition has to look.

singe-charcter closure

The state * has a jump back to the previous character. If we have the text “aaaaa” it would work with that state machine. But what is with the following regular expression.

(abc)*

For that expression the state machine has to look like the one on that following graphic. So we need to include a jump back to the last open bracket. .

closue expression

If we have the text “abcabcabc” it would work with that state machine. Now we have to check how the | operator that stands for logical or works. If we have the following regular expression.

(ab|cd)

We need the following state machine with the following transitions. So the state machine has to jump to the next character behind the pipe operator. .

or expression

If we have a program that can dynamical build a state machine for all that operations we have to ensure that we can combine all that operations so that the following regular expression would also work.

(ab|ef)*

The following graphic shows the state machine that combines the or (|) operation with the repeat (*) operation.

repeat and or
As we see there are many texts that would fit that regular espression. For example “abab” or “efefef”. Now we have definied the operations for the regular expression to state machine parser. So with this transformations we now know how the state machine has to look for the defined commands. What we now need is a program that converts the expression in the corresponding state machine and checks if a given text flows through the generated state machine.

The description of a possible implementation is shown in the second part of that post series. The source code is available at RegexToStateMachine.


Using JsonToStaticTypeConverter (part 1: introduction)

In this post i will write about the JsonToStaticTypeConverter project and how to use it. I created that project after implementing a Json parser. After finishing the parser i thought about where i could use this. So the result is an application that can generate c# classes out of a Json string. So with the  JsonToStaticTypeConverter you can generate c# classes and fill them with your Json data.

If we press generate a class Addresse im namespace MainTest in the file Addresse.cs in the folder C:\Projecte is generated out of the Json code (in this case { ‘street’ : ‘Teststreet’, ‘number’ : 1, ‘city’ : ‘Graz’ }. Here we see the generated c# file:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using JsonConverter;
namespace MainTest
{
public class Addresse : JsonStaticObject
{
public string street { get { return JsonSource.Find<string>("Addresse", "street"); } }
public double number { get { return JsonSource.Find<double>("Addresse", "number"); } }
public string city { get { return JsonSource.Find<string>("Addresse", "city"); } }
}
}

This c# file is inserted into the solution that wants to process the Json data. Here we see how the JsonConverter is used to generate the static structure of Addresse out of the Json data. We also need to reference the JsonConverter.dll from the JsonToStaticTypeConverter Project. Therefore we use the Deserialize method and overtake the Json string.

var jsonConverter = new JsonConverter<MainTest.Addresse>();
MainTest.Addresse address = jsonConverter.Deserialize(
    "{ 'street' : 'Gürtel', 'number' : 23, 'city' : 'Vienna' }");
var street = address.street;
var number = address.number;
var city = address.city;
Trace.WriteLine(street);
Trace.WriteLine(number);
Trace.WriteLine(city);

Here we see the trace result:

So with the JsonConverter it is possible to use the Json data due to a static structure. In the second post of this series i will write about the implementation of JsonToStaticTypeConverter and about the implementation of the Json ParserCombinator.


Building a rule engine in c#

In my current project in work i needed a simple rule engine that processes rules that are saved in the database. In the Database i have the following table:

The column Property stores the name of the property of the overtaken class. The column Operator stores the logical operator which is used to evaluate the rule and the column Value stores the constant to evaluate against. To try this rules we have the Person class.

    public class Person
    {
        public string Name { get; set; }
        public int Age { get; set; }
        public int Children { get; set; }
    }

This class has the properties Name, Age and Children. The rule class stores the rules that are defiened in the database table.

    public class Rule
    {
        private bool propertySet = false;
        public string PropertyName { get; set; }
        public Operator Operator_ { get; set; }
        public object Value { get; set; }

        public Rule(Operator operator_, object value)
        {
            this.Operator_ = operator_;
            this.Value = value;
        }

        public Rule(string propertyName, Operator operator_, object value)
        {
            this.Operator_ = operator_;
            this.Value = value;
            this.PropertyName = propertyName;
            if(!string.IsNullOrEmpty(propertyName))
                this.propertySet = true;
        }
    }

The RuleLoader class loads the rules from the database and stores it in Rule objects. This is the code to load the rules from the database.

            RuleLoader ruleLoader = new RuleLoader();
            Rule firstRule = ruleLoader.Load(1);
            Rule secondRule = ruleLoader.Load(2);
            Rule thirdRule = ruleLoader.Load(3);

Now we have the rules and what we need is a Person to apply the rules to.

Person person = new Person() { Name = "Mathias", Age = 35, Children = 2 };

Now we will define the rule engine that evaluates the values of the property of the Person objects against the definied rules from the database. To do so we need some reflection and some objects from the System.Linq.Expressions namespace. This is the RuleEngine class that generates a Func that we use to evaluate the overtaken rule.

    public class RuleEngine
    {
        public Func<T, bool> CompileRule<T>(Rule rule)
        {
            if (string.IsNullOrEmpty(rule.PropertyName))
            {
                ExpressionBuilder expressionBuilder = new ExpressionBuilder();
                var param = Expression.Parameter(typeof(T));
                Expression expression = 
                    expressionBuilder.BuildExpression<T>(rule.Operator_, rule.Value, param);
                Func<T, bool> func = 
                    Expression.Lambda<Func<T, bool>>(expression, param).Compile();
                return func;
            }
            else
            {
                ExpressionBuilder expressionBuilder = new ExpressionBuilder();
                var param = Expression.Parameter(typeof(T));
                Expression expression = 
                    expressionBuilder.BuildExpression<T>(
                    rule.PropertyName, rule.Operator_, rule.Value, param);
                Func<T, bool> func = 
                    Expression.Lambda<Func<T, bool>>(expression, param).Compile();
                return func;
            }
        }
    }

The RuleEngine class uses the ExpressionBuilder class to build the Expression and compiles it to a lambda expression. This lambda expression does the actual work when a rule is evaluated.

    public class ExpressionBuilder
    {
        public Expression BuildExpression<T>(
            Operator ruleOperator, object value, ParameterExpression parameterExpression)
        {
            ExpressionType expressionType = new ExpressionType();
            var leftOperand = parameterExpression;
            var rightOperand = 
                Expression.Constant(Convert.ChangeType(value, typeof(T)));
            var expressionTypeValue = 
                (ExpressionType)expressionType.GetType().GetField(
                Enum.GetName(typeof(Operator), ruleOperator)).GetValue(ruleOperator);
            var binaryExpression = 
                Expression.MakeBinary(expressionTypeValue, leftOperand, rightOperand);
            return binaryExpression;
        }

        public Expression BuildExpression<T>(
            string propertyName, Operator ruleOperator, object value, 
            ParameterExpression parameterExpression)
        {
            ExpressionType expressionType = new ExpressionType();
            var leftOperand = MemberExpression.Property(parameterExpression, propertyName);
            var rightOperand = Expression.Constant(Convert.ChangeType(value, value.GetType()));
            FieldInfo fieldInfo = 
                expressionType.GetType().GetField(Enum.GetName(typeof(Operator), ruleOperator));
            var expressionTypeValue = (ExpressionType)fieldInfo.GetValue(ruleOperator);
            var binaryExpression = 
                Expression.MakeBinary(expressionTypeValue, leftOperand, rightOperand);
            return binaryExpression;
        }
    }

Here we see the usage of the RuleEngine class. We call the CompileResult method and get back a Func that evaluates us the overtaken parameter (in this case of type Person).

     RuleEngine ruleEngine = new RuleEngine();
     var firstRuleFunc = ruleEngine.CompileRule<Person>(firstRule);
     var secondRuleFunc = ruleEngine.CompileRule<Person>(secondRule);
     var thirdRuleFunc = ruleEngine.CompileRule<Person>(thirdRule);
     var result = firstRuleFunc(person) &&
         secondRuleFunc(person) && thirdRuleFunc(person);

So we know that the Person object

( Person person = new Person() { Name = "Mathias", Age = 35, Children = 2 }; )

passes all the defiend rules from the database.

In the second part of this post i will extend the rule engine to handle collections:
Building a rule engine in c# (part 2: extending the rule engine to handle collections) In this third part of the post series called Building a rule engine in c# (part 3: extending the rule engine to handle aggregations of collections) i extend the rule engine to handle aggregation for collections. In the fourth part of this blog series i write about implementing an expression evaluator to define more complex expressions Building a rule engine in c# (part 4: extending the rule engine to evaluate defined expressions) and in the fifth part building a rule engine in c# (part 5: bringing it all together) i give an overview about the usage of the ruleengine.codeplex.com project i created to bring the code parts in that post series together.