Implementing a sychronized priority queue in c#

At my last project i used a queue to add download and upload tasks of documents. After most of the implementation was done i discovered that some of the upload and download tasks should have higher priority than others. So for my implementation i used a synchnoized wrapper of the .net queue. And with a queue it is not possible to work with items that have different priorities. At the following picture the use of a queue is ilustrated.
Queue
So at that point i was looking for a priority queue. The problem is that the .net framework does not provide such a data structure, so i had to implement it myself. The following picture shows how a priority queue works.
PriorityQueue
So if you want to implement a priority queue you need an underlying data structure. That data structure has to reorganize the enqued items so that the item with the highest priority is dequeued first. So one possible data structure is a list you reorder. If a item with a higher priority is enqued you bring it to the front of the list. That works fine with hundred of added items. But if you have several thousend items to enqueue the solution with the list gets slower and slower. Then you need a priority queue that is implemented with a tree structure. You could use a balanced search tree like an AVL tree, but most of data structure books recomend a binary heap. So i decided to implement my priority queue with two modes. In one mode it internaly uses a list, in the second mode it uses a binary heap. If you want to look at the code or use the priority queue, i added it to codeplex. (http://priorityQueue.codeplex.com).
Here is the uml class diagram and a two tests that show the use of the priority queue in linked list and in binary heap mode.
PriorityQueue

[TestClass]
public class PriorityQueueTests
{
    [TestMethod]
    public void PriorityQueueListModeTest()
    {
        PriorityQueue<int, string> priorityQueue =
            new PriorityQueue<int, string>(PriorityQueueMode.LinkedList);
        priorityQueue.Enqueue(1, "A");
        priorityQueue.Enqueue(2, "B");
        priorityQueue.Enqueue(3, "C");
        priorityQueue.Enqueue(4, "D");
        priorityQueue.Enqueue(5, "E");
        priorityQueue.Enqueue(6, "F");
        var count = priorityQueue.Count;
        var result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "F");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "E");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "D");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "C");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "B");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "A");
    }

    [TestMethod]
    public void PriorityQueueBinaryHeapModeTest()
    {
        PriorityQueue<int, string> priorityQueue =
            new PriorityQueue<int, string>(PriorityQueueMode.BinaryHeap);
        priorityQueue.Enqueue(1, "A");
        priorityQueue.Enqueue(2, "B");
        priorityQueue.Enqueue(3, "C");
        priorityQueue.Enqueue(4, "D");
        priorityQueue.Enqueue(5, "E");
        priorityQueue.Enqueue(6, "F");
        var count = priorityQueue.Count;
        var result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "F");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "E");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "D");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "C");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "B");
        result = priorityQueue.Dequeue();
        Assert.AreEqual(result, "A");
    }
}

The two tests show that the items with higher priority are ranked at the top of the priority queue and the items with lower priority are ranked at the end of the priority queue.


building a simple spam filter with naive bayes classifier and c#

Have you also often wondered how a filter works? I have and last month I got a book in my fingers that deals with machine learning. In that book there is a chapter about naive bayer classifier, and that naive bayer classifiers are used to implement spam filtering. All programming examples in that book are in python and I do not speak python, so i calculated the naive bayer classifier myself, wrote a c# program and put it on codeplex (naivebayes.codeplex.com). I made the application only do demonstrate the principle of spam filtering, it is not for practical use. To understand how naive bayes classifiers work the best example is the cookie problem. Lets assume we have two bowls with cookies in it.

Bowl 1: 5 chocolate cookies and 35 vanilla cookies

Bowl 2: 20 chocolate cookies and 20 vanilla cookies

Now we take one cookie (it is a vanilla cookie) and we have 50 percent chance that we take it from bowl 1 or from bowl2. Now we have two hypothesis:

1. It is from bowl 1

H1 = 1/2

2. It is from bowl 2

H2 = 1/2

That means that we have a 50 to 50 chance to pick from bowl 1 or from bowl 2

P(h1) = P(h2) = 1/2

After defining our hypothesis and taking one vanilla cookie, now we have one evidence, and we can calculate the probability that we took it from bowl 1 or bowl 2. So now we can define the hypothesis under the evidence that we took a vanilla cookie.

P(E/h1) = 35/40 (because we took one cookie that is vanilla the chance is 35/40)
P(E/h2) = 20/40 = 1/2 (because we took one cookie that is vanilla but the chance is 1/2)

P(E/h1) is 35/40 because we have 35 vanilla cookies out of 40
P(E/h2) is 1/2 because we have 20 vanilla cookies out of 40
At this point we need bayes theorem:

P(h1/E) = P(h1) * P(E/h1) / P(E)

Bayes theorem says that you can calculate the evidence under the hypothesis one P(h1/E) with multiplying the P(h1) with the calculated evidence under the hypothesis P(E/h1) divided by a base P(E). To calculate P(h1/E) that is the value that tells us how big the chance is that our hypothesis one is correct. What we need to calculate that value is P(E). In this case it is very easy to find. Is is the sum of all vanilla cookies divided by the sum of all cookies.

P(E) = sum(vanilla cookies) / sum(all cookies) -> 55 / 80 

So now we have all values we need for bayes theorem.

P(h1/E) = P(h1) * P(E/h1) / P(E) -> P(h1/E) = 
(1/2 * 35/40) / 55/80 = 0.5 * 0,875 / 0,6875 = 0.6363636363 = 63.6 %

That means after taking one vanilla cookie we can say that the chance that we take cookies from bowl 1 is 63.6 percent.

So that’s it with the cookie problem but how can us help that by implementing a spam filter? We can calculate the possibility if a mail we get is spam or not spam with the same formula, with bayes theorem.
But instead of cookies we use mails we have classified as spam mails or as not spam mails. So if we get a new mail, we take every word and calculate a value Q. If Q is greater than 1 the word comes more likely from a spam mail, if Q is smaller than 1 the word comes more likely not from a spam mail. So the rule will be

Q > 1 = spam
Q < 0 = not spam

Now we will calculate Q for every word w.
Our hypothesis h1 and h2 are:

H1 = HSpam = 1/2
H2 = HNotSpam = 1/2

So we have again have a 50 to 50 chance

P(h1) = P(h2) = 1/2

Now we need some evidence data, in our case we have 10 spam mails and 10 not spam mails. That mails will give us the evidence we need to decide if the word comes from a spam mail or not. So now we take one word we have in the mail we want to check. Lets say the word is mother. We look in the spam mails and we find it one time. Then we look in the not spam mails and we find it five times. With that information we can calculate P(E/h1) and P(E/h2):

P(E/h1) = 1 / 10 (one mail out of ten)
P(E/h2) = 5 / 10 (five mails out of ten)

Now we use bayes theorem to calculate P(h1/E) and P(h2/E).

P(h1/E) = P(h1) * P(E/h1) / P(E)
P(h2/E) = P(h2) * P(E/h2) / P(E)

Normally we have to figure out what P(E) is but in this case P(E) refers only to the actual mail that means it is one and we can remove it. Here is the calculation:

P(h1/E) = 1/2 * 1/10 / 1 = 1/20
P(h2/E) = 1/2 * 5/10 / 1 = 5/20

So we have two values that tell us how big the chance is that a mail is a spam mail or not. (based on the spam and not spam mails we already got)
Now we calculate Q and check if it is greater or smaller that 1.

Q = P(h1/E) / P(h2/E) = 1/20 / 5/20 = 1/5 = 0.2

Q is smaller that one that means that the word mother occurs more often in one of our not spam mails. To check all the text in a mail we only have to calculate Q for every word sum all Q values up and divide it through the number of all words in the email.

Q overall = sum(calculate Q for every word w) / count(every word w)
Q overall > 1 = spam
Q overall < 0 = not spam

Now all we have to do is to implement a program that takes the emails and checks a given email for spam. To check if the program does what it should do, i implemented two tests that check phrases. The NaiveBayersSpamTrueTest method uses phrases from spam mails and the NaiveBayersNotSpamFalseTest method uses phrases from not spam mails.
That is the spam true test.

[TestMethod]
public void NaiveBayersSpamTrueTest()
{
    NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
    var result = naiveBayes.CheckEmail("Buy Cheap cialis Online");
    Assert.AreEqual(true, result);
    result = naiveBayes.CheckEmail("Enlarge Your Penis");
    Assert.AreEqual(true, result);
    result = naiveBayes.CheckEmail("accept VISA, MasterCard");
    Assert.AreEqual(true, result);
}

That is the spam not true test.

[TestMethod]
public void NaiveBayersNotSpamFalseTest()
{
    NaiveBayes.NaiveBayes naiveBayes = new NaiveBayes.NaiveBayes();
    var result = naiveBayes.CheckEmail("Sincerely Mathias");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("Dear Graduates");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("Thanks in advance for your support");
    Assert.AreEqual(false, result);
    result = naiveBayes.CheckEmail("for it with my Mastercard");
    Assert.AreEqual(false, result);
}

Now that we have the tests we start with the part that reads the evidence data. What i did is, i took 10 normal mails and 10 spam mails from my inbox and my spam folder and made 10 text files out of them. Then i implemented a method that read all the data from the text files. (the text files with the text from the spam mails and not spam mails are at the naivebayes.codeplex.com project)

public class EmailReader
{
    public List<string> Read(string pathToFiles)
    {
        List<string> list = new List<string>();
        DirectoryInfo currentdirectoryInfo = 
            new DirectoryInfo(Environment.CurrentDirectory);
        DirectoryInfo parent = 
            currentdirectoryInfo.Parent.Parent.Parent;
        DirectoryInfo directoryInfo = 
            new DirectoryInfo(parent.FullName + @"\NaiveBayes" + pathToFiles);
        if (directoryInfo.Exists)
        {
            foreach (var file in directoryInfo.EnumerateFiles())
            {
                var text = String.Empty;
                using(FileStream filestream = file.OpenRead())
                {
                    System.IO.StreamReader streamReader = 
                        new System.IO.StreamReader(filestream);
                    text = streamReader.ReadToEnd();
                }
                list.Add(text);
            }
        }
        return list;
    }
}

The methode Read is called two times, one time for the spam mails and one time for the not spam mails. After reading the text from the text files i need a class that parses all words from the text into a dictionary and count how often a word occurs. So that parser tells me how often a word occurs in all my spam mails and in all my not spam mails. So for simplicity the parser takes the text of every mail and splits it into word by using the string.Split(‘ ‘) method, that means every times a blank occurs a new word starts.

public class Parser
{
    private Dictionary<string, int> wordCounter = new Dictionary<string,int>();

    public Dictionary<string, int> WordCounter
    {
        get { return wordCounter; }
        set { wordCounter = value; }
    }

    public void Parse(string textToParse)
    {
        var stringArray = textToParse.Split(' ');
        foreach (var str in stringArray)
        {
            if (wordCounter.ContainsKey(str.ToLower()))
            {
                wordCounter[str.ToLower()] += 1;
            }
            else
            {
                wordCounter.Add(str.ToLower(), 1);
            }
        }
    }
}

Now all the groundwork is done. What we need now is the method that calculates Q for every word and calculates out of all Q values of all words of our email, if it is spam or not spam. In the constructor of the NaiveBayes class the mails are read and saved in the spamMails and notSpamMails variable. The CheckMail method is the starting point of the calculation. First it parses the spamMails and the notSpamMails string and fills two dictionaries with the spam and the not spam mails words. Then the CheckIfSpam method is called that splits the mail text in words and calls for every word the CalculateQ method. Then it sums up all q values and divided it by the count of all tested words. If the calculated value is bigger one it returns true that means spam if not it returns false that means no spam. In the CalculateQ method we calculate Q with the bayes theorem.

P(h1/E) = P(h1) * P(E/h1) / P(E)

We calculate Ph1e by dividing wordCountSpam by countSpamMails and Ph2e by dividing wordCountNotSpam with countNotSpamMails. Then we divide Ph1e by Ph2e and the result is q.

public class NaiveBayes
{
    private List<string> spamMails;
    private List<string> notSpamMails;
    private int countSpamMails;
    private int countNotSpamMails;

    public NaiveBayes()
    {
        EmailReader emailReader = new EmailReader();
        spamMails = emailReader.Read(@"\Mails\Spam\");
        notSpamMails = emailReader.Read(@"\Mails\NotSpam\");
        countSpamMails = spamMails.Count();
        countNotSpamMails = notSpamMails.Count();
    }

    public bool CheckEmail(string text)
    {            
        Parser parser = new Parser();
        foreach(var spamMail in spamMails)
        {
            parser.Parse(spamMail);
        }
        var spamWords = parser.WordCounter;
        parser = new Parser();
        foreach (var notSpamMail in notSpamMails)
        {
            parser.Parse(notSpamMail);
        }
        var notSpamWords = parser.WordCounter;
        return CheckIfSpam(text, countSpamMails,
            spamWords, countNotSpamMails, notSpamWords);
    }

    private bool CheckIfSpam(string text, 
        int countSpamMails, Dictionary<string,int> spamWordList,
        int countNotSpamMails, Dictionary<string,int> notSpamWordList)
    {
        var stringArray = text.Split(' ');
        var sumQ = 0.0;
        var wordCounter = 0;
        foreach (var item in stringArray)
        {
            var q = CalculateQ(item.ToLower(), 
                countSpamMails, spamWordList, 
                countNotSpamMails, notSpamWordList);
            sumQ += q;
            wordCounter++;
        }
        if (sumQ / wordCounter > 1)
        {
            return true;
        }
        else
        {
            return false;
        }
    }

    private double CalculateQ(string word, 
        int countSpamMails, Dictionary<string,int> spamWordList,
        int countNotSpamMails, Dictionary<string,int> notSpamWordList)
    {
        double wordCountSpam = 1;
        if (spamWordList.ContainsKey(word))
        {
            wordCountSpam = spamWordList[word];
        }
        double Ph1e = 0.5 * wordCountSpam / countSpamMails;
        double wordCountNotSpam = 1;
        if (notSpamWordList.ContainsKey(word))
        {
            wordCountNotSpam = notSpamWordList[word];
        }
        double Ph2e = 0.5 * wordCountNotSpam / countNotSpamMails;
        double q = Ph1e / Ph2e;
        return q;
    }
}

So that’s it, if we now give the CheckEmail method a phrase (e.g. “Thanks in advance for your support”) it recognises that this is not from a spam mail. The complete source code including the text from spam and not spam mails is in the naivebayes.codeplex.com project. If you need more information about naive bayes classification i would recommend the open book Think Stats: Probability and Statistics for Programmers. I have to say that I really enjoyed implementing the naive bayes classifier, because i always wanted to know how a spam filter works and it is surprising how good it works.


little or big endian in c#

If you want to check the endianess of your operating system in c# you have two possibilities, a safe and a unsafe way. But first of all i want to explain what endianess is and what big and little endian means. If you have an Int32 value in c#, that Int32 is saved in 4 bytes. But the question is which part of the Int32 is saved in which byte with which address. If the system is using little endian (like Windows does) the last byte (=least significant byte) is saved in the lowest address. That means if you have the hex value 0A0B0C0D and you have the Int32 has the address 00000001, the bytes are saved in the following order:

0D at address 00000001

0C at address 00000002

0B at address 00000003

0A at address 00000004

The following images shows the insertion order:

LittleEndian

If the system has big endian order the last byte (=least significant byte) is saved in the highest address. That means the Int32 value 0A0B0C0D with the address 00000001 the bytes are inserted the following way:

0A at address 00000001

0B at address 00000002

0C at address 00000003

0D at address 00000004

The following image shows the insertion order:

BigEndian

Now if we want to check if the system uses big or little endian in c# there is a very easy way, you can use the BitConverter class that has an IsLittleEndian property.

if(BitConverter.IsLittleEndian)
   return "Little Endian";

If you want to implement the check yourself the simplest method is the following one:

public enum BigLittleEndianEnum
{
    Big,
    Little
}
public BigLittleEndianEnum TestSystemIsBigOrLittleEndianSafe()
{
    Int32 intValue = 0x0001;
    byte[] bytes = BitConverter.GetBytes(intValue);
    var safeBigLittleEndian = (bytes[0] == 1) ? 
        BigLittleEndianEnum.Little : BigLittleEndianEnum.Big;
    return safeBigLittleEndian;
}

What we do is we take the Int32 type intValue and assign 1 to it. Then we use the BitConverter class to generate a byte array out of that Int32 value. If the zero byte of the array (byte[0]) is one the we know that the system uses little endian because the 1 is in the lowest address (byte[0]).
If you search for check little or big endian in google you often find the c/c++ solution. In c# you have to use the unsafe keyword, because you need pointer syntax to implement the c/c++ solution.

public BigLittleEndianEnum TestSystemIsBigOrLittleEndianUnsafe()
{
    Int32 intValue = 0x0001;
    unsafe
    {        
        Int32* intValueAddress = &intValue;                
        byte* byteValueAddress = (byte*) intValueAddress;
        byte byteValue = *byteValueAddress;
        var unsafeBigLittleEndian = (byteValue == 1) ? 
            BigLittleEndianEnum.Little : BigLittleEndianEnum.Big;        
        return unsafeBigLittleEndian;                        
    }            
}

Here we use a pointer of the Int32 intValue and cast it to a byte pointer. That byte pointer points on the lowest address (byte[0]) of the Int32 value and you can check if the value the pointer points on is zero or one, if it is one its little endian. A real c/c++ programmer would write that whole method in two lines:

public BigLittleEndianEnum TestSystemIsBigOrLittleEndianUnsafe()
{
    Int32 intValue = 0x0001;
    return *((byte*) &intValue) == 1 ? 
        BigLittleEndianEnum.Little : BigLittleEndianEnum.Big;
}

For a more detailed explanation please look at the wickipedia article about Endianess.
 


Calculating string difference values with the Levenshtein fuzzy string algorithm (part 2: implementation in t-sql).

In this second post i will write about the Levenshtein algorithm and how to implement it in t-sql (in part 1 i have explained the implementation in c#). With this algorithm it is possible to compare two strings and calculate the a difference value of them. Lets asume that we have a table in a database with words in it.

Now if the user overtakes a word we want to suggest the words out of the table words that is the closest. So we need the Levenshtein algorithm to calculate the levenshtein difference between the overtaken word and all words in the table words. Then we take the words with the minimal levenshtein distance and give it the user to choose out of that words if he wants. For example if the user overtakes ‘lay’ the words with the minimal levenshtein distance are ‘lady’ and ‘lacy’. So this are the words we would suggest to the user. To acomplish that we need a function that overtakes the source and the target string and calculates the levenshtein distance.

CREATE FUNCTION dbo.CalculateLevenshteinValue(@source nvarchar(max), @target nvarchar(max))
RETURNS int
BEGIN  
     declare @sourceLength int, @targetLength int

     set @sourceLength = len(@source);
     set @targetLength = len(@target);

     declare @array table ( i int, j int, value int);
     declare @i int, @j int, @result int

     set @i = 0;
     set @j = 0;

     WHILE @i <= @sourceLength
     BEGIN
	     insert @array( i, j, [value])
	     values( @i, 0, @i) 
  	     SET @i = @i + 1;
     END
     WHILE @j <= @targetLength
     BEGIN
	     insert @array ( i, j, [value])
	     values ( 0, @j, @j) 
  	     SET @j = @j + 1;
     END

     set @i = 1;
     set @j = 1;

     declare @equal int, @minInsert int, @minDelete int, 
	 declare @minInterchange int, @equalValue int, @insertValue int
     declare @sourceChar nvarchar(1), @targetChar nvarchar(1)

     WHILE @i <= @sourceLength
     BEGIN
	     WHILE @j <= @targetLength
	     BEGIN
		     SET @sourceChar = substring(@source, @i, 1);		
		     SET @targetChar = substring(@target, @j, 1);
		     SET @equal = 2;
		     if(@sourceChar = @targetChar)
		     begin
			     SET @equal = 0;	
		     end
		     select @minInsert = value + 1 
                     from @array a where a.i = @i - 1 and a.j = @j
		     select @minDelete = value + 1 
                     from @array a 
                     where a.i = @i and a.j = @j - 1
		     select @minInterchange = [value] + @equal 
                     from @array a 
                     where a.i = @i - 1 and a.j = @j - 1
		     select @insertValue = 
                     dbo.ScalarMinValue(dbo.ScalarMinValue(@minInsert, @minDelete), @minInterchange)
		     insert @array ( i, j, [value])
		     values ( @i, @j, @insertValue) 
	  	     SET @j = @j + 1;
	     END
	     SET @i = @i + 1;
	     SET @j = 1
     END
     select @result = [value] 
     from @array a 
     where a.i = @sourceLength and a.j = @targetLength
RETURN @result;
END;
CREATE FUNCTION dbo.ScalarMinValue
(@Value1 int, @Value2 int)
RETURNS int
BEGIN
	declare @return int	
	Select @return = Case When @Value1 < @Value2 Then @Value1 Else @Value2 End	
	return @return
END

The calculation is very simmular to the calculation in c# but because the sql server does not support arrays i am using a temptable to store the calculated values. With this function that calculates the difference between two words we can write the stored procedure that takes the source string and select the target strings out of the Words table with the smallest levenshtein distance.

CREATE procedure LevenshteinCalculator
@source nvarchar(256)
as
	select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word 
	from Words wout
	where dbo.CalculateLevenshteinValue(@source,Word) = (
	select MIN(m.Levenshtein)
	from
	(select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word 
	from Words w) m)
go

So if we execute the CalculateLevenshteinValue procedure with ‘lay’ we get the following result.

exec LevenshteinCalculator @source='lay'

So lacy and lady are very close to lay and so we could assume that the user has made typing error and suggest him the two words .
(part1) (part2)


Calculating string difference values with the Levenshtein fuzzy string algorithm (part 1: introduction and implementation in c#).

In this post i will write about the Levenshtein algorithm. With this algorithm it is possible to compare two strings and calculate the a difference value of them. I read in a sql blog about this but the implementation was much to complicated so i desided to implement the algorithm myself in c# and in t-sql. After that i thought about the best use of the algorithm and decided to implement a special Levenshtein subject in my SimpleRX project to check a string against a list of overtaken strings. First of all i want to describe the Levenshtein algorithm that belongs to the fuzzy string search algorithms. The basic idea is very simple. You take two strings and compare that strings character by character. If both characters are the same you calculate zero if you have to exchange the character to another character to transform the source string into the target string you calculate two. If you have to insert or delete a character to transform the source character into the target character you calculate one. You do that for every character in the source string and as closer the source string is to the target string as smaller the number is. So as a example we take the string “must” and compare it to “dust”.

So the result is 2 because must and dust only differ in one character (exchange m for d). If we would calculate the difference between “mus” and “dust”

the result would be 3 because we have to exchange one character (d for m) and we have to add one character (t). If we have the string “mustard” and “dust”

the result is 5 because we have to exchange one character (d for m) and we have to delete three characters (a, r and d). So that are the basic operations to calculate the Levenshtein distance between two strings. The calculation in c# works with a two dimensional array to compare each character of the source string with each character of the target string plus one row and one column that shows the position of the characters in the source and target string. So for “must” and “dust” the start array is the following.

To accomplish this i used the following two loops.

var sourceLength = source.Length + 1;
var targetLength = target.Length + 1;
var array = new int
[sourceLength, targetLength];
for (int i = 0; i <= source.Length; i++)
{
     array[i, 0] = i;
}
for (int i = 0; i <= target.Length; i++)
{
     array[0, i] = i;
}

With that array we start the calculation. We use two loops to compare every character of the source string with every character of the target string. For every pair of characters we calculate the value for exchange, insert or delete and take the minimum to fill the corresponding item in the array. In this example we have two loops that run from 1 to the length of the array (for the source string) and from 1 to the length of the array (for the target string).
Now we check if the actual characters in the source and target string are the same. If yes we calculate zero if no we calcuate two. Then we canculate the insert value, the delete value and the interchange value out of the positiohn values already stored in the array. For i = 1 and j = 1 we compare the ‘m’ of must with the ‘d’ of dust. Equal is 2 because d and m are not the same. MinInsert is the value of [0,1] and that is a 1 plus 1 because of the insert that means that minInsert is 2. MinDelete is the value of [1,0] and that is a 1 plus 1 because of the delete that means that minDelete is 2. MinInterchange is the value of [0,0] and that is a 0 plus 2 because equal is 2 that means that minInterchange is 2. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,1] of the array we insert 2. That is the array after the comparision of the first two characters. Here we see the result of the first calcuation.

Now we want to look at the next continuous loop. Now i = 1 and j = 2 we compare ‘u’ with the ‘d’ of dust. Equal is 2 because u and d are not the same. MinInsert is the value of [0,2] and that is a 2 plus 1 because of the insert that means that minInsert is 3. MinDelete is the value of [1,1] and that is a 2 plus 1 because of the delete that means that minDelete is 3. MinInterchange is the value of [0,1] and that is a 1 plus 2 because of the equal is 2 that means that minInterchange is 3. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,2] of the array we insert 3. That is the array after the comparison of ‘u’ and ‘d’.

So we continue until the whole array is filled with values.

That is the code that generates the array values.

for (int i = 1; i <= source.Length; i++)
{
    for (int j = 1; j <= target.Length; j++)
    {
        var equal = (source[i - 1] == target[j - 1])? 0 : 2;
        var minInsert = array[i - 1, j] + 1;
        var minDelete = array[i, j - 1] + 1;
        var minInterchange = array[i - 1, j - 1] + equal;
        array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
    }
}

Now we take the value at position [n,m]  of the array. In our case it is array[4,4] that value is 2. That means the difference between ‘must’ and ‘dust’ is 2. Here we have the complete method that calculates the Levenshtein difference.

public int CalculateLevenshteinValue(string source, string target)
{
    var sourceLength = source.Length;
    var targetLength = target.Length;
    var array = new int[sourceLength + 1, targetLength + 1];
    for (int i = 0; i <= source.Length; i++)
    {
        array[i, 0] = i;
    }
    for (int i = 0; i <= target.Length; i++)
    {
        array[0, i] = i;
    }                       
    for (int i = 1; i <= source.Length; i++)
    {
        for (int j = 1; j <= target.Length; j++)
        {
            var equal = (source[i - 1] == target[j - 1])? 0 : 2;
            var minInsert = array[i - 1, j] + 1;
            var minDelete = array[i, j - 1] + 1;
            var minInterchange = array[i - 1, j - 1] + equal;
            array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
        }
    }
    return array[sourceLength, targetLength];
}

After finishing the implementation i wanted to use the calculation of the Levenshtein algorithm. So i build a LevenshteinSubject in my SimpleRX project to use it. Here is the code that shows the usage.

[TestMethod]
public void LevenshteinSubjectTest()
{
    List list = new List() { "label", "length", "lamp", "lab",
        "lacy", "lady", "lager", "lair", "lake", "lam", "lamb" };
    LevenshteinSubject subject = new LevenshteinSubject(list);
    subject.Subscribe(x => 
        { 
            var y = x;
            Debug.WriteLine(x);
        },
        ex => Console.WriteLine("OnError {0}", ex.Message),
        () => Console.WriteLine("OnCompleted")
        );
    subject.OnNext("lay");
    subject.OnCompleted();
}

That is the result of the execution:

In the second part of that blog post i will show how to implement the Levenshtein algorithm in t-sql.

(part1) (part2)


Linq Index (part3 speeding up a linq query)

In this post i will first show some more complex LinqIndex queries. I want to start with some unequal where queries. In this example i want to query all Customers with a number that is bigger than a specific value.

IEnumerable<Customer> customers = Customer.SampleData();
IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number).
    AddIndexKey(c => c.CustomerID);
var list = indexPart.Where(c => c.Number > 10).ToList();

When we look at this query we see that generating a hashtable and using it as index is useless here. If we have an equal where query (like c => c.Number == 10) we generate a hashtable and look in this hashtable for the value 10. But if we have the unequal query (like c => c.Number > 10) we have to lookup many values an we have only the starting value (10). To speed up such a query we need a tree as index. To be more specific we need a balanced tree to speed up the query. There are a lot of algorithms to build balanced trees. (avl trees, splay trees, red-black-trees …) I have choosen a avl tree because i implemented one some times ago and so i could use it for LinqIndex. The problem i had was that my implementation was too slow so i had to reimplement it. It is very importand that the insertion speed in the balanced tree is as fast a possible because i am always generating a new avl tree when an index is needed. So here is an example to show the performance with and without LinqIndex when querying on 100010 Customers.

[TestMethod]
public void UnequalIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number);                            
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition without index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition with unequal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(c => c.Number > 1000000).ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where for unequal condition with unequal index second time " + stopWatch.ElapsedMilliseconds);
}

As we see this test executes the same query (with one million Customers) two times without an Index and then two times with an unequal index. This are the result times:

Where for unequal condition without index first time 231
Where for unequal condition without index second time 224
Where for unequal condition with unequal index first time 14738
Where for unequal condition with unequal index second time 0

So while the execution time of the standard Linq query is constant the execution time with the index drops from 14738 to 0 milliseconds. Here we see how long it takes to generate the avl tree. So it is only useful to use LinqIndex for such queries if it is possible to precreate the index or if the index gets used very often.

So now i have described the use of LinqIndex with simple where queries (equal and unequal compare), Join and GroupJoin. But what is with more complicated queries that uses and (&&)  and/or or (||) expressions. Implimenting such queries and gaining speed from the generated Index is very difficult because when combinating expressions (like c.Number == 10 && c.LastName == “Gottshall”) we have to decide if we use an existing index, generate a new index or don’t use an index. If we can speed up the query by using an index for the first part of the expression (c.Number == 10) that does not mean that it is a good decision to use an index for the second part of the expression (c.LastName == “Gottshall”). So complicated queries that uses LinqIndex are slower than the standard linq queries most of the time. Now i will show one example with a logical and (&&) and one with a logical or (||). So in this first example i will use a where on two properties. For both properties (c.Number and c.LastName) a index is defined.

[TestMethod]
public void EqualIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number)
        .AddIndexKey(c => c.LastName);                
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 && i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index second time " + stopWatch.ElapsedMilliseconds);
}

Again i am executing the same query two times without an Index and then two times with an index (one unequal and one equal). This are the result times:

Where without index first time 206
Where without index second time 199
Where with equal index first time 19859
Where with equal index second time 0

Again it was possible to bring the execution time with the index from 19859 to 0 milliseconds. If you execute such a query with LinqIndex it will first build a avl tree search for the Number in it, take the result build a hashtable out of it and search for the LastName in it. If you define no index for the LastName, LinqIndex would build the avl tree, search for Number in it and then it would use a standard linq where to find the LastName. If i use a logical or instead of the logical and the query would look as the following.

[TestMethod]
public void EqualIndexSpecification()
{
    IEnumerable<Customer> customers = Customer.SampleData();
    IndexDefinition<Customer> indexPart = customers.CreateIndexKey(c => c.Number)
        .AddIndexKey(c => c.LastName);
    Stopwatch stopWatch = new Stopwatch();
    stopWatch.Start();
    var list = customers.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = customers.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where without index second time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index first time " + stopWatch.ElapsedMilliseconds);
    stopWatch.Reset();
    stopWatch.Start();
    list = indexPart.Where(i => i.Number > 1000000 || i.LastName == "Gottshall").ToList();
    stopWatch.Stop();
    Debug.WriteLine("Where with equal index second time " + stopWatch.ElapsedMilliseconds);
}

If we execute this query query two times without an Index and then two times with an index (one unequal and one equal) the execution times are different.

Where without index first time 337
Where without index second time 337
Where with equal index first time 24102
Where with equal index second time 742

Here LinqIndex is also slower than the standard where query the second time when executing. That happens because when using a logical or i have to find both values in both indexes and then i have to create a hashset to combine the results. That costs more time than traversing all items and check for both conditions. So when building where queries with logical or in it LinqIndex gains no speed and should not be used.

LinqIndex: (part1) (part2) (part3) (part4) (part5)


building a webservice proxy at runtime

In this post i will write about webservice proxies in c#. If you want to access a webservice in c# you can add a service referenc. If you add an URL to a Webservice Visual Studio generates you a c# proxy call. If you instanciate this class you can use the object to access the Webservice like any local object. That mechanism works very good but what if you want to generate this proxy at runtime and access this proxy with reflection? To do so we need runtime compilation and the ServiceDescription and ServiceDescriptionImporter classes. The first step i to get the wsdl definition.

Uri uri = new Uri("www.webservice.com/testservice.asmx");
WebRequest webRequest = WebRequest.Create(uri + "?WSDL");
webRequest.Credentials = CredentialCache.DefaultCredentials;
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.Proxy = null;
webRequest.Timeout = -1;
webRequest.Method = "GET";
WebResponse webResponse;
webResponse = webRequest.GetResponse();
using (Stream stream = webResponse.GetResponseStream())
{
// here starts the generation
}

All we do here is to generate a http request to the wsdl description and generate a stream that contains the xml.

ServiceDescription description = ServiceDescription.Read(stream);

Now we read the stream into the ServiceDescription. The ServiceDescription object now knows the structure of the wsdl. After that we generate a ServiceDescriptionImporter:

ServiceDescriptionImporter descriptionImporter = new ServiceDescriptionImporter();
descriptionImporter.ProtocolName = "SOAP";
descriptionImporter.AddServiceDescription(description, null, null);
descriptionImporter.Style = ServiceDescriptionImportStyle.Client;
descriptionImporter.CodeGenerationOptions = CodeGenerationOptions.GenerateProperties;

We tell the ServiceDescriptionImporter that he should use SOAP to access the webservice, generate it at the client and generate simple types as properties. With the AddServiceDescription Method we add the information about the WSDL to the ServiceDescriptionImporter. Now we generate a CodeNamespace and a CodeCompileUnit and add the CodeNamespace to the CodeCompileUnit.

CodeNamespace codeNamespace = new CodeNamespace();
CodeCompileUnit codeCompileUnit = new CodeCompileUnit();
codeCompileUnit.Namespaces.Add(codeNamespace);

The next point is crucial. We now import the structure of our proxy from the ServiceDescriptionImporter and save it in the codeCompileUnit.

ServiceDescriptionImportWarnings serviceDescriptionImportWarnings =
descriptionImporter.Import(codeNamespace, codeCompileUnit);

Here we can check if the import is correct.

if (serviceDescriptionImportWarnings != ServiceDescriptionImportWarnings.NoCodeGenerated)
{
    // if code generation is ok we can generate the dynamic assembly with the proxy in it
}

Now we can generate the assembly with runtime compilation.

CodeDomProvider compiler = new CSharpCodeProvider();
string[] references = new[] { "System.Web.Services.dll", "System.Xml.dll" };
CompilerParameters parameters = new CompilerParameters(references);
CompilerResults results = compiler.CompileAssemblyFromDom(parameters, codeCompileUnit);

At this point the dynamic proxy definition is ready. The next step is to get the exported Types and generate an instance of the proxy.

Type[] assemblyTypes = results.CompiledAssembly.GetExportedTypes();
object webServiceProxy = Activator.CreateInstance(assemblyTypes[0]);

So the object webServiceProxy now contains the correct proxy, now we need to find the Methodes to call and call them. In this case i a searching for the “GetAge” method.

MethodInfo methodInfo = webServiceProxy.GetType().GetMethod("GetAge");
var result = methodInfo.Invoke(webServiceProxy, new object[] { "mathias" });

Thats it, as long as you can access the wsdl file you can generate a proxy for every webservice you want. Here is the whole code example.

Uri uri = new Uri("www.webservice.com/testservice.asmx");
WebRequest webRequest = WebRequest.Create(uri.ToString() + "?WSDL");
webRequest.Credentials = CredentialCache.DefaultCredentials;
webRequest.ContentType = "application/x-www-form-urlencoded";
webRequest.Proxy = null;
webRequest.Timeout = -1;
webRequest.Method = "GET";
WebResponse webResponse;
webResponse = webRequest.GetResponse();
using (Stream stream = webResponse.GetResponseStream())
{
    ServiceDescription description = ServiceDescription.Read(stream);
    ServiceDescriptionImporter descriptionImporter = new ServiceDescriptionImporter();
    descriptionImporter.ProtocolName = "SOAP";
    descriptionImporter.AddServiceDescription(description, null, null);
    descriptionImporter.Style = ServiceDescriptionImportStyle.Client;
    descriptionImporter.CodeGenerationOptions = CodeGenerationOptions.GenerateProperties;
    CodeNamespace codeNamespace = new CodeNamespace();
    CodeCompileUnit codeCompileUnit = new CodeCompileUnit();
    codeCompileUnit.Namespaces.Add(codeNamespace);
    ServiceDescriptionImportWarnings serviceDescriptionImportWarnings =
        descriptionImporter.Import(codeNamespace, codeCompileUnit);
    if (serviceDescriptionImportWarnings != ServiceDescriptionImportWarnings.NoCodeGenerated)
    {
        CodeDomProvider compiler = new CSharpCodeProvider();
        string[] references = new[] {"System.Web.Services.dll", "System.Xml.dll"};
        CompilerParameters parameters = new CompilerParameters(references);
        CompilerResults results = compiler.CompileAssemblyFromDom(parameters, codeCompileUnit);

        Type[] assemblyTypes = results.CompiledAssembly.GetExportedTypes();
        object webServiceProxy = Activator.CreateInstance(assemblyTypes[0]);

        MethodInfo methodInfo = webServiceProxy.GetType().GetMethod("GetAge");
        var result = methodInfo.Invoke(webServiceProxy, new object[] {"mathias"});
    }
}