Calculating string difference values with the Levenshtein fuzzy string algorithm (part 1: introduction and implementation in c#).

In this post i will write about the Levenshtein algorithm. With this algorithm it is possible to compare two strings and calculate the a difference value of them. I read in a sql blog about this but the implementation was much to complicated so i desided to implement the algorithm myself in c# and in t-sql. After that i thought about the best use of the algorithm and decided to implement a special Levenshtein subject in my SimpleRX project to check a string against a list of overtaken strings. First of all i want to describe the Levenshtein algorithm that belongs to the fuzzy string search algorithms. The basic idea is very simple. You take two strings and compare that strings character by character. If both characters are the same you calculate zero if you have to exchange the character to another character to transform the source string into the target string you calculate two. If you have to insert or delete a character to transform the source character into the target character you calculate one. You do that for every character in the source string and as closer the source string is to the target string as smaller the number is. So as a example we take the string “must” and compare it to “dust”.

So the result is 2 because must and dust only differ in one character (exchange m for d). If we would calculate the difference between “mus” and “dust”

the result would be 3 because we have to exchange one character (d for m) and we have to add one character (t). If we have the string “mustard” and “dust”

the result is 5 because we have to exchange one character (d for m) and we have to delete three characters (a, r and d). So that are the basic operations to calculate the Levenshtein distance between two strings. The calculation in c# works with a two dimensional array to compare each character of the source string with each character of the target string plus one row and one column that shows the position of the characters in the source and target string. So for “must” and “dust” the start array is the following.

To accomplish this i used the following two loops.

var sourceLength = source.Length + 1;
var targetLength = target.Length + 1;
var array = new int
[sourceLength, targetLength];
for (int i = 0; i <= source.Length; i++)
{
     array[i, 0] = i;
}
for (int i = 0; i <= target.Length; i++)
{
     array[0, i] = i;
}

With that array we start the calculation. We use two loops to compare every character of the source string with every character of the target string. For every pair of characters we calculate the value for exchange, insert or delete and take the minimum to fill the corresponding item in the array. In this example we have two loops that run from 1 to the length of the array (for the source string) and from 1 to the length of the array (for the target string).
Now we check if the actual characters in the source and target string are the same. If yes we calculate zero if no we calcuate two. Then we canculate the insert value, the delete value and the interchange value out of the positiohn values already stored in the array. For i = 1 and j = 1 we compare the ‘m’ of must with the ‘d’ of dust. Equal is 2 because d and m are not the same. MinInsert is the value of [0,1] and that is a 1 plus 1 because of the insert that means that minInsert is 2. MinDelete is the value of [1,0] and that is a 1 plus 1 because of the delete that means that minDelete is 2. MinInterchange is the value of [0,0] and that is a 0 plus 2 because equal is 2 that means that minInterchange is 2. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,1] of the array we insert 2. That is the array after the comparision of the first two characters. Here we see the result of the first calcuation.

Now we want to look at the next continuous loop. Now i = 1 and j = 2 we compare ‘u’ with the ‘d’ of dust. Equal is 2 because u and d are not the same. MinInsert is the value of [0,2] and that is a 2 plus 1 because of the insert that means that minInsert is 3. MinDelete is the value of [1,1] and that is a 2 plus 1 because of the delete that means that minDelete is 3. MinInterchange is the value of [0,1] and that is a 1 plus 2 because of the equal is 2 that means that minInterchange is 3. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,2] of the array we insert 3. That is the array after the comparison of ‘u’ and ‘d’.

So we continue until the whole array is filled with values.

That is the code that generates the array values.

for (int i = 1; i <= source.Length; i++)
{
    for (int j = 1; j <= target.Length; j++)
    {
        var equal = (source[i - 1] == target[j - 1])? 0 : 2;
        var minInsert = array[i - 1, j] + 1;
        var minDelete = array[i, j - 1] + 1;
        var minInterchange = array[i - 1, j - 1] + equal;
        array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
    }
}

Now we take the value at position [n,m]  of the array. In our case it is array[4,4] that value is 2. That means the difference between ‘must’ and ‘dust’ is 2. Here we have the complete method that calculates the Levenshtein difference.

public int CalculateLevenshteinValue(string source, string target)
{
    var sourceLength = source.Length;
    var targetLength = target.Length;
    var array = new int[sourceLength + 1, targetLength + 1];
    for (int i = 0; i <= source.Length; i++)
    {
        array[i, 0] = i;
    }
    for (int i = 0; i <= target.Length; i++)
    {
        array[0, i] = i;
    }                       
    for (int i = 1; i <= source.Length; i++)
    {
        for (int j = 1; j <= target.Length; j++)
        {
            var equal = (source[i - 1] == target[j - 1])? 0 : 2;
            var minInsert = array[i - 1, j] + 1;
            var minDelete = array[i, j - 1] + 1;
            var minInterchange = array[i - 1, j - 1] + equal;
            array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
        }
    }
    return array[sourceLength, targetLength];
}

After finishing the implementation i wanted to use the calculation of the Levenshtein algorithm. So i build a LevenshteinSubject in my SimpleRX project to use it. Here is the code that shows the usage.

[TestMethod]
public void LevenshteinSubjectTest()
{
    List list = new List() { "label", "length", "lamp", "lab",
        "lacy", "lady", "lager", "lair", "lake", "lam", "lamb" };
    LevenshteinSubject subject = new LevenshteinSubject(list);
    subject.Subscribe(x => 
        { 
            var y = x;
            Debug.WriteLine(x);
        },
        ex => Console.WriteLine("OnError {0}", ex.Message),
        () => Console.WriteLine("OnCompleted")
        );
    subject.OnNext("lay");
    subject.OnCompleted();
}

That is the result of the execution:

In the second part of that blog post i will show how to implement the Levenshtein algorithm in t-sql.

(part1) (part2)

Advertisements


If you have a note or a question please write a comment.

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s