Calculating string difference values with the Levenshtein fuzzy string algorithm (part 2: implementation in t-sql).

In this second post i will write about the Levenshtein algorithm and how to implement it in t-sql (in part 1 i have explained the implementation in c#). With this algorithm it is possible to compare two strings and calculate the a difference value of them. Lets asume that we have a table in a database with words in it.

Now if the user overtakes a word we want to suggest the words out of the table words that is the closest. So we need the Levenshtein algorithm to calculate the levenshtein difference between the overtaken word and all words in the table words. Then we take the words with the minimal levenshtein distance and give it the user to choose out of that words if he wants. For example if the user overtakes ‘lay’ the words with the minimal levenshtein distance are ‘lady’ and ‘lacy’. So this are the words we would suggest to the user. To acomplish that we need a function that overtakes the source and the target string and calculates the levenshtein distance.

CREATE FUNCTION dbo.CalculateLevenshteinValue(@source nvarchar(max), @target nvarchar(max))
RETURNS int
BEGIN  
     declare @sourceLength int, @targetLength int

     set @sourceLength = len(@source);
     set @targetLength = len(@target);

     declare @array table ( i int, j int, value int);
     declare @i int, @j int, @result int

     set @i = 0;
     set @j = 0;

     WHILE @i <= @sourceLength
     BEGIN
	     insert @array( i, j, [value])
	     values( @i, 0, @i) 
  	     SET @i = @i + 1;
     END
     WHILE @j <= @targetLength
     BEGIN
	     insert @array ( i, j, [value])
	     values ( 0, @j, @j) 
  	     SET @j = @j + 1;
     END

     set @i = 1;
     set @j = 1;

     declare @equal int, @minInsert int, @minDelete int, 
	 declare @minInterchange int, @equalValue int, @insertValue int
     declare @sourceChar nvarchar(1), @targetChar nvarchar(1)

     WHILE @i <= @sourceLength
     BEGIN
	     WHILE @j <= @targetLength
	     BEGIN
		     SET @sourceChar = substring(@source, @i, 1);		
		     SET @targetChar = substring(@target, @j, 1);
		     SET @equal = 2;
		     if(@sourceChar = @targetChar)
		     begin
			     SET @equal = 0;	
		     end
		     select @minInsert = value + 1 
                     from @array a where a.i = @i - 1 and a.j = @j
		     select @minDelete = value + 1 
                     from @array a 
                     where a.i = @i and a.j = @j - 1
		     select @minInterchange = [value] + @equal 
                     from @array a 
                     where a.i = @i - 1 and a.j = @j - 1
		     select @insertValue = 
                     dbo.ScalarMinValue(dbo.ScalarMinValue(@minInsert, @minDelete), @minInterchange)
		     insert @array ( i, j, [value])
		     values ( @i, @j, @insertValue) 
	  	     SET @j = @j + 1;
	     END
	     SET @i = @i + 1;
	     SET @j = 1
     END
     select @result = [value] 
     from @array a 
     where a.i = @sourceLength and a.j = @targetLength
RETURN @result;
END;
CREATE FUNCTION dbo.ScalarMinValue
(@Value1 int, @Value2 int)
RETURNS int
BEGIN
	declare @return int	
	Select @return = Case When @Value1 < @Value2 Then @Value1 Else @Value2 End	
	return @return
END

The calculation is very simmular to the calculation in c# but because the sql server does not support arrays i am using a temptable to store the calculated values. With this function that calculates the difference between two words we can write the stored procedure that takes the source string and select the target strings out of the Words table with the smallest levenshtein distance.

CREATE procedure LevenshteinCalculator
@source nvarchar(256)
as
	select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word 
	from Words wout
	where dbo.CalculateLevenshteinValue(@source,Word) = (
	select MIN(m.Levenshtein)
	from
	(select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word 
	from Words w) m)
go

So if we execute the CalculateLevenshteinValue procedure with ‘lay’ we get the following result.

exec LevenshteinCalculator @source='lay'

So lacy and lady are very close to lay and so we could assume that the user has made typing error and suggest him the two words .
(part1) (part2)


Calculating string difference values with the Levenshtein fuzzy string algorithm (part 1: introduction and implementation in c#).

In this post i will write about the Levenshtein algorithm. With this algorithm it is possible to compare two strings and calculate the a difference value of them. I read in a sql blog about this but the implementation was much to complicated so i desided to implement the algorithm myself in c# and in t-sql. After that i thought about the best use of the algorithm and decided to implement a special Levenshtein subject in my SimpleRX project to check a string against a list of overtaken strings. First of all i want to describe the Levenshtein algorithm that belongs to the fuzzy string search algorithms. The basic idea is very simple. You take two strings and compare that strings character by character. If both characters are the same you calculate zero if you have to exchange the character to another character to transform the source string into the target string you calculate two. If you have to insert or delete a character to transform the source character into the target character you calculate one. You do that for every character in the source string and as closer the source string is to the target string as smaller the number is. So as a example we take the string “must” and compare it to “dust”.

So the result is 2 because must and dust only differ in one character (exchange m for d). If we would calculate the difference between “mus” and “dust”

the result would be 3 because we have to exchange one character (d for m) and we have to add one character (t). If we have the string “mustard” and “dust”

the result is 5 because we have to exchange one character (d for m) and we have to delete three characters (a, r and d). So that are the basic operations to calculate the Levenshtein distance between two strings. The calculation in c# works with a two dimensional array to compare each character of the source string with each character of the target string plus one row and one column that shows the position of the characters in the source and target string. So for “must” and “dust” the start array is the following.

To accomplish this i used the following two loops.

var sourceLength = source.Length + 1;
var targetLength = target.Length + 1;
var array = new int
[sourceLength, targetLength];
for (int i = 0; i <= source.Length; i++)
{
     array[i, 0] = i;
}
for (int i = 0; i <= target.Length; i++)
{
     array[0, i] = i;
}

With that array we start the calculation. We use two loops to compare every character of the source string with every character of the target string. For every pair of characters we calculate the value for exchange, insert or delete and take the minimum to fill the corresponding item in the array. In this example we have two loops that run from 1 to the length of the array (for the source string) and from 1 to the length of the array (for the target string).
Now we check if the actual characters in the source and target string are the same. If yes we calculate zero if no we calcuate two. Then we canculate the insert value, the delete value and the interchange value out of the positiohn values already stored in the array. For i = 1 and j = 1 we compare the ‘m’ of must with the ‘d’ of dust. Equal is 2 because d and m are not the same. MinInsert is the value of [0,1] and that is a 1 plus 1 because of the insert that means that minInsert is 2. MinDelete is the value of [1,0] and that is a 1 plus 1 because of the delete that means that minDelete is 2. MinInterchange is the value of [0,0] and that is a 0 plus 2 because equal is 2 that means that minInterchange is 2. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,1] of the array we insert 2. That is the array after the comparision of the first two characters. Here we see the result of the first calcuation.

Now we want to look at the next continuous loop. Now i = 1 and j = 2 we compare ‘u’ with the ‘d’ of dust. Equal is 2 because u and d are not the same. MinInsert is the value of [0,2] and that is a 2 plus 1 because of the insert that means that minInsert is 3. MinDelete is the value of [1,1] and that is a 2 plus 1 because of the delete that means that minDelete is 3. MinInterchange is the value of [0,1] and that is a 1 plus 2 because of the equal is 2 that means that minInterchange is 3. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,2] of the array we insert 3. That is the array after the comparison of ‘u’ and ‘d’.

So we continue until the whole array is filled with values.

That is the code that generates the array values.

for (int i = 1; i <= source.Length; i++)
{
    for (int j = 1; j <= target.Length; j++)
    {
        var equal = (source[i - 1] == target[j - 1])? 0 : 2;
        var minInsert = array[i - 1, j] + 1;
        var minDelete = array[i, j - 1] + 1;
        var minInterchange = array[i - 1, j - 1] + equal;
        array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
    }
}

Now we take the value at position [n,m]  of the array. In our case it is array[4,4] that value is 2. That means the difference between ‘must’ and ‘dust’ is 2. Here we have the complete method that calculates the Levenshtein difference.

public int CalculateLevenshteinValue(string source, string target)
{
    var sourceLength = source.Length;
    var targetLength = target.Length;
    var array = new int[sourceLength + 1, targetLength + 1];
    for (int i = 0; i <= source.Length; i++)
    {
        array[i, 0] = i;
    }
    for (int i = 0; i <= target.Length; i++)
    {
        array[0, i] = i;
    }                       
    for (int i = 1; i <= source.Length; i++)
    {
        for (int j = 1; j <= target.Length; j++)
        {
            var equal = (source[i - 1] == target[j - 1])? 0 : 2;
            var minInsert = array[i - 1, j] + 1;
            var minDelete = array[i, j - 1] + 1;
            var minInterchange = array[i - 1, j - 1] + equal;
            array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
        }
    }
    return array[sourceLength, targetLength];
}

After finishing the implementation i wanted to use the calculation of the Levenshtein algorithm. So i build a LevenshteinSubject in my SimpleRX project to use it. Here is the code that shows the usage.

[TestMethod]
public void LevenshteinSubjectTest()
{
    List list = new List() { "label", "length", "lamp", "lab",
        "lacy", "lady", "lager", "lair", "lake", "lam", "lamb" };
    LevenshteinSubject subject = new LevenshteinSubject(list);
    subject.Subscribe(x => 
        { 
            var y = x;
            Debug.WriteLine(x);
        },
        ex => Console.WriteLine("OnError {0}", ex.Message),
        () => Console.WriteLine("OnCompleted")
        );
    subject.OnNext("lay");
    subject.OnCompleted();
}

That is the result of the execution:

In the second part of that blog post i will show how to implement the Levenshtein algorithm in t-sql.

(part1) (part2)


Using JsonToStaticTypeConverter (part 1: introduction)

In this post i will write about the JsonToStaticTypeConverter project and how to use it. I created that project after implementing a Json parser. After finishing the parser i thought about where i could use this. So the result is an application that can generate c# classes out of a Json string. So with the  JsonToStaticTypeConverter you can generate c# classes and fill them with your Json data.

If we press generate a class Addresse im namespace MainTest in the file Addresse.cs in the folder C:\Projecte is generated out of the Json code (in this case { ‘street’ : ‘Teststreet’, ‘number’ : 1, ‘city’ : ‘Graz’ }. Here we see the generated c# file:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using JsonConverter;
namespace MainTest
{
public class Addresse : JsonStaticObject
{
public string street { get { return JsonSource.Find<string>("Addresse", "street"); } }
public double number { get { return JsonSource.Find<double>("Addresse", "number"); } }
public string city { get { return JsonSource.Find<string>("Addresse", "city"); } }
}
}

This c# file is inserted into the solution that wants to process the Json data. Here we see how the JsonConverter is used to generate the static structure of Addresse out of the Json data. We also need to reference the JsonConverter.dll from the JsonToStaticTypeConverter Project. Therefore we use the Deserialize method and overtake the Json string.

var jsonConverter = new JsonConverter<MainTest.Addresse>();
MainTest.Addresse address = jsonConverter.Deserialize(
    "{ 'street' : 'Gürtel', 'number' : 23, 'city' : 'Vienna' }");
var street = address.street;
var number = address.number;
var city = address.city;
Trace.WriteLine(street);
Trace.WriteLine(number);
Trace.WriteLine(city);

Here we see the trace result:

So with the JsonConverter it is possible to use the Json data due to a static structure. In the second post of this series i will write about the implementation of JsonToStaticTypeConverter and about the implementation of the Json ParserCombinator.