Calculating string difference values with the Levenshtein fuzzy string algorithm (part 2: implementation in t-sql).

In this second post i will write about the Levenshtein algorithm and how to implement it in t-sql (in part 1 i have explained the implementation in c#). With this algorithm it is possible to compare two strings and calculate the a difference value of them. Lets asume that we have a table in a database with words in it.

Now if the user overtakes a word we want to suggest the words out of the table words that is the closest. So we need the Levenshtein algorithm to calculate the levenshtein difference between the overtaken word and all words in the table words. Then we take the words with the minimal levenshtein distance and give it the user to choose out of that words if he wants. For example if the user overtakes ‘lay’ the words with the minimal levenshtein distance are ‘lady’ and ‘lacy’. So this are the words we would suggest to the user. To acomplish that we need a function that overtakes the source and the target string and calculates the levenshtein distance.

CREATE FUNCTION dbo.CalculateLevenshteinValue(@source nvarchar(max), @target nvarchar(max))
RETURNS int
BEGIN  
     declare @sourceLength int, @targetLength int

     set @sourceLength = len(@source);
     set @targetLength = len(@target);

     declare @array table ( i int, j int, value int);
     declare @i int, @j int, @result int

     set @i = 0;
     set @j = 0;

     WHILE @i <= @sourceLength
     BEGIN
	     insert @array( i, j, [value])
	     values( @i, 0, @i) 
  	     SET @i = @i + 1;
     END
     WHILE @j <= @targetLength
     BEGIN
	     insert @array ( i, j, [value])
	     values ( 0, @j, @j) 
  	     SET @j = @j + 1;
     END

     set @i = 1;
     set @j = 1;

     declare @equal int, @minInsert int, @minDelete int, 
	 declare @minInterchange int, @equalValue int, @insertValue int
     declare @sourceChar nvarchar(1), @targetChar nvarchar(1)

     WHILE @i <= @sourceLength
     BEGIN
	     WHILE @j <= @targetLength
	     BEGIN
		     SET @sourceChar = substring(@source, @i, 1);		
		     SET @targetChar = substring(@target, @j, 1);
		     SET @equal = 2;
		     if(@sourceChar = @targetChar)
		     begin
			     SET @equal = 0;	
		     end
		     select @minInsert = value + 1 
                     from @array a where a.i = @i - 1 and a.j = @j
		     select @minDelete = value + 1 
                     from @array a 
                     where a.i = @i and a.j = @j - 1
		     select @minInterchange = [value] + @equal 
                     from @array a 
                     where a.i = @i - 1 and a.j = @j - 1
		     select @insertValue = 
                     dbo.ScalarMinValue(dbo.ScalarMinValue(@minInsert, @minDelete), @minInterchange)
		     insert @array ( i, j, [value])
		     values ( @i, @j, @insertValue) 
	  	     SET @j = @j + 1;
	     END
	     SET @i = @i + 1;
	     SET @j = 1
     END
     select @result = [value] 
     from @array a 
     where a.i = @sourceLength and a.j = @targetLength
RETURN @result;
END;
CREATE FUNCTION dbo.ScalarMinValue
(@Value1 int, @Value2 int)
RETURNS int
BEGIN
	declare @return int	
	Select @return = Case When @Value1 < @Value2 Then @Value1 Else @Value2 End	
	return @return
END

The calculation is very simmular to the calculation in c# but because the sql server does not support arrays i am using a temptable to store the calculated values. With this function that calculates the difference between two words we can write the stored procedure that takes the source string and select the target strings out of the Words table with the smallest levenshtein distance.

CREATE procedure LevenshteinCalculator
@source nvarchar(256)
as
	select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word 
	from Words wout
	where dbo.CalculateLevenshteinValue(@source,Word) = (
	select MIN(m.Levenshtein)
	from
	(select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word 
	from Words w) m)
go

So if we execute the CalculateLevenshteinValue procedure with ‘lay’ we get the following result.

exec LevenshteinCalculator @source='lay'

So lacy and lady are very close to lay and so we could assume that the user has made typing error and suggest him the two words .
(part1) (part2)


Calculating string difference values with the Levenshtein fuzzy string algorithm (part 1: introduction and implementation in c#).

In this post i will write about the Levenshtein algorithm. With this algorithm it is possible to compare two strings and calculate the a difference value of them. I read in a sql blog about this but the implementation was much to complicated so i desided to implement the algorithm myself in c# and in t-sql. After that i thought about the best use of the algorithm and decided to implement a special Levenshtein subject in my SimpleRX project to check a string against a list of overtaken strings. First of all i want to describe the Levenshtein algorithm that belongs to the fuzzy string search algorithms. The basic idea is very simple. You take two strings and compare that strings character by character. If both characters are the same you calculate zero if you have to exchange the character to another character to transform the source string into the target string you calculate two. If you have to insert or delete a character to transform the source character into the target character you calculate one. You do that for every character in the source string and as closer the source string is to the target string as smaller the number is. So as a example we take the string “must” and compare it to “dust”.

So the result is 2 because must and dust only differ in one character (exchange m for d). If we would calculate the difference between “mus” and “dust”

the result would be 3 because we have to exchange one character (d for m) and we have to add one character (t). If we have the string “mustard” and “dust”

the result is 5 because we have to exchange one character (d for m) and we have to delete three characters (a, r and d). So that are the basic operations to calculate the Levenshtein distance between two strings. The calculation in c# works with a two dimensional array to compare each character of the source string with each character of the target string plus one row and one column that shows the position of the characters in the source and target string. So for “must” and “dust” the start array is the following.

To accomplish this i used the following two loops.

var sourceLength = source.Length + 1;
var targetLength = target.Length + 1;
var array = new int
[sourceLength, targetLength];
for (int i = 0; i <= source.Length; i++)
{
     array[i, 0] = i;
}
for (int i = 0; i <= target.Length; i++)
{
     array[0, i] = i;
}

With that array we start the calculation. We use two loops to compare every character of the source string with every character of the target string. For every pair of characters we calculate the value for exchange, insert or delete and take the minimum to fill the corresponding item in the array. In this example we have two loops that run from 1 to the length of the array (for the source string) and from 1 to the length of the array (for the target string).
Now we check if the actual characters in the source and target string are the same. If yes we calculate zero if no we calcuate two. Then we canculate the insert value, the delete value and the interchange value out of the positiohn values already stored in the array. For i = 1 and j = 1 we compare the ‘m’ of must with the ‘d’ of dust. Equal is 2 because d and m are not the same. MinInsert is the value of [0,1] and that is a 1 plus 1 because of the insert that means that minInsert is 2. MinDelete is the value of [1,0] and that is a 1 plus 1 because of the delete that means that minDelete is 2. MinInterchange is the value of [0,0] and that is a 0 plus 2 because equal is 2 that means that minInterchange is 2. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,1] of the array we insert 2. That is the array after the comparision of the first two characters. Here we see the result of the first calcuation.

Now we want to look at the next continuous loop. Now i = 1 and j = 2 we compare ‘u’ with the ‘d’ of dust. Equal is 2 because u and d are not the same. MinInsert is the value of [0,2] and that is a 2 plus 1 because of the insert that means that minInsert is 3. MinDelete is the value of [1,1] and that is a 2 plus 1 because of the delete that means that minDelete is 3. MinInterchange is the value of [0,1] and that is a 1 plus 2 because of the equal is 2 that means that minInterchange is 3. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,2] of the array we insert 3. That is the array after the comparison of ‘u’ and ‘d’.

So we continue until the whole array is filled with values.

That is the code that generates the array values.

for (int i = 1; i <= source.Length; i++)
{
    for (int j = 1; j <= target.Length; j++)
    {
        var equal = (source[i - 1] == target[j - 1])? 0 : 2;
        var minInsert = array[i - 1, j] + 1;
        var minDelete = array[i, j - 1] + 1;
        var minInterchange = array[i - 1, j - 1] + equal;
        array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
    }
}

Now we take the value at position [n,m]  of the array. In our case it is array[4,4] that value is 2. That means the difference between ‘must’ and ‘dust’ is 2. Here we have the complete method that calculates the Levenshtein difference.

public int CalculateLevenshteinValue(string source, string target)
{
    var sourceLength = source.Length;
    var targetLength = target.Length;
    var array = new int[sourceLength + 1, targetLength + 1];
    for (int i = 0; i <= source.Length; i++)
    {
        array[i, 0] = i;
    }
    for (int i = 0; i <= target.Length; i++)
    {
        array[0, i] = i;
    }                       
    for (int i = 1; i <= source.Length; i++)
    {
        for (int j = 1; j <= target.Length; j++)
        {
            var equal = (source[i - 1] == target[j - 1])? 0 : 2;
            var minInsert = array[i - 1, j] + 1;
            var minDelete = array[i, j - 1] + 1;
            var minInterchange = array[i - 1, j - 1] + equal;
            array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange);                                        
        }
    }
    return array[sourceLength, targetLength];
}

After finishing the implementation i wanted to use the calculation of the Levenshtein algorithm. So i build a LevenshteinSubject in my SimpleRX project to use it. Here is the code that shows the usage.

[TestMethod]
public void LevenshteinSubjectTest()
{
    List list = new List() { "label", "length", "lamp", "lab",
        "lacy", "lady", "lager", "lair", "lake", "lam", "lamb" };
    LevenshteinSubject subject = new LevenshteinSubject(list);
    subject.Subscribe(x => 
        { 
            var y = x;
            Debug.WriteLine(x);
        },
        ex => Console.WriteLine("OnError {0}", ex.Message),
        () => Console.WriteLine("OnCompleted")
        );
    subject.OnNext("lay");
    subject.OnCompleted();
}

That is the result of the execution:

In the second part of that blog post i will show how to implement the Levenshtein algorithm in t-sql.

(part1) (part2)


implementing recursive common table expressions (cte)

Last time i needed a table structure for groups and subgroups so i decided to create two tables and when i needed to query the groups from the database a i reminded myself about commont table expressions (cte) and the possibility to use them to query recursive data. So i looked at the msdn and found a good example. Nonetheless this is a good example i decided to post my own example too.

First of all i will describe the two included tables. In the group table the name of the groups gets saved. The data in the table are parent and sub groups. Human is the parentgroup of child, juvenile and adult. Child is the parentgroup of baby and tottler.

create table Groups (PK_Groups int IDENTITY primary key, Name nvarchar(256))

To reflect the hirachie of the groups i created the GroupsToGroups table. In this table the parent child relationship between groups is definied. Child is the child of human so FK_Groups reflects the parent key and FK_SubGroups reflects the child key.

create table GroupsToGroups (PK_GroupsToGroups int IDENTITY primary key, FK_Groups int, FK_SubGroups int)

So i need a function that returns a group with all the subgroups to check if a person is member of a specific group or subgroup. To reach this goal i am using a recursive common table expression. The GroupsFunc takes on parameter PK_Group and returns a list of the overtaken group plus all subgroups.

CREATE function dbo.GroupsFunc(@PK_Group int)
returns table
as
return
(
    with GroupsCte (PK_Groups, Name, FK_SubGroups, [Level])
    as
    (
        select
            PK_Groups, Name, null FK_SubGroups, 1 [Level]
        from
            dbo.Groups g
        where
            g.PK_Groups = @PK_Group

        union all

        select
            gzg.FK_SubGroups, g.Name, gzg.FK_Groups, [Level] + 1
        from
            GroupsCte gcte inner join
            dbo.GroupsToGroups gzg on gzg.FK_Groups = gcte.PK_Groups inner join
            dbo.Groups g on gzg.FK_SubGroups = g.PK_Groups
    )
    select PK_Groups, Name, FK_SubGroups, [Level]
    from GroupsCte
)

The cte consists of two parts. The first part is the entry part, the second part is the recursive part. As you see in the second part the GroupsCte is used to join on the GroupsToGroups table. The recusion gets executed until the second query does not bring an result. If the query returns the result is feeded to the cte and so the next recursion step is executed. Here is the result of the executed GroupsFunc.

select * from GroupsFunc(1)
select * from GroupsFunc(2)
select * from GroupsFunc(5)

After i finished the function with the common table expression i thought about a way to do this without a cte. Here is the result. It gives the same result as the GroupsFunc but i needet two functions and the outer apply keyword to realise this, and the resulting functions are much more complicated than the cte solution. The most imported point in the implementation is the recursive call of the function in the outer apply part. For every group in one hirachie the GroupFuncWithoutCte is called again and returns the child groups. If a group has no childs then the recusion ends.

CREATE function dbo.GroupsFuncWithoutCte(@PK_Group int, @recLevel int)
returns @tempGroups table
(
    PK_Groups int, Name nvarchar(256), FK_SubGroups int, [Level] int
)
as
begin
    declare @count int
    select
        @count = count(gzg.FK_Groups)
    from
        dbo.GroupsToGroups gzg inner join
        dbo.Groups g on gzg.FK_SubGroups = g.PK_Groups
    where
        gzg.FK_Groups = @PK_Group
    if(@count > 0)
    begin
        insert @tempGroups
        select
            g.PK_Groups, g.Name, gzg.FK_Groups, @recLevel + 1
        from
            dbo.GroupsToGroups gzg inner join
            dbo.Groups g on gzg.FK_SubGroups = g.PK_Groups
        where
            gzg.FK_Groups = @PK_Group

        insert @tempGroups
        select gf.PK_Groups, gf.Name, gf.FK_SubGroups, gf.Level
        from @tempGroups tg cross apply dbo.GroupsFuncWithoutCte(tg.PK_Groups, @recLevel + 1) gf
    end    

    return
end
--Entry point
CREATE function dbo.GroupsFuncWithoutCteMain(@PK_Group int)
returns @tempGroups table
(
    PK_Groups int, Name nvarchar(256), FK_SubGroups int, [Level] int
)
as
begin
        insert @tempGroups
        select  g.PK_Groups, g.Name, null FK_SubGroups, 0 [Level]
        from
            dbo.Groups g
        where
            g.PK_Groups = @PK_Group        

        insert @tempGroups
        select gf.PK_Groups, gf.Name, gf.FK_SubGroups, gf.[Level]
        from @tempGroups tg cross apply dbo.GroupsFuncWithoutCte(tg.PK_Groups, 1) gf

    return
end

The starting point is the GroupsFuncWithoutCteMain function that queries the Group table and the calls the GroupsFuncWithoutCte function.

select * from dbo.GroupsFuncWithoutCteMain(1)