# Calculating string difference values with the Levenshtein fuzzy string algorithm (part 2: implementation in t-sql).

**Posted:**24/10/2012

**Filed under:**.net, Algorithm, Sql Server, t-sql |

**Tags:**fuzzy, Levenshtein, sql server, string, t-sql 3 Comments

In this second post i will write about the Levenshtein algorithm and how to implement it in t-sql (in part 1 i have explained the implementation in c#). With this algorithm it is possible to compare two strings and calculate the a difference value of them. Lets asume that we have a table in a database with words in it.

Now if the user overtakes a word we want to suggest the words out of the table words that is the closest. So we need the Levenshtein algorithm to calculate the levenshtein difference between the overtaken word and all words in the table words. Then we take the words with the minimal levenshtein distance and give it the user to choose out of that words if he wants. For example if the user overtakes ‘lay’ the words with the minimal levenshtein distance are ‘lady’ and ‘lacy’. So this are the words we would suggest to the user. To acomplish that we need a function that overtakes the source and the target string and calculates the levenshtein distance.

CREATE FUNCTION dbo.CalculateLevenshteinValue(@source nvarchar(max), @target nvarchar(max)) RETURNS int BEGIN declare @sourceLength int, @targetLength int set @sourceLength = len(@source); set @targetLength = len(@target); declare @array table ( i int, j int, value int); declare @i int, @j int, @result int set @i = 0; set @j = 0; WHILE @i <= @sourceLength BEGIN insert @array( i, j, [value]) values( @i, 0, @i) SET @i = @i + 1; END WHILE @j <= @targetLength BEGIN insert @array ( i, j, [value]) values ( 0, @j, @j) SET @j = @j + 1; END set @i = 1; set @j = 1; declare @equal int, @minInsert int, @minDelete int, declare @minInterchange int, @equalValue int, @insertValue int declare @sourceChar nvarchar(1), @targetChar nvarchar(1) WHILE @i <= @sourceLength BEGIN WHILE @j <= @targetLength BEGIN SET @sourceChar = substring(@source, @i, 1); SET @targetChar = substring(@target, @j, 1); SET @equal = 2; if(@sourceChar = @targetChar) begin SET @equal = 0; end select @minInsert = value + 1 from @array a where a.i = @i - 1 and a.j = @j select @minDelete = value + 1 from @array a where a.i = @i and a.j = @j - 1 select @minInterchange = [value] + @equal from @array a where a.i = @i - 1 and a.j = @j - 1 select @insertValue = dbo.ScalarMinValue(dbo.ScalarMinValue(@minInsert, @minDelete), @minInterchange) insert @array ( i, j, [value]) values ( @i, @j, @insertValue) SET @j = @j + 1; END SET @i = @i + 1; SET @j = 1 END select @result = [value] from @array a where a.i = @sourceLength and a.j = @targetLength RETURN @result; END; CREATE FUNCTION dbo.ScalarMinValue (@Value1 int, @Value2 int) RETURNS int BEGIN declare @return int Select @return = Case When @Value1 < @Value2 Then @Value1 Else @Value2 End return @return END

The calculation is very simmular to the calculation in c# but because the sql server does not support arrays i am using a temptable to store the calculated values. With this function that calculates the difference between two words we can write the stored procedure that takes the source string and select the target strings out of the Words table with the smallest levenshtein distance.

CREATE procedure LevenshteinCalculator @source nvarchar(256) as select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word from Words wout where dbo.CalculateLevenshteinValue(@source,Word) = ( select MIN(m.Levenshtein) from (select dbo.CalculateLevenshteinValue(@source,Word) Levenshtein, Word from Words w) m) go

So if we execute the CalculateLevenshteinValue procedure with ‘lay’ we get the following result.

exec LevenshteinCalculator @source='lay'

So lacy and lady are very close to lay and so we could assume that the user has made typing error and suggest him the two words .

(part1) (part2)

# Calculating string difference values with the Levenshtein fuzzy string algorithm (part 1: introduction and implementation in c#).

**Posted:**23/10/2012

**Filed under:**.net, Algorithm, c#, SimpleRX, Sql Server |

**Tags:**algorithm, c#, fuzzy, Levenshtein, string Leave a comment

In this post i will write about the Levenshtein algorithm. With this algorithm it is possible to compare two strings and calculate the a difference value of them. I read in a sql blog about this but the implementation was much to complicated so i desided to implement the algorithm myself in c# and in t-sql. After that i thought about the best use of the algorithm and decided to implement a special Levenshtein subject in my SimpleRX project to check a string against a list of overtaken strings. First of all i want to describe the Levenshtein algorithm that belongs to the fuzzy string search algorithms. The basic idea is very simple. You take two strings and compare that strings character by character. If both characters are the same you calculate zero if you have to exchange the character to another character to transform the source string into the target string you calculate two. If you have to insert or delete a character to transform the source character into the target character you calculate one. You do that for every character in the source string and as closer the source string is to the target string as smaller the number is. So as a example we take the string “must” and compare it to “dust”.

So the result is 2 because must and dust only differ in one character (exchange m for d). If we would calculate the difference between “mus” and “dust”

the result would be 3 because we have to exchange one character (d for m) and we have to add one character (t). If we have the string “mustard” and “dust”

the result is 5 because we have to exchange one character (d for m) and we have to delete three characters (a, r and d). So that are the basic operations to calculate the Levenshtein distance between two strings. The calculation in c# works with a two dimensional array to compare each character of the source string with each character of the target string plus one row and one column that shows the position of the characters in the source and target string. So for “must” and “dust” the start array is the following.

To accomplish this i used the following two loops.

var sourceLength = source.Length + 1; var targetLength = target.Length + 1; var array = new int [sourceLength, targetLength]; for (int i = 0; i <= source.Length; i++) { array[i, 0] = i; } for (int i = 0; i <= target.Length; i++) { array[0, i] = i; }

With that array we start the calculation. We use two loops to compare every character of the source string with every character of the target string. For every pair of characters we calculate the value for exchange, insert or delete and take the minimum to fill the corresponding item in the array. In this example we have two loops that run from 1 to the length of the array (for the source string) and from 1 to the length of the array (for the target string).

Now we check if the actual characters in the source and target string are the same. If yes we calculate zero if no we calcuate two. Then we canculate the insert value, the delete value and the interchange value out of the positiohn values already stored in the array. For i = 1 and j = 1 we compare the ‘m’ of must with the ‘d’ of dust. Equal is 2 because d and m are not the same. MinInsert is the value of [0,1] and that is a 1 plus 1 because of the insert that means that minInsert is 2. MinDelete is the value of [1,0] and that is a 1 plus 1 because of the delete that means that minDelete is 2. MinInterchange is the value of [0,0] and that is a 0 plus 2 because equal is 2 that means that minInterchange is 2. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,1] of the array we insert 2. That is the array after the comparision of the first two characters. Here we see the result of the first calcuation.

Now we want to look at the next continuous loop. Now i = 1 and j = 2 we compare ‘u’ with the ‘d’ of dust. Equal is 2 because u and d are not the same. MinInsert is the value of [0,2] and that is a 2 plus 1 because of the insert that means that minInsert is 3. MinDelete is the value of [1,1] and that is a 2 plus 1 because of the delete that means that minDelete is 3. MinInterchange is the value of [0,1] and that is a 1 plus 2 because of the equal is 2 that means that minInterchange is 3. Now we take the minimum out of minInsert, minDelete and minInterchange. That means at the position [1,2] of the array we insert 3. That is the array after the comparison of ‘u’ and ‘d’.

So we continue until the whole array is filled with values.

That is the code that generates the array values.

for (int i = 1; i <= source.Length; i++) { for (int j = 1; j <= target.Length; j++) { var equal = (source[i - 1] == target[j - 1])? 0 : 2; var minInsert = array[i - 1, j] + 1; var minDelete = array[i, j - 1] + 1; var minInterchange = array[i - 1, j - 1] + equal; array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange); } }

Now we take the value at position [n,m] of the array. In our case it is array[4,4] that value is 2. That means the difference between ‘must’ and ‘dust’ is 2. Here we have the complete method that calculates the Levenshtein difference.

public int CalculateLevenshteinValue(string source, string target) { var sourceLength = source.Length; var targetLength = target.Length; var array = new int[sourceLength + 1, targetLength + 1]; for (int i = 0; i <= source.Length; i++) { array[i, 0] = i; } for (int i = 0; i <= target.Length; i++) { array[0, i] = i; } for (int i = 1; i <= source.Length; i++) { for (int j = 1; j <= target.Length; j++) { var equal = (source[i - 1] == target[j - 1])? 0 : 2; var minInsert = array[i - 1, j] + 1; var minDelete = array[i, j - 1] + 1; var minInterchange = array[i - 1, j - 1] + equal; array[i, j] = Math.Min(Math.Min(minInsert, minDelete), minInterchange); } } return array[sourceLength, targetLength]; }

After finishing the implementation i wanted to use the calculation of the Levenshtein algorithm. So i build a LevenshteinSubject in my SimpleRX project to use it. Here is the code that shows the usage.

[TestMethod] public void LevenshteinSubjectTest() { List list = new List() { "label", "length", "lamp", "lab", "lacy", "lady", "lager", "lair", "lake", "lam", "lamb" }; LevenshteinSubject subject = new LevenshteinSubject(list); subject.Subscribe(x => { var y = x; Debug.WriteLine(x); }, ex => Console.WriteLine("OnError {0}", ex.Message), () => Console.WriteLine("OnCompleted") ); subject.OnNext("lay"); subject.OnCompleted(); }

That is the result of the execution:

In the second part of that blog post i will show how to implement the Levenshtein algorithm in t-sql.

(part1) (part2)

# implementing recursive common table expressions (cte)

**Posted:**22/03/2012

**Filed under:**Sql Server, Sql Server 2012 |

**Tags:**cte, sql server Leave a comment

Last time i needed a table structure for groups and subgroups so i decided to create two tables and when i needed to query the groups from the database a i reminded myself about commont table expressions (cte) and the possibility to use them to query recursive data. So i looked at the msdn and found a good example. Nonetheless this is a good example i decided to post my own example too.

First of all i will describe the two included tables. In the group table the name of the groups gets saved. The data in the table are parent and sub groups. Human is the parentgroup of child, juvenile and adult. Child is the parentgroup of baby and tottler.

`create table Groups (PK_Groups int IDENTITY primary key, Name nvarchar(256))`

To reflect the hirachie of the groups i created the GroupsToGroups table. In this table the parent child relationship between groups is definied. Child is the child of human so FK_Groups reflects the parent key and FK_SubGroups reflects the child key.

create table GroupsToGroups (PK_GroupsToGroups int IDENTITY primary key, FK_Groups int, FK_SubGroups int)

So i need a function that returns a group with all the subgroups to check if a person is member of a specific group or subgroup. To reach this goal i am using a recursive common table expression. The GroupsFunc takes on parameter PK_Group and returns a list of the overtaken group plus all subgroups.

CREATE function dbo.GroupsFunc(@PK_Group int) returns table as return ( with GroupsCte (PK_Groups, Name, FK_SubGroups, [Level]) as ( select PK_Groups, Name, null FK_SubGroups, 1 [Level] from dbo.Groups g where g.PK_Groups = @PK_Group union all select gzg.FK_SubGroups, g.Name, gzg.FK_Groups, [Level] + 1 from GroupsCte gcte inner join dbo.GroupsToGroups gzg on gzg.FK_Groups = gcte.PK_Groups inner join dbo.Groups g on gzg.FK_SubGroups = g.PK_Groups ) select PK_Groups, Name, FK_SubGroups, [Level] from GroupsCte )

The cte consists of two parts. The first part is the entry part, the second part is the recursive part. As you see in the second part the GroupsCte is used to join on the GroupsToGroups table. The recusion gets executed until the second query does not bring an result. If the query returns the result is feeded to the cte and so the next recursion step is executed. Here is the result of the executed GroupsFunc.

select * from GroupsFunc(1)

select * from GroupsFunc(2)

select * from GroupsFunc(5)

After i finished the function with the common table expression i thought about a way to do this without a cte. Here is the result. It gives the same result as the GroupsFunc but i needet two functions and the outer apply keyword to realise this, and the resulting functions are much more complicated than the cte solution. The most imported point in the implementation is the recursive call of the function in the outer apply part. For every group in one hirachie the GroupFuncWithoutCte is called again and returns the child groups. If a group has no childs then the recusion ends.

CREATE function dbo.GroupsFuncWithoutCte(@PK_Group int, @recLevel int) returns @tempGroups table ( PK_Groups int, Name nvarchar(256), FK_SubGroups int, [Level] int ) as begin declare @count int select @count = count(gzg.FK_Groups) from dbo.GroupsToGroups gzg inner join dbo.Groups g on gzg.FK_SubGroups = g.PK_Groups where gzg.FK_Groups = @PK_Group if(@count > 0) begin insert @tempGroups select g.PK_Groups, g.Name, gzg.FK_Groups, @recLevel + 1 from dbo.GroupsToGroups gzg inner join dbo.Groups g on gzg.FK_SubGroups = g.PK_Groups where gzg.FK_Groups = @PK_Group insert @tempGroups select gf.PK_Groups, gf.Name, gf.FK_SubGroups, gf.Level from @tempGroups tg cross apply dbo.GroupsFuncWithoutCte(tg.PK_Groups, @recLevel + 1) gf end return end

--Entry point CREATE function dbo.GroupsFuncWithoutCteMain(@PK_Group int) returns @tempGroups table ( PK_Groups int, Name nvarchar(256), FK_SubGroups int, [Level] int ) as begin insert @tempGroups select g.PK_Groups, g.Name, null FK_SubGroups, 0 [Level] from dbo.Groups g where g.PK_Groups = @PK_Group insert @tempGroups select gf.PK_Groups, gf.Name, gf.FK_SubGroups, gf.[Level] from @tempGroups tg cross apply dbo.GroupsFuncWithoutCte(tg.PK_Groups, 1) gf return end

The starting point is the GroupsFuncWithoutCteMain function that queries the Group table and the calls the GroupsFuncWithoutCte function.

select * from dbo.GroupsFuncWithoutCteMain(1)