Quite often, when matching data sets, I come across the problem of what happens when data is “dirty” in someway. In other words, it’s become corrupted, either by a spelling mistake (such as HSBC becoming HSBBC) or by abbreviation (such as Gangle and Wotsit Limited becoming Gangle and Wotsit Ltd) or by contraction (such as Gangle and Wotsit Financial Adviser Ltd becoming Gangle and Wotsit Ltd). In these cases a direct comparison will return no match, so it’s good to have a way of returning data that is, roughly, approximately, a match. I used to have to do this in a cursor - which was a bit of a pain. Then I suddenly had the thought of putting the cursor into a custom function and wrapping it all in a set based query. This is what it looks like… First the function in question.
create Function [dbo].[udf_compactfield_fuzzy] (@notes Varchar(8000), @threshold float) Returns varchar(8000) as Begin --do a cursor Declare @count smallint, @charval int, @length int Declare @workingname Varchar(8000) set @workingname = '' set @count = 1 while @count <> len(@notes) +1 begin --examine each character and discard all non alphanumerics set @charval = ASCII(SUBSTRING(@notes, @count, 1)) if ((@charval >= 97 and @charval <= 122) OR --lcase letters (@charval >= 65 and @charval <= 90) OR --ucase letters (@charval >= 48 and @charval <= 57)) --numbers begin set @workingname = @workingname + SUBSTRING(@notes, @count, 1) end set @count = @count + 1 end --return the first n part of the word, n being threshold set @length = convert(int, len(@workingname) * @threshold) set @workingname = left(@workingname, @length) return (@workingname) End
A handy little function which takes two parameters, notes and threshold. The threshold being a float from 1 - 100% of the data; to 0 - 0% of the data. Notes is concatenated by the function (ie. the contents of the parameter with only the alphanumerics. so “i.j 2 ltd” will become “ij2ltd”) and returns the amount of data from the start of the notes as defined by threshold. Specifying 0.7 will return 70% of the string, which in the previous example would be “ij2l” (roughly). When using this function it becomes essential to use the sql wildcard at the end of the created string to enable the fuzzy matching to work. Obviously you can adjust the size of the match, though I have found 70% to be optimal in most cases, though it will vary with data sets and level of matching required. Note: in datasets with small (5 or less characters) the fuzzy matches returned will be very fuzzy. So some example SQL using this function to match two tables together
Select t1.company_name, t2.company_name from table1 t1 inner join table2 t2 on dbo.udf_compactfield_fuzzy(t1.company_name, 0.7) + '%' like dbo.udf_compactfield_fuzzy(c.company_name, 0.7) + '%' go
Future additions to this cursor would perhaps to have a common word list to remove words like “The” and “And”, (so Gangle and Wotsit would match Gangle & Wotsit). Code can always be improved…