Paul_Pedant

Fuzzy Match of strings: Overview

Jun 3rd, 2020
311
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.91 KB | None | 0 0
  1. These are some tools for Fuzzy Matching of strings, prompted by a Reddit post.
  2.  
  3. They are from about 25 years back, and look horrible, so I should rewrite them if I get time. Originally written for Solaris SunOS 1.10, for ksh and nawk, and in a hurry, and for a system with 24MB RAM. The inner algorithm was originally in BASIC on a Sinclair Spectrum.
  4.  
  5. The Algorithm
  6.  
  7. The algorithm is recursive, and works on a pair of strings. At every level, it finds the longest common substring, and then repeats on the substring before that match, and the substring after it.
  8.  
  9. For example, how similar are 'JOE KERR' and 'JOKER'? (Indentation means recursion level.)
  10. The longest common string is 'KER'.
  11. The left sides are now 'JOE ' and 'JO', so we find 'JO'.
  12. The left sides are both empty, no common string.
  13. The right sides are 'E ' and '', no common string.
  14. The right sides are 'R' and '', no common string.
  15.  
  16. We score that as 0.769 similar: count 2 for each character matched, and divide by the total length of both strings. (5 + 5) / 13.
  17.  
  18. Reason for Ranking Value.
  19.  
  20. I want to rank on a range between 0 and 1, the same as we normally express probability. 0.0 means no characters in common, and 1.0 means identical strings. Why?
  21.  
  22. Consider comparing TO, TOE and TWO. They each match T and O. But TO/TOE and TO/TWO each have one mismatch letter, and TOE/TWO have two mismatched letters, which is "more wrong". So we want to take into account all letters in both words.
  23.  
  24. For identical strings like YES/YES, there are 3 matches and a total length of 6, so we need to count each match twice to get 1.000. What we are really counting down is the proportion of characters in both strings that don't match.
  25.  
  26. The code comes in three pieces:
  27.  
  28. (1) Similar.
  29.  
  30. This is a ksh + awk implementation to test the algorithm. It is very nasty code. The issue is that every substring has to be extracted explicitly for the recursion, and it returns multiple values as three integers within one string. (I also found a logic bug which has been in there for 25 years.)
  31.  
  32. It includes some test cases, and the output looks like this:
  33.  
  34. Paul--) ./Similar | more
  35. 1.000/Similar/Similar
  36. 0.333/r/barry
  37. 1.000//
  38. 0.000/One/
  39. 0.000//Two
  40. 0.640/supersciliously/pernicious
  41. 0.769/JOE KERR/JOKER
  42. 0.800/TOE/TO
  43. 0.800/TO/TWO
  44. 0.667/TWO/TOE
  45. 0.892/Many of my friends have social tenderness/Man of my fiends have socialist tendencies
  46. Paul--)
  47.  
  48. (2) Sim.c
  49.  
  50. This is an implementation in C, and compiles with: cc -o Sim Sim.c
  51. It has a man page built in: ./Sim -H
  52. It has some options (similar to diff) that affect white-space and upper/lower case comparisons.
  53.  
  54. It has a nasty interface. It will do a comparison between all members of one group of strings against another group, but one group goes in as args and the other on stdin, and it is very clunky to correlate the results.
  55.  
  56. It is fast, though. It does not need to extract substrings for the recursion. As soon as it gets a string it makes a pointer to the terminating null, and then it just recurses by passing in adjusted start/end pointers into the original string.
  57.  
  58. With more memory available, this should store all the strings in memory, and pair them up for the m x n combinations. As it stands now, this is done externally in the SimFile script.
  59.  
  60. (3) Simfile
  61.  
  62. This is a ksh script. It has a man page built in: ./SimFile -H
  63. It just fixes things up so Sim.c is not so hard to use.
  64. It takes most of the options that Sim.c has, and passes them in.
  65.  
  66. It reads a set of strings from stdin (which has to be a file, not a pipe).
  67. It sends each string in turn to Sim.c as the primary string to match, and then all the other strings to compare it with. So the final results are a comparison of every string with every other.
  68.  
  69. It is slow for large sets of data, because of the n * m requirement, and it runs an external process for each input string. On a 430-line C source, it takes 8 seconds and makes 46,000 comparisons.
Add Comment
Please, Sign In to add comment