Advertisement
Paul_Pedant

Release Note for Version 2 of Fuzzy Matching.

Jun 8th, 2020
361
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.42 KB | None | 0 0
  1. I had a major rethink about Sim.c, the C program which does fuzzy comparisons. It was clumsy to use, and needed to be called a lot of times. I rewrote it to manage all the names in memory at once, and to sort the results internally, and also added a couple more options.
  2.  
  3. (1) It now works on either one or two files:
  4.  
  5. (a) With one input file, it compares each line with every other line for similarity. So for n lines, it does (n * (n - 1) / 2) comparisons. This is nCr -- combinations of n objects 2 at a time. So for 1600 input lines, you get almost 1.3 million results. This runs in about 15 seconds.
  6.  
  7. (b) With two input files, it compares every line in one file with every line in the other. So for n and m lines, you get (n * m) results. With 1400 against 1500 lines, you get 2.1 million results. This runs in about 45 seconds.
  8.  
  9. (2) There is now a -p option to remove punctuation, along with the case-insensitive and white-space options. Run Sim -H for the full help.
  10.  
  11. The are some re-useable functions in Sim.c.
  12.  
  13. .. fileLoader() gets a whole file into memory effectively, either from a named file or from stdin.
  14.  
  15. .. fileIndex() makes that file text into an array of strings in situ.
  16.  
  17. I wrote a test package for this (also pasted here), using my ripped CD directory as test data.
  18.  
  19. The similarity rating needs careful inspection. The 1.000 "identical" rating is just the same as a direct comparison. Anything below about 0.800 is probably not a real match: any random string of words is going to match quite a few letters, simply because there are only 26 of them to choose from (and only 5 vowels). I used the -t 0.75 option to reduce the number of listed matches to under 2000. The interesting part is to decide where the best cutoff point comes, for any specific data set.
  20.  
  21. There are still some difficult areas. My raw track path names were like:
  22.  
  23. Harry Chapin/Story of a Life- The Harry Chapin Box Disc 1/01 Taxi [Live].mp3
  24.  
  25. That's no good for matching -- the artist name, album name, track number, [Live] and mp3 are all too common. The track is actually called "Taxi". All the other stuff is just going to dilute the similarity.
  26.  
  27. I got an exact match on "Taxi", probably because of that "x". But I got a one-letter false match on what are two entirely different tracks:
  28.  
  29. 2 0.900|Losing You|Loving You|
  30.  
  31. So I convert those full path names into track names, with a serious awk script. It also removes .ini files, .jpeg artwork, It comments on what it deals with, and how many times, like:
  32.  
  33. 167 Delete Art-Small.jpg
  34. 127 Delete desktop.ini
  35. 64 Delete unlabelled track
  36. 1419 Omit .mp3 suffix
  37. 12 Omit [Alternate Take]
  38. 6 Omit [Demo Version]
  39. 50 Omit [Live]
  40. 1403 Omit track prefix
  41.  
  42. However, that makes it hard to trace back to the original references to see where they came from. When I found three variations on this spelling:
  43.  
  44. 1 0.968|Shake Rattle 'n' Roll|Shake Rattle & Roll|
  45. 1 0.941|Shake Rattle and Roll|Shake Rattle 'n' Roll|
  46. 1 0.909|Shake Rattle and Roll|Shake Rattle & Roll|
  47.  
  48. I used grep to extract:
  49.  
  50. Buddy Holly/The Very Best of Buddy Holly/21 Shake Rattle & Roll.mp3
  51. The Red Stripe Band/Start Spreading The News/02 Shake Rattle 'n' Roll.mp3
  52. Various Artists/Dreamboats and Petticoats/19 Shake Rattle and Roll.mp3
  53.  
  54. That's going to need to be an automated tool for a serious application. Also, any filtering is going to depend on what your specific names represent.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement