Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- I had a major rethink about Sim.c, the C program which does fuzzy comparisons. It was clumsy to use, and needed to be called a lot of times. I rewrote it to manage all the names in memory at once, and to sort the results internally, and also added a couple more options.
- (1) It now works on either one or two files:
- (a) With one input file, it compares each line with every other line for similarity. So for n lines, it does (n * (n - 1) / 2) comparisons. This is nCr -- combinations of n objects 2 at a time. So for 1600 input lines, you get almost 1.3 million results. This runs in about 15 seconds.
- (b) With two input files, it compares every line in one file with every line in the other. So for n and m lines, you get (n * m) results. With 1400 against 1500 lines, you get 2.1 million results. This runs in about 45 seconds.
- (2) There is now a -p option to remove punctuation, along with the case-insensitive and white-space options. Run Sim -H for the full help.
- The are some re-useable functions in Sim.c.
- .. fileLoader() gets a whole file into memory effectively, either from a named file or from stdin.
- .. fileIndex() makes that file text into an array of strings in situ.
- I wrote a test package for this (also pasted here), using my ripped CD directory as test data.
- The similarity rating needs careful inspection. The 1.000 "identical" rating is just the same as a direct comparison. Anything below about 0.800 is probably not a real match: any random string of words is going to match quite a few letters, simply because there are only 26 of them to choose from (and only 5 vowels). I used the -t 0.75 option to reduce the number of listed matches to under 2000. The interesting part is to decide where the best cutoff point comes, for any specific data set.
- There are still some difficult areas. My raw track path names were like:
- Harry Chapin/Story of a Life- The Harry Chapin Box Disc 1/01 Taxi [Live].mp3
- That's no good for matching -- the artist name, album name, track number, [Live] and mp3 are all too common. The track is actually called "Taxi". All the other stuff is just going to dilute the similarity.
- I got an exact match on "Taxi", probably because of that "x". But I got a one-letter false match on what are two entirely different tracks:
- 2 0.900|Losing You|Loving You|
- So I convert those full path names into track names, with a serious awk script. It also removes .ini files, .jpeg artwork, It comments on what it deals with, and how many times, like:
- 167 Delete Art-Small.jpg
- 127 Delete desktop.ini
- 64 Delete unlabelled track
- 1419 Omit .mp3 suffix
- 12 Omit [Alternate Take]
- 6 Omit [Demo Version]
- 50 Omit [Live]
- 1403 Omit track prefix
- However, that makes it hard to trace back to the original references to see where they came from. When I found three variations on this spelling:
- 1 0.968|Shake Rattle 'n' Roll|Shake Rattle & Roll|
- 1 0.941|Shake Rattle and Roll|Shake Rattle 'n' Roll|
- 1 0.909|Shake Rattle and Roll|Shake Rattle & Roll|
- I used grep to extract:
- Buddy Holly/The Very Best of Buddy Holly/21 Shake Rattle & Roll.mp3
- The Red Stripe Band/Start Spreading The News/02 Shake Rattle 'n' Roll.mp3
- Various Artists/Dreamboats and Petticoats/19 Shake Rattle and Roll.mp3
- That's going to need to be an automated tool for a serious application. Also, any filtering is going to depend on what your specific names represent.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement