I work with Huge Consumer Data (200 ~300 Million records) and am using JWT for name matching and it takes us a lot of time to process the data even after string normalization. How do you convert String into Bits WITHOUT iterating over each character in the string and mapping it against alphabets A-Z in Java. If you iterate over each character of both the Strings (both names) to map it, wouldn't that be time-consuming? Please, tell me how did you manage it.
+Shashwat Kaundinya we did iterate over each character to create the bitmaps. There are a couple of reasons why this worked for us: 1. The bitmaps get used many times so the cost of calculating them gets spread out over multiple searches 2. They're pretty quick to create. Our 64-bit implementation creates about 2.3 million bitmaps per second so it could encode your full set in a couple of minutes. You could speed this up with a lookup table and if you restrict the character range to A-Z then you could use the character values directly: something like bitmap = bitmap | (1L
+Kyle Putnam Thanks for the reply. I work with data from Banks and Insurance companies and they provide us the data in text files, so pre-processing it and remodeling it should be the way to go. I'll try and compute those Bitmaps on each record and store them in a new file with previous data. In my case, I have to work with not just names but addresses and IDs and phone# and emails and twitter accounts too. I will sit with a team soon to decide how to best use your technique for our case. Thanks man
Rosette name matching (www.rosette.com/capability/name-matching/#tech-specs) solves phonetic similarity, transliteration, nicknames, missing spaces or hyphens, titles and honorifics, truncated name components, missing name components, out-of-order name components, initials, names split inconsistently across database fields, same name in multiple languages, semantically similar names, and semantically similar names across language.
As I watched this, I couldn't help thinking the problem was being solved in the wrong domain! Who cares about the lengths and vagaries of English spelling, especially transliterations of foreign alphabets? To improve quality (reduce false positives and negatives,) I was thinking, don't we want to match on the sounds of these names? That got me looking up phonetic algorithms. I wonder whether KP & SV considered preprocessing the dataset, indexing it with something like one of the Metaphone algorithms? They then could have performed their bitwise matching on the reduced dataset generated by exact matches of the short, standard-length keys generated by the phonetics. Maybe they weren't allowed to..
Robin Betts exactly my thoughts too and lucene /solr provides support for foreign languages too
5 ปีที่แล้ว +1
@First Last At a high level, I think what Karthy referencing is called Soundex (en.wikipedia.org/wiki/Soundex) but it's not converting words to a sound wave but instead trying to match on the consonants (skip vowel unless the word begins with it and replacing them some digit(s) as the wiki explains). Comparing wave forms wouldn't be efficient with the amount of data the wave form of a word vs what the string representation would be.
Did they publish their implementation (on GitHub, etc)? Very interested in seeing the actual implementation.
Thanks for this. The exact same problem I’m trying to resolve.
Any python implementation?
Nice work!
I work with Huge Consumer Data (200 ~300 Million records) and am using JWT for name matching and it takes us a lot of time to process the data even after string normalization.
How do you convert String into Bits WITHOUT iterating over each character in the string and mapping it against alphabets A-Z in Java.
If you iterate over each character of both the Strings (both names) to map it, wouldn't that be time-consuming?
Please, tell me how did you manage it.
+Shashwat Kaundinya we did iterate over each character to create the bitmaps. There are a couple of reasons why this worked for us:
1. The bitmaps get used many times so the cost of calculating them gets spread out over multiple searches
2. They're pretty quick to create. Our 64-bit implementation creates about 2.3 million bitmaps per second so it could encode your full set in a couple of minutes. You could speed this up with a lookup table and if you restrict the character range to A-Z then you could use the character values directly: something like
bitmap = bitmap | (1L
+Kyle Putnam Thanks for the reply.
I work with data from Banks and Insurance companies and they provide us the data in text files, so pre-processing it and remodeling it should be the way to go.
I'll try and compute those Bitmaps on each record and store them in a new file with previous data.
In my case, I have to work with not just names but addresses and IDs and phone# and emails and twitter accounts too.
I will sit with a team soon to decide how to best use your technique for our case.
Thanks man
+Seth Verrinder Thanks for the reply.
Rosette name matching (www.rosette.com/capability/name-matching/#tech-specs) solves phonetic similarity, transliteration, nicknames, missing spaces or hyphens, titles and honorifics, truncated name components, missing name components, out-of-order name components, initials, names split inconsistently across database fields, same name in multiple languages, semantically similar names, and semantically similar names across language.
Hi,
Congratulation. This is terrific.
I got a question. How you deal with spacing and symbols character?.
Simple preprocessing. Usually step 1
As I watched this, I couldn't help thinking the problem was being solved in the wrong domain! Who cares about the lengths and vagaries of English spelling, especially transliterations of foreign alphabets? To improve quality (reduce false positives and negatives,) I was thinking, don't we want to match on the sounds of these names? That got me looking up phonetic algorithms.
I wonder whether KP & SV considered preprocessing the dataset, indexing it with something like one of the Metaphone algorithms? They then could have performed their bitwise matching on the reduced dataset generated by exact matches of the short, standard-length keys generated by the phonetics. Maybe they weren't allowed to..
Robin Betts exactly my thoughts too and lucene /solr provides support for foreign languages too
@First Last At a high level, I think what Karthy referencing is called Soundex (en.wikipedia.org/wiki/Soundex) but it's not converting words to a sound wave but instead trying to match on the consonants (skip vowel unless the word begins with it and replacing them some digit(s) as the wiki explains). Comparing wave forms wouldn't be efficient with the amount of data the wave form of a word vs what the string representation would be.
how did yiu build the bit masks for each string?
what if the names had Mr., Mrs,. Sgt, Dr or any other salutations ? how will this be tackled?