Phonetic Matching
This short epilogue struck me.
This past Yom Kippur, my wife and I drove two hours to spend the afternoon at my aunt’s house, with my cousins. As the night drew on, conversation roamed from television shows and books to politics and philosophy. The circle grew as we touched on increasingly sensitive and challenging topics, drawing us in.
We didn’t agree, per se. We were engaging in debate as often as we were engaging in conversation. But we all love each other deeply, and the amount of care and restraint that went into how each person expressed their disagreement was palpable.
It's about someone using Levenshtein distance for phonetic fitting against text learning about soundex.
One way to start playing around with it is to put some stuff in a database: https://dev.mysql.com/doc/refman/8.4/en/string-functions.htm...
(or this module, https://www.postgresql.org/docs/current/fuzzystrmatch.html if you're stuck with PG)
This is one of these cases where inheriting hacked-together piece of crap (English spelling) makes a lot of additional work higher up.
Another example is poetry. A regex can find rhymes in Polish. Same postfix == it rhymes.
In English it's a feat of engineering.
I created this sheet[0] to tech my kid to learn Tamil using Roman letters and in the process figured it could be useful for kids learning other Indian languages as well.
With the history of reading and speaking (Indian) phonetic languages, I think, English would've been much nicer and uniform if the vowels sounded right, esp the long forms.
Extending the long forms using orthogonal vowels probably made it complex, especially with the lack of ii and uu.
Say for instance, to extend the long form of "o", "a" was used. Eg: boat, goat. The correct spelling could've been boot, with the original boot spelled as buut.
With that notion, door is probably the only word that's written and pronounced phonetically correct, with two oo.
Curious to know how would such correct phonetic translation aid in the encoding, matching and compression.
[0] https://docs.google.com/spreadsheets/d/15hdVh-oBUngTyigqDdjg...
Oh, hello, I didn't realize this was shared here! I guess let me know if anyone has any questions. I mostly wrote this piece as part of processing some hard feelings I've been having and seeing shared among Jewish folks around me, but I also ended up learning quite a bit about phonetic encoding algorithms, and I've spent several years at this point steeped in forced alignment via Storyteller.
I'm using a library, stable-ts, for a similar issue with short audio clips and it works well: https://github.com/jianfch/stable-ts/tree/main
Not sure how it will perform on something long like an audiobook.
Highly related to my paper on why tokenization in LLMs is the devil: https://paperswithcode.com/paper/most-language-models-can-be...
I also had to do this in my previous work and I took the phonetic embeddings of reference and transcribed text and ran a dynamic time warping with them.
Im intrigued.. Is this not done just with a phonemizer?
from phonemizer.phonemize import phonemize
text = "hello world"
variations = [
phonemize(text, backend="espeak", language="en-us", strip=True),
phonemize(text, backend="espeak", language="en-gb", strip=True),
phonemize(text, backend="espeak", language="en-au", strip=True),
]
I mean, espeak isnt the best but a lot of folks in the ASR/Speech world still are using this right?(NB: If you are on iOS check out the inbuilt one - Settings -> Accessibility -> Spoken Content -> Pronounciations. Adding one it has the ability to phonemize to IPA your spoken message. If someone can tell me where that SDK/API is they use in that I'd love to know) for i, variation in enumerate(variations, 1): print(f"Variation {i}: {variation}")
The idea that "shore" and "sure" are pronounced "almost identically" would depend pretty heavily on your accent. The vowel is pretty different to me.
Also, the matches for "sorI" and "sorY" would seem to me to misinterpret the words as having a vowel at the end, rather than a silent vowel. If you're using data meant for foreign surnames, the rules of which may differ from English and which might have silent vowels be very rare depending on the original language, of course you may mispronounce English words like this, saying both shore and sure as "sore-ee".
I'm sure there are much better ways to transcribe orthography to phonetics, probably people have published libraries that do it. From some googling, it seems like some people call this type of library a phonemic transcriber or IPA transcriber.