Hacker News new | past | comments | ask | show | jobs | submit login

Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.

Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.

I wrote a Java package a decade ago that implements some of this stuff:

https://github.com/danieldk/dictomaton




> deterministic acyclic minimized automaton

That's basically a Trie right? To be fair I have only heard of them and know they can be used to do neat tricks, I've rarely used one myself.


If you do not plan to update it often and don’t need to store extra data with each word, a dawg (https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...) is more compact. You often can merge leaf nodes.

For example, if you have words

   talk
   talked
   talking
   talks
   walk
   walked
   walking
   walks
there’s no need to repeat the “”, “ed”, “ing”, “s” parts.


No. In a minimized automaton shared string suffixes also share states/transitions.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: