Hacker News new | past | comments | ask | show | jobs | submit login

> UTF-8 regular expression matching shouldn't be different from ASCII at all, as far as I can tell.

Not really. First the '.' operator needs to work differently. This can be still done fast, but if you actually want to claim unicode support you should also consider:

- Unicode collation orders

- Unicode string equivalence algorithms

- Unicode normalization algorithms

This will also take care of all the unicode weirdnesses that no one actually uses or cares about but you still must implement to claim compatibility like: presentational forms, combining diacritics, ligatures, double sized characters and other odd stuff.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: