Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Under the new string model in java > 8 a fairly frequent workflow is:

1) get external string

2) figure out if it is UTF-8, UTF-16, or some other recognizable encoding

3) validate the byte stream

4) figure out if the code points in the incoming string can be represented in Latin-1

5) instantiate a java string using either the Latin-1 encoder or the UTF-16 encoder

I know some or all of these steps are done using hotspot intrinsics, and then the JIT/VM does inlining, folding and so on, but I wonder how fast a custom assembly function to do all these steps at once could be.



You might be interested in his blog on the same subject a few days ago: https://lemire.me/blog/2018/10/16/validating-utf-8-bytes-jav...


If you are given the external string as bytes, which is all you can have if you don't know the encoding. Then steps 2,3,4 can all be done as one step I would have thought. Something like - https://github.com/adamretter/utf8-validator/blob/optimize-u...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: