Effective substring in Rust

nhellman · 2024-06-11T06:10:03 1718086203

Why return an owned string instead of a borrowed str slice? A substring can always be borrowed and should never require a heap allocation?

letmutex · 2024-06-11T07:18:09 1718090289

Thanks for pointing that out, str slice is surely better. I returned String in all versions just because collect() returns A String and I want to reuse the same test function. Will update this later.

Joker_vD · 2024-06-11T11:56:09 1718106969

It's a trade-off. I've once seen a memory leak in a JSON-handling code which was caused precisely by JSON parser returning a view into the underlying raw input: we stored one or two string fields from a large-ish (5 MiB IIRC) JSON and it prevented that whole JSON blob from being garbage collected.

kevincox · 2024-06-11T12:14:53 1718108093

This probably won't be an issue in Rust because there are very few situations where taking a reference can extend the lifetime of an object. It can really only happen for temporary objects where their lifetime can be extend to that of the function. It isn't nearly the same as a GCed language where it is very easy to accidentally keep an object alive for a long time.

virexene · 2024-06-11T15:20:04 1718119204

indeed, Rust can be fast, if you know what you're doing. this indeed means not making unnecessary allocations, but in my opinion, it also means... using byte-based indices instead of insisting on char-based indices, ie. Version 0, which OP so quickly dismissed.

using char-based indices means you have to convert back to byte-based indices, a linear time operation, every time you want to do much of anything with them. this is a silly performance loss, since you probably got those char-based indices by iterating over the string in the first place: you're doing redundant work, which could be avoided by using char_indices() in your initial iteration and keeping those byte-based indices for later manipulation. this is why that iterator exists, really.

you might ask: "but then if I do +1 to get the index of the next character, it might fall in the middle of a multibyte character and the substringing will panic!" yes, you will need to use char::len_utf8 or char_indices to offset your indices (forwards or backwards: CharIndices is a DoubleEndedIterator!). but this is less work than adding one... and then recounting characters from the beginning.

and importantly, +1 isn't even really appropriate to do with char-based indices either. there are many "characters" in the user-perceived sense of the word that are made up of multiple chars, and while cutting in the middle of one won't panic, it also won't give the user-friendly cutting you're expecting. just try your code with a flag emoji and see what happens: you'll split it into two weird "residual" characters.

if you care about that in your specific application, the solution is to iterate on an even bigger unit than chars: (extended) grapheme clusters, or EGCs. and because EGC segmentation is quite a bit more demanding than simple char-based iteration, using EGC-based numerical indices (which I believe Swift does by default?) is an even bigger waste of CPU time. in my opinion, you need to fully let go of the assumption that characters can be given consecutive numerical indices in a performant way, and once again, use byte-based indices along with the appropriate EGC-aware methods for acquiring and offsetting them.

duped · 2024-06-11T14:46:25 1718117185

You almost never want to compute a substring by codepoint boundaries, which is why this operation isn't in the standard library.

You should be looking at this crate: https://docs.rs/unicode-segmentation/

faqinghere · 2024-06-11T13:45:07 1718113507

Caution, char != grapheme. Did you try it with combining characters?

letmutex · 2024-06-11T14:50:33 1718117433

Yes, combining characters and ZWJ emojis require additional processing, just like other languages (JavaScript, Java, etc.).

https://doc.rust-lang.org/std/primitive.str.html#method.char... https://github.com/letmutex/rust-substring/blob/992b826797a5...

usr1106 · 2024-06-11T06:21:18 1718086878

Effective or efficient?

letmutex · 2024-06-11T07:22:34 1718090554

It's 'effective' in the title, I stole it from some book names. But in the content, it should be 'efficient', thanks for pointing it out.