expressions like (...){1,256} are very heavyweight and the scala JS code ends up...

noduerme · on Sept 11, 2023

Just wondering, what is it about testing repetition [a-z]{1,256} with an upper bound that's so heavy? Intuitively it feels like greedy testing [a-z]+ should actually be worse since it has to work back from the end of the input.

rntz · on Sept 11, 2023

This depends heavily on how repetition is implemented.

With a backtracking-search implementation of regexes, bounded iteration is pretty easy.

But the linked webpage appears to compile regexes to finite state machines (it shows you their finite-state-machine, for instance), and eg [a-z]{1,256} will have 256 states: 256 times the 1 state needed for [a-z]. If [a-z] were a complex regex, you could get a combinatorial explosion.

This alone probably isn't the issue? 256 is not a very large number. But I suspect there are follow-on algorithmic issues. This is just speculation, but I wouldn't be surprised if that 256-state machine were computed by applying DFA minimization, an algorithm with worst-case exponential running time, to a more naively generated machine.

d66 · on Sept 11, 2023

you're right. inclusion/intersection/etc. aren't actually computed via DFA but instead are computed directly on the regular expression representation itself. and large disjunctions (with 256 branches) are what is very heavy.

(it's possible to instead do these operations on DFAs but at the time i found it hard to get from an automata back to a reasonable-looking regular expression.)

d66 · on Sept 11, 2023

the library uses a fairly simple data representation where x{m,n} is compiled using conjunction and disjunction. so x{1,4} ends up being represented as x|xx|xxx|xxxx.

this simplifies the code for testing equality and inclusion, since logically x{n} is just xx... (n times) and x{m,n} is just x{m}|x{m+1}|...|x{n}.

but when you have x{m,n} and n-m is large you can imagine what kind of problems that causes.

yorwba · on Sept 11, 2023

Seems like it should at least be x(|x(|x(|x))) instead of x|xx|xxx|xxxx to avoid quadratic blow-up.

d66 · on Sept 11, 2023

yes, that is the actual construction: the disjunction data type only supports a lhs and rhs, so that is the only possible way to represent it.

i wrote it the way i did for clarity in the comments.