Hacker News new | past | comments | ask | show | jobs | submit login
Calculate the difference and intersection of any two regexes (phylactery.org)
353 points by posco on Sept 11, 2023 | hide | past | favorite | 117 comments



I created a similar regex web demo that shows how a regex is parsed -> NFA -> DFA -> minimal DFA, and finally outputs LLVMIR/Javascript/WebAssembly for from the minimal DFA:

http://compiler.org/reason-re-nfa/src/index.html


Though going from NFA to explicit DFA isn't always a good idea.

Btw, you might also like looking into the Brzozowski derivative https://en.wikipedia.org/wiki/Brzozowski_derivative which can be used as an alternative way to match regular expressions.


I think it is also worth mentioning that the site linked at the top uses the antimirov extension to brzozovzki work on regex deivatives.


To expand, Brzozowski introduced derivatives and Antimirov partial derivatives. Essentially the former correspond to DFAs and the latter to NFAs.


You could implement the NFA directly with concurrent exploration of all paths:

https://github.com/mike-french/myrex


This library can be used to create string class hierarchies. That, in turn, can help to use typed strings more.

For example, e-mails and urls are a special syntax. Their value space is a subset of all non-empty string which is a subset of all strings.

An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.

This library can be used to check the definitions and hierarchy of such string types. The implementation of the hierarchy differs per programming language (subclassing, trait boundaries, etc).


In languages with tagged union types you do this a lot! Some Haskell pseudocode for ya

    module Email (Address, fromText, toText) where -- note we do not export the constructor of Address, just the type

    data Address = Address Text

    fromString :: Text -> Maybe Address
    fromString =
        -- you'd do your validation in here and return Nothing if it's a bad address.
        -- Signal validity out of band, not in band with the data.

    toText :: Address -> Text
    toText (Address addr) = addr -- for when you need to output it somewhere


Pedantic note: ‘Address’ should really be a ‘newtype’…


Haha sorry, I get those backwards a lot. I was gonna do elm but then it’d be a conversation about why we’re writing our own email address validation on the front end instead of using the platform.


Don't worry, that's normal -- in this forum we only talk about how good obscure languages are, nobody actually uses Haskell.


I figure you're joshing but I literally write it for work (although I haven't been in our haskell codebase in months, tragically). I just have a particularly smooth brain so I forget all the little differences as soon as I'm done. Always in exams mode.


> Signal validity out of band, not in band with the data.

Could you expand on this?


Sure! Sorry that was a little too obtuse. So in this case we can imagine an app where we don't use any tagged unions and just use primitive types (your strings, booleans, integers, things of that nature). And we want to signal the validity of some data. Say a user ID and an email address. We store the User ID as an integer to keep space down and store the email address as a string. We use semaphore values: if the user ID is invalid we store -1 (it's JS and there are no unsigned numbers) and if the email address is invalid we store the empty string.

Whenever we consume these values, we need to make sure that userId > 0 and email != "" I mean email !== "". We are testing for special values of the data. Data and "this is for sure not meaningful data" are the same shape! So your functions need to handle those cases.

But with tagged unions you can check these things at the edge of the program and thereafter accept that the contents of the tagged data are valid (because you wrote good tests for your decoders).

So your data is a different shape when it's valid vs when it's invalid, and you can write functions that only accept data that's the valid shape. If you got Json that was hit by cosmic rays when trying to build your User model, you can fail right then and not build a model and find a way to handle that.

It's out of band because you don't guard for special values of your morphologically identical data.

If you want examples of any specific part of this let me know. IDK your level of familiarity and don't want to overburden you with things you already get.


>An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.

Don't use regex for email address validation

https://news.ycombinator.com/item?id=31092912


Nothing like a dive into the wondrous world of what is and isn't allowed in an email address left of the @ on a warm late-summer morning. It's one of the mysteries of the modern world. The simple heuristic that proposes that every regex trying to express "valid email address" is wrong is a sufficiently safe bet, but it ruins all the fun.


> Their value space...

wossis mean? TIA

Edit: instread of downvoting try answering. I'd like to know. TIA{2}


People are downvoting you because quirky/jokey super-colloquial language like “wossis mean? TIA” is hard to understand, and also just doesn’t really mesh with the vibe of the site.


What does TIA even mean?


Thanks In Advance.


That Is Amazing.


Value space is the set of values a type can have. A boolean has only two values in its value space. An unsigned byte has 256 possible values, so does a signed byte.

A string enumeration has a limited number of values. E.g. type A ("Yes" | "No" | "Maybe") has three values and is a superset of type B ("Yes" | "No"). A function that accepts type A can also accept type B as valid input.

If the value space is defined by a regular expression, as is often the case, the mentioned library could be used to check, at compile-time, which type are subsets of others.


Thank you. I guess I misread.

"For example, e-mails and urls are a special syntax. Their value space..." seemed to talk about the 'value space' of strings (these being e-mails and urls), not types (of e-mails and urls), which confused me.


It is bout the 'value space' of strings. Think of all possible strings. That is the entire value space of strings. Not every possible string is an email. Only a subset of this value space is a valid email. This subset is the 'value space' of strings which are valid emails.


If I hadn't seen your edit, I might have downvoted the comment for not being intelligible.


Regular expressions are a great example of bundling up some really neat and complex mathematical theory into a valuable interface. Linear algebra feels similar to me.


It always amazes me how given the appropriate field, so much math can be transformed into linear algebra. Even Möbius transformations on the complex plane w=(az+b)/(cz+d) can be turned into linear algebra.


Linear transformations preserve the structure of the space so you can keep applying them. It's not surprising that you can always find some "space-preserving" part of a problem and fold the rest (the "non-linear" structure) into transformations or the definition of the space itself.


Linear transformations preserve some structure, not 'the' structure.


That usually means the representation is getting close to the truth. Good interfaces have intrinsic value, which many result-focused people do not appreciate.


iirc connections with linear algebra come up in Conway's https://store.doverpublications.com/0486485838.html (which I only skimmed).


There is a whole field of “weighted automata” which combine linear algebra and automata theory.


The amazing page computes binary relations between pairs of regular expressions and shows a graphical representation of the DFA.

It’s a really incredible demonstration of some highly non-trivial operations on regular expressions.


It's very cool, but also no wonder that it doesn't support all those features of regexes which technically make them not regular expressions anymore. Though, I would have thought ^ and $ anchors shouldn't be a problem?


^ and $ are a problem, although one with a workaround.

The standard theory of regular expressions focuses entirely on regex matching, rather than searching. For matching, ^ and $ don't really mean anything. In particular, regexp theory is defined in terms of the "language of" a regexp: the set of strings which match it. What's the set of strings that "^" matches? Well, it's the empty string, but only if it comes at the beginning of a line (or sometimes the beginning of the document). This beginning-of-line constraint doesn't fit nicely into the "a regexp is defined by its language/set of strings" theory, much the same way lookahead/lookbehind assertions don't quite fit the theory of regular expressions.

The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.


This page implements regex matching, not searching. So in effect, every pattern has an implicit ^ at the beginning and $ at the end.


A lack of `^` is equivalent to prepending `(.*)`, then trimming the match span to the end of that capture. And similarly for a lack of `$` (but suddenly I remember how nasty Python was before `.fullmatch` was added ...).

More interesting is word boundaries:

`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.

`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.


Correction, `A\<B` is `(A&(\W|^))(\w&B)`, which matters if the A regex can match the empty string.


The double quote (") is also broken. If you use it in the regex, then no DFA is displayed.


As ^ and $ are implicit, you can opt out of them simply by affixing `.*`.


Only when the ^ or $ were at the start/end of your string is it simple. Eg:

    (a|b|^)(c|d|^)foo
Rewriting without ^ can require much longer regex.


Isn't that just

    ((a|b)?(c|d)|c|d)?foo
Unless you mean it as a search expression, in which case it's more like

    ((.*a|.*b)(c|d)|c|d)?foo
Which I have to admit was a lot harder to figure out than I thought it would be (and may not even be right!)


Yeah the latter.

In an engine supporting ^ and $, searching for this

    (a|b|^)(c|d|^)foo
is equivalent to searching for this

    ^((.*a|.*b)(c|d)|c|d)?foo.*$
And in this context you can drop the leading/trailing ^/$ since they are implicit.


Ha, trying to paste "regex filter numbers divisible by 3" and the page froze to death https://stackoverflow.com/q/10992279/41948

    ^(?:[0369]+|[147](?:[0369]*[147][0369]*[258])*(?:[0369]*[258]|[0369]*[147][0369]*[147])|[258](?:[0369]*[258][0369]*[147])*(?:[0369]*[147]|[0369]*[258][0369]*[258]))+$

    ^([0369]|[147][0369]*[258]|(([258]|[147][0369]*[147])([0369]|[258][0369]*[147])*([147]|[258][0369]\*[258])))+$

I wonder if there's a shortest one.


The web page hangs on the regular expressions that produce a DFA with a lot of states. For example, these ones:

(ab+c+)+

(abc){100}

a.*quick brown fox jumps over the lazy dog


The page says it doesn't support anchors anyway.


I wanted to see the intersection between syntactically valid URLs and email addresses, but just entering the URL regex (cf. below) already takes too long to process for the page.

[\-a-zA-Z0-9@:%._+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([\-a-zA-Z0-9()@:%_+.~#?&//=]*)

(source: https://stackoverflow.com/a/3809435/623763)


expressions like (...){1,256} are very heavyweight and the scala JS code ends up timing out or crashing the browser.

if you replace that with (...)+ then it seems to work (at least for me). smaller expressions like (...){1,6} should be fine.


Just wondering, what is it about testing repetition [a-z]{1,256} with an upper bound that's so heavy? Intuitively it feels like greedy testing [a-z]+ should actually be worse since it has to work back from the end of the input.


This depends heavily on how repetition is implemented.

With a backtracking-search implementation of regexes, bounded iteration is pretty easy.

But the linked webpage appears to compile regexes to finite state machines (it shows you their finite-state-machine, for instance), and eg [a-z]{1,256} will have 256 states: 256 times the 1 state needed for [a-z]. If [a-z] were a complex regex, you could get a combinatorial explosion.

This alone probably isn't the issue? 256 is not a very large number. But I suspect there are follow-on algorithmic issues. This is just speculation, but I wouldn't be surprised if that 256-state machine were computed by applying DFA minimization, an algorithm with worst-case exponential running time, to a more naively generated machine.


you're right. inclusion/intersection/etc. aren't actually computed via DFA but instead are computed directly on the regular expression representation itself. and large disjunctions (with 256 branches) are what is very heavy.

(it's possible to instead do these operations on DFAs but at the time i found it hard to get from an automata back to a reasonable-looking regular expression.)


the library uses a fairly simple data representation where x{m,n} is compiled using conjunction and disjunction. so x{1,4} ends up being represented as x|xx|xxx|xxxx.

this simplifies the code for testing equality and inclusion, since logically x{n} is just xx... (n times) and x{m,n} is just x{m}|x{m+1}|...|x{n}.

but when you have x{m,n} and n-m is large you can imagine what kind of problems that causes.


Seems like it should at least be x(|x(|x(|x))) instead of x|xx|xxx|xxxx to avoid quadratic blow-up.


yes, that is the actual construction: the disjunction data type only supports a lhs and rhs, so that is the only possible way to represent it.

i wrote it the way i did for clarity in the comments.


This is neat!

I was surprised then not surprised that the union & intersection REs it comes up with are not particularly concise. For example the two expressions "y.+" and ".+z" have a very simple intersection: "y.*z" (equality verified by the page, assuming I haven't typo'd anything). But the tool gives

    yz([^z][^z]*z|z)*|y[^z](zz*[^z]|[^z])*zz*
instead. I think there are reasons it gives the answer it does, and giving a minimal (by RE length in characters or whatever) regular expression is probably a lot harder.


I think one of the reasons is the ".+z" gets bigger and uglier after you convert it to a deterministic automaton.


They show the DFA for it on the site, it's 3 states. There's a starting state for the first . and then two states that transition back and forth between whether z was the last character or not.

I think what's actually happening here is that they're doing the intersection on the DFAs and then producing a regex from the resulting DFA. The construction of a regex from a DFA is where things get ugly and weird.


I used this concept once to write the validation logic for an "IP RegEx filter" setting. The goal was to let users configure an IP filter using RegEx (no, marketing people don't get CIDRs, and they knew RegEx's from Google Analytics). How could I define a valid RegEx for this? The intersection with the RegEx of "all IPv4 addresses" is not empty, and not equal to the RegEx of "all IPv4 addresses". Prevented many complaints about the filter not doing anything, but of course didn't prevent wrong filters from being entered.


Wouldn't a simpler solution work here? Instead of trying to validate the filter regex, show some sample IP addresses or let the user insert a set of addresses, and then show which ones the filter matches and which ones it doesn't. Also helps address the problem of incorrect filters.


The odds of the sample addresses matching is essentially zero, and adding work to the user is counterproductive.


I'm not sure I agree — most common regex editing tools available online include a section for adding test strings to verify what you actually wrote is correct. Clearly there is a benefit to it. In similar vein, allowing the user to test before they commit and then test actually reduces their work load, they don't have to drop and then reload the whole regex in their mind.


Sure, I use that when authoring and editing a RegEx. That's not the same as entry validation.


Suggestion: turn off auto suggest in the regex input fields to make it more usable on mobile.

https://stackoverflow.com/questions/35513968/disable-autocor...


I used 2 similar divide-by-3 regexes to test the page (after removing the ^ and $ to their ends), and it froze up:

Regex 1: ([0369]|([258]|[147][0369]*[147])([0369]|([147][0369]*[258]|[258][0369]*[147]))*([147]|[258][0369]*[258])|([147]|[258][0369]*[258])([0369]|([147][0369]*[258]|[258][0369]*[147]))*([258]|[147][0369]*[147]))*

Regex 2: ([0369]|[258][0369]*[147]|(([147]|[258][0369]*[258])([0369]|[147][0369]*[258])*([258]|[147][0369]*[147])))*

Everything up until the last '*' is parsable. The moment I put in the *, the entire page freezes up.

Without the *, it produced a valid verifier for parsing chunks of digits whose sum mod 3 = 0.


One possible application: If an input to a function parameter must match a certain regex, and the output of a function produces results matching another regex, we can know if the functions are compatible: if the intersection of regular expressions is empty, then you cannot connect one function to the other.

Combined with the fact the regular expressions can be used not only on strings but more generally (e.g. for JSON schema validation [1]), this could be a possible implementation of static checks, similar to "design by contract".

--

1: https://www.balisage.net/Proceedings/vol23/html/Holstege01/B...


I love how it looks like a CS textbook.


The graphics look identical to those in Hopcroft & Ullman's "Introduction to Automata Theory, Languages, and Computation" (like the convention that they use a double-circle to denote accepting states). I imagine they're GraphViz-based: it's very easy [0] to draw these in GraphViz. I don't know what Hopcroft & Ullman used though, because that one was published in 1979, and GraphViz didn't exist before 1991. Suddenly I'm curious what the state of the art for vector diagrams was in 1979...?

[0] e.g. https://graphviz.org/Gallery/directed/fsm.html


Maybe something related to 'pic'? This doc on it is a revised version of a 1984 edition, so maybe it's a little too late, but there are references to other systems back to 1977 or so.

https://pikchr.org/home/uv/pic.pdf


It has the look of graphviz about it, which is an excellent tool. Often helpful in debugging anything related to graphs.

https://graphviz.org/


Kinda related but I'm looking for something that could give me the number of possible matching strings for a simple regex. Does such a tool exist ?


I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?


This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.

Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.

And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.

This is in O(states) for R and O(2^states) for NR if I recall right.


Maybe the number of possible matchings for a given length (or range of lengths) might be interesting?


Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.

Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.

I wrote a Java package a decade ago that implements some of this stuff:

https://github.com/danieldk/dictomaton


> deterministic acyclic minimized automaton

That's basically a Trie right? To be fair I have only heard of them and know they can be used to do neat tricks, I've rarely used one myself.


If you do not plan to update it often and don’t need to store extra data with each word, a dawg (https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...) is more compact. You often can merge leaf nodes.

For example, if you have words

   talk
   talked
   talking
   talks
   walk
   walked
   walking
   walks
there’s no need to repeat the “”, “ed”, “ing”, “s” parts.


No. In a minimized automaton shared string suffixes also share states/transitions.


That was one of the short examples in Norvig's Python program-design course for Udacity. https://github.com/darius/cant/blob/master/library/regex-gen... (I don't have the Python handy.)



the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.

however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.


Here's a simple Haskell program to do it:

(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)

    -- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0

    import qualified Data.List as List

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    data Size = Finite Int | Infinite deriving (Show, Eq)

    instance Num Size where
      abs = undefined; signum = undefined; negate = undefined -- unnecessary
      fromInteger = Finite . fromInteger
      Finite x + Finite y = Finite (x + y)
      _ + _ = Infinite
      Finite x * Finite y = Finite (x * y)
      x * y = if x == 0 || y == 0 then 0 else Infinite

    -- computes size & language (list of matching strings, if regex is finite)
    eval :: Regex -> (Size, [String])
    eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
      where cset = List.nub chars
    eval (Seq regexes) = (product sizes, concat <$> sequence langs)
      where (sizes, langs) = unzip $ map eval regexes
    eval (Choice regexes) = (size, lang)
      where (sizes, langs) = unzip $ map eval regexes
            lang = concat langs
            size = if elem Infinite sizes then Infinite
                   -- finite, so just count 'em. inefficient but works.
                   else Finite (length (List.nub lang))
    eval (Star r) = (size, lang)
      where (rsize, rlang) = eval r
            size | rsize == 0 = 1
                 | rsize == 1 && List.nub rlang == [""] = 1
                 | otherwise = Infinite
            lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)

    size :: Regex -> Size
    size = fst . eval
NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).


Surely that fails for e.g. a?a?a?. I'd imagine you could do some sort of simplification first though to avoid this redundancy.


You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).

Oof, my hubris.


It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:

    import Prelude hiding (null)
    import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    -- The language of a regex is either finite or infinite.
    -- We only care about the finite case.
    data Lang = Finite (Set String) | Infinite deriving (Show, Eq)

    zero = Finite empty
    one = Finite (singleton "")

    isEmpty (Finite s) = null s
    isEmpty Infinite = False

    cat :: Lang -> Lang -> Lang
    cat x y | isEmpty x || isEmpty y = zero
    cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
    cat _ _ = Infinite

    subsingleton :: Lang -> Bool
    subsingleton Infinite = False
    subsingleton (Finite s) = isSubsetOf s (fromList [""])

    eval :: Regex -> Lang
    eval (Class chars) = Finite $ fromList [[c] | c <- chars]
    eval (Seq rs) = foldr cat one $ map eval rs
    eval (Choice rs) | any (== Infinite) langs = Infinite
                     | otherwise = Finite $ unions [s | Finite s <- langs]
      where langs = map eval rs
    eval (Star r) | subsingleton (eval r) = one
                  | otherwise = Infinite


Another interesting question is: how many possible successful matches are there for a given input string. For example:

How many ways can (a?){m}(a*){m} match the string a{m}

i.e. input m repetitions of the letter 'a'.

https://github.com/mike-french/myrex#ambiguous-example

The answer is a dot product of two vectors sliced from Pascal's Triangle.

For m=9, there are 864,146 successful matches.



https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.


I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'


The answer is usually an infinite number, except for very, very simple cases. Anything involving * for example means infinity is your answer.


I wonder if it makes sense to compute an "order type" for a regexp. For example, a* is omega, a*b* is 2 omega.

https://en.m.wikipedia.org/wiki/Order_type

https://en.wikipedia.org/wiki/Ordinal_number


I think that's just the ordinal corresponding to lexicographic order on the words in the language, so yeah that should work. I wonder how easy it is to calculate...


normally you would use an ordinal number [1] to label individual elements of an infinite set while using a cardinal number [2] to measure the size of the set.

i believe the cardinality of a set of words from a finite alphabet (with more than one member) is equivalent to the cardinality of the real numbers. this means that the cardinality of .* is c.

unfortunately, i don't think that cardinality gets us very far when trying to differentiate the "complexity" of expressions like [ab]* from ([ab]*c[de]*)*[x-z]*. probably some other metric should be used (maybe something like kolmogorov complexity).

[1] https://en.wikipedia.org/wiki/Ordinal_number

[2] https://en.wikipedia.org/wiki/Cardinal_number


I wouldn't say that's their 'normal' usage, I mean sure you can use them like that but fundamentally ordinal numbers are equivalence classes of ordered sets in the same way that cardinal numbers are equivalence classes of sets.

As you've rightly noted the latter equivalence class gets us nothing so throwing away the ordering is a bit of a waste. Of all mathematical concepts 'size' is easily the most subjective so picking one that is interesting is better than trying to be 'correct'.

In particular a*b* is exactly equivalent to ω^2, since a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an order preserving isomorphism between words of the form a^n b^m and tuples (n,m) with lexicographic ordering.


interesting!

what would [ab]* be? for computing an ordinal number the only real difficulty is how to handle kleene star: given ord(X) how do we calculate ord(X*)?

but as you probably noticed i'm a bit out of my depth when dealing with ordinals.


Unless I'm very mistaken adding onto the end of a string doesn't affect lexicographic order, so that's effectively [ab]*. The ordinality of [ab] is simple, it's 2, but the kleene star is a bit of an odd one, it's not quite exponentiation.

To reason about the kleene star it's a bit simpler to consider something like R^*n, where you repeat up to n times. Obviously R^*0 = 1 and R^*S(n) can be built from R^*n by picking an element of R^*n and appending either nothing or an element of R, here the element of R^*n determines most of the order and 'nothing' orders in front. For technical reasons the corresponding ordinal is (1+R) R^*n, which is backwards from how you'd expect it and how you'd normally define exponentiation.

The kleene star can be recovered by taking the limit, identifying R^*n with it's image in R^*S(n). Which also doesn't quite work as nicely as you'd hope (normally the image is just a downward closed subset, it's not in this case).

I think [ab]* is equivalent to something like the rational part of the Cantor set. Not sure if there's a simpler way to describe it, it's nowhere near as simple as 2^ω, which is just ω.


Hmm, turns out this fails to be an ordinal because regular languages aren't well-ordered (except if you reverse the lexicographic order, maybe?). They are what is called an order type, and it looks like it should be possible to identify them with subsets of the rationals, if you wanted to.

Perhaps reversing the lexicographic order makes more sense, in that case longer tuples simply order last so R^* = 1 + R + R^2 + ..., the limit here is much easier since R^*n = 1 + R + ... + R^n is downwards closed as a subset of R^*S(n).

Then again in that scenario [ab]* is simply ω because it is effectively the same as just writing an integer in binary, so it is less interesting in a way.


I would say [ab]* has order type omega. That's because the set of strings it represents is: {epsilon, ab, abab, ababab, abababab, ...} which looks a lot like omega. (epsilon = empty string)

What this exercise really is is finding a canonical way to order a regular language (the set of strings a regexp matches). For example, a*b* could be {epsilon, a, aa, aaa, aaa, aaaa, ..., b, bb, bbb, bbbb, bbbbb, bbbbbb, ..., ab, aab, aaab, aaaab, ..., abb, aabb, ..., ...} which looks a lot like omega ^ 2 (not 2 * omega like I said before). However, you could also re-arrange the set to look like omega: {epsilon, a, b, aa, bb, ab, aaa, bbb, aab, abb, bbb, ...} (strings of length 1, length 2, length 3, etc)

I propose the following: for any two strings in the regular language, the one that comes first is the one whose kleene-star repetition counts come first lexicographically. More concretely, for the language a*b*, aaaab represents a repetition count of (4, 1) which and bbb represents (0, 3). (0, 3) comes before (4, 1) lexicographically, so bbb comes before aaaab. This admits the following ordering: {epsilon, b, bb, bbb, bbbb, ..., a, abb, abbb, abbbb, ..., aa, aab, aabb, aabbb, ..., ...} which is omega ^ 2 which "feels" right to me. Another rule is for the regular language (X|Y) and two strings x from X and y from Y, x should always come before y in our ordered set representation.

Hold on, what about nested kleene-stars? (a*)* is just a*, but (a*b*)* is distinct from a*b*. However, the "kleene-star counts" analysis from above breaks down because there are now multiple ways to parse strings like aab. I don't really know how to classify these regular languages as ordinals yet.

I don't really see any useful applications of this, but it's still fun to think about. The game I'm playing is thinking about a random ordinal and trying to come up with a regular language that, under my ordering rule above, looks like that ordinal. Let's try 2 * omega (which looks like this: {0, 1, 2, 3, 4, 5, ..., omega, omega + 1, omega + 2, omega + 3, ...} e.g. 2 copies of omega "concatenated"):

a*|b* = {epsilon, a, aa, aaa, aaaa, aaaaa, ..., b, bb, bbb, bbbb, ...} => 2 * omega.

Some more examples:

omega ^ 3: a*b*c*

omega ^ 2 + omega: a*b*|c*

Maybe we can write down some composition rules:

let X and Y be regular languages and ord(X) and ord(Y) be their ordinal representations. Then,

X|Y => ord(X) + ord(Y)

XY => ord(X) * ord(Y)

X* => ord(X) * omega

I haven't checked if these actually work, this is just a long rambly comment of dubious mathematical value.


[ab] is a character class, not a parenthesized catenation. It's (a|b), not (ab). So [ab]* is {epsilon, a, b, aa, ab, ba, bb, ...}


What’s your use case?


Calculate how long it takes to bruteforce something matching a regexp.


Any def for 'difference and intersection of regexes' might actually mean?

I guess for regexes r1 and r2 this means the diff and intersect of their extensional sets, expressed intensionally as a regex. I guess. But nothing seems defined, including what ^ is, or > or whatever. It's not helpful


  negation (~α): strings not matched by α
  difference (α - β): strings matched by α but not β
  intersection (α & β): strings matched by α and β
  exclusive-or (α ^ β): strings matched by α or β but not both
  inclusion (α > β): does α matches all strings β matches?
  equality (α = β): do α and β match exactly the same strings?


Interesting. I think this problem is actually EXPSPACE-complete in general? But still has a straightforward algorithm.

https://en.wikipedia.org/wiki/EXPSPACE


It depends on your operators. For these, no.

Equivalence of DFA or NFA is PSPACE complete by savitch's theorem, regardless of time bound. As such, most types of regex equivalence is pspace-complete.

https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.89....

Has a detailed breakdown of operators vs complexity.

In particular, the paper cited in the expspace page is talking about allowing a squaring operator.

It is EXPSPACE complete if you allow squaring, but not if you use repetition.

IE it is expspace complete if you allow e^2, but not if you only allow ee.


Since this may be confusing at first (why does squaring buy you anything here) - the reason squaring makes it expspace complete is, basically, squaring allows you to express an exponentially large regex in less than exponential input size.

This in turn means polynomial space in the size of the input is no longer enough to deal with the regex.

If you only allow repetition, than an exponentially large regex requires exponential input size, and thus polynomial space in the size of the input still suffices to do equivalence.

This is generally true - operators that allow you to reduce the size of the input necessary to express a regex by a complexity class will usually increase the size complexity class necessary to determine equivalence by a corresponding amount.


But the site does allow squaring, and in fact also general exponentiation? Like you can write "fo{2}" to match "foo", where the {2} is squaring.


I don't think that's squaring, since "o" is a literal. Squaring is more like /f../ where the first and second /./ must match equal strings. Squaring is actually supported by the Perl regex /f(.)\g1/, where /\g1/ is a backreference[1] to the first match group (here just /(.)/, but could be arbitrarily large). It's easy to see how this feature requires backtracking, and therefore how it can lead to an exponential running time.

/f(.)\g1/ is equivalent to the non-backtracking regex /f(?:\x00\x00|…|aa|bb|cc|…|\U10FFFF\U10FFFF)/ - you've already made a regex over 2 million Unicode codepoints long from an input 6 bytes long. /f(..)\g1/ would create a regex 2 billion codepoints long. If you restrict the regex to Latin-1 or another 1-byte encoding, the exponent base is smaller, but the growth is still exponential.

Supporting features like backreferences is convenient, so that's why Perl has them. You can at least use backtracking to reduce the space blowup (only need linear memory to do a DFS), but it's easy to make the regex match take exponential time with a suitable match text. That's why backtracking regexes are not safe against hostile clients unless you sandbox their time and memory usage carefully.

[1] https://perldoc.perl.org/perlretut#Backreferences


Squaring isn't backreferences, though you can square more than a literal. An example squaring is /(foo|bar*){2}/ which is equivalent to /(foo|bar*)(foo|bar*)/, not /(foo|bar*)\g1/.

Backreferences aren't technically regular, and as you say they can take exponential time to match. But the theorem that regular expression equivalence is EXPSPACE-complete applies to real regular expressions, not just perl "regexes".

IIRC the proof is something like, given a Turing machine that runs in space S, to consider traces of the the computation, which are concatenations of S-long strings (plus a head indicator holding the machine's internal state etc). You can make a regex that matches invalid executions of the machine, which are basically of the form

/.* (foo1.{S}bar1 | foo2.{S}bar2 | ...) .*/

where foo1->bar1, foo2->bar2 etc are all the types of invalid transitions (basically, either something on the tape not next to the head changed, or the head changed in a way not indicated in the state table).

You also set rules to enforce the initial state. Then you ask whether:

/wrong initial state | invalid execution | execution that doesn't end in "accept"/

is the same regex as

/.*/

If it's the same, then there are no strings which don't match the first regex; such a string must have the right initial state, a valid execution, and end in "accept". So if the two are the same, then all valid executions end in "reject". Since you aren't constrained to allow only a single state transition rule, this actually shows NEXPSPACE hardness, but that's equal to EXPSPACE by Savitch's theorem. Anyway I could be misremembering this, it's been a while, but it doesn't require backreferences.

The squaring is required to make the {S} part without exponential blowup. Otherwise, I think the problem is only PSPACE-hard (PSPACE-complete? I don't remember).


it always bugged me as a student that had to sit through all those discrete maths lectures that standard regex libraries don't allow you to union/intersect two "compiled" regular expression objects together

(having to try them one an a time is pretty sad)


Oh neat, this is scala via scalajs.


On mobile: are the rectangle glyphs as suffixes on the states on purpose or am I missing a font?


The states are numbered, $\alpha_0, ..., \alpha_N$ and $\beta_0, ...$. You might be missing the font for the digits.


ugh STOP USING GITHUB


Can LLMs do this?


I wouldn't use an LLM for anything that can be done 100% precisely, like this.


OK, just curious how LLMs are stacking up in logical tasks like this. I kept hearing we were close to AGI so just wondering how far there is to go.


Humans can do these intersections, but we don’t do it by riffing off the top of our heads. We carefully develop and apply a formal system. LLMs are just (a very important) component.


We've been "close" to AGI for like 40+ years.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: