I created a similar regex web demo that shows how a regex is parsed -> NFA -> DFA -> minimal DFA, and finally outputs LLVMIR/Javascript/WebAssembly for from the minimal DFA:
This library can be used to create string class hierarchies. That, in turn, can help to use typed strings more.
For example, e-mails and urls are a special syntax. Their value space is a subset of all non-empty string which is a subset of all strings.
An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.
This library can be used to check the definitions and hierarchy of such string types. The implementation of the hierarchy differs per programming language (subclassing, trait boundaries, etc).
In languages with tagged union types you do this a lot! Some Haskell pseudocode for ya
module Email (Address, fromText, toText) where -- note we do not export the constructor of Address, just the type
data Address = Address Text
fromString :: Text -> Maybe Address
fromString =
-- you'd do your validation in here and return Nothing if it's a bad address.
-- Signal validity out of band, not in band with the data.
toText :: Address -> Text
toText (Address addr) = addr -- for when you need to output it somewhere
Haha sorry, I get those backwards a lot. I was gonna do elm but then it’d be a conversation about why we’re writing our own email address validation on the front end instead of using the platform.
I figure you're joshing but I literally write it for work (although I haven't been in our haskell codebase in months, tragically). I just have a particularly smooth brain so I forget all the little differences as soon as I'm done. Always in exams mode.
Sure! Sorry that was a little too obtuse. So in this case we can imagine an app where we don't use any tagged unions and just use primitive types (your strings, booleans, integers, things of that nature). And we want to signal the validity of some data. Say a user ID and an email address. We store the User ID as an integer to keep space down and store the email address as a string. We use semaphore values: if the user ID is invalid we store -1 (it's JS and there are no unsigned numbers) and if the email address is invalid we store the empty string.
Whenever we consume these values, we need to make sure that userId > 0 and email != "" I mean email !== "". We are testing for special values of the data. Data and "this is for sure not meaningful data" are the same shape! So your functions need to handle those cases.
But with tagged unions you can check these things at the edge of the program and thereafter accept that the contents of the tagged data are valid (because you wrote good tests for your decoders).
So your data is a different shape when it's valid vs when it's invalid, and you can write functions that only accept data that's the valid shape. If you got Json that was hit by cosmic rays when trying to build your User model, you can fail right then and not build a model and find a way to handle that.
It's out of band because you don't guard for special values of your morphologically identical data.
If you want examples of any specific part of this let me know. IDK your level of familiarity and don't want to overburden you with things you already get.
>An e-mail address could be passed into a function that requires a non-empty string as input. When the type-system knows that an e-mail string is a subclass of non-empty string, it knows that an email address is valid.
Nothing like a dive into the wondrous world of what is and isn't allowed in an email address left of the @ on a warm late-summer morning. It's one of the mysteries of the modern world. The simple heuristic that proposes that every regex trying to express "valid email address" is wrong is a sufficiently safe bet, but it ruins all the fun.
People are downvoting you because quirky/jokey super-colloquial language like “wossis mean? TIA” is hard to understand, and also just doesn’t really mesh with the vibe of the site.
Value space is the set of values a type can have. A boolean has only two values in its value space. An unsigned byte has 256 possible values, so does a signed byte.
A string enumeration has a limited number of values. E.g. type A ("Yes" | "No" | "Maybe") has three values and is a superset of type B ("Yes" | "No"). A function that accepts type A can also accept type B as valid input.
If the value space is defined by a regular expression, as is often the case, the mentioned library could be used to check, at compile-time, which type are subsets of others.
"For example, e-mails and urls are a special syntax. Their value space..." seemed to talk about the 'value space' of strings (these being e-mails and urls), not types (of e-mails and urls), which confused me.
It is bout the 'value space' of strings. Think of all possible strings. That is the entire value space of strings. Not every possible string is an email. Only a subset of this value space is a valid email. This subset is the 'value space' of strings which are valid emails.
Regular expressions are a great example of bundling up some really neat and complex mathematical theory into a valuable interface. Linear algebra feels similar to me.
It always amazes me how given the appropriate field, so much math can be transformed into linear algebra. Even Möbius transformations on the complex plane w=(az+b)/(cz+d) can be turned into linear algebra.
Linear transformations preserve the structure of the space so you can keep applying them. It's not surprising that you can always find some "space-preserving" part of a problem and fold the rest (the "non-linear" structure) into transformations or the definition of the space itself.
That usually means the representation is getting close to the truth. Good interfaces have intrinsic value, which many result-focused people do not appreciate.
It's very cool, but also no wonder that it doesn't support all those features of regexes which technically make them not regular expressions anymore. Though, I would have thought ^ and $ anchors shouldn't be a problem?
^ and $ are a problem, although one with a workaround.
The standard theory of regular expressions focuses entirely on regex matching, rather than searching. For matching, ^ and $ don't really mean anything. In particular, regexp theory is defined in terms of the "language of" a regexp: the set of strings which match it. What's the set of strings that "^" matches? Well, it's the empty string, but only if it comes at the beginning of a line (or sometimes the beginning of the document). This beginning-of-line constraint doesn't fit nicely into the "a regexp is defined by its language/set of strings" theory, much the same way lookahead/lookbehind assertions don't quite fit the theory of regular expressions.
The standard workaround is to augment your alphabet with special beginning/end-of-line characters (or beginning/end-of-document), and say that "^" matches the beginning-of-line character.
A lack of `^` is equivalent to prepending `(.*)`, then trimming the match span to the end of that capture. And similarly for a lack of `$` (but suddenly I remember how nasty Python was before `.fullmatch` was added ...).
More interesting is word boundaries:
`\b` is just `\<|\>` though that should be bubbled up and usually only one side will actually produce a matchable regex.
`A\<B` is just `(A&\W)(\w&B)`, and similar for `\>`.
I wanted to see the intersection between syntactically valid URLs and email addresses, but just entering the URL regex (cf. below) already takes too long to process for the page.
Just wondering, what is it about testing repetition [a-z]{1,256} with an upper bound that's so heavy? Intuitively it feels like greedy testing [a-z]+ should actually be worse since it has to work back from the end of the input.
This depends heavily on how repetition is implemented.
With a backtracking-search implementation of regexes, bounded iteration is pretty easy.
But the linked webpage appears to compile regexes to finite state machines (it shows you their finite-state-machine, for instance), and eg [a-z]{1,256} will have 256 states: 256 times the 1 state needed for [a-z]. If [a-z] were a complex regex, you could get a combinatorial explosion.
This alone probably isn't the issue? 256 is not a very large number. But I suspect there are follow-on algorithmic issues. This is just speculation, but I wouldn't be surprised if that 256-state machine were computed by applying DFA minimization, an algorithm with worst-case exponential running time, to a more naively generated machine.
you're right. inclusion/intersection/etc. aren't actually computed via DFA but instead are computed directly on the regular expression representation itself. and large disjunctions (with 256 branches) are what is very heavy.
(it's possible to instead do these operations on DFAs but at the time i found it hard to get from an automata back to a reasonable-looking regular expression.)
the library uses a fairly simple data representation where x{m,n} is compiled using conjunction and disjunction. so x{1,4} ends up being represented as x|xx|xxx|xxxx.
this simplifies the code for testing equality and inclusion, since logically x{n} is just xx... (n times) and x{m,n} is just x{m}|x{m+1}|...|x{n}.
but when you have x{m,n} and n-m is large you can imagine what kind of problems that causes.
I was surprised then not surprised that the union & intersection REs it comes up with are not particularly concise. For example the two expressions "y.+" and ".+z" have a very simple intersection: "y.*z" (equality verified by the page, assuming I haven't typo'd anything). But the tool gives
yz([^z][^z]*z|z)*|y[^z](zz*[^z]|[^z])*zz*
instead. I think there are reasons it gives the answer it does, and giving a minimal (by RE length in characters or whatever) regular expression is probably a lot harder.
They show the DFA for it on the site, it's 3 states. There's a starting state for the first . and then two states that transition back and forth between whether z was the last character or not.
I think what's actually happening here is that they're doing the intersection on the DFAs and then producing a regex from the resulting DFA. The construction of a regex from a DFA is where things get ugly and weird.
I used this concept once to write the validation logic for an "IP RegEx filter" setting. The goal was to let users configure an IP filter using RegEx (no, marketing people don't get CIDRs, and they knew RegEx's from Google Analytics). How could I define a valid RegEx for this? The intersection with the RegEx of "all IPv4 addresses" is not empty, and not equal to the RegEx of "all IPv4 addresses". Prevented many complaints about the filter not doing anything, but of course didn't prevent wrong filters from being entered.
Wouldn't a simpler solution work here? Instead of trying to validate the filter regex, show some sample IP addresses or let the user insert a set of addresses, and then show which ones the filter matches and which ones it doesn't. Also helps address the problem of incorrect filters.
I'm not sure I agree — most common regex editing tools available online include a section for adding test strings to verify what you actually wrote is correct. Clearly there is a benefit to it. In similar vein, allowing the user to test before they commit and then test actually reduces their work load, they don't have to drop and then reload the whole regex in their mind.
One possible application: If an input to a function parameter must match a certain regex, and the output of a function produces results matching another regex, we can know if the functions are compatible: if the intersection of regular expressions is empty, then you cannot connect one function to the other.
Combined with the fact the regular expressions can be used not only on strings but more generally (e.g. for JSON schema validation [1]), this could be a possible implementation of static checks, similar to "design by contract".
The graphics look identical to those in Hopcroft & Ullman's "Introduction to Automata Theory, Languages, and Computation" (like the convention that they use a double-circle to denote accepting states). I imagine they're GraphViz-based: it's very easy [0] to draw these in GraphViz. I don't know what Hopcroft & Ullman used though, because that one was published in 1979, and GraphViz didn't exist before 1991. Suddenly I'm curious what the state of the art for vector diagrams was in 1979...?
Maybe something related to 'pic'? This doc on it is a revised version of a 1984 edition, so maybe it's a little too late, but there are references to other systems back to 1977 or so.
I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?
This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.
Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.
And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.
This is in O(states) for R and O(2^states) for NR if I recall right.
Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.
Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.
I wrote a Java package a decade ago that implements some of this stuff:
the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.
however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.
(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)
-- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0
import qualified Data.List as List
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
data Size = Finite Int | Infinite deriving (Show, Eq)
instance Num Size where
abs = undefined; signum = undefined; negate = undefined -- unnecessary
fromInteger = Finite . fromInteger
Finite x + Finite y = Finite (x + y)
_ + _ = Infinite
Finite x * Finite y = Finite (x * y)
x * y = if x == 0 || y == 0 then 0 else Infinite
-- computes size & language (list of matching strings, if regex is finite)
eval :: Regex -> (Size, [String])
eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
where cset = List.nub chars
eval (Seq regexes) = (product sizes, concat <$> sequence langs)
where (sizes, langs) = unzip $ map eval regexes
eval (Choice regexes) = (size, lang)
where (sizes, langs) = unzip $ map eval regexes
lang = concat langs
size = if elem Infinite sizes then Infinite
-- finite, so just count 'em. inefficient but works.
else Finite (length (List.nub lang))
eval (Star r) = (size, lang)
where (rsize, rlang) = eval r
size | rsize == 0 = 1
| rsize == 1 && List.nub rlang == [""] = 1
| otherwise = Infinite
lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)
size :: Regex -> Size
size = fst . eval
NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).
You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).
It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:
import Prelude hiding (null)
import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
-- The language of a regex is either finite or infinite.
-- We only care about the finite case.
data Lang = Finite (Set String) | Infinite deriving (Show, Eq)
zero = Finite empty
one = Finite (singleton "")
isEmpty (Finite s) = null s
isEmpty Infinite = False
cat :: Lang -> Lang -> Lang
cat x y | isEmpty x || isEmpty y = zero
cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
cat _ _ = Infinite
subsingleton :: Lang -> Bool
subsingleton Infinite = False
subsingleton (Finite s) = isSubsetOf s (fromList [""])
eval :: Regex -> Lang
eval (Class chars) = Finite $ fromList [[c] | c <- chars]
eval (Seq rs) = foldr cat one $ map eval rs
eval (Choice rs) | any (== Infinite) langs = Infinite
| otherwise = Finite $ unions [s | Finite s <- langs]
where langs = map eval rs
eval (Star r) | subsingleton (eval r) = one
| otherwise = Infinite
https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.
I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'
I think that's just the ordinal corresponding to lexicographic order on the words in the language, so yeah that should work. I wonder how easy it is to calculate...
normally you would use an ordinal number [1] to label individual elements of an infinite set while using a cardinal number [2] to measure the size of the set.
i believe the cardinality of a set of words from a finite alphabet (with more than one member) is equivalent to the cardinality of the real numbers. this means that the cardinality of .* is c.
unfortunately, i don't think that cardinality gets us very far when trying to differentiate the "complexity" of expressions like [ab]* from ([ab]*c[de]*)*[x-z]*. probably some other metric should be used (maybe something like kolmogorov complexity).
I wouldn't say that's their 'normal' usage, I mean sure you can use them like that but fundamentally ordinal numbers are equivalence classes of ordered sets in the same way that cardinal numbers are equivalence classes of sets.
As you've rightly noted the latter equivalence class gets us nothing so throwing away the ordering is a bit of a waste. Of all mathematical concepts 'size' is easily the most subjective so picking one that is interesting is better than trying to be 'correct'.
In particular a*b* is exactly equivalent to ω^2, since a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an order preserving isomorphism between words of the form a^n b^m and tuples (n,m) with lexicographic ordering.
Unless I'm very mistaken adding onto the end of a string doesn't affect lexicographic order, so that's effectively [ab]*. The ordinality of [ab] is simple, it's 2, but the kleene star is a bit of an odd one, it's not quite exponentiation.
To reason about the kleene star it's a bit simpler to consider something like R^*n, where you repeat up to n times. Obviously R^*0 = 1 and R^*S(n) can be built from R^*n by picking an element of R^*n and appending either nothing or an element of R, here the element of R^*n determines most of the order and 'nothing' orders in front. For technical reasons the corresponding ordinal is (1+R) R^*n, which is backwards from how you'd expect it and how you'd normally define exponentiation.
The kleene star can be recovered by taking the limit, identifying R^*n with it's image in R^*S(n). Which also doesn't quite work as nicely as you'd hope (normally the image is just a downward closed subset, it's not in this case).
I think [ab]* is equivalent to something like the rational part of the Cantor set. Not sure if there's a simpler way to describe it, it's nowhere near as simple as 2^ω, which is just ω.
Hmm, turns out this fails to be an ordinal because regular languages aren't well-ordered (except if you reverse the lexicographic order, maybe?). They are what is called an order type, and it looks like it should be possible to identify them with subsets of the rationals, if you wanted to.
Perhaps reversing the lexicographic order makes more sense, in that case longer tuples simply order last so R^* = 1 + R + R^2 + ..., the limit here is much easier since R^*n = 1 + R + ... + R^n is downwards closed as a subset of R^*S(n).
Then again in that scenario [ab]* is simply ω because it is effectively the same as just writing an integer in binary, so it is less interesting in a way.
I would say [ab]* has order type omega. That's because the set of strings it represents is: {epsilon, ab, abab, ababab, abababab, ...} which looks a lot like omega. (epsilon = empty string)
What this exercise really is is finding a canonical way to order a regular language (the set of strings a regexp matches). For example, a*b* could be {epsilon, a, aa, aaa, aaa, aaaa, ..., b, bb, bbb, bbbb, bbbbb, bbbbbb, ..., ab, aab, aaab, aaaab, ..., abb, aabb, ..., ...}
which looks a lot like omega ^ 2 (not 2 * omega like I said before). However, you could also re-arrange the set to look like omega: {epsilon, a, b, aa, bb, ab, aaa, bbb, aab, abb, bbb, ...} (strings of length 1, length 2, length 3, etc)
I propose the following: for any two strings in the regular language, the one that comes first is the one whose kleene-star repetition counts come first lexicographically. More concretely, for the language a*b*, aaaab represents a repetition count of (4, 1) which and bbb represents (0, 3). (0, 3) comes before (4, 1) lexicographically, so bbb comes before aaaab. This admits the following ordering: {epsilon, b, bb, bbb, bbbb, ..., a, abb, abbb, abbbb, ..., aa, aab, aabb, aabbb, ..., ...} which is omega ^ 2 which "feels" right to me. Another rule is for the regular language (X|Y) and two strings x from X and y from Y, x should always come before y in our ordered set representation.
Hold on, what about nested kleene-stars? (a*)* is just a*, but (a*b*)* is distinct from a*b*. However, the "kleene-star counts" analysis from above breaks down because there are now multiple ways to parse strings like aab. I don't really know how to classify these regular languages as ordinals yet.
I don't really see any useful applications of this, but it's still fun to think about. The game I'm playing is thinking about a random ordinal and trying to come up with a regular language that, under my ordering rule above, looks like that ordinal. Let's try 2 * omega (which looks like this: {0, 1, 2, 3, 4, 5, ..., omega, omega + 1, omega + 2, omega + 3, ...} e.g. 2 copies of omega "concatenated"):
Any def for 'difference and intersection of regexes' might actually mean?
I guess for regexes r1 and r2 this means the diff and intersect of their extensional sets, expressed intensionally as a regex. I guess. But nothing seems defined, including what ^ is, or > or whatever. It's not helpful
negation (~α): strings not matched by α
difference (α - β): strings matched by α but not β
intersection (α & β): strings matched by α and β
exclusive-or (α ^ β): strings matched by α or β but not both
inclusion (α > β): does α matches all strings β matches?
equality (α = β): do α and β match exactly the same strings?
Equivalence of DFA or NFA is PSPACE complete by savitch's theorem, regardless of time bound. As such, most types of regex equivalence is pspace-complete.
Since this may be confusing at first (why does squaring buy you anything here) - the reason squaring makes it expspace complete is, basically, squaring allows you to express an exponentially large regex in less than exponential input size.
This in turn means polynomial space in the size of the input is no longer enough to deal with the regex.
If you only allow repetition, than an exponentially large regex requires exponential input size, and thus polynomial space in the size of the input still suffices to do equivalence.
This is generally true - operators that allow you to reduce the size of the input necessary to express a regex by a complexity class will usually increase the size complexity class necessary to determine equivalence by a corresponding amount.
I don't think that's squaring, since "o" is a literal. Squaring is more like /f../ where the first and second /./ must match equal strings. Squaring is actually supported by the Perl regex /f(.)\g1/, where /\g1/ is a backreference[1] to the first match group (here just /(.)/, but could be arbitrarily large). It's easy to see how this feature requires backtracking, and therefore how it can lead to an exponential running time.
/f(.)\g1/ is equivalent to the non-backtracking regex /f(?:\x00\x00|…|aa|bb|cc|…|\U10FFFF\U10FFFF)/ - you've already made a regex over 2 million Unicode codepoints long from an input 6 bytes long. /f(..)\g1/ would create a regex 2 billion codepoints long. If you restrict the regex to Latin-1 or another 1-byte encoding, the exponent base is smaller, but the growth is still exponential.
Supporting features like backreferences is convenient, so that's why Perl has them. You can at least use backtracking to reduce the space blowup (only need linear memory to do a DFS), but it's easy to make the regex match take exponential time with a suitable match text. That's why backtracking regexes are not safe against hostile clients unless you sandbox their time and memory usage carefully.
Squaring isn't backreferences, though you can square more than a literal. An example squaring is /(foo|bar*){2}/ which is equivalent to /(foo|bar*)(foo|bar*)/, not /(foo|bar*)\g1/.
Backreferences aren't technically regular, and as you say they can take exponential time to match. But the theorem that regular expression equivalence is EXPSPACE-complete applies to real regular expressions, not just perl "regexes".
IIRC the proof is something like, given a Turing machine that runs in space S, to consider traces of the the computation, which are concatenations of S-long strings (plus a head indicator holding the machine's internal state etc). You can make a regex that matches invalid executions of the machine, which are basically of the form
/.* (foo1.{S}bar1 | foo2.{S}bar2 | ...) .*/
where foo1->bar1, foo2->bar2 etc are all the types of invalid transitions (basically, either something on the tape not next to the head changed, or the head changed in a way not indicated in the state table).
You also set rules to enforce the initial state. Then you ask whether:
/wrong initial state | invalid execution | execution that doesn't end in "accept"/
is the same regex as
/.*/
If it's the same, then there are no strings which don't match the first regex; such a string must have the right initial state, a valid execution, and end in "accept". So if the two are the same, then all valid executions end in "reject". Since you aren't constrained to allow only a single state transition rule, this actually shows NEXPSPACE hardness, but that's equal to EXPSPACE by Savitch's theorem. Anyway I could be misremembering this, it's been a while, but it doesn't require backreferences.
The squaring is required to make the {S} part without exponential blowup. Otherwise, I think the problem is only PSPACE-hard (PSPACE-complete? I don't remember).
it always bugged me as a student that had to sit through all those discrete maths lectures that standard regex libraries don't allow you to union/intersect two "compiled" regular expression objects together
Humans can do these intersections, but we don’t do it by riffing off the top of our heads. We carefully develop and apply a formal system. LLMs are just (a very important) component.
http://compiler.org/reason-re-nfa/src/index.html