Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba
and the counter examples:

  abba
  ba
  ab
It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: