What I envision a solution to be like, would be something like an configurable/codeable OpenRefine (was Google Refine) with streaming ingestion/extraction, with a validation engine/parsing engine (something more elegant than regex, but you can drop into that if necessary) and maybe a pluggable event processor (i.e. a Spark or Flink). I would love to work on such a problem, and solve it.