I’m more interested in the work some folks are doing on using succinct structures style techniques to accelerate parsing. https://github.com/haskell-works/hw-json
It’s still relatively immature. But it’s a more algorithmic approach that I think plays nice with pretty much any source of semistructred Data. Though simd acceleration certainly is pretty sweet too
How do that? If I have json/csv data in tabular forms how apply this idea in a simple way? Because the naive me think is not possible without parsing it.
It's quite simple, really - at least in some instances.
Say you look for log lines (or jsons) that contain the username 'mamcx'; you first filter-away lines that do NOT contain 'mamcx', then parse only the remaining ones (and apply the condition again, on the properly-parsed ones)
You may want to rethink your log format to have a single (minified JSON) entry per line - it can make your life easier in a variety of ways. Out of interest, what library are you using that logs single entries on multiple lines?
You can do it without backtracking with a very simple deterministic finite automaton - keep track of <block_start, block_end, condition_met> as you pour through the input. But yes, it's a bit more complicated, it's preferable to just have one line per "log item"
You've got one entity to search per line, let's presume (if not, pre-process once to a new big file so that you do).
You want to find all entities that have propertyName = "Foo". So run "grep 'Foo' file" and put the result into your actual parsing/processing. You just removed some percentage of your data before the difficult parsing even began. Need to search on multiple fields?