Filter Before You Parse: Faster Analytics on Raw Data with Sparser

zerebubuth · on Aug 8, 2018

Sounds a lot like the "on the fly parsing" (§3.1) in Alagiannis' NoDB (See https://stratos.seas.harvard.edu/files/stratos/files/nodb-ca... for details).

carterschonwald · on Aug 8, 2018

I’m more interested in the work some folks are doing on using succinct structures style techniques to accelerate parsing. https://github.com/haskell-works/hw-json

It’s still relatively immature. But it’s a more algorithmic approach that I think plays nice with pretty much any source of semistructred Data. Though simd acceleration certainly is pretty sweet too

X6S1x6Okd1st · on Aug 8, 2018

Anyone figure out how to get an instance of spark up with sparser working?

CalChris · on Aug 8, 2018

#define PREPROCESSING 1

PaulHoule · on Aug 8, 2018

I have done that for a long time.

mamcx · on Aug 8, 2018

How do that? If I have json/csv data in tabular forms how apply this idea in a simple way? Because the naive me think is not possible without parsing it.

virgilp · on Aug 8, 2018

It's quite simple, really - at least in some instances.

Say you look for log lines (or jsons) that contain the username 'mamcx'; you first filter-away lines that do NOT contain 'mamcx', then parse only the remaining ones (and apply the condition again, on the properly-parsed ones)

mamcx · on Aug 8, 2018

Ok, but if I found a match I get stuck in a incomplete result:

        //Dude, and the data above what?
        "username": "mamcx",
        "version": 3
    }

So this need to backtrack a lot?

Or this scan for separators ({}) then in each scan it again scan per line?

    for line.scan("{", "}"):
        for result.find("mamcx")

kornish · on Aug 9, 2018

You may want to rethink your log format to have a single (minified JSON) entry per line - it can make your life easier in a variety of ways. Out of interest, what library are you using that logs single entries on multiple lines?

mamcx · on Aug 9, 2018

I don't always control my inputs. I use a variety of libraries (like explained here: https://www.reddit.com/r/rust/comments/8ygbvy/state_of_rust_...) so I'm more for the overall idea.

But for this specific case I use https://www.newtonsoft.com/json.

Eventually could incorporate the idea in my arsenal and maybe build a format streaming pipeline utility that can use across platforms.

virgilp · on Aug 9, 2018

You can do it without backtracking with a very simple deterministic finite automaton - keep track of <block_start, block_end, condition_met> as you pour through the input. But yes, it's a bit more complicated, it's preferable to just have one line per "log item"

PaulHoule · on Aug 8, 2018

that's a good way to do it.

generally you do some kind of faster/simpler parsing up front and then feed to a more expensive parser.

I've parsed more than my share of 1GB-20GB Turtle files and some up-front filtering gets a big speed boost over a conventional RDF parser.

mabbo · on Aug 9, 2018

Simplest way? grep.

You've got one entity to search per line, let's presume (if not, pre-process once to a new big file so that you do).

You want to find all entities that have propertyName = "Foo". So run "grep 'Foo' file" and put the result into your actual parsing/processing. You just removed some percentage of your data before the difficult parsing even began. Need to search on multiple fields?

  cat input | grep 'Foo' | grep 'Bar' | actualProcessing

wumpus · on Aug 8, 2018

Jsonl has the nice property that you can filter it with a line-based grep, no other parsing needed other than forming lines.