Hacker News new | past | comments | ask | show | jobs | submit login
Filter Before You Parse: Faster Analytics on Raw Data with Sparser (stanford.edu)
86 points by bandwitch on Aug 8, 2018 | hide | past | favorite | 14 comments



Sounds a lot like the "on the fly parsing" (§3.1) in Alagiannis' NoDB (See https://stratos.seas.harvard.edu/files/stratos/files/nodb-ca... for details).


I’m more interested in the work some folks are doing on using succinct structures style techniques to accelerate parsing. https://github.com/haskell-works/hw-json

It’s still relatively immature. But it’s a more algorithmic approach that I think plays nice with pretty much any source of semistructred Data. Though simd acceleration certainly is pretty sweet too


Anyone figure out how to get an instance of spark up with sparser working?


#define PREPROCESSING 1


I have done that for a long time.


How do that? If I have json/csv data in tabular forms how apply this idea in a simple way? Because the naive me think is not possible without parsing it.


It's quite simple, really - at least in some instances.

Say you look for log lines (or jsons) that contain the username 'mamcx'; you first filter-away lines that do NOT contain 'mamcx', then parse only the remaining ones (and apply the condition again, on the properly-parsed ones)


Ok, but if I found a match I get stuck in a incomplete result:

        //Dude, and the data above what?
        "username": "mamcx",
        "version": 3
    }
So this need to backtrack a lot?

Or this scan for separators ({}) then in each scan it again scan per line?

    for line.scan("{", "}"):
        for result.find("mamcx")


You may want to rethink your log format to have a single (minified JSON) entry per line - it can make your life easier in a variety of ways. Out of interest, what library are you using that logs single entries on multiple lines?


I don't always control my inputs. I use a variety of libraries (like explained here: https://www.reddit.com/r/rust/comments/8ygbvy/state_of_rust_...) so I'm more for the overall idea.

But for this specific case I use https://www.newtonsoft.com/json.

Eventually could incorporate the idea in my arsenal and maybe build a format streaming pipeline utility that can use across platforms.


You can do it without backtracking with a very simple deterministic finite automaton - keep track of <block_start, block_end, condition_met> as you pour through the input. But yes, it's a bit more complicated, it's preferable to just have one line per "log item"


that's a good way to do it.

generally you do some kind of faster/simpler parsing up front and then feed to a more expensive parser.

I've parsed more than my share of 1GB-20GB Turtle files and some up-front filtering gets a big speed boost over a conventional RDF parser.


Simplest way? grep.

You've got one entity to search per line, let's presume (if not, pre-process once to a new big file so that you do).

You want to find all entities that have propertyName = "Foo". So run "grep 'Foo' file" and put the result into your actual parsing/processing. You just removed some percentage of your data before the difficult parsing even began. Need to search on multiple fields?

  cat input | grep 'Foo' | grep 'Bar' | actualProcessing


Jsonl has the nice property that you can filter it with a line-based grep, no other parsing needed other than forming lines.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: