Data Processing -- Be Afraid, Be Very Afraid

smhinsey · on July 2, 2009

Interesting. I think where the author writes "reentrancy" they may mean idempotency instead.

I haven't thought about the "kill the whole pipeline" approach before, but it doesn't really seem that different from one of my standard techniques, which is to flag any data that was gathered under error conditions so it can be kept but reviewed for cause. Shutting down the whole process seems like it might be a hard sell to non-technical stakeholders, but I would absolutely operate that way in all non-production environments.

danmil · on July 2, 2009

Yo, the author here. Thanks for the feedback. I totally meant idempotency, drat. (In fact, on Hadoop, thanks to speculative execution of reduce tasks, you also have to worry a bit about reentrancy, but what I was talking about was, in fact, idempotency).

Shutting down the pipeline: I hear you on prod/non-prod. For our setup, the pipeline ends up writing to a datastore, so if we kill the pipeline, the datastore is still up, it just stops updating. Which is working so far. May end up flagging suspect data as you suggest, instead of the full stop (or only full stop if more than a very small percentage of the data is suspect).

smhinsey · on July 2, 2009

No problem. I am not too familiar with Hadoop, but those speculative reduce tasks sound like a real blast to debug.

I can see why the approach in your blog would have a lot appeal in that environment. It sounds like some sort of error flagging in combination with a set of heuristics around what failed, how often, what time of day, etc would be the way to go.

I find that intelligent monitoring systems like that are ultimately necessary in systems like this anyway, you just usually end up discovering that the hard way (I know I have, several times. It's one of those lessons you are tempted to unlearn in the interests of expediency). Does Hadoop help you out with that sort of thing?

tonystubblebine · on July 2, 2009

What's the difference between reentrancy and idempotency?

mgreenbe · on July 2, 2009

Idempotency is a property of mathematical functions: f : X -> X is idempotent iff f(f(x)) = f(x) for all x in X. Absolute value is an example of an idempotent function.

Reentrancy is a property of computational functions: a function f is reentrant if two copies of f can run at the same time (in the same memory space, etc. -- the details vary) and both copies behave as if the other weren't running. This can exclude or restrict, e.g. static fields or global variables.

smhinsey · on July 2, 2009

In my mind, reentrancy is associated with threading, and implies that the code could be executing in multiple locations simultaneously, whereas an idempotent operation is one that can be repeated multiple times without negative side effects.

gchpaco · on July 2, 2009

Reentrancy comes up with recursion and signal handling as well, both of which can come up in nominally single threaded programs (nominal because signals usually but don't always come from out of process).

smhinsey · on July 2, 2009

You are of course correct, I have been working in a multi-threaded environment for long enough to have let that slip my mind.

rjurney · on July 2, 2009

This is the same issue you have when you're building your own analytic system using SQL and you're building big queries to do more complex descriptive, and inferential statistical shenanigans.

No matter the technology, automated tests on known data with expected results are the only way to be sure what you're getting is right. And you better be sure, because people have zero tolerance for inaccurate analytics. They will immediately ignore your tool, permanently.

For instance... if your slot machine analytic system does not include slot machines with apostrophe's in the name in its 'revenue total' column... you are so screwed. Even once you fix the bug, your stuff is tainted. Don't ask me how I know ;)

ntoshev · on July 2, 2009

Great post and the comment (only one at the time I post this) doubles it's value, be sure not to miss it!

harry · on July 2, 2009

Good article - I'm happy to find out that other DBRs have the same healthy fear & utter suspicion for each analysis they write as I do.

staunch · on July 3, 2009

Great article and advice. I'd love to see more like it!