Hacker News new | past | comments | ask | show | jobs | submit login
Data Processing -- Be Afraid, Be Very Afraid (hubspot.com)
49 points by tonystubblebine on July 2, 2009 | hide | past | favorite | 12 comments



Interesting. I think where the author writes "reentrancy" they may mean idempotency instead.

I haven't thought about the "kill the whole pipeline" approach before, but it doesn't really seem that different from one of my standard techniques, which is to flag any data that was gathered under error conditions so it can be kept but reviewed for cause. Shutting down the whole process seems like it might be a hard sell to non-technical stakeholders, but I would absolutely operate that way in all non-production environments.


Yo, the author here. Thanks for the feedback. I totally meant idempotency, drat. (In fact, on Hadoop, thanks to speculative execution of reduce tasks, you also have to worry a bit about reentrancy, but what I was talking about was, in fact, idempotency).

Shutting down the pipeline: I hear you on prod/non-prod. For our setup, the pipeline ends up writing to a datastore, so if we kill the pipeline, the datastore is still up, it just stops updating. Which is working so far. May end up flagging suspect data as you suggest, instead of the full stop (or only full stop if more than a very small percentage of the data is suspect).


No problem. I am not too familiar with Hadoop, but those speculative reduce tasks sound like a real blast to debug.

I can see why the approach in your blog would have a lot appeal in that environment. It sounds like some sort of error flagging in combination with a set of heuristics around what failed, how often, what time of day, etc would be the way to go.

I find that intelligent monitoring systems like that are ultimately necessary in systems like this anyway, you just usually end up discovering that the hard way (I know I have, several times. It's one of those lessons you are tempted to unlearn in the interests of expediency). Does Hadoop help you out with that sort of thing?


What's the difference between reentrancy and idempotency?


Idempotency is a property of mathematical functions: f : X -> X is idempotent iff f(f(x)) = f(x) for all x in X. Absolute value is an example of an idempotent function.

Reentrancy is a property of computational functions: a function f is reentrant if two copies of f can run at the same time (in the same memory space, etc. -- the details vary) and both copies behave as if the other weren't running. This can exclude or restrict, e.g. static fields or global variables.


In my mind, reentrancy is associated with threading, and implies that the code could be executing in multiple locations simultaneously, whereas an idempotent operation is one that can be repeated multiple times without negative side effects.


Reentrancy comes up with recursion and signal handling as well, both of which can come up in nominally single threaded programs (nominal because signals usually but don't always come from out of process).


You are of course correct, I have been working in a multi-threaded environment for long enough to have let that slip my mind.


This is the same issue you have when you're building your own analytic system using SQL and you're building big queries to do more complex descriptive, and inferential statistical shenanigans.

No matter the technology, automated tests on known data with expected results are the only way to be sure what you're getting is right. And you better be sure, because people have zero tolerance for inaccurate analytics. They will immediately ignore your tool, permanently.

For instance... if your slot machine analytic system does not include slot machines with apostrophe's in the name in its 'revenue total' column... you are so screwed. Even once you fix the bug, your stuff is tainted. Don't ask me how I know ;)


Great post and the comment (only one at the time I post this) doubles it's value, be sure not to miss it!


Good article - I'm happy to find out that other DBRs have the same healthy fear & utter suspicion for each analysis they write as I do.


Great article and advice. I'd love to see more like it!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: