Hacker News new | past | comments | ask | show | jobs | submit login
For ‘Big Data’ Scientists, Hurdle to Insights Is ‘Janitor Work’ (nytimes.com)
141 points by rgejman on Aug 18, 2014 | hide | past | favorite | 60 comments



While I am all for companies and software to expedite data cleaning, I don't think any software can cleanly solve this problem in the near future.

1. Data hygiene issues show up inconsistently, even within the same data source: I experienced this first-hand as a quant trader: I only had half a dozen, well-structured time series data coming from our trading apps and switches. You would think that I can get reasonably clean data every day, or failing that, being able to automate the data cleaning script. Nope! I experienced every data hygiene issue imaginable: from ntpd being broken, local time/UTC inconsistencies, switch firmware acting up, all the way to human errors in the ETL process. Building a reliable data pipeline has so many moving parts that I am fairly pessimistic about any one-stop solution.

2. Then there is a volume issue. If the data is small enough, humans can correct them fairly reliably, especially with some help from automation scripts/software. That said, doing this at scale is a very hard problem. One thing I learned working at a big data company is that many folks use MapReduce as a data cleanser at scale, and for that, MapReduce is a pretty awkward tool.

3. Anecdotal evidence: I have talked to employees at various big data platform/software companies, and though they have a wide ranging opinions on stream processing, Hadoop, Spark, etc., they all agree data cleaning is a huge pain/deal-slowing/-killing nemesis/unsolved problem that their employers semi-solve with increasing Sales Engineer headcounts. If this issue of solving data cleaning at scale was easy, I feel that someone would have come up with a very effective answer by now (and as an industry insider, I should have heard about it).


I completely agree. However I think there is also plenty of room for improvement in the ETL process that doesn't necessarily mean a one-stop solution.

For example, the common denominator data format for many use cases (not all) is a relational database. Why isn't there something that can grab data from anywhere in any common format and import it into a relational database with only a couple commands or mouse clicks? Right now the situation is that you either have to pay a lot of money for such a tool which usually doesn't work very well or you piece together the various data conversion tools with scripts. There is of course a lot of other steps in the ETL process, but that alone would save a lot of time.


>However I think there is also plenty of room for improvement in the ETL process that doesn't necessarily mean a one-stop solution.

I agree with you here as well. I recently wrote a blog article about this, at least for log data > http://www.fluentd.org/blog/unified-logging-layer


Very cool! I guess my particular ETL complaint above is solved for log data. :)


> Why isn't there something that can grab data from anywhere in any common format and import it into a relational database

IME, the issue is not tooling so much as dirty data. Even if a tool is a perfect solution, the data is (in the majority of companies) so dirty and inconsistently sparse that automated tooling breaks down.


Absolutely. I was trying to make a larger point that often the tooling for the stuff that can be easily automated or semi-automated is currently sub-par. If we fixed that tooling as a community then that would reduce the majority of the "janitorial work" to fixing dirty data which is definitely a harder problem. In other words, I think there is still a lot of low-hanging fruit in the ETL process.


Regarding 1 - Can this be solved by some sort of a structured data training set - that can be trained by some variant of a classifier. Building the training set will be a giant PITA but would this be some sort of a first step in the data cleansing problem ?


> If this issue of solving data cleaning at scale was easy, I feel that someone would have come up with a very effective answer by now (and as an industry inside, I should have heard about it).

Agreed. Even though all these companies in the 'big data' space claim they do everything under the sun because they run on hadoop or spark, data cleansing is still an area people are hesitant to dive into. Rightly so. What surprises me more is that ETL is also widely over-marketed by a lot of new big data companies. I don't think people have paid enough respect to how subtle the 'transform' stage can really be.


Yes. I work for Informatica on a data integration product -- taking streams of similar but somewhat different data and trying to determine the actual state of things. It's often used for merging customer lists, for example.

We have found that determining the best value of truth from incomplete and sometimes conflicting data sources is one of these messy problems with no easy answers. It takes a lot of fiddling with match rules and considerable domain knowledge to get good results. Accordingly, our product is designed to be very flexible and configurable.

I suspect improvements will come from making it easier for analysts to see a) the effects of proposed match rule changes and b) what decisions contributed to producing each output record under the current rules.


Speaking for the data janitors of the world, I can't believe there wasn't a mention of hiring some of us to do this so-called dirty work, if for nothing else than to save these data scientists from using the word sexy.

After twenty years of financial operations, including a decade in the back-office of a top hedge fund, I eventually accepted that I'm a bit backwards for my desire not to jump straight to a pivot table when encountering a new data set. Like a farmer reaching down to touch the soil, my first step is exploring the rows and columns with little tests here and there to find weaknesses within the information at hand rather than paving over them with instantly flawed reporting.

Anyway, for all the big data scientists out there too sexy to clean up their own data, I'm looking for work and don't mind pushing a broom. Check my profile for more info.


> Like a farmer reaching down to touch the soil, my first step is exploring the rows and columns with little tests here and there to find weaknesses within the information at hand rather than paving over them with instantly flawed reporting.

This needs to be done by more people, more often. I've worked with more than one (often SQL, but it doesn't matter¹) database whereby someone will claim that X is always true. It X is something that can be put into a DB query, then you should do that, and run the query. Every time. In my experience, if it isn't enforced by constraints in the database software, then it isn't true.

¹The nice thing about SQL is that you can but constraints on the data, and use the database to keep various invariants invariant.

Often, you'll find people making assumptions about the data based on the business logic, not the constraints in the database. Such thinking is flawed: you cannot reason about what you think the data is, you must reason from what the data actually is.

Worse, you'll find people making assumptions about what they think the business logic is.


We created a "Data Engineer" job (inspired by Amazon's name for the position) for precisely that purpose. The core schema behind Zalora (we sell clothes online) had over 5,000 tables and well, some of the stuff could definitely be improved. There were so many upstream issues we decided to make fixing them full time work (working with both the developers, and the end users like the people buying the products to put on the site).

I can't comment too much on it publicly but well, it looks like it was worth it. I definitely agree that there is a split between statisticians, and "data engineers" in both personality and interests, and you should have both on your "data science"/"big data"/whatever they call analysts these days team. It is however bloody hard to convince upstairs to pay a lot of money for the work! (as with anything "without deliverables", really)



It turns out this problem even shows up for relatively small datasets, things that would easily fit into memory on any consumer-level laptop from a couple years ago. I did some contract work for the NYPD a few years ago and the primary data sources I had to deal with was a block of free text describing a 911 call. All they really wanted was some frequency counts on specific terms of interest, sounds easy enough.

Oh, but their data warehouse thing that nobody knew what it was called, inserted a random space every so often and removed random spaces every so often making the text completely unreliable. Compiled with all the normal misspellings, fat fingered acronyms, slang and other nonstandard ways of writing things it turned out to be a huge job.

"Bicycle stolen at 0930 from residence on the 300 block of E 10th St, East Village"

would get turned into "Bicycle stol en a t093 0 from r es iden ceon the 300 block of E 10 thst, Ea st Vil lage."

Good luck with that.


Probably way beyond the scope of that project, but actually this is a pretty interesting and well-studied problem called segmentation, where you train a system to insert the correct breaks. This is used in natural language problems like OCR or machine translation.

Maybe google "machine translation segmentation" or OCR for references to papers on the topic, but methods for doing this are really successful (especially on english).


i bet you'd have better luck just removing all spaces and then resorting to a text segmentation algorithm. Peter Norvig has some great papers on how to do this effectively. I've wrapped his code with a Tornado web service: https://github.com/adkatrit/text-segmentation-server


Nothing related to analyzing free form text should ever be considered 'easy'. The wrinkle you describe was just the cherry on top.


How did you end up dealing with the issue? My first idea would be to remove all the spaces and parse that


After contemplating the issue for a few days I told them they need to talk to their data warehouse provider and fix it before we could proceed.


Parseamen.


It is worse that the articles says -- the data cleaning is often not tracked nor reproducible. First, think about how much data cleaning is not done by a script or versioned at all. So many people just overwrite the bad data in place. Second, of the data cleaning / ETL that is scripted or partially automated, think about how much is not saved, much less version controlled. As a result, if you ever want to go back and reproduce or share your analysis, good luck!

Doing it better is simple: If you do ETL, version control your code. If you do hand edits, track the before and after.


I do this with awk or sed scripts. This way I preserve the original data and have an exact and documented steps how the data set was changed. It was ages since I actually went manually into a file to overwrite information.


I can tell you're a practitioner (as opposed to a fly by commenter), as I do the exact same thing!


Is anyone actually doing that? I wonder if there's a profession, or if there ought to be, specifically for "data wrangling". On the other hand, maybe you can't separate it from the work of the data scientist... perhaps the data scientist needs to be involved in the data wrangling in order to understand the source better?


> perhaps the data scientist needs to be involved in the data wrangling in order to understand the source better?

Founder of an ETL startup here. This is exactly what we believe: the end-user of the data should be involved as early in the data pipeline as possible, including the wrangling. If you eliminate the engineer from the ETL process you remove a lot of painful back-and-forth and get more flexible pipelines.


I'd love to hear about your startup and/or beta test.


Hit us up at info at etleap dot com.


There are places that do that; at my first job we knew that the ETL code was core business functionality, and version-controlled it appropriately. And we knew that our ability to store raw data reliably was a competitive advantage.

But in many cases it's a technical debt issue. In the early startup stages when you're struggling for product-market fit, cleaning or auditing your data is not a good use of time. When a business like that starts to get traction, it finds itself with clients that now want more reliability, scaling issues, and a bunch of data that no-one really remembers where it all came from or how it was generated. A lot of the big-data startups are hitting that stage now.


Yes and send issues and documentation back to operational teams so they can address root issues in transactional systems. Software selection and system design should have this input by default to avoid implementing data issues.


One issue that was very time consuming for me until recently was that on machines we had mining data upstream across many clusters of vm's, they would sometimes spit out mal formatted files that would need to be fed in downstream for indexing.

Each file had about 100k objects, and there would be less 10 objects with errors on avg if they were mal formatted. It was easy (but annoying) to do at first but scaling up the clusters made it exponentially harder to do by hand in the terminal.

It was conceptually easy to solve once I abstracted the problem a bit: I had a rough idea of where I was getting the data from on the mining servers (urls, ocr'd pdfs , etc), and since there are a bunch of libraries out there for parsing json that give an idea of where the errors are occurring in the files by the byte, after that it was a combination of traversing file forwards and backwards (in memory, luckily files are only ~20mb each, but crashed every text editor I tried before trying to do search and replace) looking for any data within the file that could help me reconstruct the mal formatted object or remove the object if not enough information was available (which if I didn't want to do if my error occurrence wasn't ~10/100,000, I could use the other objects near it to reconstruct the object from inference).


Kathleen Fisher et al. did some exciting PL oriented research on automatic tool generation for ad hoc, semi structured data in AT&T research a while ago. Take a look at their TOPLAS and POPL papers: The next 700 data description languages: http://www.padsproj.org/papers/ddcjournal_preprint.pdf

From Dirt to Shovels, Fully Automatic Tool Generation from Ad Hoc Data: http://www.padsproj.org/papers/popl08.pdf

The website for the project is: http://www.padsproj.org/


I suspect this kind of janitor work is why I gravitate toward python. Even in a nice, controlled environment like kaggle, it's remarkable how much string/text manipulation you have to do just to get a basic benchmark or cram stuff into a random forest - and that's where the data has already been put into a text file (in my case, a csv) for you. My guess is that it would take 10 times more data munging to get that file produced in the first place. Python has good libraries and is very good for all that data munging.

Trying to outsource this stuff is appealing, I suppose, but that has turned out in my life to surprisingly difficult as well. I once worked with a supply planning division of a manufacturing organization to try to reduce inventory by anticipating when orders would come in, so I got access to the sales database. It appeared that there was a large spike in orders later in the quarter, and that we were carrying inventory unnecessarily. Actually, as we discovered during a pilot, sales people who got new orders from the same customer later in the quarter would just delete and re-enter an entire order rather than doing small updates, which they found to be a hassle (this was quite a while ago, when these systems weren't as easy to use).

These attempts to section off and outsource the "low value" work often just make things worse. It's just too unpredictable.


"Janitor Work" is an "it depends" situation. I agree with the article that data wrangling is counter-productive, where the 80/20 ratio for wrangling/analysis can still be improved. But I disagree with the notion that doing such work never leads to a data scientist's technical growth. Such dirty work allows data scientists to be more effective because they can take data of any form, know its shortcomings, and generate insights from it with proper scope. Those who would want to rely on 'clean' data always handed down to them, I must say, are spoiled.

On a side note, it's narcissistic and cocky to use the word "janitor" for such an issue. It's not like data scientists should only be worthy of doing the illustrious part of the job and never the dirty tasks, right? I'd still go and take the broom and clean up the mess when no one else can.


Data cleaning/munging is a surprisingly pervasive problem in my field ('big data' cancer research). I spend a good fraction of my time hunting down bugs caused by dirty sources upstream of my scripts. One way to fix this is to become more rigorous about using tools and libraries that check and enforce data correctness and consistency... but it's hard to do this on an individual level. Even if I'm careful about ensuring the correctness of my data, other researchers who send me data may not be as scrupulous. As with any challenge that requires teamwork and constant vigilance (e.g. unit testing), perhaps the data science community needs to invest in tools and computational frameworks that constantly monitor data for correctness and consistency using DDLs and the like.


My team at Airbnb is working on a tool called "Salus" - named after the Roman goddess of hygiene - that allows you to check the correctness of your data in a number of common, extensible ways. Perhaps this is something that the open-source community at large might be interested in?


I have rolled an extensive (purpose-built) general data quality system at work, and would love for there to be a general purpose tool that does the same sort of thing. I would love to use, contribute to, or otherwise make real an open source version of such a tool.

There is an enormous need in several industries for this sort of thing; but most people don't really know they need it yet.


Interesting. I'd love to hear the details.


I'd be interested. Any tool that checks correctness is one less thing I have to write before I begin my analyses.


Yes. This is something that could be useful to data journalists, who often have to quickly grip and rip new datasets.


Sounds intriguing... Is it something rule-based where I have to define things ahead of time, or can it infer stuff - for example highlighting outliers in an automatically detected numeric field?


Definitely something I'd be interested in taking a look at, I imagine others would as well!


Absolutely. Would definitely be interested in hearing more about it.


Absolutely - please keep me updated on progress!


Yes.


yes.


If cancer research is anything like high frequency trading and time series analysis (which I used to work on), then one heinous thing about corrupted data is inter-dependencies among different data processing tasks: if data coming in "early" in the data workflow topology is garbage, its errors are propagated throughout the dataflow, resulting in multiple days of data detective work, laborious undoing of everything that went wrong, etc.

Also, I observe that the data science community does not talk about these issues nearly as much as they should either because 1) it doesn't make a very inspirational topic 2) they are shielded from this kind of issue by a data engineering team.


Agreed. We need to talk about these issues as a community and come up with effective strategies to prevent data contamination. We've done this for web development and the result was a panoply of tools to validate form input. Now we need to do this for our back-end tools.

For instance, statistical programming languages like R, which generally operate on nice tabular data, should come with built in methods to enforce data validity. I should be able to tell R that I expect certain columns to only contain values from 0-1 and no NAs. This is an easy example because languages like R are somewhat dictatorial about how they want you input and process data, but one can imagine the same sort of methods built into the base libraries of general purpose languages.

On the other hand, maybe we need to be more rigorous on the whole about data validation. We accept that automated/continuous testing is an effective mechanism for preventing bugs. We need the same automated systems to check data files and flag them when there are issues.


Take a look at the testdat package which provides some tools in the direction.


I quite like the term "data civilian."


Yes! They'll never know what life is really like in the data trenches, watching your best buddy get renormalized just one standard deviation away from you. And maybe they can sleep at night.

Semper chi^2, bro.


Really? It made me think that Monica Rogati, the speaker of said term, is very full of herself and her profession.


Having met Monica briefly at a conference, I don't think that's the case. But I can understand why that would give the impression.


I was wondering what that meant.


The article doesn't give a definition, but I assume it is referring to people who work with data at a small scale, i.e. something that couldn't be called "big" data. Totally arbitrary, but it's nice to have a term which is somewhat the opposite of "big data scientist."


"Big" is of course also relative; there is a massive difference between something you can load in Excel, Google's 2 terabytes of N-grams and datasets that you can't even fit on a regular hard drive.


IMHO it's not relative all and is pretty clearly defined - "big data" refers to technologies and processes for analyzing data that doesn't fit on a single machine, and only to that.

It is often misused to mean anything that doesn't fit on an A4 sheet of paper, but for many companies the largest dataset that they have fits in the RAM of an ancient laptop.

If you need to analyze a couple million sales transactions or a dozen gigabytes of web visitor logs then the most appropriate methods won't include 'big data' in their descriptions.


I agree of course (I'm in science at the moment), but my day would be a lot more annoying if I corrected every misuse of the terms "data mining", "big data" or any other technical/scientific term that has been adopted by the press.


It's a fine line between cleaning data and quantizing data. Spelling mistakes, format error correction, consistent formatting are all "janitorial".

However categorizing data into correct groups, removing what you consider is "non-essential", or simply rounding off decimal numbers all can have impact on the analysis down-stream.

In some ways data science is pretty much all about learning best practices for quantizing data.


And of course, once your data is clean, if you actually want to be able to predict anything you get to spend the time tinkering with model structure, feature extraction techniques, hyperparameter tuning, data size / model size tradeoffs, and a whole bunch more yeoman's work. Depending on your temperament you might find this boring as sin.


The fact that nobody mentioned Quandl in this article is criminal negligence.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: