Hacker News new | past | comments | ask | show | jobs | submit login
A programmer’s guide to big data: tools to know (gigaom.com)
79 points by mwetzler on Dec 18, 2012 | hide | past | favorite | 29 comments



Here's another list, sans product hype:

Unix command-line tools like awk and grep. Set operation commands are essential too (http://www.catonmat.net/blog/set-operations-in-unix-shell/)

Ruby/Python/Perl for more complex massaging and wrangling

Excel (yes, really) for quick stats and graphs, great 1st step in understanding what you have

D3.js for visualization

I've used R in the past, but I found I was trying to squeeze data into R models. D3 is pretty hard to grasp initially (I'm very much still learning) but I'm finding that it's helping me think through how to visualize the data I have, rather than just forcing it into one of a few standard charts.


This isn't big data at all. This is stuff you'd use AFTER you've processed your big data & obtained a reduced dataset result that was small data, of the order of a few hundred MB. To process said big data in the first place, of the order of a few PB, typically you are going to use hdfs & at a programmatic level, an api like cascading or better yet, scalding, so you end up needing only that last item D3.js , with everything else existing as scalding source.


This is where "Big Data" becomes about as useful as a penis measuring contest. The vast majority of business data analysis happens at MS Excel levels of data. I have done analysis on several GBs of data using R, and those data-sets are easily the 99th percentile of data size in my field. For datasets that are bigger than that, SQL is still pretty damn good. And when you get to the point where SQL starts to break down in usefulness, you are at a level where you can start sampling and still get perfectly usable results in almost all use cases.

In my experience, "Big Data" is a misnomer. What it really means is fast data. You have a repeatable calculation to perform on datasets that are perfectly fine chugging away on an Oracle cluster for 3 hours, but you want it done in 20 minutes. That is what Big Data is really used for. Anything else is just marketing hype.


Some of the tools in the article are enabling businesses to capture new types of data by making data collection & analysis cheaper and easier. For example, user interactions in an app. User interactions can happen hundreds of times per session, across thousands of devices.

Most businesses don't have to deal with those data sets, because they haven't had the opportunity to record that level of detail yet.

I think one of the most exciting things about "Big Data" (ugh, so buzzy) is that there are so many new opportunities for data gathering and analysis. Now that data storage is so much cheaper, we can record more stuff.

Full disclosure, I work at one of the companies in the list (Keen IO). We didn't know the article was coming out; not sure how they came up with the list or where they got the content (the description of what we do is a bit outdated).


No. Big data has a formal definition, but one that most people ignore:

When the size of the data becomes part of the problem.

(An O'Reilly author said this once, I forget his name.)

For example, physicists in the 80's who had 10s of MBs of data had a big data problem.

Nowadays, this typically means out-of-core data sets.


That definition is but one of many supposed formal definitions. And it isn't a particularly good one, because then "Big Data" becomes a definition that can only be quantified on a person-level of granularity. According to the guy with the Marketing degree from Ho-Hum College, anything bigger than can fit in an Excel spreadsheet is "Big Data". Switch to Doug Cutting, and "Big Data" is in the hundreds of Petabytes.

I don't buy the out-of-core definition either. I work using a Data Warehouse with petabytes of information...it certainly doesn't fit in memory. I don't use Hive, Pig, Cascading, etc...(okay, sometimes I use Cascalog, but not as a strategy for dealing with large amounts of data). I use SQL. And it works perfectly fine. But if you ask any of the people out there talking about "Big Data", an SQL database doesn't fit into the definition. Hell, I have done processing on a 200GB CSV file using nothing more than GAWK. Nobody is calling GAWK a big data tool.

Face it. "Big Data" is a buzzword for CIOs that read magazines for CIOs but still need to find an engineer to set up their email on their iPad.


Precisely, the term big data is about selling stuff to CIOs

Nobody at yahoo went "we need hadoop to deal with our Big Data problem", it was simply a very large amount of data with relatively limited budget problem, and plenty of very large companies are happy using teradata or netezza to manage PB of information.

The new set of tools are often brilliant, but the problems that they solve are almost all not new.


I feel like if you're using those tools to process raw data (and not the data produced by another job), it's not really "big" data. I would tend to define big data as data where there's so much volume that processing it in this way is not possible.

Which isn't to say this article isn't full of hype—it clearly is.


Unix command-line tools are seriously undervalued for processing data, even big data.

Need parallelization? Try xargs -P on a multi-core machine.


Don't forget cut, sort and uniq.

If you have a big csv file, you could strip out all duplicate values in the fourth fields:

cut -d , -f 4 [file] | sort | uniq > values.txt

You could then grep out the records with the particular values of interest.

If a file is too big for an editor, you can still see the record structure with head.


If you need to lots of one-off visualizations, consider Tableau (http://www.tableausoftware.com/). Although not as versatile as D3.js and although it's proprietary, it is really easy, essentially all click and point. My firm started using it recently and I am really jealous because MBA types can in seconds create visualizations that take me hours to do with D3.js. It supports a wide variety of data inputs, including excel and various databases. Also, Tableau has excellent support for working with maps and shapefiles.


I stumbled across Gephi ('an open source graph visualization and manipulation software') and got a few comments on it about 9 months ago: http://news.ycombinator.com/item?id=3700471

https://gephi.org/


Excell breaks badly for trivially small data sets


I tend to aggregate data into something manageable before I attempt to graph it, regardless of what tool I'm using. If you can deal with daily numbers, you only need 365 rows per year, and less than 9k per year if you're doing hourly numbers.


Thanks for the feedback on R. Have you felt it was more restrictive than D3 (and other technologies) or generally less comfortable to work with?


I still don't understand big data.

Is it machine learning + analytics?


Although I believe "big" usually refers to the size of the dataset, I originally heard the term used in the same sense as "Big Oil" - and the implication was that as companies collected massive amounts of data and figured out how to profit from it, "data" was going to become the next high-value commodity.

edit: Although to answer your actual question, it's a bit like "cloud computing" - it probably refers to very scalable systems with an emphasis on reading, writing, and processing large amounts of data - but really it's a bit of a marketing term :)


To branch off of your comment, which is the best description of it, I think it should be noted that for the past two years this subject has been the most nominated topic across all of our events for both information and security officers (the company I work for empowers F1000 C-suite executives to come up with relevant material for their peers to discuss annually). For companies that are inundated with web analytics, feedback they're getting from and trends they're seeing on social media to sales numbers and any other quantifiable data, they're trying to tie it all together so they can make changes or create new opportunities for their markets.


Oh OK,that makes more sense.

Big data is actually the problem then, not the solution?

Meaning there's too much disparate data, and now we can try to bring it together.


Yup!


Michael Stonebraker's answer is that big data can mean big volume (and small analytics,) big volume and big analytics, big velocity, and/or big variety. The first three aspects fit the general usage on HN, while the fourth is his own addition, I think. He's writing a blog post on each one. The first post explains what he means by each aspect:

Intro plus the "big volume, small analytics" case: http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-dat...

"Big volume, big analytics": http://cacm.acm.org/blogs/blog-cacm/156102-what-does-big-dat...

"Big velocity": http://cacm.acm.org/blogs/blog-cacm/157589-what-does-big-dat...

"Big variety": yet to come


I think of it as "trying to capture massive amounts of data and somehow get useful information out of it".

This is pretty good: http://en.wikipedia.org/wiki/Big_data


It's having a dataset that's large enough and/or complex enough that it can't really be processed on a single machine in a reasonable amount of time.


Unfortunately, it's a buzzword that marketers use however is convenient.

The most common definition refers to the size of the dataset, but I've heard it used by so-called-experts in reference to some very small datasets.


It's when you use more than 2 computers to make a chart.


Some big data is machine learning, a lot however is running analytics via software like Hadoop (map reduce) to gleam results from the data. Examples of both would be calculating shipping times based on previous shipping times. If you had enough data, you could predict with analytics how long a shipping process would take. You can then use machine learning to get more and more accurate results.


No, it's not, although it seems to usually be linked with analytics.

It's also things like sampling the power grid, photos from astronomy, and genomic data.

But unfortunately those sorts of things don't seem to get the same kind of buzz.


When the press uses it, it means "data".


I don't get it. Why not just learn Hive (it's SQL), or use python with Hadoop.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: