Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Yep. I've personally been in the situation where I had to show someone that I could do their analysis in a few seconds using the proverbial awk-on-a-laptop when they were planning on building a hadoop cluster in the cloud because "BIG DATA". (Their Big Data was 50 gigabytes.)


I remember going to a PyData conference in... 2011 (maybe off by a year or two)... and one of the presenters making the point that if your data was less than about 10-100TB range, you were almost certainly better off running your code in a tight loop on one beefy server than trying to use Hadoop or a similar MapReduce cluster approach. He said that when he got a job, he'd often start by writing up the generic MapReduce code (one of the advantages is that it tends to be very simple to implement), starting the job running, and then writing a dedicated tight loop version while it ran. He almost always finished implementing the optimized version, got it loaded onto a server, and completed the analysis long before the MapReduce job had finished. The MapReduce implementation was just there as "insurance" if, eg, he hit 5pm on Friday without his optimized version quite done, he could go home and the MR job might just finish over the weekend.


I suggest a new rule of thumb: if the data can fit on a micro SD card [1], then it's smaller than a thumb, so it can't be big data. ;-)

[1] https://www.amazon.com/SanDisk-Extreme-microSDXC-Memory-Adap...


My rule of thumb has been “if I could afford to put the data in RAM it’s not that big a deal”.


Also "If the data could fit on a laptop or desktop I can buy at a retail store"


Uh oh, a lot of ML startups just freaked out! Name brand 1TB microSDXC cards are less than $200 on Amazon.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: