Yep. I've personally been in the situation where I had to show someone that I co...

thraxil · on April 20, 2022

I remember going to a PyData conference in... 2011 (maybe off by a year or two)... and one of the presenters making the point that if your data was less than about 10-100TB range, you were almost certainly better off running your code in a tight loop on one beefy server than trying to use Hadoop or a similar MapReduce cluster approach. He said that when he got a job, he'd often start by writing up the generic MapReduce code (one of the advantages is that it tends to be very simple to implement), starting the job running, and then writing a dedicated tight loop version while it ran. He almost always finished implementing the optimized version, got it loaded onto a server, and completed the analysis long before the MapReduce job had finished. The MapReduce implementation was just there as "insurance" if, eg, he hit 5pm on Friday without his optimized version quite done, he could go home and the MR job might just finish over the weekend.

hathawsh · on April 19, 2022

I suggest a new rule of thumb: if the data can fit on a micro SD card [1], then it's smaller than a thumb, so it can't be big data. ;-)

[1] https://www.amazon.com/SanDisk-Extreme-microSDXC-Memory-Adap...

bombcar · on April 20, 2022

My rule of thumb has been “if I could afford to put the data in RAM it’s not that big a deal”.

dharmab · on April 20, 2022

Also "If the data could fit on a laptop or desktop I can buy at a retail store"

lazide · on April 20, 2022

Uh oh, a lot of ML startups just freaked out! Name brand 1TB microSDXC cards are less than $200 on Amazon.