Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most people who think they have big data don't.


At an absolute minimum, I'd say "big data" begins when you can't buy hardware with that much memory.

Apparently, in 2017, that was somewhere around 16 terabytes. https://www.theregister.co.uk/2017/05/16/aws_ram_cram/ Heck, you can trivially get a 4TB instance from Amazon nowadays: https://aws.amazon.com/ec2/instance-types/

The biggest DBs I've worked on have been a few tens of billions of rows, and several hundreds of gigabytes. That's like... nothing. A laughable start. You can make trivially inefficient mistakes and for the most part e.g. MySQL will still work fine. Absolutely nowhere near "big data". And odds are pretty good you could just toss existing more-than-4TB-data through grep and then into your DB and still ignore "big data" problems.


15 years ago I was working at a utility and was re-architecting our meter-data solution as the state government was about to mandate a change from manually read meters to remotely read interval meters. Had to rescale the database from processing less than 2 million reads a year, to over 50 million reads _a day_ (for our roughly 1M meters). Needed to keep 16 months online, and several years worth in near-line storage. We went from a bog-standard Oracle database to Oracle Enterprise with partitions by month. This was being fed by an event-driven microservices[0] architecture of individually deployable Pro*C/C++ unix executables. The messaging was Oracle A/Q.

At the time, I thought "wow - this is big!", and it was for me.

[0] we didn't call it microservices back then though that's clearly what it was. The term I used was "event-driven, loosely coupled, transactional programs processing units of work comprised of a day's worth of meter readings per meter". Doesn't roll off the tongue quite so easily.


10 years ago I was working on a 50Tb datawarehouse. Now I see people who think 50Gb is “big data” because Pandas on a laptop chokes on it.


This, a million times this!!! I see people doing consulting promoting the most over engineered solutions I have ever seen. Doing it for data that is a few hundred GB or maybe at best a TB.

It makes me want to cry, knowing we handled that with a single server and a relational database 10 years ago.

Lets also not forget that everyone today forgets that the majority of data actually has some sort of structure. There is no point in pretending that every piece of data is a BLOB or a JSON document.

I have given up on our industry ever becoming sane, I now fully expect each hype cycle to be pushed to the absolute maximum. Only to be replaced by the next buzzword cycle when the current starts failing to deliver on promises.


Yep. I left consulting after working on a project that was a Ferrari when the customer only needed a Honda. Our architect kept being like "it needs to be faster", and I'm like "our current rate would process the entire possible dataset (everyone on earth) in a night, do we really need a faster Apache Storm cluster?" :S

"Resume driven development" is a great term.


It's important to remember what drives this - employers often like to think their problems are 'big data' and by god, they need the over-engineered solution. Your peers who interview you will toss your resume in the trash if you are not buzzword compliant. Hate the game not the player.


I have given up on our industry ever becoming sane, I now fully expect each hype cycle to be pushed to the absolute maximum. Only to be replaced by the next buzzword cycle when the current starts failing to deliver on promises

I sometimes wish I was unscrupulous enough to cash in on these trends, I’d be a millionaire now. Instead I’m just a sucker who tries to make solid engineering decisions on behalf of my employers and clients. It’s depressing to think what professionalism has cost me in cash terms. But you’ve got to be able to look at yourself in the mirror.


Yeah, or R. When I need to do stuff bigger than RAM on a laptop I just reach for sqlite (assuming it can even fit on disk).


The funny thing is we didn’t even consider our 50Tb to be particularly big, we knew guys running 200Tb...

BTW in R data.table is better for larger datasets than the default dataframe... or use the Revolution Analytics stuff on a server.


Oh yeah, the default dataframe is practically only for toy problems / very-already-filtered data (which accounts for a LOT of useful scenarios!). It's more that I've run into "big data" people who (╯°□°)╯︵ ┻━┻ when the default settings + dataframe choke, which is around 200MB of heap.


There's a big difference between a 50Tb data warehouse and a data warehouse which has a number of 10Tb tables. Our data warehouse used to have 20k tables and 18Tb data. Our big data instance has 4k tables and 800Tb data


> Now I see people who think 50Gb is “big data” because Pandas on a laptop chokes on it.

IIRC the term "big data" was coined to refer to data volumes that are too large for exiting applications to process without having to rethink how the data and/or the application was deployed and organized. Thus, although the laptop reference is on point, large enough data volumes that make Pandas choke in a desktop environment does fit the definition of big data.


That definition might be a bit loose. If you have an existing application designed to run on a Raspberry Pi, it'll hit that limit pretty quickly.

This is obviously an over-exaggeration, but I don't think a dataset that breaks Pandas in a desktop environment even comes close to big data.


large enough data volumes that make Pandas choke in a desktop environment does fit the definition of big data

What about too large for Excel, is that “big data” too?

(Neither are)


Sorta, depends on your workload and if you are a solo hero dev or if you are an enterprise requiring shared data resources.

For 16TB you'll definitely get benefits if you are doing something embarrassingly parallel and processor bound over a cluster of a handful of machines. It's just parallelism, and it's a good thing.

I totally agree about the hundreds of Gigs - that is unless you are in a setting where many teams need to access that database and join it with others, in which case a proper data lake implemented on something beefy is a good idea. Hadoop has the benefit of distribution and replication, but another data warehouse might work better - say Oracle or Teradata if you are a small shop.


I'd say more than the boundary is not on a 1 dimensional axis but more on a size x time x [access - pattern ] space :

100Gb of data may not be that much to you, but it's another story if you have to process all of it every minute. Also everything is fine if you have nicely structured ( and thus indexed ... ) data, but I do have a few 100s Gb of images as well...


I'm pretty sure you can buy IBM's newest mainframe, the z14, with 32TB of memory. I don't think that mainframes are that popular in trendy "big data" though.


But we have millions of rows!

-- my actual employer

(the actual problem is our SAN is getting us a grand total of 10 iops. No, that's not wrong, 10. The "what the christ, just buy us a system that actually works" momentum has been building for 2 years now but hasn't managed to take over. Hardware comes out of a different budget than wasted time, yo.)


So you have big data. Now you just need the big hardware /s


A cynical person might point out that for empire-building managers the goals are :

1) minimize hardware spend at all costs

2) maximize lost time (if spent under the manager's control)


I don’t know why these ‘you don’t have big data’ type articles bother me so much, but they really do. I know it isn’t saying NO ONE has big data, but I feel defensive anyway. Some of us DO work for companies that need to process millions of log lines a second. I know the article is not for me, but it still feels like a dismissal of our real, actual, big data problem.


These articles are not targeted at you indeed but to the thousands of companies trying to setup a massive architecture to process 3GB of data per year. All these big data solutions are still necessary of course, it's just not for everyone.


Agreed they are tools designed to fill a niche. This doesn't make them bad, or uneccessary, just specialized. A spanner drive screw isn't worse than a hex drive, it's just designed for a different more specific use case.

Really this just goes to show how impressive RDMSs like Postgres are. There's nothing out there that's drastically better in the general case. So alternative database systems tend to just nibble around the edges.

My rule of thumb is always try to implement it using a relational model first, and only when that proves itself untenable, look into other more specialized tools.


Yeah but the whole cloud paradigm is predicated on big data, so large actors are pushing it where it makes no sense, and it makes everyone less effective. Not more.

Ever notice how with cloud it's actually in the interest of cloud providers to have the worst programmers, worst possible solutions for their clients ? Those will maximize spend on cloud, which is what these companies are going for.

Of course, they hold up the massive carrot of "you can get a job here if you ...".


For every programmer like you there's ten more at places where I've been employed trying to claim they need big data tools for a few million (or in some cases a few hundred thousand) rows in mysql. I get why you could feel attacked when this message is repeated so often, but apparently it still isn't repeated anywhere near enough.


"Relational databases don't scale".

I hope the NoSQL hype is over by now and people are back to choosing relational as the default choice. (The same people will probably chasing block-chain solutions to everything by now...)


Why should relational be the default choice? There are many cases where people are storing non-relational data and a nosql database can be the right solution regardless of whether scaling is the constraint.

Most nosql SaaS platforms are significantly easier to consume than the average service oriented RDBMS platform. If all the DBMS is doing is handling simple CRUD transactions, there's a good chance that relational databases are overkill for the workload and could even be harmful to the delivery process.

The key is to take the time to truly understand each workload before just assuming that one's preferred data storage solution is the right way to go.


What's non relational data?

You can have minimally relational data such as URL : website, but that's still improved by going URL : ID, ID : website because you can insert those ID's into the website data.

Now plenty of DB's have terrible designs, but there I have yet to year of actually non relational data.


That's fair and I'll concede that my terminology is incorrect. I suppose I'm really considering data for which the benefits of normalization are outweighed by the benefits that are offered by database management systems that do not fall under the standard relational model (some of those benefits being lack of schema definition/enforcement* and the availability of fully managed SaaS database management systems).

I'm also approaching this from the perspective of someone who spends more time in the ops world than development. I won't argue that NoSQL would ever "outperform" in the realms of data science and theory, but I question whether a business is going to see more positive impact from finding the perfect normal form from their data or having more flexibility in the ability to deliver new features in their application.

* I'm fully aware that this can be as much of a curse as a blessing depending on the data and the architecture of the application, which reenforces understanding the data and the workload as a significant requirement.


Wait, why is that an improvement? If there is a 1 to 1 mapping of url to website, splitting it into two tables is just bad database design.


Because there are URL's in the website data and or you want do do something with it. Looking up integers is also much faster than looking up strings. And as I said you can replace URL's in the data with strings saving space.

But, there are plenty of other ways to slice and dice that data, for example a URL is really protocol, domain name, port, path, and parameters, etc. So, it's a question of how you want to use it.

PS: Using a flat table structure (ID, URL, Data) with indexes on URL and ID is really going to be 2 or 3 tables behind the scenes depending on type of indexes used.


> The key is to take the time to truly understand each workload before just assuming that one's preferred data storage solution is the right way to go.

Although that is true in principle, in reality that results in the messes I see around me where a small startup (but this often goes for larger corps too) has a plethora of tech running it fundamentally does not need. If your team’s expertise is Laravel with MySql then even if some project might be a slightly better fit for node/mongo (does that happen?), I would still go for what you know vs better fit as it will likely bite you later on. Unfortunately people go for more modern and (maybe) slightly better fit and it does bite them later on.

For most crud stuff you can just take an ORM and it will handle everything as easily as nosql anyway. If your delivery and deployment process have a rdbms, it will be natural anyway and likely easier than anything nosql unless it is something that is only a library and not a server.

Also, when in doubt, you should take a rdbms imho, not, like a lot of people do, a nosql. A modern rdbms is far more likely to fit whatever you will be doing, even if it appears to fit nosql better at first. All modern dbs have document, json/doc storage built in or added on (plugin or orm) : you probably do not have the workload that requires something scaleout like nosql promises. If you do, then maybe it is a good fit, however if you are conflicted it probably is not anyway.


> There are many cases where people are storing non-relational data

No, there are not. In 99% of applications, the data is able to be modeled relationally and a standard RDBMS is the best option. Non-relational data is the rare exception, not the rule.


Because its tried and tested and has worked for decades.


Just curious what do you need to process so many log lines for? Is this an ELK type setup, and/or do you use the log processing for business logic?


I was working in a company where they brought in some consultants to do the analytics and they were going to use Hadoop, and they said straight up "We don't need it for the amount of data, but we prefer working in Hadoop"


It's far more common for your assumptions/thinking/design to be the root cause of any real problem in things you build than your choice of tech. It doesn't matter if you use the 'wrong' tech to solve a problem if the solution works effectively, and very often using something you know well means you'll solve the problem faster and better than you would with the 'right' tool. Those consultants were probably doing the right thing.


I actually don't think they had great expertise in Hadoop though, since it was quite new at the time. but maybe


I remember a Microsoft talk about running big data in Excel...




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: