How would six terabytes fit into memory? It seems like it would get a lot of swa...

coldtea · on May 27, 2024

>How would six terabytes fit into memory?

What device do you have in mind? I've seen places use 2TB RAM servers, and that was years ago, and it isn't even that expensive (can get those for about $5K or so).

Currently HP allows "up to 48 DIMM slots which support up to 6 TB for 2933 MT/s DDR4 HPE SmartMemory".

Close enough to fit the OS, the userland, and 6 TiB of data with some light compression.

>It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.

Why would you have "disorganized data"? Or "multiple processes" for that matter? The OP mentions processing the data with something as simple as awk scripts.

fijiaarone · on May 27, 2024

“How would six terabytes fit into memory?”

A better question would be:

Why would anyone stream 6 terabytes of data over the internet?

In 2010 the answer was: because we can’t fit that much data in a single computer, and we can’t get accounting or security to approve a $10k purchase order to build a local cluster, so we need to pay Amazon the same amount every month to give our ever expanding DevOps team something to do with all their billable hours.

That may not be the case anymore, but our devops team is bigger than ever, and they still need something to do with their time.

the_real_cher · on May 27, 2024

Well yeah streaming to the cloud to work around budget issues is a while nother convo haha.

Terr_ · on May 27, 2024

I'm having flashbacks to some new outside-hire CEO making flim-flam about capex-vs-opex in order to justify sending business towards a contracting firm they happened to know.

the_real_cher · on May 28, 2024

Straight to jail

coldtea · on May 29, 2024

More likely straight to bonus and eventual golden parachute!

the_real_cher · on May 27, 2024

I mean if you're doing data science the data is not always organized and of course you would want multi-processing.

1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.

coldtea · on May 27, 2024

>I mean if you're doing data science the data is not always organized and of course you would want multi-processing

Not necessarily - I might not want it or need it. It's a few TB, it can be on a fast HD, on an even faster SSD, or even in memory. I can crunch them quite fast even with basic linear scripts/tools.

And organized could just mean some massaging or just having them in csv format.

This is already the same rushed notions about "needing this" and "must have that" that the OP describes people jumping to, that leads them to suggest huge setups, distributed processing, multi-machine infrastructure, for use cases and data sizes that could fit on a single server with redundancy and be done it.

DHH has often written about this for their Basecamp needs (scalling vertically where others scale horizontally having worked for them for most of their operation), there's also this classic post: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

>1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.

Not that specialized, I've work with server deployments (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's trivial to get.

And 5 or even 30 grand would still be cheaper (and more effective and simpler) than the "big data" setups some of those candidates have in mind.

the_real_cher · on May 27, 2024

Yeah I agree about over engineering.

Im just trying to understand the parent to my original comment.

How would running awk for analysis on 6TB of data work quickly and efficiently?

They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data.

am I overthinking it and they were they just referring to buying a big ass Ram machine?

coldtea · on May 29, 2024

>How would running awk for analysis on 6TB of data work quickly and efficiently?

In that 6TB is not that huge of an amount

That's their total dataset, and there's no "real time" requirement.

They can start a batch process, process the data, and be done with it.

Here's an example of someone using awk (read further down for the relevant section):

https://livefreeordichotomize.com/posts/2019-06-04-using-awk...

"I was now able to process a whole 5 terabyte batch in just a few hours."

>They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data

There's no need to have paging and thrashing issues if you can fit all (or even most) of your data in memory. And you can always also split, process, and aggregate partial results.

>am I overthinking it and they were they just referring to buying a big ass Ram machine?

Yeah, they said one can buy a machine with several TB of memory.

allanbreyes · on May 27, 2024

There are machines that can fit that and more: https://yourdatafitsinram.net/

I'm not advocating that this is generally a good or bad idea, or even economical, but it's possible.

the_real_cher · on May 27, 2024

I'm trying to understand what the person I'm replying to had in mind when they said fit six terabytes in memory and search with awk.

is this what they were referring to just by a big ass Ram machine?

capitol_ · on May 27, 2024

It would easy fit in ram: https://yourdatafitsinram.net/

jandrewrogers · on May 27, 2024

6 TB does not fit in memory. However, with a good storage engine and fast storage this easily fits within the parameters of workloads that have memory-like performance. The main caveat is that if you are letting the kernel swap that for you then you are going to have a bad day, it needs to be done in user space to get that performance which constrains your choices.

int_19h · on May 27, 2024

Per one of the links below, IBM Power System E980 can be configured for up to 64Tb RAM.