What device do you have in mind? I've seen places use 2TB RAM servers, and that was years ago, and it isn't even that expensive (can get those for about $5K or so).
Currently HP allows "up to 48 DIMM slots which support up to 6 TB for 2933 MT/s DDR4 HPE SmartMemory".
Close enough to fit the OS, the userland, and 6 TiB of data with some light compression.
>It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.
Why would you have "disorganized data"? Or "multiple processes" for that matter? The OP mentions processing the data with something as simple as awk scripts.
Why would anyone stream 6 terabytes of data over the internet?
In 2010 the answer was: because we can’t fit that much data in a single computer, and we can’t get accounting or security to approve a $10k purchase order to build a local cluster, so we need to pay
Amazon the same amount every month to give our ever expanding DevOps team something to do with all their billable hours.
That may not be the case anymore, but our devops team is bigger than ever, and they still need something to do with their time.
I'm having flashbacks to some new outside-hire CEO making flim-flam about capex-vs-opex in order to justify sending business towards a contracting firm they happened to know.
>I mean if you're doing data science the data is not always organized and of course you would want multi-processing
Not necessarily - I might not want it or need it. It's a few TB, it can be on a fast HD, on an even faster SSD, or even in memory. I can crunch them quite fast even with basic linear scripts/tools.
And organized could just mean some massaging or just having them in csv format.
This is already the same rushed notions about "needing this" and "must have that" that the OP describes people jumping to, that leads them to suggest huge setups, distributed processing, multi-machine infrastructure, for use cases and data sizes that could fit on a single server with redundancy and be done it.
DHH has often written about this for their Basecamp needs (scalling vertically where others scale horizontally having worked for them for most of their operation), there's also this classic post: https://adamdrake.com/command-line-tools-can-be-235x-faster-...
>1 TB of memory is like 5 grand from a quick Google search then you probably need specialized motherboards.
Not that specialized, I've work with server deployments (HP) with 1, 1.5 and 2TB RAM (and > 100 cores), it's trivial to get.
And 5 or even 30 grand would still be cheaper (and more effective and simpler) than the "big data" setups some of those candidates have in mind.
Im just trying to understand the parent to my original comment.
How would running awk for analysis on 6TB of data work quickly and efficiently?
They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data.
am I overthinking it and they were they just referring to buying a big ass Ram machine?
"I was now able to process a whole 5 terabyte batch in just a few hours."
>They say it would go into memory but its not clear to me how that would work as would still have paging and thrashing issues if the data didnt have often used sections of the data
There's no need to have paging and thrashing issues if you can fit all (or even most) of your data in memory. And you can always also split, process, and aggregate partial results.
>am I overthinking it and they were they just referring to buying a big ass Ram machine?
Yeah, they said one can buy a machine with several TB of memory.
6 TB does not fit in memory. However, with a good storage engine and fast storage this easily fits within the parameters of workloads that have memory-like performance. The main caveat is that if you are letting the kernel swap that for you then you are going to have a bad day, it needs to be done in user space to get that performance which constrains your choices.
It seems like it would get a lot of swap thrashing if you had multiple processes operating on disorganized data.
I'm not really a data scientist and I've never worked on data that size so I'm probably wrong.