Raw RAM space is not the issue... it's indexing and structure of the data that makes it process-able.
If you just need to spin through the data once, there's no need to even put all of it in RAM - just stream it off disk and process it sequentially.
If you need to join the data, filter, index, query it, you'll need a lot more RAM than your actual data. Database engines have their own overhead (system tables, query processors, query parameter caches, etc.)
And, this all assumes read-only. If you want to update that data, you'll need even more for extents, temp tables, index update operations, etc.
I often use RAM disks for production services. I'm sure that I shouldn't. It feels very lazy, and I know that I'm probably missing out on the various "right" ways to do things.
But it works so well and it's so easy. It's a really difficult habit to kick.
Obligatory joke: "Oh boy, virtual memory! Now I can have a really big RAM disk!"
Funny story, about 6 years ago we got a HP DL980 server with 1TB of memory to move from an Itanium HP-UX server. The test database was Oracle and about 600GB in size. We loaded the data and they had some query test they would run and the first time took about 45 minutes (which was several hours faster than the HPUX), They made changes and all the rest of the runs took about 5 minutes for their test to complete. Finally someone asked me and i manually dropped the buffers and cache and back to about 45 minutes.
Their changes did not do anything, everything was getting cached. It was cool, but one needs to know what is happening with their data. I am just glad they asked before going to management saying their tests only took 5 minutes.
Those were some poorly built systems. I worked on probably 10 of them and they not-infrequently had .. major issues. HP's support model was to send a tech out with 2 sticks of RAM, and try them in different places to try to trace memory failures... across 4 (or 8?) cassettes, and 64 sticks of ram, and 20+ minute POST times.
We eventually had one server entirely replaced at HP's cost after yelling at them long enough, and that one never worked well enough to ever use in production, either. I'd say we had maybe a 70-80% success rate with those servers. They were beasts, though, with 4TB of RAM as I recall, and 288 cores.
Even more importantly, does your data have to fit in RAM?
There are tons of problems that need to process large data, but touch each item just once (or a few times). You can go a really long way by storing them in disk (or some cloud storage like S3) and writing a script to scan through them.
I know, pretty obvious, but somehow escapes many devs.
There's also the "not all memory is RAM" trick: plan ahead with enough swap to fit all the data you intend to process, and just pretend that you have enough RAM. Let the virtual memory subsystem worry about whether or not it fits in RAM. Whether this works well or horribly depends on your data layout and access patterns.
This is how mongodb originally managed all its data. It used memory mapped files to store the data and let the underlying OS memory management facilities do what they were designed to do. This saved the mongodb devs a ton of complexity in building their own custom cache and let them get to market much faster. The downside is that since virtual memory is shared between processes, other competing processes could potentially mess with your working set (pushing warm data out, etc). The other downside is that since your turning over the management of that “memory” to the OS, you lose fine grained control that can be used to optimize for your specific use case.
Except nowadays with Docker / Kubr you can safely assume the db engine will be the only tenant of a given vm /pod whatever so I think it’s better to let OS do memory management than fight it
Might not be exactly the same use case, but a simple example is compiling large libraries on constrained/embedded platforms. Building OpenCV on a Pi certainly used to require adding a gig of swap.
Escapes many devs? Really? I used to work with biologists who thought they needed to run their scripts on a supercomputer because the first line read their entire file into an array. But if I saw someone who calls themselves a "dev" doing this I'd consider them incompetent.
I once got into an argument with a senior technical interviewer because he wanted a quick solution of an in-memory sort of an unbounded set of log files.
Needless to say I wasn't recommended for the job, and it taught me a valuable lesson: if you don't first give them what they want, you can't give them what they actually need.
I've spent a lot of time writing Spark code, and its ability to store data in a column oriented format in RAM is the only reason why - disk is goddamned slow.
As soon as you're touching it more than once, sticking it in RAM upon reading makes everything much faster.
The lowest price floor per GB has been similar for the past decade. Roughly at $2.8/GB in 2012, 2016, and 2019. And all DRAM manufacturers has been enjoying a very profitable period.
And yet our Data size continue to grow. We can fit more Data inside memory not because DRAM capacity has increase, but we are simply increasing memory channels.
Everyone knows that DRAM prices have been in a collapse since early this year, but last week DRAM prices hit a historic low point on the spot market. Based on data the Memory Guy collected from spot-price source InSpectrum, the lowest spot price per gigabyte for branded DRAM reached $2.59 last week.
You've selected out the low points on the graph: 2012, 2016, and 2019, most of the time DRAM has not been available at these prices. Now is definitely the time to load up on RAM.
And it is predicted to climb back up this year, due to manufacturers dropping wafer starts at a point in time when a large launch of next-gen consoles is drastically increasing consumption.
If it's lowest massively-available price then this
>most of the time DRAM has not been available at these
prices.
should make it not the floor. If the floor doesen't have to be available, then what's the exact point it becomes relevant? Otherwise the price is simply misleading
Any price that is massively-available becomes relevant and stays relevant forever.
A price has to be massively available at a point in time to matter. It doesn't have to be available forever to matter. It feels like you're conflating the two.
The price is on a downward trend, but there are hitches and setbacks. One fair way to measure it is to use some kind of average. Another also-fair way to measure it is to go by the lowest "real" price, where "real" means you can buy something like a million sticks on the open market.
When we're talking about whether we should be impressed by a price, using the lowest historical price for comparison makes sense.
(And just to be absolutely clear, you would need to adjust the metric for a product that goes up in price over time. But for something on a downward trend, this metric works fine.)
buying the ram probably cheaper than buying database liceneses and/or the machines to run them and/or the time you spend making things fast enough when using them.
If it was just a matter of adding more memory people would. A few thousand, or tens of thousands of dollars aren’t much to organisations that have that much data.
The trouble is that there are limits to how much memory you can fit on a motherboard.
And how much you can afford. My partner's scientific research group recently upgraded their shared server used for analysis ... to one with 32 GB of RAM and 8 TB of storage. I think they could have done better, personally, but it's telling how thin budgets are stretched in a lot of real world cases.
Legit question: I have a dataset that's a terabyte in size spread over multiple tables, but my queries often involve complex self joins and filters; for various reasons, I'd prefer to be able to write my queries in SQL (or spark code) because it's the most expressive system I've seen. What tool should I use it to load this dataset on RAM and run these queries?
There are a few steps to consider before you are loading data into ram.
Can you partition the data in any useful way? For example if queries use separate ranges of dates, then you can partition data so that queries only need to touch the relevant date range. Can you pre-process any computations? Sometimes tricky things done within the context of multiple joins can be done once and written to a table for later use. Can you materialize any views? Do you have the proper indexes set up for your joins and filters? Are you looking at execution plans for your queries? Sometimes small changes can speed up queries by many orders of magnitude.
Smart queries + properly structured data + a well tuned postgres DB is an incredibly powerful tool.
Exactly this. DBs are really good at utilizing all the memory you give them. The query planners might give you some fits when they try and use disk tables for complicated joins, but you can work around them.
Both mysql and pgsql bypass the page cache if they can and maintain their own page caches. You have to do this, otherwise you’re double caching! That is, you’d have your own page cache, which you need to manage calls to read() and to know when to flush pages, while the OS would also have the same pages in its own cache.
(mongodb I believe uses direct mmap access instead of a pagecache, and lmdb does this as well)
You can get a box with 4TB of ram on EC2 for $4/hr spot, so copy your data into /dev/shm and go hog wild.
For lots of databases, most of their time is spent locking and copying data around, so depending on your workload you might getsignificant speedups in Pandas/Numpy if it's just you doing manipulations, and there are multicore just-in-time compilers for lots of Pandas/Numpy operations (like Numba/Dask/etc).
If you have lots of weird merging criteria and want the flexibility of SQL I'd say use a modern Postgresql with multicore selects on that 4TB box.
Currently I have it loaded on redshift with as much optimization as possible, and the queries are far more analytical than end-user like (often having to self join on the same dataset). This works okay, but doesn't scale with more than a handful users at a time. I'll probably run some tests with the postgres suggestion but curious if this is still a better alternative or not
[Disclaimer: I worked on BigQuery a couple lives ago]
I'd give Google BigQuery a shot. Should work fast [seconds] and scale seamlessly to [much] larger datasets and [many] more users. For a 1 TB dataset, I have a hard time imagining crafting a slow query. Maybe something outlandish like 1000[00?] joins. They also have an in-memory "BI Engine" offering, alas limited to 50GB max.
On premise, there is Tableau Data Engine. I don't think they offer a SQL interface, you have to buy into the entire ecosystem.
Long shot: I've been working on "most expressive query system over multiple tables" as an offshoot of some recent NLP work. Your use case piqued my interest. I'd love to help / understand it better. My contact is in my profile.
Every major database will load the hottest data into RAM, where the scope of "hottest" broadens to whatever amount will fit in RAM. A small percentage of them require you to confirm how much RAM it can use for this cache.
Putting the data on a ramdisk just becomes entirely redundant because it's still going to create a second memory cache that it uses.
Many operations do a local cache warming by running the common queries over the database before they it is brought online for processing. As a secondary note, people often under-estimate the size of their data because they don't account for all of the keys, indexes and relationships that also would be memory cached in an ideal situation.
You don't need to store everything into RAM to get fast results. Data warehouse relational databases are designed exactly for this kind of fast SQL analysis over extremely large datasets. They use a variety of techniques like vectorized processing on compressed columnar storage to get you quick results.
Google's BigQuery, AWS Redshift, Snowflake (are all hosted), or MemSQL, Clickhouse (to run yourself). Other options include Greenplum, Vertica, Actian, YellowBrick, or even GPU-powered systems like MapD, Kinetica, and Sqream.
I recommend BigQuery for no-ops hosted version or MemSQL if you want a local install.
All of those support datasets that don't fit in RAM, or else they would be useless at data warehousing.
MemSQL uses rowstores in memory combined with columnstores on disk. Both can be joined together seamlessly and the latest release will automatically choose and transition the table type for you as data size and access patterns change.
This is what most people in my org do, this is orders of magnitude slower than running queries on the same dataset on redshift with optimized presorting and distribution. Redshift doesn't scale for tens or hundreds of parallel users though, so looking for options
I'm about 75% joking: restore a new cluster from a snapshot for horizontal scaling. They did launch a feature along those lines, because my joke suggestion probably doesn't scale organizationally: take a look at Concurrency Scaling. This is also a fundamental feature of Snowflake with the separation of compute and storage.
Exactly the same game plan for us now, evaluating snowflake, but wanted to check if there's a fundamentally different paradigm that can be much more faster and scalablr
I'll be That Guy and ask: what are you doing with the data, and can you change your processing or analysis to reduce the amount of data you need to touch?
In my experience, it's nearly always the case that pulling in all data is not necessary, and that thinking through your goals, data, and processing can often reduce both the amount of data touched and the processing run on it massively. Look up the article on why GNU grep is so fast for a bunch of general tricks that can be employed, many of which may apply to data processing generally.
Otherwise:
1. Random sampling. The Law of Large Numbers applies and affords copious advantages. There are few problems a sample of 100 - 1,000 cannot offer immense insights on, and even if you need to rely on larger samples for more detailed results, these can guide further analysis at greatly reduced computational cost.
2. Stratified sampling. When you need to include exemplars of various groups, some not highly prevalent within the data.
3. Subset your data. Divide by regions, groups, accounts, corporate divisions, demographic classifications, time blocks (day, week, month, quarter, year, ...), etc. Process chunks at a time.
4. Precompute summary / period data. Computing max, min, mean, standard deviation, and a set of percentiles for data attributes (individuals, groups, age quintiles or deciles, geocoded regions, time series), and then operating on the summarised data, can be tremendously useful. Consider data as an RRD rather than a comprehensive set (may apply to time series or other entities).
Creating a set of temporary or analytic datasets / tables can be tremendously useful. As much fun as it is to write a single soup-to-nuts SQL query.
5. Linear scans typically beat random scans. If you can seek sequentially through data rather than mix-and-match, so much the better. With SSD this advantage falls markedly, but isn't completely erased. For fusion type drives (hybrid SSD/HDD) there can still be marked advantages.
6. Indexes and sorts. The rule of thumb I'd grown up with in OLAP was that indexes work when you're accessing up to 10% of a dataset, otherwise a sort might be preferred. Remember that sorts are exceedingly expensive.
If at all possible, subset or narrow (see below) data BEFORE sorting.
6. Hash lookups. If one table fits into RAM, then construct a hash table using that (all the better if your tools support this natively -- hand-rolling hashing algorithms is possible, but tedious), and use that to process larger table(s).
7. "Narrow" the data. Select only the fields you need. Most especially, write only the fields you need. In SQL this is as simple as a "SELECT <fieldlist> FROM <table>" rather than "SELECT * FROM <table>". There are times you can also reduce total data throughput by recoding long records (say, geocoded names, there are a few thousands of place names in the US, using Census TIGER data, vs. placenames which may run to 22 characters ("Truth or Consequences", in NM), or even longer for international placenames. You'll need a tool to remap those later. For statistical analysis, converting to analysis variables may be necessary regardless.
The number of times I've seen people dragging all fields through extensive data is ... many.
Some of this can be performed in SQL, some wants a more data-related language (SAS DATA Step and awk are both largely equivalent here).
Otherwise: understanding your platforms storage, memory, and virtual memory subsystems can be useful. Even as simple a practice as running "cat mydatafile > /dev/null" can often speed up subsequent processing.
Yes, but SQLite is probably the least expressive SQL dialect there is. If you're choosing SQL because of its expressiveness, you probably aren't thinking of a dialect with only 5 types (including NULL).
I haven't tried this and don't know if it would work -- but depending on the shape of your data and queries, you might not need certain indices. That is, for some workloads (especially if you're thinking of spot instances), it might be overall faster to skip the indexing and allow the query to do a full table scan. It sounds like maybe you never tried the query without the index, so I'm curious to know if there's any weight behind this theory.
I am not a big-data guy but wouldn't it be along the lines of A) get a big honking server B) fire up "X" SQL server C) Allocate 95-98% of the RAM to DB cache?
A single terabyte is a few magnitudes from what you need big-data-anything for. You could probably work with that just fine on your average 64GB ram desktop with an SSD.
Another poster already replied with a decent refutation of this claim, but a single pass over a TB of data is often not enough for 'big data' use cases and at tens of minutes per pass, it may very well be infeasible to operate on such at dataset with only 64GB of memory.
In the machine learning world, some of the algorithms that are industrial workhorses will require you to have your dataset in memory (ie: all the common GBM libraries), and will walk over it lots of times.
You may be able to perform some gymnastics and allow the OS to swap your terabyte+ dataset around inside your 64GB of RAM, but the algorithms are now going to take forever to complete as you thrash your swap constantly while the training algorithm is running.
tl;dr - a terabyte dataset in the machine learning context may very well need that much RAM plus some overhead in terms of memory available to be able to train a model on the dataset.
You'd probably be surprised. For reads, there are tons of drives that will saturate PCIe 3.0 x4 with 4kB random reads. Throughput is a bit lower because of more overhead from smaller commands, but still several GB/s. Fragmentation won't appreciably slow you down any further, as long as you keep feeding the drive a reasonably large queue of requests (so you do need your software to be working with a decent degree of parallelism).
What will cause you serious and unavoidable trouble is if you cannot structure things to have any spatial locality. If you only want one 64-bit value out of the 4kB block you've fetched, and you'll come back later another 511 times to fetch the other 64b values in that block, then your performance deficit relative to DRAM will be greatly amplified (because your DRAM fetches would be 64B cachelines fetch 8x each instead of 4kB blocks fetched 512x each).
Option A is the best imo, I worked on many sql db's that the rule was to fit it into ram. Option c will bite you in the ass eventually. The kernel and your other processes need some space to malloc, and you dont want to page in/out.
Having 4/16TiB servers or "memory db servers" as I thought of them solved a lot of problems outright. Still need huge i/o but less of it depending on your workload.
I'm pretty sure that was supposed to be a list of steps, not a list of options.
> The kernel and your other processes need some space to malloc, and you dont want to page in/out.
Some space, like "most of 20-50 gigabytes"?
You want to take into account how exactly the space used by joins will fit into memory, but 2-5% of a terabyte is an extremely generous allocation for everything else on the box.
I remember back in the early 2010's that a large selling point of SAS (besides the ridiculous point that R/python were freeware and therefore cannot be trusted on important projects ) was that it can chew through large data sets that perhaps couldn't be moved into RAM (but maybe it takes a week or whatever....).
This was a fairly salient point, and remember circa 2012/2013 struggling to fit large bioinfomatics data into an older iMac with base R.
SAS Institute have long claimed this. It's been provably bullshit for decades.
In practice, an awk script frequently ran circles around processing. On a direct basis, awk corresponds quite closely to the SAS DATA Step (and was intended to be paired with tools such as S, the precursor to R, for similar types of processing).
The fact that awk had associative arrays (which SAS long lacked, it's since ... come up with something along those lines) and could perform extremely rapid sort-merge or pattern matches (equivalent to SAS data formats, which internally utilise a b-tree structure) helped.
With awk, sort, unique, and a few hand-rolled statistics awk libraries / scripts, you can replace much of the functionality of SAS. And that's without even touching R or gnuplot, each of which offer further vast capabilities.
the way things today, the Hadoop cluster will be just a bit faster on that thing.
25 years ago i was suggesting clients to upgrade from 4MB to 6-8MB as it was improving their experience with our business software, these days i've already suggested a couple of customers to upgrade from 6TB and 8TB respectively ... as it would improve their experience with our business software. What's funny is that customer experience with business software back then was better than today.
Yeah, except Hadoop provides redundancy, easier backups, turnkey solutions for governance and compliance and easier scaling
I’m no fan of distributed systems that sit idle at 2% utilisation when a single node would do : BUT, reducing it down to “cost” and “does it fit in RAM” is way too reductive
And reducing it to a single machine means I have an order of magnitude less time spent on setting it up, maintaining it, adapting all my code, debugging, etc.
These days I’m firmly of the opinion that if you can make it run on a single machine, you absolutely should, and you don’t get more machines until you can prove you can fully utilise one machine.
If you store data in something like AWS S3, "redundancy" and "backup" is handled by AWS. And scalability is a moot point if a single box can handle your load.
When a single box cannot handle your load, you start one more, and then you start worrying about scalability.
(Of course, YMMV, I guess there are cases where "one box" -> "two boxes" requires an enormous quantum jump, like transactional DB. For everything else, there's a box.)
What about services where background load causes unacceptable latencies due to other bottlenecks on that single machine? What if you are IO bound and you couldn't possibly use all of the CPU or memory on any system which was given to you?
Waiting for a machine to be “fully utilised” before scaling just shows your lack of experience at systems engineering.
Do you know how quickly disks fail if you force them at 100% utilisation, 24/7?
Then what happens when this system dies? How much downtime do you have because you have to replace then hardware then get your hundreds of gigabyte dataset back in RAM and hot again?
I’ve worked as a lead developer at companies where I’ve been personally responsible for hundreds of thousands of machines, and running a node to 100% and THEN thinking about scaling is short sighted and stupid
I don't mean "let your production systems spool up to point where you're maxing out a single machine" - that would be exceedingly silly.
I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines.
What this means is not getting a 9-node spark cluster to push a few hundred gb of data from S3 to a database because "it took too long to run in python" because it's a single threaded, non-async, non-performance tuned.
> I mean "when you've proven that the application you've written can fully, or near fully utilise the available power on a single machine, and that when running production-grade workloads, actually does so, then you may scale to additional machines
How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully” because:
It puts more strain on the hardware and will cause it to fail a LOT faster
There’s no redundancy so when the system fails you’ll probably need hours or maybe days to replace physical hardware, restore from backup, verify restore integrity, and resume operations - which after all this work, will only put you in the same position again, waiting for the next failure
Single node systems make it difficult to canary deploy because a runaway bug can blow a node out - and you only have one.
Workload patterns are rarely a linear steam of homogenous tiny events - a large memory allocation from a big query, or an unanticipated table scan, or any 5th percentile type difficult task can cause so much system contention on a single node that your operations effectively stop
What about edge cases in kernels and network drivers - many times we have had frozen kernel modules, zombie processes, deadlocks and do on, again, with only one node something as trivial as a reboot means halting operations.
There’s just so many reasons a single node is a bad idea, I’m having trouble listing them
> How is that any different? You just backed off a tiny amount by saying “fully or near fully” - you still shouldn’t burden a single host to “fully or near Fully”
You're missing the word "can". It's a very important part of that sentence.
If your software can't even use 80% of one node, it has scaling problems that you need to address ASAP, and probably before throwing more nodes at it.
> It puts more strain on the hardware and will cause it to fail a LOT faster
Unless you're hammering an SSD, how does that happen? CPU and RAM should be at a pretty stable amount of watts anywhere from 'moderate' load and up, which doesn't strain anything or overheat.
I’m not against redundancy/HA in production systems, I’m opposing clusters of machines to perform data workloads that could more efficiently handled by single machines. Also note here that I’m talking about data science and machine learning workloads, where node failure simply means the job isn’t marked as done, a replacement machine gets provisioned and we resume/restart.
I’m not suggesting running your web servers and main databases on a single machine.
But that Hadoop cluster will also work just fine if the data set does not fit in RAM. And I've never met a data set that didn't expand over time. And with Spark you get a robust, scalable and industry standard way of distributing work amongst the nodes.
Also did you know that Hadoop is open source. So that $250K is purely for hardware.
If that allows you to simplify your architecture so that stuff just runs on a single machine instead of needing to develop, debug and maintain a distributed solution, then you save much more money in engineer salaries than this.
You know what else costs? Humongous amount of servers to run silly stuff to orchestrate other silly stuff to autoscale yet else silly stuff to do stuff on your stuff that could fit into memory and be processed on a single server (+ backup, of course).
Add to that small army of people, because, you know, you need specialists of variety of professions just to debug all integration issues between all those components that WOULD NOT BE NEEDED if you just decided to put your stuff in memory.
Frankly, the proportion of projects that really need to work on data that could not fit in memory of a single machine is very low. I work for one of the largest banks in the world processing most of its trades from all over the world and guess what, all of it fits in RAM.
There's a world outside of web apps and SV tech companies. There's a lot of big datasets out there, most of which never hit the cloud at all.
Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.
I worked with another company that had 2000 oracle servers in some sort of franken-cluster config. Reports took 1 week to run and they had pricing data for their industry (they were a transaction middleman) for almost 40 years. I can't even guess the data size because nobody could figure it out.
This is not a FAANG problem. This is an everage SME to large enterprise problem. Yeah, startups don't have much data. Most companies out there aren't startups.
By the way, memory isn't the only solution. In the past 15 years, I've rarely worked on projects where everything was in memory. Disks work just fine with good database technology.
> Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.
That's a lot of data, but what do you even do with it other than take minuscule slices or calculate statistics?
And for those uses, I'd put whether it fits in RAM as not applicable. It doesn't, but can you even tell the difference?
They paid us $600K every six months to analyze the data and suggest adjustments to their control systems (it's called continuous commissioning, but it's not really continuous due to laws in many places about requiring a person in the loop on controls). They saved millions of dollars every year doing this, because large, complex buildings drift out of optimized airflow and electricity use very quickly.
Agreed that 20 year retention is silly. We thought it was silly, but the policies reflected the need for historical analysis for audit purposes.
It does in fact matter what you can fit in RAM though. We had to adapt all our systems to a janky SQL Server setup that was horrible for time series data and make our software run on those servers. RAM availability for working sets was a huge bottleneck (hence the cost of analysis).
This is another problem. Is really 20 year retention policy necessary for ALL sensor data? Can it be somehow aggregated and only then the aggregated data to be subject to retention policy? Can the retention policy be made to make it possible to lose some fidelity gradually (the way RRDtool is used by Nagios, for example)?
Yes your company's data may fit in RAM. But does every intermediate data set also fit in RAM ? Because I've also worked at a bank and we had thousands of complex ETLs often needing tens to hundreds of intermediate sets along the way. There is no AWS server that can keep all of that inflight at one time.
And what about your Data Analysts/Scientists. Can all of their random data sets reside in RAM on the same server too ?
$100K has always been "cheap" for a "business computer" and today you can get more computer for that money than ever.
$100K of hardware (per year or so) is small-fry compared to almost every other R&D industry out there. Just compare with the cost of debuggers, oscilloscopes and EMC labs for electronic engineers.
Never said EVERY data set fits in RAM, but that doesn't mean MOST of them don't.
There is a trend, when the application is inefficient, to spend huge amount of resources on scaling it instead of making the application more efficient.
There are a number of cloud database solutions that are very easy to manage and not all that expensive. For example I work for Snowflake and our product doesn't need a small army of people to babysit it.
I mean, I also prefer doing things on a single machine, but if that machine gets expensive enough, or writing a program that can actually use all that power gets too difficult, why not switch to a cloud database?
This is about saving you money, so that's not the right fine print.
Look at it this way: This is for the person that's already going to get enough ram sticks to fit the entire data set or multiple of it, across many machines, and deal with the enormous overhead from doing queries across many machines. The revelation is that you can fit that much ram inside a single machine for a much cheaper and faster experience.
I have chosen for the cloud options to only select virtual instances that can be spun up on demand. The high-memory instances you link to are purpose-build.
You can fit 48 TB on a HPE MC990 X though I'm pretty sure that's got one of those NUMA architectures that SGI had with the UV 3000 or whatever.
I remember jokingly telling my team to spend the millions of dollars we did expanding our clusters with one of these and just processing in RAM. I honestly don't think I did the analysis to make sure it would be actually better.
It was 'jokingly' because we couldn't afford three of these machines anyway. The clusters had the property that we could lose some large fraction of nodes and still operate, we could expand slightly sub-linearly, etc. which are all lovely properties.
It would have been neat, though. Ahhhh imagine the luxury of just loading things into memory and crunching the whole thing in minutes instead of hours. Gives me shivers.
No, but I frequently see people implying that you can do your data science in Python and R as long as you can fit the data in RAM. As you mention, it's not RAM that's the limiting factor for larger data volumes, it's finding tools that exploit parallelism.
Yes, that's my point. It's too simplistic to say "well, the data fits in RAM", you have to add parallelism to make the workload tolerable. In the past, some people have done that using MapReduce or Spark, GNU parallel or just writing parallel code in their favorite language. But RAM by itself isn't the only limiting factor to whether a problem is solvable in a reasonable amount of time.
Telling me that my (example) 24TB data set fits into RAM, because it fits into 24TB on an AWS instance designed for SAP that's so expensive that it's price on inquiry, isn't overly helpful.
May I suggest that a) they consider applications need RAM too and b) that if the price is POI, it's probably better to mention "But you know, 188 m5.8xlarges might be cheaper"
The question of whether my storage mechanism fits budget is more relevant, no? It costs less than 4cents per hour to store 1TB on S3 but an X1 with 1TB of RAM costs $13 or so per hour on Aws. So the issue is whether what you pay for is worth the result you're computing.
I've no idea if the intended audience here would ever run their workloads on Solaris, but like the IBM POWER systems, Oracle & Fujitsu SPARC servers also max out at 64TB of RAM. I didn't see those included here.
Pretty cool that single boxes have > 10 terabytes of RAM.
I would definitely imagine that most workloads rarely need more than a few to a few hundred TB in memory, since you may have petabytes of data but you probably touch very little of it.
One of the best things you can often do for latency/throughput improvement is to just mlock() the data. Yes, it fits. Get a more expensive machine and 5x the throughput in a single day with a config change.
Sure, my data fits in RAM if you take one of those example machines, which we have, and which are currently hosting dozens of production VMs, and instead dedicate the whole thing to my database. I'd love that, but it's never going to happen.
The original Twitter thread is funny. This site doesn't add anything. It looks like SEO more than anything, and now that HN has linked to it, it's been successful.
1. If you have to ask, then either it doesn't now, or it doesn't sometimes. So assume it doesn't.
2. If you can use a cluster, then maybe.
3. In some senses, it doesn't matter. How so? Reading and writing from RAM is very slow, latency-wise, for today's processors. If I can bend the truth a little, it's a bit like a fast SSD. So, if you can up the bandwidth to disk enough, it becomes kind of comparable. Well, if you can use 16 PCIe 4.0 lanes, it's roughly 24 GB/sec effective bandwidth, which is roughly half of your memory bandwidth. Now it's true that in real-life systems it's usually just 4 lanes, but it's very doable to change that with a nice card.
4. DIMM-form-factor non-volatile memory may increase memory sizes much more.
Raw RAM space is not the issue... it's indexing and structure of the data that makes it process-able.
If you just need to spin through the data once, there's no need to even put all of it in RAM - just stream it off disk and process it sequentially.
If you need to join the data, filter, index, query it, you'll need a lot more RAM than your actual data. Database engines have their own overhead (system tables, query processors, query parameter caches, etc.)
And, this all assumes read-only. If you want to update that data, you'll need even more for extents, temp tables, index update operations, etc.