You do realize you have to query an index of all of that data for every single q...

dbthrowfu · on May 5, 2023

Consumer hardware can still handle that with 1TB RAM + ThreadRipper Pro.

> You do realize you have to query an index of all of that data for every single query your use makes right? Computing that index is not entirely trivial, nor is the operation of partitioning the data so it fits in ram across a pool of nodes.

I don't know what any of this means -- and it sounds like you're slapping a bunch of terminology together, rather than communicating a well-thought-out idea.

Yes, in the general case you're going to have to use an index. Computing an index or a key to that index? Computing the index is a solved problem, that does not have a hard real-time component -- you can do it outside of normal query executions. Computing the key to the index on each query is also a solved problem.

Have dimensions stored in columnar format, generate a sparse primary index on said columns, and then use binary search to quickly find the blocks of interest to do a sequential search on viz. distance function. Or you could even just use regular old SS trees, SR trees, or M Trees for high-dimensional indexing -- they're not expensive to use at all.

There, you can easily run a query on a single dimension (1 billion entries) under a second. You want 300 dimensions? Ok, parallelize it. 128 threads, easy. At most this will take 3 seconds if everything is configured properly (big IF, that seems like few can get right).

This is literally a weekend project. Anyone can build something like this, but not everyone has the integrity to be upfront about how they're reinventing the wheel, and spinning it like they've just broken ground in database R&D.

esafak · on May 5, 2023

A second is orders of magnitude off the typical SLA for these things. It's user facing. That's why these databases are a thing.

hiyou102 · on May 5, 2023

What kind of QPS are you looking at? How are you handling 1536 dimensions? How long does an incremental index update take? These are the problems you run into in building such a system.

ndriscoll · on May 5, 2023

I'm not familiar with the index part, but you can get at least 2TB on a single CPU socket these days. You shouldn't need multiple machines to fit in RAM. Depending on what QPS you need to handle, you might also be fine to not have the whole thing fit in RAM.

VHRanger · on May 5, 2023

My point was, specifically, that this data doesnt have to fit in RAM.

All of it fits on a single machine on one or a few big, fast SSDs.

ndriscoll · on May 5, 2023

A big SSD is 30 TB now: https://www.newegg.com/micron-30-72-tb-9400/p/N82E1682036315...

So that kind of dataset fits on a small SSD. :-)