Hacker News new | past | comments | ask | show | jobs | submit login
High-Performance Graph Databases (arxiv.org)
194 points by belter on May 21, 2023 | hide | past | favorite | 60 comments



I am not sure of the exact statistic, but something like 95% of all production databases are less than 10GB. There seems to be a 'FAANG hacker' fascination with 'extreme-scale' which probably comes from seeing the challenges faced by the handful of organizations working at that level. Much of the time most graph database users want (as in why are they there) a DB that allows you to flexibly model your data and run complex queries. They probably also want some sort of interoperability. If you can do that well for 10GB, that is holy grail enough. We certainly found that developing graph database TerminusDB [1] - most users have smaller production DBs, more lightly use bells and whistles features, and really want things like easy schema evolution.

[1] https://github.com/terminusdb/terminusdb


This research paper is talking about performance whilst you're talking about scalability.

Those are related but are distinct from each other.

And sure about 95% of companies would have their needs met with a simpler system but that does leave a lot of companies who will not. And for those of us in say finance doing customer/fraud analytics I would welcome all the performance I can get.


> This research paper is talking about performance whilst you're talking about scalability. Those are related but are distinct from each other.

The paper has "Scale to Hundreds of Thousands of Cores" in the title. I have not yet read the paper but it seems unlikely it doesn't talk about scalability.


I was referring to scalability in the sense of the size of the data being stored.

You can have slow queries with 10GB of data just like you can have fast queries with 10PB of data.


If your data is small enough to easily fit in ram, you kind of can't have that slow a query on it (or at least you no longer are talking about a database problem).


If you end up having to scan the 10 GB graph many times per query without acceleration structures helping you (like indices), it will be slow. I'd say it's still a DB problem.


I'm guessing that, when the paper's author mentioned "hundreds of thousands of cores", they didn't have 10GB of data in mind. That works out less than a typical L1 cache's worth of data per core.


> I have not yet read the paper

This is really common across article-comment platforms; is anyone interested in discussing how to incentivise comment sections that have read the paper?


This isn't a graph database like neo4j. This is a graph database like I hoped neo4j would be. It's not about having an easier time working with schemas. It's about analyzing graphs that are too big to fit in RAM. Transaction analysis for banks, trafic analysis of roads, failure resilience of utility networks, etc.

In these kinds of workloads you quickly run into performance bottlenecks. Even in-memory analyses need care to avoid conplete pointer chasing slowdowns.

I do still hope this is fast in like a single CPU 32 core 64GB system with an SSD. But if this takes a cluster to be useful, then I will still love it.


But the 5% of places where that kind of scale is needed are the ones paying the top 1% salary band, so this is the content distributed systems engineers like to read about and work on.


>There seems to be a 'FAANG hacker' fascination

Yeah, but the hacker fascination is what drives progress. You could have made the same type of argument about ML, and we would have been content with MNIST.


I think I kind of agree with this.

One of the simpler supported backends for our Modality product (https://auxon.io/products/modality), which results in a data model that’s a special case of a DAG for modeling big piles of casually correlated events from piles and piles of distributed components for “system of systems” use cases, is built using SQLite, and the scaling limiter is almost always how efficiently the traces & telemetry can be exfiltrated from the systems under test/observation before how fast the ingest path can actually record things becomes a problem.

That said, I do love me some RDMA action. 10 years ago I was fiddling with getting Erlang clustering working via RDMA on a little 5 node Infiniband cluster. To mixed results.


Interesting that you mention the value 10 GB, as it is the size of a DynamoDB partition or an AWS Aurora cell...


I agree with your sentiment but I suppose you're considering the wrong statistics. Instead you should consider: - how many jobs have interviews that necessitate knowing how to handle extreme scale

- proportion of jobs (not companies) requiring extreme scale - the fact that non extreme scales are the long tail doesn't mean it's a fat tail

- proportion of buyers/potential users that walk away from the inability to handle extreme scale

... and more sarcastically

- proportion of articles about extreme scale

- proportion of repos about extreme scale


Only one anecdote, but I found out a while after starting at my current job that directly questioning the extent to which scale-out was actually needed to solve a problem during a technical interview question is the thing that made me stand out from the rest of the crowd, and landed me the job. Being able to constructively challenge assumptions is an incredibly valuable job skill, and good managers know that.


Counter-anecdote: directly questioning "scale-out fantasies" has contributed to my early departure from a handful of jobs and contracts. One place was obsessed with getting everything into AWS auto-scaling groups when the problem was actually that they were running on MySQL with a godawful schema, dumbass session management, and horrific queries that we weren't allowed to fix because they were "migrating to node microservices anyway" (pretty sure that still hasn't happened years later.)

> Being able to constructively challenge assumptions is an incredibly valuable job skill

I would agree but ...

> good managers

... are few and far between.


The best people challenge bad assumption and worst bosses get mad.

Had one boss get mad that I reduced the database footprint by 94% - why? Because he wrote the initial implementation and refused to believe that his baby, which cost so much space because of how awesome it was, could fit into 5GB.

But challenging the status quo has gotten me to where I am, so I wont stop it anytime soon :)


I get that angle but I also see orgs capturing too much data. What's the use case for it? Not sure but if we ever do need it we'll have it is the typical answer.


really? I don't quite believe that. We're a tiny company with maybe 70 customers and db is roughly 11Tb.


Assuming 1kb per "record" that's 150 million records per customer.

Definitely a data heavy product, wherever it is that you're offering.

(Unless you keep large blobs in the DB. But database scale has more to do with records than raw storage.)


That seems a lot. What type of data?


Congratulations, you're in the 5%, along with us :)


Just have a look at the size of all of English Wikipedia. Or all of StackOverflow.

And these are seemingly huge services.

And yet…


What are the databases with easy schema evolution?


Maciej Besta, the first author of this paper, is a machine.

Aside from coordinating big groups to write tons of papers, he does a bunch of impressive wilderness exploration. I recommend checking out his website, there's some stunning photos:

https://people.inf.ethz.ch/bestam/expeditions.html


Stating that someone “is a machine” is somewhat ambiguous these days as it is increasingly possible that they literally are one.


Wasn’t TJ Holowaychuk confirmed to be an AI, or a group of people like Bourbaki? He never showed up at any events :)

https://qr.ae/pyvfKK


I also misinterpreted that sentence!


It's nice ChatGPT hasn't been named something like Alexa. Else we would be anthropomorphizing it even more!


Unfortunately we named our kid ChatGPT 20 years ago…


Torsten is PI of that group not Maciej so you're half right, since being Maciej is first author on this one he probably did most of the technical work (and who cares about "coordination").


"...we harness established practices from the HPC landscape to build a system that outperforms all past GDBs presented in the literature by orders of magnitude, for both OLTP and OLAP workloads. For this, we first identify and crystallize performance-critical building blocks in the GDB design, and abstract them into a portable and programmable API specification, called the Graph Database Interface (GDI), inspired by the best practices of MPI. We then use GDI to design a GDB for distributed-memory RDMA architectures. Our implementation harnesses one-sided RDMA communication and collective operations, and it offers architecture-independent theoretical performance guarantees. The resulting design achieves extreme scales of more than a hundred thousand cores..."


I am a little confused by the purpose of this paper. The architecture described is roughly how graph databases have always been implemented on HPC systems. The main contribution seems to be that they put a lot of polish on what were admittedly prototype-ish implementations historically? I was hoping for some interesting approaches to some of the fundamental computer science problems that cause scaling issues when graphs become large but this is more of a standard “throw hardware at it” solution (which has significant limitations).


I love this - "it's just engineering" say the people that think the hardest part about building a spaceship is having the big idea to build the spaceship. Note the "I love this" was sarcasm.


My point was that the architecture they are using has already been done multiple times for graph databases using things like RDMA (which has existed in HPC for ages), that is a known quantity. It was less "it's just engineering" and more I've seen similar implementations for a long time so what makes this different? I am interested in this space in part because I spent much of my time in HPC working on graph databases.


You should quickly skim sections marked with a (comically fat) "!" symbol. These indicate their "key design choices and insights" in the design space.


Agreed. They brush aside the years of high performance computing graph implementations, eg, cuStinger. If you look at systems like neo4j's GDS and how other multipurpose systems are going wrt views/projections for accelerated compute, that has been enabling targeting way more performance without dying under complexity. Benchmarking perf without that kind of comparison is weird, I'm surprised reviewers allowed that. (The work may still be good.. just you can't know without that.)


I once was obsessed at finding the holy Grail of optimization and scaling, but then came to the conclusion that every system is optimized and scaled differently. Different systems have different bottlenecks.


So true, rarely is anything the “best” or “better”, but instead each thing is a bucket of trade offs we choose from. Maybe frustrating, but also what makes engineering genuinely interesting.


This sounds similar to the no free lunch theorem in AI


Huh? NFL is a theorem on optimisation, not AI


It's originally a theorem about supervised learning, which turned out to be the prototypical application (via CNNs) of deep learning. Deep learning is now mostly synonymous with AI. Though I'm not aware of any NFL theorem for reinforcement learning or unsupervised learning.


Given that the most well-known use case for optimization algorithms is training machine learning models, it seems to me like a perfectly reasonable, if buzzword-oriented, thing to say.


How do you search a graph that is separated onto different machines?

I know a matrix multiplication can be a breadth first search if you use a 1 for each column you want to breadth search.

How do you shard a graph data structure?


Following the RDF model, every node and edge of a graph can be represented as:

subject -> predicate -> object;

Example: `<bob> <isa> <person>` (from this RDF primer [1]).

Sharding data that follows this model sounds easy? Any sort of rule and/or hashing over the triples could be used to send triples to arbitrary servers.

1: https://www.w3.org/TR/rdf11-primer/#section-triple


Thanks for this. I have worked with RDF and N3 notation.

I created a D3 visualiser from Jena Fuseki database

https://camo.githubusercontent.com/3064a94d00812c1373c4eb3b2...

It renders RDF/N3 relationships. Unfortunately the code is trapped in an Sqlite jsbin file :-)

It never occurred to me that it would be that trivial to shard, you could mapreduce the queries. I just worry about how you would do joins on data that live on independent nodes.

If a value is on server A and another on server B and there would be a relationship between them, how would you "join" this data unless you colocated join keys.


Thank you for such a succinct explanation of the RDF model of graphs. I have developed applications with cypher and neo4j, and skipped gremlin and even graphql because I found the RDF schema descriptions too complicated to bother with. Nodes and relationships seemed sufficient, but that everything as an implied digraph here makes it clearer. My note thanking you is a multiple of the length of your explanation, which is a pretty good measure!


thank you!


At what performance threshold should one migrate from a relational database to a graph database? In other words, is there something like OtterTune that can help me figure out the best storage engine (SQL vs NoSQL vs Graph vs Timeseries) for my service based on traffic patterns?


Andy from OtterTune here.

I suppose that it is possible to do something like this. We've explored the idea of synthesizing a workload that behaves like an application's real workload. The idea was to do offline testing with the synthesized workload to find optimal configurations that can then be applied to the production database. Our results were inconclusive (not enough real workloads to verify). But the idea could be extended for what you are proposing.

But in terms of relational vs graph, my opinions about the matter are public:

https://www.theregister.com/Debates/2023/03/06/great_graph_d...


What would be a use of 100k cores for a graph database?


Aggregate effective memory bandwidth if you use the CPU cache well. Graph databases are not compute intensive, nor are they particularly large data models, but they are extremely memory I/O intensive. The classic model for graph-oriented HPC was vast numbers of weak cores and barrel processors for this reason, though the utility of specialized hardware architectures has been greatly reduced by better software architecture for this purpose over time.


Fraud analytics.

You're looking for patterns across large numbers of entities and relationships.

And ideally you want this all done in real-time so you can stop transactions before they are approved.


Any chance you could expand on this? I hear about pattern matching for fraud analytics but I have never seen concrete examples, or even anything in the literature.


A social media service for hundreds of millions of users?


100 use cases for 1k cores


What are these used for in the “real” world?


If you mean graph databases generally: Most DB using applications can be built on a graph database. It's just a different data modeling framework than eg relational.

Generally you could say that application logic/queries being more concerned about links or linked entity attributes than direct entity attribute values is a sign that a graph DB may be a good fit. I'd say about 50% of db using apps are like this. You can get an idea by looking at their benchmark workload descriptions.

If you mean fully RDMA offloaded huge cluster DB systems, then I don't know the answer but suspect this kind of thing hasn't made it to industry yet:

"GDA internally uses a distributed hashtable (DHT) to resolve dif- ferent performance-critical tasks conducted under the hood, such as mapping application vertex IDs to internal GDA IDs. For highest performance and scalability, GDA’s DHT is fully-offloaded, i.e., it only uses one-sided communication, implemented with RDMA puts, gets, atomics, and flushes. Its design is lock-free, it incorporates sharding, and it uses distributed chaining for collision resolution. To the best of our knowledge, this is the first DHT with all its oper- ations being fully offloaded, including deletes. The DHT consists of a table (to store the buckets) and a heap (to store linked lists for chained elements)."


A few that I’ve encountered:

- Feature generation for machine learning model training (particularly popular with fraud detection at financial institutions) - social networks (think LinkedIn or Facebook) - supply chain analysis and optimization - healthcare patient data analysis (looking at similar patients to recommend treatments or do large scale analysis) - user identification (eg taking lots of data points and tying to a specific user). There’s a more specific name for this I can’t remember off the top of my head.


Cybersecurity is also a large industry that makes use of large scale graph databases




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: