Performance Comparison Between ArangoDB, MongoDB, Neo4j and OrientDB

ThePhysicist · on June 11, 2015

I did a lot of research on graph database technologies recently and read a lot of these "let's compare X to Y" articles. What I found is that most benchmarks - especially those done by people affiliated with a given product - often tend to show a distorted and sometimes plain wrong picture.

For example, concerning the performance and scalability of graph databases the main argument of proponents of this technology is the "join bomb" argument, which states that you can't efficiently store a graph in a relational database since it will require O(log(n)) time to lookup neighboring nodes from the index when crawling the graph. However, this is of course only true for B-tree indexes, whereas hash-based indexing would give you basically the same performance (O(1)) on a graph implemented in a relational database.

Additional features like documents and deep indexes are nice of course but can be (and often are) implemented using relational databases as well, so in the end there really isn't such a large advantage to be gained from using a graph database, especially when taking into account the immaturity of many solutions in that space.

mcphilip · on June 11, 2015

>Additional features like documents and deep indexes are nice of course but can be (and often are) implemented using relational databases as well, so in the end there really isn't such a large advantage to be gained from using a graph database, especially when taking into account the immaturity of many solutions in that space.

I've worked with graph data stored in rdbms in the medical informatics space. As you say, there are ways to correctly handle complex graph data in rdbms.

I've also used neo4j as the backend for a wall street analytics app that's in production. Could it have been done in rdms? Sure, but the ad hoc queries that needed to be run against the data were much easier to express as graph traversals than SQL.

There are some obvious downsides with using a graph database, mainly that it's practically impossible to find programmers with non trivial production experience, but it's been a great fit at the two startups I used it at since I got to implement it from the ground up and didn't need a large team.

That being said, database pragmatism is the main lesson to be learned here. Use the right tool(s) for the right jobs.

neunhoef · on June 11, 2015

(Disclaimer: Max from ArangoDB here) I am all for database pragmatism. Fortunately, the choice "graph database" or "non-graph database" is no longer binary. We at ArangoDB are convinced that graphs in data modelling have their merits (namely when you need "graphy" queries), but you do not want to be locked in to the graph data model. Therefore we argue for multi-model databases, which can give you graphs, but do not force you to use graphs for all and everything. With "graphy" I mean queries that involve paths in a graph whose length is not a priori known (e.g. ShortestPath).

neunhoef · on June 11, 2015

(Disclaimer: Max from ArangoDB here)

The purpose of this benchmark series was not to provide a comprehensive test of all these databases. We only wanted to demonstrate that a multi-model database can successfully compete with specialised solutions like document stores and specialised graph databases.

I agree to your comment about graph databases, the crucial thing is that the neighbors of a vertex can be found in time proportional to their number, and that the queries involving an a priori unknown number of steps (graph traversals, path matching, shortest path, etc.) run efficiently in the database server and can be accessed conveniently from the query language.

ThePhysicist · on June 11, 2015

Understood, I did not mean to be overly critical, I just think that there is a lot of misleading information out there concerning graph databases, and for many people it is hard to get good information about their real benefits and drawbacks.

ArangoDB seems to be a very interesting project btw, I might evaluate it again for my project in the future (we are currently creating a very large graph of code data, so we need something that can scale beyond 1B nodes and 100B vertices)

neunhoef · on June 11, 2015

No worries, yes, there is a lot of bad information about graph databases, to begin with, some seem to believe that everything out there can best described by a graph, which is clearly wrong. I have myself written something about this in this article: https://medium.com/@neunhoef/graphs-in-data-modeling-is-the-...

Furthermore, I am currently working on another article for the O'Reilly radar blog presenting a nice case study in which a multi-model database was very useful, because document queries and graph queries were both used extensively.

1B vertices and 100B edges will definitely be a challenge for any graph database and I find it highly likely that ArangoDB in its current version will not show a very good performance for a data set of this size. Obviously, it will always depend on the particular queries you need, and on whether the graph has a natural cluster structure that can be used for sharding.

NathanKP · on June 11, 2015

Rant warning:

After working with Neo4j for about six months I would definitely NOT recommend it to anyone. In my experience it has been the least reliable and most buggy database solution I've ever worked with.

From the engineering point of view it has a host of core issues that make it really hard to write code, such as constant deadlock exceptions. These are caused by the fact that the database is largely incapable of handling two simultaneous upserts if those upserts touch the same node. Where most mature, decent DBs can handle this completely transparently Neo4j just panics and returns an error. This means writing Neo4j queries ends up requiring tons of boilerplate to wait for exclusive locks on nodes and/or retry upserts until they succeed.

From the devops perspective managing a cluster is also extremely painful, as there are frequent issues with replicas getting behind on syncing due to the server's pathetically slow write performance even with SSD storage volumes. We tried everything but the bottleneck was the server processes themselves, not the storage volumes, network connection, CPU, etc. We threw some really nice hardware at our Neo4j cluster but it still struggled to keep up with write loads in the range of 500-2000 writes per minute.

The final straw was their latest version 2.2 which they were advertising as a massive improvement in speed and reliability. When we upgraded it turned out to be the exact opposite. A few of our queries got faster but overall most of them got an order of magnitude slower. Their support basically told us that we'd need to rewrite many of our queries or manually set a flag to use their older query engine (and therefore miss out on the speed of the new query engine). Needless to say we decided if we needed to rewrite queries we were going to rewrite them to use a different storage engine entirely.

In my experience Neo4j was little more than a six month waste of time and dev resources.

don71 · on June 11, 2015

Hi,

I'm Claudius, the author of the blog post. The intent of the blog was not to show, that a particular product is not performing well. There are thousands of different use cases and each database has its strengths and weaknesses. For a different scenario the results might be different. Neo4J is a solid product and is doing a good job. The aim of the blog was to show, that multi-model can compete with specialized solutions. What I wanted to show, is that a multi-model approach per se does not carry a performance penalty.

NathanKP · on June 11, 2015

> Neo4j is a solid product and is doing a good job.

My personal opinion, based on my experiences using Neo4j, is that Neo4j is not a solid product. (At least not by my definition of "solid".) I think it has good marketing but falls apart in a lot of real world use cases.

I understand that the point of the blog post wasn't to bash Neo4j, but it does demonstrate quite well the fact that a multi model approach can be quite competitive or even better in terms of performance for many use cases.

Given Neo4j's historical and ongoing issues, if the multi model DB solution also happens to be a more mature and stable one then there is no front on which Neo4j wins.

rspeer · on June 11, 2015

> Neo4J is a solid product

Do you really believe this, in your expert opinion, or are you just trying not to step on toes?

don71 · on June 12, 2015

We are a multi-model database, which is not in a strict sense a competitor but is competing with Neo4J in some areas. Therefore I'm definitely not a Neo4J expert. However, I'm now working in the field for over 15 years, developing in-memory solutions, databases and application servers. Developing ArangoDB for almost three years and I have talked to a lot of people in that area and to people who are using Neo4J. There are always obstacle when moving to new products. But most of the people who I met are quite happy with Neo4J.

rspeer · on June 11, 2015

I had a pretty similar experience, in fact. I lost months of work to Neo4J back in 2011 or so.

The purpose was to be the storage backend for ConceptNet, a semantic network that is largish but is far from "big data". The write speed was awful, the stability was awful, the mechanisms for loading in non-toy amounts of data were nearly nonexistent, and I learned what it means to "run out of PermGen".

I hastily aborted the Neo4J plan. I was also burned by RDF triple-stores (they can't cope with small data either) and MongoDB (which seemed to work at first and eventually fell over, for obvious reasons in retrospect).

The graph is now in SQLite plus flat files, which works great. I've concluded by now that the step on the scale beyond "lies" and "damn lies" is "claims about next-generation databases".

gmarx · on June 11, 2015

Good to hear because I started to experiment with it a couple months ago to solve a specific problem. I abandoned it because the embedded version required an older version of Lucene than I was using. Yes I could have worked around that but it's a lot of extra work to try something that may or may not solve my problem. Then I tried the server version which also required I be able to call web services (something my code had no other use for). Worked through that and found the performance pretty poor compared to embedded. I ended up just solving the problem in Lucene. Glad I didn't spend more time on it.

I would love to have a good graph database later on though

mrits · on June 11, 2015

Haha, damn... That sounds pretty harsh. Neo4J seems really interesting to me. Who doesn't want to get better with graphs? I still want to learn the query language as it is applicable to some of the problems I'm trying to solve now.

This isn't the first horror story I've heard about Neo4J so I've actually stayed away from using it professionally.

shiv86 · on June 11, 2015

So what did you end up doing with it ? ... How did you replace it ?

mangeletti · on June 11, 2015

IMHO, the community needs a set of specific tasks that can be achieved with all databases (just like http://benchmarksgame.alioth.debian.org/ has a series of algorithms for testing different memory/CPU strengths of languages). Then, proponents of each database (e.g., their sponsors, evangelists) can create code and config for running the tests on their database. This could all be open source, and the tests could all be run on the same host (or hosts) for comparison.

This seems to make sense, and is more akin to what https://www.techempower.com/benchmarks/ has done, IIRC.

bhauer · on June 11, 2015

You have recalled correctly! That approach is precisely what we have taken with the TechEmpower benchmarks.

crudbug · on June 11, 2015

Sane suggestion : Konigsberg Benchmark !

bhouston · on June 11, 2015

When I compare databases, I also search out the performance comparisons created or sponsored by my preferred database provider, then I know that I can trust the results to be complete and unbiased./sarcasm

Seriously, why is this one of the top stories on HN? These types of tests are so easy to tweak in favor of a perferred database that they are completely unreliable. Even neutral comparisons by third parties are rife with errors like not adding proper indices to all the DBs or using query formulations that avoid the indices on some DBs or other configuration issues (DBs are unfortunately tricky.)

I think the only way to do this objectively is to have a test and then give each DB vendor an opportunity to tweak the DB and queries to optimize performance. Seeing how DB vendors optimized performance would actually be very informative to potential users. Everything else is just a comedy of errors (or worse) as normally people usually only have good expertise in one of the DBs in question, if that.

tracker1 · on June 11, 2015

It looks like the source of the test is out there for others to compare/review... which is far better than many of these types of tests do. I'm not an ArrangoDB person, I've done a bit with other NoSQL variants, but will say their approach is fine. They don't even come out on top in all the tests...

It's mainly about showing how they compare performance wise, so that they can concentrate on selling based on features. Which is a pretty fair approach, and I wish them luck.

Also, in looking, it seems that they have pretty broad platform support as well.

porker · on June 11, 2015

> I think the only way to do this objectively is to have a test and then give each DB vendor an opportunity to tweak the DB and queries to optimize performance.

AFAIK they are: the test is open source, the raw results are there, and contributions welcomed. Hopefully the OrientDB team will step up and show how theirs can perform.

lvca · on June 11, 2015

We (OrientDB Team) had the chance to look at the code and we're preparing a Pull Request. Stay tuned for the new results ;-)

lvca · on June 20, 2015

Hey all, we sent a Pull Request 2 days ago to the author of the Benchmark, as they used OrientDB incorrectly. Now OrientDB is the fastest in all the benchmarks, except for "singleRead" and "neighbors2", but we know why.

We are still waiting for the Arango team to update the results...

porker · on June 23, 2015

> Hey all, we sent a Pull Request 2 days ago [...] We are still waiting for the Arango team to update the results...

Let's see: 1. You sent it at the weekend 2. The Arango team have a life 3. It took you 9 days to send the PR 4. The start of the week always is busy, regardless of which company you work at...

Give over and stop trying to make out there is something suspicious in whatever the Arango team do. It makes you (look like) a jerk.

orientdb_leaks · on June 22, 2015

Lvca, good to see the OrientDB team cooperating in the benchmark. Performance is just one aspect when evaluating a database, other things like robustness, stability, community can be as important if not more important. As such as I’ve shared my 10 month experience with OrientDB in: http://orientdbleaks.blogspot.com/2015/06/the-orientdb-issue...

segmondy · on June 11, 2015

I would love to see postgres in the mix.

Roboprog · on June 11, 2015

Likewise. I started some stuff at work recently comparing Mongo, Orient and PG (using a JSON column). Alas, all the test suite does so far is insert small "documents" (5000 docs w/ 3 name-vals excluding PK/ID) and time that. No read-back tests of any kind yet, so no indices in place either.

For this little test (on my macbook), Mongo was the fastest. PG took 1 1/2 times as long, and Orient took 4 times as long as Mongo. All 3 were driven by a Java client connected via a socket to the DB on localhost. (Orient could have been "in process", but I wanted it external as if on a server)

Of course, the main use-case for Orient is reading back graph chains, so it's a horrible test. However, what we need is a supplemental store to dump some flat junk as the app runs.

matts9581 · on June 11, 2015

Worked with Orient back in 2012, had some issues regarding performance and switched. Following the news about them, I thought they had made great progress. This benchmark kind of shows the exact opposite.

jsc123 · on June 12, 2015

The conversation is continuing here: https://groups.google.com/forum/#!topicsearchin/neo4j/time/n...

harunurhan · on June 11, 2015

Are you sure that you optimized/tuned Neo4j or MongoDB as much as you did with ArangoDB?

Also, I don't like when a company posts a comparison between its product and others. Although some of them are arguably informative and objective, I consider these posts as marketing/ad posts.

neunhoef · on June 11, 2015

(Disclaimer: Max from ArangoDB) We have invested considerable effort to optimize each database. Obviously, we know our own product better than the others. However, we have asked people who know the other products better, and we keep this investigation open for everybody to contribute and to suggest improvements. As you can see from last week's post, there have been very good contributions, we have tried them out and have published the improved results.

lvca · on June 20, 2015

Hey all, we sent a Pull Request 2 days ago to the author of the Benchmark, as they used OrientDB incorrectly. Now OrientDB is the fastest in all the benchmarks, except for "singleRead" and "neighbors2", but we know why we're slower there.

We are still waiting for the Arango team to update the results...

However, who is interested in running the tests themselves, just clone this repository:

https://github.com/maggiolo00/nosql-tests

nevi-me · on June 11, 2015

Will be interesting to see the 'alpha' test results, interested in seeing how the MongoDB 3.2 series would perform there

atombender · on June 11, 2015

Anyone using ArangoDB in production who can speak about it? It looks interesting, but like many of the newer databases coming out (Aerospike, Blazegraph, Hyperdex etc.) there is precious little public information from third parties.

hngonebad · on June 11, 2015

[flagged]

Roboprog · on June 11, 2015

Great, now tell the graph-based-DB guys how you do on multi-doc "join" queries :-)

Disclaimer: I'm more of a Postgres fan-boy.

Mongo's single rec insert/fetch times are impressive, though. And, it's pretty easy to set up a single node. So, it could sometimes be the right choice.