I recommend doing disaster recovery steps for your personal data as well, such as Gmail. At one point recently I was creating filters to delete bulk messages and - when the filter got created, it somehow missed the from:@xyz.com domain part and I ended up deleting => delete forever all emails. I noticed the issue right away but it was enough to wipe 2-3 months worth of emails (all of them, even Sent ones).
I simply had to download the Mac / Windows (https://teams.microsoft.com/downloads) client and things worked. I never found a "web version" of Teams - but it probably exists somewhere. :)
I tried Neo4j a while back for recommendations and calculating similarities between users but when running against our full dataset got too many OutOfMemory exceptions. Ended up with a Mahout / Spark solution. It's an awesome graph db though - can find many other uses for it.
Yeah, I'm surprised the Neo4j team hasn't made more of an effort on this. I've run into lots of memory issues with it as well, and although there are reliable, fairly straightforward solutions to most of these problems, the team doesn't seem to be particularly interested in making sure that the defaults are robust enough to handle a reasonable workload. When your database fails on you for making a reasonable query request on a light workload, you can't help but feel troubled. There's a lot to love about Neo4j, but they've got a lot of work to do if they want to win over the developer community as a whole. There may be enterprises that get reassured by a huge price tag and a whole bunch of salespeople at their beck and call, but I don't know any of them. Every engineer I know who is willing to pay for software is either expecting a completely new kind of product or expecting to have an awesome experience with a free version of the tool before being willing to commit even a few bucks a month.
Yeah I've tried a couple of times at getting Neo4j into stacks but the outcome has always been it's pretty much limited to baking relationship data pre/on demand that is saved elsewhere and cleared out otherwise you get into prohibitively expensive licensing / infrastructure territory very quickly.
At that point a more pragmatic solution has always won.
Would you mind sharing the query? If you're hitting OOM exceptions with a dataset of that size there may be a typo in the query that's doing some sort of traveling salesman operation.
e.g.,
//grabs literally EVERY node in your database
MATCH (Person)-[KNOWS]-(Friend)
//only the people who have a KNOWS relationship between them
Can you though? My impression is that it doesnt scale to large data sets. The use cases for true graph databases (over shaky implementations on HBase/Cassandra) sparse in my opinion.
It's a single image database (no partitioning except in memory), so all nodes in the cluster will have the complete dataset (thus each node must be large enough to store it). However, because Neo4j doesn't rely on joins / table scans to operate-- traversals are O(1) not O(n). So there's an advantage to doing OLTP work on really really large datasets that have a specific starting point. Neo4j will do pointer arithmetic instead of scans / joins, such that regardless of dataset size a query will only access the fixed amount of data. The reason for this strategy has been that scale up hardware pricing has come down incredibly quickly in the last decade and having a trio of 64+++ GB memory boxes isn't out of the question for most mid-size and enterprise companies. Secondly, distributed systems are non-trival problems to manage both from a development but a devops perspective as well.
The philosophy of the Neo4j team is to conquer the world slowly. In order of priority Neo4j is designed around:
1.) data integrity and availability (ACID transactions, master-slave replication)
2.) rapid reads for graph traversals
3.) ability to store web-scale datasets (trillions++ of nodes)
4.) parallel operations (multi-master, map-reduce, global analytics, etc.)
The product has firmly completely 1 and 2, and is starting to work on 3 and 4 (4 mostly with a databricks / spark partnership).
It fights the same CAP problem that all databases do. We've chosen Consistency and Availability. Partition tolerance just isn't something inherent to graph databases. We can do some really smart math and duplicate nodes with high betweenness centrality (data nodes, not servers) or shuffle data based on access patterns to prevent introducing network latency into query plans that access nodes on multiple partitions. But doing that while maintaining 1 and 2 of the above is very not easy.
I wanted to build something like this - basically a P2P based End-to-End encryption messaging technology (so it doesn't get inspected by peers) for use cases such as messaging (a-la SMS) on a non-networked environment such as cruise ships / National Parks, etc [maybe through bluetooth or something]. Cruise lines should / could easily build a messaging app that works on a closed network (lots of branding potential.. not sure why it hasn't been done yet!). If anyone is interested ping me. :)
At work we use a combination of git + Jenkins and a simple xcopy on build succeed. Does a great job, plus it can do Slack events, email notifications, etc.