It all feels like they expect developers to sift through the conflict log to resolve things manually or something. If a transaction did not go through on some of the nodes, what are the others doing then? What if they can not roll it back safely?
Typically applications will have some kind of logical separation of the data.
Given this is targeted at replication of postgres nodes, perhaps the nodes are deployed across different regions of the globe.
By using active-active replication, all the participating nodes are capable of accepting writes, which simplifies the deployment and querying of postgres (you can read and write to your region-local postgres node).
Now that doesn't mean that all the reads and writes will be on conflicting data. Take the regional example, perhaps the majority of the writes affecting one region's data are made _in that region_. In this case, the region local postgres would be performing all the conflict resolution locally, and sharing the updates with the other nodes.
The reason this simplifies things, is that you can treat all your postgres connections as-if they are just a single postgres. Writes are fast, because they are accepted in the local region, and reads are replicated without you having to have a dedicated read-replica.
Ofc you're still going to have to design around the conflict resolution (i.e. writes for the same data issued against different instances), and the possibility of stale reads as the data is replicated cross-node. But for some applications, this design might be a significant benefit, even with the extra things you need to do.
I think I understand the use case. Like, we have in fact several regional Postgreses, but we want them to be one physical database for the sake of simplicity.
Probably this should be in the motivational part of the README.
There’s no free lunch. The rabbit hole is only worth going down if the benefits are worth the operational pain. I view this as a building block, not a checkbox feature that magically just works all the time.
For someone who has these requirements out of the gate, another datastore might be better. But if someone is already deeply tied to Postgres and perhaps doing their own half assed version of this, this option could be great.
What are good off-the-shelf distributed databases? We looked at MongoDB but it wasn't worth giving up SQL. To reiterate the no free lunch point, no one has figured out how to outsmart the CAP theorem yet, so all you can do is design around it.
I work for them so take with a pinch of salt, but Oracle DB. It gives you a fully multi-master horizontally scalable database with ACID transactions (not sharding), full SQL, elastic scalability, built in queues, JavaScript stored procs, automatic REST API construction, many other features. Its pricing is competitive with a cloud hosted Postgres, believe it or not (the clouds are making a lot of money off customers who are wedded to Postgres). I work through some of the numbers for an extreme case here [1].
Behind the scenes, the way it works is by combining software tricks with special hardware. You rent a (part of a) database cluster. The cluster is running on high end hardware running customized kernels, with a private Infiniband RDMA-capable interconnect between the nodes separate from the front-side network that clients connect with. A lock manager coordinates ownership of data blocks, which can be read either from disk nodes or directly out of the RAM of other database nodes. So if one node reads a block then writes to it, the only thing written to disk immediately is the transaction log. If another node then needs to write to that block, it's transferred directly over the interconnect using RDMA to avoid waiting on the remote CPU, the disk is never touched. Dirty blocks are written back to disk asynchronously. The current transaction counter is also poked directly into remote nodes via RDMA.
In the latest versions the storage nodes can also do some parts of query processing using predicate push-down, so the amount of data to be transferred over the interconnect is also lowered. The client drivers understand all the horizontal scalability stuff and can failover between nodes transparently, so the whole setup is HA. A node can die and the cluster will continue, including open transactions.
If you need to accelerate performance further you can add read-through coherent cache nodes. These act as proxies and integrate with the block ownership system to do processing locally.
Other than financial reasons (I own some stock), I've started making this argument here on HN because it's unintuitive but correct, which is just enjoyable. A lot of people in the startup world don't realize any of the above, thinking that horizontally scalable fully coherent SQL databases either don't exist or have severe caveats. E.g. one reply to you suggests FoundationDB which is great, but it's a KV store and not a SQL database.
By the end of the day it’s not black or white there are trade offs. So special hardware is simply a no go zone for me. What happens if you want to leave the cloud and host on premises its the activation of the lock-in mechanism. Thanks I can manage one solution or another using one of the open source technologies. It’s all trade offs.
You don't have to use the special hardware, it's just faster and easier if you do. And you the customized hardware deal pre-dates the cloud: you can buy the pre-built racks and have them wheeled into a self-hosted datacenter if you want. That's called ExaData.
But if you want to run the DB on your own VMs or bare metal, you can do that. It doesn't have any DRM so from time to time you'll be expected to run some scripts that check your usage and reports back, to ensure you've been paying for what you use. But otherwise it's operationally no different to an open source DB.
The open source aspect makes a difference in terms of who you pay for support (if anyone), what quality of support you get, things like that.
Well, no moreso than for any other RDBMS. It implements standard SQL, exporting your data is easy. The only lockin comes from using features that other databases don't have, but that tradeoff exists with any kind of software including open source.
My guess is that you want to change your entire design philosophy a little bit with regards to table design, moving some entities to use a composite GUID+timestamp as PK's and replace most updates with inserts to avoid conflicts and instead resolve things at query-time (Basically a CRDT modelling philosophy contained within a relational schema).
Ideal? Not entirely but it should still give most query benefits of regular SQL and allows one to to benefit from good indexes (the proper indexes of an SQL database will also help contain the costs of an updated datamodel).
I think this is more interesting for someone building something social media like perhaps rather than anything involving accounting.
Are there any Datomic-like query layers on top of Postgres for approaches like this where you're recording immutable occurrences rather than updating mutable records?
No, this would be more be about using different ways of mapping data to the database from an applications point of view. So for an developer it would increase the load a tad in terms of exploring data when debugging simple things.
On the other hand, the increase in exploration costs should be more than offset by having most data changes logged to be able to track changes.
In our case, we're designing around INSERT-only tables with a composite primary key that includes the site id, so (in theory) there will never be any conflicts that need resolution.
I think that's absolutely true in the happy scenario when the internet is up.
There's a requirement that during outages each site continue operating independently and might* need to make writes to data "outside" its normal partition. By having active-active replication the hope is that the whole thing recovers "automatically" (famous last words) to a consistent state once the network comes back.
But if you drop the assumption that each site only writes rows prefixed with its site ID, then you're right back to the original situation where writes can be silently overwritten.
Not silently overwritten: the collision is visible to the application layer once connectivity is restored and you can prompt humans to reconcile it if need be.
That's correct: when the network comes back up we'll present users with a diff view and they can reconcile manually or decide to drop the revision they don't care about.
We're expecting this to be a rare occurrence (during partition, user at site A needs to modify data sourced from B). It doesn't have to be trivially easy for us to recover from, only possible.
One way we dealt with this in the past was assigning an "affinity" to each tenant and basically routing their writes/reads to that host, excepting if that host was down.
You would still get weird replication issues/conflicts when requests failed over in some conditions, but it worked fairly well the majority of the time.
These days I'd stick to single primary/writer as much as possible though tbh.
In principle you could use CRDTs to end up with a "not quite random" outcome that simply takes the conflict into account - it doesn't really attempt to "resolve" it. That's quite good for some cases.
This is a kind of CRDT. CRDT is just some papers defining reasonably clear terminology to cover the kind of eventually consistent replication that has been done for decades, including this kind (timestamp-based last-writer wins).
It all feels like they expect developers to sift through the conflict log to resolve things manually or something. If a transaction did not go through on some of the nodes, what are the others doing then? What if they can not roll it back safely?
Such a rabbit hole.