Author here > Again, academically, maybe not a CRDT because they will not "alway...

halfcat · 2024-08-28T13:56:56 1724853416

Help me understand this more clearly. At what level of abstraction does the CRDT definition apply? At the data structure level, the application level, the end user level?

We can, for instance, use a simple directed acyclic graph of immutable data types as a data structure that can handle asynchronous updates and result in eventual consistency across all computers.

We have a node in this DAG that says there’s a meeting at 1pm. Two people update this meeting time. One updates it to start at 2pm, and the other updates it to start at 3pm, and these changes happen simultaneously.

The data structure now has a tree with the original parent node (1pm), and two new child nodes with new times (2pm and 3pm). All computers are updated cleanly and contain the same updated data structure. All conflicts handled. Zero conflicts exist at the data structure level.

At the application level, the app now shows there’s a meeting “at 2pm or 3pm”. And all apps on all computers will reflect this same, consistent information.

But to the people, there is a conflict in the meaning of this data. Is the meeting at 2pm or 3pm? This is somewhat analogous to a git merge conflict, where the “what to do about it” depends on the meaning to a human.

Like I get the impression that a lot of people don’t consider the “meeting at 2pm or 3pm” to be a CRDT because it “doesn’t converge to a single value”.

But from a physics perspective, there’s some binary data, some changes happen, and the new binary data is now replicated to all end user devices in the same state.

jakelazaroff · 2024-08-28T14:52:18 1724856738

The CRDT definition applies at the data structure level.

Let’s model your meeting example as a set. Any local changes would atomically remove the current meeting time and add a new one.

Multiple concurrent edits could result in two meeting times in the set. In terms of user experience, that’s a conflict — but from the CRDT’s perspective, it’s still a single consistent state upon which all peers eventually agree.

Or, put slightly differently: the CRDT’s job is purely to make sure the peers can always agree on the same state. Any semantic conflicts within that state are an application-level concern.

Slightly shameless plug, but I wrote a post showing how a CRDT fits into a toy application if you want to see how the layers fit together: https://jakelazaroff.com/words/building-a-collaborative-pixe...

yazaddaruvala · 2024-08-28T15:32:30 1724859150

> In terms of user experience, that’s a conflict — but from the CRDT’s perspective, it’s still a single consistent state upon which all peers eventually agree.

So then Git uses a CRDT internally? Merge conflicts are user experience "semantic conflicts" within the Git application.

The internal state in the Git commit tree contain all commits consistent across the synced replicas. Without further inputs, all the repos would render the exact same user experience conflict.

"The data type i.e. Git's internal state is conflict free - but there are times the user needs to correct the semantic state."

"The Git application wants to guarantee semantic state, so unless the semantic conflicts are resolved, the Git internal state i.e. 'data type' will continue to be conflict-free using the 'rollback technique' to reconcile conflicts with the local copy's 'data type'."

Roughly: git fetch would be the conflict-free update of the internal data type. While git merge would be the application level, semantic conflict resolution.

kiitos · 2024-08-28T21:36:01 1724880961

> So then Git uses a CRDT internally?

No. Factually, inarguably, no.

If you want to argue against this claim, OK, have fun, but it's not interesting to engage, because there's no interesting conversation to be had. However, if you want to understand why this claim is true, great, ask whatever questions you want to ask, I'm happy to help you understand.

jakelazaroff · 2024-08-28T16:03:08 1724860988

I don’t know enough about Git’s internals to be able to say. My gut says no; the (state-based) CRDT “rules” are that merges must be commutative, associative and idempotent, and I think if you tried to merge conflicting changes on multiple remotes in different orders, you would get different internal states. Not sure though!

josephg · 2024-08-28T21:34:29 1724880869

Yeah I think you’re correct: Git’s internal data structure is a grow-only set CRDT (well, graph) of commit objects. Grow only sets of hashed contents are one of the simplest CRDTs out there - the network just implements set union.

If git only had commit and fetch, it would be a crdt (though not a very useful one). But git also needs to merge commits / branches together. And it does that via diff-match-patch, which doesn’t pass the crdt rules. (Since it’s not idempotent, associative, automatic and in many cases deterministic).

Groxx · 2024-08-29T03:00:12 1724900412

The content-addressed database for commit objects (but not their names): possibly, yes. It has trivial merge semantics, just keep everything and deduplicate.

But that's equally true of every content-addressed system. And entirely untrue for every named object in git, which are a critical component of git being git, instead of an opaque blobstore with zero semantics.

josephg · 2024-08-29T05:28:45 1724909325

Yes I agree. I still think its wrong to call Git a CRDT because git is a lot more than its content-addressable system. All that other stuff on top - you know, the parts you use to track and merge branches? That stuff isn't a CRDT.

Maybe its like asking if Wikipedia is a relational database. I assume wikipedia is implemented on top of a relational database. But the resulting wikipedia website? No, not a relational database.

yazaddaruvala · 2024-08-29T02:19:27 1724897967

Thank you, I appreciate you taking the time to agree and offer additional nuance. I agree with your nuance. That is the nuance that makes Git not a CRDT.

I can also appreciate academics need a time and place to be academic about their definitions (e.g. papers, and in-depth answers).

> What are some real world apps using CRDTs that have really good experiences?

I also feel, academics could do more to appreciate questions like this one are non-academic. The question above could just as easily be interpreted by many as "I'm building a multi-player application. I want it to be a really awesome experience but I don't know if I'll get the concurrency right. I've heard concurrency is hard. I've heard good things about CRDTs tho, should I be using them?".

To which we should absolutely be mentioning truly pure CRDT based applications/frameworks, but not mentioning similar but nuancedly different applications/frameworks, hurts 1. any less educated readers' ability to instinctively understand what a CRDT is; and 2. how it might help them create the next great multi-player application. It also hurts someone that has spent the last N nights trying desperately to make their pure CRDT application/framework fun and valuable but its just not happening. Allowing them to loosen up and be a bit more creative, helping to empower their application/framework because the universe of "asymptotically approaching a CRDT; but will definitely not have concurrency bugs/issues" (which doesn't have a nice academic name + explicit ruleset yet) is so much larger "than strictly a CRDT".

Meanwhile, I could if this was a paper or lecture have said "... are examples asymptotically approaching a CRDT; So don't really worry just because something is not strictly a CRDT doesn't mean you're gonna have bugs in your concurrency implementation". But "in passing" on a comment thread that should have been 30 seconds of all our time, who wants to specifically write all of that nuance every comment? In fairness I even tried to differentiate academic vs non-academic.

If you're still reading this thread (especially anyone that has been downvoting me) I hope you can appreciate this context.

josephg · 2024-08-29T05:57:43 1724911063

Yeah the question asking about real world use cases is a non-academic question. But that still doesn't mean git is a crdt.

Maybe its like someone asking "What are some other uses for javascript?" and someone says "Unity games!". There might be a way to use javascript when making unity games, but most unity games aren't made with any javascript. "Oh well C# is sort of close to javascript though?" - Yes, but its still not javascript. I don't think calling C# "mostly javascript" is a useful or helpful idea. To the expert its wrong, and for the novice its confusing.

We don't draw a line from "actually really javascript" to haskell or something and say C# is 80% javascript. "Javascript" as a term just means javascript. Not the space of things similar to javascript.

Its the same with CRDTs. "CRDT" doesn't mean "eventually consistent distributed data store". It just refers to the 'pure' thing. You know what else meets the definition of a CRDT, but is very dissimilar from a database? An integer, sent between peers, and merged using the MAX() function. You could also argue that the unit type is a (degenerate) CRDT. Just like how a straight line is sort of a triangle with a side length of 0.

Thats how I see the claim that "Git is a CRDT - well sort of". To experts, its wrong. And to everyone else its kinda misleading. Git isn't a CRDT in the same way C# isn't javascript. (And a web browser, V8, a javascript program, nodejs - all of these things are also not javascript.)

Now, Git's content-addressable blob store is a CRDT, so it might be fair to say git is "using a CRDT internally". But I hope you can see how that claim is kinda confused too. Git's branches and commit merging don't obey the CRDT merge rules. Because git has that stuff added, it makes it no longer a CRDT. Its the same as if you add a lump to the side of a triangle, its not a triangle any more.

jakelazaroff · 2024-08-29T15:24:24 1724945064

I like the max integer example because it’s clearly a CRDT and also not particularly useful on its own. I get the impression that people think CRDTs are some sort of magical thing that gives you bug-free sync, but it’s pretty easy to explain why naively using a single max integer to implement a distributed counter would be buggy. CRDTs are just tools, and like any tool they can be misused!

yazaddaruvala · 2024-08-28T15:04:37 1724857477

> Sure - you can add local transactions. But ...

> A transaction in a blockchain isn't generally considered to have happened until it’s been included in a block & witnessed throughout the network.

For the sake of argument:

- It’s not meaningful for Apple Notes’ local edits to only be local. Sure they will render, but if a tree falls and no one hears is, did it make a sound? So let’s ignore local edits.

- A sentence in Apple Notes isn’t generally considered to have happened until it’s been rendered in every client and users can see it if they look.

Is that any less fair of a statement?? Is this game of nuanced wording selection at all meaningful tho to answer the question “Is Apple Notes a popular application of CRDTs?”

Meanwhile, keep in mind in a blockchain entire sets of nodes could be firewalled from each other for days, each subsection performing consensus and creating blocks. Days later the firewall could collapse and the nodes will need to automatically reconcile. If I abstract away the internal replicas and say the two “offline parts” of the CRDT need to keep “local” state and reconcile once both/all parts come back online. Would this abstraction help meet the strict definition?

I know you probably still strongly disagree - but coming from a non-academic background these word games around CRDTs hurt rather than help people’s understanding and use of them.

Instead we should be asking, how can we improve Git as a CRDT (where perfect "conflict-free" is the enemy of good enough) to better automerge more conflicts? How can we improve the general theory of CRDTs based on what we learned from implementations in “video game A”, “blockchain Z”, and “web application framework J”?

josephg · 2024-08-29T06:51:10 1724914270

> - It’s not meaningful for Apple Notes’ local edits to only be local. Sure they will render, but if a tree falls and no one hears is, did it make a sound? So let’s ignore local edits.

Why would we ignore local edits? Notes is a note taking application. Not a communications platform. Local edits are the main point of the application.

> - A sentence in Apple Notes isn’t generally considered to have happened until it’s been rendered in every client and users can see it if they look

Of course it has! If I copy my favorite crepe recipe into a note on my phone, then open it up later, I don't care if that note has synced to my laptop. (Or anyone else I might be collaborating with). Local edits are meaningful and important.

The same is true of git. I use git locally on lots of projects that I don't even have any collaborators on, so I can track my work.

From the database standpoint, these are AP systems. They stay available (to reads and writes) in the face of network partitions. The blockchain is .. well, the blockchain is weird. But I think its closest to a CP system. You can't make transactions when you're disconnected from the chain. (If you could, you would risk double-spending).

I consider blockchains closer to traditional databases than eventually-consistent datastores built on CRDTs, because they have coordination and traditional atomic transaction semantics.

espadrine · 2024-08-28T12:06:18 1724846778

I would say blockchains fit this definition. What I get away from that is that this definition is not restrictive enough.

Similarly, a list of operations with Lamport timestamps fits this definition. However, just like the blockchain, it makes no effort to preserve user intent.

To go to an extreme, a merge operation which replaces the current content with the empty string, is a CRDT per this definition.

kiitos · 2024-08-28T21:33:22 1724880802

Blockchains absolutely do not satisfy the definition of a CRDT.

espadrine · 2024-08-28T22:52:57 1724885577

How would you change the definition so that they do not?

kiitos · 2024-08-29T21:48:13 1724968093

I don't need to change the definition. The existing (canonical, well-understood, etc. etc.) definition of a CRDT clearly excludes blockchain data structures.

jakelazaroff · 2024-08-29T02:13:05 1724897585

My understanding is that blockchains agree upon a shared state via consensus (i.e. coordinating with other nodes) rather than by a deterministic algorithm.

espadrine · 2024-08-29T09:45:19 1724924719

It is deterministic: the local updates must be grouped into a block linking to the longest-known chain, and the block is broadcast; upon receiving a block, all nodes must locally revert operations that are no longer part of the deepest chain, and apply operations that are.

It is a bit like last-writer-wins, except it uses the block history count instead of timestamps. Clearly they lose a lot of user intents.

I feel like CRDTs are associated with intent preservation. Perhaps rule 2 ought to include that inconsistency resolution ought to not discard operations. That would also exclude LWW and the “empty string” merge as CRDTs.

jakelazaroff · 2024-08-29T11:46:51 1724932011

IMO semantic “intent preservation” is an application-level concern that can affect which CRDTs are chosen, but it’s irrespective of whether the underlying data structure is a CRDT. The rules are that state merges must be commutative, associative and idempotent; if blockchains fulfill all three criteria, I don’t see why they wouldn’t count.

kiitos · 2024-08-29T21:14:52 1724966092

> The rules are that state merges must be commutative, associative and idempotent;

Yes!

> if blockchains fulfill all three criteria, I don’t see why they wouldn’t count.

Blockchains definitely do not fulfill all three criteria!

> IMO semantic “intent preservation” is an application-level concern that can affect which CRDTs are chosen, but it’s irrespective of whether the underlying data structure is a CRDT.

You're correct!

Whether or not a data structure is a CRDT is entirely a function of whether or not its Merge operation is associative, commutative, and idempotent. Application-level concerns, including (but not limited to) semantic intent preservation, are totally irrelevant.

jakelazaroff · 2024-08-29T21:25:39 1724966739

Okay, back to my original skeptical position lol. I don’t understand why CRDTs seem to attract so many people claiming that their favorite computing concepts count!

kiitos · 2024-08-29T21:34:31 1724967271

I'm not sure what your original skeptical position is, so shrug?

kiitos · 2024-08-29T21:11:34 1724965894

The "blockchain" notion of determinism you're referring to here, is far weaker than the notion of determinism used in the distsys context.

This discrepancy was pretty well documented in the Jepsen analysis of the Radix DLT -- https://jepsen.io/analyses/radix-dlt-1.0-beta.35.1

tl;dr: blockchains are not deterministic in the distsys sense

espadrine · 2024-08-30T08:28:11 1725006491

I don’t know Radix, so perhaps they use a different merge algorithm than the standard one. The article indicates, though, that nondeterminism was a bug that the company fixed. It does sound like it relies on validators, so perhaps they don’t have open membership, thus don’t allow updates from arbitrary nodes without going through those validators?

Either way, the standard blockchain algorithm doesn’t behave like that.

To be fair, I don’t have a stake in this beyond wanting unambiguous definitions.