What people usually miss about these things is normal version control benefits h...

TOGoS · on March 6, 2021

A lot of your comment went over my head, but I have modeled relational data in a way conducive to being stored in a Merkle tree. The trick being that every entity in the system ended up having two IDs. A hash ID, identifying this specific version of the object, and an entity ID (probably a UUID or OID), which remained constant as new versions were added. In a situation where people can have friends that are also people, they are friends with the person long-term, not just a specific version of the friend, so in that case you'd use the entity IDs. Though you might also include a reference to the specific version of the person at the point in time at which they became friends, in which case they would necessarily reference an older version of that person. If you friend yourself, you're actually friending that person a moment ago.

A list of all current entities, by hash, is stored higher up the tree. Whether it's better that objects themselves store their entity ID or if that's a separate data structure mapping entity to hash IDs depends on the situation.

On second reading I guess your comment was actually about how to come up with content-based IDs for objects. I guess my point was that in the real world you don't usually need to do that, because if object identity besides its content is important you can just give it an arbitrary ID. How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?

Ericson2314 · on March 6, 2021

> ...The trick being that every entity in the system ended up having two IDs...

I think we agree that this is a partial solution. Adding a temporal dimension and referencing immutable single-versions only can break cycles by making them unconstructable in the first place. But once an object refers to foreign entity IDs, hash IDs become "polluted" with surrogate values.

> How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?

I think it would come up with forks that have some identical edits, especially if those edits are in different orders. In that case, surrogate keygen state would get out of sync (whether it's counters or UUID state). Either we pessimize merges, or we need some way to recover.

I think allowing for "identical Alices" is probably necessary in practice, but an interface should have a warning of some sort about this. (Maybe ask the Alices more questions until you can differentiate them? Get prepared in case one of the alices comes back and wants a new I.D. card and you don't want to easily enable fraud.) Likewise when merging those warnings should be brought to the fore, along with a menu of resolutions at extremes for the user to decide between.

ghusbands · on March 6, 2021

> The salient aspect of relational data is that it's cyclic

This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.

Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.

chatmasta · on March 7, 2021

We have content addressable objects at Splitgraph. [0] And here’s an example of a “point in time” query across two versions of an image on Splitgraph. [1]

I’m on my phone and don’t have a ton of time to respond right now, but I’d recommend reading our docs. We’re working on a lot of what you mention.

(Also, we’re hiring for backend and frontend roles. See my comment history for more.)

[0] https://www.splitgraph.com/docs/concepts/objects

[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...

barrkel · on March 6, 2021

Consider the whole database - the whole set of facts across all relations - as the state in the tree. Each transaction a serialized delta that produces a new node in the tree, a new HEAD. That's closer to what's being gotten at, as I see it.

Transaction logs are already not that different to patch sets, and merge conflicts have an isomporphism with replication inconsistencies.

Ericson2314 · on March 6, 2021

> Consider the whole database - the whole set of facts across all relations - as the state in the tree.

I tried to demonstrate that this is easier said than done. Deciding the equality/redundancy of facts is very subtle. At some it might even be guess whether your two clerks each met the same Alice when entering in their DB forks or not.

Transactions are just patches, I completely agree. And patches are just partial functions. But deciding what the action is of a partial function, or whether the input is in the domain requires a notion of equality. (Partial functions are encoded very nicely with incomplete pattern matching; in this case the question is whether Alice matches the pattern.)

Basically I know where you are coming from, and I want it to work too, but you cannot just wave away the math and issues it points out.)