Hacker News new | past | comments | ask | show | jobs | submit login

What people usually miss about these things is normal version control benefits hugely from content addressing and normal forms.

The salient aspect of relational data is that it's cyclic, this makes content addressing unable to provide normal forms on it's own (unless someone figures out how to Merkle cylic graphs!), but the normal form can still made other ways.

The first part is easier enough, store rows in some order.

The second part is more interesting: making the choice of surrogate keys not matter (quotienting it away). Sorting table rows containing surrogate keys depending on the sorting of table rows makes for some interesting bags of constraints, for which there may be more than one fixed point.

Example:

  CREATE TABLE Foo (
    a uuid PRIMARY KEY,
    b text,
    best_friend uuid REFERENCES Foo(b)
  );
DB 0:

  0 Alice 0
1 reclusive Alice, best friends with herself. Just fine.

  0 Alice 1
  1 Alice 1
2 reclusive Alices, both best friends with the second one. The alices are the same up to primary keys, but while primary keys are to be quotiented out, primary key equality isn't, so this is valid. And we have an asymmetry by which to sort.

  0 Alice 1
  1 Alice 0
2 reclusive Alices, each best friends with the other. The Alices are completely isomorphic, and one notion of normal forms would say this is exactly the same as DB 0: as if this is reclusive Alice in a fun house of mirrors.

All this is resolvable, but it's subtle. And there's no avoiding complexity. E.g. if one wants to cross reference two human data entries who each assigned their own surrogate IDs, this type of analysis must be done. Likewise when merging forks of a database.

I'd love be wrong, but I don't think any of the folks doing "git for data" are planning their technology with this level of mathematical rigor.




A lot of your comment went over my head, but I have modeled relational data in a way conducive to being stored in a Merkle tree. The trick being that every entity in the system ended up having two IDs. A hash ID, identifying this specific version of the object, and an entity ID (probably a UUID or OID), which remained constant as new versions were added. In a situation where people can have friends that are also people, they are friends with the person long-term, not just a specific version of the friend, so in that case you'd use the entity IDs. Though you might also include a reference to the specific version of the person at the point in time at which they became friends, in which case they would necessarily reference an older version of that person. If you friend yourself, you're actually friending that person a moment ago.

A list of all current entities, by hash, is stored higher up the tree. Whether it's better that objects themselves store their entity ID or if that's a separate data structure mapping entity to hash IDs depends on the situation.

On second reading I guess your comment was actually about how to come up with content-based IDs for objects. I guess my point was that in the real world you don't usually need to do that, because if object identity besides its content is important you can just give it an arbitrary ID. How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?


> ...The trick being that every entity in the system ended up having two IDs...

I think we agree that this is a partial solution. Adding a temporal dimension and referencing immutable single-versions only can break cycles by making them unconstructable in the first place. But once an object refers to foreign entity IDs, hash IDs become "polluted" with surrogate values.

> How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?

I think it would come up with forks that have some identical edits, especially if those edits are in different orders. In that case, surrogate keygen state would get out of sync (whether it's counters or UUID state). Either we pessimize merges, or we need some way to recover.

I think allowing for "identical Alices" is probably necessary in practice, but an interface should have a warning of some sort about this. (Maybe ask the Alices more questions until you can differentiate them? Get prepared in case one of the alices comes back and wants a new I.D. card and you don't want to easily enable fraud.) Likewise when merging those warnings should be brought to the fore, along with a menu of resolutions at extremes for the user to decide between.


> The salient aspect of relational data is that it's cyclic

This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.

Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.


We have content addressable objects at Splitgraph. [0] And here’s an example of a “point in time” query across two versions of an image on Splitgraph. [1]

I’m on my phone and don’t have a ton of time to respond right now, but I’d recommend reading our docs. We’re working on a lot of what you mention.

(Also, we’re hiring for backend and frontend roles. See my comment history for more.)

[0] https://www.splitgraph.com/docs/concepts/objects

[1] https://www.splitgraph.com/workspace/ddn?layout=hsplit&query...


Consider the whole database - the whole set of facts across all relations - as the state in the tree. Each transaction a serialized delta that produces a new node in the tree, a new HEAD. That's closer to what's being gotten at, as I see it.

Transaction logs are already not that different to patch sets, and merge conflicts have an isomporphism with replication inconsistencies.


> Consider the whole database - the whole set of facts across all relations - as the state in the tree.

I tried to demonstrate that this is easier said than done. Deciding the equality/redundancy of facts is very subtle. At some it might even be guess whether your two clerks each met the same Alice when entering in their DB forks or not.

Transactions are just patches, I completely agree. And patches are just partial functions. But deciding what the action is of a partial function, or whether the input is in the domain requires a notion of equality. (Partial functions are encoded very nicely with incomplete pattern matching; in this case the question is whether Alice matches the pattern.)

Basically I know where you are coming from, and I want it to work too, but you cannot just wave away the math and issues it points out.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: