Thanks for the explanation, and I definitely agree with all your points. The rea...

dustingetz · on Nov 12, 2017

A lot of stuff in those links that I'm not familiar with. I'd expect challenges around what will happen when the log doesn't fit in memory? What if even the index you need doesn't fit in memory? If I understand, zlog is key-value-time, so I'm not sure what type of queries are interesting on that. Datomic is essentially a triple store with a time dimension so it can do what triple stores do. What do you think the use cases are for a key-value-time store?

noahdesu · on Nov 12, 2017

The log is mapped onto a distributed storage system, so fitting it in memory isn't a concern, though effective caching is important. The index / database also doesn't need to fit into memory. We cap memory utilization and resolve pointers within the index down into the storage system. Again, caching is important.

If I understand Datomic correctly, I can think of the time dimension in the triple store as a sequence of transactions each of which produces a new state. How that maps onto real storage is flexible (Datomic supports a number of backends which don't explicitly support the notion of immutable database).

So what would a key-value-time store be useful for? Conceptually it seems that if Datomic triples are mapped onto the key-value database then time in Datomic becomes transactions over the log. So one area of interest is as a database backend for Datomic that is physically designed to be immutable. There is a lot of hand waving, and Datomic has a large number of optimizations. Thanks for answering some of those questions about Datomic. It's a really fascinating system.

dustingetz · on Nov 12, 2017

My questions were in the context of effective caching in the query process, sorry that wasn't clear. A process on a single machine somewhere has to get enough of the indexes and data into memory to answer questions about it. Though I guess there are other types of queries you might want to do, like streamy mapreduce type stuff that doesn't need the whole log.

I will have to think about if a key-value-time storage backend would have any implications in Datomic. One thing Datomic Pro doesn't have is low-cost database clones through structure sharing, and it doesn't make sense to me why.

Here is a description of some storage problems that arise with large Datomic databases, though it's not clear to me if ZLog can help with this < http://www.dustingetz.com/ezpyZXF1ZXN0LXBhcmFtcyB7OmVudGl0eS..., >

noahdesu · on Nov 12, 2017

> A process on a single machine somewhere has to get enough of the indexes and data into memory to answer questions about it.

This common challenge is perhaps magnified in systems that have deep storage indexes. For example, the link you posted seems to suggest that queries in Datomic may have cache misses that require pointer chasing down into storage, adding latency. Depending on where that latency cost is eaten, it could have a lot of side effects (e.g. long running query vs reducing transaction throughput).

This problem is likely exacerbated in the key-value store running on zlog because the red-black tree can become quite tall. Aggressive, optimistic caching on the db nodes, and server-side pointer chasing helps. It's definitely an issue.

I don't know anything about Datomic internals, so disclaimer, I'm only speculating about why Datomic doesn't have low cost clones: which is that the underlying storage solution isn't inherently copy-on-write. That is, Datomic emulates this so when a clone is made there isn't an easy metadata change that creates the logical clone.

Despite the lack of features etc in the kv-store running on zlog, database snapshots and clones are both the same cost of updating the root pointer of the tree.