Document Database Transaction Models

Pamar · on July 30, 2021

Document databases are very convenient for modern application development, because modern application frameworks and languages are also based on semi-structured objects instead of tabular data.

Citation needed.

mjburgess · on July 30, 2021

All object-oriented programs have a data model which is "semi-structured" (ie., non-tabular).

Almost no software application data models are tabular.

jhgb · on July 30, 2021

Wait, you need a citation to show that commonly used languages have structs/records/classes with pointers/OOPs?

OK, in that case, I'm citing Java spec, C++ spec, C# spec, Javascript spec, Smalltalk spec, Common Lisp spec...

Pamar · on July 30, 2021

All the languages you listed worked fine to develop complex applications using "tabular data".

I find ... debatable that building yet again one more FB or IG challenger is significantly more important than maintaining or creating new Banking, ERP, Inventory Management solutions and for that stuff "tabular data" doesn't seem to be a bad match, or at the very least, using json or xml documents will not provide much of an improvement (all IMHO, of course...).

jhgb · on July 30, 2021

> All the languages you listed worked fine to develop complex applications using "tabular data".

Point is, it's not a native representation for them. If you just wanted to serialize a portion of a program's heap, independently of the program (which is presumably the "very convenient" part of it), it would be a graph of records or objects, not a bunch of tables. For example, no word processor to my knowledge represents the document being edited as a bunch of tables outside of the language's native data model. That's why document databases don't use tables for documents.

gryn · on July 30, 2021

the problem with non relational non tabular data, is that if there's no way to enforce strict schemas you end up with a lot of headaches.

the advantage of normalized data it that you can easily join it and transform into whatever format you want.

using objects/documents to store you data you can end up with the really annoying problems I've had in the past year at work, sub-sub-object has field/properties X/Y/Z it is assumed all objects would have it except some don't have it and some have it but it was used for something else that it remotely related but not the same.

now you who was not part of making this decision have to go hunt for patterns of correlation between subobjects, find out the illegal states of the global object, and decide wether the patterns you found were the results of an unnoticed bug, a valid edge case or just the result of poor design. keeping in mind that the some patterns are just way more popular than others and you cant just query the db "give me all the ways your objects and sub-objects varies depending on other fields".

another problematic things I saw: period 20xx/xx/xx to 20yy/yy/yy the informal schema was A, then it was changed to B, then after that to C. of course old data was not migrated to the new design each time, giving that the system work mostly in an online fashion that was not much of problem. expect the requirements changed or you found out something it not right and you need to recacultate a result that depdent of the date from the beggining up to now. now querying this data is a nightmare.

these are not problem of the document model itself but the people using it, but from my experience once you give it to people they slowly throw away all common sense on database design.

now if you want to enforce to quality into your stored data you'll find yourself hand writing part of the tools that you threw away when moving from more traditional databases.

At the end you have to ask you self was this worth it ?

jhgb · on July 30, 2021

> if there's no way to enforce strict schemas you end up with a lot of headaches.

I think the idea here is that your database schema is isomorphic to the restrictions on your in-memory data enforced by your application code: you can't store objects that you can't create in the first place, so that's your schema.

Of course systems like AllegroCache blur the difference between in-memory and stored data completely.

> another problematic things I saw: period 20xx/xx/xx to 20yy/yy/yy the informal schema was A, then it was changed to B, then after that to C. of course old data was not migrated to the new design each time, giving that the system work mostly in an online fashion that was not much of problem. expect the requirements changed or you found out something it not right and you need to recacultate a result that depdent of the date from the beggining up to now. now querying this data is a nightmare.

For example in AllegroCache this would be solved by usual object migrations (like in http://clhs.lisp.se/Body/f_upda_1.htm), perhaps even lazily as to not disrupt operation.

evanweaver · on July 30, 2021

These are fair critiques, and we expect document databases to evolve towards schema support in the the future. GraphQL is part of this trend.

Pamar · on July 30, 2021

I suspect that we are discussing one of those "if all you have is a hammer" cases here. In my experience most enterprise applications will prefer to have data in tabular form because making financial or management reports (including dashboards and analytics) based on a serialized portion of a program's heap is not really suitable.

Maybe I am just not "modern" enough.

gryn · on July 30, 2021

I've got to agree with you.

after the initial team that designed such document based system have gone to other jobs, once it got harder.

I found my self maintaining and evolving such system need to the changing needs of the business.

I've had to deal with the cases where customer were charged multiple times, some that were never charged at all, and all kind of similar problems.

I can trace almost all of them back to the original decision of lets just ditch a traditional DB for a NoSQL one, because it scales TM and geo redundancy will be easier for our infrastructure team.

jhgb · on July 30, 2021

> just ditch a traditional DB for a NoSQL one, because it scales TM and geo redundancy will be easier for our infrastructure team.

This doesn't sound like it has anything to do with the data model, though. The so-called "NoSQL" databases can be deficient for many other reasons.

jhgb · on July 30, 2021

Large-scale enterprise applications will generally favour the relational model simply because it's application-independent (which is something you're pointing out at the end). But documents are documents, they already have some defined structure, mostly a tree-like one.

gregwebs · on July 30, 2021

I think this is the much of the reason document dbs are popular. If you have an array in your struct, just save it.

Having used mongoDB in the past, for me it was a better day 1 experience. I didn’t need SQL, I just stored my data. The problem is that on day 2 there’s a different access pattern for your data but now it’s stored in a way that’s highly optimized only for that first application.

evanweaver · on July 30, 2021

If only there was some database that let you store flexibly structured documents but keep the data normalized. Perhaps you could even construct views and indexes to accelerate different access patterns.

bob1029 · on July 30, 2021

If only. Can you imagine if we also had some form of normalization so complete that it could actually manage an arbitrary number of dimensions?

Can you tell me how many times that users email address changed in that document over the last 3 years by executing a simple query? What was the email after the 2nd time it changed?

In 6th normal form, such a thing is trivial to manage. Discipline is the only price to pay for admission.

evanweaver · on July 30, 2021

Fauna is temporal, so yes. Normal forms not required.

bob1029 · on July 30, 2021

Does Fauna support multiple temporal dimensions in the same query?

evanweaver · on July 30, 2021

Can you give me an example of the kind of query you have in mind?

bob1029 · on July 30, 2021

I think your original claim is where I would like to focus my argument:

> Normal forms not required.

3NF (approximately where document databases live) struggles to decouple the time domain from individual facts. Let me give you an example.

Assume a Customer document has a LastModifiedUtc property. Does this tell you when their email specifically changed? No. It just says "something" in the customer document was modified.

Now, you could say "how about we just add a property per thing I want to track?" Ok - so we now have Customer.EmailLastModifiedUtc, Customer.NameLastModifiedUtc, etc. This is pretty good, but now assume we also need to know what the previous email addresses and names were. How do we go about this? Ok - no big deal, lets just add another column that is some JSON array or whatever. Customer.PreviousEmailAddresses. Cool, so now we know when the email was last modified AND we know what every last variant of it was.

What is missing? Oh right. What about when each of the previous email addresses actually changed? Who changed it? From what IP address? Certainly, we could nest a collection of documents within our document to keep track of all of this, but I hope the point is starting to come across that there may be some value in exploring higher forms of normalization. Imagine if I wanted to determine all of the email addresses that were modified by a specific IP address (across ALL customers), but only over the last 25 days. I feel like this is entirely out of the scope of a document database.

Don't get me wrong. 3NF is extremely powerful and handles many problem domains with total ease. But, once you start talking about historization of specific fields and rates of change, you may need to consider something higher order.

evanweaver · on July 30, 2021

This is possible in Fauna. All documents are actually collections of document versions within the configurable retention period. If you ensure that every writer decorates the document with the facets you want to search by (ip address, etc.) then you can construct indexes on those facets and query them temporally. They will return event records that show when the document entered the index (when that ip updated it) and left the index (when a different ip updated it).

Map the index additions and their timestamps onto the documents themselves and you can retrieve the entire state of each record that the ip wrote at the time that it wrote it. If you want to know specifically what that ip changed, then diff it with the previous record, for example, to filter down to updates that only changed the email address.

jimsimmons · on July 30, 2021

You can have never a system that’s capable of all types post hoc querying. If your model is general enough to handle all the things you want you won’t have a performant system as it can’t exploit any information about the problem. The only thing I can think of that’s capable of all you describe is a Write Ahead Log without any compaction.

bob1029 · on July 30, 2021

> If your model is general enough to handle all the things you want you won’t have a performant system as it can’t exploit any information about the problem

Define "performant". Denormalizing your domain model because you feel like the database might get slow is a strong case of premature optimization, unless you have actually tried to model it this way and have measured.

You will find that most modern SQL database systems have no problem querying databases with thousands or tens of thousands of tables. In fact, having narrow tables can dramatically improve your utilization of memory bandwidth since you aren't scanning over a bunch of bytes you will never use in the result set.

teitoklien · on July 30, 2021

Sounds a lot like sql databases with json extension to me

yellowapple · on July 30, 2021

I suspect that might indeed have been the joke.

jhgb · on July 30, 2021

> In the distant past of the 2010’s, document databases didn’t offer transaction support, instead implementing various forms of eventual consistency. Vendors and open-source maintainers promoted the idea that transactionality was an unnecessary, complexifying feature that damaged scalability and availability—and many claimed that adding it to their systems was impossible.

I find it interesting that the proposed explanation for lack of transactions in document databases is the CAP theorem which covers distributed systems, NOT document databases. Clearly such thing as a document database with transactions is not impossible if, for example, GemStone/S can support transactions just fine.

evanweaver · on July 30, 2021

GemStone/GemFire use a transactional protocol akin to Tuxedo. Open a bunch of locks, write a bunch of updates, release the locks. As per the docs (https://gemfire82.docs.pivotal.io/docs-gemfire/latest/develo...) this does not offer isolation or even atomicity, so it doesn't give you the C in CAP at all.

These are exactly the kind of "transactions" you get when you try to implement everything at the application level rather than the database level. Couchbase transactions (in the article) are the same. And it's not that different from Vitess cross-shard transactions either, which are not isolated (https://vitess.io/docs/reference/features/two-phase-commit/). Tandem SQL used the same scheme as well I believe.

Prior to Spanner, there were no production databases that offered ACID transactions across distributed, disjoint shards.

jhgb · on July 30, 2021

I'm sorry, what does Gemfire have to do with Gemstone/S? That seems like a completely different software from a different vendor.

> Open a bunch of locks, write a bunch of updates, release the locks.

That's how transactional databases using two-phase locking generally work, isn't it?

evanweaver · on July 30, 2021

I thought GemFire was directly derived from GemStone, via numerous acquisitions. If GemStone has a different transaction model I don't know it.

The point is, no distributed database with a naive two-phase lock is truly transactional.

jhgb · on July 30, 2021

Gemstone/S is a "classic" Smalltalk-based object database developed by GemTalk Systems. On the other hand, Gemfire, courtesy of VMWare, seems to be something completely different: https://gemfire.docs.pivotal.io/96/gemfire/getting_started/g...

I have no idea how you can draw similarities between the two. One of them is basically a Smalltalk VM with automatic object persistence. The other is some kind of real-time Java-based distributed key/value data platform.

evanweaver · on July 30, 2021

As far as I can tell, GemStone/S doesn't offer any server-side partitioning, clustering, or replication. GemFire was developed to scale the GemStone/S patterns horizontally.

The GemStone transaction docs describe a scheme that would work properly on a single machine, but don't discuss anything about distributed coordination across servers or failure modes. The installation instructions don't discuss setting up a cluster. The marketing docs discuss using thousands of VMs (clients) and scaling the dataset to "hundreds of gigabytes" based on disk storage instead of memory which is not what I would expect from a distributed system. Various benchmarks and user comments refer to using a single server for GemStone.

I will update the post to clarify that we are discussing distributed document databases only. It's easy to do anything you want on a single machine.

jhgb · on July 30, 2021

But of course I didn't say anything about partitioned and replicated systems, only about the options for document databases. The article of course is about "transaction models in document databases" and the observation that "document databases are very convenient for modern application development, because modern application frameworks and languages are also based on semi-structured objects", which Gemstone/S's model fits nicely since it's based on the same assumption.

vvern · on July 30, 2021

> Additionally, out of band coordination, most likely via human beings, is required to make sure that all potential readers of the transactional writes are also transaction-aware.

Subtle burn on the mongo and its client-coordinated session causality model.

sargun · on July 30, 2021

Given that most application developers started with something like a postgresql, or a MySQL that has pessimistic locking, and thus transactions rarely end in read / write conflicts (aborts), how have people (and potentially programming languages) adapted to optimistic concurrency control?

Also, when you say:

> On the other hand, server-side transactions use a more typical pessimistic relational lock. They open a transaction, do some work, and then commit it.

What do you mean? What are server-side transactions?

evanweaver · on July 30, 2021

We haven't seen much difference in practice between frequent aborts and frequent timeouts. Both are better than deadlocks.

I meant transactions issued from a server ("cloud") client to the database, as opposed to a mobile client.

jhgb · on July 30, 2021

> how have people (and potentially programming languages) adapted to optimistic concurrency control

If they started with PostgreSQL, then they're already adapted to optimistic concurrency control, right?

sargun · on July 30, 2021

Having played with the Firebase Python client, outstanding writes within a transaction are "kind of" exposed because you fetch the document, and make alterations to it before saving it (within a "transaction"). That object, when you make alterations to it, bubbles up the changes -- AFAICT, this is a bit of client side hack, but it's ergonomically wonderful.

evanweaver · on July 30, 2021

I guess most ORMs are like that. Is the object shared by reference across the entire runtime or do you end up with divergent objects?

sargun · on July 30, 2021

It's only valid in the context of that function invocation. The docs say not to write impure functions or do anything with concurrency -- if you start two simultaneous transactions, the client isn't "smart" about it unfortunately.

evanweaver · on July 30, 2021

I’m around to answer questions and discuss.

Anybody who has used Couchbase transactions, or sharded Mongo transactions, and can corroborate our analysis?

jimsimmons · on July 30, 2021

I like the presentation of ideas. Can you expand on use cases of Firebase’s nested document model? Seems powerful as a file system but not sure how that will play with the complexities of distributed applications.

evanweaver · on July 30, 2021

Firebase was originally designed more as a realtime communication mechanism than an operational database. The idea was that clients would subscribe to different nodes in a data hierarchy to receive realtime notifications from other clients that were publishing to those nodes. Depending on what was in the client view, sometimes you wanted to subscribe to a leaf, sometimes to a subtree, sometimes to everything.

As these things tend to go, when there is a place to store arbitrary data, all kinds of things get shoved into it, so the mixed model in Firestore is a compromise between the original tree-of-nodes data model and a more conventional document data model.

My assumption is the Firestore-to-Spanner mapping creates subcollections as shared tables with foreign keys to the parent documents, but I don't actually know. However, that would match the mandatory 1-to-many-to-1-to-many data layout, and makes more sense than shoving all the dependent data into the document itself or creating multiple millions of SQL tables for millions of documents.

jimsimmons · on July 30, 2021

This is classic HN speak but it seems like it should be easy to build a Raft or Paxos consensus based sharded database. We have had plenty of Spanner esque implementations since it came out. Would it be too hard to take something like TiDB and modify it to support nested document model? Of course testing and bringing it up to production levels is its own feat but nonetheless I’m surprised such an interesting DB avenue gets so little attention.