If anyone from the dolt team is reading this, I'd like to make an enquiry:
At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.
The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.
I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.
Yeah, that sized database is likely to be a challenge unless the computer system it's running on has scads of memory.
One of my projects (DBHub.io) is putting effort towards working through the problem with larger sized SQLite databases (~10GB), and that's mainly through using bare metal hosts with lots of memory. eg 64GB, 128GB, etc.
Putting the same data into PostgreSQL, or even MySQL, would likely be much more efficient memory wise. :)
We have 200 GB databases in dolt format that are totally queryable. They don't work well querying on the web though - you need a local copy to query it effectively. Making web query as fast as local is an ongoing project.
Uh, personal question here. Where does your ~10G number come from? I pretty much run my life on the Apple Notes app. My Notes database is about 12G and now I’m scared.
Oh, it's just the vast majority of SQLite databases we see are pretty tiny. eg a few MB's at max
So, resource usage is pretty much not a consideration in any way when doing things with them.
Once people start doing things with multi-GB databases though, if the system they're running on has a small amount of memory (say 4GB, 8GB) then things can start going poorly.
Not like "crash and burn" poorly (so far). More like "tries to read 10GB of data into 4GB of ram" poorly. eg Dog slow, not lightning quick.
If your machine has a bunch of ram in it, you're likely safe . Though, I'd be making damn sure there are tested and working backups of it, just to be safe. :)
I missed this answer the first time around. Thanks so much for the reply. I always get machines with the maximum amount of RAM precisely for this reason.
You have other options too. If I have time i can try to reduce the size with a columnar format that is designed for this use case (repeated values, static dataset).
That would be really great. Let me know if there's any way we can help. Maybe if we released a small version of the dataset for testing/benchmarking and then I could take care of running the final processing on the full dataset?
The preliminary tests show significant reduction in space usage when using Parquet over Sqlite3. This is not unexpected at all. Parquet is much better for analytical use.
What a lot of webmasters want is, test the site locally, then merge it back. A lot of people turned to Jekyll or Hugo for the very reason that it can be checked into git, and git is reliable. A static website can’t get hacked, whereas anyone who has been burnt with Wordpress security fail knows they’d prefer a static site.
And even more: People would like to pass the new website from the designer to the customer to managers — Wordpress might have not needed to develop their approval workflows (permission schemes, draft/preview/publish) if they had had a forkable database.
At the root of this lies the problem that content, configuration and code is stored in one blurp (a single db). Never clearly bounded nor contained. Config is spread over tables. Usergenerated content (from comments to orders) mixed up with redactional content: often even in tables, sometimes even in the same column. Drupal suffers the same.
What is needed, is a clear line between configuration (and logic), redactional content, and user generated content. Those three have very distinct lifecycles. As long as they are treated as one, the distinct lifecycles will conflict. No matter how good your database merging is: the root is a missing piece of architecture.
Very well summarised. I can see clearly you too have tried to apply normal software lifecycle principles and come out the other end with a rather dismal impression of WordPress.
Sure, but Wordpress is still running PHP files for every page load and back-end query. Dolt would help offload some of the code’s complexity, but that would still leave a significant attack surface area.
In other words, by itself Dolt couldn’t solve the problem of Wordpress not being run mostly from static assets on a server (plus an API).
The parent post here is speaking to the content stored in the database, not the templates on the filesystem. Dolt enables one to merge, push and pull data, just as one would with files and git.
thanks for the mention of TerminusDB (https://github.com/terminusdb/terminusdb) -> we are the graph cousins of Dolt. Great to see so much energy in the version control database world!
We are currently focused on data mesh use cases. Rather than trying to be GitHub for Data, we're trying to be YOUR GitHub for Data. Get all that good git lineage, pull, push, clone etc. and have data producers in your org own their data.
We see lots of organizations with big 'shadow data' problems and data being centrally managed rather than curated by domain experts.
Dolt might be good but never underestimate the power of Type 2 Slowly Changing Dimension tables [1]. For example, if you had an SSIS package that took CSV and imported them into a database, and one day you noticed it accidently rounded the value incorrectly, you could fix the data and retain traceability of the data which was there originally.
I definitely agree, just tossing in the superset concept that Dolt and Type 2 SCD involve - temporal databases [1].
I think the idea of a "diff" applied to datasets is quite awesome, but even then, we kind of do that with databases today with data comparison tools - it's just most of them are not time aware, rather they are used to compare data between two instances of the data in different databases, not at two points in time in the same database.
The commit log in Dolt is edit history. (When did someone import or edit the data? Who made the change?) It's not about when things happened.
To keep track of when things happened, you would still need date columns to handle it. But at least you don't need to handle two-dimensional history for auditing purposes. So, in your example, I think the "effective" date columns wouldn't be needed.
They have ways to query how the dataset appeared at some time in the past. However, with both data changes and schema changes being mixed together in a single commit log, I could see this being troublesome.
I suppose writing code to convert old edit history to the new schema would still be possible, similar to how git allows you to create a new branch by rewriting an existing one.
However, I'm struggling to figure out a real-world use case for this. I'd love if anyone here can enlighten me.
I don't see how it can be for production databases involving lots of users, because while it seems appealing as a way to upgrade and then roll back, you'd lose all the new data inserted in the meantime. When you roll back, you generally want to roll back changes to the schema (e.g. delete the added column) but not remove all the rows that were inserted/deleted/updated in the meantime.
So does it handle use cases that are more like SQLite? E.g. where application preferences, or even a saved file, winds up containing its entire history, so you can rewind? Although that's really more of a temporal database -- you don't need git operations like branching. And you really just need to track row-level changes, not table schema modifications etc. The git model seems like way overkill.
Git is built for the use case of lots of different people working on different parts of a codebase and then integrating their changes, and saving the history of it. But I'm not sure I've ever come across a use case for lots of different people working on the data and schema in different parts of a database and then integrating their data and schema changes. In any kind of shared-dataset scenario I've seen, the schema is tightly locked down, and there's strict business logic around who can update what and how -- otherwise it would be chaos.
So I feel like I'm missing something. What is this actually intended for?
I wish the site explained why they built it -- if it was just "because we can" or if projects or teams actually had the need for git for data?
The application backing use case is best suited for when you have parts of your database that get updated periodically and need human review. So you have a production database that you serve to your customers. Then you have a branch / fork of that (dev) that your development team adds batches of products to. Once a week you do a data release: submit a PR from dev -> prod, have somebody review all the new copy, and merge it once you're happy. If there's a big mistake, just back it out again. We have several paying customers building products around this workflow.
As for lots of people collaborating on data together, we have started a data bounties program where we pay volunteers to assemble large datasets. Two have completed so far, and a third is in progress. For the first one, we paid $25k to assemble precinct-level voting data for the 2016 and 2020 presidential elections. For the second, we paid $10k to get procedure prices for US hospitals. You can read about them here:
What's cool is that novices can make a really good income from data entry as a side gig, and it's two orders of magnitude cheaper than hiring a firm to build data sets for you.
You're right that the site is kind of vague about what dolt is "for." It's a really general, very multi-purpose tool that we think will get used a lot of places. Here's a blog we wrote a while back about some of the use cases we envision.
This won't work for usual database usecases. This is meant for interactive work with data same way you work with code.
Who needs that?
Data scientists working with large datasets. You want to be able to update data without redownloading everything.
Also make your local changes (some data cleaning) and propose your updates upstream same way you would with git.
Having many people working interactively with data is common here.
One of the companies I work with provided set of data distributed to their partners on a daily basis.
Once it grew larger, downloading everything daily became an issue. So that would be desirable,
I have large data model that I need to deploy to production and update once in a while.
For code, network usage is kept to minimum because we have git. For data, options are limited.
As with git, it is something that once you have, you will find a lot of usecases that make life easier and open many new doors.
Here's one thing I'd have used it for: Video game assets.
Say you have a tabletop game engine for designing starships. Different settings have different lists of parts. Some settings are run by a game's DM, some are collaborative efforts. I ended up saving the lists of parts in huge JSON files and dumping those into git. However, for much the same reason that data science is often done in a REPL or notebook type interface, it turned out that by far the most efficient way for people to iterate on assets was to boot up the game, fiddle with the parts in-engine until things looked right, then replicate their changes back into the JSON. With this, we could just save the asset database directly.
The same reasoning should hold for effectively any dataset which a) can be factored into encapsulated parts b) isn't natively linear c) needs multiple developers. Game assets are one example, as I described above. Other datasets that hold: ML training/testing sets, dictionaries, spreadsheets, catalogs, datasets for bio papers.
I might look at it for work. Compliance requires us to keep track of who made what change when, and who it was approved by in case regulators need it.
Right now, this often means making an MR on Git with your snippet of SQL, getting it approved, and then manually executing it. This would let us bring the apply stage in, as well as avoid "that SQL query didn't do exactly what I expected" issues.
It's possible to do it within the SQL engine, but then I have to maintain that, which I would prefer not to do. As well as dealing with performance implications from that.
I see a use-case for public resource data sets. For example, right of way, land use variance, permits, and land records. These are fundamentally public data sets that are often maintained by entities with limit budgets and the potential (even incentive) for fraud is substantial. Also, there is significant value in being able to access the data set for analytical purposes such as real estate analysis.
I am not associated to Dolt, but I really like the idea of Dolt personally. I do see use cases, but not without challenges.
One of the main use cases you can see them targeting, and that I think makes a ton of sense, is providing tools for collecting, maintaining and publishing reference data sets using crowd sourcing.
For example, they are doing this with hospital charge codes (a.k.a. chargemaster data). Hospitals in the US are required to publish this data for transparency.. however, I have never seen a single aggregated national (or international) data set of all these charges. In fact, such a data set could be worth a lot of money to a lot of organizations for so many reasons. I used to work in health insurance, gathering data from all kinds of sources (government rules/regs, etc.) and it was a lot of hard work, scraping, structuring, maintaining, etc.
This reference data can be used for analytics, to power table-driven business logic, machine learning - to help identify cost inequalities, efficiencies, maybe even illicit price gouging, etc. There are so many reference data sets that have similar characteristics... and "data marketplaces" in a way are targeted at making "private" reference data sets available for sale - so then where is the "open" data marketplace? Well, here you go.. Dolt.
I have often realized that the more ways we can make things collaborative, the better off we will be.
Data is one of those things where, coming up with common, public reference datasets is difficult and there are lots of different perspectives ("branches"), sometimes your data set is missing something and it would be cool if someone could propose it ("pull request"), sometimes you want to compare the old and new version of a dataset ("diff") to see what is different.
One difficult thing about Dolt is, it will only be successful if people are actually willing to work together to cook up and maintain common data sets collaboratively, or if those doing so have an incentive to manage an "open data" project on Dolt as benevolent maintainers, for example. But, I could say then it has the same challenges as "open source" in general, so therefore it is not really that different.
Dolt could even be used as a foundation for a master data management registry - in the sense of you could pop it in as a "communal data fountain" if you will where anybody in your org, or on the Web, etc. could contribute - and you can have benevolent maintainers look over the general quality of the data. Dolt would be missing the data quality/fuzzy matching aspect that master data tools offer, but this is a start for sure.
For example, I work in a giant corporation right now. Each department prefers to maintain its own domain data and in some cases duplicates common data in that domain. Imagine using Dolt to make it possible for all these different domains to collaborate on a single copy of common data in a "Dolt" data set - now people can share data on a single copy and use pull requests, etc. to have an orderly debate on what that common data schema and data set should look like.
I think it's an idea that is very timely.
P.S. Dolt maintainers, if you read this and want to talk, I'm game! Awesome work :)
Your final example is very like the data mesh ideas coming out of thought works and elsewhere (https://martinfowler.com/articles/data-mesh-principles.html). Data products being owned by data producers and common data models ONLY when required. It is as much organizational model as technology, but I don't really think this maps to SQL tables. You probably want to look at a versioned knowledge graph of some kind. Downside is no SQL, upside is flexibility and speed. (disclaimer - I work over at TerminusDB, the graph version of Dolt)
You're absolutely right that what I'm saying is like data mesh.
The data mesh ideas aren't necessarily new - as this idea that data has been managed/owned by what some would call "federated teams" (optimistic view) and others might call "silos" (pessimistic view) isn't really new - especially in certain industries.
Certainly, the current energy and focus on building collaborative organizational service models, process, standards, etc. and standardizing this around the concept of data mesh is "fresh" and very much on the tip of many people's minds.
I worked at a company that had built a tool which was a data federation engine, and we dabbled with the concepts of "data fabric", "mesh", "network" etc. but we were more talking about tech/implementation vs. architecture and organizational structure. Still, we thought the right tech could enable this architecture/organizational structure and that it wasn't something totally figured out.
Thanks for sharing and mentioning it, and also Terminus. Certainly, I agree that knowledge graph is another approach and one that can also fit very well with the idea of linking data that is both common and specific to domains.
However, I've rarely seen graph databases that had the kind of versioning/time travel that a temporal database (such as relational ones) had built in natively (or anything like "diff").
I don't agree this can't map to SQL tables.. tables and relations (tuples) are just a form of data. People do collaborative data management using relational-based master data management systems already today and have been for many, many years.
That is not to say that graphs aren't a good match for storing common/reference datasets or linking data together. Given that graphs can be considered a superset of relational model, I don't see any reason that you can't do the same thing in graph databases, just that doing diffs and enabling the same kind of workflow might be more difficult in a graph data model - but definitely not impossible.
Lest we forget things like "Linked Data" [1], triple stores, etc. also that have been trying to get at linked data across the entire Internet. However, I never saw collaborative data management as part of that vision, it was mostly for querying.
agree 100% this looks awesome for reference datasets.
not so sure how well it would work for live data sources that update with time as it could encourage people to apply more ad-hoc edits as opposed to getting their version controlled jobs to work 100%, but who knows, maybe that would be a net win in some cases?
Or more simply put, how about table-driven logic in general? It doesn't have to be as complex as machine learning. There are more use cases than just machine learning, IMHO.
Yes you need reference data for machine learning, but the world isn't only about machine learning. You might want reference data for human-interpreted analytics, table-driven logic (business rule engines, for example), etc.
It might not be exactly what you are looking for, but git-bug[1] is encoding data into regular git objects, with merges and conflict resolution. I'm mentioning this because the hard part is providing an ordering of events. Once you have that you can store and recreate whatever state you want.
This branch[2] I'm almost done with remove the purely linear branch constraint and allow to use full DAGs (that is, concurrent edition) and still provide a good ordering.
It’s a bit slower, but smart use of partial/shallow clones can address performance degradation on large repositories over time. You just need to take care of the transformation between “physical” trees/blobs and “logical” objects in your dataset (which may not have 1:1 mapping, as having physical layer more granular reduces likelihood of merge conflicts).
I’m also following Pijul, which seems very promising in regards to versioning data—I believe they might introduce primitives allowing to operate on changes in actual data structures rather than between lines in files, like with Git.
Add to that sound theory of patches, and that’s a definite win over Git (or Doit for that matter, which seems to be same old Git but for SQL).
Lack of locking for binary files, overhead > 1gb and all the shenanigans you need to do for proxy servers. There's better solutions out there but they aren't free.
Sure keep in mind that my data is a little old but last time I peeked into the git LFS space it seemed like there were still a few gaps.
First, most of my background in this area comes from gamedev so YMMV if the same applies in your use cases.
For our usage we'd usually have a repo history size that crossed the 1TB mark and even upwards of 2-3TB in some cases. The developer sync was 150-200GB, the art sync was closer to 500-600GB and the teams were regularly churning through 50-100GB/week depending on where we were in production.
You need discipline specific views into the repo. It just speeds everything up and means that only the teams that need to take the pain have to. From a performance perspective Perforce blows the pants off anything else I've seen, SVN tries, but P4 was easily an order of magnitude faster to sync or do a clean fetch.
I've seen proxy servers done with git but it's usually some really hack thing scripted together with a ton of ductape and client-specific host overrides. When you have a team split across East Coast/West Coast(or other country) you need that proxy so that history is cached in a way that it only gets pulled in locally once. Having a split push/pull model is asking for trouble and last I checked it wasn't clear to me if stuff like git LFS actually handles locking cleanly across it.
From an overhead perspective git just falls over at ~1gb(hence git LFS, which I've seen teams use to varying degrees of success based on project size). The need to do shallow history and sidestep resolving deltas is a ton of complexity that isn't adding anything.
With a lot of assets, merging just doesn't exist and a DVCS totally falls over here. I've seen fights nearly break out in the hallway multiple times when two artist/animators both forgot to checkout a file(usually because someone missed the metadata to say it's an exclusive access file). With unmergeable binary files that don't get locked your choice is who gets to drop 1-3 days of work on the floor when the other person blows away their changes to commit. If those changes span multiple interconnected packages/formats/etc you have a hard fork that you can never bring back together.
There's a couple other details but that's the large ones, Perforce worked incredibly well in this space but it is not cheap and so I've seen teams try to go their own way to mixed success. I'll admit that you can't do a monorepo in P4(and even tools like repo in the Android world have their problems too) but if you segregate your large business/product lines across P4 repos it scales surprisingly well.
Anyway, you may or may not hit any or all of this but I've yet to see git tackle a 1TB+ repo history well(and things like repo that uses many mini-repos doesn't count in my book due to the lack of atomicity across submissions that span multiple repos).
In my case it’s different since Git isn’t accessed by users directly, rather I’m working on some tools that work on top of Git (on user’s machine). Data is primarily text-based, though sometimes binary assets come up (options for offloading them out of Git are being investigated).
So far there were no major issues. I predict degradation over time as repos grow in size and history (Git is not unique in this regard, but it’ll probably be more rapid and easier to observe with Git), so we might start using partial cloning.
(I stand by the idea that using straight up Git for data is something to consider, but with an amendment that it’s predominantly text data, not binary assets.)
Yeah, my experience has been that you start seeing issues with long delta decompression times around the 1-2gb mark. That climbs quicker of you have binary formats that push the delta compression algorithm into cases where it does poorly(which makes sense since it was optimized for source code).
If you have binary assets and they don't support merging or regeneration from source artifacts that mandates locking(ideally built into SCM but I've seen wiki pages in a pinch at small scale).
I find a balance between this using git on JSON files. And I build the JSON files into a database (1 file per record, 1 directory per table, subdirectories for indexes). The whole thing is pretty beautiful, and it's functioning well for a user-account, access management database I'm running in production. I like that I can go back and do:
`git diff -p` to see the users who have signed up recently, for example.
The advantages of this approach are using existing unix tooling for text files, solid versioning, easy inspect-ability, and leveraging the filesystem B-Tree indexing as a fast index structure (rather than having to write my b-trees). Another advantage is hardware-linked scaling. For example, if I use regular hard disks, it's slower. But if I use SSDs it's faster. And i should also be possible to mount the DB as a RAM disk and make it super fast.
The disadvantages are that the database side still only supports a couple of operations (like exact, multikey searches, lookup by ID, and so on) rather than a rich query language. I'm OK with that for now, and I'm also thinking of using skiplists in future to get nice ordering property for the keys in an index so I can easily iterate and page over those.
Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one. Complexity is a liability. Mercurial has a nicer interface. And now I see the real simplicity of non-distributed source control systems. I have never actually needed to work in a distributed manner, just client-server. I have never sent a patch to another dev to patch into their local repo or whatnot. All this complexity seems like a solution chasing after a problem - at least for most developers. What works for Linux isn't necessary for most teams.
> Git is too complicated. It's barely usable for daily tasks. Look at how many people have to Google for basic things like uncommitting a commit, or cleaning your local repo to mirror a remote one.
Cars are too complicated. They are barely usable for daily tasks. Look at how many people have to Google for basic things like changing a fan belt, or fixing cylinder head gasket.
You can fill in almost anything here. Most tools are complicated. Yet yo don’t need to know their ins and outs for them to be useful to you.
To me it sounds like you're proving the exact opposite. I'd assume most car owners never need to change a fan belt themselves, while everyone who uses git daily needed at some point to revert a commit. "How to turn right" isn't huge on stackoverflow last time I checked...
To me there’s some irony in that all insta-criticism of Git in responses to my comment presumably applies to a project that describes itself as “Git for data” and promises exact reproduction of all Git command behaviour—therefore suffering from the same shortcomings.
Doit boasts its likeness to Git as a feature. Does this mean it’ll also be barely usable for daily tasks? Is it possible for a project to faithfully reproduce the entirety of Git command interface and be less complicated than Git / not suffer from the same shortcomings?
I personally think Git isn’t that bad, once it’s understood. It could be counter-intuitive sometimes though (as an example, for the longest time I used Git without realizing it stores a snapshot of each file and diffs/deltas are only computed when required). Just trying to be pragmatic and not expecting a tool like Git to be entirely free of leaky abstractions.
We used git to store and manage data sets for a machine learning project involving chewing detection with audio data used in training. It was cumbersome and the huge datasets caused some problems with git (e.g., searches of our code base got really slow because the data was being searched also until we moved the data to a different repo). Something easier to use that could manage large datasets would be useful.
I wonder if DoIt could be used to create a clone of Apple's Time Machine. Seems like the basics are there.
What people usually miss about these things is normal version control benefits hugely from content addressing and normal forms.
The salient aspect of relational data is that it's cyclic, this makes content addressing unable to provide normal forms on it's own (unless someone figures out how to Merkle cylic graphs!), but the normal form can still made other ways.
The first part is easier enough, store rows in some order.
The second part is more interesting: making the choice of surrogate keys not matter (quotienting it away). Sorting table rows containing surrogate keys depending on the sorting of table rows makes for some interesting bags of constraints, for which there may be more than one fixed point.
Example:
CREATE TABLE Foo (
a uuid PRIMARY KEY,
b text,
best_friend uuid REFERENCES Foo(b)
);
DB 0:
0 Alice 0
1 reclusive Alice, best friends with herself. Just fine.
0 Alice 1
1 Alice 1
2 reclusive Alices, both best friends with the second one. The alices are the same up to primary keys, but while primary keys are to be quotiented out, primary key equality isn't, so this is valid. And we have an asymmetry by which to sort.
0 Alice 1
1 Alice 0
2 reclusive Alices, each best friends with the other. The Alices are completely isomorphic, and one notion of normal forms would say this is exactly the same as DB 0: as if this is reclusive Alice in a fun house of mirrors.
All this is resolvable, but it's subtle. And there's no avoiding complexity. E.g. if one wants to cross reference two human data entries who each assigned their own surrogate IDs, this type of analysis must be done. Likewise when merging forks of a database.
I'd love be wrong, but I don't think any of the folks doing "git for data" are planning their technology with this level of mathematical rigor.
A lot of your comment went over my head, but I have modeled relational data in a way conducive to being stored in a Merkle tree. The trick being that every entity in the system ended up having two IDs. A hash ID, identifying this specific version of the object, and an entity ID (probably a UUID or OID), which remained constant as new versions were added. In a situation where people can have friends that are also people, they are friends with the person long-term, not just a specific version of the friend, so in that case you'd use the entity IDs. Though you might also include a reference to the specific version of the person at the point in time at which they became friends, in which case they would necessarily reference an older version of that person. If you friend yourself, you're actually friending that person a moment ago.
A list of all current entities, by hash, is stored higher up the tree. Whether it's better that objects themselves store their entity ID or if that's a separate data structure mapping entity to hash IDs depends on the situation.
On second reading I guess your comment was actually about how to come up with content-based IDs for objects. I guess my point was that in the real world you don't usually need to do that, because if object identity besides its content is important you can just give it an arbitrary ID. How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?
> ...The trick being that every entity in the system ended up having two IDs...
I think we agree that this is a partial solution. Adding a temporal dimension and referencing immutable single-versions only can break cycles by making them unconstructable in the first place. But once an object refers to foreign entity IDs, hash IDs become "polluted" with surrogate values.
> How often does the problem of differentiating between a graph containing identical Alices vs one with a single self-friending Alice actually come up? Is there any way around it other than numbering the Alices?
I think it would come up with forks that have some identical edits, especially if those edits are in different orders. In that case, surrogate keygen state would get out of sync (whether it's counters or UUID state). Either we pessimize merges, or we need some way to recover.
I think allowing for "identical Alices" is probably necessary in practice, but an interface should have a warning of some sort about this. (Maybe ask the Alices more questions until you can differentiate them? Get prepared in case one of the alices comes back and wants a new I.D. card and you don't want to easily enable fraud.) Likewise when merging those warnings should be brought to the fore, along with a menu of resolutions at extremes for the user to decide between.
> The salient aspect of relational data is that it's cyclic
This is an odd claim. Most relational data is not cyclic, and it's easy enough to come up with a scheme to handle cyclic data in a consistent fashion.
Conflicting changes (two changes to the same 'cell' of a database table) are a much more likely issue to hit and will need handling in much the same way merge conflicts are currently handled, so there are already situations in which manual effort will be needed.
We have content addressable objects at Splitgraph. [0] And here’s an example of a “point in time” query across two versions of an image on Splitgraph. [1]
I’m on my phone and don’t have a ton of time to respond right now, but I’d recommend reading our docs. We’re working on a lot of what you mention.
(Also, we’re hiring for backend and frontend roles. See my comment history for more.)
Consider the whole database - the whole set of facts across all relations - as the state in the tree. Each transaction a serialized delta that produces a new node in the tree, a new HEAD. That's closer to what's being gotten at, as I see it.
Transaction logs are already not that different to patch sets, and merge conflicts have an isomporphism with replication inconsistencies.
> Consider the whole database - the whole set of facts across all relations - as the state in the tree.
I tried to demonstrate that this is easier said than done. Deciding the equality/redundancy of facts is very subtle. At some it might even be guess whether your two clerks each met the same Alice when entering in their DB forks or not.
Transactions are just patches, I completely agree. And patches are just partial functions. But deciding what the action is of a partial function, or whether the input is in the domain requires a notion of equality. (Partial functions are encoded very nicely with incomplete pattern matching; in this case the question is whether Alice matches the pattern.)
Basically I know where you are coming from, and I want it to work too, but you cannot just wave away the math and issues it points out.)
It's amazing this isn't a standard feature. The database world seems to have focused on large, high volume, globally distributed databases. Presumably you would't version clickstream or IoT sensor data.
Features like this that are only feasible below a certain scale are underdeveloped and I think there's opportunity there.
Every DB engine used at scale has a concept of snapshots and backups. This just looks like someone making a git-like porcelain for the same kind of DB management constructs.
Dolt actually stores the tables, rows, and commits in a Merkle DAG, like Git. So you get branch and merge. You can't do branch and merge with snapshots.
(You also get the full git / github toolbox: push and pull, fork and clone, rebase, and most other git commands)
Yeah it's a neat idea but I struggle to think of good use-cases for merge, other than toy datasets. If I'm working on a service that's sharding millions of users across dozens of DB instances a merge is going to be incomprehensible to understand and reason about conflicts.
> Yeah it's a neat idea but I struggle to think of good use-cases [...] If I'm working on a service [...]
I suspect that's simply not the use-case they're targeting. You're thinking of a database as simply the persistence component for your service, a means to an end. For you, the service/app/software is the thing you're trying to deliver. Where this looks useful is the cases where the data itself is the thing you're trying to deliver.
No. Relational DB logs are kept for short time, and do not allow for branching/merging. And even if you would store full log since day 1, the only way to check state of database at some point would be to start from scratch and reply everything to desired point. For any sizeable db, that's not practical.
If you are interested to one step beyond, you can check out https://terminusdb.com/docs/terminusdb/#/ which uses RDF to represent (knowledge) graphs (and has revision control).
Is there a nice way to distribute a database across many systems?
Would this be useful for a data-set like Wikipedia?
It would be really nice if anyone could easily follow Wikipedia in such a way that anyone could fork it, and merge some changes, but not others, etc. This is something that really needs to happen for large, distributed data-sets.
I really want to see large corpuses of data that can have a ton of contributors sending pull requests to build out some great collaborative bastion of knowledge.
This is basically Wikidata [1], where the distribution is linked data/RDF with a triple store and federation to query different triple stores at once using SPARQL.
Already I would not use this project because of its name. I’m not offended by it, but I know others will be, and it will only be a matter of time before we have to replace it with something else. So why bother in the first place?
I know the name is not DOLT but it is close enough to cause offense. Imagine the N-word with one typo. Would it still be offensive? Probably to some.
> but I know others will be, and it will only be a matter of time before we have to replace it with something else
Or we can just not give in to such insanity. That's always an option, and would help prevent things from getting increasingly worse as we cede ground to claims that increasingly get further and further away from the realm of what's reasonable.
This is the issue with names. Even though the project is called doit the DoIt stylizing makes it look problematic. It's a non-starter, hopefully the author makes a big change. Just choosing lower case for the project would be enough.
If this could somehow work with existing production MariaDB/PostgreSQL databases, this would be the next Docker. Sad that it requires its own proprietary DB (SQL compatible though it may be, it's not my existing database).
I wish something completely free like this existed just to manage database migrations. I don't think anything is quite this powerful/user-friendly (if you can call Git that) and free.
For best results it requires modeling your project in Python using SQLAlchemy, and even then, some migrations may require manual tweaks. But it is free and supports an extensive range of database systems.
I've been working on https://github.com/projectnessie/nessie for about a year now. Its similar to Dolt in spirit but aimed at big data/data lakes. Would welcome feedback from the community.
Its very exciting to see this field picking up speed. Tons of interesting problems to be solved :-)
I love the idea of this project so much. Being able to more easily share comprehensible data is an interest of mine.
It is not the first time I have seen immutable B-trees being used as a method for being able to query a dataset on a different point in time.
Spanner (and its derivatives) uses a similar technique to ensure backup consistency. Solutions such as CockroachDB also allows you to query data in the past [1], and then uses a garbage collector to delete older unused data. The Time-to-live of history data is configurable.
Albeit, now it makes me wonder how much of the functionality of dolt is possible to be replicated with CockroachDB. The internal data structures of both databases are mostly similar.
You can have infinite point-time-querying by setting the TTL of data to forever.
You have the ability to do distributed BACKUP/IMPORT of data (Which is mostly a file copy, and also includes historical data)
A transaction would be the equivalent of a commit, but I do not think there is a way to list out all transactions on CRDB, so that would have to be done separately.
And gain other benefits, such as distributed querying, and high availability.
I just find interesting that both databases (CockroachDB and Dolt) share the same principal of immutable B-Trees.
Something like this but for sqlite would be great for building small git enabled applications/servers that cana benefit from the features git provides but only need a database and a library to do it.
I never understood why we don’t have SQL databases that track all changes in a “third dimension” (column being one dimension, row being the second dimension).
It might be a bit slower to write, but hook the logic in to write/delete, and suddenly you can see exactly when a field was changed to break everything. The right middleware and you could see the user, IP, and query that changed it (along with any other queries before or after).
This has existed for a very long time as a data modeling strategy (most commonly, a "type 2 dimension") and is the way that all MVCC databases work under the covers. You don't need a special database to do this, just add another column to your database and populate it with a trigger or on update.
Huh. I just read the spec. Not quite three 'dimension', but looks like exactly what I was asking for: a (reasonably) automatic and transparent record of previous values, as well as timestamps for when they changed.
I'll call this a "you learn something every day" and a "hey thanks @tthun (and @predakanga)"
Both are "standard" from the UNIX/GNU diff(1) tool. The default behavior gives you the '>' and '<' format and using -u gives you the "unified" '+' and '-' format.
In regards to merging, I understand that part, but say I have a conflict, how does it get resolved. Does it open some sort of text editor where I delete the unwanted lines?
I mean, if I have a repository on one PC, I can clone from there on a different PC.
As far as I can see this only allows S3/GC or Dolthub remotes.
Merge conflicts get put into a special conflicts table. You have to resolve them one by one just like text conflicts from a git merge. You can do this with the command line or with SQL.
It's true we don't support SSH remotes yet. Wouldn't be too hard to add though. File an issue if this is preventing you from using the product. Or use sshfs, which is a pretty simple workaround.
I've looked into Dolt perviously for powering our product that is Git for financials.
The trouble is that most of our complexity lies in our data's relationships. Merging is very tricky when some data that has been added in one branch has not had a specific change in properties when other data in a master branch has been modified.
Good point - and why a graph based approach has advantages in some use cases (disclaimer TerminusDB co-founder here). Allows more flexibility in versioning relationships and more scope to merge in the scenario you outline.
I like this. A while ago, I was asking my co-founder the question "Why isn't there Git for data like there is for code?" while working on a database migrations engine that aims to provide automatic database migrations (data & schema migrations). After all, code is data.
Well, maybe this is the ONE. But I've heard "git for data" way too many times to jump on board of the latest tool. I'll wait till "dust settles" and there's a clear winner. Till then, it's parquet, or even pickle.
keep history just as a changelog to see when and what was changed? or something with versions you actively can revert to? I suppose one of the problems is that changing schema is usually interleaved with some sorts of data migrations and conversion and those I have no idea how to track without using some general scripting language, like migrations in many frameworks which are already there.
Just storing a changelog would be very useful. I could for example compare db version some app was developed for with current state.
Most companies will store schema migrations in git, but doing anything with this information is difficult. Having this in db would make automated checks easier.
DoIt isn't really a swear word. It's a euphemism for coitus. It's also a very common phrase in general so it really isn't strongly associated with "doing it" to cause any issues with English speakers. Aka it's fine as a project name. See Nike's "Just do it." advertisement campaign, they would have never gone with the phrase if it had strong negative connotations.
That doesn't let you merge in someone else's changes easily. Aka two team members make changes to the (local copy of) database and now you want to merge it.
I mean sure you can have another database tracking their changes and a merging algorithm but that's what dolt is doing for you
I would rather have a "GitHub for data" - an SQL database I could have hosted for free, give everybody read-only access to and give some people I choose R/W access to. That's a thing I miss really.
At bugout.dev, we have an ongoing crawl of public GitHub. We just created a dataset of code snippets crawled from popular GitHub repositories, listed by language, license, github repo, and commit hash and are looking to release it publicly and keep it up-to-date with our GitHub crawl.
The dataset for a single crawl comes in at about 60GB. We uploaded the data to Kaggle because we thought it would be a good place for people to work with the data. Unfortunately, the Kaggle notebook experience is not tailored to such large datasets. Our dataset is in a SQLite database. It takes a long time for the dataset to load into Kaggle notebooks, and I don't think they are provisioned with SSDs as queries take a long time. Our best workaround to this is to partition into 3 datasets on Kaggle - train, eval, and development, but it will be a pain to manage this for every update, especially as we enrich the dataset with results of static analysis, etc.
I'd like to explore hosting the public dataset on Dolthub. If this sounds interesting to you please, reach out to me - email is in my HN profile.