Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Versioning Filesystem for SQLite (github.com/sudeep9)
110 points by devnull3 on Aug 28, 2022 | hide | past | favorite | 40 comments



(I am the author) Some of the use-cases of mojo I could think of:

1. Rollbacking DDL changes (like create table, adding columns, etc)

2. (My use-case) Multiple readers-single writer. Readers read old versions. Writers do not block readers and vice-versa. Note: this is different than reading+writing in the same version can still happen but with database-level locking.

3. Extension of #2: The older versions read from S3 so that lamda can read.

4. Extension of #3: Readers reading from old immutable snapshots makes the database as a queue of complex changes with indexes and all the benefits of SQL.

5. Backups and replication.

Oh: I have change the license to MIT


For no. 2, WAL mode has this property, why not just use that? Is it for finer control over which version the readers see?


There is a requirement where I need data which existed at timestamp T or version V. So even a WAL needs to be versioned.


This seems great, also in performance, but why would you use this over making plain file copy-based backups, since SQLite is single file based? I'd assume this solution might be preferable because it saves deltas instead of absolute, but the Readme does not say so. If the main advantage here is on-the-fly selecting from different versions, granted, that is unique, even though I cannot really imagine a production use case for this.

So far, I have been using file copies and GFS backup scheme based solutions like Borg, which also does deltas in chunks, compression, and encryption. Perhaps with Borg one should better even backup SQL dumps instead of db files, I don't know.


Only modified pages go into a version data file [1]

[1] https://github.com/sudeep9/mojo/blob/main/design.md


don't you get the same benefit if you version controlled the db file with git? with git, each commit saves a diff from the previous one as a blob. the difference is that in git, in addition to the diffs, you also have to create a working copy of the db, which means you use up at least 2x the storage your system uses. in your implementation, the diff blobs are the live db, which saves you ~2x storage. is that the main benefit?


In git, each version of the database will be a full copy. The git has to perform diff i.e. scan the database file. Imagine doing commits & creating snapshots very frequently.

Have a look at https://github.com/sudeep9/mojo/blob/main/design.md#index


sorry, yes, you mention in another comment the use case of multiple readers operating on different versions of the db simultaneously. that'd be difficult to do with git for the reason you mention.


Could you achieve the same by using BTRFS to store the db file?


Yes. But you need to have such a filesystem installed at the first place.

Mojo can run on any filesystem which does not support snapshots/versions. This makes it portable not only across different fs but also across OS.


re: backup SQL dumps instead of db files - indeed! In my small experiment (SQL databases in a browser profile, [1]), simple `sqlite3 "$file" .dump | gzip >"$file.sql.gz"` decreased size of files to be backed up about 10 times!

I'm not sure how it compares to Borg's delta- and zstd-compression, though.

[1]: http://alexey.shpakovsky.ru/en/minimizing-size-of-browser-pr...


hm but can't you do the same thing (gzipping) with the db itself?

Edit. Regardless of if you maybe even meant that, I made a small test too: A 42 MB db zipped is 8.8M, but its sql dump zipped is 4.9M, so almost half the size. Pretty good already. If you don't just `.dump` but actually output the tables in a consistently sorted manner, the size might go down even more and will enable very efficient delta-ing. Questionable value to effort ratio though...


re: gzipping the db itself - yep, I tried and saw the same effect as you did: about half a size. I think it's because of indexes (I once played with sqlite3_analyzer[1] and on some databases it showed about half of disk space was used by indexes) and unused pages (all the space which can be vacuum'ed - again, size of some databases can be decreased almost twice by vacuum'ing).

[1]: https://www.sqlite.org/sqlanalyze.html

On the other side, indeed, text dump is likely not the most space-efficient way of storing raw DB data (compared to some binary one), so I wouldn't be surprised to find databases for which gzipped sql dump is bigger than (gzipped) database itself. I would even say that I'm surprised that it's not true for most databases :)


Is there a way to provide the same functionality via extensions instead of VFS? Recently, there's been a few other interesting projects adding backup and replication to SQLite, but they all use VFS. AFAIK only one VFS can be active at a time.


I am the author. Mojo is both an extension and a VFS implementation.

> AFAIK only one VFS can be active at a time.

You can specify VFS as a part of connect URI.


How would something like this work with litestream? I imagine that the WAL file gets written the same way, but would recovering the database allow for mojo to continue working?


From what I know of litestream, it should be decoupled from VFS implementation. The versions/snapshots are not known to the database.


Would love to learn more on the motivation behind the project


This is a part of a larger system which requires versioning of data. So there is a requirement of reading the data as of timestamp T. The sqlite is supposed to be metadata of that large data. Since the larger data itself is versioned I need metadata to be versioned.

Some of the use-cases are mentioned at: https://news.ycombinator.com/item?id=32629548


Ah cool. We are defaulting to using sqlite for storing metadata too.

Happy to chat and exchange notes - my email is in the profile.


Could this be used to have multiple writes? where every write have their own sqlite3 and read is a combination of all the files


regarding multiple writes - I think hardest part would be to find an sqlite version where "read is a combination of all the files".

Speaking of multiple writers, I've recently heard about "BEGIN CONCURRENT" feature of SQLite [1] - currently it lives on its separate branch, but I hope they will eventually merge it to the main branch. Not sure if it can be used in your case, but worth mentioning anyway, I think.

[1]: https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...


Not sure I understood your question fully, but if there are multiple versions, then read will be over multiple files but writes go to a single active version.


Each writer can get its own branch but you would never be able to re-combine them.


Given that sqlite databases are just files, what is the benefit of this vs. version controlling the database file itself?


> The main feature of the fs is versioning/snapshotting. Only one version is writable and all the old versions are immutable.


Yes, that's what version control does. `git init` followed by git commits every time you update the db would do exactly the same, so: what is the specific benefit of this approach compared to just version controlling the db file itself?


> At present there is no license and to my knowledge it is very restrictive. This will change. I have not made up my mind on the exact license.

There might be something lost in translation but since there is no licence this is a copyrighted work and cannot be used without permission.


I have changed the license to MIT. Sorry for the wrong impression.


That's exactly how I read that sentence. There was no license, and ironically no license is more restrictive than most licenses.

Once you release code under a certain license you can pretty much never undo that. You can change the license, but the previous version under the previous license will still be valid. So I understand wanting to take time to think about what license you will use.


Why ironically? License (aka permit) is a document which gives someone a set of rights. No license no permission, that’s by default.


It's ironic because usually when there's no license, it's cause the author doesn't care what you do with the code. Anyone who actually cares will say "proprietary" or whatever.


You are correct


[flagged]


> is it some kind of a tease

(I am the author). It's definitely not a tease. The license will change within a week (max 2). This is my first open source project. The purpose is to validate the approach.

This has been a private project for a very long time. It can be argued that I should have figured out the license but the delay from my side reached a tipping point. I just thought "make it public, figure the rest later".


Got it, if you're open to feedback: in the future, a few more words of explanation about the reasoning in the readme will be appreciated by folks and go a long way in making it clear you are reasonable and thoughtful person. Otherwise it's up to the reader to guess what kind of individual sits on the other side of the screen.

Unfortunately there is a lot of code license drama these days compared to a decade ago. For reference, see the Elasticsearch OSS License renege fiasco from January 2021.

https://www.infoq.com/news/2021/01/elastic-aws-open-source/

https://en.wikipedia.org/wiki/Elasticsearch#Licensing_change...

Keep up the good work!


Isn't "...and to my knowledge it is very restrictive. This will change." enough of an explanation?


Noted.

I have changed the license to MIT.


Should we avoid even reading this code if there is no license? Its currently proprietary code?


I think you can read the code. I knowingly made the repo public. You do not even have to login to github. I think it should be ok.


It's just a matter of legalities. Even if you share it openly, it doesn't change its copyright status. Unless you offer a licence or make a statement putting it in the public domain (this doesn't work in all jurisdictions), it remains proprietary.

Of course that doesn't matter too much because you're not going to sue people who use it, but it can still have potentially significant consequences for people who use the code.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: