2. (My use-case) Multiple readers-single writer. Readers read old versions. Writers do not block readers and vice-versa. Note: this is different than reading+writing in the same version can still happen but with database-level locking.
3. Extension of #2: The older versions read from S3 so that lamda can read.
4. Extension of #3: Readers reading from old immutable snapshots makes the database as a queue of complex changes with indexes and all the benefits of SQL.
This seems great, also in performance, but why would you use this over making plain file copy-based backups, since SQLite is single file based? I'd assume this solution might be preferable because it saves deltas instead of absolute, but the Readme does not say so. If the main advantage here is on-the-fly selecting from different versions, granted, that is unique, even though I cannot really imagine a production use case for this.
So far, I have been using file copies and GFS backup scheme based solutions like Borg, which also does deltas in chunks, compression, and encryption. Perhaps with Borg one should better even backup SQL dumps instead of db files, I don't know.
don't you get the same benefit if you version controlled the db file with git? with git, each commit saves a diff from the previous one as a blob. the difference is that in git, in addition to the diffs, you also have to create a working copy of the db, which means you use up at least 2x the storage your system uses. in your implementation, the diff blobs are the live db, which saves you ~2x storage. is that the main benefit?
In git, each version of the database will be a full copy. The git has to perform diff i.e. scan the database file. Imagine doing commits & creating snapshots very frequently.
sorry, yes, you mention in another comment the use case of multiple readers operating on different versions of the db simultaneously. that'd be difficult to do with git for the reason you mention.
re: backup SQL dumps instead of db files - indeed! In my small experiment (SQL databases in a browser profile, [1]), simple `sqlite3 "$file" .dump | gzip >"$file.sql.gz"` decreased size of files to be backed up about 10 times!
I'm not sure how it compares to Borg's delta- and zstd-compression, though.
hm but can't you do the same thing (gzipping) with the db itself?
Edit. Regardless of if you maybe even meant that, I made a small test too: A 42 MB db zipped is 8.8M, but its sql dump zipped is 4.9M, so almost half the size. Pretty good already. If you don't just `.dump` but actually output the tables in a consistently sorted manner, the size might go down even more and will enable very efficient delta-ing. Questionable value to effort ratio though...
re: gzipping the db itself - yep, I tried and saw the same effect as you did: about half a size. I think it's because of indexes (I once played with sqlite3_analyzer[1] and on some databases it showed about half of disk space was used by indexes) and unused pages (all the space which can be vacuum'ed - again, size of some databases can be decreased almost twice by vacuum'ing).
On the other side, indeed, text dump is likely not the most space-efficient way of storing raw DB data (compared to some binary one), so I wouldn't be surprised to find databases for which gzipped sql dump is bigger than (gzipped) database itself. I would even say that I'm surprised that it's not true for most databases :)
Is there a way to provide the same functionality via extensions instead of VFS? Recently, there's been a few other interesting projects adding backup and replication to SQLite, but they all use VFS. AFAIK only one VFS can be active at a time.
How would something like this work with litestream? I imagine that the WAL file gets written the same way, but would recovering the database allow for mojo to continue working?
This is a part of a larger system which requires versioning of data. So there is a requirement of reading the data as of timestamp T. The sqlite is supposed to be metadata of that large data. Since the larger data itself is versioned I need metadata to be versioned.
regarding multiple writes - I think hardest part would be to find an sqlite version where "read is a combination of all the files".
Speaking of multiple writers, I've recently heard about "BEGIN CONCURRENT" feature of SQLite [1] - currently it lives on its separate branch, but I hope they will eventually merge it to the main branch. Not sure if it can be used in your case, but worth mentioning anyway, I think.
Not sure I understood your question fully, but if there are multiple versions, then read will be over multiple files but writes go to a single active version.
Yes, that's what version control does. `git init` followed by git commits every time you update the db would do exactly the same, so: what is the specific benefit of this approach compared to just version controlling the db file itself?
That's exactly how I read that sentence. There was no license, and ironically no license is more restrictive than most licenses.
Once you release code under a certain license you can pretty much never undo that. You can change the license, but the previous version under the previous license will still be valid. So I understand wanting to take time to think about what license you will use.
It's ironic because usually when there's no license, it's cause the author doesn't care what you do with the code. Anyone who actually cares will say "proprietary" or whatever.
(I am the author). It's definitely not a tease. The license will change within a week (max 2). This is my first open source project. The purpose is to validate the approach.
This has been a private project for a very long time. It can be argued that I should have figured out the license but the delay from my side reached a tipping point. I just thought "make it public, figure the rest later".
Got it, if you're open to feedback: in the future, a few more words of explanation about the reasoning in the readme will be appreciated by folks and go a long way in making it clear you are reasonable and thoughtful person. Otherwise it's up to the reader to guess what kind of individual sits on the other side of the screen.
Unfortunately there is a lot of code license drama these days compared to a decade ago. For reference, see the Elasticsearch OSS License renege fiasco from January 2021.
It's just a matter of legalities. Even if you share it openly, it doesn't change its copyright status. Unless you offer a licence or make a statement putting it in the public domain (this doesn't work in all jurisdictions), it remains proprietary.
Of course that doesn't matter too much because you're not going to sue people who use it, but it can still have potentially significant consequences for people who use the code.
1. Rollbacking DDL changes (like create table, adding columns, etc)
2. (My use-case) Multiple readers-single writer. Readers read old versions. Writers do not block readers and vice-versa. Note: this is different than reading+writing in the same version can still happen but with database-level locking.
3. Extension of #2: The older versions read from S3 so that lamda can read.
4. Extension of #3: Readers reading from old immutable snapshots makes the database as a queue of complex changes with indexes and all the benefits of SQL.
5. Backups and replication.
Oh: I have change the license to MIT