Wddbfs – Mount a SQLite database as a filesystem

blagie · on Feb 18, 2024

This seems really nice!

If this is posted by the author looking for feedback:

1) WebDAV is a much better choice than FUSE. FUSE is a good concept, but buggy and poorly-implemented. Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue. Not a hack.

2) Writes seem like a very bad idea. Keep those out unless you come up with a clean way to handle them (which seems difficult if not impossible given the differences in FS versus relational abstractions, especially with regards to data validation). Not a limitation.

In other words, the "hacks" seem like design choices a good architect would likely have made. Continuing:

3) The major use-case I have is if I have a small (<1MB) database, and don't know the structure. Lots of tools use small sqlite databases. There is no way to query all tables for something, whereas tools like `find` and `grep` can look through all files. I was recently trying to recover some lost data, and it was a pain to find it.

4) I think a major theoretical question is how to fuse the two models. I would like to be able to do 'generic' things like the above on databases, while still being able to be relational.

5) I don't have an answer to the above, but perhaps natural first step might be to allow something like queries or virtual tables to sit on the file system:

wddbfs --anonymous --db-path=/path/to/an/example/database/like/Chinook_Sqlite.sqlite

wddbfs_query myjoin "SELECT * FROM table_1, table_2 WHERE table_1.id=table_2.id"

And voila! A /virtual/myjoin.csv file pops up.

(Even more) half-baked thoughts:

There might be more clever ways to do it too. I'm thinking through half-baked thoughts on how to make files and tab completion work. My half-baked thoughts are moving towards something like:

wddbfs_SELECT * from Customer.tsv\, Employee.tsv WHERE

But I don't like all the potential bugs with escaping. I'm also thinking about when output wants to go to the console versus into a virtual table.

loeg · on Feb 18, 2024

> WebDAV is a much better choice than FUSE.

That's a pretty spicy opinion. WebDAV is slow; FUSE works great. Use interrupt mounts for things like sshfs (or NFS, for that matter).

> Writes seem like a very bad idea. Keep those out unless you come up with a clean way to handle them (which seems difficult if not impossible given the differences in FS versus relational abstractions, especially with regards to data validation).

You could imagine supporting appending new records in json format to the json/jsonl files in a pretty clean way, but I agree it seems well out of scope. And removing records seems basically impossible.

nh2 · on Feb 19, 2024

What's an "interrupt mount"?

loeg · on Feb 19, 2024

I was thinking specifically of these NFS mount options[0]:

> intr / nointr

> Selects whether to allow signals to interrupt file operations on this mount point. If neither option is specified (or if nointr is specified), signals do not interrupt NFS file operations. If intr is specified, system calls return EINTR if an in-progress NFS operation is interrupted by a signal.

> Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption.

But:

> The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified, this mount option is ignored to provide backwards compatibility with older kernels.

So I guess on Linux you just have to use "soft" now instead[1]. (I believe intr still works on FreeBSD, which is the last place I made significant use of NFS[2].)

> soft / hard

> Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

> NB: A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity. Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.

[0]: https://linux.die.net/man/5/nfs#:~:text=the%20NLM%20protocol...

[1]: https://linux.die.net/man/5/nfs#:~:text=soft%20/%20hard,the%...

[2]: https://man.freebsd.org/cgi/man.cgi?query=mount_nfs&apropos=...

nh2 · on Feb 27, 2024

If it is an NFS concept, not FUSE, then "use interrupt mounts for things like sshfs" does not really make sense, because sshfs is a FUSE program and not related to NFS.

geraldhh · on Feb 19, 2024

lazy/async/autoreconnect

loeg · on Feb 19, 2024

No, it's a pretty specific concept that doesn't mean any of those things: https://news.ycombinator.com/item?id=39431698

geraldhh · on Feb 19, 2024

actually it reads very much like an asynchronous mount

loeg · on Feb 19, 2024

With “intr,” operations are synchronous but cancelable. That doesn’t match any sense of “async” I am familiar with, nor have I heard the term “async mount” before.

hansvm · on Feb 18, 2024

> WebDAV is a much better choice than FUSE. FUSE is a good concept, but buggy and poorly-implemented. Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue. Not a hack.

I have had problems with particular fuse implementations (sshfs included, s3fs especially), but I've never had trouble writing FUSE implementations that behave correctly. Is there something deficient in the spec that you want to call out, or is the idea just that WebDAV carefully built networking into its core concepts, where you have to do a lot of extra work to squeeze those ideas correctly into a FUSE implementation?

blagie · on Feb 19, 2024

I suspect if you've run into problems with a lot of things built on FUSE, the problem is FUSE.

Yes, s3fs and sshfs can both leave the system in an unstable state. For example, there can be a dead mount which is impossible to unmount, and in severe cases, blocks a clean reboot.

A file system in user space (or in network space) should NEVER break the system, no matter what happens in user space (or in network space). Most network file systems try to respect this (albeit with mixed success). FUSE does not.

I'm not claiming FUSE cannot be made to work. Just that it's very bad since (1) plenty of smart people clearly failed to do so (2) the badness it leaves behind should be more than it's permitted to.

I can point to specific issues, but at the end of the day, that's neither here nor there. At the end of the day, something like:

https://github.com/libfuse/libfuse/blob/master/example/poll....

Is about a hundred times more complicated than it should be. It should not require memsets, pthread mutexes, or flags, and should probably have an implementation in a modern, high-level language. To a large extent, that's the point of moving things out of the kernel.

I'm even perhaps okay with being permitted to do low-level operations for a particularly performance-constrained subsystem, but that's not 95% of the uses of something like FUSE.

Footnote: I actually enjoyed writing low-level code like this a lot, when computers were in the single-digit to triple-digit MHz range, and we didn't need to worry about people breaking in over a ubiquitous worldwide internet, but I left that mindset behind decades ago. Right now, I want code to be stable, simple, auditable, and secure.

_ugfj · on Feb 18, 2024

> There is no way to query all tables for something,

Ha. I would bet anything you could adopt my MySQL findall() to SQLite for the purpose. https://stackoverflow.com/a/68915413/308851

bemusedthrow75 · on Feb 19, 2024

This SQL is definitely supervillain territory.

I just can't decide if you're a sci-fi supervillain, a political thriller supervillain, or a Despicable Me supervillain.

;-)

blagie · on Feb 18, 2024

Nice! I hope I remember this when I next need it.

justin66 · on Feb 18, 2024

> Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue.

That’s a peculiar critique of FUSE. You could write the same thing about NFS or SMB (but especially NFS).

Paul-Craft · on Feb 18, 2024

Huh. I guess I'm one of the few who's had generally good experiences with FUSE, then. I've used several different FUSE filesystems, ranging from disk-based, to network-oriented, to... well, let's just say a bit odd, kind of fun, but probably not particularly useful. I even wrote a couple of the latter myself, just for funsies.

Oh well, there's my FUSE rant for the day, I guess. Now, on to what I came to comment on....

I'm indifferent on the WebDAV vs FUSE thing, actually. But if the author wanted a nicer way to query a bunch of text files, where each line would roughly represent what a db row would, and there's whatever random delimiter character in there (comma, tab, pipe, or even those weird ASCII control characters nobody ever knows or remember exist for the purpose [0]), then I think he gave in and went to a database too soon. Besides the ubiquitous GNU textutils, there are tons of tools out there for working with record-oriented CSV and TSV files. [1] There's even `jq` if you're a real masochist and want to use json as your text-based data storage, query, and interchange format.

Now you've got me wondering if I was lucky in not having any significant issues with FUSE. But, as far as OP and the technical problems they were facing, I think it was a little premature to jump to `sqlite`. All those things they put down to "clunkiness" really just makes me thing that `sqlite` + WebDAV ended up being one of the members of the set of simple, elegant, and wrong solutions most problems seem to have. ¯\_(ツ)_/¯

---

[0]: https://stackoverflow.com/a/18782271

[1]: https://stackabuse.com/running-sql-on-csv-data-data-conversi...

blagie · on Feb 18, 2024

I don't quite agree. There's a reason relational won, and continues to win more than a half-century after it was invented, despite no lack of wannabe usurpers.

For your model to work, what would be needed is the reverse: a system which allows me to interact with TSV files with SQL.

For 98% of uses of SQLite, performance and storage efficiency don't matter; we're talking about dozens or hundreds of records. It doesn't need to be well-implemented from that perspective. It does need to be robust. For example, databases solve a lot of problems especially around consistency in threaded systems which this would need to handle as well (although a KISS solution, such as handling all access from one thread / process, and simply blocking other requests during transactions, would be fine).

Something like that would likely eat SQLite's lunch.

q seems close. It'd need to support writes, table creation, and all the other stuff, perhaps be more actively maintained, provide documented API (not just command-line) access, and work with things like ORMs.

samatman · on Feb 18, 2024

> For your model to work, what would be needed is the reverse: a system which allows me to interact with TSV files with SQL.

Several such systems exist, one of which is SQLite. https://www.sqlite.org/csv.html

I don't know if there's a virtual table extension for tab separated values, but SQLite imports them just fine.

blagie · on Feb 21, 2024

Interesting. Thank you.

It's intentionally available as a single source file:

https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/mi...

For easy modification to handle TSV, TSVx, JSON, Pandas, YAML, or whatever else my heart desires.

gkbrk · on Feb 18, 2024

> a system which allows me to interact with TSV files with SQL

ClickHouse and DuckDB can both do this. Pretty sure AWS has some built-in stuff for this too.

blagie · on Feb 18, 2024

I think those are a very different use case. sqlite is used for things like app settings. Both of those are heavyweight monsters designed for rather large data.

That said, I'm reading their documentation right now, and I think one or the other might slot into a different use-case I have where I do have rather large data. I'd last looked at this space quite a number of years ago, and both are way ahead of what I'd recalled.

alexnewman · on Feb 18, 2024

Don't people use NFS instead of Fuse now? I'd assume it would have much better performance than webdav and it should handle a bunch of the issues around writes`

blagie · on Feb 18, 2024

I am moving outside of my zone of expertise, but when I last looked (decades ago), systems like NSF assumed a filesystem backing store and thinks about things like byte ranges. There are a lot of operations which file systems support which work very poorly when this is not true, such as seeks, memory-mapped files, etc.

If I want the 50,004,123,121th byte of a file from a disk, that's very fast. If I want the same for a virtual object from an HTTP server, object store, virtual table, etc. it literally involves creating the whole object, and stepping through it byte-by-byte until I get there.

If the next request is doing the same 1k ahead, on a disk, that's probably in cache, and if not, I can get there quickly. If this was a SQL query, I probably need to redo the whole thing.

You get a natural explosion from O(1) to O(n) in many common cases, and for something like a complex SQL query, it can be much, much worse.

m-hilgendorf · on Feb 18, 2024

NFSv4 is stateful, and while it's true that the server has to support stateless reads for bad clients, most clients are ok with keeping the state ids that are returned after an open. Whether you have to step through the object byte-by-byte isn't inherent to the protocol, it depends on how you've stored the data.

m-hilgendorf · on Feb 18, 2024

I've implemented both in the last 6 months and the only reason I would expect anyone to use NFS over FUSE is for MacOS without macfuse/fuse-t. fuse-t is a cool project that provides (iirc) the high level fuse interface (libfuse) via NFS on MacOS. Compare to macfuse, which implements the low level interface (/dev/fuse) via a kext.

FUSE is much, much easier to work with. The protocol (even v4) should also be less chatty, and if performance is your goal there are projects like ExtFuse that can move some caching into the kernel via eBPF which I don't think would be possible with NFS.

If you poke around other user space file systems (sshfs, cephfs, objectivefs, etc) they're all FUSE.

vidarh · on Feb 19, 2024

I can implement a basic Fuse filesystem in a handful of lines of code in several languages. I'm not aware of any libraries that makes it equally simple to write an NFS server.

atsmyles · on Feb 19, 2024

Why not just add views?

adius · on Feb 18, 2024

I'm also working on something like this:

https://github.com/Airsequel/SQLiteDAV

My mapping is: table -> dir, row -> dir, cell -> file

zokier · on Feb 18, 2024

I wonder if mapping index->directory would be the best match, that way you could at least hypothetically reclaim some of the SQL benefits, and the directory entries could have more natural names. You'd need to have slightly different structure for unique vs non-unique indices, but that seems like minor issue.

jFriedensreich · on Feb 18, 2024

I build a couchDB webdav server back in the day, you could also edit json documents or file blobs directly. the problem i discovered was that all OSes totally staled their webdav support and there are also enough differences between oses to be annoying. In the end to build somthing with great performance you would also need to control the webdav client side and probably build a fuse webdav client. I would have loved to see webdav maturing and becoming what the 9P vision was just for the web, but this obviously never happened as all the applications just went into to the web and used rest intead of webdav and everything else moved to sync protocols that sync to local folders.

paul_h · on Feb 18, 2024

I'm with you on wishing WebDAV continued its rollout. These days there are great low-drama server-side deployments like https://github.com/sigoden/dufs. It's run relative too - you could habe multiple dufs processes serving up different directories in different ways. But for WebDAV, you can't simply mount that on the client side for every OS that's equally low configutaion. For that reason, I really like sshfs as it can be initiated from the client-side without a lot of config (just a mkdir of the mapped dir), and it's OK most time despite it's lack of speed and multi-day uptime. I'm on a chromebook now and it turns out that Samba is the easiest client-side tech to use for remote file systems. DAv should've been uniquitous.

pletnes · on Feb 18, 2024

Wrap in an SSH tunnel and you can do some fun stuff over the network, too.

rickette · on Feb 18, 2024

Over the network (when using azure or gcp storage) there's also https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki which doesn't require loading everything in memory.

renonce · on Feb 18, 2024

> Although for now, the whole table gets read into memory for every read so this won’t work well for very large database files. There’s also no write support… yet.

At such a set of features I would prefer a tool that converts databases to a directory of real csv and jsonl files, at least there are no performance issues to worry about

AnyTimeTraveler · on Feb 18, 2024

I really like the idea of running a watch on an sqlite table with my common cli tools. If I understand correctly, it is being re-read on every request. Does that mean that changes to the sqlite database will be visible on the next read of a csv file?

yellowapple · on Feb 18, 2024

I was expecting this to be a way to mount so-called SQL Archives (https://sqlite.org/sqlar.html) but this is just as cool.

Nican · on Feb 18, 2024

> the SQL syntax for selecting a few records is much more verbose than head -n or tail -n

I use DBeaver to inspect SQLite files, and to also work with Postgres databases.

I kind of miss MySQL Workbench, but MySQL is pretty dead to me. And SQL Server Management Studio is a relic that keeps being updated.

I also sometimes make dashboards from SQLite files using Grafana, but the time functions for SQLite are pretty bad.

callamdelaney · on Feb 19, 2024

Could you expand on why the time functions in sqlite are pretty bad?

T-A · on Feb 19, 2024

> Part of this is avoiding the overhead of figuring out a relational schema, but an equal amount of friction comes from the fact that .sqlite files are just slightly more difficult to inspect

I like this solution to that problem:

https://sqlitebrowser.org/

account-5 · on Feb 18, 2024

This looks pretty cool.

genocidicbunny · on Feb 18, 2024

Wasn't there an extension that let you mount a filesystem as a table or db in sqlite? I wonder how far you can inception that. Mount a db as a filesystem, then mount that filesystem as a db..etc.

dvaun · on Feb 18, 2024

Borrowing concepts from DB2 here

loeg · on Feb 18, 2024

Cute, seems legitimately useful, succinct. Not everything has to be super technically challenging to be valuable. I can see how this would be really handy.

zokier · on Feb 18, 2024

I think its neat proof of concept, but I struggle to see any case where this would be particularly useful. Or rather when this would be more handy than what sqlite cli already offers.

like is this really meaningful improvement

    $ tail -n 3 Chinook_Sqlite.sqlite/Album.tsv

over this?

    $ sqlite3 -tabs Chinook_Sqlite.sqlite 'select * from Album' | tail -n 3

eternauta3k · on Feb 18, 2024

Why webdav instead of making short sqltocsv, sqltojson, etc scripts? You could make completion work (with some enormous time investment).

dmd · on Feb 18, 2024

Because those all exist already, and this is cool?

eternauta3k · on Feb 24, 2024

A fine reason, I just thought you had a niche need prompting you to do it like this.

mharig · on Feb 18, 2024

Nice. I think exposing the tables as csv, tsv, json & jsonl is to much of a cluttering. Format should be a mount option.

pjerem · on Feb 18, 2024

It is

nikeee · on Feb 18, 2024

Does it support mounting a sqlar file?