Hacker News new | past | comments | ask | show | jobs | submit login
Wddbfs – Mount a SQLite database as a filesystem (adamobeng.com)
291 points by vitplister on Feb 18, 2024 | hide | past | favorite | 50 comments



This seems really nice!

If this is posted by the author looking for feedback:

1) WebDAV is a much better choice than FUSE. FUSE is a good concept, but buggy and poorly-implemented. Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue. Not a hack.

2) Writes seem like a very bad idea. Keep those out unless you come up with a clean way to handle them (which seems difficult if not impossible given the differences in FS versus relational abstractions, especially with regards to data validation). Not a limitation.

In other words, the "hacks" seem like design choices a good architect would likely have made. Continuing:

3) The major use-case I have is if I have a small (<1MB) database, and don't know the structure. Lots of tools use small sqlite databases. There is no way to query all tables for something, whereas tools like `find` and `grep` can look through all files. I was recently trying to recover some lost data, and it was a pain to find it.

4) I think a major theoretical question is how to fuse the two models. I would like to be able to do 'generic' things like the above on databases, while still being able to be relational.

5) I don't have an answer to the above, but perhaps natural first step might be to allow something like queries or virtual tables to sit on the file system:

wddbfs --anonymous --db-path=/path/to/an/example/database/like/Chinook_Sqlite.sqlite

wddbfs_query myjoin "SELECT * FROM table_1, table_2 WHERE table_1.id=table_2.id"

And voila! A /virtual/myjoin.csv file pops up.

(Even more) half-baked thoughts:

There might be more clever ways to do it too. I'm thinking through half-baked thoughts on how to make files and tab completion work. My half-baked thoughts are moving towards something like:

wddbfs_SELECT * from Customer.tsv\, Employee.tsv WHERE

But I don't like all the potential bugs with escaping. I'm also thinking about when output wants to go to the console versus into a virtual table.


> WebDAV is a much better choice than FUSE.

That's a pretty spicy opinion. WebDAV is slow; FUSE works great. Use interrupt mounts for things like sshfs (or NFS, for that matter).

> Writes seem like a very bad idea. Keep those out unless you come up with a clean way to handle them (which seems difficult if not impossible given the differences in FS versus relational abstractions, especially with regards to data validation).

You could imagine supporting appending new records in json format to the json/jsonl files in a pretty clean way, but I agree it seems well out of scope. And removing records seems basically impossible.


What's an "interrupt mount"?


I was thinking specifically of these NFS mount options[0]:

> intr / nointr

> Selects whether to allow signals to interrupt file operations on this mount point. If neither option is specified (or if nointr is specified), signals do not interrupt NFS file operations. If intr is specified, system calls return EINTR if an in-progress NFS operation is interrupted by a signal.

> Using the intr option is preferred to using the soft option because it is significantly less likely to result in data corruption.

But:

> The intr / nointr mount option is deprecated after kernel 2.6.25. Only SIGKILL can interrupt a pending NFS operation on these kernels, and if specified, this mount option is ignored to provide backwards compatibility with older kernels.

So I guess on Linux you just have to use "soft" now instead[1]. (I believe intr still works on FreeBSD, which is the last place I made significant use of NFS[2].)

> soft / hard

> Determines the recovery behavior of the NFS client after an NFS request times out. If neither option is specified (or if the hard option is specified), NFS requests are retried indefinitely. If the soft option is specified, then the NFS client fails an NFS request after retrans retransmissions have been sent, causing the NFS client to return an error to the calling application.

> NB: A so-called "soft" timeout can cause silent data corruption in certain cases. As such, use the soft option only when client responsiveness is more important than data integrity. Using NFS over TCP or increasing the value of the retrans option may mitigate some of the risks of using the soft option.

[0]: https://linux.die.net/man/5/nfs#:~:text=the%20NLM%20protocol...

[1]: https://linux.die.net/man/5/nfs#:~:text=soft%20/%20hard,the%...

[2]: https://man.freebsd.org/cgi/man.cgi?query=mount_nfs&apropos=...


If it is an NFS concept, not FUSE, then "use interrupt mounts for things like sshfs" does not really make sense, because sshfs is a FUSE program and not related to NFS.


lazy/async/autoreconnect


No, it's a pretty specific concept that doesn't mean any of those things: https://news.ycombinator.com/item?id=39431698


actually it reads very much like an asynchronous mount


With “intr,” operations are synchronous but cancelable. That doesn’t match any sense of “async” I am familiar with, nor have I heard the term “async mount” before.


> WebDAV is a much better choice than FUSE. FUSE is a good concept, but buggy and poorly-implemented. Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue. Not a hack.

I have had problems with particular fuse implementations (sshfs included, s3fs especially), but I've never had trouble writing FUSE implementations that behave correctly. Is there something deficient in the spec that you want to call out, or is the idea just that WebDAV carefully built networking into its core concepts, where you have to do a lot of extra work to squeeze those ideas correctly into a FUSE implementation?


I suspect if you've run into problems with a lot of things built on FUSE, the problem is FUSE.

Yes, s3fs and sshfs can both leave the system in an unstable state. For example, there can be a dead mount which is impossible to unmount, and in severe cases, blocks a clean reboot.

A file system in user space (or in network space) should NEVER break the system, no matter what happens in user space (or in network space). Most network file systems try to respect this (albeit with mixed success). FUSE does not.

I'm not claiming FUSE cannot be made to work. Just that it's very bad since (1) plenty of smart people clearly failed to do so (2) the badness it leaves behind should be more than it's permitted to.

I can point to specific issues, but at the end of the day, that's neither here nor there. At the end of the day, something like:

https://github.com/libfuse/libfuse/blob/master/example/poll....

Is about a hundred times more complicated than it should be. It should not require memsets, pthread mutexes, or flags, and should probably have an implementation in a modern, high-level language. To a large extent, that's the point of moving things out of the kernel.

I'm even perhaps okay with being permitted to do low-level operations for a particularly performance-constrained subsystem, but that's not 95% of the uses of something like FUSE.

Footnote: I actually enjoyed writing low-level code like this a lot, when computers were in the single-digit to triple-digit MHz range, and we didn't need to worry about people breaking in over a ubiquitous worldwide internet, but I left that mindset behind decades ago. Right now, I want code to be stable, simple, auditable, and secure.


> There is no way to query all tables for something,

Ha. I would bet anything you could adopt my MySQL findall() to SQLite for the purpose. https://stackoverflow.com/a/68915413/308851


This SQL is definitely supervillain territory.

I just can't decide if you're a sci-fi supervillain, a political thriller supervillain, or a Despicable Me supervillain.

;-)


Nice! I hope I remember this when I next need it.


> Things like sshfs can break in very bad ways if e.g. there is a network connectivity issue.

That’s a peculiar critique of FUSE. You could write the same thing about NFS or SMB (but especially NFS).


Huh. I guess I'm one of the few who's had generally good experiences with FUSE, then. I've used several different FUSE filesystems, ranging from disk-based, to network-oriented, to... well, let's just say a bit odd, kind of fun, but probably not particularly useful. I even wrote a couple of the latter myself, just for funsies.

Oh well, there's my FUSE rant for the day, I guess. Now, on to what I came to comment on....

I'm indifferent on the WebDAV vs FUSE thing, actually. But if the author wanted a nicer way to query a bunch of text files, where each line would roughly represent what a db row would, and there's whatever random delimiter character in there (comma, tab, pipe, or even those weird ASCII control characters nobody ever knows or remember exist for the purpose [0]), then I think he gave in and went to a database too soon. Besides the ubiquitous GNU textutils, there are tons of tools out there for working with record-oriented CSV and TSV files. [1] There's even `jq` if you're a real masochist and want to use json as your text-based data storage, query, and interchange format.

Now you've got me wondering if I was lucky in not having any significant issues with FUSE. But, as far as OP and the technical problems they were facing, I think it was a little premature to jump to `sqlite`. All those things they put down to "clunkiness" really just makes me thing that `sqlite` + WebDAV ended up being one of the members of the set of simple, elegant, and wrong solutions most problems seem to have. ¯\_(ツ)_/¯

---

[0]: https://stackoverflow.com/a/18782271

[1]: https://stackabuse.com/running-sql-on-csv-data-data-conversi...


I don't quite agree. There's a reason relational won, and continues to win more than a half-century after it was invented, despite no lack of wannabe usurpers.

For your model to work, what would be needed is the reverse: a system which allows me to interact with TSV files with SQL.

For 98% of uses of SQLite, performance and storage efficiency don't matter; we're talking about dozens or hundreds of records. It doesn't need to be well-implemented from that perspective. It does need to be robust. For example, databases solve a lot of problems especially around consistency in threaded systems which this would need to handle as well (although a KISS solution, such as handling all access from one thread / process, and simply blocking other requests during transactions, would be fine).

Something like that would likely eat SQLite's lunch.

q seems close. It'd need to support writes, table creation, and all the other stuff, perhaps be more actively maintained, provide documented API (not just command-line) access, and work with things like ORMs.


> For your model to work, what would be needed is the reverse: a system which allows me to interact with TSV files with SQL.

Several such systems exist, one of which is SQLite. https://www.sqlite.org/csv.html

I don't know if there's a virtual table extension for tab separated values, but SQLite imports them just fine.


Interesting. Thank you.

It's intentionally available as a single source file:

https://www.sqlite.org/src/artifact?ci=trunk&filename=ext/mi...

For easy modification to handle TSV, TSVx, JSON, Pandas, YAML, or whatever else my heart desires.


> a system which allows me to interact with TSV files with SQL

ClickHouse and DuckDB can both do this. Pretty sure AWS has some built-in stuff for this too.


I think those are a very different use case. sqlite is used for things like app settings. Both of those are heavyweight monsters designed for rather large data.

That said, I'm reading their documentation right now, and I think one or the other might slot into a different use-case I have where I do have rather large data. I'd last looked at this space quite a number of years ago, and both are way ahead of what I'd recalled.


Don't people use NFS instead of Fuse now? I'd assume it would have much better performance than webdav and it should handle a bunch of the issues around writes`


I am moving outside of my zone of expertise, but when I last looked (decades ago), systems like NSF assumed a filesystem backing store and thinks about things like byte ranges. There are a lot of operations which file systems support which work very poorly when this is not true, such as seeks, memory-mapped files, etc.

If I want the 50,004,123,121th byte of a file from a disk, that's very fast. If I want the same for a virtual object from an HTTP server, object store, virtual table, etc. it literally involves creating the whole object, and stepping through it byte-by-byte until I get there.

If the next request is doing the same 1k ahead, on a disk, that's probably in cache, and if not, I can get there quickly. If this was a SQL query, I probably need to redo the whole thing.

You get a natural explosion from O(1) to O(n) in many common cases, and for something like a complex SQL query, it can be much, much worse.


NFSv4 is stateful, and while it's true that the server has to support stateless reads for bad clients, most clients are ok with keeping the state ids that are returned after an open. Whether you have to step through the object byte-by-byte isn't inherent to the protocol, it depends on how you've stored the data.


I've implemented both in the last 6 months and the only reason I would expect anyone to use NFS over FUSE is for MacOS without macfuse/fuse-t. fuse-t is a cool project that provides (iirc) the high level fuse interface (libfuse) via NFS on MacOS. Compare to macfuse, which implements the low level interface (/dev/fuse) via a kext.

FUSE is much, much easier to work with. The protocol (even v4) should also be less chatty, and if performance is your goal there are projects like ExtFuse that can move some caching into the kernel via eBPF which I don't think would be possible with NFS.

If you poke around other user space file systems (sshfs, cephfs, objectivefs, etc) they're all FUSE.


I can implement a basic Fuse filesystem in a handful of lines of code in several languages. I'm not aware of any libraries that makes it equally simple to write an NFS server.


Why not just add views?


I'm also working on something like this:

https://github.com/Airsequel/SQLiteDAV

My mapping is: table -> dir, row -> dir, cell -> file


I wonder if mapping index->directory would be the best match, that way you could at least hypothetically reclaim some of the SQL benefits, and the directory entries could have more natural names. You'd need to have slightly different structure for unique vs non-unique indices, but that seems like minor issue.


I build a couchDB webdav server back in the day, you could also edit json documents or file blobs directly. the problem i discovered was that all OSes totally staled their webdav support and there are also enough differences between oses to be annoying. In the end to build somthing with great performance you would also need to control the webdav client side and probably build a fuse webdav client. I would have loved to see webdav maturing and becoming what the 9P vision was just for the web, but this obviously never happened as all the applications just went into to the web and used rest intead of webdav and everything else moved to sync protocols that sync to local folders.


I'm with you on wishing WebDAV continued its rollout. These days there are great low-drama server-side deployments like https://github.com/sigoden/dufs. It's run relative too - you could habe multiple dufs processes serving up different directories in different ways. But for WebDAV, you can't simply mount that on the client side for every OS that's equally low configutaion. For that reason, I really like sshfs as it can be initiated from the client-side without a lot of config (just a mkdir of the mapped dir), and it's OK most time despite it's lack of speed and multi-day uptime. I'm on a chromebook now and it turns out that Samba is the easiest client-side tech to use for remote file systems. DAv should've been uniquitous.


Wrap in an SSH tunnel and you can do some fun stuff over the network, too.


Over the network (when using azure or gcp storage) there's also https://sqlite.org/cloudsqlite/doc/trunk/www/index.wiki which doesn't require loading everything in memory.


> Although for now, the whole table gets read into memory for every read so this won’t work well for very large database files. There’s also no write support… yet.

At such a set of features I would prefer a tool that converts databases to a directory of real csv and jsonl files, at least there are no performance issues to worry about


I really like the idea of running a watch on an sqlite table with my common cli tools. If I understand correctly, it is being re-read on every request. Does that mean that changes to the sqlite database will be visible on the next read of a csv file?


I was expecting this to be a way to mount so-called SQL Archives (https://sqlite.org/sqlar.html) but this is just as cool.


> the SQL syntax for selecting a few records is much more verbose than head -n or tail -n

I use DBeaver to inspect SQLite files, and to also work with Postgres databases.

I kind of miss MySQL Workbench, but MySQL is pretty dead to me. And SQL Server Management Studio is a relic that keeps being updated.

I also sometimes make dashboards from SQLite files using Grafana, but the time functions for SQLite are pretty bad.


Could you expand on why the time functions in sqlite are pretty bad?


> Part of this is avoiding the overhead of figuring out a relational schema, but an equal amount of friction comes from the fact that .sqlite files are just slightly more difficult to inspect

I like this solution to that problem:

https://sqlitebrowser.org/


This looks pretty cool.


Wasn't there an extension that let you mount a filesystem as a table or db in sqlite? I wonder how far you can inception that. Mount a db as a filesystem, then mount that filesystem as a db..etc.


Borrowing concepts from DB2 here


Cute, seems legitimately useful, succinct. Not everything has to be super technically challenging to be valuable. I can see how this would be really handy.


I think its neat proof of concept, but I struggle to see any case where this would be particularly useful. Or rather when this would be more handy than what sqlite cli already offers.

like is this really meaningful improvement

    $ tail -n 3 Chinook_Sqlite.sqlite/Album.tsv
over this?

    $ sqlite3 -tabs Chinook_Sqlite.sqlite 'select * from Album' | tail -n 3


Why webdav instead of making short sqltocsv, sqltojson, etc scripts? You could make completion work (with some enormous time investment).


Because those all exist already, and this is cool?


A fine reason, I just thought you had a niche need prompting you to do it like this.


Nice. I think exposing the tables as csv, tsv, json & jsonl is to much of a cluttering. Format should be a mount option.


It is


Does it support mounting a sqlar file?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: