Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A couple of thoughts:

First, great job on the readme! One way you could improve it is by expanding on the "Caution" section. What's written is the beginnings of a threat model, but it could be improved by being more explicit about which attacks this system does/doesn't defend against.

> The only security property that disk encryption (and this package) provides is that all information such an adversary can obtain is whether the data in a sector has (or has not) changed over time.

I think the adversary learns a bit more than this. Randomized encryption would provide the above probably, but the _deterministic_ scheme that's used here will let the adversary learn not only whether a sector changed, but whether its value matches what it was at a previous point in time.

How does this translate into the security of the database, itself? Seeing what blocks have changed might reveal information about what data has changed. Let's consider a security game where I (the adversary) get to submit sql queries, and then learn which blocks on disk has changed. After this initial phase (where I can learn where data is stored), I submit two different sql queries, you pick one of them at random and run it, and then tell me which blocks have changed. I win if I can guess which sql query you picked.

Suppose I submit queries which each insert into a different table. Because the tables are stored separately on-disk, it'll probably be pretty easy for me to distinguish them. But okay, that's still count-ish/size-ish, and maybe out of scope.

What if I submit two queries which each insert different values, but into the same table. Further, let's say that this table has an index. Based on which pages were written to, I can now learn something about the _values_ that were inserted, because different values will write into the index in different places.

Now, it's completely valid if the threat model says, "if you can see more than two copies of the database file, then all is lost." However, I think it'd be worth translating the current write-up of the threat model into the implications for leaking the database. For more examples of attacks based on seeing what indices/sizes changed [1] and [2].

Is it valid to pad the sqlite file to a multiple of the block size? Does sqlite ever call truncate on a non-block-aligned size and expect any truncated bytes to be fully removed?

What are the atomicity requirements for a sqlite VFS? SQLite, in general, is supposed to not get corrupted if power were to be yanked mid-write. However, because this VFS writes one block at-a-time, the computer dying mid-write could corrupt more bytes around the write position than would normally be corrupted if the the standard VFS was used. It's possible this is a non-issue, but it's worth considering what contract sqlite has for VFSes.

[1]: https://en.wikipedia.org/wiki/CRIME [2]: https://www.usenix.org/legacy/events/sec07/tech/full_papers/...



The threat model has to exclude:

- attacks on a running app that has the keys loaded, naturally

The threat model has to include at least:

- passive attacks against the DB itself, lacking access to the keys

The threat model really should also include:

- active attacks against the DB lacking access to the keys (e.g., replace blocks)

IMO ZFS does a pretty good job against these threats, for example, so ZFS is a good yardstick for measuring things like TFA.

However, the fact that a running system must have access to the keys means that at-rest data encryption does not buy one much protection against server compromise, especially when the system must be running much/most/all of the time. So you really also want to do the utmost to secure the server/application.


ZFS, AFAIK, can offer something in addition which is harder for a VFS to offer, and which AFAICT no other SQLite encryption offers: a kind of HMAC Merkel tree that authenticates an entire database (at a point in time).

Alternatives, even those that use MACs only authenticate pages/blocks. They still allow mix-and-match of pages/blocks from previous backups.

I could, potentially, add optional/configurable nounces and MACs at the VFS layer.

I've refrained from doing so because (1) it complicates the implementation; (2) it can be added later, compatibly; (3) it doesn't fix mix-and-match; (4) it will impact performance further; and (5) it would be MAC-then-encrypt (against best practice).


Yes, that's right. In fact, ZFS w/ encryption gives you two Merkle hash trees, one using a hash function and using a MAC. SQLite3 could do this, but it would have to change its database format fairly radically.

A SQLite3 VFS could, maybe, store additional metadata on the side knowing the SQLite3 database file format, I suppose. But if you really want this it's best to do it in the database itself.


> ... the fact that a running system must have access to the keys means that at-rest data encryption does not buy one much protection against server compromise, especially when the system must be running much/most/all of the time.

A common approach to help mitigate this is by having the keys be fetchable (eg via ssh) from a remote server.

Preferably hosted in another jurisdiction (country) in a data centre owned by a different organisation (ie. not both in AWS).

When the encrypted server gets grabbed, the staff should (!) notice the problem and remove its ssh keys from the ssh server holding the ZFS encryption keys.

---

That being said, I'm not an encryption guy whereas some of the people in this thread clearly are. So that's just my best understanding. ;)


> When the encrypted server gets grabbed, the staff should (!)

If the people doing the grabbing are LEO then they have ways of taking running servers such that they keep running or otherwise don't lose what's in RAM. And if it's LEO then "the staff" should absolutely not do things that can be construed as destroying evidence.


> ways of taking running servers such that they keep running

That's an interesting point. Wonder how complete that approach is, and if it maintains network connectivity between the servers they're grabbing?

Some clustering solutions automatically reboot a server if it loses network connectivity for a short period of time (ie 1 min). That would really mess up the "preserve stuff in ram" thing, if it's purely just designed to keep a server running.


There's at least two ways. One is to keep the servers powered even after they are unplugged from wall power (they have special adaptors for portable PSUs). The other is to cryogenically cool the RAM then cut the power, keep the RAM cooled, and then read it later in a lab.


Interesting. The ram cooling sounds like it could work if done quickly and precisely enough. :)


Sure, if its LEO. That's not the threat model for most organisations encrypting their data at rest though. :)

---

> should absolutely not do things that can be construed as destroying evidence.

It'd be a very long stretch to successfully argue "removing access to the key" is destroying evidence. The data would still be intact, and available, to anyone with the key.

Just not to whoever physically grabbed the server. ;)


I would get legal advice on that, from a lawyer in the relevant jurisdiction, before going with that.


Of course. And I'm just pointing out a commonly implemented approach.

LEO isn't generally the consideration of places encrypting their stuff. Businesses dealing with sensitive data (PII, etc) are required to as a matter of course.


First of all, thanks for the review. I'll try to respond to all points.

Disk encryption, on which this is based, is usually deterministic in nature.

So yes, an adversary 1 that can inspect multiple versions of a database (e.g. backups) can learn exactly: which (blocks) changed, which didn't change, which have been reverted; but that is all they should learn.

An adversary 2 that can modify files, can also mix-and-match blocks between versions to produce a valid file with high probability .

And an adversary 3 that can submit changes and see their effect on the encrypted data can probably infer a lot about the database.

I'll try to make these more explicit in the README. In practical terms: adversary 1 is the one I thought I'd covered reasonably well; adversary 2 means that backups should be independently signed, and signatures verified before restoring them; adversaries 2 and 3 mean that this is ineffective against live attacks.

Security, though, is also about comparing options. Reading the documentation for alternatives (even the expensive ones) I don't see this kind of analysis. I see 2 advantages to the alternatives that encrypt page data with a nounce and a MAC. The nounce allows reverts to go unnoticed. No change, means a block definitely didn't change. But ciphertext changing doesn't necessarily mean plaintext changed. The MAC ensures blocks are valid. But they still be reverted to previous versions of themselves, mix-and-match is still possible. Do these two properties make a huge difference? Is there anything else I'm missing?

On your other points.

Yes it's always safe to round up file sizes to block size, for databases, journals and WALs (I could detail why, but the formats are documented). It may not be safe for all temporary files (I'm assuming it is), but that can be fixed for those files by remembering the file size in memory.

About atomicity, corruption, etc, the VFS is supposed to declare its characteristics [1] to SQLite. Your concerns are covered by SAFE_APPEND and POWERSAFE_OVERWRITE. See also [2]. As a wrapper VFS, I filter most of those characteristics from the underlying VFS, forcing SQLite to assume the worst.

[1] https://www.sqlite.org/c3ref/c_iocap_atomic.html

[2] https://www.sqlite.org/psow.html




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: