What If OpenDocument Used SQLite? (2014)

nabla9 · on Dec 18, 2020

Sustainability of Digital Formats: Planning for Library of Congress Collections https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...

The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets. The RFS does not specify a particular version of SQLite.

svnpenn · on Dec 18, 2020

I'd say better to export tables to CSV files, and dump the schema to an SQL file.

geocar · on Dec 18, 2020

CSV isn't a single format but many incompatible formats that require additional metadata to decode, such as:

- whether quotes are ""doubled like SQL92 or backslash-\"escaped like C

- whether newlines are quoted or \n like C

- whether there are column headings

- what the character set of the file is (Excel insists it's the local machine's codepage)

- whether that's a BOM or part of a column heading

- whether lines terminate with \n or \r\n or \r or \r\n

- what to do about ragged rows

and so on. Except for the simplest of datasets, CSV is almost certainly going to lose information. The only advantages over SQLite is streaming support and maybe compression, but even then there are better formats.

mulmen · on Dec 18, 2020

There's a current debate on my team if we should accept semicolon delimited files as CSVs. You can't even rely on commas in CSVs.

I would say any format that is compatible with Excel is not suitable for archival. Ever open a file with numeric values with more than say, 20 (I forget the actual cut off) digits? Excel converts to scientific notation and loses precision. When you change the format there is data loss. The length data is preserved but not the full precision.

arethuza · on Dec 18, 2020

I saw a problem recently where a file had a .csv extension but was actually colon delimited. The data contained fields with large quantities of text so a CSV parser happily parsed the file splitting lines on commas in the text and didn't complain so the problem wasn't noticed until much later in the processing of the data.

Pamar · on Dec 18, 2020

This is a common problem when you receive csv from someone working in a different country: comma is a decimal separator in some cases (e.g.: Germany) so Excel will automatically switch to semicolon for separators. So now your colleague in the UK office (or even someone who simply prefers using Windows in with US locale, or, as in my case, got a UK laptop because the German keyboards has some layout differences to what they consider "normal"...) double clicks on the "German" .csv... Excel will show (most of the time) every line as a single FIELD.

So you have to close it and use Data Import.

We just moved to the latest version of Office 365 at the job and it still works like this (even if the import action can usually figure out that semicolon is the separator without me having to manually specify it...).

dspillett · on Dec 18, 2020

One of our clients internally uses an odd text file format with ~ as the field delimiter and a fixed-width based header & footer containing metadata. And still sometimes send such files with names ending .csv.

~ is a better choice for field separator though, it is far less commonly used in most datasets and natural languages. | can also make sense for the same reason, and visually looks like a column divider as an extra bonus.

arp242 · on Dec 18, 2020

In hindsight, semicolons are the better choice, given that commas are common in many languages as decimal separators.

I don't know where there's never been some effort to create a "CSV format specification"; for example the first line could indicate quoting, delimiter, etc. style used in the file.

Guess it's never been big enough of a problem for people to take action. "Good enough" (or rather, "not bad enough").

mulmen · on Dec 18, 2020

> In hindsight, semicolons are the better choice, given that commas are common in many languages as decimal separators.

Or a pipe. Or the unprintable “field separator” character.

> I don't know where there's never been some effort to create a "CSV format specification"; for example the first line could indicate quoting, delimiter, etc. style used in the file.

Fundamentally CSV or any delimited format requires agreement between parties for interchange. At that point standards don’t really help because the parties can just agree to anything that works.

> Guess it's never been big enough of a problem for people to take action. "Good enough" (or rather, "not bad enough").

Exactly.

mavhc · on Dec 18, 2020

It's very annoying that ascii includes file and record separator characters and csv still exists

dspillett · on Dec 18, 2020

The problem there is those code-points are usually non-printable, making the resulting file less human-readable which is often a stated desire for a data transfer file format. Some text editors will show those code points, but others will completely omit them (or worse, reject them as invalid).

mavhc · on Dec 25, 2020

If they were something people used then editors would support them

necovek · on Dec 18, 2020

How do you type those in modern text editors?

You used to be able to ctrl + letter to input control keys on traditional terminals — was it (ord(char) - 0x20), but outside Emacs, vi and similar editors, I am sure you can't do that anymore.

If you are questioning existence of human-readable/writeable (typeable?) data formats, you might want to check out SGML/XML, JSON, YAML... Are you surprised that being able to type it out holds true for all of these? Esp on standard keyboards?

It's more interesting how SGML-derived languages have kept <> symbols so accessible on keyboards regular people use.

krick · on Dec 18, 2020

Or a tab. It's what tabs are for, after all.

ixwt · on Dec 18, 2020

Or ASCII 0x1C, 0x1D, 0x1E, and 0x1F. Because those are in the spec exactly for records based files.

gpm · on Dec 18, 2020

I've never understood why these characters aren't used more frequently.

diggernet · on Dec 19, 2020

Let's call it ASV - ASCII Separated Values.

geraldbauer · on Dec 18, 2020

FYI: Find some CSV format specifications (and tests) @ https://github.com/csvspecs

cesarb · on Dec 18, 2020

> I don't know where there's never been some effort to create a "CSV format specification"

RFC 4180 (https://tools.ietf.org/html/rfc4180) "Common Format and MIME Type for Comma-Separated Values (CSV) Files"

arp242 · on Dec 18, 2020

That's not exactly what I mean; that's just one of many different CSV variants (and just because it's an RFC doesn't make it "official" either). What I mean is something like:

  $ cat file.csv
  CSV,quote=always,quotechar=",header=yes,delim=,
  Col1,Col2
  "val","val2"
  [..]

Then you always know what you're dealing with, unambiguously.

There are other cases where embedding some "metadata" in CSV files is useful; for example the exports I create in my app are versioned by prefixing the first header with the version number. It works, but it's not exactly great.

Of course, the downside is legacy software not being able to process it.

geraldbauer · on Dec 18, 2020

There are CSV specs - see https://specs.frictionlessdata.io/csv-dialect/ and https://github.com/csvspecs , for example. Note: Both initiatives are ongoing and alive and kicking. The trouble is almost nobody cares - as your ignorance shows - just complaining and playing the know-it-all is so much easier and more fun, isn't?

arp242 · on Dec 18, 2020

You seem like a nice person. Want to be friends?

yoz-y · on Dec 18, 2020

As long as you make the Filene extension .scsv

Jokes aside, we had that debate in our team way back and decided to follow the rfc and only use commas. The format is often abused, better not create more tools that do so.

mulmen · on Dec 18, 2020

Totally understand and respect that decision. In our case these files come from third parties. We ingest that data, grind it up, sprinkle on our own blend of 11 herbs and spices and serve it to customers. Rejecting that data because it doesn't strictly conform to the RFC makes our product less appetizing. Or we can hold our noses and accept any consistent delimiter while working with the data providers to follow the standards.

im3w1l · on Dec 18, 2020

Do not sniff the format. Instead override the field separator based on source.

Raqbit · on Dec 18, 2020

Last time I checked, Outlook (The desktop application) exported contacts in localized CSV (semi-colon separated in my case) but Outlook (online) only accepted contacts separated by commas. As I was trying to migrate contacts on someone else's computer, I had to use notepad to change it.

laumars · on Dec 18, 2020

I used to think this too before I started researching the file format and it’s associated RFCs. Then I discovered a lot of that is actually explicitly defined, such as C-style escaping not being accepted and new lines following the MS-DOS format (\r\n).

Of course, there might be programs that write invalid CSV or parsers that accept invalid CSV, but that doesn’t mean CSV isn’t a standardised format.

https://tools.ietf.org/html/rfc4180

mulmen · on Dec 18, 2020

> Of course, there might be programs that write invalid CSV

Like Excel?

RFC 4180 exists but that doesn't mean any file with .csv at the end of the name or a text/csv MIME type can be parsed according to it. Also, you still have to track the metadata, such as the charset and if the first line is a header. Both are optional in the RFC.

e: RFC 4180 is not a standard.

laumars · on Dec 18, 2020

The header should be optional. That’s a schema specific detail like the number of columns.

The character set is also forgivable given the age of the format; it comes from an era when there wasn’t an agreed single character encoding to rule them all. Sure, RFC 4180 probably should have specified UTF-8 but by that point there was already three decades of CSV usage in different encodings on mainframes which are likely still in use even now to make specifying any character encoding rather pointless.

Don’t get me wrong, I’m not a massive fan of CSV either. The way quoted new lines are encoded, for example, is horrible and I’ve never liked the “” format for escaping quotation marks. But a lot of CSVs biggest problems are ironically a result of its success: as a format it’s so easy to use that a great many developers can hand crank their own parsers without looking at the spec. So I find it hard being critical about the lack of standardisation in CSV when the real problem is the number of developers who don’t follow the standardisation.

geocar · on Dec 18, 2020

> the real problem is the number of developers who don’t follow the standardisation.

RFC4180 isn't an Internet standard. Says so right at the top in the first paragraph.

I think I have a few thousand files on my computer right now whose names end in ".csv", and I'll bet money not one of them agrees with RFC4180 to the letter except by accident of the data itself, and that's a Real Problem to me.

I can concede that two parties could agree to interchange according to RFC4180, but as a general format I maintain that for archival and interchange purposes CSV cannot be divorced from the rather complex schema I alluded to without data loss.

laumars · on Dec 18, 2020

> RFC4180 isn't an Internet standard. Says so right at the top in the first paragraph.

I didn't say it was an Internet standard. I said standardised. Ok, I'll concede it is more of an informal or de facto standard but when IBM, W3C, IETF, OKF and others all publish the same parsing rules for CSV, it's hard to agree when people make statements like "there's no standard in CSV". The problem isn't that there isn't a standard to CSV, the problem is that people often don't follow those conventions. But you have that same problem with other file formats too.

> I think I have a few thousand files on my computer right now whose names end in ".csv", and I'll bet money not one of them agrees with RFC4180 to the letter except by accident of the data itself, and that's a Real Problem to me.

That's just conjecture. And even if that were proven true, it's still only anecdotal. That said, I do sympathise with your point. But you could make the same argument for

- JSON files that don't follow spec (support for comments, aren't UTF-8 encoded, have been manually written so don't follow the escaping rules correctly and thus only parse correctly by chance).

- XML files that have been manually cranked and so don't follow schema

- HTML documents that don't follow specification and thus browsers do a lot of non-specification interpretation work to render correctly

The IT industry is littered with example of people not following the docs. CSV isn't unique in that regard.

> I can concede that two parties could agree to interchange according to RFC4180, but as a general format I maintain that for archival and interchange purposes CSV cannot be divorced from the rather complex schema I alluded to without data loss.

I wasn't commenting on whether it's a better format than another file format. I was commenting on your points about standardisation saying there are an abundance of published documents on how to read and write a standard CSV file and C-style escaping isn't part of that specification.

laumars · on Dec 18, 2020

I’m getting a slew of downvotes so maybe we’ll tackle it from the other side: I’ve evidenced my point, now let’s see a list of popular tools that do use C-style escaping in CSV files or other wildly incompatible rules. I’d wager they either don’t exist or are some minor non-enterprise utility.

I’m not a CSV fanboy by any means but I have used CSV with a number of enterprise solutions because that’s all they’d offer. And my experience is that the comments levelled against it in this discussion are overstated.

mulmen · on Dec 18, 2020

Even if RFC 4180 was a standard, which isn’t, and everyone in the world strictly adhered to it, it would still be a poor choice for archival purposes.

laumars · on Dec 18, 2020

> Even if RFC 4180 was a standard, which isn’t

It is an informal. RFC just says it's not defining the Internet standard. The reason being, and as I'd already pointed out, CSV predates RFC 4180 by 3 decades so what the RFC is really setting out is to describe what those established conventions are and what the MIME type should be. That doesn't mean that what the RFC describes isn't already a convention (it's the same specifications as published by IBM, W3C and many other big hitters).

> it would still be a poor choice for archival purposes.

I hadn't suggested it was a good choice for archival purposes. You're building a straw man argument there.

My point was that there is a conventional standard to CSV and that is very well documented. The issue with CSV is that it's so easy to write a parser that some developers do so without bothering to read any documentation on how CSV should be parsed. That's not the fault of CSV, that's the fault of lazy developers. But I do completely agree that CSV has other faults (which I had also discussed too).

foolmeonce · on Dec 18, 2020

> it would still be a poor choice for archival purposes.

It isn't being compared to all formats, it is being compared to sqlite. Some random version of one implementation written in C, stored in "fossil" may be good enough for the library of Congress, but that kind of nonsense went too far for even browser makers to take seriously.

Meanwhile it is trivial to read and infer data from formats like CSV to either use directly or automate a reading process with any tools that exist now or in the future.

foolmeonce · on Dec 18, 2020

Perhaps I should have cited the official reference which makes it clear that the viewpoint in this article was at best too optimistic:

https://www.w3.org/TR/webdatabase/

MaxBarraclough · on Dec 18, 2020

Isn't XML the obvious choice for robustly future-proof long-term data storage? Why mess about with an anemic scheme like CSV or a DBMS-specific scheme like SQLite?

im3w1l · on Dec 18, 2020

    Quotes are doubled
    Newlines are quoted
    Both having headings and not having headings is valid. Document/data specific.
    Use utf8
    Do not use bom at all. But if for whatever reason you must use it in column heading, quote it.
    Idk which line ending but you should probably handle both
    Do not emit ragged rows.

seveibar · on Dec 18, 2020

Having worked with a lot of government CSV/XLSX datasets (e.g. IPEDS college reporting), I can say I'd prefer something with proper foreign keys and less ambiguity in structure (people like to put descriptive cells at the top of spreadsheets)

paulfitz · on Dec 18, 2020

Grist is an sqlite-based spreadsheet format that captures relations well and discourages funky structure https://www.getgrist.com/blog - it works well for passing relational datasets around (disclosure, I work on it)

kroltan · on Dec 18, 2020

(the poster is associated with the product)

qwerty456127 · on Dec 18, 2020

Proper foreign keys don't really suit the dumped data case. Tables are usually imported one by one, in virtually random order so dependencies are often missing at the import time. Even if they were not, checking keys on every record import would mean a huge load for big datasets - that's really slow. That's why foreign key control is usually switched off before importing.

mulmen · on Dec 18, 2020

Postgres at least can defer these checks, it's generally a best practice to disable rebuilding indexes while loading as well. It's not hard to simply load the tables in order so the keys are available. I would expect any decent unload utility to also create a script to load the data which should be in the right order.

I think the implication here is that the archival should be done into a SQLite database and then all you need is a SQLite binary to read it. So there wouldn't even be a need to use a load script.

IanCal · on Dec 18, 2020

Even without checks, there's a benefit to having documented links like this tied to the data release rather than a random website.

spaniard89277 · on Dec 18, 2020

CSV is a bad a idea if you work with text or string. I work a lot with data transformation, and CSV with text usually means headache. Not to mention if you treat data from different countries.

Excel is too limited, odf formats are typically not very extended outside OO or LOffice..., same with parquet and other file formats. Sadly this is how it is.

You can use SQLite as a file for moving data around, as I do, but I guess it's not practical for everyone.

mulmen · on Dec 18, 2020

For database dumps I prefer escaped TSVs personally but the delimiter doesn't really matter as long as it is consistent. Strict adherence to quoting, escaping and character sets does, along with tracking that metadata. If you aren't comfortable using a capital A as a delimiter your format is not robust enough for the real world.

Or you could just take the entire engine with you, which is what SQLite provides.

danellis · on Dec 18, 2020

This reminds of the WebSQL spec being dropped because the only implementations used SQLite:

> The W3C Web Applications Working Group ceased working on the specification in November 2010, citing a lack of independent implementations (i.e. using database system other than SQLite as the backend) as the reason the specification could not move forward to become a W3C Recommendation.

-- Wikipedia

> This document was on the W3C Recommendation track but specification work has stopped. The specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.

-- W3C

I wonder how much of their reasoning applies here.

chungy · on Dec 18, 2020

I've grown to not mind WebSQL being dropped, despite that 10 years ago, I wish it were standardized.

WebAssembly's introduction a few years later means that SQLite can be built and used as a wasm module, and you aren't limited by whatever old version of SQLite was used for WebSQL -- you can always update it independently according to your needs.

A lot of sites actually do this.

tasogare · on Dec 18, 2020

It strange that Microsoft didn’t try to use Sql Server Compact Edition, which was available on Windows Phone, to make the distinct required implementation.

detaro · on Dec 18, 2020

The main problem was that the WebSQL "spec" draft did basically say: "Database that supports SQL the same way as this specific version of SQLite does", and nobody felt it was important enough to put in the work to actually specify something that could be implemented long-term.

pbronez · on Dec 18, 2020

Fascinating, never hears of SQL Server Compact Edition before. Unfortunately Wikipedia reports it was deprecated in 2013 and will EOL in 2021. Also it doesn’t have an ODBC driver.

Certainly not a contemporary alternative to SQLite.

indolering · on Dec 18, 2020

The biggest problem with WebSQL == SQLite was that the semantics couldn't be nailed down beyond a pile of C code. That is doable if there is a single application vendor for a given file format, but a total nightmare if you are trying to write code for the last 3 versions of 5 different browsers.

But SQLite was also a problem for anyone that wanted different semantics than what SQLite offered, like statically typed columns. Better to just provide a fast indexing mechanism and let people define semantics on top of that.

Shorel · on Dec 18, 2020

I think that was faulty reasoning back then, and it is no different now.

garmaine · on Dec 18, 2020

Seems like the problem is in the standards process here...

wmf · on Dec 18, 2020

Standards like OpenDocument are expected to have multiple independent implementations. I'm not sure such a thing is feasible with SQLite (this is why WebSQL was rejected).

I'm personally sympathetic to the idea that certain implementation-defined monocultures are OK but I don't think that idea has critical mass.

otabdeveloper4 · on Dec 18, 2020

Lack of multiple independent implementations means we can't ever switch to a different programing language or radically different system architecture.

Nobody is feasibly going to rewrite Sqlite in Rust.

So yeah, it is important.

DougBTX · on Dec 18, 2020

The SQLite website is a little more positive than that, but as a "quite a bit in the future" option, rather than something that can happen today:

> All that said, it is possible that SQLite might one day be recoded in Rust. Recoding SQLite in Go is unlikely since Go hates assert(). But Rust is a possibility.

https://sqlite.org/whyc.html

quietbritishjim · on Dec 18, 2020

Thanks for posting that link, it was a very interesting read. I'm a big fan of Rust, but that article really convinced me that C is the best language for SQLite. A couple of highlights IMO:

> 2. Safe programming languages solve the easy problems: memory leaks, use-after-free errors, array overruns, etc. Safe languages provide no help beyond ordinary C code in solving the rather more difficult problem of computing a correct answer to an SQL statement.

(My commentary) So SQLite development is at such a different level to normal application that the memory safety errors that are a concern for us mortals are the "easy" problems for SQLite developers!

> 5. Safe languages insert additional machine branches to do things like verify that array accesses are in-bounds. In correct code, those branches are never taken. That means that the machine code cannot be 100% branch tested, which is an important component of SQLite's quality strategy.

(My commentary) I don't really agree with this one: If a branch cannot possibly be taken then I don't think it counts towards your coverage percentage... just like, if you're careful never to dereference null pointers, you don't include dereferencing null pointers as a missing part of your test coverage. Still, I thought it was very interesting, especially since it's not quite the argument I was expecting (that those checks are wasted CPU time and binary bloat because they're not needed). (Edit: What's more, exactly this situation applies with SQLite's own ALWAYS() and NEVER() macros [1])

[1] https://sqlite.org/assert.html#different_behaviors_according...

necovek · on Dec 18, 2020

Not familiar with the specifics of the SQLite situation, but moving away from 100% means maintaining an exceptions list, introducing complexity and room for error (eg mistakenly adding exceptions when you can't figure out how to reach a branch).

If they are at 100% today, it'd be a very convincing argument to keep it that way.

bluGill · on Dec 18, 2020

The problem is those branches CAN be taken, otherwise the compiler would optimize them out. So now SQLite needs to figure out how to cause them to be taken to verify that the right thing happens.

rurban · on Dec 19, 2020

The problem is not the implementation language of sqlite, the problem is the author. Hacks upon hacks, and nothing of it in the docs. Instead it represents itself as the most secure of all databases. World class, whilst being worst of its class.

Hardening sqlite to get some security back from the insanity internally is a herculean task. I only scratched the surface with mine. Fts accepting custom tokenizers by pointer, dynamically! If someone wants a different tokenizer it needs to compiled in.

Simple selects being destructive. https://research.checkpoint.com/2019/select-code_execution-f... No security settings on by default. Dos'able. You can easily overwrite all internal and public tables. A hackers dream.

Rust would not help at all, only a proper rewrite.

bluGill · on Dec 18, 2020

Good read.

It fails by calling C++ a object oriented language, which is false. C++ is a deterministic destruction language.

But in the end it doesn't matter, sqlite has been in C long enough that all the hard things C++ gives you for free have been coded manually anyway so there is no real point. For new code C++ might be better, but that is debatable. Just like rust might be better, but that is debatable.

GaelFG · on Dec 18, 2020

Nobody is feasibly going to rewrite Sqlite in Rust.

Per (Genuine) curiosity why ? I know (good) database systems are hard to make but do sqlite have particulary 'hard' part which are near impossible to replicate ? Especially since we now have a (I assume) well documented base implementation and an extensive test suite/history of error to avoid ?

Is making à file based database that hard,compared to creating new programming language for exemple ?

merijnv · on Dec 18, 2020

> Is making à file based database that hard,compared to creating new programming language for exemple ?

Yes. I can hack together a trivial compiler in a weekend. A serious (non-optimising) compiler for a complex language is a task you could do in a few months.

If you gave me two year I'm not sure I could implement a filesystem API that correctly did something as simple as "atomically append to a file". See also Dan Luu's writing on the topic of filesystems: https://danluu.com/deconstruct-files/

_0w8t · on Dec 18, 2020

The problem is to write a reliable file database. Firefox got a lot of bugs related to history file corruption until it was replaced with SQLite. The same story was with Subversion where they replaced Berkeley key DB due to reliability issues.

This reliability comes from a very deep knowledge of OS API and all their corner cases. Creating a programming language typically involves less corner cases and the bugs in the compiler/runtime are much simple to reproduce (again, typically).

colonwqbang · on Dec 18, 2020

Maybe it's not so much that it's impossible to do so. Perhaps there is just no reason to do so.

Sqlite is one of the most battle-tested pieces of software in the world. It's used "everywhere" and it's performant.

It is not the case, as some seem to believe, that rewriting software in Rust will magically eliminate all bugs and give a free speed boost compared to well-tested and optimised C.

ori_b · on Dec 19, 2020

It's more than 200,000 lines of code. Lua and its libraries are about 45,000 lines of code.

So, yes.

kristianp · on Dec 18, 2020

That's very interesting. The use case described is for quite a limited subset of sqlite. I imagine a limited query language and kv-store could suffice for the usage described in the article. (Probably could be done in a weekend! wink.)

typ · on Dec 18, 2020

Not sure if this is qualified: https://gitlab.com/cznic/sqlite

It seems to be a Sqlite (compatible) implementation written in Go.

arp242 · on Dec 18, 2020

It's not exactly a reimplementation, but rather an automatic conversion of the C source code to Go.

usefulcat · on Dec 18, 2020

I can think of reasons why re-implementing SQLite might lack appeal, buy why would it be infeasible? Even if the format is poorly- or under- specified (note: I have no idea whether that is true), you could definitely check the source.

otabdeveloper4 · on Dec 18, 2020

You'd need something like a "Sqlite standard" then, with a clear path to how to evolve it taking all stakeholder interests into account. Seems like a job at least as difficult as writing the sofware itself.

(E.g., lack of a standard is why alternative Python implementations never went anywhere, even though the original cPython is, frankly, dreck from a software quality point of view.)

Scaevolus · on Dec 18, 2020

There is a standard though-- the database file format. You can read and write database files without using any SQLite source code by conforming to the expected format: https://www.sqlite.org/fileformat.html

It's not even that exotic: a bunch of rows in a simple binary format stored in a btree. The rollback journal / WAL are trickier, but aren't strictly necessary.

Implementing arbitrary SQL queries on top of the file format is obviously _much_ harder.

Here's an example project doing this-- a pure Go sqlite reader: https://github.com/alicebob/sqlittle

pbronez · on Dec 18, 2020

I wonder if there’s a Presto connector for this file format, such that you could dump a bunch of SQLite files in object storage and run analytics on them like they were parquet. Of course, any such driver should probably just link SQLite itself...

ncmncm · on Dec 18, 2020

SQLite has millions of lines of test code. It would actually be difficult to re-implement it in such a way that it is passes the tests, but is incompatible.

Would it be as robust? Absolutely not, unless all the OS interaction code were transcribed directly.

But Rust has no compiler that targets anywhere near the number of platforms SQLite is deployed on, and never will. The best that could be hoped for is that most such platforms would no longer be used, by the time the transcription is done.

Ecco · on Dec 18, 2020

Kind of reminds me of http://utf8everywhere.org/, but in that case it'd be sqlite-everywhere. Which is a pretty good point. A large number of Apple-made iOS apps went this route, and they seem to be doing ok!

takluyver · on Dec 18, 2020

I think a lot of Android apps use Sqlite as well. But the achilles heel of Sqlite in my experience is network filesystems - it's well known & documented that Sqlite can suffer corruption on NFS, because of issues with locking. That doesn't really matter for iOS & Android apps, where you know storage is local, but it's a serious sticking point for desktop applications, where e.g. NFS home directories are not unusual.

pbronez · on Dec 18, 2020

This sounds similar to the challenge of storing a git repo in a cloud-synced folder (Dropbox, one drive, etc). You have so little control over the syncing behavior in these systems that they just cause piles of locking errors. The solution I’m trying to this involves using a bare repo in the synced drive that’s used as a remote for a working repo on a separate, unsync’d folder.

zelly · on Dec 18, 2020

Maybe this is the right thread to ask:

I vaguely remember a file archive format similar to JAR/WAR (for Java) or ASAN (for Electron) which was designed for bundling software with its dependencies and static resources. It is not a container because it's just files with no specification for its runtime. Unlike JAR, it is language agnostic. These bundles could be compiled and compressed on your local machine then scp'd to a server that knows how to execute this bundle. If I recall correctly, it was some open source project from a big company.

I've spent hours trying to find it again and am wondering if I might be remembering something from a dream.

Twisol · on Dec 18, 2020

XAR, perchance?

https://github.com/facebookincubator/xar

mxuribe · on Dec 18, 2020

Wow, i never heard of this; sounds cool! I just started reading up on PEX, but now will read a little more on XAR. Thanks for sharing; TIL!

zelly · on Dec 18, 2020

Yes, that is it. Thank you.

theblazehen · on Dec 18, 2020

You might be thinking of AppImage?

dathinab · on Dec 18, 2020

While I generally agree with the points they make I'm not sure I would want a (stock) sqlite being used for "less" trusted files.

While sqlite has a very comprehensive test suite there had been more than one attack with "manipulated" database files and as far as I know (I might be mistaken) the test suite(s) are not focused on testing that case much.

Through I'm positive that it would be fine with a "hardened" sqlite where certain features are not compiled in at all and in turn the attack surface is reduced.

infogulch · on Dec 18, 2020

The article describes the current format as a zip of files where the main content of presentation slides is stored in a `content.xml` file within. Yes, sqlite may have had file format vulnerabilities in the past, but I cannot imagine trusting it less than any xml library in existence.

_0w8t · on Dec 18, 2020

At work we used a library that relied on Expat XML parser. After 20 years it still gets exploitable bugs. As the software was Windows-only we patched the library to use OS API for XML parsing to delegate to Microsoft the job of fixing security bugs.

vlovich123 · on Dec 18, 2020

You’d think so from reading [1]. [2] though paints a very different picture.

> The attacker can submit a maliciously crafted database file to the application that the application will then open and query.

I think it’s not impossible to secure SQLite better but it takes some work (like how Chrome sandboxes it in a separate process).

[1] https://www.sqlite.org/security.html [2] https://www.sqlite.org/cves.html

moonchild · on Dec 18, 2020

> [2] though paints a very different picture.

I don't agree. That document states that the CVEs have historically required one of two preconditions:

> The attacker can submit and run arbitrary SQL statements.

> The attacker can submit a maliciously crafted database file to the application that the application will then open and query.

If you look at the actual list of CVEs, all but one start with ‘Malicious SQL statement’. The single one that doesn't is suffixed with ‘The bug never appeared in any official SQLite release’.

In other words, there has never been an official release of SQLite which was vulnerable when presented with a crafted database file.

oever · on Dec 18, 2020

The proposal to use sqlite database files instead of zip files falls under the precondition that an attacker can submit a maliciously crafted database file. So the security concerns apply.

SQLitte is mostly used in situations where neither the database nor the SQL statements can be provided by the attacker. If SQLite is used to exchange data both of these attack vectors are available.

ori_b · on Dec 18, 2020

> If you look at the actual list of CVEs, all but one start with ‘Malicious SQL statement’. The single one that doesn't is suffixed with ‘The bug never appeared in any official SQLite release’.

The list on the page is partial, and the language on that page is disturbingly dismissive. Reading leaves me more concerned that they're too arrogant to take security researchers seriously.

0zymandiass · on Dec 18, 2020

I'm sure they aren't perfect, but their test suite and extensive fuzz testing is far better than any XML or JSON parser I've ever seen

https://www.sqlite.org/testing.html

_0w8t · on Dec 18, 2020

I do not agree about JSON. As a format it is so trivial one can cover it with few test cases. SQL dialect that SQLite uses is vastly more complex.

thristian · on Dec 18, 2020

You'd think so, but in practice no two JSON parsers have the same bugs, so it's more complex than you might think:

http://seriot.ch/json/parsing.html#all_results

dathinab · on Dec 18, 2020

Think about it if even JSON which is kinda simple has that many problems with parsers how likely is it that something multiple degrees of complexity higher which wasn't developed with untrusted database fiels in mind is secure?

Besides that the large majority of thinks tested in your link are irrelevant for the security in this use-case. They do matter if you need fully stable de- and re-serialization or e.g. have a fully trusted validator. E.g. for certain security token use-cases it matters. Many of the thinks which are marked as failure could even be considered as a "more robust" or "more secure" (through not fully standard conform) parser. E.g. parsing tailing commas (disallowed by JSON), rejecting certain unescaped control/special characters (which JSON allows), rejecting to long field values (which JSON allows) or similar. What matters is that a) it has a protection about to deep recursion and as such no stack overflow can happen, b) it makes sure the text it outputs is correctly encoded (independent of weather or not it was well-formed in the JSON blob) and c) it doesn't crash in a way which leads to security problems (which the test suite you linked doesn't differentiate from "acceptable" crashes).

_0w8t · on Dec 19, 2020

As long as one uses a single parser to read a particular input it matters little if it produces a result that is different from what some other parser generates as long as the parser has no security bugs. And JSON is simple enough to cover a particular interpretation of its spec with test suits to make a security vulnerability extremely unlikely.

Surely if one uses one parser to verify the payload and another to use it, a disaster comes as was with IPhone verification bug.

touisteur · on Dec 18, 2020

I seem to remember really great interaction with lcamtuf (of AFL fame): https://lcamtuf.blogspot.com/2015/04/finding-bugs-in-sqlite-...

trevyn · on Dec 18, 2020

I feel like there should be a way to make a maliciously crafted database file execute arbitrary SQL... SQLite doesn't do stored procedures, but there's gotta be something in that surface area.

dathinab · on Dec 18, 2020

> SQLite doesn't do stored procedures

Link: https://www.sqlite.org/gencol.html

Example:

CREATE TABLE t1(

   a INTEGER PRIMARY KEY,

   b INT,

   c TEXT,

   d INT GENERATED ALWAYS AS (a*abs(b)) VIRTUAL,

   e TEXT GENERATED ALWAYS AS (substr(c,b,b+1)) STORED

);

Note how you can call functions in the columns. Thats one feature which a hardened sqlite for less trusted documents would need to not have.

moonchild · on Dec 18, 2020

I expect the majority of the vulnerabilities were in the command parser. Which attack vector is gone from the database file.

yc12340 · on Dec 18, 2020

AFAIK, SQLite stores database schema as text, so certain subset of parser still has to be invoked on untrusted DDL.

gfody · on Dec 18, 2020

if select is all you need then a view would suffice

infogulch · on Dec 18, 2020

https://www.sqlite.org/cves.html

18 CVEs

http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=xml

> There are 1776 CVE Records that match your search.

Do I win?

dathinab · on Dec 18, 2020

No, the article just replaces the container with SQL but not the components (e.g. a slide). Which without extra changes are still XML, stored as text in the database.

The change from storing the inner document format in something which is not XML is completely independent of changing the outer container to sqlite.

SQL isn't really suited as a markup language so storing styles "flowing" text in it isn't that good of an idea. The article only recommends storing all separate blobs of styled text in tables (it also sidesteps the whole problem about flow text across multiple pages by simple using slides as an example ;=) ).

Lastly I'm against using a "stock"/unhardened sqlite version, not sqlite in general. Ironically on of the features I would disable for now is related to text search but I don't remember the name of it so I can't really use it as an argument :=/.

So if you ask me:

- Change containers to a different format then zipped sql files, consider sqlite but use a hardened feature limited sqlite version.

- Change the inner representation away from XML(1)

(1): Ok, there are ways to safely use XML but this isn't useful for this discussion and they require a bunch of thinks which all reduce usability (of the XML library) and performance and only work for certain use-cases where you e.g. do not need fully stable serialization and de-serialization and might be limited to a subset of XML and don't rely on the verifier for anything "safety" realated and preferably don't use C/C++ to write a parser... so it's easier to just not do it tbh.

fastball · on Dec 18, 2020

Probably better to compare Apples and Apples[1].

> There are 126 CVE Records that match your search.

[1] https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=sqlite

infogulch · on Dec 18, 2020

That's fair. Though the disparity is still an order of magnitude...

More seriously, my point is that SQLite is very unlikely to be the most vulnerable piece of software in your stack. It's probably the most stable and bug free piece of software that you use on a day to day basis that hasn't been formally proven correct. Even openoffice itself has more CVEs: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=openoffice

And some of them are due to the very issue I brought up above...

> CVE-2012-0037 Redland Raptor (aka libraptor) before 2.0.7, as used by OpenOffice 3.3 and 3.4 Beta, LibreOffice before 3.4.6 and 3.5.x before 3.5.1, and other products, allows user-assisted remote attackers to read arbitrary files via a crafted XML external entity (XXE) declaration and reference in an RDF document.

patrec · on Dec 18, 2020

> My point is that SQLite is very unlikely to be the most vulnerable piece of software in your stack.

Sqlite is very high quality, but I am rather doubtful this is true if you actually care about having a secure system to start with. I can think of three main reasons an office suite would have serious security vulnerabilities:

1. It's written in an unsuitable, memory-unsafe, programming language (as most office suites sadly are).

2. It includes an unsuitable programming language for scripting purposes (Word Macros are of course infamous for this).

3. Vulnerabilities in one of its dependencies (like XML parsers)

If you care to, 1 and 2 are very easy to avoid and 3 will be pretty hard to avoid completely, but you can limit your exposure. You need a GUI toolkit at the minimum, but you can completely avoid parsing as an attack vector by either not using a braindamaged format like XML in the unlikely case you don't need compatibility, or if you must, use an xml (zip etc.) parser written in a memory safe language. If you followed all other steps, but chose sqlite instead, I'd say chances a very good it's now your main vulnerability.

infogulch · on Dec 18, 2020

You have crafted a nigh-impossible tower of requirements that you admit are way beyond what is commonly practiced in the software under question in order to disclaim the statement "SQLite is very unlikely to be the most vulnerable piece of software in your stack". Right now on the front page of HN is an article about how Microsoft was also a victim of what may come to be labeled the most impactful hack ever, a list which includes most (all?) US government agencies. The root vuln? A crap admin password used by an IT monitoring software company which has software deployed to an alarming percentage of the industry.

I dunno man, I feel like my original statement is pretty reasonable. Let's please not move the goalposts to the moon. :)

patrec · on Dec 18, 2020

I don't think your statement is unreasonable at all coming from the common perspective that it is part of the natural order of things that opening a text document would expose your computer to a non-negligible risk of hostile takeover.

Personally I find this a very unreasonable thing to accept. Secure (single user) text processing (etc.) is not a hard problem at all and there is zero reason it should create any security risks.

Crap passwords (and security practices) are not pertinent at all here, unless you also argue that Boeing shouldn't really bother to make airplanes with non-lethal autopilot interactions because being killed by a mis-designed airplane is the least of risks to the typical US citizen's life give how obese he or she is.

The pertinent question is: if you are to design a wordprocessor, should you avoid baking sqlite into the design because of security concerns? I think you should. Sqlite is written and tested very well, but it is written in not formally verified C, and using a memory unsafe language for reading potentially hostile databases poses a completely unnecessary and avoidable security risk.

oever · on Dec 18, 2020

Or run the IO in a sandbox where it is streamed out in a simpler/safer format. Or run the entire suite in a sandbox like we do with parts of web browsers.

patrec · on Dec 18, 2020

How is this better than just avoiding the problem in the first place?

fastball · on Dec 18, 2020

Because SQLite, as a general purpose database system, can be used for loads of things.

A two-piece component (DB + sandbox) that you use for 100 different things is going to end up much more secure (overall) and require less human effort than maintaining 100 different systems for each use case.

It's like "don't roll your own crypto". That is the advice not because it's impossible to roll your own secure crypto, but because it is just much more likely that you will repeat bugs that other people have already figured out, so it is better if everyone is using one library so that when an issue is found and resolved, it applies to everyone. It is still good advice even though that 3rd party crypto lib is probably going to have a lot of unneeded functionality that actually increases the overall attack surface area than if you made your own for whatever your specific purpose is. I think the same logic applies here.

TylerE · on Dec 18, 2020

I think his point is that if an attacker wants to give you bad data, there is already zero barrier to editing an XML file inside a zip.

thayne · on Dec 18, 2020

I think the GP is more an statement of how bad security is with XML than how good it is with SQLite. See for example https://owasp.org/www-community/vulnerabilities/XML_External....

teraflop · on Dec 18, 2020

You could always sandbox it. This has actually already been done at least once: years ago, there used to be a "pure Java" JDBC driver for SQLite that was generated by compiling the original C code to a MIPS binary, and then translating the result into Java bytecode.

http://web.archive.org/web/20080703234650/http://www.zentus....

ReactiveJelly · on Dec 18, 2020

https://hacks.mozilla.org/2020/02/securing-firefox-with-weba...

Firefox did the same recently to sandbox Graphite (some font dependency)

C/C++ --> WebAsm --> sandboxed AOT compiled native code

remram · on Dec 18, 2020

Can you elaborate about this? What are the gotchas when opening an untrusted SQLite3 database?

(I am developing a web application that will allow users to upload application files, which are SQLite3 databases)

kibibyte · on Dec 18, 2020

Section 1.2 here might be a good start. https://www.sqlite.org/security.html

mcint · on Dec 18, 2020

The Full Text Search extension (FTS) (FTS4?) has been a repeated source of issues.

If you stick with the core, and disable some of the shiny features, the track record is better. Though still not spotless, iirc.

ori_b · on Dec 18, 2020

It's a multi-hundred-thousand line project (current count, excluding tests, is about 210,000). Like any project of a similar size, it has a history of security vulnerabilities. OpenBSD removed it because it was too much code to audit. OpenBSD dodged several issues like this RCE (https://research.checkpoint.com/2019/select-code_execution-f...) as a result.

See https://www.cvedetails.com/vulnerability-list/vendor_id-9237...

Hundreds of thousands of lines of code are simply going to have vulnerabilities and bugs. It's unavoidable with that level of complexity. I'm skeptical anything that complicated is a wise choice for a general file format -- especially when smaller, simpler, and more easily debuggable and independently implementable formats exist.

hansvm · on Dec 18, 2020

I haven't taken the time to verify any of it, but a quick search for "sqlite code execution" doesn't look promising.

yencabulator · on Dec 19, 2020

Further reading: https://research.checkpoint.com/2019/select-code_execution-f...

I've never seen mention of this being fixed in any way. SQLite is not an acceptable format to transfer between computers.

SQLite · on Dec 19, 2020

That bug was fixed long ago, even before the referenced video was produced.

The described attack is clever. It exploits the fact that an attacker might alter the schema so that it invokes an SQL function with side-effects when an app simply tries to read from a table. And depending on those side-effects, an exploit might be possible. In the video, there was a bug in a built-in SQL function that caused exploitable side-effects. But that bug was fixed long before the video was even produced. The examples in the video were from an older version of SQLite. They did not work for the latest SQLite release on the day that lecture was given.

Since the checkpoint.com attack was described (in the video and elsewhere) new defense-in-depth features have been added to SQLite to make similar exploits increasingly unlikely.

(1) Built-in SQL functions that have side effects cannot be used in the schema. Side-effect functions can only be invoked directly by the application.

(2) When applications register their own custom SQL functions, they can now mark those functions as "direct-only", meaning that they are prohibited in the schema.

(3) Run-time and compile-time options are available to prohibit the use of SQL functions in the schema that are not explicitly declared to be safe for use in the schema - that is, functions without side effects. This is for use in legacy applications that might have been created before the per-function flag that prohibited use within the schema was available. It is also an extra layer of defense for complex applications that might add hundreds or thousands of side-effect SQL functions - to ensure that the "direct-only" flag is not accidentally omitted from one of them.

(4) Run-time options are available to disable triggers and views in applications that do not need them. This is not necessary to avoid an exploit, but it does provide an additional layer of defense.

See https://sqlite.org/security.html and especially paragraph 1.2 item 8 for additional information.

Aeolun · on Dec 18, 2020

I think the same is true for less trusted MS office files though? E.g. VBA can do some crazy stuff.

knighthack · on Dec 18, 2020

The article convinces me on the failure of the ZIP method in at least 3 respects:

1) For efficient incremental updates. Especially when you're saving a large file often, making constant single changes to ZIP files is just inefficient.

2) The insane memory gulps that ZIP takes.

3) The pile of files method. The data in OpenDocument files could be much better sorted/accessed through SQLite tables, rather than through XML files in ZIPs.

I think SQLite makes far more sense for OpenDocument. Incidentally, SQLite has also made good arguments previously for its use to replace application file formats: https://www.sqlite.org/appfileformat.html

etimberg · on Dec 18, 2020

In a past job i tried hard to replace an Access database file with sqlite for a c++ desktop application. One big win from sqlite was on the testing front. You can just open an in-memory sqlite database and test against that without worrying about disk state, or state leaking between test runs.

coddle-hark · on Dec 18, 2020

We use in-memory sqlite databases for unit testing a MySQL backed web service. It’s great! You can build and test the code without setting up a MySQL server and no risk of state leaking like you said. The syntax isn’t exactly the same but we have a small wrapper layer that rewrites MySQL-specific queries into their SQLite equivalents.

The only real downside is that it’s a C dependency which makes cross compiling a bit of a pain until you figure out the magic incantation the compiler wants.

alkonaut · on Dec 18, 2020

The whole "Writes are atomic" still breaks down if they assume that disks are local. Having worked with one of these zip document format for 20 years now, I have found that local disk is the edge case, and SMB is the norm in corporate environments.

Sure you can download the sqlite file at the start of a session and work on it locally in a temp directory with atomic writes - but in the end the user doesn't think they have "saved" until version N+1 of the file is actually on their network share.

mesebrec · on Dec 18, 2020

SQLite tries to detect when the database is on a network share and changes it's behavior accordingly.

niea_11 · on Dec 18, 2020

The sqlite documentation advises against using it in that scenario:

If there are many client programs sending SQL to the same database over a network, then use a client/server database engine instead of SQLite. SQLite will work over a network filesystem, but because of the latency associated with most network filesystems, performance will not be great. Also, file locking logic is buggy in many network filesystem implementations (on both Unix and Windows). If file locking does not work correctly, two or more clients might try to modify the same part of the same database at the same time, resulting in corruption. Because this problem results from bugs in the underlying filesystem implementation, there is nothing SQLite can do to prevent it.

A good rule of thumb is to avoid using SQLite in situations where the same database will be accessed directly (without an intervening application server) and simultaneously from many computers over a network.

text from here : https://www.sqlite.org/whentouse.html

alkonaut · on Dec 18, 2020

This doesn't really work as well as one would like. It's not really related to sqlite it's just that expectations of file system behavior (locking, operations having completed when the system says they have, etc) are just flimsy fo many network file system implementations. All SQLite can do when it can't map the file as it would normally is basically upload the new version of the file, and not write it in place anyway. That behavior is just as easy to create by the application developer.

wmertens · on Dec 18, 2020

Isn't that the same problem with zip files? So if we're assuming that the compressed SQLite format will take up the same or less space as the zip file, wouldn't that mean that exactly the same mitigations work?

alkonaut · on Dec 18, 2020

Yes the problem is ”fundamental” here as there is no filesystem guarantees. You basically upload the whole file after modifying it just like you would a zip. My point was that this is just as good to leave to the application and gain some better control over failure behavior/recovery.

mattxxx · on Dec 18, 2020

In my experience, sqlite is rarely the wrong choice for a first pass at almost anything. It’s - Straightforward - Portable - Performant

So, unless you have a good reason to not use it: - Human-readable/editable file format - Non-trivial queries across multiple servers - Specific performance optimizations needed - Etc.

It’s a good first bet / gets you through mvp (and then some)

sargun · on Dec 18, 2020

The problem I have with sqlite for being a general purpose file format is that it’s difficult to take a sqlite file in any language and work with it without linking against code.

I kind of wish that there were native implementations of sqlite in every language.

simonw · on Dec 18, 2020

WASM is really interesting here.

https://github.com/sql-js/sql.js is SQLite complied to WASM, which means you can run it not just in the browser but also in any WASM environment.

https://github.com/wasmerio/wasmer-python lets you execute WASM compiled binaries within Python.

https://github.com/wasmerio/wasmer-ruby handles Ruby.

So it should be possible to take the WASM compiled SQLite and run it in other languages.

This is a thing that really excites me about WASM: as a mainly-Python programmer it has the potential to give me access to a much wider range of tools.

mcint · on Dec 18, 2020

Nice links. (Hi again)

Bring full SQL select, join, and order in datasette into the browser, perhaps?

Ecco · on Dec 18, 2020

What's the problem of linking against SQLite? In my book it's much better because:

- SQLite is coded in C, so performance and weight are going to be equal or better than any other reimplementation.

- Having one code base means no compatibility issue between implementations

- Also SQLite's codebase is so heavily tested and its track record is so good that I really don't see the point.

sargun · on Dec 18, 2020

I'll give a couple examples of where I've needed a K/V database, but linking caused problems:

* We have a sandboxed Javascript runtime inside of the JVM that can only do restricted work. IT would be nice to be able to use something like a SQLite file to deliver configuration to this Javascript runtime

* There are environments where you cannot install any native code -- like serverless providers that require your software to be pure Java / Javascript / Python

* Cross-language debugging can be very difficult when something goes wrong

* Erlang's FFI model is either "blocking and fast" (it runs the FFI call inline, and halts preemptive multitasking), or "slow and safe" (dirty NIFs, where it runs FFI calls on a dedicated thread.) This can get really ugly, really quickly.

pvg · on Dec 18, 2020

https://www.zdnet.com/article/google-chrome-impacted-by-new-...

sigzero · on Dec 18, 2020

He didn't say it never has bugs. It does have a great track record and Dr. Hipp fixes things quickly when found.

cowsandmilk · on Dec 18, 2020

If there were implementations of SQLite in other languages, do you believe chrome would have used a different implementation?

pvg · on Dec 18, 2020

I don't know! But if this sort of thing can happen in Chrome, I think it's a pretty strong counterargument to "let's standardize formats in terms of a C implementation of a thing that parses a complicated language, among many other things".

rhencke · on Dec 18, 2020

XML has been a source of countless vulnerabilities, yet we continue to use it as a standard.

pvg · on Dec 18, 2020

XML has its own problems but 'has a single canonical implementation in a big hairball of C' is not among them.

viraptor · on Dec 18, 2020

Non-C implementations of XML don't have countless vulnerabilities. There's XXE and schema requests which are language independent, but otherwise... it's ok.

gwd · on Dec 18, 2020

FWIW I've used golang linking against sqlite in a couple of my projects. One issue is that linking through cgo is really slow, due to the fact that the runtime has no idea anything about types. Probably OK if you're just doing basic queries at a reasonable rate; but if you implement a custom callback, suddenly all the marshaling back and forth takes a huge amount of time.

Not a deal-breaker; but a native implementation of sqlite in golang would be a lot nicer.

0xjnml · on Dec 18, 2020

> Not a deal-breaker; but a native implementation of sqlite in golang would be a lot nicer.

CGo-free SQLite for Linux: https://godoc.org/modernc.org/sqlite

an_d_rew · on Dec 18, 2020

In this particular case though you have to weigh linking against SQLight versus zlib and libxml.

rsynnott · on Dec 18, 2020

Native implementations of both those are available for many languages, though. ~no-one is using libxml with Java, say.

AtlasBarfed · on Dec 18, 2020

Yes.

Possibly if this was done however, we'd get native sqllite file format read/write libraries in most major languages.

Might overall be good. Sqllite-as-file-format isn't necessarily a bad idea if this infrastructure exists.

7thaccount · on Dec 18, 2020

Why can't you just have your language call to SQLite via the command line?

simonw · on Dec 18, 2020

One reason SQLite is so useful for this kind of application is that once you've linked it in any SQL call can be treated as a C function call.

Many Small Queries Are Efficient In SQLite: https://www.sqlite.org/np1queryprob.html

Round-tripping queries via shelling out to the SQLite binary would lose much of that advantage.

7thaccount · on Dec 18, 2020

So efficiency?

systematical · on Dec 18, 2020

I use SQLite extensively for unit tests and demo code bases for my OSS libraries. For the latter because its super easy for someone to clone the code base and run the entire thing in seconds. Huge fan of SQLite in those two scenarios.

molyss · on Dec 18, 2020

Does that still apply if the DB is in WAL mode ? according to https://www.sqlite.org/fileformat2.html I'd say "no", because it sounds like the journal file might be necessary if the DB dies unexpectedly

teleforce · on Dec 18, 2020

Why not go to the extreme and just use TileDB for all the documents? By using it you can utilize universal storage engine supporting Arrow format that should work with SQL, NoSQL, dataframe and even streaming systems.

ncmncm · on Dec 18, 2020

All I would like from OpenDocument, and LibreOffice, is an ASCII storage format that is not specifically designed to be incompatible with revision control systems.

As it is, they have an almost-serviceable format that fails only by having each line start with a random number that differs from the number on the corresponding line that was originally read in.

Fixing it would not even be an incompatible change! They could just remember the number that was there, and write it back out, next time.

There is a report for this in their bug tracker that mentions SVN because there was no Git, yet, when it was posted.

MayeulC · on Dec 18, 2020

Not exactly a complete answer to your problem, but you can at least improve diffs for odt documents by using git textconv:

    *.ods diff=odf
    *.odt diff=odf
    *.odp diff=odf
    
    *.ods difftool=odf
    *.odt difftool=odf
    *.odp difftool=odf
    [diff "odf"]
      textconv=odt2txt

https://git.wiki.kernel.org/index.php/Textconv

sally1620 · on Dec 19, 2020

There is a single XML variant of the open document format. although images and videos won't be embedded in the document that way.

I used this format when I was generating ODF documents using a script.

swiley · on Dec 18, 2020

I'm a huge fan of sqlite but I'm not sure it's an improvement over xml for holding a DOM. Especially since there's more or less just one implementation.

rlpb · on Dec 18, 2020

> The atomic update capabilities of SQLite allow small incremental changes to be safely written into the document. This reduces total disk I/O and improves File/Save performance, enhancing the user experience.

How would this work without breaking the conventional "Save" paradigm? Unless the app tracks all incremental changes to apply when the user clicks "Save" instead of dumping it all back to disk, which seems difficult and error-prone to implement.

teraflop · on Dec 18, 2020

> Unless the app tracks all incremental changes to apply when the user clicks "Save" instead of dumping it all back to disk, which seems difficult and error-prone to implement.

That's the point of using SQLite instead of implementing it yourself. Because of how SQLite supports transactions, you can just update the database as you go; your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit, which is an atomic (and typically fast) operation.

The downside of this approach is that if the application crashes while writing to disk, the database file itself might be in an inconsistent state, accompanied by a rollback journal (or write-ahead log) that contains the necessary information to recover it. But if you manually delete the journal, or otherwise separate it from its database, you get data corruption.

That's probably fine as long as the user is a sysadmin who can be trusted to Just Not Do That(tm), but it's very user-hostile behavior for an office suite.

ReactiveJelly · on Dec 18, 2020

> But if you manually delete the journal, or otherwise separate it from its database, you get data corruption.

I didn't believe you, but you're right.

https://sqlite.org/wal.html

> If a database file is separated from its WAL file, then transactions that were previously committed to the database might be lost, or the database file might become corrupted. The only safe way to remove a WAL file is to open the database file using one of the sqlite3_open() interfaces then immediately close the database using sqlite3_close().

https://sqlite.org/lockingv3.html

> If a crash or power failure occurs and results in a hot journal but that journal is deleted, the next process to open the database will not know that it contains changes that need to be rolled back. The rollback will not occur and the database will be left in an inconsistent state.

It's a bummer that SQLite is a "database in one file" but many operations actually require a 2nd or 3rd file.

What's the reason that it can't act like a filesystem image, and keep everything in one file? If I mounted an ext4 image in userspace and modified it a bunch, it wouldn't be corrupt, right? Is that just because SQL has stronger atomicity requirements than a filesystem?

teraflop · on Dec 18, 2020

> What's the reason that it can't act like a filesystem image, and keep everything in one file?

I suspect it's mostly because putting everything into one file instead of two isn't really necessary for most use cases, and it would significantly complicate the implementation.

The main SQLite database file is organized into numbered fixed-size pages, which are used to store B-tree nodes. The B-tree structure itself is fairly complex, but the journal file format is extremely simple: it's essentially just a list of page indices along with their old contents. (WAL mode is very similar, except that the new contents are stored instead. See https://sqlite.org/fileformat2.html for the details.)

The catch is that both the database and the journal need to be independently resizable. So if you wanted to store them both in a single file, you would need to either reserve a fixed amount of space for the journal up front, or allow the journal and/or the database itself to become fragmented across different byte ranges of the file. The latter option would require you to keep track of an extra layer of indirection to know where the journal data is stored, and that metadata would itself need to be efficiently, atomically updateable.

Essentially, you would be building an entire rudimentary filesystem just to store two logical files in one physical one, when the OS already has a perfectly good filesystem available for you to use.

skissane · on Dec 18, 2020

> What's the reason that it can't act like a filesystem image, and keep everything in one file?

SQLite actually provides an example VFS which stores everything in one file, except the one file has to be of fixed size – https://www.sqlite.org/src/doc/trunk/src/test_onefile.c

It is intended as an example of how SQLite can be used on an embedded system which has no filesystem, just a raw disk.

I think someone could create a VFS which stored everything in one file without the fixed size limitation. I think you'd need to have three types of pages in the file – data, journal, and metadata (which marks which pages are of which type, using a bitmap or extent list or whatever). The VFS would present two separate files, data and journal, but actually store them as one file containing these three types of pages. You'd need to think carefully about the ordering of updates to the three page types to ensure atomicity in the case of a crash. A fair amount of work, but doable if someone wanted to. (Maybe even the SQLite developers might do it if they thought it was a good idea, or if someone was willing to pay them to do it.)

rlpb · on Dec 18, 2020

> That's the point of using SQLite instead of implementing it yourself. Because of how SQLite supports transactions, you can just update the database as you go; your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit, which is an atomic (and typically fast) operation.

You can't "just" update the database as you go. It'd be a major refactor, since typically an app will use an in-memory working copy instead of referring to the storage engine all the time.

ReactiveJelly · on Dec 18, 2020

Well, you could write a fresh SQLite DB on every Save command, and blow away the old one, thus losing all of SQLite's nice fsync handling.

Or if your data structures can be modeled as vectors or ordered maps, use SQLite directly as much as possible. It _is_ a database engine, so it's already got btrees for fast lookups.

Hell, one time I played back video from a SQLite file. Not HD, but it was a peculiar format where I couldn't use any normal codec, and 30 FPS VGA recorded and played just fine.

cesarb · on Dec 18, 2020

> your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit

This breaks one important invariant: unless the user clicks "Save", the original file must not be modified in any way. It doesn't matter if the changes are ignored by readers; if "Save" is a separate operation, it's the only place which should modify the original file.

Also, how would this work with "Save As"?

krisoft · on Dec 18, 2020

I don’t understand what you are asking. When the user clicks “Save” that’s when your application calls the sqlite api to update the database. When the user clicks “Save as” the application calls the sqlite api to create a new database file, it then creates the database schema and dumps the internal representation of the document into it.

You might be confused by the GP comment. It says that if something bad happens with the computer after your application started saving, but before the transaction is commited then the database still will be in a consistent state. (Altough the document will be in the pre-save state.) If the application is the “click save” kind then here we are talking about what happens if the power goes out in that fraction of a second after you hit save, but before the cursor stops being a hourglass.

teraflop · on Dec 18, 2020

> This breaks one important invariant: unless the user clicks "Save", the original file must not be modified in any way.

SQLite's default rollback journaling mode doesn't satisfy this property, but WAL mode does: updates will be first written to the WAL file, and automatically copied to the main database file only after at least one transaction has committed.

> Also, how would this work with "Save As"?

SQLite also has a "backup" API that I think could be used for this purpose: https://sqlite.org/backup.html

_akoy · on Dec 18, 2020

Microsoft more or less does this already with their Office Open XML file format in SharePoint Online/OneDrive. Small changes are read/written as users edit the doc. Real-time co-authoring locks on a per-paragraph basis.

okennedy · on Dec 18, 2020

SQLite already has logic for this built in: transactions (https://www.sqlite.org/lang_transaction.html)

More broadly though, tech seems to be migrating away from explicit saves in general (e.g., when's the last time you needed to save something in a web-app). Maybe that's not such a bad thing? I learned to develop a reflexive cmd-/ctrl-s twitch as a youth, but I would have been just as happy to have not lost hours of work to power outages and/or cooperative multitasking.

iggldiggl · on Dec 18, 2020

OTOH, when I open some document as a reference document, I most certainly don't want any changes I accidentally (or even purposely) might have made to it to be saved without any further questions.

ReactiveJelly · on Dec 18, 2020

I hope some day auto-save and VCS converge to result in some kind of continuous undo system.

Right now the only reason I leave a file unsaved is to act as an extra layer of ad-hoc version control, and it's not that good.

mlyle · on Dec 18, 2020

Modern word processing formats write down incremental changes, to avoid writing all of a massive document back to disk.

Sometimes this causes people embarrassment, because if you send a normally saved file to someone else, they can often see the change history and intermediate versions.

mixcocam · on Dec 18, 2020

This is indeed an issue. Most apps allow a (non default) saving manipulation which will remove the history.

WClayFerguson · on Dec 18, 2020

I am the developer of a platform that hosts documents in a NoSQL (MongoDB) so that every paragraph of text is a 'record' in MongoDB and there's a 'path' property which is what builds out a 'tree' structure to give the documents a hierarchy. You guys might be interested if you're interested in new kinds of document/wiki innovations.

https://quanta.wiki

jamesfisher · on Dec 18, 2020

Saving the versioning in the same file would lead to security issues. Imagine redacting a file before sending it to someone. All they have to do is roll back a few versions to recover the original.

(Note this is not a criticism of the "use SQLite" argument; just a criticism of the specific schema.)

cyingfan · on Dec 18, 2020

The only application/document file format I know that uses sqlite is https://www.giuspen.com/cherrytree/. Is there others out there worth mentioning?

sally1620 · on Dec 19, 2020

I wonder if anyone has implemented a document file format based on SQLite.

SQLite file format is documented and other implementations exist. I am sure given enough interest, it could get standardized.

dang · on Dec 18, 2020

If curious see also

2017 https://news.ycombinator.com/item?id=15607316

edoceo · on Dec 18, 2020

What would be really awesome is a PDF competitor. Put fonts, images, media in tables. Open spec so everyone can use.

Perhaps an ability to lock/sign?

ReactiveJelly · on Dec 18, 2020

You could do it with packed HTML, and run a little localhost web server to plug it into a browser.

HTML is a mess, but at least it's popular.

pessimizer · on Dec 18, 2020

PDF is already a popular mess, and open. If you were going to reimplement on top of sqlite, the value would be from implementing it sanely.

BlueTemplar · on Dec 18, 2020

PDF, unlike HTML, is inappropriate for most digital documents.

BlueTemplar · on Dec 18, 2020

What happened to single-file HTLM formats ?

There's EPUB, but it has some shortcomings, like the lack of mp4/webm support...

2038AD · on Dec 18, 2020

> There's EPUB, but it has some shortcomings, like the lack of mp4/webm support...

As a PDF replacement, that seems like a good feature to exclude

BlueTemplar · on Dec 18, 2020

Why ? Also, I'm not looking for a PDF replacement so much as an actual digital portable document format.

im3w1l · on Dec 18, 2020

Html is a very complex "living standard" which arguably makes it unsuited for anything but web content.

zelly · on Dec 18, 2020

PDF is an open spec--even though it is so complicated that there's only one complete implementation which costs money.

scrollaway · on Dec 18, 2020

Meet PDF/A, a safe subset of pdf which enforces certain features and is suitable for archival:

https://www.loc.gov/preservation/digital/formats/fdd/fdd0003...

PDF/A is in use in several government institutions already, and there are validators out there etc.

fomine3 · on Dec 18, 2020

OpenXPS :)

ahoka · on Dec 18, 2020

Umm, because it's a proprietary format, with questionable licensing?

chungy · on Dec 18, 2020

Neither OpenDocument, SQLite, nor Zip are proprietary formats.

Come again?