The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets. The RFS does not specify a particular version of SQLite.
CSV isn't a single format but many incompatible formats that require additional metadata to decode, such as:
- whether quotes are ""doubled like SQL92 or backslash-\"escaped like C
- whether newlines are quoted or \n like C
- whether there are column headings
- what the character set of the file is (Excel insists it's the local machine's codepage)
- whether that's a BOM or part of a column heading
- whether lines terminate with \n or \r\n or \r or \r\n
- what to do about ragged rows
and so on. Except for the simplest of datasets, CSV is almost certainly going to lose information. The only advantages over SQLite is streaming support and maybe compression, but even then there are better formats.
There's a current debate on my team if we should accept semicolon delimited files as CSVs. You can't even rely on commas in CSVs.
I would say any format that is compatible with Excel is not suitable for archival. Ever open a file with numeric values with more than say, 20 (I forget the actual cut off) digits? Excel converts to scientific notation and loses precision. When you change the format there is data loss. The length data is preserved but not the full precision.
I saw a problem recently where a file had a .csv extension but was actually colon delimited. The data contained fields with large quantities of text so a CSV parser happily parsed the file splitting lines on commas in the text and didn't complain so the problem wasn't noticed until much later in the processing of the data.
This is a common problem when you receive csv from someone working in a different country: comma is a decimal separator in some cases (e.g.: Germany) so Excel will automatically switch to semicolon for separators.
So now your colleague in the UK office (or even someone who simply prefers using Windows in with US locale, or, as in my case, got a UK laptop because the German keyboards has some layout differences to what they consider "normal"...) double clicks on the "German" .csv... Excel will show (most of the time) every line as a single FIELD.
So you have to close it and use Data Import.
We just moved to the latest version of Office 365 at the job and it still works like this (even if the import action can usually figure out that semicolon is the separator without me having to manually specify it...).
One of our clients internally uses an odd text file format with ~ as the field delimiter and a fixed-width based header & footer containing metadata. And still sometimes send such files with names ending .csv.
~ is a better choice for field separator though, it is far less commonly used in most datasets and natural languages. | can also make sense for the same reason, and visually looks like a column divider as an extra bonus.
In hindsight, semicolons are the better choice, given that commas are common in many languages as decimal separators.
I don't know where there's never been some effort to create a "CSV format specification"; for example the first line could indicate quoting, delimiter, etc. style used in the file.
Guess it's never been big enough of a problem for people to take action. "Good enough" (or rather, "not bad enough").
> In hindsight, semicolons are the better choice, given that commas are common in many languages as decimal separators.
Or a pipe. Or the unprintable “field separator” character.
> I don't know where there's never been some effort to create a "CSV format specification"; for example the first line could indicate quoting, delimiter, etc. style used in the file.
Fundamentally CSV or any delimited format requires agreement between parties for interchange. At that point standards don’t really help because the parties can just agree to anything that works.
> Guess it's never been big enough of a problem for people to take action. "Good enough" (or rather, "not bad enough").
The problem there is those code-points are usually non-printable, making the resulting file less human-readable which is often a stated desire for a data transfer file format. Some text editors will show those code points, but others will completely omit them (or worse, reject them as invalid).
You used to be able to ctrl + letter to input control keys on traditional terminals — was it (ord(char) - 0x20), but outside Emacs, vi and similar editors, I am sure you can't do that anymore.
If you are questioning existence of human-readable/writeable (typeable?) data formats, you might want to check out SGML/XML, JSON, YAML... Are you surprised that being able to type it out holds true for all of these? Esp on standard keyboards?
It's more interesting how SGML-derived languages have kept <> symbols so accessible on keyboards regular people use.
That's not exactly what I mean; that's just one of many different CSV variants (and just because it's an RFC doesn't make it "official" either). What I mean is something like:
Then you always know what you're dealing with, unambiguously.
There are other cases where embedding some "metadata" in CSV files is useful; for example the exports I create in my app are versioned by prefixing the first header with the version number. It works, but it's not exactly great.
Of course, the downside is legacy software not being able to process it.
There are CSV specs - see https://specs.frictionlessdata.io/csv-dialect/ and https://github.com/csvspecs , for example. Note: Both initiatives are ongoing and alive and kicking. The trouble is almost nobody cares - as your ignorance shows - just complaining and playing the know-it-all is so much easier and more fun, isn't?
Jokes aside, we had that debate in our team way back and decided to follow the rfc and only use commas. The format is often abused, better not create more tools that do so.
Totally understand and respect that decision. In our case these files come from third parties. We ingest that data, grind it up, sprinkle on our own blend of 11 herbs and spices and serve it to customers. Rejecting that data because it doesn't strictly conform to the RFC makes our product less appetizing. Or we can hold our noses and accept any consistent delimiter while working with the data providers to follow the standards.
Last time I checked, Outlook (The desktop application) exported contacts in localized CSV (semi-colon separated in my case) but Outlook (online) only accepted contacts separated by commas. As I was trying to migrate contacts on someone else's computer, I had to use notepad to change it.
I used to think this too before I started researching the file format and it’s associated RFCs. Then I discovered a lot of that is actually explicitly defined, such as C-style escaping not being accepted and new lines following the MS-DOS format (\r\n).
Of course, there might be programs that write invalid CSV or parsers that accept invalid CSV, but that doesn’t mean CSV isn’t a standardised format.
> Of course, there might be programs that write invalid CSV
Like Excel?
RFC 4180 exists but that doesn't mean any file with .csv at the end of the name or a text/csv MIME type can be parsed according to it. Also, you still have to track the metadata, such as the charset and if the first line is a header. Both are optional in the RFC.
The header should be optional. That’s a schema specific detail like the number of columns.
The character set is also forgivable given the age of the format; it comes from an era when there wasn’t an agreed single character encoding to rule them all. Sure, RFC 4180 probably should have specified UTF-8 but by that point there was already three decades of CSV usage in different encodings on mainframes which are likely still in use even now to make specifying any character encoding rather pointless.
Don’t get me wrong, I’m not a massive fan of CSV either. The way quoted new lines are encoded, for example, is horrible and I’ve never liked the “” format for escaping quotation marks. But a lot of CSVs biggest problems are ironically a result of its success: as a format it’s so easy to use that a great many developers can hand crank their own parsers without looking at the spec. So I find it hard being critical about the lack of standardisation in CSV when the real problem is the number of developers who don’t follow the standardisation.
> the real problem is the number of developers who don’t follow the standardisation.
RFC4180 isn't an Internet standard. Says so right at the top in the first paragraph.
I think I have a few thousand files on my computer right now whose names end in ".csv", and I'll bet money not one of them agrees with RFC4180 to the letter except by accident of the data itself, and that's a Real Problem to me.
I can concede that two parties could agree to interchange according to RFC4180, but as a general format I maintain that for archival and interchange purposes CSV cannot be divorced from the rather complex schema I alluded to without data loss.
> RFC4180 isn't an Internet standard. Says so right at the top in the first paragraph.
I didn't say it was an Internet standard. I said standardised. Ok, I'll concede it is more of an informal or de facto standard but when IBM, W3C, IETF, OKF and others all publish the same parsing rules for CSV, it's hard to agree when people make statements like "there's no standard in CSV". The problem isn't that there isn't a standard to CSV, the problem is that people often don't follow those conventions. But you have that same problem with other file formats too.
> I think I have a few thousand files on my computer right now whose names end in ".csv", and I'll bet money not one of them agrees with RFC4180 to the letter except by accident of the data itself, and that's a Real Problem to me.
That's just conjecture. And even if that were proven true, it's still only anecdotal. That said, I do sympathise with your point. But you could make the same argument for
- JSON files that don't follow spec (support for comments, aren't UTF-8 encoded, have been manually written so don't follow the escaping rules correctly and thus only parse correctly by chance).
- XML files that have been manually cranked and so don't follow schema
- HTML documents that don't follow specification and thus browsers do a lot of non-specification interpretation work to render correctly
The IT industry is littered with example of people not following the docs. CSV isn't unique in that regard.
> I can concede that two parties could agree to interchange according to RFC4180, but as a general format I maintain that for archival and interchange purposes CSV cannot be divorced from the rather complex schema I alluded to without data loss.
I wasn't commenting on whether it's a better format than another file format. I was commenting on your points about standardisation saying there are an abundance of published documents on how to read and write a standard CSV file and C-style escaping isn't part of that specification.
I’m getting a slew of downvotes so maybe we’ll tackle it from the other side: I’ve evidenced my point, now let’s see a list of popular tools that do use C-style escaping in CSV files or other wildly incompatible rules. I’d wager they either don’t exist or are some minor non-enterprise utility.
I’m not a CSV fanboy by any means but I have used CSV with a number of enterprise solutions because that’s all they’d offer. And my experience is that the comments levelled against it in this discussion are overstated.
It is an informal. RFC just says it's not defining the Internet standard. The reason being, and as I'd already pointed out, CSV predates RFC 4180 by 3 decades so what the RFC is really setting out is to describe what those established conventions are and what the MIME type should be. That doesn't mean that what the RFC describes isn't already a convention (it's the same specifications as published by IBM, W3C and many other big hitters).
> it would still be a poor choice for archival purposes.
I hadn't suggested it was a good choice for archival purposes. You're building a straw man argument there.
My point was that there is a conventional standard to CSV and that is very well documented. The issue with CSV is that it's so easy to write a parser that some developers do so without bothering to read any documentation on how CSV should be parsed. That's not the fault of CSV, that's the fault of lazy developers. But I do completely agree that CSV has other faults (which I had also discussed too).
> it would still be a poor choice for archival purposes.
It isn't being compared to all formats, it is being compared to sqlite. Some random version of one implementation written in C, stored in "fossil" may be good enough for the library of Congress, but that kind of nonsense went too far for even browser makers to take seriously.
Meanwhile it is trivial to read and infer data from formats like CSV to either use directly or automate a reading process with any tools that exist now or in the future.
Isn't XML the obvious choice for robustly future-proof long-term data storage? Why mess about with an anemic scheme like CSV or a DBMS-specific scheme like SQLite?
Quotes are doubled
Newlines are quoted
Both having headings and not having headings is valid. Document/data specific.
Use utf8
Do not use bom at all. But if for whatever reason you must use it in column heading, quote it.
Idk which line ending but you should probably handle both
Do not emit ragged rows.
Having worked with a lot of government CSV/XLSX datasets (e.g. IPEDS college reporting), I can say I'd prefer something with proper foreign keys and less ambiguity in structure (people like to put descriptive cells at the top of spreadsheets)
Grist is an sqlite-based spreadsheet format that captures relations well and discourages funky structure https://www.getgrist.com/blog - it works well for passing relational datasets around (disclosure, I work on it)
Proper foreign keys don't really suit the dumped data case. Tables are usually imported one by one, in virtually random order so dependencies are often missing at the import time. Even if they were not, checking keys on every record import would mean a huge load for big datasets - that's really slow. That's why foreign key control is usually switched off before importing.
Postgres at least can defer these checks, it's generally a best practice to disable rebuilding indexes while loading as well. It's not hard to simply load the tables in order so the keys are available. I would expect any decent unload utility to also create a script to load the data which should be in the right order.
I think the implication here is that the archival should be done into a SQLite database and then all you need is a SQLite binary to read it. So there wouldn't even be a need to use a load script.
CSV is a bad a idea if you work with text or string. I work a lot with data transformation, and CSV with text usually means headache. Not to mention if you treat data from different countries.
Excel is too limited, odf formats are typically not very extended outside OO or LOffice..., same with parquet and other file formats. Sadly this is how it is.
You can use SQLite as a file for moving data around, as I do, but I guess it's not practical for everyone.
For database dumps I prefer escaped TSVs personally but the delimiter doesn't really matter as long as it is consistent. Strict adherence to quoting, escaping and character sets does, along with tracking that metadata. If you aren't comfortable using a capital A as a delimiter your format is not robust enough for the real world.
Or you could just take the entire engine with you, which is what SQLite provides.
This reminds of the WebSQL spec being dropped because the only implementations used SQLite:
> The W3C Web Applications Working Group ceased working on the specification in November 2010, citing a lack of independent implementations (i.e. using database system other than SQLite as the backend) as the reason the specification could not move forward to become a W3C Recommendation.
-- Wikipedia
> This document was on the W3C Recommendation track but specification work has stopped. The specification reached an impasse: all interested implementors have used the same SQL backend (Sqlite), but we need multiple independent implementations to proceed along a standardisation path.
-- W3C
I wonder how much of their reasoning applies here.
I've grown to not mind WebSQL being dropped, despite that 10 years ago, I wish it were standardized.
WebAssembly's introduction a few years later means that SQLite can be built and used as a wasm module, and you aren't limited by whatever old version of SQLite was used for WebSQL -- you can always update it independently according to your needs.
It strange that Microsoft didn’t try to use Sql Server Compact Edition, which was available on Windows Phone, to make the distinct required implementation.
The main problem was that the WebSQL "spec" draft did basically say: "Database that supports SQL the same way as this specific version of SQLite does", and nobody felt it was important enough to put in the work to actually specify something that could be implemented long-term.
Fascinating, never hears of SQL Server Compact Edition before. Unfortunately Wikipedia reports it was deprecated in 2013 and will EOL in 2021. Also it doesn’t have an ODBC driver.
Certainly not a contemporary alternative to SQLite.
The biggest problem with WebSQL == SQLite was that the semantics couldn't be nailed down beyond a pile of C code. That is doable if there is a single application vendor for a given file format, but a total nightmare if you are trying to write code for the last 3 versions of 5 different browsers.
But SQLite was also a problem for anyone that wanted different semantics than what SQLite offered, like statically typed columns. Better to just provide a fast indexing mechanism and let people define semantics on top of that.
Standards like OpenDocument are expected to have multiple independent implementations. I'm not sure such a thing is feasible with SQLite (this is why WebSQL was rejected).
I'm personally sympathetic to the idea that certain implementation-defined monocultures are OK but I don't think that idea has critical mass.
The SQLite website is a little more positive than that, but as a "quite a bit in the future" option, rather than something that can happen today:
> All that said, it is possible that SQLite might one day be recoded in Rust. Recoding SQLite in Go is unlikely since Go hates assert(). But Rust is a possibility.
Thanks for posting that link, it was a very interesting read. I'm a big fan of Rust, but that article really convinced me that C is the best language for SQLite. A couple of highlights IMO:
> 2. Safe programming languages solve the easy problems: memory leaks, use-after-free errors, array overruns, etc. Safe languages provide no help beyond ordinary C code in solving the rather more difficult problem of computing a correct answer to an SQL statement.
(My commentary) So SQLite development is at such a different level to normal application that the memory safety errors that are a concern for us mortals are the "easy" problems for SQLite developers!
> 5. Safe languages insert additional machine branches to do things like verify that array accesses are in-bounds. In correct code, those branches are never taken. That means that the machine code cannot be 100% branch tested, which is an important component of SQLite's quality strategy.
(My commentary) I don't really agree with this one: If a branch cannot possibly be taken then I don't think it counts towards your coverage percentage... just like, if you're careful never to dereference null pointers, you don't include dereferencing null pointers as a missing part of your test coverage. Still, I thought it was very interesting, especially since it's not quite the argument I was expecting (that those checks are wasted CPU time and binary bloat because they're not needed). (Edit: What's more, exactly this situation applies with SQLite's own ALWAYS() and NEVER() macros [1])
Not familiar with the specifics of the SQLite situation, but moving away from 100% means maintaining an exceptions list, introducing complexity and room for error (eg mistakenly adding exceptions when you can't figure out how to reach a branch).
If they are at 100% today, it'd be a very convincing argument to keep it that way.
The problem is those branches CAN be taken, otherwise the compiler would optimize them out. So now SQLite needs to figure out how to cause them to be taken to verify that the right thing happens.
The problem is not the implementation language of sqlite, the problem is the author. Hacks upon hacks, and nothing of it in the docs. Instead it represents itself as the most secure of all databases. World class, whilst being worst of its class.
Hardening sqlite to get some security back from the insanity internally is a herculean task. I only scratched the surface with mine. Fts accepting custom tokenizers by pointer, dynamically! If someone wants a different tokenizer it needs to compiled in.
It fails by calling C++ a object oriented language, which is false. C++ is a deterministic destruction language.
But in the end it doesn't matter, sqlite has been in C long enough that all the hard things C++ gives you for free have been coded manually anyway so there is no real point. For new code C++ might be better, but that is debatable. Just like rust might be better, but that is debatable.
Nobody is feasibly going to rewrite Sqlite in Rust.
Per (Genuine) curiosity why ? I know (good) database systems are hard to make but do sqlite have particulary 'hard' part which are near impossible to replicate ? Especially since we now have a (I assume) well documented base implementation and an extensive test suite/history of error to avoid ?
Is making à file based database that hard,compared to creating new programming language for exemple ?
> Is making à file based database that hard,compared to creating new programming language for exemple ?
Yes. I can hack together a trivial compiler in a weekend. A serious (non-optimising) compiler for a complex language is a task you could do in a few months.
If you gave me two year I'm not sure I could implement a filesystem API that correctly did something as simple as "atomically append to a file". See also Dan Luu's writing on the topic of filesystems: https://danluu.com/deconstruct-files/
The problem is to write a reliable file database. Firefox got a lot of bugs related to history file corruption until it was replaced with SQLite. The same story was with Subversion where they replaced Berkeley key DB due to reliability issues.
This reliability comes from a very deep knowledge of OS API and all their corner cases. Creating a programming language typically involves less corner cases and the bugs in the compiler/runtime are much simple to reproduce (again, typically).
Maybe it's not so much that it's impossible to do so. Perhaps there is just no reason to do so.
Sqlite is one of the most battle-tested pieces of software in the world. It's used "everywhere" and it's performant.
It is not the case, as some seem to believe, that rewriting software in Rust will magically eliminate all bugs and give a free speed boost compared to well-tested and optimised C.
That's very interesting. The use case described is for quite a limited subset of sqlite. I imagine a limited query language and kv-store could suffice for the usage described in the article. (Probably could be done in a weekend! wink.)
I can think of reasons why re-implementing SQLite might lack appeal, buy why would it be infeasible? Even if the format is poorly- or under- specified (note: I have no idea whether that is true), you could definitely check the source.
You'd need something like a "Sqlite standard" then, with a clear path to how to evolve it taking all stakeholder interests into account. Seems like a job at least as difficult as writing the sofware itself.
(E.g., lack of a standard is why alternative Python implementations never went anywhere, even though the original cPython is, frankly, dreck from a software quality point of view.)
There is a standard though-- the database file format. You can read and write database files without using any SQLite source code by conforming to the expected format: https://www.sqlite.org/fileformat.html
It's not even that exotic: a bunch of rows in a simple binary format stored in a btree. The rollback journal / WAL are trickier, but aren't strictly necessary.
Implementing arbitrary SQL queries on top of the file format is obviously _much_ harder.
I wonder if there’s a Presto connector for this file format, such that you could dump a bunch of SQLite files in object storage and run analytics on them like they were parquet. Of course, any such driver should probably just link SQLite itself...
SQLite has millions of lines of test code. It would actually be difficult to re-implement it in such a way that it is passes the tests, but is incompatible.
Would it be as robust? Absolutely not, unless all the OS interaction code were transcribed directly.
But Rust has no compiler that targets anywhere near the number of platforms SQLite is deployed on, and never will. The best that could be hoped for is that most such platforms would no longer be used, by the time the transcription is done.
Kind of reminds me of http://utf8everywhere.org/, but in that case it'd be sqlite-everywhere. Which is a pretty good point. A large number of Apple-made iOS apps went this route, and they seem to be doing ok!
I think a lot of Android apps use Sqlite as well. But the achilles heel of Sqlite in my experience is network filesystems - it's well known & documented that Sqlite can suffer corruption on NFS, because of issues with locking. That doesn't really matter for iOS & Android apps, where you know storage is local, but it's a serious sticking point for desktop applications, where e.g. NFS home directories are not unusual.
This sounds similar to the challenge of storing a git repo in a cloud-synced folder (Dropbox, one drive, etc). You have so little control over the syncing behavior in these systems that they just cause piles of locking errors. The solution I’m trying to this involves using a bare repo in the synced drive that’s used as a remote for a working repo on a separate, unsync’d folder.
I vaguely remember a file archive format similar to JAR/WAR (for Java) or ASAN (for Electron) which was designed for bundling software with its dependencies and static resources. It is not a container because it's just files with no specification for its runtime. Unlike JAR, it is language agnostic. These bundles could be compiled and compressed on your local machine then scp'd to a server that knows how to execute this bundle. If I recall correctly, it was some open source project from a big company.
I've spent hours trying to find it again and am wondering if I might be remembering something from a dream.
While I generally agree with the points they make I'm not sure I would want a (stock) sqlite being used for "less" trusted files.
While sqlite has a very comprehensive test suite there had been more than one attack with "manipulated" database files and as far as I know (I might be mistaken) the test suite(s) are not focused on testing that case much.
Through I'm positive that it would be fine with a "hardened" sqlite where certain features are not compiled in at all and in turn the attack surface is reduced.
The article describes the current format as a zip of files where the main content of presentation slides is stored in a `content.xml` file within. Yes, sqlite may have had file format vulnerabilities in the past, but I cannot imagine trusting it less than any xml library in existence.
At work we used a library that relied on Expat XML parser. After 20 years it still gets exploitable bugs. As the software was Windows-only we patched the library to use OS API for XML parsing to delegate to Microsoft the job of fixing security bugs.
I don't agree. That document states that the CVEs have historically required one of two preconditions:
> The attacker can submit and run arbitrary SQL statements.
> The attacker can submit a maliciously crafted database file to the application that the application will then open and query.
If you look at the actual list of CVEs, all but one start with ‘Malicious SQL statement’. The single one that doesn't is suffixed with ‘The bug never appeared in any official SQLite release’.
In other words, there has never been an official release of SQLite which was vulnerable when presented with a crafted database file.
The proposal to use sqlite database files instead of zip files falls under the precondition that an attacker can submit a maliciously crafted database file. So the security concerns apply.
SQLitte is mostly used in situations where neither the database nor the SQL statements can be provided by the attacker. If SQLite is used to exchange data both of these attack vectors are available.
> If you look at the actual list of CVEs, all but one start with ‘Malicious SQL statement’. The single one that doesn't is suffixed with ‘The bug never appeared in any official SQLite release’.
The list on the page is partial, and the language on that page is disturbingly dismissive. Reading leaves me more concerned that they're too arrogant to take security researchers seriously.
Think about it if even JSON which is kinda simple has that many problems with parsers how likely is it that something multiple degrees of complexity higher which wasn't developed with untrusted database fiels in mind is secure?
Besides that the large majority of thinks tested in your link are irrelevant for the security in this use-case. They do matter if you need fully stable de- and re-serialization or e.g. have a fully trusted validator. E.g. for certain security token use-cases it matters. Many of the thinks which are marked as failure could even be considered as a "more robust" or "more secure" (through not fully standard conform) parser. E.g. parsing tailing commas (disallowed by JSON), rejecting certain unescaped control/special characters (which JSON allows), rejecting to long field values (which JSON allows) or similar. What matters is that a) it has a protection about to deep recursion and as such no stack overflow can happen, b) it makes sure the text it outputs is correctly encoded (independent of weather or not it was well-formed in the JSON blob) and c) it doesn't crash in a way which leads to security problems (which the test suite you linked doesn't differentiate from "acceptable" crashes).
As long as one uses a single parser to read a particular input it matters little if it produces a result that is different from what some other parser generates as long as the parser has no security bugs. And JSON is simple enough to cover a particular interpretation of its spec with test suits to make a security vulnerability extremely unlikely.
Surely if one uses one parser to verify the payload and another to use it, a disaster comes as was with IPhone verification bug.
I feel like there should be a way to make a maliciously crafted database file execute arbitrary SQL... SQLite doesn't do stored procedures, but there's gotta be something in that surface area.
No, the article just replaces the container with SQL but not the components (e.g. a slide). Which without extra changes are still XML, stored as text in the database.
The change from storing the inner document format in something which is not XML is completely independent of changing the outer container to sqlite.
SQL isn't really suited as a markup language so storing styles "flowing" text in it isn't that good of an idea. The article only recommends storing all separate blobs of styled text in tables (it also sidesteps the whole problem about flow text across multiple pages by simple using slides as an example ;=) ).
Lastly I'm against using a "stock"/unhardened sqlite version, not sqlite in general. Ironically on of the features I would disable for now is related to text search but I don't remember the name of it so I can't really use it as an argument :=/.
So if you ask me:
- Change containers to a different format then zipped sql files, consider sqlite but use a hardened feature limited sqlite version.
- Change the inner representation away from XML(1)
(1): Ok, there are ways to safely use XML but this isn't useful for this discussion and they require a bunch of thinks which all reduce usability (of the XML library) and performance and only work for certain use-cases where you e.g. do not need fully stable serialization and de-serialization and might be limited to a subset of XML and don't rely on the verifier for anything "safety" realated and preferably don't use C/C++ to write a parser... so it's easier to just not do it tbh.
That's fair. Though the disparity is still an order of magnitude...
More seriously, my point is that SQLite is very unlikely to be the most vulnerable piece of software in your stack. It's probably the most stable and bug free piece of software that you use on a day to day basis that hasn't been formally proven correct. Even openoffice itself has more CVEs: https://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=openoffice
And some of them are due to the very issue I brought up above...
> CVE-2012-0037 Redland Raptor (aka libraptor) before 2.0.7, as used by OpenOffice 3.3 and 3.4 Beta, LibreOffice before 3.4.6 and 3.5.x before 3.5.1, and other products, allows user-assisted remote attackers to read arbitrary files via a crafted XML external entity (XXE) declaration and reference in an RDF document.
> My point is that SQLite is very unlikely to be the most vulnerable piece of software in your stack.
Sqlite is very high quality, but I am rather doubtful this is true if you actually care about having a secure system to start with. I can think of three main reasons an office suite would have serious security vulnerabilities:
1. It's written in an unsuitable, memory-unsafe, programming language (as most office suites sadly are).
2. It includes an unsuitable programming language for scripting purposes (Word Macros are of course infamous for this).
3. Vulnerabilities in one of its dependencies (like XML parsers)
If you care to, 1 and 2 are very easy to avoid and 3 will be pretty hard to avoid completely, but you can limit your exposure. You need a GUI toolkit at the minimum, but you can completely avoid parsing as an attack vector by either not using a braindamaged format like XML in the unlikely case you don't need compatibility, or if you must, use an xml (zip etc.) parser written in a memory safe language. If you followed all other steps, but chose sqlite instead, I'd say chances a very good it's now your main vulnerability.
You have crafted a nigh-impossible tower of requirements that you admit are way beyond what is commonly practiced in the software under question in order to disclaim the statement "SQLite is very unlikely to be the most vulnerable piece of software in your stack". Right now on the front page of HN is an article about how Microsoft was also a victim of what may come to be labeled the most impactful hack ever, a list which includes most (all?) US government agencies. The root vuln? A crap admin password used by an IT monitoring software company which has software deployed to an alarming percentage of the industry.
I dunno man, I feel like my original statement is pretty reasonable. Let's please not move the goalposts to the moon. :)
I don't think your statement is unreasonable at all coming from the common perspective that it is part of the natural order of things that opening a text document would expose your computer to a non-negligible risk of hostile takeover.
Personally I find this a very unreasonable thing to accept. Secure (single user) text processing (etc.) is not a hard problem at all and there is zero reason it should create any security risks.
Crap passwords (and security practices) are not pertinent at all here, unless you also argue that Boeing shouldn't really bother to make airplanes with non-lethal autopilot interactions because being killed by a mis-designed airplane is the least of risks to the typical US citizen's life give how obese he or she is.
The pertinent question is: if you are to design a wordprocessor, should you avoid baking sqlite into the design because of security concerns? I think you should. Sqlite is written and tested very well, but it is written in not formally verified C, and using a memory unsafe language for reading potentially hostile databases poses a completely unnecessary and avoidable security risk.
Or run the IO in a sandbox where it is streamed out in a simpler/safer format. Or run the entire suite in a sandbox like we do with parts of web browsers.
Because SQLite, as a general purpose database system, can be used for loads of things.
A two-piece component (DB + sandbox) that you use for 100 different things is going to end up much more secure (overall) and require less human effort than maintaining 100 different systems for each use case.
It's like "don't roll your own crypto". That is the advice not because it's impossible to roll your own secure crypto, but because it is just much more likely that you will repeat bugs that other people have already figured out, so it is better if everyone is using one library so that when an issue is found and resolved, it applies to everyone. It is still good advice even though that 3rd party crypto lib is probably going to have a lot of unneeded functionality that actually increases the overall attack surface area than if you made your own for whatever your specific purpose is. I think the same logic applies here.
You could always sandbox it. This has actually already been done at least once: years ago, there used to be a "pure Java" JDBC driver for SQLite that was generated by compiling the original C code to a MIPS binary, and then translating the result into Java bytecode.
It's a multi-hundred-thousand line project (current count, excluding tests, is about 210,000). Like any project of a similar size, it has a history of security vulnerabilities. OpenBSD removed it because it was too much code to audit. OpenBSD dodged several issues like this RCE (https://research.checkpoint.com/2019/select-code_execution-f...) as a result.
Hundreds of thousands of lines of code are simply going to have vulnerabilities and bugs. It's unavoidable with that level of complexity. I'm skeptical anything that complicated is a wise choice for a general file format -- especially when smaller, simpler, and more easily debuggable and independently implementable formats exist.
That bug was fixed long ago, even before the referenced video
was produced.
The described attack is clever. It exploits the fact that
an attacker might alter the schema so that it invokes an SQL
function with side-effects when an app simply tries to read
from a table. And depending on those side-effects, an exploit
might be possible. In the video, there was a bug in a built-in
SQL function that caused exploitable side-effects. But that bug
was fixed long before the video was even produced. The examples in the video were from an older version of SQLite. They did not work for the latest SQLite release on the day that lecture was given.
Since the checkpoint.com attack was described (in the video and elsewhere) new defense-in-depth features have been
added to SQLite to make similar exploits increasingly unlikely.
(1) Built-in SQL functions that have side effects cannot be
used in the schema. Side-effect functions can only be invoked
directly by the application.
(2) When applications register their own custom SQL functions, they can now mark those functions as "direct-only", meaning that they are prohibited in the schema.
(3) Run-time and compile-time options are available to prohibit the use of SQL functions in the schema that are not explicitly declared to be safe for use in the schema - that is, functions without side effects. This is for use in legacy applications that might have been created before the per-function flag that prohibited use within the schema was available. It is also an extra layer of defense for complex applications that might add hundreds or thousands of side-effect SQL functions - to ensure that the "direct-only" flag is not accidentally omitted from one of them.
(4) Run-time options are available to disable triggers and views in applications that do not need them. This is not necessary to avoid an exploit, but it does provide an additional layer of defense.
The article convinces me on the failure of the ZIP method in at least 3 respects:
1) For efficient incremental updates. Especially when you're saving a large file often, making constant single changes to ZIP files is just inefficient.
2) The insane memory gulps that ZIP takes.
3) The pile of files method. The data in OpenDocument files could be much better sorted/accessed through SQLite tables, rather than through XML files in ZIPs.
I think SQLite makes far more sense for OpenDocument. Incidentally, SQLite has also made good arguments previously for its use to replace application file formats: https://www.sqlite.org/appfileformat.html
In a past job i tried hard to replace an Access database file with sqlite for a c++ desktop application. One big win from sqlite was on the testing front. You can just open an in-memory sqlite database and test against that without worrying about disk state, or state leaking between test runs.
We use in-memory sqlite databases for unit testing a MySQL backed web service. It’s great! You can build and test the code without setting up a MySQL server and no risk of state leaking like you said. The syntax isn’t exactly the same but we have a small wrapper layer that rewrites MySQL-specific queries into their SQLite equivalents.
The only real downside is that it’s a C dependency which makes cross compiling a bit of a pain until you figure out the magic incantation the compiler wants.
The whole "Writes are atomic" still breaks down if they assume that disks are local. Having worked with one of these zip document format for 20 years now, I have found that local disk is the edge case, and SMB is the norm in corporate environments.
Sure you can download the sqlite file at the start of a session and work on it locally in a temp directory with atomic writes - but in the end the user doesn't think they have "saved" until version N+1 of the file is actually on their network share.
The sqlite documentation advises against using it in that scenario:
If there are many client programs sending SQL to the same database over a network, then use a client/server database engine instead of SQLite. SQLite will work over a network filesystem, but because of the latency associated with most network filesystems, performance will not be great. Also, file locking logic is buggy in many network filesystem implementations (on both Unix and Windows). If file locking does not work correctly, two or more clients might try to modify the same part of the same database at the same time, resulting in corruption. Because this problem results from bugs in the underlying filesystem implementation, there is nothing SQLite can do to prevent it.
A good rule of thumb is to avoid using SQLite in situations where the same database will be accessed directly (without an intervening application server) and simultaneously from many computers over a network.
This doesn't really work as well as one would like. It's not really related to sqlite it's just that expectations of file system behavior (locking, operations having completed when the system says they have, etc) are just flimsy fo many network file system implementations. All SQLite can do when it can't map the file as it would normally is basically upload the new version of the file, and not write it in place anyway. That behavior is just as easy to create by the application developer.
Isn't that the same problem with zip files? So if we're assuming that the compressed SQLite format will take up the same or less space as the zip file, wouldn't that mean that exactly the same mitigations work?
Yes the problem is ”fundamental” here as there is no filesystem guarantees. You basically upload the whole file after modifying it just like you would a zip. My point was that this is just as good to leave to the application and gain some better control over failure behavior/recovery.
In my experience, sqlite is rarely the wrong choice for a first pass at almost anything. It’s
- Straightforward
- Portable
- Performant
So, unless you have a good reason to not use it:
- Human-readable/editable file format
- Non-trivial queries across multiple servers
- Specific performance optimizations needed
- Etc.
It’s a good first bet / gets you through mvp (and then some)
The problem I have with sqlite for being a general purpose file format is that it’s difficult to take a sqlite file in any language and work with it without linking against code.
I kind of wish that there were native implementations of sqlite in every language.
So it should be possible to take the WASM compiled SQLite and run it in other languages.
This is a thing that really excites me about WASM: as a mainly-Python programmer it has the potential to give me access to a much wider range of tools.
I'll give a couple examples of where I've needed a K/V database, but linking caused problems:
* We have a sandboxed Javascript runtime inside of the JVM that can only do restricted work. IT would be nice to be able to use something like a SQLite file to deliver configuration to this Javascript runtime
* There are environments where you cannot install any native code -- like serverless providers that require your software to be pure Java / Javascript / Python
* Cross-language debugging can be very difficult when something goes wrong
* Erlang's FFI model is either "blocking and fast" (it runs the FFI call inline, and halts preemptive multitasking), or "slow and safe" (dirty NIFs, where it runs FFI calls on a dedicated thread.) This can get really ugly, really quickly.
I don't know! But if this sort of thing can happen in Chrome, I think it's a pretty strong counterargument to "let's standardize formats in terms of a C implementation of a thing that parses a complicated language, among many other things".
Non-C implementations of XML don't have countless vulnerabilities. There's XXE and schema requests which are language independent, but otherwise... it's ok.
FWIW I've used golang linking against sqlite in a couple of my projects. One issue is that linking through cgo is really slow, due to the fact that the runtime has no idea anything about types. Probably OK if you're just doing basic queries at a reasonable rate; but if you implement a custom callback, suddenly all the marshaling back and forth takes a huge amount of time.
Not a deal-breaker; but a native implementation of sqlite in golang would be a lot nicer.
I use SQLite extensively for unit tests and demo code bases for my OSS libraries. For the latter because its super easy for someone to clone the code base and run the entire thing in seconds. Huge fan of SQLite in those two scenarios.
Does that still apply if the DB is in WAL mode ? according to https://www.sqlite.org/fileformat2.html I'd say "no", because it sounds like the journal file might be necessary if the DB dies unexpectedly
Why not go to the extreme and just use TileDB for all the documents? By using it you can utilize universal storage engine supporting Arrow format that should work with SQL, NoSQL, dataframe and even streaming systems.
All I would like from OpenDocument, and LibreOffice, is an ASCII storage format that is not specifically designed to be incompatible with revision control systems.
As it is, they have an almost-serviceable format that fails only by having each line start with a random number that differs from the number on the corresponding line that was originally read in.
Fixing it would not even be an incompatible change! They could just remember the number that was there, and write it back out, next time.
There is a report for this in their bug tracker that mentions SVN because there was no Git, yet, when it was posted.
I'm a huge fan of sqlite but I'm not sure it's an improvement over xml for holding a DOM. Especially since there's more or less just one implementation.
> The atomic update capabilities of SQLite allow small incremental changes to be safely written into the document. This reduces total disk I/O and improves File/Save performance, enhancing the user experience.
How would this work without breaking the conventional "Save" paradigm? Unless the app tracks all incremental changes to apply when the user clicks "Save" instead of dumping it all back to disk, which seems difficult and error-prone to implement.
> Unless the app tracks all incremental changes to apply when the user clicks "Save" instead of dumping it all back to disk, which seems difficult and error-prone to implement.
That's the point of using SQLite instead of implementing it yourself. Because of how SQLite supports transactions, you can just update the database as you go; your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit, which is an atomic (and typically fast) operation.
The downside of this approach is that if the application crashes while writing to disk, the database file itself might be in an inconsistent state, accompanied by a rollback journal (or write-ahead log) that contains the necessary information to recover it. But if you manually delete the journal, or otherwise separate it from its database, you get data corruption.
That's probably fine as long as the user is a sysadmin who can be trusted to Just Not Do That(tm), but it's very user-hostile behavior for an office suite.
> If a database file is separated from its WAL file, then transactions that were previously committed to the database might be lost, or the database file might become corrupted. The only safe way to remove a WAL file is to open the database file using one of the sqlite3_open() interfaces then immediately close the database using sqlite3_close().
> If a crash or power failure occurs and results in a hot journal but that journal is deleted, the next process to open the database will not know that it contains changes that need to be rolled back. The rollback will not occur and the database will be left in an inconsistent state.
It's a bummer that SQLite is a "database in one file" but many operations actually require a 2nd or 3rd file.
What's the reason that it can't act like a filesystem image, and keep everything in one file? If I mounted an ext4 image in userspace and modified it a bunch, it wouldn't be corrupt, right? Is that just because SQL has stronger atomicity requirements than a filesystem?
> What's the reason that it can't act like a filesystem image, and keep everything in one file?
I suspect it's mostly because putting everything into one file instead of two isn't really necessary for most use cases, and it would significantly complicate the implementation.
The main SQLite database file is organized into numbered fixed-size pages, which are used to store B-tree nodes. The B-tree structure itself is fairly complex, but the journal file format is extremely simple: it's essentially just a list of page indices along with their old contents. (WAL mode is very similar, except that the new contents are stored instead. See https://sqlite.org/fileformat2.html for the details.)
The catch is that both the database and the journal need to be independently resizable. So if you wanted to store them both in a single file, you would need to either reserve a fixed amount of space for the journal up front, or allow the journal and/or the database itself to become fragmented across different byte ranges of the file. The latter option would require you to keep track of an extra layer of indirection to know where the journal data is stored, and that metadata would itself need to be efficiently, atomically updateable.
Essentially, you would be building an entire rudimentary filesystem just to store two logical files in one physical one, when the OS already has a perfectly good filesystem available for you to use.
It is intended as an example of how SQLite can be used on an embedded system which has no filesystem, just a raw disk.
I think someone could create a VFS which stored everything in one file without the fixed size limitation. I think you'd need to have three types of pages in the file – data, journal, and metadata (which marks which pages are of which type, using a bitmap or extent list or whatever). The VFS would present two separate files, data and journal, but actually store them as one file containing these three types of pages. You'd need to think carefully about the ordering of updates to the three page types to ensure atomicity in the case of a crash. A fair amount of work, but doable if someone wanted to. (Maybe even the SQLite developers might do it if they thought it was a good idea, or if someone was willing to pay them to do it.)
> That's the point of using SQLite instead of implementing it yourself. Because of how SQLite supports transactions, you can just update the database as you go; your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit, which is an atomic (and typically fast) operation.
You can't "just" update the database as you go. It'd be a major refactor, since typically an app will use an in-memory working copy instead of referring to the storage engine all the time.
Well, you could write a fresh SQLite DB on every Save command, and blow away the old one, thus losing all of SQLite's nice fsync handling.
Or if your data structures can be modeled as vectors or ordered maps, use SQLite directly as much as possible. It _is_ a database engine, so it's already got btrees for fast lookups.
Hell, one time I played back video from a SQLite file. Not HD, but it was a peculiar format where I couldn't use any normal codec, and 30 FPS VGA recorded and played just fine.
> your changes will be physically written to disk, but they won't actually replace the old version from a reader's perspective until you explicitly commit
This breaks one important invariant: unless the user clicks "Save", the original file must not be modified in any way. It doesn't matter if the changes are ignored by readers; if "Save" is a separate operation, it's the only place which should modify the original file.
I don’t understand what you are asking. When the user clicks “Save” that’s when your application calls the sqlite api to update the database. When the user clicks “Save as” the application calls the sqlite api to create a new database file, it then creates the database schema and dumps the internal representation of the document into it.
You might be confused by the GP comment. It says that if something bad happens with the computer after your application started saving, but before the transaction is commited then the database still will be in a consistent state. (Altough the document will be in the pre-save state.) If the application is the “click save” kind then here we are talking about what happens if the power goes out in that fraction of a second after you hit save, but before the cursor stops being a hourglass.
> This breaks one important invariant: unless the user clicks "Save", the original file must not be modified in any way.
SQLite's default rollback journaling mode doesn't satisfy this property, but WAL mode does: updates will be first written to the WAL file, and automatically copied to the main database file only after at least one transaction has committed.
Microsoft more or less does this already with their Office Open XML file format in SharePoint Online/OneDrive. Small changes are read/written as users edit the doc. Real-time co-authoring locks on a per-paragraph basis.
More broadly though, tech seems to be migrating away from explicit saves in general (e.g., when's the last time you needed to save something in a web-app). Maybe that's not such a bad thing? I learned to develop a reflexive cmd-/ctrl-s twitch as a youth, but I would have been just as happy to have not lost hours of work to power outages and/or cooperative multitasking.
OTOH, when I open some document as a reference document, I most certainly don't want any changes I accidentally (or even purposely) might have made to it to be saved without any further questions.
Modern word processing formats write down incremental changes, to avoid writing all of a massive document back to disk.
Sometimes this causes people embarrassment, because if you send a normally saved file to someone else, they can often see the change history and intermediate versions.
I am the developer of a platform that hosts documents in a NoSQL (MongoDB) so that every paragraph of text is a 'record' in MongoDB and there's a 'path' property which is what builds out a 'tree' structure to give the documents a hierarchy. You guys might be interested if you're interested in new kinds of document/wiki innovations.
Saving the versioning in the same file would lead to security issues. Imagine redacting a file before sending it to someone. All they have to do is roll back a few versions to recover the original.
(Note this is not a criticism of the "use SQLite" argument; just a criticism of the specific schema.)
The Library of Congress Recommended Formats Statement (RFS) includes SQLite as a preferred format for datasets. The RFS does not specify a particular version of SQLite.