Hacker News new | past | comments | ask | show | jobs | submit login
SQLite as an Application File Format (2014) (sqlite.org)
211 points by kjeetgill on Dec 24, 2018 | hide | past | favorite | 161 comments



I wish more people would publish data as SQLite databases (if the size permits, of course; usually it does). It's so much more reliable than CSVs, which have at least a few dimensions of significant differences (quoted/unquoted, comma vs semicolon vs tab vs space, headers/no headers, comments). Not to mention that initial data exploration can be done right in an SQLite explorer/browser tool.


I keep harping on about this at work - a sqlite file carries its schema with it so you can inspect it. With foreign keys, notnull, check and other constraints you can easily make out what the data is about. There is a driver in every language it seems, and if not the docs are very good so if you are handy with FFI it is easy to build one. It can be far more compact than XML (gulp! SOAP) with the equivalent amount of data as the amount of data gets larger. Being a single file, it can be sent over the wire like any other. And you use SQL to interface with it.

Currently building software for the ATO (Single Touch Payroll) which uses SBR (https://en.wikipedia.org/wiki/Standard_Business_Reporting). The SBR project is listed as having on-going problems, and has cost the ATO ~$AUD1b to date (https://en.wikipedia.org/wiki/List_of_failed_and_overbudget_...). One of the reasons cited is that it uses XBRL (https://en.wikipedia.org/wiki/XBRL). Now imagine if it used sqlite...


There's a fairly large ecosystem of tools for creating and processing financial data in XBRL which doesn't exist for SQLite. All the Big 4 handle XBRL already - I know because I write the software they use. It's quite straightforward to take a Word document, for example, and turn it into XBRL; we even use machine learning to automatically tag the tables.

I can easily imagine how painful it is for you to process XBRL from scratch, but it's not crazy to exploit the existing infrastructure.

Of course if you give a project to IBM I wouldn't be surprised if it costs a billion dollars, especially given they know roughly nothing about XBRL...


I'd never heard of XBRL, looks very interesting. It seems it's primarily used in financial reporting environments though. Is it suitable for general purpose reporting as well?


In principle yes, it was intended for general business reporting.

Essentially each filing (called an instance) consists of a series of 'facts', each of which reports a single value and some metadata, and footnotes, which are XHTML content attached to facts. Fact metadata includes dimensions, which can specify arbitrary properties of a fact. So a fact might be e.g. 'profit' with metadata declaring it's in 2018, in the UK, and on beer, but all of those aspects would be defined by a specific set of rules called a taxonomy. You can create a taxonomy for any form of reporting you want.

There's also a language, XBRL Formula, which allows taxonomies to define validation. It allows something semantically similar to SQL queries over the facts in an instance, with the resulting rows being fed into arbitrary XPath expressions.

Unfortunately the tools for working with XBRL are mostly quite expensive, which probably limits its application outside finance. Arelle is a free and fairly standards compliant tool that will parse, validate and render XBRL and even push the data into an SQL database, but it's written in Python and isn't very performant. (Although it's probably good enough for most uses since it's used as the backend for US XBRL filing.) I'm not sure if there are any open source tools to help with creating instances.

Also creating a taxonomy itself is quite challenging. There are (expensive) tools to help, and using them it's still quite challenging. For real-world taxonomies it usually involves a six or seven figure payment to one of the few companies with the right expertise.


XBRL is the format the SEC (security and exchange commission, the government "accountants of last resort" that check publicly traded companies). This means there's regular XBRL files (every quarter) for every large publicly traded US company.

Here's Google's latest: https://abc.xyz/investor/static/documents/xbrl-alphabet-2018...

Please do keep in mind that this is a sort of XML key-value database with a number of keys standardized, and some level of rules defined that say "if a bank lends Volkswagen money and it uses 12% of that money to Google to run ads with a repo clause, you enter add A to value X and B to value Y". In other words, there's rules that define how complex financial data is entered into those standardized values. To find those rules, there's a SEC textbook that you wouldn't wish on your worst enemy, nothing about that in the files themselves.

There's a large directory at the SEC with the quarterly XBRLs for all US publicly traded companies. Used to be accessible over FTP until a little over a year ago.

Here it is: https://www.sec.gov/Archives/edgar/

XBRL files exist for all forms to be filed with the SEC. The ones you probably want are the 10-Q and 10-K ones (q = quarterly, k = no idea, but somehow means yearly)

(of course there's an entire industry of accountants essentially about hacking those rules, and therefore the meaning of those files. So let me give you a free quick 5 year experience in financial analysis: search for "GAAP vs non-GAAP", read 2 articles, decide the conspiracy theorists are less informative than the government, and just believe the SEC is at least trying. That doesn't mean nobody's lying, but GAAP vs non-GAAP is generally not what they're lying about)


As much as I love SQLite, and while it is open source, it is a single implementation that AFAIK has no published open specification. The only way to read an SQLite file is using SQLite, and in that respect, it is for many users just as closed as wrapping something in a word document.

CSV isn't perfect, but it provides a ton of flexibility, for example, CSVs can be streamed or support parallel segmented download across a network with useful work possible during the transfer. The format is so simple that it can approach almost free to parse (see e.g. my own https://github.com/dw/csvmonkey ).

CSV is also distinguished in that regular home users with spreadsheet programs can usually do most things a developer can do with the same file. For me user empowerment trumps all other goals in software, including warts. Things like JSON, XML or SQLite definitely don't fit in that category, although I guess SQLite is at least better due to the wide availability of decent GUIs for it.

Finally as a data transfer format, SQLite has the potential to be massively inefficient. Done incorrectly it can ship useless indexes that can inflate size >100%, and even in the absence of those, depending on how amenable the data is to being stored in a btree and the access patterns used to insert it, can leave tons of wasted space inside the file, or AFAIK even chunks of previously deleted data.


>"As much as I love SQLite, and while it is open source, it is a single implementation that AFAIK has no published open specification. The only way to read an SQLite file is using SQLite, and in that respect, it is for many users just as closed as wrapping something in a word document."

That's an extreme position to take, particularly since the SQLite code is public domain. Furthermore it's one of the formats recommended by the Library of Congress for archival/data preservation:

https://www.loc.gov/preservation/resources/rfs/data.html

https://www.sqlite.org/locrsf.html


> The only way to read an SQLite file is using SQLite

This part unfortunately isn't a position, it's absolute. It's hard to imagine a situation where as a developer we would not have access to a C runtime or for any reason whatsoever would not be able to use SQLite, but the hard dependency on its code is real, and represents a real hazard in the wrong environment. A super easy example would be parsing data on say, a tiny microcontroller on an IOT device. This can start to hurt quickly:

> Compiling with GCC and -Os results in a binary that is slightly less than 500KB in size

Open formats at least give you the option of implementing whatever minimal hack is necessary to finish your job without say, introducing some intermediary to do an upfront conversion, and at least for this reason SQLite cannot really be considered a perfectly universal format


>> The only way to read an SQLite file is using SQLite

> This part unfortunately isn't a position, it's absolute.

It's also false. I know of SQLJet which is a pure java implementation, there may be others. But in the end, the SQLite format being well defined and documented [1], a sure-fire way to read an SQLite file is writing the code to read an SQLite file. Since SQLite a rock-solid, public domain, portable C library, it might not be the best idea to do that, but it is completely feasible. No one stops you from "implementing whatever minimal hack is necessary to finish your job" while using the SQLite format.

[0] https://sqljet.com

[1] https://www.sqlite.org/fileformat.html


> A super easy example would be parsing data on say, a tiny microcontroller on an IOT device.

I mean, it's a point, but I don't think anybody is saying that SQLite should replace all data storage formats everywhere. If you're just storing a few dozen short text strings with keys, plain text is fine. I don't think you'd want to have a JSON parser on a tiny microcontroller either.


By 'position' I was very obviously referring to

>"as closed as wrapping something in a word document."

Sure CVS makes it trivial to waste your time reinventing the wheel, making your own parser. The situations were there are technical limitations that prevent the use of sqlite are becoming vanishingly rare. (Not to mention the resources necessary to use sqlite is unrelated to how many implementations there are or whether it's 'open' or 'closed'.)


1) For practical purposes the vast majority of platforms that are inappropriate for SQLite are not appropriate for CSV.

2) SQLite has no standard. The same is true for CSV in practice. At least SQLite has a high quality reference implementation.

3) It’s a shame that SQLite doesn’t have a standard of some sort.


It's not extreme. It's the whole reason why WebSQL failed.


Do you have a source for that claim? Mozilla's 2010 justification for dropping WebSQL was that SQLite doesn't implement a particular standard of SQL (incidentally no SQL implementation does, SQL is notorious for that, every vendor implements something slightly different with parts of the various standards excluded and other features included.)

But more to the point, Mozilla's gripe was with SQLite's API, not with the file format used by SQLite. I can't find any source for the file format used by SQLite being the problem that tripped up WebSQL.


"AFAIK has no published open specification. The only way to read an SQLite file is using SQLite"

It's documented pretty well: https://www.sqlite.org/fileformat.html

I suppose that doesn't make it an open standard, but it hasn't changed much.


Not sure what an open standard gets you above what SQLite already is. You can’t get much more open than that.


That's basically the non-subtle version of what I was trying to convey :)


I had never seen this before! Thanks


You argue bases on a good principle. But while the principle of open well documented standard is good one, it's just a principle. The goal is portability, consistence and long term stability.

SQLite is a Recommended Storage Format for datasets according to the US Library of Congress. https://www.loc.gov/preservation/digital/formats/fdd/fdd0004...


You can open a SQLite file as text and read it and be able to discern data and tables.

Not saying it's useful but you saying "only way to read SQLite is SQLite" is hyperbole.


>...it is for many users just as closed as wrapping something in a word document.

Which users? I’m curious.

Cross platform, fully documented, Fred software, multiple implementations vs closed proprietary single platform commercial software. Really?


When I open my data I usually publish both CSV dumps (for those who prefer using awk/sed and command line tools; this is usually easier) and MariaDB reconstructive files. I'll consider SQLite databases from now on, I like that idea.


That is a truly fantastic idea. Especially if Excel added file format support for it directly to open it within excel with a double click, one table per sheet within the workbook. I know, not the ideal way to use SQLite, but that would ensure easy adoption by the average person that doesn't have SQL or relational database knowledge. (And would allow them to get a bit of that relational knowledge via power query)

Some might think Access would be the more obvious choice, but for the average user Access is not an accessible tool (we don't even install it as part of the Office suite where I work) and they're much more comfortable in Excel.


I've probably said this around here before, but I wish people trade SQLite files instead of Excel workbooks. Excel (and the like) can operate on top of SQLite. This way we have data portability, data integrity, and ease of data access.


Read only, or Excel would have to enforce constraints. But, yes, would be nice


The problem is data corruption. If a CSV is corrupted, then I could at least parse part of the data. For a corrupted SQL file, I'm done. Also, diff is not working for binary format, and it is more difficult to trace change for SQL format.

In this sense, I prefer a SQL dump file.


What is the source of the corruption. Unlikely to be disks nowadays but people should take backups of things. Very unlikely to be a SQLite buggy write but again a backup could save you. Worse case I’m sure there are tools to recover a corrupted file.


There are plenty of things that can cause it: https://www.sqlite.org/howtocorrupt.html


Nice of them to publish a guide for those who enjoy corrupted data :)

I particularly liked Fake capacity USB sticks. I didn't even realize that was a think. I can't even...


Oh this goes way back. The first SSD drives from Hong Kong advertised double or quadruple their capacity - you didn't find out until you tried to write the N+1th block and it overwrote the 0th block. Back in the 90's?


Sqlite's sqldiff might be an okay replacement for diff in many cases - https://sqlite.org/sqldiff.html


Reminds me of this HN submission: https://news.ycombinator.com/item?id=16809963

Apparently CSV is actually quite hard to parse.


It's not that CSV is hard to parse. It's that there's no guarantee that you'll get proper CSV. For example, it may literally be "CSV", without quotes. And that's fatal if values contain commas. I've even seen CSV with values that contain ``","``!


RFC4180 [1] standardizes CSV, but there are many implementations that don't read this 100% and unfortunately even more (including an extremely popular spreadsheet application) that don't write it.

If you are including CSV functionality in something you work on, please read and follow this (tiny) spec!

[1] https://tools.ietf.org/html/rfc4180


Whether Excel writes standard CSV or not depends on the user's locale settings. E.g. with a German locale you get a semicolon (;) as a separator which you can only change system-wide. However, apart from the changed seperator, it's still standard CSV, which still works with Python or SQLite (.separator ; .import foo.csv foo).

A bigger issue is that Excel tends to write large numbers in scientific notation, which is a common issue handling price lists. E.g. it'll turn EAN numbers into 6.2134e+11, losing most of the number. Then you have to go back to the XLS file and change the column type into text and exporting it again as CSV. As this is lossy you can't fix it when receiving such a file.

Something like the SQL Server Import/Export Wizard but being able to write SQLite files would be very handy.


Fun fact - we noticed SQLite wasn't RFC-compliant for it's CSV output (it used native lineendings, not CRLF, which is mandatory).

It is fixed now... but I'm now wondering whether SQLite wasn't more correct in the first place...


The worst part about CSV is the locale-specific format used by some programs. For example when exporting to CSV a German Excel will use the comma as the decimal separator in numbers and a semicolon for the value separator. So you need to specify the exact format for each individual file you want to import.


Back when I worked at an NLP company, my boss used to say that he was "morally opposed to CSV".

When sharing tabular data in text format, he always preferred TSV because commas were everywhere in the material we were working with, but tab characters were really rare.


I've always been curious about the characters in ASCII for this, but I've never seen them used in the wild. Stuff like "Group Separator" (0x1D), "Record Separator" (0x1E) or "Unit Separator" (0x1F)

Is there a reason why nobody uses these? Did someone work out back in the 90s they were pure evil and we've just never used them since?


https://www.lammertbies.nl/comm/info/ascii-characters.html

Those codes were originally for just such a purpose, but as others point out there was a bootstrapping problem. At this point, if Excel doesn’t support it, it’s not going to gain traction.


About 10 years ago I had to use those for a financial system integration. I was getting files that had been created on a mainframe, and whoever wrote it originally had the foresight to use those characters. Probably because they were based out of EMEA and understood that commas weren’t useful across national borders.


The codes between x01 and x1F were designed for telecommunications, for instance binary custom formats over RS-232 or synchronous protocols.

There is an excellent description in "C Programmer's Guide to Serial Communications" by Joe Campbell.


There’s no physical key for them. You can trivially write/edit CSV or TSV on any computer using any editor.


Indeed. I used to clean up data in UltraEdit. With the option for direct disk read/write set, there was virtually no file-size limit.


When I was working as a forensic data analyst, I felt pretty much the same. If you request CSV in discovery, there's no telling what you'll get. I mean, the data may come from custom COBOL, and then get reviewed by someone using Excel.

So yes, TSV. However, I've seen TSV with spurious tabs :(

Sometimes I ended up pushing for |-delimited data. Or even fixed-width format :)


> I've even seen CSV with values that contain ``","``!

Are you sure it wasn't an injection attempt of some kind?


It could have been, I suppose.

But more likely is twisted creativity. It seems that some businesses are still using ancient systems, based on COBOL, AS/400, etc. There's resistance to changing legacy code. So when business changes require additional data fields, fields sometimes get subdivided. So a field that originally contained stuff like |foo| now contains stuff like |"foo","bar,baz"| or whatever. That works, because there's nothing like CSV in the data system. But when someone tries a CSV export, you get garbage.


Newlines in the middle of a quoted field will cause problems for a lot of tools. And Python's csv.DictReader gives an error when a delimited file has too few fields.

But one can usually work around this.


It's hard enough that you should always use the library that handles the 5 weird cases, of which you'll only think of 3.

I made the mistake once and learned.


In several companies, the interview assignment is indeed to implement a CSV parser.


I'd rather have a proper text format which I can inspect without a highly specialized tool. In 50 or 500 years it will be hard to hunt down the right tools to open an SQLite file, build SQLite from sources (if they still exist), or reverse-engineer the file format. For a text file you only need to reverse-engineer the ASCII encoding (or maybe UTF-8).


I wish webpages were 'archived' as sqlite databases. :x

I wish a lot of metadata were defined as a database schema, and sqlite lends itself so willingly to becoming the archive/header.

Does sqlite do internal gunzip compression?

I do understand we have MHTML:

https://en.wikipedia.org/wiki/MHTML


WARC is the "standard" now for web archiving.


What about hdf5? I'm starting to use it as an in-memory, parsed cache representation for data stored in like a plaintext file document database. With python's PyTables and visidata acts as the independent data explorer app.


I like HDF5 because it's compatible with everything, but it's a bit more difficult to get started with than other modern formats. Partly this is due to design by committee including everything, and the APIs follow suit. With some more modern APIs I think it could come back as an archival format.


Hdf5 files are good if the data is write once, and numeric unless things have changed over the last fives years. From my recollection, you can’t append data so you have to rewrite the whole table and text fields are fixed width.


What size restrictions are you worried about ?

*Note I work with software that regularly stores terabytes of data in sqlite.


Heh I was thinking that > 1TiB of data in an sqlite db is a bad idea :).

How do you process that? The restrictions I had in my mind come from that it's infeasible to concurrently process a large dataset in mapreduce/bigquery style.


There are many different concurrence modules for sqlite. I can't see why you would have any problems with what you describe.

The systems I work with use it for backup data. We have many readers and writes. Some of those export a iscsi daemon that represents a block device from the backup that is then booted form.


spelling :(


Reminds me I have to extract data from an 9x day windows application. It's a jetdb that refuses to load in the first viewers I could find. Made me wish it was sqlite (although it didn't exist at the time..)


An issue there is the entry level. For CVS a common tool to read and analyse the date exists: Excel.

For sqlite there's hardly an approvable tool for analysis of the data.

Excel is driving the world.


Well, my personal rule is to never touch csv with excel, because depending on the locales, most of the time excel breaks something in the file. (Of course, in theory I might make a flawless data type mapping when importing the file to excel, but unfortunately that happening is quite rare...)


This is a concern and there are lots of issues (also think about proper escaping etc.) but Excel is ubiquitous outside hackernews's demography.


Cool fact you might not know: SQLite has direct support for “SQLite archives” (i.e. an SQLite database containing a conventional table of {file metadata, BLOB data} records representing archived files) built into its CLI client. This functionality lives under the `sqlite ar` subcommand, which exposes similar command-line functionality to tar(1).

Read more at: https://www.sqlite.org/sqlar.html

The nice thing about this is that, since this “sqlar” table is just one table in the file, and the commands work whether or not other such tables exist, a file can be an “SQLite archive” while also being a regular SQLite database containing other tables at the same time. It’s sort of like how Fireworks used to add special chunks to PNG—the file was still a PNG, but now it was also a Fireworks project. (But, in this case, the “chunks” aren’t opaque binaries, but rather are SQL tables you can manipulate using regular SQL queries.)

And, of course, the SQLar “standard” isn’t all that complex—it’s designed so that you can easily construct your own SQLite archive files just by issuing regular DDL+DML queries through the SQLite binding of your language of choice. (Though, if you want support for inserting files compressed—which the `sqlite ar` command supports extracting—you’ll need a zlib binding as well.)


I have migrated a binary file format to an SQLite-based ORM schema at one of my jobs. It worked out amazingly well, with (limited) forward and backward compatibility for nearly free.

A lot of hairy if-else code was made unnecessary.

SQLite is IMHO the best option for data serialization/persistence/interchange in many cases, especially when some data is a blob, or, say, a huge array of floating-point numbers that you need to store with full precision (which makes text-based formats clunky to deal with).


> Single-File Documents. An SQLite database is contained in a single file, which is easily copied or moved or attached. The "document" metaphor is preserved.

Is this still true by default? With write-ahead (wal) and shared memory (shm) enabled, it is no longer safe to consider only the database file. Pragmas exist to disable those, but people should be familiar of these features and use with care. The document should be updated to introduce these concepts.


It's mentioned on the WAL docs on https://sqlite.org/wal.html

> There is an additional quasi-persistent "-wal" file and "-shm" shared memory file associated with each database, which can make SQLite less appealing for use as an application file-format.


Why would you realistically need wal or shm for an application file format anyway? That's for high concurrency in general.


The article explicitly mentions concurrent access.


That's true, however it's possible to remove the wal file after closing the database:

> The only safe way to remove a WAL file is to open the database file using one of the sqlite3_open() interfaces then immediately close the database using sqlite3_close().


Can you expand on that? How do those two features break things?


It's not so much that they break things, but if those features are enabled sqlite will be writing multiple files instead of just one, which somewhat nullifies the "single file" benefit of sqlite.


Sketch (https://sketchapp.com) actually used a zipped SQLite database as it's file format until the past year.


Any idea why they changed?


They changed to a zip of JSON files, I believe the main reason was likely hackability – one of the main advantages Sketch has in the design ecosystem is it's plugin architecture and the array of tools that exist around it.

If you make your file format easy to work with then it only widens this moat.


CoreData has an option to write to a SQLite database, so a lot of data on iOS and macOS apps ends up being written to disk this way.


If you code in c/c++, and have even a modest OS/RTOS available, you should use SQLite as your persistence layer, full stop. I switched to this methodology 5 years ago & never looked back. You know what happens when another engineer gets your on disk format & asks how to look at it? You say “open it in SQLite browser”; all further questions disappear.


Seconding this.

One thing though: last time I did that, I ended up writing my own ORM/serialization library (with templates and all), but I wish I didn't have to.

Is there anything freely available out there that would allow one to store object hierarchies with the ease of boost::serialization, but into an SQLite EAV scheme?


I always trace rather than log. That means I have one table per ‘tracing function’, and just record the arguments to each table. Recording an object means recording the function calls that generated the object; regenerating the object is just playing those functions back. I’ll admit that this works far better in C than C++, as C++ tends to treat new types as opaque primitives.



I believe ZIP supports updates about as well as SQLite. You append the new file to the end (and write the "header" after that). This leaves a hole in the middle, but that can be reused by a later addition. There's some details about internal page size and fragmentation, but I think it could amortize to about the same. There's something to be said about quality of implementation, but that's not an argument about the file format per se.


How? There's a massive difference in usability and performance with a SQL database and all the data querying and manipulation it provides compared to a ZIP file that just straps together a bunch of compressed payloads with a key/value file structure.

The linked page even has a case study of the OpenDocument file format that shows exactly how SQLite is better than a Zip file: https://www.sqlite.org/affcase1.html


If you're using SQLite as a pile of files, you're not doing much querying if the compressed data inside.

> Newer machines are faster, but it is still bothersome that changing a single character in a 50 megabyte presentation causes one to burn through 50 megabytes of the finite write life on the SSD.

That's funny, since Firefox would burn through gigabytes of writing to SQLite. All comes down to quality of implementation.


The point is that you don't just use it as a pile-of-files when you have SQL access to the data and instead use a high-level data model. This is not possible at all with ZIP.


ZIP is a horrible, horrible format - it's best left in the past.


I honestly don't know much about the relative differences in formats like this, what is horrible about ZIP? Also, my impression is that more compressed files end up in RAR these days. I know that's not a new format either-- is it similarly horrible?


To read a zip file, you need to read it backwards: https://github.com/corkami/pics/raw/master/binary/zip101/zip....

There's also another standard called ZIP64 to allow for larger files (maximum 4GB in the original spec).

I was going to say that RAR is proprietary, but at least there's (non-FOSS) source code and good specs: https://www.rarlab.com/technote.htm


The central directory is a great feature of zip. Makes it easy to append to and to list the files or decompress individual files without decompressing the entire archive.

The actual horrible thing about zip is that there is no standard whatsoever for the encoding of filenames. So mangled filenames are common and most utilities don't even let you manually select the encoding if you know what it is.


Yeah, I occasionally would encounter errors decoding zip files along the lines of "the file name was too long"


Are there zip libraries that facilitate this usage pattern?

Either way, I tend to think you'd probably be better off with sqlite since it would give you more flexibility (for instance if you later realize that your application possesses relational data.)


When using kubernetes, I always wonder if we can transform all the tedious "kubeconfig"s into a collection that can be easily queried, modified, saved and retrieved, rather than maintinaing small text pieces in some directories. More ambitiously, maybe include container/image metadata as well?

After reading this document, well, I think I will give it a try. It might not be a perfect fit, but I don't quite like yamls.


Lxd uses distributed sqlite for it's config: https://github.com/lxc/lxd/blob/master/doc/database.md

And there's this simplified fork of K8S that replaces etcd with sqlite: https://github.com/ibuildthecloud/k3s/blob/master/README.md


It should be noted that the distributed SQLite is a patchset written by the LXD folks. Last time I talked to hem they said they're working on upstreaming it but want to be sure that everything is sane (SQLite has crazy backwards-compatibility requirements, not to mention the testing).


Yeah, I have always felt this is a great idea for almost anything that is not really so small and simple that plain text would make more sense. BLOB-by-nature data like multimedia, pictures etc come as the only reasonable exception.


Funnily, SQLite also describes itself as better than the FileSystem in this case as well: https://www.sqlite.org/fasterthanfs.html


The third great file format (after text and SQLite) is the zip file with custom suffix. That is ideal for BLOB-by-nature data.


I didn't really study the formats myself but as far as I know 7z container format is a way better than zip.


Very likely true however zip has somehow become a de-facto go-to standard for this stuff. From Java (JAR files) to Microsoft (.docx et al) and numerous open formats including ODF and EPUB.


Using sqlite as our format has worked well for us at dannea. It also means we can have web only documents since sqlite can be embedded in a web page


> since sqlite can be embedded in a web page

howso ? ... beyond querying the database server-side ...


Perhaps they're using SQLite compiled to JS via Emscripten: https://github.com/kripken/sql.js/


Yes that is correct, at Dannea we use sqlite compiled to JS. We can even load the data from the same Html page so it allows us to have sqlite powered web apps which can be emailed to others and is true server less


I've used SQLite for a distributed project lately where different offline teachers in rural schools would back up their app data to a thumb drive as file databases, take it back to a town with connectivity and upload that data to the app server, so they'd get extra functionality and updates. It was incredibly straightforward to do and worked wonders!


It's a good idea unless you're exchanging these files. I'd love couple of my clients switch from HDF5 to SQLite, HDF5 is horrible..

But if you exchange them, it's too easy for app developers to misuse the library causing user's data to be leaked to third parties, in free DB pages, and even in the way how exactly the DB is fragmented.


I use HDF5 extensively, why is it horrible? It seems for a different use case, column-oriented numerical data vs. "database" record-type data.


Too slow. It's 2018, modern PCs have 8 cores and 16 hardware threads, HDF5 only uses 1 at a time: https://support.hdfgroup.org/HDF5/faq/threadsafe.html And that's C API, C++ will just crash instead. The case is different threads processing different files, not concurrent access to the same file (the latter would be very nice but I understand why it's quite hard to support).

Also looks like C++ API was designed in 90-s. Maybe that's my personal preference but I find it hard to work with, too much OOP but too few types.


What you suggest sounds like security by obscurity.


Based on the mention of free DB pages, I suspect the issue is information leaking because it isn't scrubbed after being deleted, so that it's still in the db file when it's sent off in a third party, just not "accessible" via SQLite APIs. For example, a redaction procedure which consists of deleting one table that contains all the sensitive data doesn't necessarily _erase the data_; you'd probably want to VACUUM or reconstruct the database from a dump.

Moreover, it's a general issue with file formats that can encode the same data in multiple ways that, in addition to the data itself, the data file also incidentally encodes information about the process by which the data was encoded. This metadata is usually ignored and abstracted out by APIs for working with the data, so application developers tend to overlook it as a place where sensitive data can leak.

For comparison, you might consider fingerprinting API clients based on the order in which they send HTTP headers or keys in JSON objects, which can often be correlated with language or library versions in environments where the most convenient map structure is an arbitrarily ordered hash.

It's not necessarily easy to think of nefarious uses for this sort of information, but SQLite database fragmentation can reveal, albeit in rather rough detail, the order in and frequency with which changes were made.

Quite the opposite of "security by obscurity", in fact.


The SECURE_DELETE pragma sounds like it'd be useful for this:

https://sqlite.org/pragma.html#pragma_secure_delete

Needs to be manually enabled though, as that doc says it's generally not enabled by default. :/


This indeed helps but only partially. Will still leak data in the way the DB is fragmented.

E.g. if you have a text document with 1 row per paragraph, fragmentation will reveal where it was edited, and what kinds or edits were made. You insert a bunch of images into the document, and the physical order of pages will reveal in which order they were inserted, even if you re-arrange them afterwards.

It's possible to workaround by careful programming, but still, very easy to screw up and leak data.

OTOH, with zipped XMLs people normally use instead, you overwrite the complete file each time, it's not fragmented and hopefully doesn't contain any extra info.


>or reconstruct the database from a dump.

Indeed - just write a new file you export data to be shared with someone else.

Heck, for many use cases (e.g. config files), the save_to_file() function can always start by deleting the old file completely, creating the DB from scratch, and filling it with data from memory (essentially, treating the DB files as immutable).

Hardly a difficulty, no?

And proprietary file formats aren't guaranteed to NOT have the problems you described; the only apriori advantage is that their structure is opaque. Hence my "security by obscurity" remark if my understanding is correct.


> just write a new file you export data to be shared with someone else.

There's no good way to distinguish. Add separate File/Export command? Users will forget to use it. Add a checkbox on file dialog? Similarly, users will forget to check it. It's very unreliable.

The only reliable way is treat every save as "export data to be shared with someone else".

> proprietary file formats

Very few left, the industry has moved towards open formats. For example, MS Word documents are zipped XMLs, ISO/IEC 29500.


For those wishing to see Sqlite used to publish data sets:

Datasette (https://github.com/simonw/datasette) is a new tool to publish data on the web. It uses SQLite under the hood.


Why not a text serialization format? That SQLite can read/write.

e.g. tables and schema in JSON.

A table as a list of row-objects, of attributes. (or more efficiently, an object of lists. Or, following the achema, just a list of rows, each a list of attribue's values).


I was thinking of using SQLite for creating game saves. Does anyone have experience with that? The more I think about it the more I wonder why we don't see SQLite in games more often.


It has been used for game saves before. Not sure how commonly, but we (sqlitebrowser.org) do have links to our website from various places giving tips on how to change things (in the saves). ;)


Cubeworld used SQLite to store the procedurally generated world.



I'm surprised this is the only comment in the thread that mentions this security issue. It's only a week old!


I have done this on a large .NET Win32/XAML client / server application in two ways:

1) Saving application state and config/settings (window layout, ... very easy to marshall XAML stage to an SQLite file) 2) Caching of common data sets to avoid unnecessary polling to the server of data that is accessed all the time (ie customer master data)

It works beautifully.


What if I need to store binary data like images or PDFs? Is storing them as blobs an acceptable method?


Yup, see no reason why not. This is answered in the post:

> Any application state that can be recorded in a pile-of-files can also be recorded in an SQLite database with a simple key/value schema like this:

   CREATE TABLE files(filename TEXT PRIMARY KEY, content BLOB);
> If the content is compressed, then such an SQLite Archive database is the same size (±1%) as an equivalent ZIP archive, and it has the advantage of being able to update individual "files" without rewriting the entire document.

Also look at the related post which suggests doing precisely that - storing images inside SQLite blobs: https://www.sqlite.org/affcase1.html


Is this really a good idea though? I thought it was generally preferred practice to store large binary files outside of the database. Otherwise the database file will end up growing pretty quickly.


That's your call. If you're going to be storing 100gb of images then put them somewhere else, but this post is about using SQLite as a file format so storing binary files inside is not a problem.


Counterpoint: tiled maps, which are basically loads and loads of images, can be stored on disk as

- Direcory structure

- Zip file

- Tar file

- Sqlite db

Out of these, sqlite is the most compact, by far (also the fastest)


We just transitioned our tiled images from using a directory structure to SQLite.

Having a single file per very large image (for mobile) makes transferring them to the phone over USB around ten times as fast. I wish I’d done it years ago.


Creating huge files is usually a bad idea. It does not matter if they are a sequence of lines of characters, a gziped collecion of XML data, a binary format that requires 6000 pages of documentation, or a SQLite database.

If you can spread your data among many files, you probably should. Not everybody can.


Storing large objects in database rows impacts performance and scalability but this shouldn't be a problem unless your files are extraordinarily large.


Yes. The MBTiles Format [1][2] is essentially just an SQLite database storing map tiles (thousands of square images of parts of a map). In that case it's mostly done because SQLite stores small files much more efficiently than a general purpose file system in default settings.

1: https://github.com/mapbox/mbtiles-spec 2: https://wiki.openstreetmap.org/wiki/MBTiles


In SQLite, a database file can be 140 terabytes, https://www.sqlite.org/limits.html

And reading and writing can be even faster than if they were stored directly in the filesystem, https://www.sqlite.org/fasterthanfs.html


So does that mean serving static files from a SQLite file would be faster than something like NGINX?


No, because of the sendfile syscall, where the kernel pass all the file data without switching back to userspace.

Even if you can in theory save some time with small files with sqlite, you would be back to read/write IO and userspace/kernel byte buffer switching.

And it probably wouldn't be faster than something like a mmaped file anyway (unless the SQLite Db is also mmaped).


The thing is SQLite handles a lot of hard problems that you might not realize you have: like “was my data actually written to the disk”


Why don't we replace entire filesystems with SQLite then?


>"This file contains some example code demonstrating how the SQLite vfs feature can be used to have SQLite operate directly on an embedded media, without using an intermediate file system."

https://www.sqlite.org/src/doc/trunk/src/test_onefile.c

https://www.sqlite.org/vfs.html

IIRC, during one of his talks D. Richard Hipp mentions this actually being done for real.


SQLite does not support BLOBs larger than 2GB, so in that sense it's only as good as FAT32


You can't stream SQLite files.


I'm not sure what you mean by "stream" here. However, there are two APIs designed to facilitate sending SQLite database files over a socket. The sqlite3_serialize() interface (https://www.sqlite.org/c3ref/serialize.html) takes a database and converts it into a blob of contiguous memory that can be sent over a wire. The sqlite3_deserialize() interface (https://www.sqlite.org/c3ref/deserialize.html) does the opposite, accepting a blob and using it as a database file. These interfaces were created for the purpose of "streaming" database files across network connections. Depending on your specific needs, they might work for you.


I'm assuming they mean some parts in the file require parts that appear farther on, so you'd have to receive the whole file?


> I'm not sure what you mean by "stream" here. However, there are two APIs designed to facilitate sending SQLite database files over a socket.

I already have APIs that send SQLite database files over a socket: send(), sendfile(), writev(), and so on.

What I don't have is the ability to receive SQLite database files incrementally (stream): I always need enough disk and/or memory to store an SQLite database.

I can however, receive an XLS file (or even XLSX) incrementally and implement a streaming parser for it.


Yaml is stupid. JSON is fine, because it could be `import` from javascript as a real javascript object. In general, make any configuration as a Javascript Object and you can program it whatever you want.


1. "YAML is stupid" is not constructive. Why is it "stupid"?

2. One place where YAML is nice is if you want to have a user-edited file. JSON can be a pain to edit by hand due to a lack of comments and being so picky about where commas can and cannot go - YAML isn't perfect, but, it does help with issues like that.

3. There are languages outside of Javascript, and it's not possible to import a javascript file directly from most of those languages.

4. There is a difference between the file that you use to configure an app - for which JSON or YAML may be a good fit - and the format for which you export data - where JSON or YAML may or may not be a good fit.

5. JSON has some very real issues with different parsers not implementing the JSON corner cases the same way - which doesn't really matter, until you hit one of those cases and then it really matters.

6. JSON doesn't support streaming access to file data - it's not impossible to process a JSON file incrementally, but, it's hard to do as JSON is easiest used when you can load the whole thing into memory.


To me, the biggest problem with YAML is how much unintuitive magic behavior it has with translating values into different types (see unquoted yes/on actually evaluating as boolean true), which makes its increased direct-editability actually a bad thing.


I've been trying to decipher this irrational hatred of YAML coming from a certain subset of developer for years. I've only three or four data points so far, but all of them were very gifted, opinionated developers who were unable to explain their hatred.

Recently, it began to dawn on me that maybe the problem is one of visual acuity required for discerning whitespaces. I always assumed that just because I could visualize a YAML hierarchy just as well as I could JSON, and could edit YAML without messing up the indentation, that others could too. But I've witnessed highly capable developers bang their heads against the keyboard trying to fix repeated parsing errors in YAML they've written. Perhaps not everyone can pick up whitespaces quite so easily...


My experience failing with YAML: I've converted a visual workflow tool into a YAML file mode so I could give my users a file format they could edit, share and version with any text editor and commit to a git repo.

The result was that they were at a loss trying to figure out the free-form map and array structures. They could not grasp when to use dashes (arrays) mixed with maps (array of maps) and would fail to indent structures all the time. Trying to create meaningful error reporting was very hard since their mistakes were also valid YAML.

YAML is definitely powerful but can become too complex too fast. And being a format full of sigils and its TIMTOWTDI "there's more than a way to do it" approach can easily enrage lean purists and complexity detractors, hence the hatred.


OK. Let me clarify. YAML couldn't be imported directly by , for example, Javascript. You have to readFileSync, parse it, process it some way. YAML whitespaces are hard to get right. You have to have consistent indent format for both editor and the yaml file itself. YAML couldn't be composed. YAML's types are confusing with quotes and not quotes.


But what do you do when dealing with large numbers across platforms? Represent them as a string?

I have recently been moving to toml.


Consider using all config files as `app.config.js` , you can just return a JS object and done.


> Consider using all config files as `app.config.js` , you can just return a JS object and done.

It's kind of sad that you seem to think that JavaScript is the only relevant language in existence.


It's because my projects are universal application, that's why Javascript. But JS is not the only choice. What i mean is, we want a programmable config file, not "static one"


This topic interests me. I am very interested to hear what the specific issues are with "non programmable" config files in your case.

Generally I find that people often choose config languages because of syntax preference, where in reality they should be thinking if what they are trying to express is data or behavior?

Ansible should have been just python. Yaml makes for an ugly programming language. Whereas python setup.py files should have been anything but python. It's a terrible source of metadata.

There are cases of course where the distinction is not that obvious. Is your nginx config data or a program/behavior? How about your CI config?


You've probably already seen this, but https://www.youtube.com/watch?v=0pX7-AG52BU (a talk on configuration at Jane Street) goes over some of these issues at depth.


Some common use cases: - Compile config.js into config.json or config.yaml, use that config.js as an intermediate format to compile to real config. - Environment variables, or require 'other-module.js'


There is an argument to be made for doing config in the language of your application. It's a pattern I've seen in codebases using JS, Elixir, Scala, Haskell, and even C if DWM counts.


And what is the argument?


First, your config isn't a static file that you read wherever you access it, but a library that defines how to load config, and if you treat it as a static file you'll potentially have trouble adapting your code if you want to load config from environmental variables or centralised config management tools.

Second, if you load your config from a different language like JSON, you miss out on the benefits of your first language, like static typing. Static typing gives you nice things like build-time failures for invalid config and integrated configuration autocompletion in IDEs and editors.

Third, you can do things like merging environment-specific configs into a base config or programmatically filling in an array of values explicitly and in a way that's easy to follow. If you're using Fibonacci backoff, you can write "cfg.backoffRetryTimings = fibonacci(100)" instead of listing 100 numbers.


> want to load config from environmental variables or centralized config management tools.

I am not sure I understand this, I have a feeling you are mixing "configuration" with "loading the configuration" here which is a slippery slope.

> you miss out on the benefits of your first language, like static typing

Again, you are confusing configuration with writing your application. You have a point here if you are your applications only user (e.g. server apps) and you have to recompile anyway.

> Third, you can do things like merging environment-specific configs into a base config

Again, I am not sure if your config file is the place to do that. I would argue that your configuration should be different in different environments instead of having a smart configuration that recognizes the environment and does a bunch of different things.

> or programmatically filling in an array of values explicitly and in a way that's easy to follow.

This is also a good point and is definitely a place where JSON falls short.

Good stuff.


Maybe that future maintainers/inheritors of the application would thereby be defacto knowledgeable in the config language as well?


Mh,

> Simplified Application Development

It's depend "application development" is a vast area...

> Single-File Documents

Not much related to SQlite, can be or cannot be done with nearly any format you like. Also remember as a good advise how worse go Catia v6 R2009 (Catia is the best CAD/CAE/CAM suite in the world, used anywhere from automotive to aerospace&defence, it's sole real competitor are Creo (formerly Pro-e) and Nx) when it start to offer single-file PLMs for "light usage"...

> High-Level Query Language

Depending on the format you choose you may have the best/highest level query language: your own apps programming language itself, so IMO may not be an SQLite plus at all

> Accessible Content

Any binary content is FAR LESS accessible than text

> Better Applications

Sorry, it have NO mining at all.

I like SQLite for many aspects but I certainly not look at it as a file format for most applications I use or imaging, for some usage is really nice, for others usages {Tokyo,Kioto}Cabinet is another nice format, for MANY other usage text files in various format, binary files in various formats (including classic pickle) may be better...




Consider applying for YC's Summer 2025 batch! Applications are open till May 13

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: