Hacker News new | past | comments | ask | show | jobs | submit login
SQLite 3.42.0 (sqlite.org)
250 points by nikbackm on May 16, 2023 | hide | past | favorite | 105 comments



> Enhance the JSON SQL functions to support JSON5 extensions

Wait, what's JSON5??

> Object keys may be unquoted identifiers.

> Objects may have a single trailing comma.

> Arrays may have a single trailing comma.

> Strings may be single quoted.

> Strings may span multiple lines by escaping new line characters.

> Strings may include new character escapes.

> Numbers may be hexadecimal.

> Numbers may have a leading or trailing decimal point.

> Numbers may be "Infinity", "-Infinity", and "NaN".

> Numbers may begin with an explicit plus sign.

> Single (//...) and multi-line (/.../) comments are allowed.

> Additional white space characters are allowed.

Oh crap, the levees have broken!


The important point to keep in mind is that SQLite will read JSON5, but it never writes it. The JSON that SQLite generates is canonical JSON that is fully compliant with the original JSON spec.

It turns out that there is a lot of "JSON" data in the wild that is not pure and proper JSON, but instead includes some of the extensions of JSON5. The point of this enhancement is to enable SQLite to read and process most of that wild JSON.

This feature was requested by multiple important users of SQLite.


Dr. Hipp

Off topic: would you mind sharing any info on potential timing of begin-concurrent-pnu-wal2 branch being merged into main (or consideration of forking sqlite to have a "client/server" version)?

https://www.sqlite.org/src/timeline?r=begin-concurrent-pnu-w...

(Love what you have created. Thank you so much for all the years of amazing work)


That branch has been renamed "bedrock" (after its principal user) and is up-to-date.


Hi Dr Hipp

Curious, would you recommend using this bedrock branch?

Why / why not?


Expensify is the main user from what I could gather. They've been using it for awhile now.

https://bedrockdb.com/


Isn't this just the YAML debacle all over again? Where virtually any sequence of text is "valid" (albeit meaningless) YAML?


While I don't care for JSON5 it's hardly "virtually any sequence of text", it rather hews closer to JavaScript content, rather than the very limited subset (ish) that is JSON.


Virtually any sequence is not the same as adopting a specification five years after it’s been published and adopted by industry: https://spec.json5.org/

I appreciate that SQLite can’t write the format, because those changes are human afordances


> > Single (//...) and multi-line (/.../) comments are allowed.

We can now have different parsers have pragmas that specify different behaviour depending on whether those pragmas are recognized or not.

In case anyone was wondering, the history is that comments were considered an anti-feature by Douglas Crockford, the creator of JSON:

> I removed comments from JSON because I saw people were using them to hold parsing directives, a practice which would have destroyed interoperability. I know that the lack of comments makes some people sad, but it shouldn't.

> Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.

* https://web.archive.org/web/20150105080225/https://plus.goog...

* https://en.wikipedia.org/wiki/Douglas_Crockford


And for a data interchange format this is also 100% reasonable.

The problem is that people have started using JSON for configuration files and the like, which IMHO has always been – and continues to be – the wrong tool for the job.


How does that relate? Your config parser can disregard commments.


I think the point is that your config parser should be using yaml or toml, though in fairness neither of those existed in the early 2000s when JSON was being developed/discovered/adopted— the 800lb gorilla in that space at the time was XML.

And XML remained dominant for a long time— for example, the original Google Maps from 2005 received its server responses as XML blobs, and API v2 even exposed the relevant parsing functionality as the GXml JavaScript class. By around 2007, it was all JSONp I think, and GXml was deprecated and removed in API v3 and v4 respectively.


What comments? JSON doesn't have any.

And editor will rightfully mark them as errors.


that's the JSON5 support


That's my point. Parent says that config parser can disregard comments. Why would it need to disregard comments if it was json5?


Yes, but JSON was never intended for that use case – the feature set isn't geared towards it.


> Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.

Every problem can be solved by introducing a build step, except for the problem of having too many build steps.


Let's add parsing directives then.


Is JSON5 a thing people use? I see it's from 2012 but it's the first I've heard of it. It looks fairly sensible; I'd take it for trailing commas alone. And comments!


VSCode config files use it


No, VSCode uses jsonc, not JSON5. They are very different, e.g. allowing trailing commas is an explicit design goal of JSON5, whereas in jsonc they are frowned upon (generates a warning, and was illegal at one point). jsonc is basically JSON + comments and nothing more, IIRC.


Well I use it, not that that means much :)

Tried several "json replacement" formats (bson, cbor, msgpack, ion...) and json5 won by being the smallest after compression (on my data) while also having the nice bonus of retaining human-readability.


How does it affect speed of parsing in JavaScript? I kinda thought the whole reason this deeply-mediocre format caught on in the first place was it was especially natural & fast to serialize/deserialize in JavaScript, on account of being a subset of that language. (XML also had the "fast" going for it thanks to browser APIs, but not so much the "natural")


> this deeply-mediocre format

Most of the "popular" publicly available formats at the time were singularly worse, even ignoring commonly limited or inconvenient language support.

SOAP? ASN.1? plists? CSV? uuencode? I'll still take JSON over all of them, especially when it comes to sending shit to the browser (plists might be workable with a library isolating you from it, but it is way too capable for server to browser communications, or even S2S for that matter not all languages expose a URL or an OSet type).

> it was especially natural & fast to serialize/deserialize in JavaScript, on account of being a subset of that language.

That is certainly a factor, specifically that you could parse it "natively", initially via eval, and relatively quickly[0] via built-in JSON support (for more safety as the eval methods needed a few tricks to avoid full RCE).

But an other factor was almost certainly that it's simple to parse and the data model is a lower common denominator for pretty much every dynamically typed language. And you didn't need to waste time on schemas and codegen, which at the time was a breath of fresh air.

> XML also had the "fast" going for it thanks to browser APIs, but not so much the "natural"

XML never has "fast" going on in any situation, the browser XML APIs are horrible, and you had to implement whatever serialization format you wanted in javascript over that, so that was even slower (especially at a time when JS was mostly interpreted)

[0] compared to the time it started being used: Crockford invented / extracted JSON in 2001, but services started using JSON with the rise of webapps / ajax in the mid aughts, and all of Firefox, Chrome, and Safari added native json support mid-2009


> And you didn't need to waste time on schemas and codegen, which at the time was a breath of fresh air.

Everyone just wrote their own custom clients in code instead :) But it was a gradual process, so it's harder to notice the pain compared to some generator slamming a load of code in your project. coughgRPCcough.


> I kinda thought the whole reason this deeply-mediocre format caught on in the first place was it was especially natural & fast to serialize/deserialize in JavaScript, on account of being a subset of that language

No, people couldn't just eval random strings, especially the ones containing potentially malicious user input. They started writing parsers like any other language, then came the global JSON.parse and .stringify per spec.

I can't say for sure but I doubt JSON being a JS subset helped at all.


> No, people couldn't just eval random strings, especially the ones containing potentially malicious user input.

hehehehehe

> I can't say for sure but I doubt JSON being a JS subset helped at all.

Oh it very much did, both because it Just Worked in a browser context, and because the semantics fit dynamic languages very nicely, and those were quite popular when it broke through (it was pretty much the peak of Rails' popularity).


"> Objects may have a single trailing comma."

Any arguments against this one? My knee jerk reaction is 'yay'


Why would you put a trailing comma when there's nothing after it? A comma isn't a period :-/


Because it follows the general principle of making it easier to change [0]. Fewer edits, cleaner diffs.

Useless from the perspective of a wire format, but nice for things like config files, which seem to be the use cases json5 is targeting.

[0] Bullet pt 3: https://betterprogramming.pub/5-essential-takeaways-from-the...


When manually modifying JSON, it really helps avoid accidentally missing a comma when you add or reorder elements.

JSON isn't English. The comma doesn't mean what it does in English, so I don't see why a period would be appropriate either.


Makes copy-pasting and reordering lines easier, which is quite handy.


Simplifies JSON-generation code, typically saving a conditional.


recursing issue since forever, "separator" v.s. "terminator" both can work. For data rows, I prefer termination as makes missing columns explicit. But being optional here, it means nothing.

I guess it facilitates low effort scripting that doesn't have to treat the last element differently.


Just noticed today that the Python formatter "black" does this, I really disagree. Of course I realise that by noticing its changes and arguing about them I am missing the whole point of using it.


Despite the name, “JSON5” is not an official successor to JSON.


Good to know, I would still like to see it become defacto successor though.


all that and still no datetime support which is the most annoying thing missing in JSON imho


Huh, I'd never thought twice about that. What would native datetime support in JSON get you that a ISO 8601 string doesn't?


Type validation for one. The reduced chance that a client has put some backwards format in there like MM-DD-YYYY (or DD-MM-YYYY for that matter) or just an invalid date completely.

You might as well ask what native numeric or boolean support offer over just jamming stuff in a string in an agreed format. Some might argue that dates are a compound value so differ from atomic types like a number, but they are wrong IMO as a datetime can be treated as a simple numeric¹ with the compound display being just that – a display issue.

Others will point out that JS doesn't have a native date/datetime/time type, but JSON is used for a lot more that persisting JS structures at this point.

--

[1] caveat: this stops being true if you have a time portion with a timezone property


> You might as well ask what native numeric or boolean support offer over just jamming stuff in a string in an agreed format.

A big difference is that a numeric or a boolean are quite limited datatypes with agreed upon semantics (mostly).

> Others will point out that JS doesn't have a native date/datetime/time type

It does, in fact. And it's absolute shit.

> but JSON is used for a lot more that persisting JS structures at this point.

So what I'm reading here is that you don't need types to be supported natively in order to serialize to JSON.


> It does, in fact. And it's absolute shit.

Maybe it is "just us", maybe it is part of the general shitness you point out, maybe it is the lack of literal representation (even VB and relatives had one) other than an ISO8601 string, but dates don't feel native like simpler types, arrays, objects, fictions, …

> So what I'm reading here is that you don't need types to be supported natively in order to serialize to JSON.

Yes. But not absolutely needing something does not mean it isn't (or wouldn't be) exceptionally useful to have.


Well one thing it could get you (probably wouldn't, but could) is symbolic timezones, as ISO 8601 only supports offsets.


What it gets you is not having to deal with breakage when someone doesn't know the difference between ISO8601 and RFC3339 or, in fact, `date` strings. Shuffle all that mess off down the stack to where someone else has made one decision, once, rather than having to relitigate it every time.


> It is a conformant subset of the ISO 8601 extended format.

Huh, TIL.

https://www.rfc-editor.org/rfc/rfc3339


JSON will never have datetime support, since Javascript does not have datetime literals (and that's a good thing given how horrible the Date object is).

Probably more importantly, all of that and still not proper datetimes in sqlite.

Also even more so no domains (for custom datatypes).


> Probably more importantly, all of that and still not proper datetimes in sqlite.

Home Assistant recently did a ton of changes to work around the issues caused by this.

The short story is that they stopped storing timestamps as 'timestamp' datatypes and started storing them as unix times stored in numeric columns. Since timestamps turn into strings in SQLite, this was a huge improvement for storage space, performance, etc.

The problem is that this change also affects databases which have a real datetime datatype. So PostgreSQL, which internally stores timestamps as unix times, is now being told to store a numeric value. To treat it as a timestamp you have to convert it while querying. Since I used PostgreSQL for my Home Assistant installation, this feels like a giant step backwards for me.

I wish that they had used this change as an opportunity to refactor the database code a bit so that they could store timestamps as numeric for SQLite, but use a real timestamp datatype for MySQL and PostgreSQL. I'm sure that this isn't a simple thing to do though.


Generally speaking... DateTime/TimeStamp fields between databases are in general treated differently either in practice or purpose much of the time. When migrating from one database to another, this is almost always an issue.


Inconsistent datetime storage has been a consistent issue for me providing cross-provider and cross-application/framework support for SQLite.


In people's opinions: Would this feature be appropriate to implement, or beyond the scope of what the JSON project should aim for?


I don't know what "appropriate" or even "beyond the scope" mean in this context but, having in mind that datetime data needing to be stored is a fact of life that's not going away then I'd say yes, it does belong in JSON.

It also belongs in SQLite.


You can remove JSON, when is SQLite adding proper date type?



Wrong, there's no datetime there. Only integer can be used for it in a more space-saving manner. But that means you have to convert and invoke calculation functions. It's error-prone.


Isn't that what most databases do under the covers anyway?


Yes, and? I prefer the DB to take care of it for me.


I wished upon a shooting star for comments in JSON sometime in the 2010s. That took a while, but I'm glad it came through.

-- EDIT: mainly back when we started using JSON to configure, well, lot of things.


Comments in JSON? Why isn’t this everywhere?


> Comments in JSON? Why isn’t this everywhere?

Comments were considered an anti-feature by Douglas Crockford, the creator of JSON:

> I removed comments from JSON because I saw people were using them to hold parsing directives, a practice which would have destroyed interoperability. I know that the lack of comments makes some people sad, but it shouldn't.

> Suppose you are using JSON to keep configuration files, which you would like to annotate. Go ahead and insert all the comments you like. Then pipe it through JSMin before handing it to your JSON parser.

* https://web.archive.org/web/20120507093915/https://plus.goog...

* https://en.wikipedia.org/wiki/Douglas_Crockford


I always hated this logic, the "I know better than you". You can still add pragmas trivially in JSON, this doesn't prevent that problem, it just makes it more convoluted at the expense of a powerful documentation feature.


> I always hated this logic, the "I know better than you".

Feel free to create your own data interchange file format. (Perhaps with Blackjack and hookers. :)

> You can still add pragmas trivially in JSON, this doesn't prevent that problem, it just makes it more convoluted at the expense of a powerful documentation feature.

Computers/systems do not need comments to exchange data between themselves, and that is the problem JSON is trying to solve: data exchange.

If you're using it for configuration and other human-centric tasks, and wish to communicate human-centric things in the file, you're trying to fit a square peg into a round hole. Use some other format that was designed with those things in mind.

If you're saying the designers of a hammer didn't think of being able to handle Robertson screws you're unfairly critiquing the tool.


> Feel free to create your own data interchange file format.

This is such a lazy argument to make. We are both allowed to criticize flaws in data exchange formats. As far as what JSON is trying to solve, it's far beyond that, and the creator knew this (which is why he went out of his way to fight it). If you as a human have ever debugged anything dealing with a json handling, you've already proven that comments can have value.


And still JSON is less of a pain than XML... ;-)


It's a square peg that fits very nicely in the round hole I use it for, so Imma continue to use it as I see fit.


This really never made sense; I find adding property like __pragma: nicer than comments. It has the same problem as he stated, but leaving out comments is not going to prevent that; it was a bad idea IMHO. I use hjson just for the comments; it's excellent.


> This really never made sense

It always made perfect sense, given JSON was invented as a data exchange format. It was never intended to be written out by hand.

> I find adding property like __pragma: nicer than comments. It has the same problem as he stated

It does not, because that is data which every JSON parser will at least parse if not interpret.

A comment is something a parser has no reason to yield, which means smuggling metadata in comments makes the document itself inconsistent between parsers which do and parsers which don't[0] expose comments.

And this was not a fancy Crockford made up, smuggling processing instructions in comments was ubiquitous in Java at the time, as well as HTML/XML, where it literally still is a thing: https://en.wikipedia.org/wiki/Conditional_comment.

[0] and possibly can't, how do you deserialize a comment in the middle of an object to JS or Ruby


Because preserving them across processors (think jq) is hard or impossible.


It's a shame because it's a rather trivial thing to filter out. Personally I wish they would have left out the "//" single-line commenting to avoid the whitespace dependency.


Filtering them out is indeed trivial, but then they're not stable / preserved, and so why bother writing them?

Ensuring that comments and their locations in the text are preserved by processors is quite difficult if not impossible. Reformatting a JSON text alone can "break" comments by not necessarily placing them where they belong. Any schema transformation means comments must be dropped.


You can write a parser that generates an AST node for comments and then pretty prints that AST. So it’s not impossible, but it does require that your parser gives you the option to not just drop comments on the floor.


Alright, here's a JSON5 text:

  {
    // start
    x: "y",
    // the following is foo
    foo: "hey",
      // the following is bar
    bar: "there"
    // end
  }
When parsing it, should I attach each comment to the following name or the previous name?

If I reformat to change indentation, should I change the indentation of the comments too? How would the comment that reads "the following is bar" be re-indented?

What if there are duplicate names? JSON does allow duplicates. If I attach comments to preceding (or following) names, then while parsing I find a dup... what should I do with the preceding comment and the new name?


You make the comment a child node of the object literal node, don’t associate it with the following key/value pair at all. When you pretty print it, it will have the same relative order with the following kv pair as it had originally (and you cannot reorder json keys safely because the handling of duplicate keys is implementation-defined).

As far as indentation goes, if I were writing a pretty printer, I’d make indentation handled by the object/array node not by the comment/kv pair node. So that doesn’t matter: the comment will start printing at whatever the current column is just like the kv pairs would.

Anyways, among other tools, prettier has already solved this problem.


> You make the comment a child node of the object literal node, don’t associate it with the following key/value pair at all. When you pretty print it, it will have the same relative order with the following kv pair as it had originally (and you cannot reorder json keys safely because the handling of duplicate keys is implementation-defined).

Many implementations parse objects into hash tables and lose duplicates and even order of appearance of name/value pairs, and all of that is allowed. Such implementations will not be able to keep relative order of comments.


Don’t use that sort of implementation for a pretty printer is my point. I’m not saying you can pretty print json that has been parsed by a generic parser, but that you can write a parser that can pretty print arbitrary json.


But pretty-printing is not all that one does with JSON. If that were all one ever did with JSON then JSON wouldn't exist.


The original reason is that they didn’t want people to use comments for processing information.


> > Strings may be single quoted.

Hmmm... It's super convenient that SQL uses single quotes for string literals while JSON uses double quotes. Changing that is going to cause pain.

> > Strings may span multiple lines by escaping new line characters.

I really can't recommend this. Yeah it's annoying to have to write \n, but still.


> Hmmm... It's super convenient that SQL uses single quotes for string literals while JSON uses double quotes. Changing that is going to cause pain.

Surely you're not constructing queries by concatenating JSON to text?

> I really can't recommend this. Yeah it's annoying to have to write \n, but still.

I can only disagree, the ability to just put newlines in a string literal in languages like rust is refreshing.


> Surely you're not constructing queries by concatenating JSON to text?

Of course not, but I do have code that generates SQL and which uses JSON. (And no, the code in question is not subject to SQL injection.)


I fail to see how this is relevant. Your code feeds JSON to SQL, that remains supported. Or your code fetches SQL-generated value from SQL, in which case as literally stated before the JSON5 listing:

> JSON text generated by [JSON] routines will always be strictly conforming to the canonical definition of JSON.


Does it have "contains" and paths yet?


No need to go Postel about it.


Is there an unavoidable reason why SQLite won't allow foreign keys to be added with ALTER TABLE, only at table creation? The whole "select * into temporary table and create new table with FK constraint" seems so verbose / convoluted, my layman's mind can't comprehend why one can't simply

    ALTER TABLE "foo" ADD FOREIGN KEY...


The alter table docs have a general explanation kinda.

They don't store the parsed representation of the schema, since that would lock the table to a specific version. They store the create command itself, and each time you open a DB the schema is generated from that. Allows for flexible upgrades to the schema system without the need for migrations, and lets you use the same DB file with multiple sqlite versions.

The downside is that alter table commands are just edits to the create table string, which when accounting for all the different versions is fairly difficult and risky, so alter operations are limited.


Is there something special about ALTER DDLs that make it impossible to store them and play them back along with CREATE DDls to derive the final schema?


Exactly... there's no real apparent reason this can't be done


But also FKs remain half-assed and seen with some disdain: it's 2023, support was added (a bit under) 14 years ago, and you still need to activate foreign keys every time you open the database file. There is still no way to enable FKs by default for a database file, let alone prevent disabling it.


no, you can do it by define SQLITE_DEFAULT_FOREIGN_KEYS=1 while compiling sqlite3


”You can set it before compiling” is such a foreign concept to the real world that it’s almost meaningless, if that is the only way to change the behavior it’s pretty crazy.


I generally agree when it comes to standalone software, but not for libraries.

SQLite is both.

Setting a compiler flag is reasonable for the common case where it is included as a header file in a C/C++ application.

It is certainly inconvenient when you design with that feature in mind, but can't rely on it being part of the standard binary distributions.


Many applications use SQLite as an embedded library; that's how it's intended to be used anyway.

It doesn't really jibe all that well with things like Python though, where you typically don't do that. But it's certainly not a "foreign concept" or "almost meaningless".

For e.g. Python you can still just set the desired parameters when connecting, which should usually be handled by the database connector/driver for you.


Maybe this comment is worthless and has no information content, but SQLite is in the pantheon of open source projects, along with linux itself.


How are people using sqlite within a multi-threaded, asynchronous runtime: are you using a synchronization lock? SQLite seems to "kinda, sorta" support multi-threading, requiring special configuration and caveats. Is there a good reference?


Create a connection per task. WAL is probably a good idea.

Even using SERIALIZED mode, sqlite has multiple APIs which are completely broken if two clients touch the same connection (https://github.com/rusqlite/rusqlite/issues/342#issuecomment...).

Don't bother, just don't share connections between threads and use the regular multi-thread mode (do use that though).

You can use a connection pool and move connections from task to task (and thread to thread), just not use connections concurrently.


1. Use WAL mode - this allows reading while practically never being blocked by writes. 2. Use application-level locks (in addition to the SQLite locking system). The queuing behavior from application-level locks can work better than the retry mechanism provided by SQLite. 3. Use connection pooling with a separate connection per thread - one write connection and multiple read connections.

Ideally, all of the above would be covered by a library, and not up to the app developer.

I wrote a blog post covering some of this recently: https://www.powersync.co/blog/sqlite-optimizations-for-ultra...


Be careful of WAL mode. There are specific limitations.

https://www.vldb.org/pvldb/vol15/p3535-gaffney.pdf

“To accelerate searching the WAL, SQLite creates a WAL index in shared memory. This improves the performance of read transactions, but the use of shared memory requires that all readers must be on the same machine [and OS instance]. Thus, WAL mode does not work on a network filesystem.”

“It is not possible to change the page size after entering WAL mode.”

“In addition, WAL mode comes with the added complexity of checkpoint operations and additional files to store the WAL and the WAL index.”

https://www.sqlite.org/lang_attach.html

SQLite does not guarantee ACID consistency with ATTACH DATABASE in WAL mode.

“Transactions involving multiple attached databases are atomic, assuming that the main database is not ":memory:" and the journal_mode is not WAL. If the main database is ":memory:" or if the journal_mode is WAL, then transactions continue to be atomic within each individual database file. But if the host computer crashes in the middle of a COMMIT where two or more database files are updated, some of those files might get the changes where others might not.”


Yeah, wal can also grow very large, making read operations increasingly slow on wrong checkpoint configuration, which also makes checkpointing slower and it grows out of proportion.

Inefficient queries, no reader gaps and a bad manual checkpointing system can bring the system down.


Does WAL2 mitigate the performance reduction?

https://sqlite.org/cgi/src/doc/wal2/doc/wal2.md


That's neat, I haven't worked with SQLLite for a few years but this changes it. Of course you can still fall into the trap of reaching your checkpoints faster than the first wal file needs to write back to the database, but I think that's a less critical problem in pratice.


Two methods:

1. Queue up writes to a single connection in a given thread.

2. Retry the write after a short sleep when you get a SQLITE_BUSY error. SQLite will do this for you, see busy_timeout docs.

WAL is single write, multiple reader, so if you are doing SELECT queries they should not return SQLITE_BUSY


https://stackoverflow.com/questions/25319031/what-is-the-def...

Just to clarify, busy_timeout = 0 by default


Each thread opens its own connection. Set a non-zero timeout. I think that's all you need.


Wat > Note that SQLite interprets NaN, QNaN, and SNaN as just an alternative spellings for "null"


Wait! Can I store unicode escaped binary strings in JSON now?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: