I've been using the JSON1 extension for some time now (in production projects) and it's truly remarkable.
I usually just dump the JSON-response data from an API to a "raw_data" table (typically one "updated_at" column and a second "json_data" one).
At that point you can somehow normalize your schema, but only if you really have to!
That is because you can get away with a NoSQL-like denormalized schema performance wise, by carefully defining index on expressions.
You can somehow normalize it with views (SQLite doesn't support materialized views).
And of course it's almost always faster to query data where it exists (via SQL) instead of fetching it from disk and querying it say Python. You pay too much IO cost. (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )
The json dump is not stored in binary, but in text format, but honestly I haven't seen this to be a problem, plus you can easily run queries at the CLI and pipe the output to jq, sed etc.
If your application is data warehouse-like and read-heavy (for example an internal reporting dashboard) I can't see any reason why you should pay the cost of setting up a Postgres or MongoDB instance (although I do love both.)
It is true that SQLite does not support concurrent-writes, but (and that's a big BUT) if you carefully open connections only when you need them and use prepared statements, I can't see how you could run into problems with modern SSD hardware (unless you're Google-scale of coure).
> It is true that SQLite does not support concurrent-writes, but (and that's a big BUT) if you carefully open connections only when you need them and use prepared statements, I can't see how you could run into problems with modern SSD hardware (unless you're Google-scale of coure).
WAL mode is your friend. (Various SQLite drivers, including the Python one, are however somewhat buggy in their transaction handling and need workarounds; essentially they delay the BEGIN of a transaction until you issue DML statements which obviously breaks snapshot isolation entirely).
WAL mode allows one writer at a time without impeding readers.
I see your point, but I'm very hesitant to change any configuration of SQLite. It kinda feels like one step too close to doing devops - which one wants to avoid by using SQLite I guess.
Having said that, I do play around with PRAGMA statements when it's really needed, but usually tweaking the code usually works fine - even increasing the timeout is probably enough :D
Often people use SQLite for sheer convenience instead of perceived performance or resource efficiency i.e. because SQLite is zero config, works out of the box. Sure spinning up a docker container with PostgreSQL is easy enough but why do that when you can use the default standard library with SQLite already embedded and linker configed?
> It’s like saying specifying the file directory SQLite uses is config.
Or, to continue the fopen analogy, like saying that the file you've given fopen is to be opened with mode "a+". The only real difference is that fopen's mode argument is required whereas SQLite's mode has a useful default.
> Enabling WAL can hardly be called config, it’s a one liner in every driver I’ve ever seen.
Even in the raw libsqlite3 C API (as long as you don't need any error checking ;) ):
> (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )
This hits a little bit too close to home for me. I am quite proficient at writing performant SQL queries and recently started using Pandas. I find the data frame abstraction better for certain data manipulation tasks. Assuming there is enough RAM available is it still better to offload everything to the database engine?
As usual: it depends...
If you're doing prototyping and are working in a Jupyter notebook, sure, go ahead and work on the Pandas-level.
Once however you're done with prototyping and have settled to a "final_df" (I bet you have something like that in your last notebook cells), maybe you should think transforming some of the "columns" to sql queries (which are VCS-able, sometimes are faster, and most importantly other people can use them too. And instead of 10 people loading 10 different DFs, you can have 10 people querying the same table/view.
> (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )
You really don't even need to know SQL anymore to keep things out of memory. In R, the dplyr/dbplyr package has SQL translations so you can utilize the exact same syntax as you would on in-memory data frames and it will execute as SQL using the database as a backend.
Not saying people shouldn't learn SQL regardless, but even that shouldn't be an excuse for doing everything in-memory these days.
To check if you already have it installed run "pragma compile_options;" and look for ENABLE_JSON1. On my Mac I had it installed both on sqlite3 in terminal/bash and Python :-)
import sqlite3
con = sqlite3.connect(':memory:')
con.enable_load_extension(True)
a = con.execute("pragma compile_options;")
for i in a: print(i); #check for ENABLE_JSON1
I recently wrote an SQLite extension that lets me query stored Protobuf messages. I was inspired by the JSON extension. Adding virtual tables to SQLite with an extension is tricky but kind of magical when you finally have the full query language to play with.
I don’t, sorry. I did not build this with performance in mind, but for querying packet captures of a protocol built on Protobuf.
Any query is going to require deserializing every row in the table (JSON and Protobuf alike) which is basically a non-starter for any project with performance requirements. And I believe that Protobuf messages are decoded as a whole, rather than just the desired field, which is going to be slower.
> Any query is going to require deserializing every row in the table (JSON and Protobuf alike) which is basically a non-starter for any project with performance requirements.
Apparently you can create indexes on the json functions, so I assume on your proto ones too.
> I believe that Protobuf messages are decoded as a whole, rather than just the desired field, which is going to be slower.
With the official library, yes, but it's an implementation/API choice. The wire format [1] can be skimmed fairly efficiently. Most significantly, message/byte/string fields all are written as tag number, length, data. So if there are submessages (entire trees) you don't care about, you can just skip over them. I've seen custom proto decoding things do this.
I think the most efficient approach would be to parse the path once per SQL query into a tag number-based path. And then have the per-row function just follow those with custom decoding logic. Unfortunately from my quick skim, it looks like SQLite's extension API doesn't really support this. You'd want it to build some context object for a given path/proto, then call extract with it a bunch of times, then tear it down. Still, I suppose you could do a LRU cache of these or something.
btw, I see your code is compiling a regex [edit: originally wrote proto by mistake] in the per-row path. [2] I haven't profiled, but I'd bet that's slowing you down a fair bit. You could just make it a 'static const std::regex* kPathElementRegexp = new std::regex("...")' to avoid this. (static initialization is thread-safe in C++. The heap allocation is because it's good practice to ensure non-POD globals are never destructed. Alternatively, there's absl::NoDestructor for this.)
I can already use JSON in Sqlite by doing parsing in the client outside of the Sqlite API. Any examples where I'd want to use this instead? I'm guessing for where clauses in queries, perhaps?
Can I create an index on a field within a JSON tuple?
What I have found with any relational database that has this kind of JSON storage option(MySQL, PostgresSQL) is that you can take some kind of data that arrives via 3rd party as JSON and just dump it straight into the db as JSON as opposed to having to create a schema for it.
It's nice to have the option for the data to still be queryable without having to make it a first class schema with all schema setup involved.
It isn't for 1st class data that is used frequently. But when you might only need the data on occassion its nice to have productivity-wise.
So you pull out the fields you need frequently but have the full json so you can dig in to details the response on an adhoc basis. Not very efficient as you can end up storing a big blob of json, but you're not throwing away any data when you parse the response and your schema stays very manageable. Also gives you the flexibility to add new fields to the schema by pulling the data out of the json.
I use this pattern a lot, too. Just make sure not to `SELECT *`, which could be much slower and waste a lot of RAM in some cases (ORMs using e.g. active record pattern are often susceptible to this when used naively).
I've also done similar where there are differing systems where the extra data is different, but common bits are, well common. For example supporting direct CC, Paypal and Amazon pay. Putting the common parts of the transaction as separate columns, with the full details in a JSON column.
It's nice for being stage 1 of an import: pull the JSON in and use queries to populate other columns/tables with the extracted portions you're working with since you have the full power of a SQL database for many common cleanup tasks.
The JSON module is really a god send! It allows to do things that otherwise would be extremely painful, difficult or not ergonomic.
Just to take few examples, here (http://redbeardlab.tech/rediSQL/blog/JaaS/) is a simple way to store JSON and doing manipulation on it, it would have been impossible without the JSON module.
RediSQL looks interesting and I like the pricing structure. I'm working on marketing my own set of products so I know it's difficult to keep it up. Good luck with this venture and I hope it works out for you and your team.
Please get in touch, there is my email in the profile! There is a small community of people that try to live with open source software.
If your product is somehow usable with RediSQL we could do marketing together or bundle our products, there are a lot of synergy in software that can be enhanced.
Since you can make effectively arbitrary index expressions[1], I'd expect the answer to be "yes". The article doesn't explicitly state if they're deterministic or not, but it seems like some of them could / should be.
I gave it a shot and it appears to work as expected.
sqlite> create table a (id int primary key, j json);
sqlite> insert into a values (1, '{"hello":"world"}');
sqlite> create index idx_a on a (json_extract(j, '$.hello'));
sqlite> explain query plan select * from a where json_extract(j, '$.hello') = 'world';
QUERY PLAN
`--SEARCH TABLE a USING INDEX idx_a (<expr>=?)
sqlite> explain query plan select * from a where json_extract(j, '$.foo') = 'world';
QUERY PLAN
`--SCAN TABLE a
I'm using this in a mobile application (Android and iOS). The data model can change server-side at any time, so the data is stored as JSON to avoid complicated client-side migrations.
Before this I loaded and filtered all the data in memory. Now, with json_extract, I can both index and filter the data using SQLite, which is a massive performance boost.
Its really common to store json documents inside rows in normal relational databases.
This can be because, actually, you have a json blob that is associated with the row e.g. I have a script that scrapes some public registries of historic monuments (I have dull hobbies!) and it just stores the responses in a json column. Its convenient.
Another way these 'dynamic columns' are used is to flatten one-to-many relationships. For example, I have a database where account managers can add arbitrary tags to customers. The classic approach would be to have a customer table, then a 1:M into a tag table with the key value, and then another M:1 for the key to go from the key id to the key names. Instead, I just have a json column with all the key values in it, right there in the customer record. Its convenient!
Note that for databases that support array columns, this is much nicer to represent with an array. Postgres supports them, for example. (SQLite doesn't)
One thing I like is when getting data from multiple tables.
Let's say you have a product with multiple labels: usually you'll get one line per (product, label) tuple so you'll have to do some job application side if you what to get one [product, labels] object per product.
With the json aggregate function you can get a (product, json array of labels) tuple per product and just have to do some json_decode in your application code.
> "Let's say you have a product with multiple labels: usually you'll get one line per (product, label) tuple so you'll have to do some job application side if you what to get one [product, labels] object per product."
What's the payoff of doing [product, labels], particularly if that means a json array of labels? I briefly investigated doing that for my application but I found that it would increase disk space (not really a big problem) but it would also make efficient querying a big chore (e.g. labels->products queries.)
I can see this being a good tradeoff if querying by labels is extremely rare and instead you only ever want to query product->labels. But that's still pretty damn fast with the (product, label) tuple schema.
There is the possibility of storing a binary-encoded json object, which uses far less space than regular json, and since the server is doing the encoding you aren't going to need custom client side libraries to decode it.
There is also the possibility of magic 'compression' by having the server automagically extract common schema elements of the stored data to probably cut the storage size in half again.
Pivots and trees. Long skinny tables to wide sparse tables used to be something we did a lot, now we return a JSON Tree... we were always parsing the sparse table into a tree anyway.
From a glance, this looks like its compatible with MySQL's JSON functions.
It isn't compatible, for example, with MariaDB's dynamic columns.
I'm not very familiar with the PostgreSQL json functionality, but I think that's subtly different again.
When I use local dbs for testing my json code I've used derby and defined the json_extract() functions etc myself just to test. With Sqlite having compatible functions, people wanting to test mysql stuff locally will be able to just point their code at an Sqlite DB instead. Great!
The bit that seems to be missing is the shorthand for selecting column values; in MySQL, instead of doing SELECT JSON_EXTRACT(col, "$.this.is.ugly[12]"), ... you can just do SELECT col->"$.this.is.ugly[12]", ...
Now what I want to be able to write is SELECT col.this.is.nicer[12], ....
I think there is some 'standard' somewhere that MySQL - and now Sqlite - is implementing? The functions and the 'path' syntax are standardized (although with only MySQL and now Sqlite supporting them its not perhaps a big deal). I just can't find any reference to that standard in the MySQL docs, nor this Sqlite doc.
Personally, I dislike the path syntax though! Every time I see an sql snippet with string paths full of dollar signs it offends my retinas.
According to [1], ISO/IEC 9075:2016 specifies JSON functionality in ISO SQL, though I don't have access to it/didn't buy the standard text. I also don't know whether Postgres etc. is designed to follow the standard, as Postgres' implementation predates it, but would expect sqlite to.
Postgres 11 doesn't follow the standard at all (as per your link). A third party implemented the standard [0] which was submitted on pghackers[1] very late in the Postgres 11 cycle[2] (at 75k gzipped the patch is pretty humongous). A quick googling doesn't show any information pertaining it being integrated (in part or full) into pg12 so IDK.
There's a technical report ISO/IEC TR 19075-6:2017 corresponding to the standard, freely available from [1]. It's a much more accessible version of the standard if you're not looking to implement it.
> Backwards compatibility constraints mean that SQLite is only able to store values that are NULL, integers, floating-point numbers, text, and BLOBs. It is not possible to add a sixth "JSON" type.
I'd be interested to know what these constraints are? Does SQLite guarantee that files created with newer SQLite versions are still compatible with older SQLite versions?
> The SQLite database file format is also stable. All releases of SQLite version 3 can read and write database files created by the very first SQLite 3 release (version 3.0.0) going back to 2004-06-18. This is "backwards compatibility". The developers promise to maintain backwards compatibility of the database file format for all future releases of SQLite 3. "Forwards compatibility" means that older releases of SQLite can also read and write databases created by newer releases. SQLite is usually, but not completely forwards compatible.
And yet, "The json1 extension uses the sqlite3_value_subtype() and sqlite3_result_subtype() interfaces that were introduced with SQLite version 3.9.0 (2015-10-14) The json1 extension will not work in earlier versions of SQLite."
This just means you can't use the json1 extension with an old version of the library. But anything you write using it, will still be readable with an older version.
Of course, if you do something like creating an index using a json1 expression, that will cause issues if you try to use the database with an older version.
SQLite is used in a LOT of environments, including low-overhead embedded systems. This means it's impractical to add the overhead for those systems. As to desktop and mobile, I absolutely agree, it should be the default in a lot of places.
The documentation specifies that that is a possible end state:
> The "1" at the end of the name for the json1 extension is deliberate. The designers anticipate that there will be future incompatible JSON extensions building upon the lessons learned from json1. Once sufficient experience is gained, some kind of JSON extension might be folded into the SQLite core. For now, JSON support remains an extension.
Especially for things like Python's embedded SQLite. Packaging and loading an extension module cross-platform with an otherwise pure Python script seems difficult.
JSON1 is compiled into every Linux distro's SQLite I tested and is also in Mac's SQLite, as well as the Windows binaries provided by sqlite.org. The only exception I found was the SQLite bundled with Python for Windows, which doesn't seem to have any extensions at all.
Performance-wise this is fast enough that you can do faily complex un-indexed queries (so full table scans) against tables with a few ten-thousand to hundred-thousand rows and have them complete in a couple tens msecs.
The json1 extension does not (currently) support a binary encoding of JSON. Experiments have been unable to find a binary encoding that is significantly smaller or faster than a plain text encoding. (The present implementation parses JSON text at over 300 MB/s.) All json1 functions currently throw an error if any of their arguments are BLOBs because BLOBs are reserved for a future enhancement in which BLOBs will store the binary encoding for JSON.
At that point you can somehow normalize your schema, but only if you really have to! That is because you can get away with a NoSQL-like denormalized schema performance wise, by carefully defining index on expressions. You can somehow normalize it with views (SQLite doesn't support materialized views).
And of course it's almost always faster to query data where it exists (via SQL) instead of fetching it from disk and querying it say Python. You pay too much IO cost. (Yes, my dear aspring data scientist, do not load everything in a huge DataFrame, go learn yourself some SQL :-) )
The json dump is not stored in binary, but in text format, but honestly I haven't seen this to be a problem, plus you can easily run queries at the CLI and pipe the output to jq, sed etc.
If your application is data warehouse-like and read-heavy (for example an internal reporting dashboard) I can't see any reason why you should pay the cost of setting up a Postgres or MongoDB instance (although I do love both.)
It is true that SQLite does not support concurrent-writes, but (and that's a big BUT) if you carefully open connections only when you need them and use prepared statements, I can't see how you could run into problems with modern SSD hardware (unless you're Google-scale of coure).