SQLite's Automatic Indexes

banana_giraffe · on Dec 28, 2022

Related: https://sqlite.org/cli.html#index_recommendations_sqlite_exp...

There's an experimental module and CLI command in the normal shell to use this logic to dump out a suggested list of indexes to create for a given SQL command to speed it up.

parentheses · on Dec 28, 2022

I'd like to see this in Postgres and the like.

masklinn · on Dec 28, 2022

Definitely would be nice in pg. I think it exists for sql server (likely also oracle), iirc the db can even be configured to optimistically create the index to try it out, and revert it if it’s bad.

One of the difficulties is indexes have a write cost so heuristically deciding on the trade off can be complicated.

chasil · on Dec 28, 2022

I was not aware that SQLite creates temporary indexes in places where other databases would perform merge joins.

https://sqlite.org/tempfiles.html#transient_indices

SQLite also creates background indexes for ROWID, of which I was aware.

https://sqlite.org/rowidtable.html

SigmundA · on Dec 28, 2022

SQLite uses a b-tree storage format for tables, so all tables are an index, similar to clustered indexs in MSSQL Server and Index Organized Tables in Oracle. The RowID is the true primary key unless you specify the table should use the actual primary key as the row identifier, this is legacy quirk of Sqlite.

Either way SQLite doesn't have heap tables like PG (which only has heap tables), all tables are clustered indexes.

Shorel · on Dec 28, 2022

Clustered indexes in MSSQL Server, Index Organized Tables in Oracle, and any table in MySQL using InnoDB as storage engine.

SigmundA · on Dec 28, 2022

InnoDB is similar to SQLite all tables are clustered indexes, if you don't have suitable primary key it creates a hidden one similar to rowid in SQLite.

MSSQL and Oracle give you a choice for table storage to either have unordered heaps or ordered b-trees, depending on whether a clustered index is defined.

PG only has unordered heaps, indexes are always secondary data structures. PG could really use true clustered indexes as an option.

winrid · on Dec 28, 2022

This is not true in my experience at all. You have to create indexes.

Also, a btree is created for rowid, according to these docs.

https://www.sqlite.org/withoutrowid.html#:~:text=In%20an%20o....

masklinn · on Dec 28, 2022

> This is not true in my experience at all. You have to create indexes.

Nobody said you don’t have to create other indexes.

> Also, a btree is created for rowid, according to these docs.

As the GP says that btree is the table itself, that’s what a clustered table is.

This is the bit that reveal this information:

> As an ordinary SQLite table, "wordcount" is implemented as two separate B-Trees. The main table uses the hidden rowid value as the key and stores the "word" and "cnt" columns as data.

So one of the btrees is “the main table”, of which rowid is the key.

SigmundA · on Dec 28, 2022

What's not true? All tables in SQLite are b-tree indexes, the unique key is either the rowid or the primary key in "WITHOUT ROWID" tables. SQLite does not have unordered heaps:

https://www.sqlite.org/lang_createtable.html#rowid

winrid · on Dec 28, 2022

I see what you're saying. Just the wording you're using is confusing IMO. The tables themselves aren't btree indexes, there isn't any sort of "clustered" index on all columns. It's just that the tables rows are entries in a btree, indexed by rowid.

masklinn · on Dec 28, 2022

> The tables themselves aren't btree indexes, there isn't any sort of "clustered" index on all columns.

What do you think a btree index is exactly? Especially a covering index?

> It's just that the tables rows are entries in a btree, indexed by rowid.

That’s what a clustered index (/ table) is. It’s when the table's storage is organised by the chosen clustering key. In SQLite, all tables are clustered, and the clustering key is the ROWID, or the primary key for tables WITHOUT ROWID. You can think of it as the table being its own covering index (keyed on the clustering key, and INCLUDE-ing all other columns).

By opposition a database which uses heap tables (like postgres) needs a separate index for its PK (it also needs one to enforce unicity so that's two birds one stone).

winrid · on Dec 28, 2022

I see. When I think of a covering index, I think of a covered query, where all query data can be fetched from the index itself. Doesn't the rowid btree just point to an offset on disk where the actual row is? Or is the row itself in the index?

masklinn · on Dec 28, 2022

> Doesn't the rowid btree just point to an offset on disk where the actual row is? Or is the row itself in the index?

The value of the rowid btree is the row.

winrid · on Dec 28, 2022

oh, nice! thx for the responses

SigmundA · on Dec 28, 2022

>The tables themselves aren't btree indexes, there isn't any sort of "clustered" index on all columns. It's just that the tables rows are entries in a btree, indexed by rowid.

That's what a clustered index is, the table is the index, the primary key is the key all the rest of the columns are in the value rather than the value being a row id that needs a second lookup to get to the data.

https://en.wikipedia.org/wiki/Database_index#Clustered

You can still have secondary indexes that point to the key of the clustered index.

SQLite is quirky in that even if you define a primary key, it has a hidden actual primary key called the rowid, unless you define it "WITHOUT ROWID" which was added later, this is what they say in the docs:

"WITHOUT ROWID is found only in SQLite and is not compatible with any other SQL database engine, as far as we know. In an elegant system, all tables would behave as WITHOUT ROWID tables even without the WITHOUT ROWID keyword. However, when SQLite was first designed, it used only integer rowids for row keys to simplify the implementation. This approach worked well for many years. But as the demands on SQLite grew, the need for tables in which the PRIMARY KEY really did correspond to the underlying row key grew more acute. The WITHOUT ROWID concept was added in order to meet that need without breaking backwards compatibility with the billions of SQLite databases already in use at the time (circa 2013)."

https://www.sqlite.org/withoutrowid.html#quirks

winrid · on Dec 28, 2022

Yeah I know about rowid, I've used (exploited?) :) that a few times.

Scaevolus · on Dec 28, 2022

rowid isn't really a background index, but rather the normal storage layout, with non-integer primary keys having a separate index structure.

yread · on Dec 28, 2022

would be cool to get a diagnostic message when that happens

masklinn · on Dec 29, 2022

Per the linked mailing list post there’s no dust but there is a back channel through which you can be notified:

> SQLite comes with instrumentation (specifically the SQLITE_STMTSTATUS_AUTOINDEX verb for sqlite3_stmt_status() - http://www.sqlite.org/c3ref/stmt_status.html) that can be used to detect when automatic indices are used and alert the developer through a back channel to this fact so that she can fix the problem with an appropriate CREATE INDEX. In other words, SQLite provides you with the tools to help you detect and eliminate the use of automatic indices.

chasil · on Dec 28, 2022

For transient indexes, or permanent ones?

The transient indexes will be created in a separate, temporary file. They will be erased on a commit, rollback, or crash recovery. They will not persist.

high_byte · on Dec 28, 2022

I think EXPLAIN is a little known feature that gives a lot of insight about database internals, especially in Sqlite.

Funny you posted this now, I just recently researched Sqlite internals for my own project. Maybe you will find it interesting: https://twitter.com/high_byte/status/1607853384123703296

Interesting blog, I love that you keep consistency writing for so long.

hinkley · on Dec 28, 2022

If EXPLAIN is little known then we have failed as educators, mentors and peers.

It does amaze me the gaps that people have, though.

krylon · on Dec 29, 2022

My gut feeling is that it's fairly well known, but few people use it. Including yours truly.

Then again, I have rarely run into performance issues with database engines. Either the ones I've had contact with (SQLite, MSSQL, Postgres) are very good at what they do (which I suspect they are), or my requirements were too puny for them to break a sweat (which I also suspect to be the case).

I did use MSSQL Studio's query analyzer once to fix a performance issue and was able to get the query in question from nearly an hour down to under a second. triumphant grin I don't remember the details of the query, but the solution was to use a hash join.

haggy102 · on Dec 29, 2022

I'm glad you said it because I was about to say the exact same thing.

jaggederest · on Dec 28, 2022

Cool details. Makes sense, since a database has to have index creation anyway, and if you're avoiding merge joins for code complexity reasons, why not use it to optimize.

An interesting project to manage indexes on postgres that seems tangentially related:

https://github.com/ankane/dexter

senderista · on Dec 29, 2022

As others already noted, this is equivalent to asymmetric hash join, except less efficient. Constructing a hash table on the fly instead of a B-tree would almost certainly be faster to both build and query. You don’t need an ordered lookup structure for join indexes (unless it’s a merge join, which SQLite doesn’t do anyway).

ayende · on Dec 28, 2022

That was interesting to read. RavenDB does something similar, if there isn't an index for the query, it will create one for you. The difference is that this is a persistent structure, which is reused across multiple queries and invocations.

winrid · on Dec 28, 2022

RavenDB is interesting. Neat that it's in C#.

forinti · on Dec 28, 2022

It might make sense to flag in the metadata that an index might be needed; then again someone using SQLite will probably know very well which queries are being executed.

Also, most of the time joins are made on foreign keys and these should always be indexed. So I guess this should be a rare occurrence if the database is properly thought-out.

takeda · on Dec 28, 2022

I keep seeing posts like these, and I'm wondering, whether "lite" still applies. It feels like all it is missing is the network layer and it would have more features than original MySQL.

samkon · on Dec 28, 2022

It’s not “lite”, it’s “ite”. SQLite, as if it were a mineral, like Azurite.

ckvjdonee · on Dec 29, 2022

That's not really it is it?

I thought they just dropped the extra L because it made more sense without it.

samkon · on Jan 1, 2023

The author of SQLite suggested that the name is meant to sound like a mineral:

https://changelog.com/podcast/201

masklinn · on Dec 28, 2022

The "lite" part is about being embedded, server-less, and single-file (at rest), it's not about being shitty, DBM did (and still does) that fine.

Furthermore this is exactly in furtherance of your stated goals: sqlite supports indexes, by using automatic indexes it avoids the need to implement merge joins and hash joins.

johndough · on Dec 28, 2022

EDIT: I misread. Nevermind!

dang · on Dec 28, 2022

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

This could become a (very) good comment if you added enough explanation for people to understand what the issues are. But just posting a putdown isn't in the intended spirit of the site. If you know more than others, e.g. about nested loops in databases, that's great - but in that case please share some of what you know, so the rest of us can learn.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...

SCLeo · on Dec 28, 2022

Maybe I am missing something but I think O(N * N) is the correct complexity for a nested loop implementation. If you believe that is wrong, can you explain a bit more?

senderista · on Dec 29, 2022

Well that assumes the same cardinality for both sides of the join, which may be far from the case, especially if selections can be pushed down to scans.

0x457 · on Dec 28, 2022

Pretty sure nested loops are O(^2).