Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SQLite's Automatic Indexes (misfra.me)
216 points by preetamjinka on Dec 28, 2022 | hide | past | favorite | 40 comments


Related: https://sqlite.org/cli.html#index_recommendations_sqlite_exp...

There's an experimental module and CLI command in the normal shell to use this logic to dump out a suggested list of indexes to create for a given SQL command to speed it up.


I'd like to see this in Postgres and the like.


Definitely would be nice in pg. I think it exists for sql server (likely also oracle), iirc the db can even be configured to optimistically create the index to try it out, and revert it if it’s bad.

One of the difficulties is indexes have a write cost so heuristically deciding on the trade off can be complicated.


I was not aware that SQLite creates temporary indexes in places where other databases would perform merge joins.

https://sqlite.org/tempfiles.html#transient_indices

SQLite also creates background indexes for ROWID, of which I was aware.

https://sqlite.org/rowidtable.html


SQLite uses a b-tree storage format for tables, so all tables are an index, similar to clustered indexs in MSSQL Server and Index Organized Tables in Oracle. The RowID is the true primary key unless you specify the table should use the actual primary key as the row identifier, this is legacy quirk of Sqlite.

Either way SQLite doesn't have heap tables like PG (which only has heap tables), all tables are clustered indexes.


Clustered indexes in MSSQL Server, Index Organized Tables in Oracle, and any table in MySQL using InnoDB as storage engine.


InnoDB is similar to SQLite all tables are clustered indexes, if you don't have suitable primary key it creates a hidden one similar to rowid in SQLite.

MSSQL and Oracle give you a choice for table storage to either have unordered heaps or ordered b-trees, depending on whether a clustered index is defined.

PG only has unordered heaps, indexes are always secondary data structures. PG could really use true clustered indexes as an option.


This is not true in my experience at all. You have to create indexes.

Also, a btree is created for rowid, according to these docs.

https://www.sqlite.org/withoutrowid.html#:~:text=In%20an%20o....


> This is not true in my experience at all. You have to create indexes.

Nobody said you don’t have to create other indexes.

> Also, a btree is created for rowid, according to these docs.

As the GP says that btree is the table itself, that’s what a clustered table is.

This is the bit that reveal this information:

> As an ordinary SQLite table, "wordcount" is implemented as two separate B-Trees. The main table uses the hidden rowid value as the key and stores the "word" and "cnt" columns as data.

So one of the btrees is “the main table”, of which rowid is the key.


What's not true? All tables in SQLite are b-tree indexes, the unique key is either the rowid or the primary key in "WITHOUT ROWID" tables. SQLite does not have unordered heaps:

https://www.sqlite.org/lang_createtable.html#rowid


I see what you're saying. Just the wording you're using is confusing IMO. The tables themselves aren't btree indexes, there isn't any sort of "clustered" index on all columns. It's just that the tables rows are entries in a btree, indexed by rowid.


> The tables themselves aren't btree indexes, there isn't any sort of "clustered" index on all columns.

What do you think a btree index is exactly? Especially a covering index?

> It's just that the tables rows are entries in a btree, indexed by rowid.

That’s what a clustered index (/ table) is. It’s when the table's storage is organised by the chosen clustering key. In SQLite, all tables are clustered, and the clustering key is the ROWID, or the primary key for tables WITHOUT ROWID. You can think of it as the table being its own covering index (keyed on the clustering key, and INCLUDE-ing all other columns).

By opposition a database which uses heap tables (like postgres) needs a separate index for its PK (it also needs one to enforce unicity so that's two birds one stone).


I see. When I think of a covering index, I think of a covered query, where all query data can be fetched from the index itself. Doesn't the rowid btree just point to an offset on disk where the actual row is? Or is the row itself in the index?


> Doesn't the rowid btree just point to an offset on disk where the actual row is? Or is the row itself in the index?

The value of the rowid btree is the row.


oh, nice! thx for the responses


>The tables themselves aren't btree indexes, there isn't any sort of "clustered" index on all columns. It's just that the tables rows are entries in a btree, indexed by rowid.

That's what a clustered index is, the table is the index, the primary key is the key all the rest of the columns are in the value rather than the value being a row id that needs a second lookup to get to the data.

https://en.wikipedia.org/wiki/Database_index#Clustered

You can still have secondary indexes that point to the key of the clustered index.

SQLite is quirky in that even if you define a primary key, it has a hidden actual primary key called the rowid, unless you define it "WITHOUT ROWID" which was added later, this is what they say in the docs:

"WITHOUT ROWID is found only in SQLite and is not compatible with any other SQL database engine, as far as we know. In an elegant system, all tables would behave as WITHOUT ROWID tables even without the WITHOUT ROWID keyword. However, when SQLite was first designed, it used only integer rowids for row keys to simplify the implementation. This approach worked well for many years. But as the demands on SQLite grew, the need for tables in which the PRIMARY KEY really did correspond to the underlying row key grew more acute. The WITHOUT ROWID concept was added in order to meet that need without breaking backwards compatibility with the billions of SQLite databases already in use at the time (circa 2013)."

https://www.sqlite.org/withoutrowid.html#quirks


Yeah I know about rowid, I've used (exploited?) :) that a few times.


rowid isn't really a background index, but rather the normal storage layout, with non-integer primary keys having a separate index structure.


would be cool to get a diagnostic message when that happens


Per the linked mailing list post there’s no dust but there is a back channel through which you can be notified:

> SQLite comes with instrumentation (specifically the SQLITE_STMTSTATUS_AUTOINDEX verb for sqlite3_stmt_status() - http://www.sqlite.org/c3ref/stmt_status.html) that can be used to detect when automatic indices are used and alert the developer through a back channel to this fact so that she can fix the problem with an appropriate CREATE INDEX. In other words, SQLite provides you with the tools to help you detect and eliminate the use of automatic indices.


For transient indexes, or permanent ones?

The transient indexes will be created in a separate, temporary file. They will be erased on a commit, rollback, or crash recovery. They will not persist.


I think EXPLAIN is a little known feature that gives a lot of insight about database internals, especially in Sqlite.

Funny you posted this now, I just recently researched Sqlite internals for my own project. Maybe you will find it interesting: https://twitter.com/high_byte/status/1607853384123703296

Interesting blog, I love that you keep consistency writing for so long.


If EXPLAIN is little known then we have failed as educators, mentors and peers.

It does amaze me the gaps that people have, though.


My gut feeling is that it's fairly well known, but few people use it. Including yours truly.

Then again, I have rarely run into performance issues with database engines. Either the ones I've had contact with (SQLite, MSSQL, Postgres) are very good at what they do (which I suspect they are), or my requirements were too puny for them to break a sweat (which I also suspect to be the case).

I did use MSSQL Studio's query analyzer once to fix a performance issue and was able to get the query in question from nearly an hour down to under a second. triumphant grin I don't remember the details of the query, but the solution was to use a hash join.


I'm glad you said it because I was about to say the exact same thing.


Cool details. Makes sense, since a database has to have index creation anyway, and if you're avoiding merge joins for code complexity reasons, why not use it to optimize.

An interesting project to manage indexes on postgres that seems tangentially related:

https://github.com/ankane/dexter


As others already noted, this is equivalent to asymmetric hash join, except less efficient. Constructing a hash table on the fly instead of a B-tree would almost certainly be faster to both build and query. You don’t need an ordered lookup structure for join indexes (unless it’s a merge join, which SQLite doesn’t do anyway).


That was interesting to read. RavenDB does something similar, if there isn't an index for the query, it will create one for you. The difference is that this is a persistent structure, which is reused across multiple queries and invocations.


RavenDB is interesting. Neat that it's in C#.


It might make sense to flag in the metadata that an index might be needed; then again someone using SQLite will probably know very well which queries are being executed.

Also, most of the time joins are made on foreign keys and these should always be indexed. So I guess this should be a rare occurrence if the database is properly thought-out.


I keep seeing posts like these, and I'm wondering, whether "lite" still applies. It feels like all it is missing is the network layer and it would have more features than original MySQL.


It’s not “lite”, it’s “ite”. SQLite, as if it were a mineral, like Azurite.


That's not really it is it?

I thought they just dropped the extra L because it made more sense without it.


The author of SQLite suggested that the name is meant to sound like a mineral:

https://changelog.com/podcast/201


The "lite" part is about being embedded, server-less, and single-file (at rest), it's not about being shitty, DBM did (and still does) that fine.

Furthermore this is exactly in furtherance of your stated goals: sqlite supports indexes, by using automatic indexes it avoids the need to implement merge joins and hash joins.


EDIT: I misread. Nevermind!


"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

https://news.ycombinator.com/newsguidelines.html

This could become a (very) good comment if you added enough explanation for people to understand what the issues are. But just posting a putdown isn't in the intended spirit of the site. If you know more than others, e.g. about nested loops in databases, that's great - but in that case please share some of what you know, so the rest of us can learn.

https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...


Maybe I am missing something but I think O(N * N) is the correct complexity for a nested loop implementation. If you believe that is wrong, can you explain a bit more?


Well that assumes the same cardinality for both sides of the join, which may be far from the case, especially if selections can be pushed down to scans.


Pretty sure nested loops are O(^2).




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: