Hacker News new | past | comments | ask | show | jobs | submit login
Many small queries are efficient in SQLite (sqlite.org)
542 points by ArtTimeInvestor on Feb 16, 2021 | hide | past | favorite | 342 comments



If you haven't tried SQLite, please do. For years I ignored SQLite and used MySQL (it does the job) but once you see how fast SQLite is, and advantages of having a DB contained in a single file... just go play around with SQLite instead of ignoring it for years like me. It's neat.


Counterpoint: I over-used SQLite because it was the first database I encountered and spent waaaay longer working around its shortcomings than I eventually spent porting to postgres.

Long version: I couldn't get bulk insert performance above absolutely miserable levels. I tried tricks like deleting and recreating indices but without luck. The perf tooling wasn't there to quickly figure out where the problem was (this was 10 years ago, not sure if things have improved) so I wound up building a version of SQLite with debug symbols and profiling it with a C profiler. The problem turned out to be a default setting that made spill-to-disk very aggressive and basically guaranteed that any workflow like mine would grind along with miserable slowness and no outward indication of what to do about it. I found an email thread where someone in effectively the same situation made some constructive suggestions and got turned away on the principle that even casual users ought to just know performance knobs like this one. Yikes. I am probably munging some of the details, but it made me angry enough to learn postgres and port my code over despite having a fix for my immediate problem.


Like everyone else, I'm going to offer tuning tips even thought that very much wasn't your point :-)

What you are describing sounds almost exactly like PRAMGA synchronous = FULL [1] (which is the default). That pragma controls when fsync occurs. Depending on your application, you might have got away with NORMAL or even OFF. Again depending on your application, you could have set journal_mode = OFF or increased the mmap_size (both also discussed on that page). Yes those are fairly magical hacks, but synchronous and journal_mode at least are things are always worth considering for a new SQLite database (and before building it with debugging symbols!).

Even without those tweaks, the really key thing is to use fairly large transactions. I'm surprised that alone wouldn't have got you decent performance.

One option you didn't have at the time but might help today is write-ahead mode [2] with journal_mode = WAL (but still presumably not as fast as journal_mode = OFF!). I believe the only reason it isn't enabled by default is for backwards compatibility. According to that article, it was introduced in 2010-07-21, and improved to better handle large transactions (>100MB) in 2016-02-15.

[1] https://www.sqlite.org/pragma.html#pragma_synchronous

[2] https://sqlite.org/wal.html


I've been there. It doesn't work. I've been applying all those tips and tricks, but it didn't improve performance that much, but I have lost data.

Sqlite is full of locks, and any writes are single-threaded stop-the-world, and there's no way around it. It's Sqlite's philosophy. Write-heavy databases just need something like MVCC, and Sqlite won't have that.


I think you make a point which is very true and valid but not relevant to this particular comment thread.

You are talking about multiple different processes/threads heavily and concurrently writing to a database. In that case, you're absolutely right, the point has come to switch to a client/server database like PostgreSQL or MySQL.

But the parent comment was not about that (or at least they didn't explicitly mention concurrency, and their mention of "a default setting that made spill-to-disk very aggressive" rather than locks suggested that concurrency wasn't the problem). They seemed to be talking about a single writer inserting at a high rate, which is something I'd expect to cope with very well with the right tricks (mostly batching multiple inserts in transactions - I acknowledge their comment that the defaults are unfortunate though). Yes even with a single process there are locks, but if a lock is uncontended then it is not normally a problem.


> the point has come to switch to a client/server database

Windows has an embedded NoSQL DB engine which is fine with concurrent writes and multi-versioning: https://en.wikipedia.org/wiki/Extensible_Storage_Engine

There’re disadvantages too. It does not implement SQL, the queries need to be done manually on top of various indices in these tables. The DB has much more than a single file. The API is way more complicated than sqlight. The databases are portable from older to newer versions of Windows with automatic upgrades, but not the other way.


The JET database continually corrupts itself. It’s terrible.

The Windows search functionality uses this database and the only way Microsoft was able to make this feature somewhat reliable is by having it run very thorough checks at startup and tossing the database and reindexing at the first hint of trouble.

I don’t know why it is so terribly unreliable but I know it is.


Interesting, I have opposite experience. Once I had lots of experience with AD and Exchange. Later I shipped desktop software based on ESENT, and supported it for a few years. Was generally happy with the technology.

Are you sure these issues are caused by the DB engine? Other possibilities include your PC (like interference with crappy AV software), or Microsoft’s code of these search indexing services.


This happens on large amounts of PCs and I don’t think the way Microsoft uses its own database engine should be an excuse for it corrupting its data.

I think the difference is that Windows Search is always running, including when the computer crashes, is turned off unexpectedly or doesn’t complete waking up from sleep or hibernation. I think JET just isn’t that robust against that. It’s quite conceivable people manually closed your software when they shut down their computer.

I remember Exchange failing the same way if its database disks suddenly disappear due to network issues.


> Microsoft uses its own database engine

Microsoft is large and software quality varies. For instance look a Skype, they failed to use their own GUI frameworks and are using Electron i.e. Google Chrome to paint a few controls.

> I think JET just isn’t that robust against that.

I think it is, it's mentioned everywhere:

https://docs.microsoft.com/en-us/windows/win32/extensible-st...

https://stackoverflow.com/a/2369220/126995


Okay, ‘according to the documentation’ it probably is. In reality it isn’t.

Like they say,

In theory, there’s no difference between theory and practice. In practice, there is.

Here’s a list of things that can go wrong (here in the context of domain controllers):

https://docs.microsoft.com/en-us/troubleshoot/windows-server...

Some of them are just the unavoidable hardware failures, for others they suggest ‘Deploy the OS on server-class hardware’. Or the always helpful ‘restore from backup’. Not quite reasonable for a database containing a search index on a consumer device.


> Here’s a list of things that can go wrong (here in the context of domain controllers)

That article was written because DC is a business-critical infrastructure. Here’s a comparable one about Oracle: https://docs.oracle.com/cd/A87860_01/doc/server.817/a76965/c...

NTFS: https://docs.microsoft.com/en-us/previous-versions/windows/i...

> Not quite reasonable for a database containing a search index on a consumer device.

Before I switched to MS Outlook, I was using Windows live mail (now discontinued) as an e-mail client for a decade or so, it used ESENT for everything.

When I run process explorer and search for “esent.dll”, it finds a dozen of system services using ESENT databases, many of them critical like CryptSvc.

I try to buy good hardware, but that’s not server-grade components. I don’t use ECC RAM nor a UPS, and I suffer from brief power outages couple times a year. If ESENT would be corrupting databases when the power is turned off suddenly, I would have noticed.


Your experiment with a sample size of 1 proves, with an uncertainty of 100%, that the database never corrupts.

Experiments with a larger sample size show a different result.


> Sqlite is full of locks, and any writes are single-threaded stop-the-world, and there's no way around it. It's Sqlite's philosophy.

SQLite is not the silver bullet of the DB world, but it's extremely useful in certain scenarios. Will you need a DB for a desktop app? Ditto. Want a portable file to carry some complex data structure? Ditto. Want to test your webapp during development, or provide a single-user version for users? Ditto.

Concurrent writing, multiple user, multiple producer scenarios need something bigger. MySQL Embedded, MySQL, Posgres, MSSQL, Oracle... List goes on.

SQLite makes databases accessible and useful in much more scenarios. I've hated databases until I've found SQLite, because I simply didn't see the point of running a big server which is designed to handle much bigger data sets to store 250KB of text tables.

SQLite is extremely underrated IMHO.


There's a way around it which is actively worked on apparently:

https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...


Why would stop-the-world affect performance when you are doing batch inserts?

Stopping the database is even a common trick when bulk inserting into "real" databases.


I think they are referring to a OLTP type scenario where you have lots of inserts and reads, but the reads only look at a couple of rows at a time, often only a single row. MVCC is good for such cases because you can avoid the overhead of taking lots of locks but still get fine grained concurrency control.

You are right that the Sqlite approach actually works quite well for bulk operations since you only require a single lock. However, it's usually still better for bulk inserts without updates to use fine grained locking or MVCC since you can often avoid acquiring any locks at all beyond the basic ones guarding fundamental DB data structures (these aren't locks as far as SQL is concerned since they cannot cause deadlock, it's a big pet peeve of mine when people think lock-free = no use of mutexes).

As a side note, don't do the following pattern: "BEGIN TRAN; INSERT INTO T ...; SELECT max(id) FROM T; COMMIT;". I used to do this, but this is a very bad habit that may be incorrect (assuming that id is an autoincrementing column). It's only correct on systems with true serializability, when you have opted into full serializability, and where such systems consider autoincrementing IDs to be part of serializability. When I tested this, Postgres and MSSQL handled this as expected while MySQL allowed the select to return a different row. I just tested Sqlite, and it does seem to work there regardless of WAL since it only allows concurrent readers plus a single writer. Use last_insert_rowid() or the equivalent for your database[1].

[1]: https://sqlite.org/lang_corefunc.html#last_insert_rowid


> As a side note, don't do the following pattern: "BEGIN TRAN; INSERT INTO T ...; SELECT max(id) FROM T; COMMIT;". I used to do this, but this is a very bad habit that may be incorrect (assuming that id is an autoincrementing column). It's only correct on systems with true serializability, when you have opted into full serializability, and where such systems consider autoincrementing IDs to be part of serializability.

And even when it work, it'll create a lot of unnecessary conflicts.

> Use last_insert_rowid() or the equivalent for your database[1].

RETURNING is the best approach for that in postgres (although lastval() also works). https://www.postgresql.org/docs/devel/sql-insert.html


According to several heavy sqlite users on HN, performance is improved if you do locking at the application layer. No idea if this is true.


See what you need to do is stand up a Redis cache cluster in front of your app, and then a cassandra cluster to cache the writes and compact them, before synchronizing the data with your single-page SQLite datastore as the master data.

Obviously for your SQL queries we crank up a large SparkSQL cluster.

- AWS architects

You can thank me later.


Do you get more thanks than just telling people to use MySQL than your 3 layer complicated solution?


I'm sure they were being sarcastic.


One trick of `pragma journal_mode = WAL` is that even with synchronous normal, SQLite syncs when you close the last connection [1]. So re-using a connection is key to getting the full benefits if you're executing commands from a single thread/process.

[1] https://sqlite.com/wal.html#the_wal_file


Out of curiosity, just how many rows were you trying to insert, for this to be a problem? My memory is a bit fuzzy, but on SQLite even standard INSERT statements can scale to hundreds of thousands per second, if you do them in one transaction. Just curious about the scenario here.


A need little trick is to explicitly state “begin transaction “ and “end transaction” in my app. Not sure how general use this is.


This is exactly how I accomplished high performance in sqlite too.

I'm surprised more people don't use transactions in sqlite given transactions are a staple of using any enterprise RDBMS.


Once you travel code boundaries (classes, functions, whatever) transaction management gets a bit hairy.

The question "prove this program reliably closes the transaction I started" starts to become equivalent to "prove this program halts"

Obviously they are useful tools and heavily used, but it's not like they are a zero-overhead feature.


This is quite a generalised point so difficult to really respond to but some frameworks do provide a mechanism to check if a transaction has been closed. The stateless nature of web development does also make working with transactions easier as you have a clear entry and exit point of each request. However I appreciate not all uses of RDBMSs are stateless, let alone web applications. But a similar approach can sometimes be applied with static stateful applications where you can calve up the transactional operations into logical code paths that are easier to mentally visualise. As always though, it really depends on how your code is structured and the specific problems you're trying to solve.

Where you know code boundaries are an issue I've found functional designs tend to work a little better than OOP with regards to managing transactions but a lot of that could just be down to how my brain is wired (while I'm not a FP evangelist I do tend to favour breaking code down to stateless functions rather than stateful classes).

It's fair to say spaghetti code will be a problem on most reasonably mature code bases but there are approaches and frameworks that help somewhat with managing transactions across code boundaries -- just as there are tools that make working with transactions harder. But in my experience there are much harder problems to solve than working with transactions.

> but it's not like they are a zero-overhead feature.

Is there such thing as a zero overhead feature? (I say this semi-flippantly).


I mean I have to deal with this crap at $dayjob but I genuinely can't believe of the terrible code I see that borrows a resource (connection, transaction, file handle) and then only the happy path gives it back. I desperately wish that languages would make this a compile error if all code paths don't lead to the resource being freed.

The only thing that should ever stop you from returning a resource is a malicious scheduler.


> I desperately wish that languages would make this a compile error if all code paths don't lead to the resource being freed.

You might want to take a look at rust.


I can see the obvious parallels between memory management lifetimes and transaction management, does rust have explicit features for extending lifetimes to resources besides memory?


Yes, it works for files, network connections and database transactions[1].

To span a transaction over two functions, it has to be assigned to some variable and the lifetime of this variable is tracked by the compiler.

[1] https://docs.rs/sqlx/0.3.0/sqlx/struct.Transaction.html


You don’t need to solve the halting program, you just need a way to construct programs that halt (or close the connection), which is way easier.

Many languages have some sort of `finally` or `with` construct tailored for this use case.

Remember, we’re code writers, not arbitrary discriminators.


I actually avoided transactions in MSSQL when I could because it escalates locks real quick. Which is a death knell for a busy system.


Eh? They exist, that doesn’t mean anyone actually uses them.


My point is that using transactions should be drilled into people who work on databases and where consistency is a requirement because transactions turn multiple complex SQL requests into one atomic operation:

http://db4beginners.com/blog/relationaldb-transaction/


Oh, I agree. It’s just that my recent experiences have included mostly people foreign to both transactions and foreign keys.


I've never worked at a place that doesn't use them. What if the 3rd insert in a series fails and consequently writes the wrong thing down on the 4th with an update?

That's just my experience so I guess it may be meaningless but I'm surprised to hear it may not be often used.


> I've never worked at a place that doesn't use them.

Consider yourself lucky then. I know of a place that doesn’t use transactions in a homegrown ERP solution, of all things.


I've seen all kinds of wonders during my life so this one is no surprise. If however someone is doing stupid things it is their problem. They're free to complain to themselves.


Tangentially - it may be their problem, but I hesitate to say it's their fault. I'm continually dismayed at how spotty and superficial education about how to use an RDBMS can be. Even in formal education on the subject.


It does not take PhD and rocket science for one to figure that sometimes operations must be bunched and executed with success/failure as a single unit. That would come as a business requirements. For curious person it would not take much to do some search on a subject and discover and read about transactions.


You hardly ever need transactions if you track validity explicitly in your schema.

Suppose T1 has a 1-many relationship with T2. Declare in your assumptions that any rows in T2 with no corresponding valid row in T1 are not valid.

Additionally, have an is_valid field on T1 so selecting all valid data from T2 is done with "select * from T2 inner join T1 on T2.t1id = T1.id where T1.is_valid".

To insert data, insert a row into T1 first but initially have is_valid be false. Then insert all necessary data into T2. Finally, do an update and change the original row in T1 to have is_valid be true.

For deletions to T1, just do an update and set is_valid to false. Thanks to the validity logic, this has the effect of also invalidating all T2 rows.

Updates are trickier, but you can allow them to work without transactions by having two ids for for every table. The first id is the one we worked with before which is used for joins. The second id is used by applications to look for explicit records. Therefore, just never do any updates aside from the one setting is_valid to true (which is really storage logic and not application logic). Instead, just insert a new row into T1 whenever you want to update something in T1. The final update now just needs to flip the is_valid bit for the old row and the new row and will also need to verify that the old row is valid as well as any other rows the current update relies on (basically need to turn it into complex CAS).

All of this is pretty messy, but it does let you have CRUD without any transaction support from your DB. Also, even if your DB has transactions, this scheme has the advantage of being lock-free so your application cannot deadlock.

If you have many updates/deletes, you can do garbage collection either by allowing the GC to use a transaction or by changing adding in a check for insertions to T1 that verify the number of associated rows in T2 before setting is_valid.

Unfortunately, while updates and inserts with GC can still be lock-free, they are not wait-free since an insert or update can fail. If you never do updates or GC though, this is actually wait-free and guarantees that every create, read, and delete operation will succeed in the absence of hardware/network failures.

Still, this overhead probably isn't worth it unless you already need to track the history explicitly for auditing or something. At the company where we used this, we didn't have an is_valid row, we had valid_from and valid_to which were timestamps.


Same as @mikeyjk! Can you point toward more information on this topic? The applications I'm dealing with in my day job are constantly plagued by deadlocks in the database surrounding bulk transactions. I've theorized that a "T1 1-many T2" refactoring of the tables - like what you've described - might be a solution , but haven't settled on how to implement. Would absolutely love to learn more on this esoteric, hard-to-google topic.


I guess I'm not sure how you can get away without needing transactionality for these extra columns / is_valid columns


All of this sounds like a lot of additional work just to provide some of the functionality transactions provide out of the box. And I say "some" because you've not addressed rollbacks.

That said, it's still an interesting thought experiment so glad you shared.


Wow, this has broken my brain but has really piqued my interest - thanks. Do you have any reading on this topic / strategy to help a mouth breather like me out?


My sentiments exactly!


Not using transaction is just very bad practice. If people are using wrong approach to solve the task they should not complain about results.


Another trick if you have multiple process (users) accessing the DB and you don't want to lock the DB for a long time, is to insert into a temp table (possibly with a transaction, although the time savings is not as dramatic with temp tables). Then copy the temp table to the main one (insert into ... from ...). The advantage is lets say you are reading in a bunch of items from something else, that will take a chunk of time more than just the DB time. So by going to a temp table you aren't locking the target table for anyone else while gathering the data. Then combine this with flushing the temp table every X rows or X seconds, and you have a number of efficient updates to the table without long lock times.

Also have WAL mode on to get multi-user access going.


This surprises me as I thought SQLite locks are db-level, not table-level. Is this not the case?


If you have wal-mode enabled then the automatic locks are table level. So one process can be updating one table and another one can work on a different one. Also, you only need to enable wal mode on the DB once (pragma journal_mode=wal), it "sticks" for each connection.

In my application that uses SQLite (Snebu backup), as data comes in (as a TAR format stream) I have one process extracting the data and metadata, then serializing the metadata to another process that owns the DB connection. This process dumps the metadata to a temp table, then every 10 seconds "flushes" the metadata to the various tables that it needs to go to. This way I can easily have multiple backups going simultaneously, as each process spends a small amount of time (relatively) flushing the data to the permanent tables, and a greater part of the time compressing and writing backup data to the disk vault directory.

I've been working with this for the past 8 years or so, and have picked up a few tricks on keeping as much as possible batched up in transactions, but also keeping the transaction times short relative to other operations. So far seems to work out fairly well.

Note, that in addition to journal_mode=wal, you need to have a busy handler defined that infinitely retries transactions with a 250 ms delay between each retry.

Edit: On further review of the docs, I'm not sure if wal mode enables table-level locking, it may be that when writing to a temp table, that temp tables are part of a separate schema (or are otherwise separate from the main DB) -- which makes sense, as temp tables are only visible to the process that owns them. So a temp table can be locked in a transaction, while the rest of the DB is writable.


It seems like the locking is page-level in "wal" and "wal2" modes:

> Usually, SQLite allows at most one writer to proceed concurrently. The BEGIN CONCURRENT enhancement allows multiple writers to process write transactions simultanously if the database is in "wal" or "wal2" mode, although the system still serializes COMMIT commands.

> When a write-transaction is opened with "BEGIN CONCURRENT", actually locking the database is deferred until a COMMIT is executed. This means that any number of transactions started with BEGIN CONCURRENT may proceed concurrently. The system uses optimistic page-level-locking to prevent conflicting concurrent transactions from being committed.

> When a BEGIN CONCURRENT transaction is committed, the system checks whether or not any of the database pages that the transaction has read have been modified since the BEGIN CONCURRENT was opened. In other words - it asks if the transaction being committed operates on a different set of data than all other concurrently executing transactions. If the answer is "yes, this transaction did not read or modify any data modified by any concurrent transaction", then the transaction is committed as normal. Otherwise, if the transaction does conflict, it cannot be committed and an SQLITE_BUSY_SNAPSHOT error is returned. At this point, all the client can do is ROLLBACK the transaction.

The page also mentions:

> The key to maximizing concurrency using BEGIN CONCURRENT is to ensure that there are a large number of non-conflicting transactions. In SQLite, each table and each index is stored as a separate b-tree, each of which is distributed over a discrete set of database pages. This means that:

> Two transactions that write to different sets of tables never conflict

Source: https://www.sqlite.org/cgi/src/doc/begin-concurrent/doc/begi...


+1. I was not aware of the 'stickiness' of the wal pragma. Thank you for the tip. I will be defaulting to this going forward. For a quick confirmation, I used:

  $ f=$(mktemp)
  $ sqlite3 $f "pragma journal_mode"
  delete
  $ sqlite3 $f "pragma journal_mode=wal
  wal
  $ sqlite3 $f "pragma journal_mode"
  wal
Edit: struggling to format code markup correctly on my mobile.


+1, thanks for this and your previous comment. Have you written up your approach in any more detail? I'd love to reference it.


The code is on github.com/derekp7/snebu if you want to take a look. The next task on my list is to write up developer / contributor documentation on how the whole thing is put together along with various tricks (and simplified demonstration code).

But specifically, look in "snebu-main.c" that is where the opendb function is (so you can see the pragma statements), and there is a busy_retry function that gets referenced (all it does is sleep for .1 seconds). I believe that you don't need the busy-retry function, if you use the built-in busy handler, but I'm not really sure and don't want to take a chance and break working code.

For the temp tables, look in snebu-submitfiles.c -- the function at the top handles the DB operations, one towards the bottom handles the non-DB tar file consumption operations, and there is a circular queue in the middle to handle buffering so the data ingestion can keep going while the data is getting flushed (these three run as separate processes). I should learn threads, as there may be more flexibility in that, but not comfortable enough with thread programming yet.


Thanks so much! The phrase "unreasonably helpful" comes to mind. Also, kudos for the extraordinarily high-quality documentation in snebu!


> A need little trick is to explicitly state “begin transaction “ and “end transaction” in my app.

That is exactly what the parent comment already said:

> if you do them in one transaction

The commands you stated are exactly how to do (multiple) things in a transaction.


Indeed, without this, performance would seem quite lacklustre. I believe it's very common in the SQLite community.


Why would I do a hundred thousand insertions in a single transaction in a crud app?


I think parent means that for some careful selection of N (where N > 1) insertions per bulk transaction you can scale up to hundreds of thousands of insertions per second, rather than putting hundreds of thousands of insertions in a single transaction.


No, the opposite. In SQLite, starting and ending transactions that write things to the db is a relatively expensive operation, and running queries outside transaction is effectively the same as running each of them in an independent transaction.

If you need to do a lot of inserts (or updates, etc), the slowest possible way to do them is to do them outside of a transaction. The fastest way to do them is to wrap them all into a single transaction.


> If you need to do a lot of inserts (or updates, etc), the slowest possible way to do them is to do them outside of a transaction. The fastest way to do them is to wrap them all into a single transaction.

This doesn't seem to contradict the comment you're replying to. They're suggesting wrapping operations into transactions in batches e.g. (just making some numbers up) if you have 100,000 inserts maybe you'd do 100 transactions of 1000 inserts each. I wouldn't call that "the opposite" of your one mega-transaction suggestion. In fact I'd expect it to still have most or all of the speed benefit of using one single transaction, or potentially even be slightly faster.


Oh fascinating, you actually put hundreds of thousands of statements in a single SQLite transaction in an online CRUD app (as opposed to offline processing)? I've never done more than a couple hundred and even then usually they're "logically batched," both because I'm worried about forcing unnecessary read to write transaction promotions for concurrent reads and thereby increasing busy errors, but also because that affects durability to have a transaction open that long (it's not great to let your HTTP response hang for a second before responding as you keep your transaction open).

For serialized writers in any system I'm sure keeping a transaction open as long as possible is the ideal case for throughput, but there's other problems with that in a CRUD app no?


Well, that was my original question to the the parent poster, too. You can achieve 100k+ /sec, but they went looking for better performance, apparently.

But to answer your question at face value, one common use for SQLite is as a file format for complex desktop apps, or perhaps a game save format for certain types of games (mostly the non-realtime types). One great advantage to this is that if you use a DB migration library, you ensure backwards compatibility with previous versions of your saved files. However, it's easy to imagine a fair bit of data getting inserted into such a new file each time it is created. It might only be 5000 rows, but I'd prefer to not have to wait 5 seconds just to persist it, if a single change can make it 0.05 seconds.


db seeding with dummy data, migration, analytics db sync etc.


These are all one time/one a year operations.


You still don't want to wait 5 hours to let your inefficient query complete.


because bulk kinda implies that you want all-or-nothing :)


I've recently had my Python script generate about 100GB of data and inserting them into SQLite was _far_ from being the bottleneck.


I suppose OP was committing too often or even had autocommit on.

Then it wouldn't actually haven been a case of "bulk insert" at all.


How many rows? What is the "fast" you expect?

My main app sync data across ERPs and their main case is batch loading of data. This mean that I need to nearly mirror a SQL Server/Oracle/Cobol/Firebase/Etc database into sqlite, clean it, then upload to postgresql.

I have more troubles fast loading into PG than sqlite (not saying I don't have them in the past!) and sqlite is very very fast to me.


I stumbled upon a similar issue last week.

I was building a small Rails project that has mostly reads, but here and there it has inserts that can have 1000+ entries with related models. Things seem fine when clicking around, but when I used jMeter to test what is the capability of the server, I found it terrible with simultaneous requests.

Adding two workers made things even worse, and this is with ~10% writes and the remaining going to reads.

I quickly changed the db to postgres for comparison purposes, and from 3-4 requests per second with sqlite, it jumped to 60-70 as it did scale linearly with the number of workers.

I am by no means expert in database optimisations, but out of the box this behaviour was somewhat limiting the usability of sqlite for a webapp.


Postgres is pretty bad at bulk insert. You need to use COPY INTO and create an in memory CSV file or mess around with unnest() to get decent performance.


`copy to` yes. That's what bulk insert is. How can having the feature make postgres bad at it? "in memory csv" no. You don't need to create an in memory csv file in order to load in bulk. Your language binding should provide a way to stream data into a `copy to` process I can't imagine how unnest is related to bulk loading. It's a query feature to unnest arrays.


The relationship between bulk loading and unnest is as follows.

A single insert of an array using an unnest can insert many rows. The performance is worse than copy to, but in the same ballpark. But there are use cases where you'd like to bulk load through a stored procedure for a variety of reasons, and now calling the procedure with an array and using unnest internally is a straight win.


Thanks for sharing! Did you have performance benefits using PostgreSQL out of the box, or did you have to tune specific settings? How much effort did you have to spend to do this migration?


Did you use transactions? I noticed a significant performance increase after wrapping blocks of queries with BEGIN TRANSACTION and COMMIT TRANSACTION.


If you like SQLite, then DuckDB[1] is probably worth looking at. Very similar in many ways, but DuckDB is a column-oriented rather than row, so does have some performance advantages. It is quite new, so I might not go all in for mission critical production yet, but it is worth exploring for analytics work.

[1] https://duckdb.org/


Question for ppl using DuckDB: are the use cases similar to what you'd use Apache Arrow, but with the benefit of working in SQL, or are they meaningfully different?

I'm not currently using any of those, mind you, still on a pandas/dask* dataframe basis, but I'm trying to wrap my head around where the ecosystem is moving

*I know Dask is already using Arrow behind the scenes


I don't use Apache Arrow, so I'm in a poor position to compare it with DuckDB.

My use case for DuckDB is effectively querying R dataframes with SQL. DuckDB has the functionality to register virtual tables, with data from existing dataframes.

As I know SQL reasonably well, using DuckDB to query dataframes means I don't need to learn a bunch of new dplyr verbs or data.table constructs.There are some other R packages which also support this use case - sqldf and tidyquery are two I am aware of. Both of these follow a different approach, where they parse the SQL query. Using a DuckDB virtual table lets the database handle all of the SQL.

I've found so far that through using DuckDB, performance is much better than sqldf and tidyquery, nowhere near as quick as data.table and can be quicker than dplyr, depending on query complexity. I haven't really looked at anything approaching big data sizes though.


Wow! Seems to be around for a while as well. Regret not finding it earlier.


DuckDB is amazing.


I will re-consider it, once they have proper data types and data type checking.


This aspect will probably never change:

> Flexible typing is considered a feature of SQLite, not a bug. Nevertheless, we recognize that this feature does sometimes cause confusion and pain for developers who are acustomed to working with other databases that are more judgmental with regard to data types. In retrospect, perhaps it would have been better if SQLite had merely implemented an ANY datatype so that developers could explicitly state when they wanted to use flexible typing, rather than making flexible typing the default. But that is not something that can be changed now without breaking the millions of applications and trillions of database files that already use SQLite's flexible typing feature.

https://sqlite.org/quirks.html#flexible_typing


It is possible to create check constraints that do the type checking. Actually, if they shipped some predefined ones and said "if you want strong type checking do this", like, syntax to automatically populate the table columns with type check constraints, it would be IMO perfectly backward compatible.


At the very least we could get strict versions of data types, or some sort of key word used in the ddl to specify strict typing.



When I started a project with SQLite, the available data types struck me as odd (coming from MySQL), but as I learned I started asking the question: are there any more fundamental data types other than null, int, real, text and blob? For example dates are just a facade for an integer of some kind, JSON is really just text adhering to certain formatting rules, booleans are usually stored as some kind of byte anyway so why not drop that abstraction?

With this limited set of datatypes it really makes you think harder about the data you are processing, because in the end all your data is one of these types anyway.


> are there any more fundamental data types other than null, int, real, text and blob?

Those type names are just hints, they don't constrain the valid values in any way.

https://dbfiddle.uk/?rdbms=sqlite_3.27&fiddle=4634e3821676ed...


I don't really see it as such a problem. When interacting with the database you either use an ORM with your strongly-typed business objects or you write the SQL and use parameters (again strongly typed)


> When interacting with the database you either use an ORM with your strongly-typed business objects

If you have strongly-type business objects then why not have a strongly typed storage? If you code can control the constraints that a correct data type ensures, then why have "strongly type business objects" to begin with? Why not store everything in your code as strings as well?

I see this misconception all the time. The database (and its data) lives way longer than most applications. And it's also a wrong to assume that there is always only one application accessing the database. Bulk loads are a typical case of secondary applications.

Not choosing the proper data type in a relational database is a really bad decision and we see question on stackoverflow and similar sites on a weekly (if not daily) basis asking how to fix invalid data in those "un-typed" columns.


I'm not saying it's useless to have type constraints in a database. Just that with a bit of care it's not such a big deal.

In the end there will always be some business rules that are not constrained by the database. So, you always have to be a bit careful about what you store in it. Indeed not being careful and hoping that your types, constraints and triggers are going to save you is more risky

> Not choosing the proper data type in a relational database is a really bad decision

Well, then rejoice, you can't make this bad decision in SQLite because everything is +/- a number or a string


SQL parameters aren't strongly typed though and in any case SQLLite doesn't have stored procedures.



That is... interesting...


I can't believe that runs. What a misfeature.


You just put me off Sqlite for life. I can't believe someone thought this to be a good idea.


SQLite just does what JavaScript and other weakly typed languages do. The developer is responsible for what they input. If type checks are a must, SQLite is just not for you, easy as that.


But event JavaScript will reject data if it's considered not valid for a type.

It won't throw but it will set an "invalid" / "NaN" / etc. value.


I don't see how having "NaN" or "undefined" in a number field makes any difference to just putting the value there. If you are using a strongly typed language, this will not be a problem.

SQLite made a design choice in favor of simplicity. It's also missing basic date functions all together. The only way to compare dates is by using Unix Epoch.


I use ISO 8601 full timestamps in UTC.

They were carefully designed so that collation order is identical to temporal order. Which is convenient!

If you need interval logic, though, SQLite won't help you, and epoch is the better choice. It's possible to solve some queries with a regex, but you won't love it.


Chill out. This is a design choice to make it more viable to be used in dynamically typed languages. It's maybe not the design choice everyone would make, but it's not, like, insane: this is how all variables in Python/JavaScript/Ruby/whatever works.

If you really care about this, adding strongly typed columns is trivial: https://dbfiddle.uk/?rdbms=sqlite_3.27&fiddle=9baffa184672a7...


I mean, you can represent everything under the sun as a string (or a very big int) but that's a level of operating where I find my brain starts to get fatigued for absolutely no real gain at all.


We need to store weights in kg with three decimal places. If we sum to get a total, the total should be correct according to those three digits.

I'm sure we could get it to work in SQLite, but it sounds like we would have to have some layer of manual fudging that we couldn't forget about.

In our current database we just use "numeric(16, 3)" and no worries.


At the risk of sounding glib, you could store everything as integer grams?


Yes of course. That's why I said I'm sure we could get it to work. But then it has to be converted to kg for actual usage. Either directly in the queries or views.

Having a fixed-point decimal type allows us to not think about once the table is created.

Same issue with money. Most of the time we need to store monetary values with two decimal places (cents).


What kind of application are you developing where there is so little middleware between the database storage layer and the user that converting to/from kilos adds significant additional complexity, but your users aren't technical enough to just deal with having weights in grams?


These are electronic versions of official forms. The user interface and reports should show what is actually being sent or has been sent.

Sure we could deal with it, but there are quite a number of tables due to different forms and messages, and then there's all the reports. Many custom ones thanks to to local officials wanting data from a certain customer in a certain way...


Bureaucratic forms have a tendency to change unpredictably. Today they want weight in kilograms, tomorrow they'll want it in grams, and the day after that in standardized intergalactic weight units. I get what you're saying (and I don't like writing unnecessary mapping boilerplate too), but there should be a middle layer to protect the application from craziness like that.


Thankfully things in our sector is fairly static in that regard, so this is seldom an issue. That sounds quite horrible indeed.

And yes, a fatter middle layer would be nice. Our next generation software will probably have more of that, this code base is over 20 years old at this point...


I realize this response turned out a lot more harsh than I intended. I was in a slight rush so didn't formulate myself well, didn't mean to be abrasive.


> Most of the time we need to store monetary values with two decimal places (cents).

Yeah who cares about fractions of pennies anyway. Just makes things complicated.


Yeah for being so uptight about just about everything, the officials seems quite relaxed about values slipping between the cracks due to lack of decimal places.

Like, total invoice value can only be specified with two decimal digits, typically in foreign currency. Yet we also have to specify per-line value in local currency, also with only two digits. And then the per-line values are used to calculate taxes and whatnot...


The US IRS (tax agency) allows you to round nearly all values to dollars ignoring cents altogether.


You could store it in grams as an integer.


With this sort of perspective, why use a database at all? You can store literally any digital object as some sequence of bytes, and that's what file systems are for.

At some point, you actually like for the software you use to actually have meaningful features.


Querying filesystems is extremely primitive when compared to what sqlite can do.


So is storing dates as integers, and json as strings.


But dates are integers (all databases store dates as an integer that is an offset from some fixed date), and JSONs are strings. SQLite has the functions to work with these datatypes (dates, JSON, etc...) but fundamentally they are not different from integers or strings.


And integers are bytes and strings are bytes and tables are bytes and indexes are bytes. If you're not happy with databases being intelligent, why stop with integers and strings, when you can do everything you want with bytes?! Memcached, or LevelDB, or even a file system is the ultimate database if you follow your arguments to their logical conclusion.

But data representation isn't data. You can't look at a serialized sequence of bytes and know what it represents without context - without a serialization scheme. For relational databases, the table definition is the serialization scheme - it is the context. For example, you can't know what date an integer is representing without contextual information like the offset. By storing dates as dates in your database, that context is baked in.

It is helpful to have that full context inside your database because it allows you to operate that database more efficiently (by carefully indexing on the properties of the data, not the data representation), but it also allows you to use that database in ways that are not tightly coupled to your application, such as analytics, because you are storing data itself and not just the bare minimum required to represent that data.


Because sqlite let's you index and query the data without having to write your own layer to do that.


Having rich data types allows a lot of things without having to write your own layer. Spatial indexes are extremely efficient, and I can't query geometries efficiently when I'm storing geometries as a blob. The same goes for dates, json, XML, ranges, etc.

The problem with reserving specialized logic for the application layer is that it limits you to simplistic indexing schemes and you end up doing excessive IO and filtering in memory to get what you actually want.

The idea that databases shouldn't have specialized datatypes is really only an idea that works in simplistic crud apps. The world is much bigger than that.


FWIW, we use https://cgsql.dev/ to solve the type issue, among others. I totally understand that this is not for everyone.


I tried to switch from MySQL to SQLite for my Postfix/Dovecot installation but I ran into a very annoying problem: every now and then I lose an email because the "database is locked".

This is a show-stopper for me because the problem happens very rarely, only once every couple of days, but it's catastrophic: when this happens, the incoming message is not bounced, it is actually lost. The only reason I even realized it was happening is because I noticed there were emails in the root account, which is the error-reporting mechanism of last resort.

I've searched the web in vain for a solution. If you have any suggestions, I would love to be able to stick with sqlite, but at the moment I am about to begin migrating back to MySQL. :-(


This sounds like a problem with Postfix or Dovecot. Postfix shouldn't ack or Dovecot shouldn't delete the email until it is safely stored.

That being said there are many use cases where having high availability is critical and in the face of multiple writers SQLite isn't the best option for that.


I don't think this is a multiple-writer problem. Postfix is only reading. I am running a milter that is writing, but I control the code for that so I have it set up to retry if it fails. So the error is being generated by postfix itself, and so it must be happening on a read (because that is all postfix does).

I was hoping to find some kind of global switch that would make sqlite always wait for locks rather than throwing an error. But I've scoured the web for such a solution without success :-(


There is no such thing, to my knowledge.

What can be done is registering a busy callback:

https://www.sqlite.org/c3ref/busy_handler.html

I don't understand enough about your specific problem to know if this will actually help you, just sharing a tidbit I encountered working on a comparable issue.


It's terrific right until you need multiple processes writing to the same database. It's no accident that SQLite is fast for many small queries; it's not doing a lot of the work required to, say, be a good database backend for multiple web frontends.


Yeah, I don’t really see sqlite as competing with postgres or mysql for this reason. It’s an alternative to the file system.


Good observations from a MySQL perspective. Any thoughts from the other end, where the alternatives are JSON or XML or ZIP? SQLite tries hard to convince you to use it as an application file format, but it looks like a giant black box of overkill: why incorporate its 200k SLOC when the alternatives are a fraction of the size?


Is 200kLOC really a lot? Lots of software nowadays has hundreds of megabytes of dependencies, and people seem to be fine with that. Not that I think having tons of dependencies is really a good thing, but for an application that needs a file format, SQLite is a very sensible dependency.

neither XML, JSON nor zip solve the problems SQLite does, though; if you use plain old files, you need to make sure any changes you make actually end up on the disk, consistently. This is not easy to do. It also solves any consistency issues that might stem from someone reading the data while you're writing it.

On top of being just better, having a relational model for your data gives you much more freedom to use said data; you'll be able to do things efficiently that might require restructuring your JSON or XML format. Personally, I love SQLite-based application formats because I can explore them with SQL, which is often much easier than trying to make sense of a custom JSON or XML schema.


Good reply, thank you.

Yes 200k SLOC is huge (modern development practices notwithstanding). SQLite creates temporary files at whim - nine different kinds! https://sqlite.org/tempfiles.html

I know how to atomically write a JSON file. But when I read, for example:

"The temporary files associated with transaction control, namely the rollback journal, super-journal, write-ahead log (WAL) files, and shared-memory files, are always written to disk. But the other kinds of temporary files might be stored in memory only and never written to disk. Whether or not temporary files other than the rollback, super, and statement journals are written to disk or stored only in memory depends on the SQLITE_TEMP_STORE compile-time parameter, the temp_store pragma, and on the size of the temporary file..."

My eyes have completely glazed over. If I add this to my app, what will it actually do? How can I even know?


I know how to atomically write a JSON file.

Are you sure? I’ve had a lot of trouble getting that to work reliably myself across multiple OSes. (In hindsight I wish I’d used SQLite!) This article gives a good explanation of the many difficulties:

https://danluu.com/deconstruct-files/

My eyes have completely glazed over. If I add this to my app, what will it actually do? How can I even know?

Well, fundamentally it’s very hard to get it exactly right, and I imagine that’s why the implementation is a little involved.

But you could a) read through those docs, lengthy though they are, and/or b) trust the many testimonials saying SQLite is very, very robust and reliable.


Here's another good review of the pain you get if you want to get your data to disk safely.

(SQLite does this for you automatically, BTW.)

"Ensuring data reaches disk"

https://lwn.net/Articles/457667/


> (SQLite does this for you automatically, BTW.)

Unless you're on nfs. Remote file locking is hard, and I don't think that any nfs implementation has gotten to the point where you can trust SQLite on it.

SQLite does updates in place, which I would trust far less than a rename call.


> Are you sure?

No, and anyone who says yes is lying. (Lockless NFS exists and is no fun.)

> Well, fundamentally it’s very hard to get it exactly right, and I imagine that’s why the implementation is a little involved

SQLite has set itself the horrible task of updating files in-place. I know of two reliable, simpler alternatives:

1. Appending to files through O_APPEND

2. Rewriting files through rename()

If SQLite has different magic syscalls then I would very much like to learn.


I don’t think you can atomically append more than one byte to files in unixes (the write call can return after having written some but not all requested bytes)

(Haven’t googled, but if that’s possible, I don’t see why write would have that limitation)


Interesting and confirmed in the write() syscall man pages. Thanks!

Do you have any other resources regarding these types of low level "gotchas"?

I remember PostgreSQL having such an issue two years ago for example.


Yeah, and eventually we reach the best-effort bedrock. Maybe the file is on a NFS mount, you call write(), it goes over the wire, who knows what happens!


Dan luu's article mentioned these two options as unreliable. What's the reason for the disparity?


> If I add this to my app, what will it actually do? How can I even know?

Be really, really, *really*, unambiguously sure about whether your data was written or not, AND have high confidence that I/O errors (eg, power loss) in the middle of does of deletes won't scramble (or truncate) existing data.

What you're looking at is the complexity required to solve for the wonderful tornado of "but it's my data really written???". But you don't have to deal with SQLite's implementation details in order for it to do its thing, which is what makes it so awesome (given is public domain status, what's more!).


> I know how to atomically write a JSON file.

Sure. But that forces you to rewrite all the data at once. Once it becomes large or you require more frequent changes, that will impact performance.


File formats based on ZIP files only work for small files. For big files you have huge overheads; opening and saving a moderately sized documents takes seconds (vs. milliseconds for writing changes to an SQLite database). There's a reason why Excel files are limited to a million rows, while Access databases aren't.

The complexity of including SQLite is trivial for practical purposes; it's already available on many systems, and if not you can include it by adding a single C file to your project.

Setting up a workflow for Google Protocol Buffers (another popular alternative for document file formats) is a lot more complex than building or linking with SQLite, and it doesn't stop people from using them.

One thing that speaks for SQLite is the quality of the project; it's one of the best maintained Open Source projects with fantastic quality assurance and support for almost every OS. This means that you are unlikely to run into issues compiling or working with SQLite, like you might have with alternative libraries like libxml2 or jsonc (which are still great libraries!!).

EDIT: The big downside of SQLite is that it's unsuitable for documents that are exposed to the user because of the temporary files (like the WAL). If you have a ZIP based file format that you atomically rewrite from scratch on every save, it's almost impossible to corrupt. Your users can just take the file and email it and nothing bad will happen. I'm not sure what happens if you email an SQLite database file that is currently being used. I've done that in the past and have been surprised that some data seemed to be missing, but I don't recall the details. Hence SQLite is often used for application data files that are not directly exposed to the user.


You could get SQLite to work as document files exposed to the user so long as you use sessions[1]. When a file is opened, copy the DB to a temporary file or to use memory and write all changes during operation to this new DB, recording them all in a session. When the user explicitly saves a document, apply the session to the real DB.

[1]: https://www.sqlite.org/sessionintro.html#:~:text=1%20Introdu...


That doesn't sounds like it would just combine the drawbacks of both approaches: You end up with slow open/save operations, and the writes to the db still aren't necessarily atomic (eg. if a user copies the file while changes are being applied)


JSON/XML quickly stop being alternatives as soon as you need any sort of index, a memory-mapped/on-disk data structure that doesn't have to be loaded into memory, transactional or even just incremental writes. ZIP is not even directly comparable.


That's not normally what you need for an application format though is it.


It's becoming normal, as users coming from phones aren't trained to use "save" function and expect every individual change to persist.

I actually consider that a good thing. Doing everything in volatile memory until user asks otherwise is a relic from diskette era.


It's not limited to phone users. I've been using computers since the 1980s (C64), and I appreciate not needing to habitually keep pressing "save" every few seconds in Google Docs or macOS Notes.


#1 best jetbrains idea feature IMO - save on focus lost. Just alt tab into your app, or into your terminal to git commit, no worrying about "did I remember to ctrl-s".


I grew up in the Win 3.1-Win 98 era. I don't think the save reflex will ever quite go away. :)


Sometimes when I'm working on paper I reach for ctrl+s out of reflex ;)


It is a nuisance unless you have undo that is persistent. Phone-app likes have barely any undo.

For example: cut some content from a file to paste it somewhere else. Now the program saves and system crashes.


It’s bad enough to have to keep pressing Save manually, but I also have to do it regularly while using LibreOffice Calc since it keeps crashing. :-)


> That’s not normally what you need for an application format though is it.

Well, it depends what you mean by “need”. But continuous, incremental updates generally provide a much better user experience, either instead of or in addition to active “save” actions.

So, yeah, I think its exactly something that is commonly desirable in a file format for maintaining application state, even if there is a different interchange format that the application produces/consumes as a static input or output.


During the alpha Minecraft divided the world into 16x16x128 grids of blocks called chunks. Each chunk was its own file. Large worlds suffered from very poor performance because there were tens of thousands of files in a single folder.

Some random modder basically just put multiple chunks into one file so that each file is 2MB. If Notch had just put the game world into a SQLite database he wouldn't have had to reinvent the wheel. There are games that did that, such as the alpha of Cube World and they work just fine.

Heck, notch went one step further and invented NBT aka named binary tag which is basically a weirdo binary file format that stores JSON like data.


> Large worlds suffered from very poor performance because there were tens of thousands of files in a single folder.

It was using subdirs for the chunks, two levels iirc, one was chunkX % 36, the next level chunkY % 36. So there weren't that many files per directory. The slowness came from the overhead of opening, read/write and closing so many files all the time.

> Some random modder basically just put multiple chunks into one file so that each file is 2MB.

Almost, it wasn't limited by file size, it was putting 32*32 chunks into one file that was similar to a simple file system. The format of the individual chunks within that file stayed almost the same. Yet it performed much better.

NBT is indeed a little weird but fairly straight forward overall, I guess designing and implementing it just scratched an itch. It was a hobby project after all.


I'm actually currently working on a user mode FS using Dokan for Windows that saves everything to a SQLite file for similar reasons. NTFS just doesn't do well at all with lots of small files.


Interesting. Did you tested any performance by non-NTFS?

I remembered this post. https://github.com/microsoft/WSL/issues/873#issuecomment-425...


> Any thoughts from the other end, where the alternatives are JSON or XML or ZIP?

ZIP isn’t a format alternative, its just a compression and/or packaging technique for files which you still need to choose a format for.

JSON/YAML/XML are great for input and output formats, but not great for continuous, random read/write access.


My understanding is that SQLite doesn't impose any format either?


> My understanding is that SQLite doesn't impose any format either?

That's true that you can stuff any kind of string/blob data into any column of any table, so, yes, you still have to determine the data schema with sqlite much as you do with JSON, XML, or even CSV. I mean, I could have a CSV where each element is a base64-encoded ZIP containing sqlite database files that are each a single table with a single column of JSON files, each of which contains a JSON array of strings with XML documents in them.

But that's usually not something people would mean if they said their app was using CSV as it's data storage format, nor is the version stripping out CSV on the top what people would mean if they say they are using SQLite.

With ZIP, you have to decide the format(s) for the file(s) in the ZIP, their hierarchical structure, and, if the files aren't themselves the atomic data elements, the schema applicable to each file.

Furthermore, in discussion of performance characteristics and other aspects of suitability, ZIP adds overhead, but you still also need to consider the access properties of the contained files.


I'm genuinely curious what you mean by this. Of course SQLite imposes a file format... That format is a SQLite database


> Of course SQLite imposes a file format... That format is a SQLite database

Well this is in a context that rejects zip as being a format. Do you do that? If the answer is no then skip the rest of my post and just note that they're talking about a different definition of 'format'.

-

But in that context:

The amount of structure imposed on you by the sqlite database format is not much more than the structure imposed on you by a zip. I think it's fair to rate them similarly as formats. A zip file is basically a key-value store.

"Zip full of csvs", while awful to use, would impose about the same amount of structure as sqlite does: not much. And zip+csv is not much more elaborate than zip on its own.


In SQLite, I can have a table with a row which is `flag INTEGER NOT NULL CHECK flag == 0 or flag == 1 DEFAULT 0`

Surely this is more structure than a ZIP file, which is merely a way of compressing a directory of files into a single entity, can provide?

Sure the INTEGER part doesn't really do anything...


That's a feature of the database engine, not the file format. You could attach a similar database engine to a zip backend.


This is an asinine objection.


I have a better one, if you want it.

A configuration like that comes from the program using sqlite. Just adding sqlite into a system doesn't set up any data formatting like that. Sqlite itself gives you a blank canvas. And a blank canvas is not much of a data format.


You're wrong about this.

The SQLite data format includes the schema, in plain ASCII. This self-documenting nature makes it an excellent data format, I've taken advantage of it numerous times in making use of SQLite-based application file formats.

SQLite is put forth as a basis for an application file format, and by definition it must be sufficiently flexible to accommodate any application. But by including the schema, it is self-documenting as to what the structure is, which ZIP isn't and can't be. QED.


I hope to make the modest point that, if you package your data in a SQLite file or a zip file, then you have both packaging and data problems.

MyCoolSQLApp may read and write a SQLite file with its own schema, but it can't handle an arbitrary SQLite file. Likewise MyCoolZipApp can't handle an arbitrary zip file.


I'm not clear on the point here. You can make an application that handles arbitrary SQLite files, (for example: the SQLite shell, or a GUI database broswer), or you can make an application that only uses SQLite files with certain content/schema. Does this mean that SQLite does or does not define a file format?

If not, what do you call the specification of how data is stored in a SQLite file besides a 'file format'?


It all depends on how much/how complex data you have. SQLite is a database after all, you can query it with SQL and do lots of fancy stuff that might be hard to do with regular file formats like JSON or XML.

If you just need a config file or only have a small amount of data you can use XML/JSON files that you parse yourself. If you are going to have loads of data that needs some structure (for example messages in a messaging app) i would use SQLite.


There's an enormous amount of comments and tests in that codebase. As installed on my Mac, sqlite comprises a 1.3MB command line utility and a 1MB shared library. That's absolutely tiny given the functionality it provides.


Go ahead and use text files and then have fun with data corruption issues. We use CSV for sending commands to IoT devices and it's an issue. If this had been done with SQLite, then there were at least no data corruption issues. One could even use SQLite as a storage container for JSON if one whises to do so. They even have an extension that aids it with an useful set of functions:

https://www3.sqlite.org/json1.html


The JSON features of SQLite are extremely robust and performant. There is no reason to use raw JSON as the storage when you can just shove it into SQLite and lose almost nothing.


This sounds like your issue is avoiding data corruption: then atomic writes are sufficient, you don't need a SQL parser or query optimizer or etc.


Not only that, it enables us to to CRUD operations, list the commands, sort them by time, do limits, pagination, bundle a bunch of commands that enable a certain functionality in a transaction etc. SQLite has all of those and more and also avoids data corruption issues by design. Anyway, the path we took was to move everything to MySQL just because most of the other data is also in a MySQL database. Otherwise we would have definitely used SQLite.


How do you create structure data with ZIP?

why incorporate its 200k SLOC when the alternatives are a fraction of the size?

Performance, ACID and a superior declarative query language.


ZIP files are not a database: they are more like a directory hierarchy. But maybe all I need is named blobs: no query language parser, optimizer, indexing, etc.

SQLite positions itself as an improvement over ZIP for application file formats: https://www.sqlite.org/appfileformat.html . But minzip is so much smaller, easier to understand, debug and ship. So why use SQLite for an app if ZIP suffices?


If you’re talking about like cbr archives, you’re right. It’s comparing against usages like word/excel, which store a bunch of XML in an archive and call it a day.

If you’re not reading and writing out application state, then yes, you don’t need something to manage your non-existent state


Depends on your use case. XML and JSON are great for applications with simple data stores, having done this myself. But if you foresee a need for complex queries or locking and threads then SQLite might be a good choice.


The main challenge there is how to you ensure your database is resilient to machine or datacentre outages? ie what happens if the 1 server with the database is in a datacentre that loses Internet connectivity?


SQlite "merely" assumes that the problems that come with distributed systems are handled at the application layer. You'll have to solve those problems for yourself, sure, but in practice I have rarely (I think never actually) had dataloss through a fault of sqlite.

Also, did you know you can use in-memory instances (and share them across threads!) with the right incantation? And that you can backup your on-disk instance to an in-memory one, do your expensive transactions without hitting the disk then backup the modified instance right back to disk, even in-place if you want!

Sqlite is amazing when you don't expect the DB to do replication or failover on its own.


No I did not know that, I've looked for a long time for a way to convert a sqlite3 database to an in memory database and then back again. Do you mean that there's support in sqlite3 for this? Could you point me in the right direction?


We both learned something new today. Looks like this is what you want in combination with using an in memory database.

I've been doing a handrolled in memory cache layer to speed data access, but with this, I can just call the db directly and then periodically sync to disk, redis rdb style. Sqlite is a staggeringly good piece of technology!

https://www.sqlite.org/backup.html


In Python you just have to open the special file name “:memory:” to get a memory-based db. I don’t remember what the raw SQLite incantation is (or if it’s different). Also, pay attention to “ATTACH” - it’s the way to use multiple databases (file and/or memory) while still letting SQLite handle it all (e.g. join a memory db to a file db, insert result into 3rd file db - all without having to look at records in your own code)


It's the same in plain sqlite.


In other words it is just a regular object you serialize from time to time...


There is always DRBD as well. You can make replication a lower-layer problem. Not that it's without drawbacks.


I feel like the sibling comments here are basically just saying "yep, that's the main challenge!" without providing useful tips. I personally haven't used it, but I'm aware that this library exists to help resolve this challenge. https://litestream.io


And as it happens this featured on HN just 4 days ago: https://news.ycombinator.com/item?id=26103776


Just open your SQLite database in read-only mode :)

SQLite works really well for static or semi-static data. For example, a blog where you have a small number of users writing and many users reading from the DB. If the authors are content to use one server to edit the DB then you can easily push that DB to the servers handling the reads.


Yes this can work, however you are mostly relying on the operating system's file system cache for speed. Other databases will try harder to keep their own cache. But true, there is lots of room where SQLite works nicely.


Some say replication is an application layer concern, not a serialization concern.

I don't agree or disagree but it's something I've heard.

I've seen sqlite used as a cross language data frame solution. Store it in s3 and it's resilient if you are read only.



rqlite looks neat. I'd be interested in hearing any major success stories about it.


TBF: you don't. The moment you care about any shortcoming of SQLite, move away.

One of the cool things about it, is that SQLite is very lax about what it accepts (mostly in the datatype area). You can write your SQL statements targeting whatever database you think you'll move to later and they'll work while you're still on SQLite. I believe having this migration work seamlessly towards PostreSQL is one of the advertised features.


I would say that's a case for using a proper replicated RDBMS if you need that level of replication. Sqlite is not a hammer for all occasions.


I'm using SQLite at the moment; on the one side there's a 'legacy' (read: poorly written 2012) application, on the other there's the new and rebuilt version. The old one was not built very well, it does not use foreign keys or any kind of database constraints (it references other entries by name in a column of comma-separated values) and it runs like trash. But the performance problem is not in the dozen queries it runs to load the data, it's in the fact that it converts the query result to XML (via string concatenation, because of course) and that is converted to JSON; the conversion is at least 60% of each request. The other problem is that it writes and re-queries the data whenever you leave one of the hundreds of form fields in the application.

I'm rebuilding the application in a modern tech stack, still using SQLite but properly this time, along with Go and React. API requests take 20-40ms instead of 300-1500ms, and there's much less of them.

The main downside to using SQLite is that it does not support "proper" database migrations; you cannot alter a column. You can add columns to an existing table, but you can't change existing columns. The database abstraction I'm using at the moment, Gorm (a different subject entirely) work around this by moving stuff to a temp table, recreating the table with the updated columns and moving stuff back, I believe.

Anyway TL;DR sqlite is not the bottleneck.


SQLite did get support for renaming columns recently [1].

Anyways, the process you describe is also used in MySQL for doing online schema migrations. [2] "proper" database migrations cause downtime

[1]: https://stackoverflow.com/questions/805363/how-do-i-rename-a... [2]: https://github.com/github/gh-ost


I reach for SQLite if I need persisted state for a local application or custom file format but why use it for things that may need more write concurrency like web server?

Postgres is basically just as easy to use and backup.


> Postgres is basically just as easy to use and backup.

SQLite is way ahead on this: no daemon to run, no user / database to create, manage and administrate, no authentication to set, no socket connection to manage… backup is as easy as it gets: (copy one or two files).

Postgres is still largely manageable of course.


Well, backing up by copying doesn't neccissarily result in a consistent state if there are writes to the database. For that you have to use the SQLite `.backup` command (or using the backup API https://sqlite.org/c3ref/backup_finish.html) after which the backup database has to be copied over to backup storage (or backup storage has to be mounted to the production system, which is dangerous)


Use "rsync" instead of "cp". After it finishes copying, it will check to see if the file has changed since it started copying, and will restart the copy if so (with a limited number of retries).

If copying the entire database is faster than your average update right, this will converge very quickly and will deliver a consistent copy.

For many small applications, this is perfectly fine. It's not much harder to just "sqlite3 $file ".backup $backupfile"' (that's literally all it takes, and what you should do) and guarantee consistency. But it's nice to know that a simple "rsync" is sufficient for slowly-updating uses - e.g.

And as for the other side of backup, you know -- restore -- sqlite shines brighter than everything else. You can just take a good copy and put it back. You can examine the file everywhere, on a read only system, etc - without configuring anything if needed.


Yes, rsync is a great toolinnsome cases it might even find which blocks changed and only retransmit those.

However if you have frequent writes I would be careful.


SQLite loses most of its edge in concurrent write scenarios, but its read performance is difficult to beat.

A lot of it comes from what TFA says: there's no network roundtrip, but a function call. Even in a local machine, a unix socket query will carry at least a couple of system calls with potential context switches, and that makes regular RDBMS lag behind when you do tons of sequential and small queries.

Of course, when you have large results or complex queries that eat a bigger chunk of the time cake and that technical advantage wanes. After that, which RDBMS has the performance lead is largely workload-dependent.


An interesting consequence of this characteristic of SQLite is that it makes it a really good fit for GraphQL.

GraphQL has the N+1 queries problem pretty much baked into it. You can work around it to a certain extent with a lot of effort implementing data loader patterns, but someone will still likely be able to cow up with a query that triggers a cascade of SELECTs.

Or, because "many small queries are efficient in SQLite", you could use that instead.

I built a Datasette plugin that exposes a GraphQL API to an existing SQLite database and it works so well that I switched from being a GraphQL skeptic to a GraphQL advocate.

https://simonwillison.net/2020/Aug/7/datasette-graphql/

Here's an example nested GraphQL query that executes 32 select queries. It's pretty snappy: https://datasette-graphql-demo.datasette.io/graphql?query=%7...


hasura does fairly good query optimization using postgres json_agg, that's my preferred way - if more graphql engines did the same that'd be amazing.


It does not have that problem baked into it since it relies on resolvers. It would have to be a choice to have that problem.


This seems a little pedantic. It may not be literally required by the technology, but GraphQL certainly makes it easier to introduce N+1 queries than a traditional REST API.


In practice, I find REST APIs also have N+1 problems the same way GraphQL does. They just get spread across multiple requests.


If anything I'd say REST is dramatically worse in this respect. There's no structure for nested queries (or anything except thing/:id really), so the only broadly compatible option is to pull every piece as a separate GET request.

Sure, you can use query params... but there's no implied support nor semantics, so one site will do one thing and another will do something from a completely different universe of architectural patterns, while a third will just have nothing, and there's no way to reconcile those in a consistent way.


You don't need to have 1:1 equivalence between API resources and database models, though. For example, you can create an API resource called TweetActivity, and then the backend code for GET /tweet-activity could put together a database query that grabs all the tweet likes, comments and basic commenter info (name, profile image), from different database tables, into a single new object. You can give that object the same ID as the tweet itself, and you can put a cache layer around that endpoint to, for example, save that response for the next 1 minute.

That being said, one thing you give up is providing a standard way for the client to specify which fields it needs returned. For example, a client might want to dig deeper into the commenter profile info. GraphQL's resolvers architecture opens up that possibility immediately.


Agreed on pretty much all points, but you've just described a way to optimize an N+1 system without changing the fundamental N+1 aspect. The caller is still making N+1 requests and incurring round-trip latencies (end-to-end cost can at best be reduced to 2 with parallelism), and the server still has to receive N+1 requests. REST has no semantics to change this.


In a GraphQL API backed by a single database you could write resolvers that analyze the entire query and rephrase it as a single database query. But I'm not aware of anyone doing this. The common/straight forward solutions resolve each layer after each other, which you can optimize to 1 DB query per layer of the GraphQL query using dataloader patterns, with further optimizations for known common patterns.


Some people do try to generate a single SQL query that covers all nested resolves. See Join Monster [1]. I'm skeptical that it would ever work well in SQL.

Datomic, being close to a graph database, makes constructing a single deep query for all resolvers fairly straight forward. This is the approach my team is taking now. It's worked quite a bit better than my DataLoader biased intuition suggested.

[1] https://join-monster.readthedocs.io/en/latest/


> But I'm not aware of anyone doing this.

I have, for a bespoke internal application. PostgreSQL is actually capable of expressing GraphQL queries as (rather elaborate) SQL queries. You end up generating queries that use a lot of LATERAL sub-selects. The risk is generating queries with poor performance, and that risk is high enough that I don't think this is a viable approach for complex applications.

Which is a shame.


Yeah, I think there is a perception that people want direct queries into their database, but that is actually not the correct way to think about graphql.


A straightforward implementation certainly would have that problem. You have to put a decent amount of effort in to reduce query count to fill a nested query graph.


Not really. A straightforward implementation has nothing to do with database usage. A resolver is an endpoint. You can leverage that like you can in any other http backend server.


Nested resolvers can incur extra queries as you follow down the tree, unless you put extra effort into pre-fetching what child nodes need, or some kind of data loader as GP suggested.

  query {
    parent {
      child {
        field
      }
    }
  }

A straightforward, naive implementation would resolve `parent`, then a resolver for `child` would execute, then for `field`. If `child` is a table linked by an FK, a second query would be executed, unless you pre-fetch the results through a join when resolving `parent`.


No, replacing sqlite with leveldb/rocksdb/lmdb will give you ~10-50x improvement.


I'm interested. Any links with benchmarks or example applications demonstrating the difference?


> The Appropriate Uses For SQLite page says that dynamic pages on the SQLite website typically do about 200 SQL statements each. This has provoked criticism from readers. Examples:

> "200 SQL statements is a ridiculously high number for a single page"

> "For most sites, 200 queries is way, way, way too much."

> "[This is] bad design"

Funnily enough, at a recent workplace the guideline was "if it only does 200 queries, you've optimized it enough".


Django is really bad (good?) at abstracting this kind of stuff away.

    <ul>
    {% for friend in user.friends.all %}
        <li>{{ friend.name }}</li>
    {% endfor %}
    </ul>
The problem is that if you have 200 friends then it will make 200 queries. If you have 1,000,000 friends then, well, this will crash. The Django solution would be to fix-up your "user" queryset like:

    user = User.objects.filter(pk=<pk>).prefetch_related("friends").first()

Pay the query cost just once.. instead of on every iteration of the loop!


This is very similar in Rails.

Only in Rails it is far easier to accidentally introduce interfaces that do this: the downside of having a very easy DSL to quickly add relations between records/tables is that people very quickly add relations between records/tables: introduce spaghetti.

And in Rails it is very hard to debug, because ActiveRecord tries hard to be smart, but this "smart" quickly degrades into "black magic" when stuff gets complex (mostly caused by: see above, in my experience).

Edit: currently working on a codebase where 3000ms queries to get 200 items is no exception, where 500+ queries to build a page is common and where, if you change a foo on a bar, suddenly at the other end of the app, your blargs and blazzes start blurping out SQL that used to run 200ms but now adds some wheres and limits and group-bys that make it run over 2000ms. Yay for Rails trying to help!/s


You might check out the gem "fasterer" which warns about n+1 queries.


We use bullit, and scoutAPM. So we are well aware of N+1 queries. Edit: but thanks for the suggestion. Will have a look if it works better than bullit.

There's a difference, however, between being aware of them and fixing them. Especially when they were caused by an unrelated change; when you cannot just roll back that change.


Yep, that was pretty much exactly the issue.

Except this was in a home-rolled solution which made it incredibly opaque where the ".friends" access actually happened (or if it wasn't ".acquaintances.where(type='friend')".

Made it very easy to get pages on the road, which just turned incredibly slow later down that same road, and fixing it wasn't trivial without knowing exactly what to preload when.


I have never encountered this issue in Django. Perhaps your post lacks some context. Is `friends` a custom method/property? Asking because I'm used to seeing attributes like `friends_set`.

The only scenario I see where your example would need `prefetch_related` is if your snippet looked something like this:

    {% for user in all_users %}
        <ul>
        {% for friend in user.friends.all %}
        <li>{{ friend.name }}</li>
        {% endfor %}
        </ul>
    {% endfor %}
To avoid the classic N+1 query problem, all_users must be loaded like so:

    User.objects.all().prefetch_related("friends")
If you're writing verbose expressions such as this in Django:

    User.objects.filter(pk=<pk>).prefetch_related("friends").first()
You must be doing something wrong.


Based on context, GP's example is a ManyToMany field. "friend_set" is what you'd get on a reverse ForeignKey or reverse ManyToMany by default, but can be changed with the "related_name" kwarg.

Either way you're right, GP's example as given doesn't have the 1+N problem (it's 1+1 at worst, a single query for user and a second single query for all friends), there's definitely missing context if they are seeing it.


I currently have a project with Django and you made me look this up because I remembered it differently. And yes, there is in fact a cache layer between the QuerySet (.all) and the iteration.

See: https://docs.djangoproject.com/en/3.1/topics/db/queries/#cac...


With Hibernate, you have something called "open session in view" that should always be disabled to prevent bad practices like this. I really hope there is something similar for Django, as I may have to rewrite a Spring Boot application to Django(politics and lack of java resources).


Why that is not already done by the framework? 99.9% of the people using that for look would benefit of having it in 1 query instead of N queries.


Because that's extra work done by the DB for no reason if you don't intend on using the Friends field (say, you're only getting user data to render the "Welcome, {{username}}" text).


> "200 SQL statements is a ridiculously high number for a single page"

> "For most sites, 200 queries is way, way, way too much."

> "[This is] bad design"

Now, if we read these statements again, now as excited enthusiasm expressed by a developer in awe of this rule-breaking architecture of web development, enabled by SQLite.

The submission really provides a practical example of the developers dogfooding SQLite, and designs that could be unpractical with other tools. Great

Support your local


I can relate to this. It’s so easy to write a some code that fires off 10 queries but it’s obscured by the ORM and then it’s in an index listing which isn’t paginated and eventually some user does things weird and it ends up with 1000 items so this one listing runs 10,000 queries.


In my experience, it doesn't take the user doing anything weird - all it takes is to use ORM enough in the project, and suddenly you have hundreds of separate queries being run to fetch something trivial, which could be done in one or two queries of hand-written SQL.


In my experience, those "one or two queries of hand-written SQL" quickly balloon out of control into a litany of crimes, because the assumption they will be better relies on having the kind of developer who can easily write "one or two queries of hand-written SQL" but are somehow incapable of reading the documentation or using an ORM correctly.


Writing SQL is so much easier than understanding ORMs. ORMs tend to be very complex, because the problem they're trying to solve requires that complexity.

Now maybe my perspective is biased too much, since the only ORM I've really used is Hibernate. But in that experience, I learned that Hibernate brings a ton of complexity with no real value. Seems like a high price to pay to avoid learning SQL.

The sweet spot seems to be using a library like JOOQ, which lets you write queries, as well as safely construct queries at runtime.


Just imagine the horror those using Magento would get when they see how many queries it runs for doing anything at all!


I have only worked with PostgreSQL and when the wife(Accounting Professor) needed a DB to teach their students SQL, I heartily took her down the SQLite path because of its simplicity. It may not have been the best decision as I see her now fighting with its warts like

1. No type checking! Declare a column as int in DDL, and insert a text, SQLite happily stores it without an error :-(

2. No full outer join... instead my wife taught the students the standard SQL syntax and told them to use UNION with SQLite because it does not support that syntax. This is where it gets annoying.

I can’t recollect, but there were issues with views as well which sounded really weird.

I now think the choice of SQLite is more nuanced than I believed it to be.


> No full outer join..

I'm kind of shocked that it doesn't offer a full outer join, as it is such a basic part of relational algebra, really in some ways the conceptually simplest kind of join... but I also am not sure that I have ever even once actually wanted a full outer join.

You really do mean full outer join there? Or, quite possibly, is it me who's become confused about what a full outer join is?


    No type checking! Declare a column as int in DDL, and
    insert a text, SQLite happily stores it without an 
    error :-(
This is the one that absolutely baffles me. They very explicitly view it as a "feature, not a bug" but I am not sure why -- I've not seen a deeper rationale behind this decision.

https://www.sqlite.org/faq.html#q3

Nonetheless though, SQLite is excellent overall and I'm so, so thankful for it.


Duckdb is like sqlite but with the postgres SQL parser.


Does it also have the features? Like a returning clause? Or is it similarly gimped and restricted to a half-assed thread-incoherent function?

edit: I'll take the downvotes as a no, and apparently last_insert_rowid somehow has fans.


The Database Design course I took in college started on Access and moved to SQL Server which I found to be quite nice, may not be an option for your wife though since we already had student licenses through the college.

Maybe there's a PaaS or SaaS option that would work for her? Something like Jupyter notebooks maybe.


try h2database.com if you don't mind Java, has pretty full SQL support (except for weak support for CTEs)

Disclaimer: H2 dev


H2 is great. Thanks for your work on it.

We used it on my last project as an in memory database for running interaction tests between Dropwizard resources and the database in our CI pipeline. We had Liquibase running migrations during the setup for each test and DbUnit setting up test data.


H2 rocks. It is fantastic. Thank you.


I don’t think MySQL has/had full outer join either?


> For a 50-entry timeline, the latency is usually less than 25 milliseconds.

There are a lot of good ideas which don't scale down to a smaller single-core system.

For example, there's a CROSS JOIN in that example, which is just fine at the row-counts we might encounter on sqlite.

The general low-scale workloads + the in-process operation (the SQL query is a function call, not RPC), means that sqlite can avoid needing the complex query to compress the user intent into a single round-trip.

This doesn't apply in general & the article doesn't claim to do so.

In the past, I've had production systems which are blocked on mysql_parse() rather than the actual disk IO (anyone remember the mysql Drizzle client side optimizations & the Dynamic SQL extensions?).

The simplicity of 1+N is basically a "simple enough and it always works", plus in a few case we can actually use a different projection for the N queries.

As a side note to Fossil, I've gone spelunking into the SVN repos with sqlite (there's a .db inside your .svn/ if you still use it) & what you'd end up using is a lot of 1+N queries (ACTUAL_NODE).


CROSS JOIN in sqlite3 is used to force the order of the tables in the join. Note the query has a WHERE condition so is the same as an INNER JOIN

https://www.sqlite.org/optoverview.html#manual_control_of_qu...


From a live table (there are about 10 of these in the database):

sqlite> select count(*) from <redacted>; 49862386

Curious what row count you expect in non-SQLite databases on a regular basis.


If you're using sqlite, but for a static website, you might consider using duckdb. It's still evolving, but it has huge potential.

It's also an embedded single file database like sqlite, but stores data in columns. Duckdb to sqlite is like redshift to postgres.

Hopefully they will add compression for string columns, so that host your entire static site from a single file in a very efficient way.

https://duckdb.org/

https://youtu.be/PFUZlNQIndo


It literally says OLAP on their website. Why would I use this for a static website?


If you have a static website with a lot of data behind it, eg. a map visualizing some larger dataset, then OLAP database will be actually a very good backing storage.


That seems like an oddly specific thing to assume when suggesting it for use by static websites generally — afaict most static sites would see no distinct benefit from columnar storage.

That said, SQLite for OLAP is very neat.


An important reminder is to use the right tool for the job. SQLite is indeed amazing at what it does, and fits with many scenarios where it isn't often considered, but it isn't a silver bullet.

If you have a scenario where you must have multiple machines (for redundancy or resiliency), then SQLite may not be the "best" choice. However, if these machines were needed for performance reason, you may find that a single well-written SQLite-based solution can be performant enough to run on a single machine (while reaping all the benefits of its simpler approach).


Creating read replicas has become easy with SQLite, thanks to Litestream (recently on HN: https://news.ycombinator.com/item?id=26103776).

What is really hard with SQLite is efficient concurrent updates. But you don't need them very often.


What I've never got to grips with is the "use a single writer with SQLite" advice. That seems doable with a long-running application server, but if your application boots up on every request (like PHP, Python & Ruby applications generally do) how do you do this, when you have multiple simultaneous users?

It feels like SQLite is missing a separate gatekeeper-binary to act as the single writer: the database server as it were.


SQLite uses file lock to coordinate with many processes. Your performance will be similar to use single writer if we ignore the repeated DB open / close due to the app boot on every request.

I am not sure about real-world DB open / close cost. Because it is considered good practice to always use some kinds of SQLite connection pool if you need many readers / writers. Didn't get a chance to try.


A solution is to append requests as files to a directory, and have a single process that reads the files and performs the write on SQLite. This way you can avoid locks and use the OS to manage the coordination.


With Python at least an application server with long-running processes is the most common way to deploy, for example with gunicorn. Though typical config will have green threads and/or multiple worker processes, so I am not quite sure how one the single-writer is enforced.


File locking and waiting on the lock? Assuming that open + update + close is very fast and not very frequent, the wait time will be near zero, can be done synchronously pretty well.


I just got to feel the downsides of trying to add mysql support to an app that was developed for sqlite exclusively first with digikam. Digikam has experimental mysql support so you can host your photos and the db on your NAS and access it from multiple computers.

I assume that digikam makes many simple queries for each individual photo when showing an overview page with thumbnails, tags and dates, as it's all stored in the db and notably slower than with sqlite. With GBit LAN it's acceptable, wifi makes you want to rip your hair out. It's not the bandwidth, but the latency. The sqlite case never made you consider to fetch multiple thumbnails in one query and now it's probably hard to redesign the whole thing to do that.


It's ridiculously tempting to try and use SQLite for everything.

It's so ridiculously useful and capable.


Your smart phone most likely has it already.


I might be mistaking things here but I believe that Apple's CoreData and Android's equivalent are both powered by SQLite under the hood.

Messages on macOS stores all your message in a SQLite file under ~/Library/Messages/chat.db


macOS and iOS both use SQLite pervasively. There's tons of little databases all over.

Chrome uses SQLite heavily for history, cookies, bookmarks… etc. So does Firefox.

SQLite is everywhere. :)


It must be easy for SQLite developers to apply for jobs. Let me into your company, you've already let me into your home, your office, your data center, and your pocket.


As amusing at it is to imagine, their team page only lists 3 developers[0], so I bet that turnover does not happen particularly frequently!

[0]: https://sqlite.org/crew.html


It's everywhere to the extent that I find it hard to believe it's not lurking somewhere on a default Windows install. I would very much like to be able to say to people "do not use excel for that. It will make you cry" and not have the response "corporate IT won't let me install anything better."


It definitely exists within Edge, as that's Chromium-based. Not sure it's available in a standalone form, though.



Tip: you must grant your terminal application "Full Disk Access" in Security & Privacy settings in order to poke around this directory.


CoreData is such a awfully engineered piece of software with bunch of pitfalls that you are much better just using SQLite directly.


The limited type system and loose data validation always prevented me of using SQLite.

It's a nice data store but a poor database in my opinion.


> The limited type system and loose data validation always prevented me of using SQLite.

This.

"limited type system" is actually an exaggeration in my opinion. The "data types" seem to only serve documentation purposes. You can always store any value in any column regardless of the "data type"


Wait? Your can for example put an unicode string into integer or number columns? That can‘t be true? I mean, really? ...


It is. They even consider it a feature.

You don't actually need to specify column types at all. You can just leave them out. Or make up entirely new types like `BLIB` or whatever.

Foreign key constraints are also not enforced by default but there is at least an option to turn that on.

I'm keeping a close eye on DuckDB which looks like a modern saner version of SQLite.


> Wait? Your can for example put an unicode string into integer or number columns?

Yes

https://dbfiddle.uk/?rdbms=sqlite_3.27&fiddle=4634e3821676ed...


Yeah. Execute program has a short and sweet explanation with in-browser demonstration: https://www.executeprogram.com/courses/sql/lessons/no-type-e...


It is. They even consider it a feature.

You don't actually need to specify column types at all. You can just leave them out. Or make up entirely new types.

Foreign key constraints are also not enforced by default but there is at least an option to turn that on.


Is anyone aware of efforts to build a a “strict mode”-type of setting, or a SQLite competitor? I’d love to use a single file as db with stricter type enforcement.



If you use a typed language (like Java), this is a non-issue. The data will be written in the correct format. But if you use a non-typed language (like Python), this is also not a problem, by the very nature of dynamic languages. So I don't see where the problem is, unless you use SQLite from the console. Moreover, they have constraints that can be added to the db.


Why wouldn't you validate before inserting it into a database?


If a tool can do something automatically (type checking, other constraints) then it is often nicer to let the tool do it for you, instead of doing it manually.

Sqlite is very unusual in that it uses static type declarations, but it doesn't actually enforce them. That can be quite unexpected.

See also: "Why wouldn't you check the type of your variable before passing it to a function?"


This doesn't make any sense. Why do you want to fail on insertion and not fail during the acquisition step itself?

You have to write tons of wrappers and boilerplate to handle those insertion failures instead of cleaning up your data during acquisition.

>See also: "Why wouldn't you check the type of your variable before passing it to a function?"

Completely different workflow. The types of data is known during acquisition.


But programming language types are different than SQLite types so that's already a mismatch that could introduce errors.

Also, there is much more boilerplate in man-in-the-middle-ing each database query to validate data rather than have a single entry point which validate the data types which is the database.

Also, using any in-database feature like functions or triggers should also have this security baked in but won't due to SQLite's missing validation.


>But programming language types are different than SQLite types so that's already a mismatch that could introduce errors.

No...because they can be boiled down to the sqlite types and again it doesn't matter if you properly validate the data on insertion which you should be doing.

>Also, there is much more boilerplate in man-in-the-middle-ing each database query to validate data rather than have a single entry point which validate the data types which is the database.

No...you don't validate queries..you validate insertion

>Also, using any in-database feature like functions or triggers should also have this security baked in but won't due to SQLite's missing validation.

You don't need to depend on it...because you validated on insertion.


Humans validate when using GUI tools and that's a very loose checking.


I've been thinking lately that, in the unlikely event that I'll ever write a big computer game, I'd like to use SQLite as the save file. Too many games take ages to save or load the state, whereas updating a database with every state change seems much more painless. I don't have any experience with SQLite, but this article makes it sound like my idea would work very well.


Saving is usually done almost instantly because you just write a file. Loading is another beast on itself because the game engine needs to prepare the world. That's where it lies the most time, not in reading a file that has like couple MB on disk.

Sure, go with SQLite if you want as a save, it won't matter anyway in the big picture.


sqlite pages are all pleasure to read, all the badass statements reminds me of the classic:

    SQLite does not compete with client/server databases. SQLite competes with fopen().


I can imagine that.

However, if you ever need to move out from SQLite to anything that is behind the network, the performance will tank—and if you've really taken advantage of this fast round-trip, refactoring them into fewer larger queries may turn out to be quite a task.


If you put for example postgresql on equal footing, application and database in the same server (over unix sockets), then I guess 200 SQL statements is not a problem either for the timeline example.


Wouldn't it still be over the network/IPC, unlike SQLite?


No. It will be internal port to port connection. And it also depends on the driver. You can definitely have a crappy SQLite driver for your obscure OS that can absolutely take a lot of time for those 200 queries.


It is still IPC (inter-process communication) even though it wont go over any network. Probably still order of magnitude more overhead than a function call, but quite possibly it will be acceptable.


None is stopping you to write a PGSQL driver that uses memory-mapped files and then your calls are as fast as a function call.

Like I said, it all depends on the driver.


I've been running it in production successful for years. I have a read-only service that exposes Geonames gazetteer using SQLite and it's simply beautiful. I can deploy both code and dataset at the same time.


I just started work on a simple state management tool based on SQLite [0], utilising SQL.js under the hood. Just wrote it last week too! It intends to be much easier to learn than redux and mobx, but more powerful than a react's context.

In theory, it should support most of redux's ecosystem too, such as reselect, though I have yet to create examples for it.

[0]: https://github.com/ziinc/memlite


Question : I'm building a 3d procedural geometry generator and I need to store and index it by tiles for hundreds of kilometers.

I'm curious if it can be efficient to store polygonal data in a database instead of a dedicated file format.

I'm never sure how to properly copy raw struct binary data from sql, directly in ram. I think avoiding parsing the text output result in better performance, but I'm not entirely sure...


It sounds like PostGIS might be a good fit. It gives you a lot of tools for accessing and indexing geometry data, and it can scale to huge sizes.

Parsing text formats is unlikely to be a bottleneck, accessing the data is more likely to be the bottleneck. The correct data structure (schema) and indexes are going to make the biggest difference.

If you do complex computations on the data, and it fits in RAM, you'll probably see the best performance if you first load the data into data structures optimized for the computation first, and it's less important how you store the data on disk.


> store and index it by tiles

Options include the https://sqlite.org/rtree.html module, and building an application-specific mapping from geometric indexes to an integer keyspace (https://github.com/google/s2geometry or similar).

We're using SQLite archives of many GB successfully without issue. As long as the primary keyspace is well-designed (see also https://sqlite.org/withoutrowid.html), ranged queries are extremely fast.

> I'm never sure how to properly copy raw struct binary data from sql, directly in ram.

BLOB columns and an application-specific serialization/deserialization step work well. memcpy to a struct works if you are absolutely certain that you know what the layout will be. All of the standard perils apply - alignment, internal padding, platform-specific layout, endianness, etc.

We're using Protobuf with success. I imagine Flatbuffers would also work well. I'd put Protobuf/Flatbuf and their competitors on the front of the tool shelf.


I am working on an app that needs some GIS features. Spatialite and GDAL cover my use case. You can even eliminate GDAL if you don't retrieve data as binary but it is useful in other places too.


Storing small blobs is a use case where SQLite often excels. By small, I mean in the range of <400kB. That is, if those are usually write-once, read-only. I can't remember the paper right now, but one of the results was that if you are repeatedly overwriting rows, then the performance will degrade gradually (until you VACUUM, presumably).

I've used SQLite for this quite often, since it's convenient to store additional metadata that can be queried with these blobs. I know of a person who uses an SQLite database for storing emojis in a chat client, as well.

You can open blobs directly as binary data, as well. Almost like files. See sqlite_blob_open and friends.

The question you should ask, though, is what is your access/write patterns? Once you know where SQLite excels, you can accurately assess whether it fits your use-case. Without knowing more details, I can't say one way or another.


I was thinking of implementing almost exactly this for mesh decimation. Certainly a decade or so ago you had enormous point clouds being generated that were too big to fit in ram, and none of the common mesh decimation tools worked off disk. I imagine it's less of a constraint now.


3dcitydb with PostGIS on Postgres is pretty neat for those kinds of operations.



My only complaint about SQLite is that on https://www.sqlite.org/datatype3.html it reads:

>REAL. The value is a floating point value, stored as an 8-byte IEEE floating point number.

When I read this I parsed 8-byte as 8-bit, and made several embarrassingly incorrect design decisions when creating a table as a result.


You exchange network access against filesystem access, but what if your filesystem is a network device?


SQLite recommends not using network storage:

> You are advised to avoid using SQLite on a network filesystem in the first place, since performance will be slow

https://sqlite.org/atomiccommit.html#_broken_locking_impleme...


That's a serious limitation since many VMs in datacenters have block devices attached over network.


I think this is true for any RDBMS, though. If your block device has high latency, you will see bad perf when the DB tries to fsync.


But the block device latency is usually a order of magnitude lower (microseconds) than the network latency between the database and its clients (milliseconds), except when all clients are located in the same datacenters.


”Network filesystems” means stuff like NFS. Networked block devices should generellt be fine AFAIK.


You exchange network+filesystem access against only filesystem acess.

The server needs to store your DB somehow too.


Hmm, NextCloud always warns against sqlite (but it is the default in the Linux Server IO docker image. It would be very easy, and nice, if it turned out sqlite was fine. It makes backing up easier, and would clean up my docker-compose.yaml.


Cloud volumes are usually IOPS-challenged and have shitty durability guarantees. If you’re going this route you should use a VPS or dedicated server instead.


It's in my basement on a Corei3 with a decent Samsung nvme ssd.


You said docker. Doesn’t matter where it’s physically running, just that the persistence guarantees are not the same for containers.


I was addressing the "IOPS-challenged and have shitty durability guarantees". I don't use cloud volumes, so this shouldn't be a problem. I guess I was hoping for someone to tell me sqlite is fine in my case ;)


Ah, “shitty durability guarantees” comes from the container orchestration, not any downside of cloud hardware. I mixed up two likely reasons :)


For anything but the smallest ones, I would prefer to use PostgreSQL.

SQLite has too many quirks, and I sincerely feel it messes our thought process when working on Postgres or the like.

As a file format, it works awesome, but it has its uses and limits.


As an aside (or perhaps related note), over this past weekend I was doing my first real-world usage of SQLite, since I typically rely on conventional databases for my day-to-day work.

I've read the article shared here in this thread in the past and read about the usefulness of SQLite generally, particularly for read-heavy situations, so I thought it would be a nice experiment to pull in the "Have I Been Pwned Password" List and import it into the SQLite and see how fast it could be queried.

At first I didn't find too many resources out there of the same thing having been done, but after searching some more throughout the weekend I did find out some nice nuggets of information that helped (in particular, I learned that adding the "WITHOUT ROWID" option at the end of a SQLite CREATE TABLE statement will help to reduce the overall database size considerably).

The desire to put things into SQLite for the "Have I Been Pwned Password" actually stemmed from a misinterpretation of the API page since it seemed at first that the API was not free (although I had thought it was), since one of the first things I read was the need for a $3.50/mo charge. Later on, I then figured out that the password checking API is still free, it's just the other pieces of the API that require the API Key / monthly charges (doh!).

If you run a query against Troy's API, it's pretty damn fast all things considered, so I'll likely revert to using his API directly, but I was thinking of uploading the SQLite versions to some cloud hosting somewhere and potentially share it with others if anybody is interested.

On to some quick stats collected over the weekend (using an import process powered by some PowerShell):

  50 million took ~6 hours 22 minutes to complete (with ROWID):
  Start Time: Sunday, February 14, 2021 9:12:29 AM
  End Time: Sunday, February 14, 2021 3:34:46 PM

  50 million took ~5 hours 46 minutes to complete (without ROWID):
  Start Time: Sunday, February 14, 2021 7:51:38 PM
  End Time: Monday, February 15, 2021 2:05:30 AM
File size comparison:

  5.28 GB (50 million rows, with ROWID)
  2.61 GB (50 million rows, without ROWID)
Search Time comparison:

Example performance of Select-String (line-by-line search):

  TotalMilliseconds : 216245.7246
Example performance of the SQL query:

  TotalMilliseconds : 20.8543
I then ended up finding this implementation of a Binary Search option (rather than the naive approach Select-String takes above) that Chris Dent put together here: https://www.indented.co.uk/powershell-file-based-binary-sear...

Example performance of this approach was more in line with the SQLite database:

  TotalMilliseconds : 40.3749
(For these runs comparing against the Binary Search option, the SQLite query ran in):

  TotalMilliseconds : 4.9491
All in all, now that I have discovered the Binary Search option that one would be workable without needing to do anything at all (except to make sure to download the "Ordered By Hash" version of the hashed password list) and use it directly, although I believe once the import has completed of the full password file into SQLite (using the WITHOUT ROWID option) the file size for the database should be fairly close to the actual txt file size (the text file size is 25.1 GB currently with close to 630 million lines in it...based on the smaller batch conversions to SQLite, I think the full import should be in the low 30 GB range).

Tagging along with the top comment in this thread however, I do see what they mean about insert performance since I did try a few experiments there as well, but went ahead and kept things with the 10k batch inserts at a time (I think with individual inserts I was calculating things to potentially take 216 days or something ridiculuous, and switching to batch inserts made it go down to about 3 days, but based on some comments I read elsewhere over the weekend, even that seemed like it might be kind of high...at the moment I'm trying a full import outside of my PowerShell script using DBeaver's functionality for doing so, so I'll see if it completes faster...seems like it might since based on the current database's size it seems like it's more than 50% complete and it's probably only been about 12 hours).


Eh, we used to do 20k mysql queries on a single page like 10 years ago and it worked well enough to not notice at first.

Of course eventually we “fixed” that, and things got faster, but it was still impressive.


> we used to do 20k mysql queries on a single page

As a non-web developer, this is something that surprises me greatly. Why does loading a single page ever need to do a sql query? Can't you most often just generate all the possible static pages that you want to serve? Is your db content really changing every second?


This was a page in an e-learning system showing student results. It was all calculated on page load and iirc there were a few too many ‘look all these things up for one student at a time’ kind of situations that compounded.

You could cache that information, but you’d need to update it every time someone submits a new result (in this case to a test that can have tens of questions), times 30 students in class, times multiple tests.

In hindsight I think we could maybe have precalculated the results after the last student finished, but that wouldn’t really be worth it for a page that took maybe a few seconds to load.


e-commerce sites.


That's exactly what I was thinking about. For example, last weekend I was looking at the products offered by an online bike shop. There were at most a few thousand pages showing product lists with prices. Some of these pages took more than a second (!) to load. What the hell is going on? What's the server doing all that time? That should be instantaneous. It can be solved by a small folder of static html pages, re-generated when stock changes a few times per week.

I'm not a web developer so maybe what I say does not make sense and sounds stupid. But I'm just astonished by what process this happens to be a complicated problem.


Based on the numerous e-commerce web sites I see on a daily basis, it's usually one or more of the following:

* Inefficient, disabled, or non-existing caching.

* Random SEO plug-ins/add-ons doing things like queries on non-indexed data or insanely convoluted JOINs.

* Logging every single HTTP request into a single table that nobody ever remembers to truncate.

* Multiple queries for each page in order to recommend similar products in the page footer or on some widget.

* Checking whether the visitor is logged in, and if there's anything in the basket yet.

* Checking whether there's an active discount for the current user/product combination.

It's really quite ridiculous. The worst offenders are WordPress/WooCommerce sites. Dedicated e-commerce solutions usually fare better.


to quote you " ... re-generated when stock changes..." - that's exactly the correct wording, except it doesn't happen a few times per week, but a few times per millisecond if you're Amazon. Or a few times per minute if you're a national supplier, any country you want, and you have a sale on Black Friday. That, of course, assumes you want to have almost real-time inventory stock to show to your possible clients. Otherwise, if you don't care to show out of stock items, sure, it can be a few times per week.


But they don't even solve that problem! In both shops that I use (rosebikes.fr and decathlon.fr) stocks are explicitly marked as approximate, with a very visible warning in red when the stock is 2 or 1 items, asking you to call the shop to verify stock. Website is slow as molasses still.


In those particular cases then yeah, it might be a sloppy job, but you didn't made it sound like that initially, instead let us believe you meant it generally speaking


You can have a client/server database which is just as efficient as a single-process embedded database like SQLite, if that database uses shared memory.


Exactly. I have the feeling there is just a little bit too much sqlite-hype going on lately.


One of your queries is `SELECT uid FROM user WHERE cap LIKE '%s%';`. I can only assume your users table is tiny. How does it do with 1M users?


Indeed. In my app using SQLite I log every query that takes longer than 5ms (the db is a few GBs, largest table about 1m rows). It logs very little




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: