Hacker News new | past | comments | ask | show | jobs | submit login
SQLite or PostgreSQL? It's Complicated (twilio.com)
297 points by makaimc on June 28, 2022 | hide | past | favorite | 244 comments



> Once again, SQLite happily stored my UUIDs on this column and even indexed them, but to get these across to PostgreSQL I needed to fix the schema and properly declare these keys as strings.

In PostgreSQL you're leaving performance on the table if you store UUIDs as strings instead of as the built in UUID type (128-bit value under the hood)


Plus PG and SQLite3 have fundamentally different approaches to types.

SQLite3 has very few types, you cannot create new types, and is duck-typed.

PG has many built-in types, you can create new types, and is (mostly) statically typed.

You really have to know this going in. If you have to support both, you'll probably have to write two versions of your SQL -- or you'll end up using an ORM that supports the two RDBMSes, and then you may get stuck with the ORM.

SQLite3 is amazing, but it's mainly amazing because of how easy it is to embed in apps. If that's what you need, use SQLite3. If you need more power, if you're building a server, then think long and hard about why not just go straight to PG.

EDIT: SQLite3 is amazing for other reasons too.


> As of version 3.37.0 (2021-11-27), SQLite provides STRICT tables that do rigid type enforcement

https://www.sqlite.org/datatype3.html


Thanks! Here's the link to the actual details

https://www.sqlite.org/stricttables.html


I'm aware of this, and it's a very welcomed addition, though the types supported are still very limited, but...

> though new types might be added in future releases of SQLite

Intriguing! Can't wait for SQLite3 to have `CREATE TYPE`!


This is so 'on brand' for HN: author for once actually uses the main selling point of ORM concept and changes the underlying database by updating one config variable and nothing else in application code. One of the first comments warns "you may get stuck with the ORM".


I work with (and occasionally have made contributions to) a proprietary Rails (ORM) app that in principle supports SQLite3 and PG but, in practice, left SQLite3 behind long ago. The main problem is that the app in question needed something like SQLite3's `ANY` type in PG but all there was back then was `hstore` (and even now, all there is is `hstore`, `json`, and `jsonb`), so `hstore`-using code metastasized and no one kept the SQLite3 bits working.

Nowadays, of course, SQLite3 has JSON support, and PG has JSON and JSONB, and so it should be possible to fix that, but it would be quite the migration, so no one bothers.

This is real life with a multi-RDBMS ORM: the promise of RDBMS neutrality always fails in some way as the differences between the RDBMSes bleed through the ORM's abstractions, and that bleeding occurs the moment you need more expressive power than the ORM provides out of the box for dead-simple apps.

It's the same old story: dev meets ORM, ORM makes things easy (at first), dev needs more power, dev bumps up against ORM's limitations, dev works around those limitations, dev now has a frankenapp, dev wants to switch RDBMSes but can't, or dev wants to ditch the ORM but it's a lot of work.

ORMs are a nice crutch to get started, but that's it.


You're right. Calling out tradeoffs transparently in the comment section is 'on brand' and that's precisely why I keep coming back to HN and the comment section, specifically.

Blog articles are helpful but most often are a single person's perspective and no one person is perfect. I love HN because I can get some great anecdotes (or sometimes response articles) from many perspectives.


tbf GP went, You really have to know this going in. If you have to support both, you'll probably have to write two versions of your SQL.

That is, the problem isn't just the ORM but the very different characteristics of the databases behind the ORM.

Any serious production app and I bet no one is touching that config. Especially, if it has been running for months with one db or the other.

And now, you're stuck with an extra database and a ORM.


And SQLAlchemy made the process of migrating from one to another quite simple too.

https://www.sqlalchemy.org/


Thank you for highlighting this. I learned this about a year ago when I tried porting my storage layer from postgres to sqlite.. loads of little small things that don't quite work the way you expect it to or require a bit more code as workarounds.


Normally you expect the reverse migration, from SQLite3 to PG!


Is it really duck typed if no static type checking is performed on the SQL? Duck typing to me seems to imply checking quacks and waddles at "compile" time. I would think it's at most loosely typed and at worst dynamically typed.

EDIT: Without the new strict mode of course, where it is not duck typed but typed by name.

EDIT 2: Actually maybe duck typing is fine to use for dynamic languages too. The affinity of INT keys and stuff is sort of duck typing I guess.


You're getting downvoted because the most well known duck-typed languages, and probably where people first learned the term from, are all examples of very dynamic languages like python, ruby, and even js.

I'm sure there are others but the main language anyone knows static duck-typing for is go. In go at least, the interfaces get checked against implementations based what instances are bound to what interface variables or parameters at compile time.


Just to add more detail. SQLite was designed for Tcl interop. Its typing model reflects that.


TypeScript is another mainstream programming language that allows, or even encourages, static duck typing.


[dead]


I'd say that structural typing is a type system feature/paradigm that enables you to use duck typing in a statically-verifiable way. Duck typing is a "pattern" which you can do in dynamic languages (because there's nobody checking), or in static structurally typed languages.

I'm not sure why you'd call duck typing "dogma"? It's just a way of writing code.

https://www.typescriptlang.org/play?#code/JYOwLgpgTgZghgYwgA...


There's also no word about indexing or analysis (and postgres is very sensitive to having up-to-date statistics).

And on sqlite you might be leaving memory / storage on the table if you use a UUID as PK and don't remove the implicit ROWID.


I did not feel that was a relevant topic for this article. I have spent a good amount of time over the life of this application optimizing indexes, though. But the article is about how I designed a test to benchmark these two databases, not a general discussion on all the optimizations you can make.


In postgres there usually no good reason to use varchar unless you specifically want to enforce size limits. If you don't care use TEXT, the performance is the same.


For this particular application I discuss in the article it would make no difference, since there is only one not heavily used table that uses UUIDs. All the other tables use integer primary keys.

Also this may be not a very popular opinion, but I prefer to design the database without using database-specific features. Things such as using SQLAlchemy and sticking to basic types. Sure it may affect performance some, but thanks to having done that I was able to port the app from SQLite to PostgreSQL in like an hour.


> Also this may be not a very popular opinion, but I prefer to design the database without using database-specific features.

The problem with that is that SQL as a standard is so patchy and inconsistently implemented that effectively everything has database-specific features. You discovered that yourself when you inadvertently relied on SQLite's "feature" of letting you store UUID strings in an integer column. No matter how portable you imagine your schema is, it will work very, very differently between any two SQL implementations.

I can't see any value whatsoever in voluntarily tying one hand behind your back performance-wise, honestly. It's not like changing "text" to "uuid" would have noticeably increased the time this port-job took you.

As a wise man once said, if you're doing a job manually while the tool designed for that job is sitting idle beside you, the tool is smarter than you are.


> The problem with that is that SQL as a standard is so patchy and inconsistently implemented that effectively everything has database-specific features.

Noting that when it comes to such patchiness, sqlite's developers specifically prefer to follow PG's example. When they write new SQL features, their basis for comparison is literally always "do we get the same answer as PG?" Yes, there are differences between the two, but porting between sqlite and PG, provided no fancy PG features are involved, is very likely simpler than porting between any two other arbitrary SQL engines. (That's ignoring any performance factors - i'm only talking about semantic compatibility.)


> Also this may be not a very popular opinion, but I prefer to design the database without using database-specific features.

I prefer to leverage the DB schema to do as much as possible for me. Generally with PostgreSQL that means using a lot of domain types etc.


> Sure it may affect performance some, but thanks to having done that I was able to port the app from SQLite to PostgreSQL in like an hour.

This is unpopular because it's a bad tradeoff. In the real world, performance matters all day, every day, while switching from Postgres to SQLite is an incredibly unusual scenario (especially because their hosting models are completely different).

Tuning a database for performance (or just taking the hit and scaling the database) is far more expensive than the likely-not-going-to-happen migration from one RDBMS to another. The vast majority of companies will never do it. It makes no sense to optimize for it at great cost elsewhere.


That's ignoring the specific use case of TFA though, which is not "I'm setting out to build something for my bigcorp to scale infinitely" but "I built a little tool that ended up handling more than I thought it would, good thing I used an ORM so that I could switch to X based on actual metrics and A/B tests."

I'm as anti-ORM as the next jaded dev but this is a picture-perfect use case for it: use an ORM to rapidly prototype and get the job done and leave a path for an industrial solution. The adage "you never switch your RDBMS" does not apply here.


> this may be not a very popular opinion, but I prefer to design the database without using database-specific features

I used to be in this camp, having started my dev life in Rails when it ran Sqlite in dev. This turned out to be an anti-pattern for Rails because dev/prod parity was never quite there. Over time I also realized it just wasn't worth the effort to keep things agnostic. Sure a hypothetical migration will be easier later on, but that migration very, very, very (very) rarely occurs in my experience.


You can't completely avoid using database-specific features - even the most basic types such as strings effectively require database-specific treatment at least if you want strings that are long and/or unicode.


Prefer this approach too. Makes unit testing easier and makes data more portable. Stick to the commonly supported set of SQL data-types.


> Makes unit testing easier

In what language? Unit testing uses whatever DB client comes with your stack, and those clients have generic interfaces.

For example, they will transparently decode UUIDs into strings when you read them, and then they will write them back as the database's UUID type when writing. As far as your code is concerned, they were always just strings.

The same is true of dates for mainstream, mature stacks.

> Stick to the commonly supported set of SQL data-types.

Postgres is among the most vanilla SQL implementations. UUID/GUID/uniqueidentifier is available in every modern SQL-based RDBMS.

Refusing to use the idioms of your database is a guaranteed way to lose performance for no reason.


In languages like Go, Java and C++. The product in question is an integration product where customer can configure choice of database in public/private cloud.

Using basic SQL types that have common support across all database vendors simplified development, testing and support. I have seen folks waste weeks (to months!) in data migration/porting and still lose time troubleshooting issues across databases when they used db specific features. Too much tech debt incurred.

Loss of minimal performance weighed against simplicity and maintainability and no vendor lock-in. When you have the same tests and same schema and same DB layer code working against multiple databases, life becomes wonderful and hassle free.


> The product in question is an integration product where customer can configure choice of database in public/private cloud.

I've seen this type of thing before, but it's not a common use case (and I'm not sure why people still do it).

It makes more sense to just say that the product uses a FOSS RDBMS (like Postgres) and leave it at that. There's no vendor lock-in.


Because many customers have their own expertise in the database system of their own choice. A shop that has hired Oracle DBA's doesn't want to hire Postgres folks. A MySQL based place doesn't want to hire Oracle folks. And they already have built extensive data-management processes around their choice of database systems and have the in-house experience to manage their database system in a wide variety of platforms.

If you to sell a packaged, deployable product to the widest market, yet want to minimise support, you leverage the expertise that the customer already possesses.

This is probably more of a consideration in the business enterprise space than some simple normal end-consumer SaaS, so I am not saying this approach makes sense everywhere.


I've done this before, used a sqlite as a unit testing stand-in for a read-only MSSQL database my app used. Allowed me to unit test the data transform code conveniently. But that's about as far as I'd go. Running tests against one database and production against other is a great way to discover bugs in production, like the text length constraint covered in the article. And if speed is of concern but data storage is not, just launch the test database with tmpfs backed storage.


Just wait until you learn that SQLite stores everything as a string :(


True, but there’s a difference between representing a UUID as a sequence of bytes in a string and as the ASCII encoded hex characters representing those bytes.


> The solution was to grow the slug column to 512 characters and retry.

No, the solution is to always use `text` as your column type, and use a check constraint if you need to enforce a limit. It's much easier and safer to alter a check constraint than it is to change a column type live in production. The `text` type and `varchar(x)` are identical under the hood, `text` takes up no more space on disk.


For Postgres, there's a strong case to be made VARCHAR(n) is preferable to TEXT+check constraint. As you note, they're stored the same on disk. But for a varchar, you can increase the length with just this:

    ALTER TABLE a ALTER COLUMN b TYPE VARCHAR(n)
This only requires ACCESS EXCLUSIVE for a fraction of a second while the metadata is updated.

Whereas updating a CHECK constraint will require a full scan of the table. If you do it naively, this locks the table for the entire scan. If you want to avoid that, it's three steps:

    ALTER TABLE a
        DROP CONSTRAINT b_c,
        ADD CONSTRAINT b_c CHECK (length(b) < n) NOT VALID;
    COMMIT;
    ALTER TABLE a
        VALIDATE CONSTRAINT b_c;
So as long as you're only increasing the length (which in my own experience is a safe assumption), VARCHAR is much easier to work with.


On the other hand, changing the length of a varchar column in Postgres is actually changing the type of the table, and if you have any other objects in the database that are dependent upon that type (e.g. views), you'll have to drop and recreate them (and transitively, any views that depend upon those views) in the same transaction where you alter the column. This can easily spiral into a lot of work.


Btw this procedure doesn't work for partitioned tables. For some reason, it locks the table and does a table scan anyway.

The workaround we came up with was to add check constraints for instant partition attachment, lock the table, detach all the partions, update the type on each partition, update the parent table type, then reattach all the partitions.

In our case, we didn't already have those instant-attach check conditions, so we incurred table scans anyway to validate those, but if you were to make multiple updates, you only have to do that once.


Yep, that's fair enough. In my experience, however, in the vast, vast majority of cases, folks either don't actually care about the length, or they should be using business logic in code to do the validation, rather than waiting for the database to yell at them, invalidate their transaction, etc.


>they should be using business logic in code to do the validation

I trust myself to do this, I don't trust my coworkers and future devs writing applications for the same database to follow the same constraints in their code though. Constraints at the DB level keep your data clean(er) even if some other devs tries to shit it up.


If I had a dollar for every time I saw someone doing application-layer-only validation of data and end up being surprised that some data that wasn't supposed to be valid ended up in the database only to later blow up in their face, I'd be a fairly well-off man.

In practice, declaratively creating data constraints at schema-creation time has a much much higher success rate than trying to imperatively enforce data constraints at write time, in my experience. The missed cases and bugs always come back to haunt you.


I'm not talking about all validation, I'm talking about a length constraint on a string (that isn't something like a US state, which should be a two-character field) -- already a pretty uncommon thing. At work, our codebase has 975 instances of a column storing a text value, and the length is material only in a very small handful of those.


How do you handle user input in your case? Frontend-only length limits? Or just allow them to take up more storage if thats what they submit?


The benefit is that you can treat the database schema as the single source of truth as to what data is valid. Otherwise what tends to happen is that different code components will have different ideas of what exactly is valid. This is also important because databases tend to be longer-lived than the code that accesses them.


It's good practice to have a length constraint on all text fields, just to put some cap on field sizes. You don't want somebody uploading 100MB of garbage into the `first_name` field—A 1000-char max is more than enough.

Constraints generally should be enforced both application- and database-side wherever possible. Application constraints can be more robust and have better UX, but only database constraints can offer you guarantees about what is actually in your database.


I thought as of PG 9.2 expanding a varchar column was metadata only operation and therefore low overhead.

I know in SQL Server there are two issues with doing varchar(max) for everything and increasing a columns size is metadata only. First indexes have a limit of 900 byte values and will fail at runtime if you index a column with no max length and insert a value larger than 900 bytes. PG seems to have this issue as well but the limit is 2712 bytes.

Second the query planner makes use of the size to determine how much memory to pre-allocate for the query, with unlimited length field it assume something like 4096 bytes and wastes working memory if your values are not actually that size. Not sure if PG has the second issue, having a max value defined is valuable information to a database engine along with other type information such as UUID's only taking 16 bytes instead of 36 bytes.


> PG seems to have this issue as well but the limit is 2712 bytes.

Note that this is only for bytes indexes. Hash is not so limited.

Tho at this sort of sizes it seems unlikely such simple indexes are of much use. I guess the range matching abilities of btrees could be but that seems unlikely, a 2700 bytes path-like datum is a lot.


Note: that's true for postgres and sqlite (which ignores the length limit entirely as they discovered anyway), not necessarily for other database systems.


How convenient then that he's comparing PostgreSQL and SQLite then!


I think it's even more complicated.

The reduction in latency brought on by in-process databases, combined with modern NVMe storage, means that SQLite is a substantially faster approach than any other solution which requires a trip through the network stack.

I've got services in production for several financial institutions right now that are resolving most SQL queries within 100-1000 micro seconds by simply using SQLite on reasonable hardware.

How many more users could you support if the amount of time you had to await IO for each was reduced by ~2 orders of magnitude?

Obvious caveats here being the resilience angle. We addressed this with application-level protocols and additional instances.


Another angle is unbounded horizontal scalability across tenants. If your application can be divided cleanly between different customers and one customer can be served with one instance (see "How many more users could you support if the amount of time you had to await IO for each was reduced by ~2 orders of magnitude?") then scaling to more customers is extremely simple and linear.


Agreed, but a fairer comparison would be SQLite vs PostgreSQL installed on the same machine.


Even there you'll see the same: micros in SQLite vs millis for PostgreSQL. PostgreSQL is simply not the right choice if you want in-process speeds.


What stops you from running postgres on the same vm/machine as your app ?! You avoid the network and reap all the benefits of the NVMe storage.


It still goes through the network stack, just not the network hardware.

Unix sockets are faster, but they still require some system calls that an in-process database can do without.


Can you share more info about resilience, app protocols and additional instances?


> How many more users could you support if the amount of time you had to await IO for each was reduced by ~2 orders of magnitude?

None - or practically not many, when using async (or sort of green threading) backend stacks. The waiting connections / users are just waiting and don't block any new connections.

If the network round trip to Postgres (in my experience around 3-10 ms, but of course highly depending on your server infra) was a concern Postgres could be placed on the same server as the backend, though I would not recommend it. But this relatively small IO overhead usually is not a concern for many apps.


Thank you for sharing! Anecdotes like this are very useful.

Can you share more about the context? How big/gnarly are the tables? How frequently/concurrently are they written to? Based on your experience here, when wouldn't you want to use this approach?


It’s difficult to accept the results without looking at the query plans to see how the queries are running and if they are optimal. Seems like it’s just a straight dump of data into PostgreSQL and letting SQLAlchemy handle some queries and indexes but no analysis of the indexes at all.

Could be able to squeeze more perf out of both SQLite and PostgreSQL.


The point of this exercise was to determine how these database perform with same schema, and if it makes sense to make the jump from SQLite to PostgreSQL or not. As a side note and since you ask, the indexes on this database are fairly well thought out, I have exhausted all ideas on indexing improvements to help with performance. I do mention a major implementation enhancement which is to precalculate monthly totals.

Also, another important point I tried to make is that this benchmark is useful for this particular deployment. I don't really want to convince you the numbers that I've got will also apply to you, my proposal is that you should test your application and make decisions based on your own results, which can certainly be different than mine.


While the indexes /may/ be pretty well thought out. That doesn’t mean that they are useful.

Indexes don’t work the same between SQLite and PostgreSQL so a 1:1 transition between the two databases on indexes probably isn’t going to yield the best results.

For example UUIDs were stored incorrectly in PostgreSQL.

SQLite does not enforce varchar lengths, and typically in PostgreSQL you just use text type and not impose any length restrictions unless absolutely necessary.

PostgreSQL can use multiple indexes in a single query which may help with the 2gb ram limits.

PostgreSQL supports partitioning which could help with the range queries.

All I’m saying is that I believe the performance you get from SQLite imo could be better, and the performance from PostgreSQL a lot better.


> typically in PostgreSQL you just use text type and not impose any length restrictions unless absolutely necessary.

There was a horror story posted on here a while back about some old API causing an outage because there happened to be a random text field that a user was effectively using as an S3 bucket. Their conclusion was to always set a sane length constraint (even if very large, say 1MiB for the smallest fields) on field sizes.

Hyrum's Law. Yes user input validation, blah blah, but that is infinitely more surface area than having a sane (or absurdly large) limit by default.


Right, but use a check constraint and not a data type constraint (i.e. `varchar(x)`).

That sounds more like a code review issue -- `blob` would've been an appropriate storage type in that case.


> That sounds more like a code review issue -- `blob` would've been an appropriate storage type in that case.

No, it was supposed to be a short text field. Not an infinite storage endpoint!


I use and prefer varchar(1000) personally, extremely easy to change in the future if needed.


There are also a shitload more indexing options available to you in Postgres.


> As a side note and since you ask, the indexes on this database are fairly well thought out, I have exhausted all ideas on indexing improvements to help with performance. I do mention a major implementation enhancement which is to precalculate monthly totals.

So, this is what Data Warehousing is all about: reporting related optimizations (typically what monthly totals and such are aimed at). The idea being that you trade disk for speed., so that kind of optimization is actually pretty normal if you've identified it as a bottleneck. There's no need necessarily to separate data warehousing constructs into another DB, though it's often good practice because sooner or later you're going to start running into problems on your important system related to storage inflation and non-indexed sorts/filtered.


I'm surprised you are the only comment mentioning this. I don't have quite enough knowledge to speak up about it here but it seems that while this article is about SQLite vs PostgresQL, it's really that neither are the ideal answer in this scenario.


exactamundo


The fact that the performance is so close between a database that requires NO SERVER and one that requires a dedicated server + ops management is a signal to me that all projects should start on SQLite unless there is a compelling reason not to.


It's funny because to me it says pretty much the opposite: because sqlite has no server it works entirely in-process, whereas the client will be talking to postgres over a unix socket, or possibly a full-blown network socket.

Plus it seems to be a readonly workload, something for which posgres is not ultra relevant (not that it doesn't work, but...).


Postgres doesn't require a _dedicated_ server. If your requirements allow for SQLite but it didn't exist, you could conceivably run postgres on your app server to reduce the ops load if you really felt it was a significant effort.

If you might want to use multiple app instances connected to a shared DB, I would say it's probably easier to just use a local postgres container for dev and a managed cloud DB. Really not that much effort and you get automatic backups.

If you plan to never use a shared DB, SQLite is great though.


> you could conceivably run postgres on your app server to reduce the ops load if you really felt it was a significant effort.

Still, PGSQL requires some level of setup/maintenance: create user, permissions, database, service with config/security effort. If there is a way to run PGSQL as lib and say to it: just store your data in this dir without other steps, I am very interested to learn.


You can set up Postgres with three lines of bash and a Unix socket. It is more work; I wouldn't say it's appreciably more work, and if you need more rigorous data modeling (I frequently do) then it's worth considering if this is the mode in which you'd like to run your data stores (I wouldn't, but latency isn't the most critical thing for stuff I do).


I think it is more than that.

Steps will be:

- sudo apt install postgres (more steps are needed if you are not satisfied with OS default version)

- sudo su postgres

- psql -> create db, user, grant permissions

- modify hba.conf file to change connection permissions

- modify postgresql.conf to change settings to your liking because defaults are very out of touch with modern hardware

Another options would be to build your docker image, but I am not good with this.


Whereas with SQLite it's:

- sudo apt install sqlite3 # you're going to want the client

- sudo chmod myapp db.sqlite3

- sudo chmod myapp . # make sure you have directory write permissions for WAL/journal

- sudo -u myapp sqlite3 db.sqlite3 'PRAGMA journal_mode=WAL;'

Nothing is ever single-command easy in all cases.


I am not familiar with SQLite, and not familiar how engineers embed it to apps, but I believe for many apps with SQLite in distribution you don't need all these steps, it is included as some kind of library and not as standalone server.

I am familiar with Java and H2, you just say in config somewhere: store db in this directory, and you are all set.

I think there is no such route currently for PGSQL.


Sorry, but it's more complex than this. As an example, NPM's `sqlite3` package does not contain the actual library. Neither does Ruby's. You need the platform library for quite a few runtimes and now your single-command future has...multiple commands.

Which isn't bad. This is a pretty low bar to clear.

(All that said, I am of the opinion that separating data storage from application logic is better for operations, for security, and for disaster recovery unless you desperately need to prioritize avoiding network latency above everything else, so this is all pretty moot to me.)


> As an example, NPM's `sqlite3` package does not contain the actual library. Neither does Ruby's.

That seems pretty... odd on their part. I know that Ruby gems in particular can and do include C sources, so there's no reason it couldn't ship sqlite3.c (or even the various .c files that get combined into sqlite3.c); if you're able to install Nokogiri (i.e. you have a usable C toolchain), then you should be able to compile sqlite3.c no problem.

Meanwhile, Python ships with an "sqlite3" module in its standard library (last I checked), so if you have Python installed at all then you almost certainly already have SQLite's library installed.


Nokogiri itself depends on an OS-supplied libxml2, doesn't it? And I think--don't quote me, Python isn't my thing and I only looked briefly--that Python on Ubuntu requires the `python-sqlite` APT package, which depends on both `python` and `libsqlite0`. I think CentOS is the same way? As a sysadmin, I'd expect it to be this way, so I'm not surprised by seeing this.

I think the thinking is that packing The Entire Library rather than your necessary bindings is overkill/duplication/etc etc. - it's the static versus dynamic argument in a sense, even though you may be dynamically linking while you pack your own library. Duplication versus OS-controlled security patching? Admittedly when most deployables end up being Docker containers (for good or for ill) that's less important, but most of these tools are from a distinctly earlier era of deployment tools.


> Nokogiri itself depends on an OS-supplied libxml2, doesn't it?

It does, but it still compiles bindings for it. My point is that if you're able to compile C code at all (including Nokogiri's bindings to libxml2), then strictly requiring some external libsqlite3.so is unnecessary - because it's pretty dang simple to just compile sqlite3.c alongside whatever other C code you're compiling.

> And I think--don't quote me, Python isn't my thing and I only looked briefly--that Python on Ubuntu requires the `python-sqlite` APT package, which depends on both `python` and `libsqlite0`

Right, and my point there is that if you've installed Python, then in doing so you've already installed SQLite (because your package manager did so automatically, because it's a dependency), so you don't need to run an additional command to install SQLite.


> but it's more complex than this. As an example, NPM's `sqlite3` package does not contain the actual library. Neither does Ruby's. You need the platform library for quite a few runtimes and now your single-command future has...multiple commands.

I think usual way to do it is say having shared library file like sqlite.so in some directory or installed through OS package manager, and thin API binding for specific runtime in form of runtime package (e.g. npm).

> you desperately need to prioritize avoiding network latency above everything else, so this is all pretty moot to me

Goal here is easy distribution of embedded DB. Many apps are doing this with SQLite, for example Dropbox client. Many other examples: https://www.sqlite.org/famous.html


I see what you're saying now. But unless I'm very much missing the plot, and everyone else is talking about a web application. The post to which you originally replied even constrained it as such:

> you could conceivably run postgres on your app server

So what you've been saying has been taken in the context of packing sqlite or Postgres next to your web application, which has pretty limited use cases, and reads really weirdly in that light.


Discussion was about "app server" and ops maintenance. We just got used to that if you want to bootstrap/install web/api stack, you need to change tons of knobs and configs and install many libs and software pieces.

But there is another approach: one self contained binary/artifact which can include embedded DB too, and be launched by one command.


As my comment says, you are going to want the stand-alone client though, for a variety of reasons. The psql client usually comes with the postgresql-server, but apps embedding SQLite3 definitely don't come with the stand-alone command-line sqlite3 client.


> you are going to want the stand-alone client though

I don't want stand-alone client for app which customer downloaded from my site.


The official docker image seems quite good and handles a lot of the bootstrapping needs.


Could you be more specific what exactly it handles from the steps above?..

Also, docker adds complexity itself.


I can’t remember the exact command off top of my head but it’s something like:

docker run -it -e POSTGRES_PASSWORD -p 5432:5432 postgres:12

And bam, you have a server


Another cool thing is you can initialize the Postgres db when it starts the container with your own SQL (or sh) scripts to create your schemas, tables, views etc by just copying the scripts into /docker-entrypoint-initdb.d/.

Like:

FROM postgres:14.3

COPY pg/pg-config.sql /docker-entrypoint-initdb.d/000-pg-config.sql

COPY pg/schemas.sql /docker-entrypoint-initdb.d/010-schemas.sql

COPY pg/meddra-tables.sql /docker-entrypoint-initdb.d/020-meddra-tables.sql

# ...

RUN chmod a+r /docker-entrypoint-initdb.d/*

ENV POSTGRES_USER=myapp

ENV POSTGRES_PASSWORD=xxx

ENV POSTGRES_DB=mydb


I tend to do things like (after launching container):

docker exec postgres psql -h localhost -p 5432 -U ${DBADMINUSER} -c "some SQL statement"


There is config change in steps above (pgsql defaults suck), and having docker now, changing them becomes harder.

Also, I usually change hba.conf file too, to disallow outside connections.


Keep in mind the suggestion was a managed product in production but if you really need to customize your local image, you can mount in arbitrary initialization scripts as well.

If you need all this customization then did SQLite really fit the bill in the first place?


> if you really need to customize your local image

I need to change config for prod pgsql instance, because default config performance wise is not necessary suitable for serious prod machines.

> If you need all this customization then did SQLite really fit the bill in the first place?

I am not familiar with SQLite, but somehow familiar with Java H2 DB, which has similar idea. You can embed it and all configs into self contained java jar file together with your app and no moves in host OS is needed.


I only use this for local development personally, we use RDS in Prod


You are likely wanting something like RDS [1], Azure [2], or even Linode (which I just discovered) [3]. Sure you have to pay a larger premium for it, but these products do allow you to focus on the app than the DBA stuff.

- [1] https://aws.amazon.com/rds/

- [2] https://azure.microsoft.com/en-us/services/postgresql/#overv...

- [3] https://www.linode.com/products/postgresql/


> pay a larger premium for it

yeah, premium is very large if I want to do a lot of data crunching (lots of iops, cores and ram is needed), plus latency/throughput to those services will be much higher than to my local disk through PCIE bus.


"never use a shared DB"

I think this is what it comes down to for me too. Yes there might be some use cases that really benefit from Postgres features like GIS and the native JSON type, but ultimately, for most use-cases, it's going to hinge on whether or not you ever expect to need to scale any part of the system horizontally, whether that's the frontends or the DB itself.


> Yes there might be some use cases that really benefit from Postgres features like GIS and the native JSON type

Or a proper UUID type, materialized views, enforced data-constraints, lateral joins, queries that consider multiple indexes, etc, etc.

It's not like there's just a couple of niches where PostgreSQL's features distinguish it from SQLite. SQLite is appropriate for very simple DB use-cases where you aren't too bothered by silent data inconsistency/corruption, but not a whole lot beyond that. It's a good little piece of software, but it hardly relegates PostgreSQL to "Just GIS things" or "Just JSON use-cases".


> SQLite is appropriate for very simple DB use-cases where you aren't too bothered by silent data inconsistency/corruption, but not a whole lot beyond that.

SQLite is also appropriate for use cases where you are bothered by silent data inconsistency/corruption, it's just that often the hardware running PostgreSQL (or any "serious" DBMS, relational or not) is usually less prone to random inconsistency/corruption (non-interruptible redundant power sources, well cooled environment, running lower but more stable clock speeds on everything, ECC RAM, raid disk array, corruption-resistant FS).

If you run PostgreSQL on the cheapest consumer hardware, expect random corruption of the database when power runs out several times in a row.


Sorry, "corruption" was probably misleading there. By "silent data inconsistency/corruption" I mean, "a bug in my application led to me storing strings in an integer column, or too-long data in a too-short column, and SQLite just silently let me do that"-type errors – the data consistency is logically (but not physically) corrupted. The author encountered 2 separate occurrences of this just in their simple use-case, evidently.


> led to me storing strings in an integer column

Recent SQLite versions have STRICT mode where they forbid that.

https://www.sqlite.org/stricttables.html

> or too-long data in a too-short column

You can use a constraint for that:

    CREATE TABLE test ( 
        name TEXT NOT NULL CHECK(length(name) <= 20)
    ) STRICT;


Neat, will definitely give this a try. Definitely one of my minor peeves about SQLite is having to set a bunch of pragmas on every new connection. Like, especially stuff such as foreign_keys=1... really, enforcement of foreign key constraints is something each connection has to opt into rather than being set one on the whole DB file and never thought about again? Blah.


Well, unless you care about disaster recovery or availability.

For most business applications used by more than one user there are usually some expectations that data will be retained and the application will be available.


I read that as "dedicated server[process]" not "dedicated server[hardware]" While the latter would obviously imply the former, it's really the former that increases the complexity of managing whole affair. Desktop applications with a dependency on mysql/postgres running on the same machine are, in my experience without exception, a pain in the ass for users.


Managing a publicly accessible DB and user certs to connect to it securely is like someone’s full time job at some companies.


Why should the database ever be publicly accessible? I've only ever seen databases being accessible by applications and maybe users (but those only for analytics databases), and have never experienced anyone working full time on certs and user access for a single database. There's usually a team for that, but they do that for the whole server fleet.


There are enough SQL oopsies in this article to show that you should NEVER use sqlite for any production except as filter databases or other sort of throwaways. RDBMSs are there to enforce constraints on data.


That’s silly, SQLite is the single most widely deployed database engine in the world, possibly the most deployed software, period. It’s unit test suite is second to none. It’s reliable enough that SMF, Solaris’ boot and service management (sort of like systemd) depends on it.

That does not mean it is applicable to all databases use-cases, but production-grade it undoubtedly is.


How many installations SQLite has is not that relevant in a discussion of SQLite vs full RDMS.

By your example, a bank should use SQLite because it is deployed widely.

You have to look where they are deployed and the use cases. They also come with dramatically different tooling which make sa huge difference. More tooling does not mean that a full RDMS is a better choice:

It is just about what you need, what use case you need to cover and what industry you are in.


> NEVER use sqlite for any production

> RDBMSs are there to enforce constraints on data.

If you did not ask SQLite to enforce type constraints on relations (with STRICT), it will work in the relaxed, backwards-compatible expected behavior of previous versions of SQLite.

That being said, if you want actual validation, you probably need more complex CHECK expressions anyways with your business rules, and those work by default on any database.

https://www.sqlite.org/stricttables.html

  * unless you're a madman that runs things with "PRAGMA ignore_check_constraints = false;" enabled or equivalent; in that case, no DB can help you.


RDBMS constraints are oversold usually. For instance it is usually the case that things like foreign-key constraints are avoided in production because they cause performance and maintenance nightmares.

Widely-used constraints like UNIQUE are production-quality in SQLite.


Uh, most distributed environments you have no choice but to use a server, because you’ll have clients on a (relatively) uncontrollable set of other machines.

If you’re writing something to run in an embedded/client app environment, then yeah why would you use Postgres for your one machine? You could, but it’ll add a lot of moving parts you’ll never need and probably don’t want (like remotely accessible network ports you’ll need to secure)


yes!

It seems to me (or at least, I'm hoping this is happening) that the pendulum is swinging back to simple deployments and away from AWS/Cloud/K8s for every tiny little app.

One thing I'd love to see is that the 'default' MVP deployment is your code + SQLite running on a single VM. That will scale well into "traction" for the vast majority of applications people are building.


I have a fully working SaaS with 30+ companies paying, with 20 employees each on average. I wrote it as a self taught programmer via YouTube and stackoverflow. It is hosted on a $2 a month namecheap sharedhosting. Php back, jQuery front. I use cpanel to set things up.

It runs faster than 90% of webapps on the internet.


As a former php dev I absolutely believe this. I can't get over how slow and bloated stuff is these days. I can't even do front-end anymore.

We once did an app ages ago where the database was used to create a materialized view of code (PHP) + web pages for everyone and everything. We then rsynced that to 6 machines. This is ancient times, but this thing FLEW -> click click click click you could fly through the pages. It was a read heavy workload, but still, even the front-end (just hand coded) was somehow faster than all the latest and greatest going on these days.


Do you implement something akin to Role Based Access Control for those 30x20 employees? If so where do you store and manage that data?

I have zero experience with this but I am very curious how people do it in sqlite.


I think OP is saying their customers are mainly companies with ~20 employees, not that their company has 20 employees


Yeah, no, I meant that 30 orgs with an average of 20 employees is (likely) a trivial load but it begins to have an interesting complexity on the amount of code for RBAC.


Not OP, but like any other database? Even if the database supports it, you usually should not have users directly logged in the database anyways, so whenever the database "has" or "lacks" access control, it doesn't matter.

What does matter however, is enforcing parametrized queries everywhere. Unless all the db handles you pass to the client handling code are read-only, chaos will ensure from the DDL permissions.

https://xkcd.com/327/


I start by assuming a server-side bridge between the client and the database, and it's a given the queries are parameterized. So only the bridge is doing the "logging in".

Why is it superior to put all of the (bespoke) access control logic in the server side bridge rather than use what's available in the database (accessed by the bridge, not the client)?

I have been watching like a hawk for 6 months but I haven't stumbled upon a clear reason why this is done, except for "it helps source code db portability".

For a multiorg/multiuser application this seems like the crucial distinction between sqlite and postgresql.

Again I have no experience here, talk to me like I'm stupid (I really am!).


For a multiorg approach, you probably want isolated DBs anyways.

Within a single org, multiuser approach, there are 2 big problems that I remember with attempting to shoehorn DB auth into application auth:

* assuming you use a connection pool, you might run out of TCP connections/ports if you need to handle too much stuff;

say for example that your load balancer need 3 application nodes behind it - you will need 2 (connections per user) x 3 (application nodes) connections just to handle a user - 6 connections/user. That will eat your database connection limit very fast, for no good reason.

* assuming you don't use a connection pool, you now have horrible latency on every connection (bad) and need to handle plain text passwords (assuming you use scram-sha-256), or md5 non-replay-resistant hashes of user passwords in, either sent in every client request, or in a shared session system. No matter what you pick, you have a security disaster in the making (very bad).


In Postgres at least you can use SET ROLE when getting a connection from the pool and RESET ROLE when returning a connection to the pool, and avoid the problems you describe.


Ah, thank you very much. I will happily investigate these and ditch postgresql for sqlite if still true, for my planned use case.

sqlite looks like great technology to me (as is postgresql) but I am a bit of a fanatic for keeping the overall system as understandable as possible, so these questions are important (for me, I'm stupid).


Yup, I believe it. As an industry 90% of companies are spending 90% of their time on crap only the remaining 10% of the companies actually need.

Good resume fodder though I guess.


If people are paying for the app why not spring for a dedicated vm? It’s only like $5 more a month and is way more predictable and customizable.


It seems a lot harder to setup correctly and use, and I don't really know what benefit I'd get.


In my experience the pain will come later as it grows, I believe it.


That pain also comes if you're already deeply invested in cloud products. Nothing scales without effort. And pain points always arise where you don't expect them. But the simpler the setup, the simpler it's often to scale out the parts that need scaling. In this example maybe a dedicated server or a separate database server at some point. But there can be many customers before that's necessary.

Preemptive scaling never really works and most projects never scale enough to warrant more than one server (unless you write very inefficient code).


And it's a great pain to have


I believe it!


hmm in that case I would probably pick Postgres because despite the added network call, it is still faster than an embedded DB.


As I understand it the actual low-level database interface may be a shared library mechanism in either case. At least, for a local database I'm pretty sure that's a function of the connection type. You're still dlopen()-ing an ODBC library and the underlying glue between that and the rest of the userspace DB engine parts can be implemented in many ways. If there's a socket involved it would probably be AF_UNIX


I disagree:

Postgres and SQLite are very different in several aspects, e.g.

- types and enforcing type safety

- handling concurrency and transactions (available isolation levels)

- functions / procedures

- array support

Postgres can be much more valuable than just using it for "relational data storage". Its features can be a good reason for choosing it from the beginning.


Sure, you can start there. Once you scale though you're going to have lots of fun.


For many applications you're right. However, in the case of a dashboard for blog articles I don't think it could ever come to that.


I’ve never worked on a project that was scaled out that didn’t require at least moderate DB maintenance, such as a large migration or sharding.

A migration is all but expected. Might as well make things easy on yourself for the first year and start with SQLite.


We're talking about data from a blog, though. It's not like it's going to scale to have terabytes of entries anytime soon.


I have a question that wasn't answered on the blog, and might be silly for me to ask. Is SQLite running in WAL mode or not? I usually found that WAL mode usually allows better performance even in read-heavy workloads as it doesn't need to exclusively lock the database each time.

Otherwise, the post is a nice reminder for SQLite users that foresee a database switch to use the newly-introduced strict mode.


Yes, WAL mode can have a drastic impact.

Also this is apparently benchmarking Postgres and SQLite _with Python on top_. Some of the overhead here might not have much to do with either SQLite or Postgres at all. Just because one uses an ORM doesn't mean the same code is running and especially considering how sluggish Python is, I'm not convinced by any of this either way outside of this being an improvement for the very specific use-case presented. But we don't know _why_ that is.


I explain this in the post. The goal is to optimize the application for users. I really don't care if Postgres is faster than SQLite or viceversa, I just want users to have a snappier application. The experiment was to swap out the database while leaving everything else the same to see if that affects the overall user experience in a positive or negative way.

Also note that I'm not saying Postgres is faster the SQLite in absolute terms. That happened to be the case for this particular application and how it was deployed. If you perform a similar test with a different app or stack you will likely get different results. A main point I'm trying to make in the article is that people should do their own benchmarking instead of making decisions on stuff you read online.


SQLite isn't running in WAL mode. This application isn't just read-heavy, it is ready-only for users, so I wouldn't think this will make a difference. The writes occur while everybody is sleeping, from a cron job that runs once a day to upload the traffic for the previous 24 hs.


I had a look at SQLite around the time sqlite3 appeared, and my notes say I decided I didn't like it because of the lax and sometimes inconsistent way it can treat its input.

I thought I'd have another go at it today (sqlite 3.34) and I'm mildly surprised that the things I didn't like are still there:

  sqlite> select '1 apple' + '2 oranges';
  3

  sqlite> select count(*) where 0.9;
  1
  sqlite> select count(*) where 1.1;
  1
  sqlite> select count(*) where not 0.9;
  1
  sqlite> select count(*) where not 1.1;
  0


You may be interested in STRICT Tables [1], introduced just 6 months ago:

> Some developers appreciate the freedom that SQLite's flexible typing rules provide and use that freedom to advantage. But other developers are aghast at SQLite's flagrant rule-breaking and prefer the traditional rigid type system found in all other SQL database engines and in the SQL standard. For this latter group, SQLite supports a strict typing mode, as of version 3.37.0 (2021-11-27), that is enabled separately for each table.

[1]: https://www.sqlite.org/stricttables.html

Related past discussions:

Strict Tables – Column type constraints in SQLite - Draft | 2021-08-21 | 331 points, 108 comments | https://news.ycombinator.com/item?id=28259104

SQLite Release 3.37.0 | 2021-11-27 | 333 points, 112 comments | https://news.ycombinator.com/item?id=29363054


"I'm mildly surprised that the things I didn't like are still there"

That's because SQLite doesn't ship changes that break existing applications.


The debate over type conversion seems somewhat like arguing language features (such as weak typing vs strong), which has been a spirited debate since forever.

The fact that this is debatable at all means that there are arguments and counter-arguments that serve both sides well, which further implies that there are use cases where some pros outweigh the cons on each side.

(Of course, this entirely neglects the point that it's usually fun and even educational to have a technical argument.)


I don't have any great objection to a database that lets you store a value of any type in any column.

And if they want to say '3' + 5 = 8, I could imagine why that might be fine.

But I don't think that's a good reason to have behaviour like the above.

I think it's more likely that they wouldn't choose to do those things again today: I think this is what fell out of the initial TCL-based implementation and they feel they're stuck with keeping it backwards-compatible.


This is totally uncomplicated. The choice of database is a function of your application and its architecture, not a matter of all-things-equal performance comparison.

"I want an extensible versitile database platform for use as a component in an infrastructure that is part of a larger system" - postgres.

"I want an embeddable lightweight SQL storage backend for a narrow scope, single-purpose, probably single-user, probably single-system application" - sqlite

Postgres should perform well in any role in the hands of an expert - it's a question of whether or not that is overkill. Sqlite can also perform well in a subset of those roles.


Some context: I've worked with massive PostgreSQL db's, containing millions of daily records spanning multiple years (allowing for potentially massive numbers of returned rows). I also exclusively MVP simple db architectures in SQLite (think: no massive joins or need to partition data).

Best takeaways here:

- SQLite kills it. It appears untuned PostgreSQL is only 2 seconds (???) faster than untuned SQLite; always try to kickoff a project with SQLite

- PostgreSQL has a performance cliff when query responses leave memory. This is extra true when an index no longer fits in memory. worker_mem is critical for pushing back this performance cliff

Side note: Tuning postgres can be really painful, but it's also a really dope tried-and-true db.


It seems a solid first step would be to use EXPLAIN to understand query planning/scanning issues.

Assuming 6.5M rows/day*1.5yr=~3.5B rows, how large is the dataset in both SQLite and pg? With indexes, is it fitting in RAM?

Linode 2GB plan is shared vcpu, so results will vary depending on who is sharing it with your service.


This isn't the topic of the article so I haven't included it, but I have taken several "solid steps" to optimize this application throughout its year and a half of being in production.

This article is about one of them only, how I designed a test system to evaluate the two databases head to head.


> This isn't the topic of the article

Query planning seems like it should be the topic of the article. SQLite and PostgreSQL do their query planning very differently - not to mention that SQLAlchemy is probably generating different queries. I can't really fathom how one can meaningfully "evaluate the two databases head to head" while seeming to actively avoid how they plan/optimize/execute queries.


This sounds like a great topic for your research, your article, and your passion to share with others if it exists. Get to work.


I don't have access to the code used in the article. Therefore, no amount of research on my part is capable of answering why PostgreSQL was faster than SQLite in the author's situation - certainly not without the queries and the plans / EXPLAINs thereof, those being the specific thing I and multiple others have suggested as something the article could've provided and which would've made it an actually interesting head-to-head.

Meanwhile, if all you have to offer in response to such a suggestion is "hurr durr do it yourself", then why bother commenting at all?


>If you are going to take one thing away from this article, I hope it is that the only benchmarks that are valuable are those that run on your own platform, with your own stack, with your own data, and with your own software. And even then, you may need to add custom optimizations to get the best performance.

Good advice, contextually.


Is it just me or was testing done with only four concurrent clients? In a mostly read only environment?

Wouldn't moving to a static site generator been a better direction in that case?

If you're maxing out at four people reading your blog, I can't imagine more infrastructure is better.


They not benchmarking the blog, they are benchmarking an analytics application aggregating statistics about the blogs.


You know, apart from some of the details in regards to what data types to better use etc., I think that both PostgreSQL and SQLite are excellent choices for most use cases!

Of course, SQLite might be the best choice for embedded or local usage, they even have this lovely page about using it as an application file format: https://sqlite.org/appfileformat.html

And PostgreSQL might scale better for multiple writers and situations where you need it on a separate VM/server/container that might be shared between different apps through a network connection of some sort.

Though it's also curious, because you never hear a lot about MySQL or MariaDB (comparatively, in my subjective opinion) on HN or many other places for that matter. And yet, it's still a decent option, has good tooling (MySQL Workbench is great), is widely supported and backs a significant amount of production systems out there.

Do people just view it as old/legacy, or is there the belief that for many use cases PostgreSQL might simply be the better option, e.g. also because of projects like PostGIS and support for more complex setups, like custom data types, more advanced scripting language for in-database processing and whatnot?


I’m very happy with MySQL currently. I started using it because AWS Aurora supports it a bit better than postgres. I haven’t had any regrets since.

It’s great when you just want simple relational storage without a lot of features or ceremony, but need a bit more than SQLite. It’s fast, predictable, and gets the job done.

Postgres is great too and offers more, but sometimes that can tempt you to try to do too much in the DB layer.


I think it's more the latter: using Postgres for relatively simple workloads really isn't any more challenging than using something like MySQL, but then Postgres also supports more complex setups when you need it. So there would need to be a reason to lock yourself out of what PG offers, and I just don't think such a reason exists.


I really wish WordPress supported using SQLite as it's database instead of MySQL as this could make backups as simple as rsync'ing from prod server to backups.


There's a WordPress plugin that adds support for SQLite as an alternative to MySQL.

The implementation is (to my mind) simply astonishing: they run regular expressions against the SQL to convert it from MySQL dialect to SQLite! https://github.com/aaemnnosttv/wp-sqlite-db/blob/9a5604cce13...

I'm amazed that this works, but apparently it works really well. I would expect it to break the moment it ran into a WordPress plugin that executed custom SQL in a way they hadn't anticipated though.


WordPress became popular at a time when LAMP hosting was easy, widespread, and affordable. It's one reason WordPress largely supplanted Movable Type, which was an early popular blogging platform / CMS / static site generator. Because Movable Type was Perl-based, it was more difficult to host.

Internally, Movable Type had a really nice object persistence / ORM abstraction, with support for MySQL, Postgres, SQLite, and Oracle. Meanwhile, iirc around this time WordPress internally had raw queries strewn about its codebase, which certainly would have been difficult to support multiple DBs. (I haven't looked at WP since then so I don't know if this is still the case.)

Faced with shrinking market share and being out-executed on features by WordPress/Automattic, Movable Type actually removed Postgres and SQLite support in response! Then again, these DBs weren't nearly as popular back then as they are today.


Wouldn’t you run the (real) risk of a corrupted backup?


Yeah. SQLite's backup API is the right way to do this:

https://sqlite.org/backup.html


And, if you can't use the backup API for some reason, try VACUUM INTO:

https://www.sqlite.org/lang_vacuum.html#vacuum_with_an_into_...


Would creating a ZFS snapshot, then either copying from that snapshot or sending it to another machine with zfs send, also be reliable? IIUC, ZFS snapshots are point-in-time consistent.


Beat me to mentioning this. Rsync is not a safe database backup tool. Period. It can be used to move around backups created by whatever specific database tool generated them, but just blindly using rsync is a recipe for unrecoverable corruption


It would be great to have the option to move easily between databases.

SQLite would be best for development, testing, and small sites. MySQL would be better for production . And SQLite would be be great for archiving old sites.


To be fair exporting and importing databases is a simple command on MySQL (or postgres) so it's just a command more simple than just rsyncing the sqlite file.


Not only SQLite. I would've expected WordPress be database agnostic.


Why? WordPress is from an era of PHP where the ties between PHP and MySQL were tighter than tight. What reason would they have to change away from that? I would love if WP worked with more than just MySQL as well, but I understand why it doesn't.


It being from an era doesn't mean it has to remain in that era. I also understand why it doesn't. I'm saying it isn't expected to (anymore).


Great blog post. A few observations (considering the blog post skips numerous details):

SQLite really shines when compared to running a database in a separate server from the app. This is when the network latency starts being noticeable. Latency between you and the app is usually constant. The latency between the app and the database is what matters. If in this case, Postgres can run in the same host as the main app, I think it's a good thing to do. But very often that's not the case.

Even though not surprising, it's great to see how SQLite handles concurrency. From my experience, SQLite would be able to handle much more read concurrency. It would have been great to see more benchmarks (like 1000 concurrent users, even if the app would never see such traffic).

I was surprised that SQLite didn't beat Postgres on simpler queries. Without seeing the queries, if they contain multiple joins, that could be another reason why SQLite is slower since it can only do "nested loop" JOINs. The workloads also seem mostly analytical, we'll see where DuckDB goes in this space (looking promising for now).

One thing not mentioned in the blog post is "easy to maintain and update". If the app is deployed with Docker, adding Postgres means you go from worrying about:

  1 docker container + 1 docker volume
to:

  2 docker containers (docker-compose?) + 1 docker volume + internal docker networking
The fact that SQLite is so simple to operate also has some advantages. You can usually go back to an SQLite-based app 1 year later and understand everything. With other databases, you may need to review users, roles, permissions, connection string, etc.

Another cool thing about SQLite is that it scales mostly linearly as you add more CPU cores. It can scale as long as you can also scale the number of processes in your app. In other databases, scaling means adding connection pools and other tools in the middle.

Lastly, and related to the latency topic. The low latency in SQLite means that you can sometimes convert a big query into 5 separate, small and reusable queries without any performance hit. It's really remarkable home much cleaner your code can become with this.

In any case, I love working both with Postgres and SQLite, and we need more blog posts like this one showing real-world cases.


Ditto on DuckDB point. This looks like OLAP workloads to me and a columnar database would work wonders. DuckDB if you going embedded, Clickhouse if you going with a server.


Here is a comparison that I'd like to see next: how much time was spent on the following...

- porting the application, making all these benchmarks and writing a blog post

  vs 
- simply getting a VM with more RAM and more CPUs.

In the conclusion the author goes on to talk about the options "if I get the budget" vs "without throwing more money at the problem". I'd guess that the cost of the day of a Principal Engineer would far surpass the cost of years running a 4CPU server.


I understand and partially agree with all the criticism of the article (and could add my own like why did you tweak work_mem out of all the hundreds of settings on postgres) but it's a blog post, not a scientific article - I still think it's a useful post. I might be in a similar situation at some point and it's nice to know that you haven't encountered any major gotchas and made it even 2x faster. Thanks for sharing your experience.


If you say using SQLite, there is always risk of losing that data file in disk right. How is that managed. is this a dumb question , which PostgreSQL do not ??


The best answer I know of to this question is Litestream - https://litestream.io/ - you can use it to inexpensively replicate a backup of your database up to an S3 bucket (or various other storage providers) to give you a very solid robust recovery option should you lose the disk that your SQLite database lives on.


The big question here would be whether the filesystem that the data file resides on is mounted on a single-disk volume or a RAID-volume (or even a clustered SDS volume such as Ceph). On a single-disk volume, if the disk dies your DB is gone (at least until you can replace the disk and restore the backup). On a sensibly designed RAID-volume, if one of the disks dies the application's admin probably would never even know. The sysadmin would see it in their logging/alerting infrastructure and replace the dead disk. There would not be any loss of data or availability of the data (just a temporary degradation in the overall resiliency). Same is true for a clustered SDS volume such as Ceph.


Not only that, the easy mode we all doing now is "run the application in a major cloud". Besides AWS, They all promise 99.999% or above durability for their block storage offering.

Of course you still need backup for other incidents.


Postgresql stores data to disk, so it has the same risk of loss. Both setups need a backup solution to mitigate the problem of disk failure.

My understanding of the situation is sqlite's solution is a bit more low level than postgresql but should be workable.


In combination with a base backup, you can use streaming WAL backups with Postgres, shipping compressed WAL files to S3 or wherever. I like WAL-G for this.

There are lower-level approaches built-in to SQLite, such as the Backup API, but there is at least one 3rd party project, Litestream, which does streaming WAL backups too.


Not a dumb question, but some people don’t care for their use case (either the data is cheap, or they don’t think it will happen)


These days, you could replicate it to something like S3 via something like Lifestream.


No one ever got fired for just using postgtreSQL, maybe burned though.


According to https://www.linode.com/pricing/ a shared instance with 2GB of RAM and 1CPU is $10/month. I don't know how much a principal engineer at twilio makes but I'm pretty sure this wasn't a cost effective optimization... It's also a bit weird not to talk about indexing? Are sqlite and Postgresql similar enough that the same indexing strategy should be used?


I mean SQLite is living on a single node, what exactly do you do when you have failure, when more than one process need to acces the DB, when you need RBAC etc ...


You tell the users of your internal statistics dashboard to wait a few minutes/hours while you fix it.


> what exactly do you do when you have failure

Depends on the failure. What do you have in mind?

> when more than one process need to acces the DB

Easy, you use `journal_mode` set to `WAL` when you connect to the sqlite database.

> when you need RBAC

If you need RBAC, you aren't going to be using sqlite. In theory you could have your application handle the RBAC part since sqlite is embedded in your application.

Another solution is to leverage unix users and groups and set permissions appropriately for the database file and the application running.


>> when more than one process need to acces the DB

> Easy, you use `journal_mode` set to `WAL` when you connect to the sqlite database.

and set isolation level and check_same_thread and PRAGMA synchronous and then put all queries in retry loops because the DB is maybe locked https://sqlite.org/wal.html#sometimes_queries_return_sqlite_...

I wasted so much time trying to get multiple sqlite queries running at the same time


> I wasted so much time trying to get multiple sqlite queries running at the same time

You were trying to write at the same time right?


no, although the DB was not opened readonly mode. I tested with apachebench with many parallel requests on endpoints using only SELECTs


It has everything to do with the nature of the application, architecture, business, and the business cost of such failures doesn't it?

If the discussion is around failures for the use case of the article (dashboard for blogs which is read-only), my best guess at 'high availability' would be just to have a separate SQLite DB file on multiple app server nodes and call it done. The author mentions that the DB is updated once per day (probably during the night) by a cron job. Just as easy to then scp a few copies to other app server nodes.

If the discussion is around a 24x7 read-write workload with high business costs for performance, availability, and scalability then it's an entirely different problem.


> what exactly do you do when you have failure

Same thing you have for failure when your pg instance is running on a single node?


Interesting article. It should be noted that the testing done is with 100% read/query workload (not surprising since the application is a website of blogs).


This is 100% correct. This is a dashboard application that is mostly read-only for users. There is a cron job that imports and writes new data once per day.


I don't know about now but at one time Firebird made this question moot. It never received the love it deserved in my view.


Many years ago, I was hired at a web shop running Firebird in protection. It had the interesting property that if you ran the client software on your desktop to connect to the server software on another host, and used ^Z to stop the client, then the server process would hang until you resumed the client. I'm not entirely sure how I'd implement that as a feature if I wanted to.


Not good!


How about PostgreSQLite? Joke aside, is anyone building something that takes a stab at fixing the issues with both?


Honestly, the blogging dude wrote a web "application" in Python, known for its not-quite-good performance, and wonders whether he can make the application faster by... changing the database?


Am I the only one who is thinking why they have developed in-house system for such tracking in first place? This problem (of analysis of traffic) could easily be achieved with Segment (a twillio product) and analytics tool like Mixpanel or Amplitude.


Can SQLite be used outside of single threaded contexts? Last I checked having multiple concurrent readers and writers is very difficult. Multiple concurrent readers only is probably ok if that's all you need.


In practice, most SQLite writes take less than a few ms, so concurrent writes are only likely to be a problem for you if you need to handle thousands of writes per second.

I've implemented the pattern where writes sit in an in-memory queue and are applied to a dedicated write connection in series. I've not run into any problems with this so far, but I've also not tried to handle thousands of writes per second with it yet.


I’ve done this with retry logic in each concurrent writer (retry if you can’t lock the file with a short timeout) and it works fine enough.


If threads have to contend for db with retry loops, does concurrency actually buy you anything?


It definitely doesn't buy you performance. But I thought you were asking if it's possible, or how difficult it is. A use case might be if you have an existing architecture of N processes and want to collect stuff from them in a central file.


It can, but as the concurrency increases you'll start to see errors indicating that database is locked.


This sounds like: so there is some point where a feedback loop tends to lock it all up, but it's fixable with faster hardware.


The major drawback I encountered when building upon SQLite is concurrency of writes.

I have a lot of worker processes writing new data.

With SQLite, I was getting a lot of "Database locked" errors and ended up having to use pg.


Why is this limited to four clients? Maybe I missed something but it would be good to see what happens with higher numbers of connections (something both of these dbs are rumoured to be bad at).


It seems the queries are aggregate based and they’d be served better by DuckDB instead of SQLite.

DuckDB performance is astounding for doing group by, sum/count/avg with filters and sorts.


From my experience SQLLite shines (using it in Go) when in-process, local, SQLlite-API and from an NVMe. It's probably not the right choice over networks and with ORMs.


My advice: go straight to PG. And avoid ORMs.


ORMs are fine but they are not a silver bullet that means you never need to look at the database or the queries generated.

I think too many people use ORMs and rely on them to do everything, then blame the database when things are slow. :(


Devs start with ORMs because they promise to make everything easy.

Then their apps grow complex enough that the ORMs get in the way.

Then the devs get sad.


Re: no ORMs - do you have any advice on how I could make this raw SQL insertion ( https://github.com/bbkane/starghaze/blob/54d255f524f8ef73384... ) more readable and less error prone? Right now I'm basically concatenating strings and hoping I don't misorder my parameters when I exec. 200 lines of this!! Am I doing raw SQL wrong?


> Right now I'm basically concatenating strings

Well, you're using query parameters, but I think I understand what you mean.

What I would do is write a stored procedure that takes whatever parameters, then I'd write PG SQL to generate bindings to that procedure for whatever language you're using (Golang here). I might write the bindings by hand if it's not a lot of work, but definitely would write a little compiler if it's a tone of work. That way you can get type-safety.


Thank you. I'm using SQLite3 to keep deployment simple, which unfortunately doesn't support stored procedures. Do you have any other suggestions to improve this code?


SQLite3 very much supports something like stored procedures though! Only, instead of calling them that it calls them `INSTEAD OF INSERT` triggers.

So, you'd do something like:

  CREATE VIEW IF NOT EXISTS things AS ...;
  CREATE TRIGGER IF NOT EXISTS create_thing
  INSTEAD OF INSERT ON things
  FOR EACH ROW
  BEGIN
      ...
  END;
and to invoke such a "stored procedure" you'd

  INSERT INTO things (...)
  VALUES (...), ...;
or `INSERT .. SELECT ..;`.

Now, with SQLite3 you don't get reflection via SQL, like you do in PG via the `pg_catalog` schema, but you do get reflection via the API. Also, because it's SQLite3, you don't get strong typing, and you get very few types, so even with reflection via the API, generating type-safe stub functions to invoke the "stored procedure" is not exactly easy or possible.

FYI, SQLite3 is Turing-complete. You can loop via recursive triggers. And you can do conditionals via `WHERE` clauses on the statements in your trigger bodies (and also via `OR IGNORE` and such).

EDIT: I made a post of this: https://news.ycombinator.com/item?id=31913062


SQLite supports named parameters: https://sqlite.org/c3ref/bind_blob.html

I don't know if Go's SQLite support in turn supports named parameters, but if it does, then that would likely help considerably with readability.


It takes a lot to say very little in Go sometimes, I'd say this is one of those cases. Mainly the if err != nil boiler is at issue here.

I don't think you are doing anything wrong from a noobies opinion.


If using Python, then I think Peewee strikes a perfect balance. In my pov it rather helps writing and reason about queries, which in turn you can get as query+params to use in other places such as pandas. It is neither in the way nor magic bloat.


In the JVM world, jOOQ is great happy-medium technology between the extremes of string SQL and full-blown ORM:

- Type-safe DSL generated from your schema that looks like SQL

- (If you can write the SQL, you can write it in jOOQ)

- SQL-injection protection

- Strong multi-dialect and vendor-specific support

- Custom-type mappers

- Composable queries

- Compile-time, schema-aware query safety

- Fine-grained transaction control

- And so much more...


There are various non-ORM database libraries for TypeScript too. I develop Zapatos, but Slonik and pg-typed are probably better known.

https://jawj.github.io/zapatos/

https://github.com/gajus/slonik

https://pgtyped.vercel.app/


My advice: advice with no reasoning or context is not useful to anyone.


Much much e-ink has been spilled on the subject of ORMs-good-or-bad. I shan't always reprise that.


Literally you are the person who brought up ORMs.



Hibernate+Spring Data allow me to write native queries, and maps them to interfaces for me.

Using an ORM doesn't mean you have to rely on behaviour that generates slow (e.g. n+1) query issues.


I like that but its the testing that makes it hard. Being able to use ORM then mock yourdatabase with sqlite or similar is very useful. You can use testcontainers but not in my firm.


This is the promise, but in experience nasty bugs occur when the abstraction inevitably doesn't work.

If it doesn't work with an real life implementation e.g. docker test container, then I don't consider it a robust test.


Yeah one problem is my corp employer doesnt let us use docker.


Would plain virtual machines be acceptable?


You can test with PG. Trivially.


You do know an ORM is a thing that is used with a database rite? Perhaps you actually meant "don't use a relational database". Either way, LOL


You are comparing the wrong databases together! Compare MySQL and PostgreSQL instead


Do clickhouse next!


is the app used by a single user or shared across many users? what question could be simpler?


Use the right tool for the right job.


Agreed. Performance comparison is interesting in a synthetic benchmark context but the premise of the title isn't that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: