> I really would like to see bi-temporal table support eventually come to PG
> Is anybody aware somebody working on that for PG?
Not that I am aware of.
Postgres is a fantastic database, and with zero licensing headaches, but when people ask why companies are still paying for commercial DBMS engines, some features that they have which Postgres does not have, and which I often miss are :
1)"temporal" / versioned tables / time consistent views of data . Oracle has a lot of "flashback" related options at the DB and table levels which can be really useful.
2) Easy, built-in partitioning for large datasets. (This is being worked on and will probably start to be released in Postgres 10).
3) Better backup options for large databases, such as incremental/differential backups.
In the Yandex migration from Oracle to Postgres I recall that this, ie incremental backups, is one of the things they mentioned it would be great to have.
4) Along with backups, the built-in recovery options in Postgres are fairly basic compared to what you can do regarding single table recovery, corruption repair etc that other DBs have.
> why companies are still paying for commercial DBMS engines
In my experience in 90% of cases application code migration to a new db engine is the real road block.
I've seem many cases where even migration to a new Oracle version is a huge problem because of required app code review and full user acceptance tests. So companies pay licence fees for old unsupported versions. What is even more absurd.
And frankly speaking in many cases of simple apps even SQLite has all necessary features in replace Oracle.
> 1)"temporal" / versioned tables / time consistent views of data
Oracle doesn't have transactional DDL. It means if you drop a column or create a table, it commits your transaction automatically.
This is huge, because with PostgreSQL you can import a full database in a single transaction, and roll it back if anything goes wrong. You can upgrade a full application and roll back if anything goes wrong. No more partial-state headaches.
Therefore I'm not sure what "versioned tables" mean. Have you tried PostgreSQL's temporal tables extension, which seems to do what you mean?
However there might be a lot of WAL to apply, as data gets added then deleted or changed.
A differential backup (see Note) solves this somewhat for large databases by just backing up the blocks changed since the last full backup.
So you might do a full backup once a week, and differentials nightly in between and also save the WAL. This means you are not backing up many TBs every night.
To recover, you would restore the full backup, plus the last differential backup plus any WAL since the last differential. This reduces the volume of WAL to be applied in a recovery.
Note :
Differential backups would include all the changes since the last FULL backup.
Incremental backups would include changes since the last backup, whether a full or the last incremental.
I use PG for 10 years and I never needed incremental backups. But if someone asked me about it then I would respond that the PG has no incremental backups.
So WAL backup is not well known or documented in my opinion. An average db admin needs a tool which will generate incremental backup as a single file named like dbname_20160926.sql
Barman[1] gives you something better than incremental backups actually. It sets up the backup server as a database slave via PG's built-in replication. It gives you point in time recovery with the nicest CLI of any backup tool for any DB I've ever used.
If you think backup tools are a minus for PG, I guess you've never used barman. Easy as pie and has saved my bacon several times when CASCADE DELETE wrecked a few dbs for me due to coworkers futzing around.
> If you think backup tools are a minus for PG, I guess you've never used barman
You are right, I've never used barman. But I have streaming replication configured manually. Generally I am quite happy with PG backup tools.
However this discussion made me to think about pg_dump deficiencies. WAL backup is for whole db cluster. But how do I incrementally update one db? E.g. one prod db to my dev environment.
Indeed, that is something that WAL files don't work for. There aren't a lot of great answers for that. I generally will restore a point in time snapshot somewhere and then dump out that entire db to restore via sql or whatnot.
The corollary to the backup solution is that if Postgresql ever ended up operating in a similar fashion to Oracle's rman, particularly on its encouraged ASM, Postgresql would very quickly start seeing "Postgresql requires a full time DBA, just use MariaDB" type blogs. I can see how the middle ground is a hard problem.
I agree, Postgres having its own storage subsystem would be wrong. Fortunately the PG philosophy is to leave what the OS does best to the OS, so its unlikely that situation would arise.
However, a middle ground of being able to backup just the changes in a multi-terabyte database that have been made since the last full backup, wouldn't be controversial and is a common use case.
I've read of a few attempts on the PG dev mailing list over the last couple of years, so hopefully one of them will be completed sometime soon and merged into core.
3) ZFS with snapshots and send/receive are what I use for local and offsite backups. They can be as fine grained as you like time wise, and are so so so much easier to manage and restore vs traditional backups.
This is probably easier, but it might be an issue, when you might want to skip some transactions (someone for example deletes a table). Make sure you also back up WALs.
> Parallel execution of sequential scans, joins and aggregates
This confuses me—I thought a sequential scan was slow because it was limited by disk I/O; why is it faster to hawe multiple threads waiting around for the same disk?
(I guessed maybe that it was running different sequential scans in the same query in parallel, but no, pages 3 and 4 of the PDF show parallel workers being used for a single scan of a single table)
> I thought a sequential scan was slow because it was limited by disk I/O
Not necessarily. Suppose you have a fairly big table that fits in the shared buffer cache and you run a query with a predicate that returns only a handful of rows. Then you would be better off scanning parts of it in parallel and then combining the results.
Parallel sequential scan is a building block that enables other work to be done in parallel. Once you've got a way to split up a sequential scan by handing chunks to different workers, those workers can then do other CPU intensive work like filtering (involving potentially expensive expressions), partial aggregation (for example summing a subtotal that will be added to other workers' subtotals at the end) and joining (say against a hash table).
Because "the disk" is the wrong abstraction here; think instead of stuffing I/O requests into a queue, letting the storage controller re-order them, then giving you a callback when the blocks are ready.
I was wondering the exact same thing -- how can CPU-parallelism improve on a task which (to me at least) seems to be disk-IO bound?
I'm guessing that my assumption of the task being disk-IO bound is incorrect!
It's only disk IO bound if you're accessing the disk. With modern servers packing multiple TB of memory, it is likely you won't be accessing the disk at all.
For some cases (like single HDD and low amount of RAM) it should be so, but for some cases(from "the whole table is already loaded to RAM" to "SSD" and to "a table is partitioned among multiple drives"), it must give very good improvements for analytical queries.
I am really glad to see "searching for phrases" feature for Full-Text Search, as doing that with POSIX was really painful operation for long text area fields.
https://wiki.postgresql.org/images/6/64/Fosdem20150130Postgr...
Is anybody aware somebody working on that for PG?