Hacker News new | past | comments | ask | show | jobs | submit login
Making PostgreSQL tick: New features in pg_cron (citusdata.com)
194 points by thunderbong on Oct 26, 2023 | hide | past | favorite | 23 comments



The ability to run a cron every few seconds is pretty cool and avoids those OS cron scripts that loop and sleep, so I'm glad that was added. Only downside to cron jobs running inside postgres is PL/pgSQL. It's never my first choice for scripting sql operations. I'd much rather use python, ruby, etc. I'm glad it exists when there's no other option, but it's kind of its own beast.

That said, I'm very happy to see more and more stuff coming from Citus. We're looking to split a few growing databases and a bunch of folks are excited to try Citus again now that the enterprise features have been open sourced.


At the risk of sounding like a Golang/inteface{} "you're using it wrong" defender, most of the objection to PL/pgSQL is because people don't know what they are doing. They look at assignment, looping and manipulation and think this would be much easier in javascript or python. However they've usually embarked on the wrong route at this point, and if they knew how to do join on UPDATE, or use CTEs, or windowing they wouldn't need a big ugly block of imperative code, 90% you see a FOR ... the whole block can be collapsed to a single more performant update/upsert statement. That said I still have 1000+ line pl/pgsql functions and the language constructs could definitely be improved, just a little sugar please...


As a smell test, when writing a FOR loop in pgSQL, you might ask yourself, "does the order of these rows matter?" And "do I need to see all of them?" (rather than say, just the first or last). If the answer to either of these questions is no, you can probably rewrite it declaratively. You might be able to turn the entire function to PL/SQL rather than PL/pgSQL, at which point you might ask yourself if it's better as a view instead of a function. (Similarly, if the answer to both of those questions is yes, you definitely cannot rewrite it declaratively, unless you figure out a different algorithm.)

If all of them matter but the order doesn't, you might be able to use aggregations like sum.

If the order matters but you don't need to see all of them, you might use aggregations like min and max.

You can refactor some of your logic into an aggregation function if your use cases isn't in the standard library. Often helpful if you have some core business logic that contains lots of if statements, and so is easier to write imperatively.

But I also wish the language were more ergonomic.


Can someone explain the difference between declarative and imperative in this context? My understanding is SQL is declarative.

So is declarative something like “writing high level statements that postgres turns into more specific (imperative?) instructions”? And imperative being writing those specific “do this, then do this” yourself (although probably differently than how postgres does it under the hood).


SQL is declarative, but Postgres provides an alternative language for database-side programming called PgSQL[1]. PgSQL adds some imperative features, like FOR loops and variables, as a superset of SQL's features.

So functions in the database can be written in this imperative language (or in SQL), and integrated into features like database triggers or called be clients (or, commonly, used to write pg_cron jobs).

And your understanding of the distinction between them is correct. Declarative describes what you want to do but not how you want to do it, and Postgres' query planner (or that any fully featured SQL database) will uses statistics it has recorded about your data and indexes you've created to optimize the execution of your query (decide how to do it). PgSQL can tell Postgres what to do and how, except in the places where you're writing declarative SQL statements (since it's a superset). Breaking things up into smaller units like that limits the scope the planner can optimize over, which may defeat or limit certain optimizations and you may end up with something less performant.

For some straightforward examples: if you write a min or max function with a FOR loop, you have written an O(n) implementation for something the query planner can sometimes do in O(1) using an index. If you write a sum function, you've written a sequential implementation of something the query planner may be able to parallelize.

[1] https://www.postgresql.org/docs/current/plpgsql.html


By being declarative you tap into what's good about the database, into the reason you are using databases at all. You are doing it because databases have primitives for operating on data in bulk, searching, aggregating, updating.

When you doing things imperatively instead you are degrading performance to pretty much the same you could achieve with a piece of python or js operating on the data set pulled out of the database wholesale without using any fast database features.

Main trouble with imperative SQL code is that some people that are forced to do something in the database, don't really know why SQL exists and given the opportunity of writing imperative code they jump straight to it instead of solving the puzzle of understanding SQL declarative primitives and applying them to their problem. So the project ends up with slow, buggy imperative code in horrible syntax. When somebody needs to fix it later they first have to find the slow business logic ... database is not the first place one looks for it. Then understand ugly imperative code to figure out what it's doing, then actually solve the problem properly which takes time and thinking. Some queries I wrote in my jobs took me a day or two of thinking. Imperative versions would take 2 hours but would be orders of magnitude slower on medium sized data sets.


Don't forget that if order matters, you probably can still use windowing functions and have no problem - I wrote an iterative but still set based newtons method of least squares in CTEs/windowing functions and it was tens of thousands of times faster than a looping option.


And you are so right you decided to use a throwaway account to make this point.


PL/pgSQL is a great example of the kind of DSL that I would have avoided in the past but I'm actually much happier to use today because GPT-4 can explain code to me and even write basic things that work which I can then improve on.


Yeah but it can't __easily__ manipulate the file system, talk to the cache store, run bash commands, make an HTTP requests and so on.

Cron job are not just about db updates, and once you have one cron job in Python, having them scattered among several language is not ideal.


Yes, but there are advantages to doing some things inside your DBMS. Things have tradeoffs as always.


One way to use raw python/javascript/bash on top of a postgres scheduler is to run a single container of https://github.com/windmill-labs/windmill. It only requires a PG connection and does not require a pg extension. It implements a scheduler that is implemented on top of PG and can run any of the aforementioned language in a very practical and observable interface.


An obvious answer would be PL/Python, though is not available on most managed services because they prefer to prevent file system access. An exception is https://docs.crunchybridge.com/extensions-and-languages/pyth...

On the other hand, it's quite useful to have a pure SQL cron that cannot do everything, but can run safely in managed services or under a restricted database user. That's probably one of the reasons why practically every managed PostgreSQL service offers pg_cron. Doing more things, like running shell scripts, might limit its usefulness.


> I'd much rather use python, ruby, etc.

I think I’m with you on this one


You can use PL/Python no?


I dunno about other hosted Postgres providers, but not on AWS RDS no, as the Python extension is untrusted. Looks like you can use plv8 via pg_tle (Trusted Language Extensions) [1] though.

[1] https://github.com/aws/pg_tle


Good luck using a venv, loading your libs, upgrading the extensions when needed, etc.


Cant you just use it to call a webhook?


You need an extension, and making the HTTP call reliable with retry and jitter is going to be hard in pl.


plv8 can let your run JavaScript


Is that yet another way to place a backdoor in a infected system?


Fwiw, pg_cron jobs run as the database user that scheduled the job and the jobs table is not writable, nor can you see jobs created by other users unless you are the admin.

There is not much you can do with it that you could not via external infrastructure that connects to the database, except you don't need to set up external infrastructure that connects to the database. (and simpler systems are generally more secure)


If you mean that you might be able to use it to disassociate the moment of gaining entry and the moment of exploitation of the system, then yes. Indeed the DB contents may have been synced elsewhere in the meantime, meaning your exploit has spread before triggering, perhaps even to places you'd otherwise not be able to exploit as they are airgapped. You can use it as a persistence mechanism.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: