Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've used PostgreSQL in the first two scenarios and would love an opportunity to in the third. It's worked really well and hasn't caused any problems under decent loads.

The one feature of Redis I'd love to have supported in PostgreSQL is to be able to set a TTL on a record. On a recent project where we could only save personal data for so long, it was a must have feature so we had to use Redis for that purpose instead.



Redis EXPIRE doesn't actually delete any data after it expires though. Active deletion happens at random, so you can easily still have expired values in memory months later:

> Redis keys are expired in two ways: a passive way, and an active way.

> A key is passively expired simply when some client tries to access it, and the key is found to be timed out.

> Of course this is not enough as there are expired keys that will never be accessed again. These keys should be expired anyway, so periodically Redis tests a few keys at random among keys with an expire set. All the keys that are already expired are deleted from the keyspace.

> Specifically this is what Redis does 10 times per second:

1. Test 20 random keys from the set of keys with an associated expire. 2. Delete all the keys found expired. 3. If more than 25% of keys were expired, start again from step 1.

So really it's not much better than doing `SELECT value from keys where key=? and expires > now()` with manual deletion. Though I agree that it can be more convenient.


I would contend that it really depends on what one would prioritize the most in that scenario. In my experience, Redis EXPIRE means it is not selectable. That is the primary requirement for a lot of development around EXPIRE/TTL. It is OK if it is still in memory in some form, it still won't be accessible by applications SDK or CLI. Since Redis 2.6 the expire error is from 0 to 1 milliseconds which is accurate enough for many use cases. Not to mention, Redis will handle that deletion for you. You don't need to run a deletion job and/or include an additional condition on a query.

Additionally, the expire/ttl/get/set in Redis is incredibly easy to use (and abuse, hence the OP article). Some team's criteria is limiting the amount of moving parts - and that's great. Don't use Redis and use a relational database for everything such as what you mentioned. Use it as a queue, a cache, a message broker, etc..

Other teams may care less about an extra moving part if it means their code will look simpler and they leverage relational databases for their more common usecases.


The fewer moving parts bit is key.

It was a government project, written by one team (us) to be maintained by another.

The data that needed to be expunged was user signup data, upon completion the record was sent to a CRM and the Redis record destroyed. If the signup wasn't finished it's automatically removed after 12 hours.

Referential integrity wasn't really a problem, emails are unique and if we clash the two records are auto-merged by the CRM.

Setting up scheduled tasks, triggers, partitioning, cron, etc, is just more things that can go wrong. If they go wrong _and_ go unnoticed we end up with piles of data we shouldn't have. That would be many different kinds of bad.

Doing `redis.set(k, v, ex: 12.hours)` or whatever is just easier.


You could very easily create a database view that applies the where query, and even prevent your db user from selecting from the underlying table.

You could also use a library like PG boss to handle the cleanup task.


> Redis EXPIRE doesn't actually delete any data after it expires though.

I guess OP likes the simplicity that built-in expiration provides. In your example - all selects reading the value will need to have this expiration check. And also some scheduled process will have to be written to actually delete the values.


I would access the table through a view that had that query built into it.

create table all_items(id integer, value text, expires timestamp);

create index all_item_expiry on all_items(expires);

create view items as (select id, value, expires from all_items where expires > now());

Then you can treat items as your base table and postgres neatly allows INSERT/UPDATE/DELETE from it. You'll need a job to clean up expires < now() items but it can be done at whatever arbitrary interval you like, could even be a trigger in PG if you were feeling spicy.


hmmm, I disagree that it's not better. Select operation implies index scan most likely with O(log n), while GET operation is essentially O(2-3). And you also have to run DELETE on sql to remove the expired keys.

Oh, and i'm not entirely sure about the part about redis active expiry (disabled by default, default is remove expired on lookup - lazy); you're talking about key eviction which applies to all deleted keys and AFAIR happens only when certain watermarks are hit. Since it happens in ram, it's also faaaast, unlike SQL DELETE, which will definitely involve disk...


Wouldn’t a simple scheduled batch job be enough to go through periodically and drop records older than N days?

This would also give you audit logs, etc. As well as flexibility to adjust business logic without updating the TTL on all records.


Deleting large amount of data on Postgresql is expensive. First you need an index on a column to select the expired data, then you actually need to delete the rows which creates a lot of garbage, is heavy and slow.

Creating and deleting a lot of data on PG is a pain because of MVCC and vacuum. One useful trick is to partition data into tables and to truncate/drop entire tables, drop/truncate is instant and reclaims space immediately.


You can do the equivalent by adding a Timestamp column with index on your table and add “where now - timestamp <= TTL”. (Or some computational easier way that doesn’t require math on the query)

TTL on systems like Cassandra is pretty ugly and deleting data is hard as you scale no matter how you do it. I don’t think Postgres would be able to implement a TTL that is magically better than the rest.


Deleting whole partitions is generally useful strategy. It's like the difference between single inserts and batch inserts (often huge performance difference, and much lower IO)

Since you mentioned Cassandra and TTL, I'll mention ClickHouse, very nice TTL options, splitting into smaller partitions and using "ttl_only_drop_parts=1" has prove itself in the production with big data ingestion rates.

Last, but not the least, I almost always prefer Postgres for data storage needs, one can trust it to be safe and fast enough. Only some specific situations warrant other solutions, but it's a long way until that point (if ever), and better not optimize too early.


> Creating and deleting a lot of data on PG is a pain because of MVCC and vacuum. One useful trick is to partition data into tables and to truncate/drop entire tables, drop/truncate is instant and reclaims space immediately.

In this case, the requirement is that user data must only be kept for a certain time and not longer.

If that time is a property of the record create and we're allowed to keep the data with an error of a day, I guess it's easy: We partition the table on a daily basis, and delete any partition that is older than 28 days old. Sometimes, a record will be closer to 29 days old when it's deleted, but we accepted that, and it's easy enough to write queries so that it's unavailable to the application if it's more than exactly 28*24*60*60 seconds old if that's our constraint.

If the requirement is to keep it based on the last of a certain kind of use, we'd need to move it from one partition to another if we need to keep it. For instance, if we can keep data for 28 days after the user last longs in, we can't just drop the partition - unless we've moved the user each day they log in.

If we have that kind of a constraint, where data lifespan is based on properties that change over the lifetime of the data, is partitioning + drop actually a useful approach? The drop will still be instant, I guess, but it's the movement of data over its lifetime that concerns me here.


You’d probably already have indices by user then. There also other ways to store the data, for example not as a time series of user actions, but as another data structure. It just depends on your requirements and data model.

The gist of the original article is asking whether you could reduce tech stack complexity and use a single set of technologies for more use cases, allowing to understand the tool you are using better.

Also, note that a traditional database may or may not be the right tool for the job - there are different storage/behavior needs for writing lots of data, processing lots of data and serving the results.


> You’d probably already have indices by user then.

Do you mean "you'd probably already have indices by user then, so you won't be able to take advantage of quick drops"?

> There also other ways to store the data, for example not as a time series of user actions, but as another data structure. It just depends on your requirements and data model.

I suppose I want to be a bit more specific. Till now, if I wanted to deal with this issue, I would just have used a cron job and some indexes and taken on the load. But what is an example of a nice way of dealing with this? You get to make up plausible requirements and data model details, perhaps drawing on particular requirements you had when you faced a similar issue.

> The gist of the original article is asking whether you could reduce tech stack complexity and use a single set of technologies for more use cases, allowing to understand the tool you are using better.

The legitimacy of a question in a free-form conversation doesn't depend on the original inspiration, several iterations ago. But even if it did, the question is here is exactly about understanding the tools better and whether we can use one tool instead of two, so by your summary it's perfectly on topic.

> Also, note that a traditional database may or may not be the right tool for the job - there are different storage/behavior needs for writing lots of data, processing lots of data and serving the results.

The subquestion here is precisely if we can get away with just using Postgres for data with mandatory lifespan requirements that vary over the lifetime of the data.

Extra tools come at a cost - that is the presumption of this article and the thread it has spawned.

If we have to use Postgres, we need to assess the cost of doing this in Postgres before we can decide whether or not to pay the cost of other tools. Waving in the general direction of other tools isn't helpful; it probably takes as much work to enumerate candidates as it does to calculate the cost of doing it in Postgres.


I was going to say... the standard approach is to simply partition the data by time, at which point it gets really easy to manage.


Sure, there are different design patterns for different data sizes. This also adds querying complexity, so just depends on what’s needed.

Also, most applications have peak and low periods of load that are predictable (e.g. users concentrated in a given set of time zones) which make for good times to run otherwise disrupting functions, etc.


It would, and you'd also get nice features like proper transactions and referential integrity.

BUT that comes at a cost. Doing that in Redis is just so easy, and the vast majority of the time no-one is going to notice a few concurrency errors, YOLO!

;-)

To be serious: it's a nice tool to have in the box and is amazing for inexperience/start developers because you can do a lot with very little time investment or knowledge.


What's wrong with Redis' transactions?


The requirement for the project was that the deletion should be handled in as fail-safe a manner as possible. Relying on external jobs, cron, other scheduled tasks etc were deemed a risk.

We were already using Redis for other things, it was the logical choice for this scenario.


you can do that using postgres portioning and a trigger to drop old partitions from the table.


exact this. even MongoDB has TTL index which is amazing nice.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: