Using a tool without knowing reasonable bounds of the domain, current requirements, and how the tool (redis in this case) solves the problem isn't good advice.
Case in point our team went with Redis, just the default, use it blindly without fully understanding our requirements and how redis helps scale.
2 years later we spent 2 sprints, holding back the release trying to understand RDF vs AOF, and why we're seeing massive spikes in consumption and performance triggering pod eviction, running comparison tests to prove which works better and explaining why, running qa tests (regression, performance, load), introducing postgres for queuing, redoing our code to bypass the sync mechanism between how data flows between redis and postgres, updating dependenciies, migrating existing customer data (various on site locations), explaining this to team members, managers and their managers, installation technicians, support engineers and presenting it at engineering townhalls as a case of bad decisions.
Well, by your admission, you used Redis for a problem domain it wasn't suited for in the first place. How is this an argument for using in-database queues?
> use it blindly without fully understanding our requirements and how redis helps scale
I'm sorry I don't get how I could come across as advocating the use of Redis blindly. My point is if your data flow looks like a queue, then use a queue, don't hack a relational DB to become a queue. I think that's reasonable rule of the thumb, not going in blind.
We needed queues. We used Redis. That fits the domain.
Problem was there wasn't a good answer to "How much redis does your team need to know to put it in production".
We thought we knew it well enough, we thought we knew what we were getting into, and we thought so many others are using it for this, we should be good. That is makes a difference, clearly.
Also reading your reply I get the impression that "sync mechanism between redis and postgress" was the bottleneck. Wondering if you can add some details around it and also was this something that can't be fixed by fine tuning redis config, rather than completely removing it from your stack.
There were many problems but at the core of it, this was us having redis write huge amounts of data to disk very frequently causing this.
We could not reduce the frequency (product would not allow) and we couldn't find a way to make the writes reliably fast.
I like to believe there exists a possible way of handling this, but point being, our team had no time to find out, how redis works internally and have confidence that the new way won't bring up new surprises.
Case in point our team went with Redis, just the default, use it blindly without fully understanding our requirements and how redis helps scale.
2 years later we spent 2 sprints, holding back the release trying to understand RDF vs AOF, and why we're seeing massive spikes in consumption and performance triggering pod eviction, running comparison tests to prove which works better and explaining why, running qa tests (regression, performance, load), introducing postgres for queuing, redoing our code to bypass the sync mechanism between how data flows between redis and postgres, updating dependenciies, migrating existing customer data (various on site locations), explaining this to team members, managers and their managers, installation technicians, support engineers and presenting it at engineering townhalls as a case of bad decisions.
Not worth it.