If the promotion logic is wrong how would persistence have helped? Say the prima...

seiji · on Oct 28, 2014

In the case described, there is no promotion logic.

The replicas will try to reconnect to their original master forever unless something else (like Sentinel) redirects them in an actual failover/promotion setup.

So, the master had data, it died, it restart with no data, then the replicas immediately reconnect. If the master had persistence enabled, it would have reloaded the old dataset on startup and the replicas would have re-downloaded everything—since they are replicas of the master, they will always prefer the master data over their own, even if the master is empty.

If you were in a strange case where the disk failed and you replaced it with an empty disk (is that what you mean by "fresh disk?") then it's the same as starting an empty dataset. Not entirely relevant since the server would be intentionally started empty after a maintenance action instead of just restarting the already-populated process that restarts as empty because there's no saved dataset to load on startup.

The "all replicas resync an empty dataset" is a logical consequence of the configuration they enabled, but one without obvious repercussions without either directly experiencing it or a longer multi-chain thought experiment. (but, fixes for such things are already on the way—soon!)

antirez · on Oct 29, 2014

Just to add some more info:

Funny enough what triggers this problem when you have master persistence turned down is, the lack of failover, if the reboot happens fast enough, in case you are using Sentinel, for it to failover to a replica. So no failure was sensed at all, just the master magically wiped its data set.

So from the point of view of distributed systems, if you want to analyze the sum of Redis replicated nodes + Sentinel, the problem is that the system is not designed to cope with nodes losing state on restarts.

However it is possible to improve it, and I'm doing it, but before diskless replication it was IMHO pretty useless to have support for persistence-less operations in conjunction with replication, since the slaves to synchronize required anyway the master to save an RDB file on disk.