The current default is data=ordered, which should prevent this problem if the ha...

charleshn · on Nov 22, 2023

It's likely because of delayed allocations (delalloc): https://issuetracker.google.com/issues/172227346#comment6

because the only guarantee which data=ordered provides is the security guarantee that stale data won't be revealed.

Yes, it's bad and breaks prefix append consistency, and does not match the documentation...

fulafel · on Nov 22, 2023

For more context, that's a comment from one of the ext4 main authors, Ted Ts'o. the other subsequent comment from him spells out the case more but sadly no spelled out NUL byte origin story I spotted from skimming.

charleshn · on Nov 30, 2023

The original report [0] shows the corruption due to NUL bytes at the end of the file (see the hexdump). This comment [1] from Ted Ts'o details the exact chain of events leading to it.

[0] https://issuetracker.google.com/issues/172227346 [1] https://issuetracker.google.com/issues/172227346#comment8

matja · on Nov 22, 2023

> which should prevent this problem if the hardware doesn't lie.

Or, one can take the ZFS approach and assume the hardware often lies :)

worthless-trash · on Nov 22, 2023

I do not know how zfs will overcome hardware lying. If its going to fetch data that is in the drives cache, how will it overcome the persistence problem ?

danarmak · on Nov 22, 2023

It will at the very least notice that the read data does not match the stored checksum and not return the garbage data to the application. In redundant (raidz) setups it will then read the data from another disk, and update the faulty disk. In a non-redundant setup (or if enough disks are corrupted) it will signal an IO error.

An error is preferred to silently returning garbage data!

nolist_policy · on Nov 22, 2023

The "zeroed-out file" problem is not about firmware lying though, it is about applications using fsync() wrongly or not at all. Look up the O_PONIES controversy.

Sure, due to their COW nature zfs and btrfs provide better behavior despite broken applications. But you can't solve persistence in the face of lying firmware.

Even thought zfs has some enhancements to not corrupt itself on such drives, if you run for example a database on top, all guarantees around commit go out the window.

Dylan16807 · on Nov 22, 2023

"Renaming a file should always happen-after pending writes to that file" is not a big pony. I think it's a reasonable request even in the absence of fsync.

nolist_policy · on Nov 22, 2023

Well, for one rename() is not always meant to be durable. It can also be used for IPC, for example some mail servers use it to move mails between queues. Flushing before every rename is unexpected in that situation.

Fun fact: rename() is atomic with respect to running applications per POSIX, that the on-disk rename is also atomic is only incidental.

Dylan16807 · on Nov 23, 2023

I'm not suggesting flushing for rename. If a file write and a rename happen shortly before power loss, and neither goes through, that's fine.

With this rule, three outcomes are acceptable: both occur, or neither occur, or just the file write happens. The unacceptable outcome is that just the rename happens.

("file write" here could mean a single write, or an open-write-close sequence, it doesn't particularly matter and I don't want to dig through old discussions in too much detail)

fulafel · on Nov 22, 2023

As an aside, can you still get the bad checksum file contents with zfs? Eg if it's a big database with its own checksums you might want to run a db level recovery on it.

gumby · on Nov 22, 2023

It can’t tell if the drive is lying to it.

worthless-trash · on Nov 22, 2023

This was my theory too, its not going to help in -these- situations. I can't see how.

matja · on Nov 22, 2023

Actual file data ends up in the same transaction group (txg) as metadata if both are changed within the same txg commit (either flushed explicitly, due to recordsize/buffer limit being reached, or txg commit timeout - 5 seconds by default), so if there is a write barrier violation caused by hardware lies followed by an untimely loss of power, the checksums for the txg updates won't match and they get rolled-back until the last valid one when importing the pool - which doesn't end up zero'ing out extents of a file (like in xfs) or ending up with a zero file size (like on ext3/ext4).

altfredd · on Nov 22, 2023

The "data" setting of ext filesystems isn't replacement for fsync().

Dylan16807 · on Nov 22, 2023

It's not a replacement but it can give you some guarantees.

Also fsync is a terrible API that should be replaced, but that's mostly a different topic.

the8472 · on Nov 22, 2023

At least on linux you can use io_uring to make fsync asynchronous. And you can initiate some preparatory flushing with sync_file_range and only do the final commit with fsync to cut down the latency.

consp · on Nov 22, 2023

My only n=1 observation is that null values in logs occur on nvme, ssd and spinning rust. All ext4 with defaults. I do have the idea it occurs more on nvme drives though. But maybe my systems settings are just booked.