the shutdown loop in one of the blog posts there - sync(), then sleep(2) - has me worried he may get filesystem corruption at times under those circumstances. i could be wrong, but i recall that sync() will return immediately even though it's not done synchronizing the filesystem writes (for a filesystem that needs that). as such, the sleep(2) gives it some time to get that done.
is that a correct understanding? if so, is that a reasonable risk i see (filesystem corruption at times)?
One would hope that the completion of sync() would mean the data is written out, except I recently read this horror[1] on HN:
Unfortunately, most consumer-grade mass storage devices lie about syncing. Disk drives will report that content is safely on persistent media as soon as it reaches the track buffer and before actually being written to oxide. This makes the disk drives seem to operate faster (which is vitally important to the manufacturer so that they can show good benchmark numbers in trade magazines). And in fairness, the lie normally causes no harm, as long as there is no power loss or hard reset prior to the track buffer actually being written to oxide. But if a power loss or hard reset does occur, and if that results in content that was written after a sync reaching oxide while content written before the sync is still in a track buffer, then database corruption can occur.
… part of me hopes there's a very special hell for the people making disks where the OS can never be sure if the data is safe or not.
This problem is nowhere near as widespread as most people claim. While bugs do happen and I can't speak for the SSD side of things, HDD manufactures test their cache behavior quite thoroughly. This includes pulling the power immediately after flushing the cache to make sure the data made it to disk. 99% of people who report cases of HDDs "lying" about write integrity either have write cache enabled or are not actually issuing a flush cache command due to OS level issues.
I've definitely seen `sync` itself waiting/blocking (especially if you use fuse for something network based and disconnect the cable first), but whether it's guaranteed or not... that's an interesting question.
Edit: after some googling:
On Linux, sync is guaranteed only to schedule the dirty blocks for
writing; it can actually take a short time before all the blocks are
finally written. The reboot(8) and halt(8) commands take this into
account by sleeping for a few seconds after calling sync(2).
This page describes sync as found in the fileutils-4.0 package; other
versions may differ slightly.
So it doesn't look like the writes are guaranteed to take place. Just a best effort + wait + pray :)
On Linux, sync is guaranteed only to schedule the dirty blocks for
According to the standard specification (e.g., POSIX.1-2001), sync()
schedules the writes, but may return before the actual writing is done.
However, since version 1.3.20 Linux does actually wait. (This still
does not guarantee data integrity: modern disks have large caches.)
So it seems like the sleep(2) is there to give the disk enough time to write the cache data.
I changed an OS (before Linux) to sync when idle. So by the time you could type a shutdown command, it was already sync'd. I don't know why more OSs don't do that.
Ancient Unix lore has it that you need to do 'sync; sync; init 6' in order to sync the buffers and reboot. Sync was supposed to only schedule a sync, but would block if another sync was already running. I have no idea how applicable that lore is to modern 2013 Linux... Definitely would like to see more careful research than just removing the sleep(2) and declaring victory and address if that sleep was simply vestigial or not...
It's not sync() which is the problem. A correctly implemented `sync` should flush all writes to permanent storage before returning.
The issue here is storage devices may lie about it[0]. The 2s sleep is problematic, but IIRC devices don't (and have no way to, and would not anyway) report when the data is actually written to permanent storage, so you can't do much besides waiting a bit and hoping for the best.
Storage devices actually do have several ways of reporting when data is permanently stored, and Linux makes use of them. However, some storage device manufacturers found that if they lied and claimed data was permanently stored when it wasn't quite yet, they got better benchmark results.
> Storage devices actually do have several ways of reporting when data is permanently stored
Which `sync` uses. My comment was probably unclear, but the point I was trying to make is if they're lying to sync they're probably not going to provide other accurate ways to get the information.
is that a correct understanding? if so, is that a reasonable risk i see (filesystem corruption at times)?