the shutdown loop in one of the blog posts there - sync(), then sleep(2) - has m...

deathanatos · on Oct 9, 2013

One would hope that the completion of sync() would mean the data is written out, except I recently read this horror[1] on HN:

Unfortunately, most consumer-grade mass storage devices lie about syncing. Disk drives will report that content is safely on persistent media as soon as it reaches the track buffer and before actually being written to oxide. This makes the disk drives seem to operate faster (which is vitally important to the manufacturer so that they can show good benchmark numbers in trade magazines). And in fairness, the lie normally causes no harm, as long as there is no power loss or hard reset prior to the track buffer actually being written to oxide. But if a power loss or hard reset does occur, and if that results in content that was written after a sync reaching oxide while content written before the sync is still in a track buffer, then database corruption can occur.

… part of me hopes there's a very special hell for the people making disks where the OS can never be sure if the data is safe or not.

[1]: http://www.sqlite.org/howtocorrupt.html

magila · on Oct 9, 2013

This problem is nowhere near as widespread as most people claim. While bugs do happen and I can't speak for the SSD side of things, HDD manufactures test their cache behavior quite thoroughly. This includes pulling the power immediately after flushing the cache to make sure the data made it to disk. 99% of people who report cases of HDDs "lying" about write integrity either have write cache enabled or are not actually issuing a flush cache command due to OS level issues.

mark-r · on Oct 9, 2013

Does anybody know if SSD is subject to the same delayed sync issue?

masklinn · on Oct 9, 2013

As with regular drives, it depends on the device.

Early Intel SSDs were known to be particularly prone to this issue: http://www.mysqlperformanceblog.com/2009/03/02/ssd-xfs-lvm-f... http://www.evanjones.ca/intel-ssd-durability.html

viraptor · on Oct 9, 2013

I've definitely seen `sync` itself waiting/blocking (especially if you use fuse for something network based and disconnect the cable first), but whether it's guaranteed or not... that's an interesting question.

Edit: after some googling:

       On Linux, sync is guaranteed only to schedule the dirty blocks for
       writing; it can actually take a short time before all the blocks are
       finally written.  The reboot(8) and halt(8) commands take this into
       account by sleeping for a few seconds after calling sync(2).

       This page describes sync as found in the fileutils-4.0 package; other
       versions may differ slightly.

So it doesn't look like the writes are guaranteed to take place. Just a best effort + wait + pray :)

jabiko · on Oct 9, 2013

Interesting. http://linux.die.net/man/2/sync says

    On Linux, sync is guaranteed only to schedule the dirty blocks for
    According to the standard specification (e.g., POSIX.1-2001), sync()
    schedules the writes, but may return before the actual writing is done.
    
    However, since version 1.3.20 Linux does actually wait. (This still
    does not guarantee data integrity: modern disks have large caches.)

So it seems like the sleep(2) is there to give the disk enough time to write the cache data.

igravious · on Oct 9, 2013

Pity no-one thought to comment this little piece of reboot/halt magic - 'twould have saved a bit of digging.

JoeAltmaier · on Oct 9, 2013

I changed an OS (before Linux) to sync when idle. So by the time you could type a shutdown command, it was already sync'd. I don't know why more OSs don't do that.

lamontcg · on Oct 9, 2013

Ancient Unix lore has it that you need to do 'sync; sync; init 6' in order to sync the buffers and reboot. Sync was supposed to only schedule a sync, but would block if another sync was already running. I have no idea how applicable that lore is to modern 2013 Linux... Definitely would like to see more careful research than just removing the sleep(2) and declaring victory and address if that sleep was simply vestigial or not...

JoshTriplett · on Oct 9, 2013

This example used a read-only squashfs filesystem, making sync() irrelevant.

Also, counting on sync() to write everything within two seconds seems problematic as well.

masklinn · on Oct 9, 2013

It's not sync() which is the problem. A correctly implemented `sync` should flush all writes to permanent storage before returning.

The issue here is storage devices may lie about it[0]. The 2s sleep is problematic, but IIRC devices don't (and have no way to, and would not anyway) report when the data is actually written to permanent storage, so you can't do much besides waiting a bit and hoping for the best.

[0] http://www.sqlite.org/howtocorrupt.html

JoshTriplett · on Oct 9, 2013

Storage devices actually do have several ways of reporting when data is permanently stored, and Linux makes use of them. However, some storage device manufacturers found that if they lied and claimed data was permanently stored when it wasn't quite yet, they got better benchmark results.

masklinn · on Oct 9, 2013

> Storage devices actually do have several ways of reporting when data is permanently stored

Which `sync` uses. My comment was probably unclear, but the point I was trying to make is if they're lying to sync they're probably not going to provide other accurate ways to get the information.

masklinn · on Oct 9, 2013

> if so, is that a reasonable risk i see (filesystem corruption at times)?

It is not a reasonable risk on a normal system, but TFA uses a readonly filesystem (squashfs) so it's not an issue: there's no data to be written.