Optimizing Magic Pocket for cold storage

wallflower · on May 6, 2019

> We killed the project after more than nine months of active work. As an engineer, it is not easy to give up something you have been trying to make work for so long. Some of the early conversations were controversial but ultimately everyone agreed killing the project was the right decision for Dropbox and our users.

They found the solution did not work once deployed to be tested in staging. Being able to walk away from that sunk cost is of merit.

NKosmatos · on May 6, 2019

They don’t provide that much info on why Erasure coding didn’t work out for them. I’m curious to see how Backblaze will handle something similar while they scale up now that they’re goin to have multiple data centers (2 in US and 1 in EU).

flas9sd · on May 6, 2019

The authors writing style in a well structured post resonates positively with me. I've read some of their technical posts but not methodically. I'm curious on what the process is at Dropbox to bring such an article to life - if this is done in isolation and left to the writers abilities or if there are feedback loops with an editor. Certainly a document I can take cues from.

simonebrunozzi · on May 7, 2019

Most probably PR-approved. In my experience, pretty much all public companies (or soo-to-go public) will have a strict PR process in place, to handle almost everything that goes out.

preslavle · on May 7, 2019

There are feedback loops with editors, both experts and non-experts on the subject matter. When we work on a given domain for a long time, we sometimes lose perspective on what is obvious and what is too much details. Having someone to sanity check the structure and edit does result in higher quality blogposts.

NKosmatos · on May 6, 2019

Nice article, well written and a good read. Always nice to see a technical post on something many of us use daily. Extra points for them being honest about how they treat incoming files :-) “...such as perform OCR, parse content to extract search tokens, or generate web previews for Office documents...”

lukax · on May 6, 2019

It looks like they use something very similar to Erasure Coding in Ceph with RadosGW [1].

[1] https://ceph.com/planet/using-erasure-coding-with-radosgw/

kdkeyser · on May 6, 2019

They have been doing erasure coding since at least 2016 (it is mentioned in their original Magic Pocket post), just like almost all distributed storage system, so nothing really new there.

This article is about single-region storage vs. multi-region storage (and how to reduce the cost in this case). There is very little public info available about distributed storage systems in multi-region setup with significant latency between the sites.

kdkeyser · on May 6, 2019

The Dropbox technical blog is always a very interesting read, I love that they provide so much detail into the technical background of their solution. Still was left with quite some questions when trying to understand this article though:

- Dropbox did not want to be running a single version of the software / a single region, because they consider the risk of a single software bug / human error resulting in data loss too high. However, the alternative they choose introduces a completely new code base which will have to be battle tested. This increases the risk of a data loss bug, which would affect a smaller fraction of the data, but any significant data loss issue would be game-over for a company like Dropbox. Did they consider partitioning the system into smaller subsets (some single region, other multi-region), using staged roll-outs of new software versions? Or is there really some fundamental incompatibility between Magic Pocket and multi-region?

- The "New Replication Model" story sounds a bit too simplified. It seems to re-introduce some issues that the single region Magic Pocket solution had already solved: the size of the IO operations becomes quite small again (fractions of the 4M block size), placement of data on the disks becomes less predictable, which could cause increasing rebuilt times when a disk fails. Also, the number of IOs to read or write an object increases significantly (2-3x in the example), which means that the observed advantages in latency go hand in hand with a 2-3x lower maximum supported load than in the Magic Pocket case, before the latency explodes due to running out of IOPS on the HDD's. The whole design seems to ask for far more IOPS than the Magic Pocket solution, which sounds like an odd match to SMR HDD's.

These issues are maybe alleviated by the fact that moving data to the cold tier happens asynchronously, and the cold data is accessed very infrequently, resulting in far less IOPS being required for the cold storage region. However, it also makes the option of combining hot and cold data on a single disk much more difficult (which for HDDs is the way to make optimal use of the limited IOPS vs. their huge capacity - I suspect Amazon / Google use this for their near-line storage solution). Moving from the 2+1 example to e.g. 4+1, to reduce cross region storage costs even more, becomes now a though call as it now goes hand in hand with an even larger increase in IOPS cost.

- The claimed "simplicity" of deleting data in the proposed scenario is rather relative. If they are using SMR drives, deleting data and reclaiming space are complex and expensive operations. They might reduce it to a non-distributed problem (which is still a significant gain, of course), but it is far from trivial.

Probably a lot of the finer, left-out details of their cross-region system address these issues, and if not, maybe the cross region system and the single region Magic Pocket solution will converge again in a later phase.

preslavle · on May 6, 2019

This is Preslav from Dropbox here. All great questions! We would have absolutely loved to put all interesting details in the blog post but we need to keep the length limited in order to not overwhelm the reader. Will try to answer your questions here:

> Dropbox did not want to be running a single version of the software / a single region, because they consider the risk of a single software bug / human error resulting in data loss too high. However, the alternative they choose introduces a completely new code base which will have to be battle tested. This increases the risk of a data loss bug, which would affect a smaller fraction of the data, but any significant data loss issue would be game-over for a company like Dropbox. Did they consider partitioning the system into smaller subsets (some single region, other multi-region), using staged roll-outs of new software versions? Or is there really some fundamental incompatibility between Magic Pocket and multi-region?

Magic Pocket already employees partitioning, staged rollouts, multiple versions, stringent operator controls, and extensive testing. This is discussed in more detail in https://blogs.dropbox.com/tech/2016/07/pocket-watch/

There’s no real incompatibility between Magic Pocket and multi-region, just a general trade-off in software that we’re not willing to make in this case. Globally replicated state would elevate availability and durability risks. It’s true that we can introduce protections to avoid this, and we do employ these protections, but it’s not a silver bullet - a single system would still be vulnerable to rare “black swan” events we may not anticipate. (There is a great example of how unexpected correlation triggered a subtle bug in third party vendor software in the beginning of https://www.infoq.com/presentations/dropbox-infrastructure.)

In our approach the additional codebase for cold storage is extremely small relative to the entire Magic Pocket codebase and importantly does not mutate any data in the live write path: data is written to the warm storage system and then asynchronously migrated to the cold storage system. This provides us an opportunity to hold data in both systems simultaneously during the transition and run extensive validation tests before removing data from the warm system.

We use the exact same storage zones and codebase for storing each cold storage fragment as we use for storing each block in the warm data store. It’s the same system storing the data, just for a fragment instead of a block. In this respect we still have multi-zone protections since each fragment is stored in multiple zones.

> The "New Replication Model" story sounds a bit too simplified. It seems to re-introduce some issues that the single region Magic Pocket solution had already solved: the size of the IO operations becomes quite small again (fractions of the 4M block size), placement of data on the disks becomes less predictable, which could cause increasing rebuilt times when a disk fails. Also, the number of IOs to read or write an object increases significantly (2-3x in the example), which means that the observed advantages in latency go hand in hand with a 2-3x lower maximum supported load than in the Magic Pocket case, before the latency explodes due to running out of IOPS on the HDD's. The whole design seems to ask for far more IOPS than the Magic Pocket solution, which sounds like an odd match to SMR HDD's. > > These issues are maybe alleviated by the fact that moving data to the cold tier happens asynchronously, and the cold data is accessed very infrequently, resulting in far less IOPS being required for the cold storage region. However, it also makes the option of combining hot and cold data on a single disk much more difficult (which for HDDs is the way to make optimal use of the limited IOPS vs. their huge capacity - I suspect Amazon / Google use this for their near-line storage solution). Moving from the 2+1 example to e.g. 4+1, to reduce cross region storage costs even more, becomes now a though call as it now goes hand in hand with an even larger increase in IOPS cost.

Yes, the new replication model does change the average block size. There are implications on IO, metadata to file data ratio and memory to disk ratio, which we have taken into account when building the system. As you noted the issue are largely alleviated by the data being cold. Also even with SMR disks, Magic Pocket is not necessary limited by the IOs for serving live users requests but also from load from background operations, such as repairing after disk or machine failure or compaction.

> The claimed "simplicity" of deleting data in the proposed scenario is rather relative. If they are using SMR drives, deleting data and reclaiming space are complex and expensive operations. They might reduce it to a non-distributed problem (which is still a significant gain, of course), but it is far from trivial.

Yes, compaction is a complicated problem in general and claimed simplicity is relative to the other proposals discussed in the blog post. We are not changing what each Magic Pocket region needs to do internally, after we delete a fragmentation from each region, each region needs to reclaim the space separately. This is the same problem for both the warm and cold storage systems.

toomuchtodo · on May 7, 2019

Thank you for taking the time to write this reply, all very interesting!

bluedino · on May 6, 2019

>> Maintaining a globally available data structure with these pairs of blocks came with its own set of challenges. Dropbox has unpredictable delete patterns so we needed some process to reclaim space when one of the blocks gets deleted.

I wonder what that means.

Scaevolus · on May 6, 2019

If you're storing A, B, and A+B, and then the file holding block A is deleted, what happens? You can't immediately remove A, because then you lose the redundancy for B. You'd need to somehow find a _new_ pair for B, C (maybe of another block in the same situation, where D is supposed to be deleted?), and write B+C, then delete A, A+B, D, C+D.

Pretty fiddly.