Hacker News new | past | comments | ask | show | jobs | submit login
Google seeks new disks for data centers (googlecloudplatform.blogspot.com)
160 points by dbcooper on Feb 25, 2016 | hide | past | favorite | 66 comments



That blog post doesn't really say anything. You have to click through a couple of links to find the real paper: https://static.googleusercontent.com/media/research.google.c... which has some interesting ideas, although it's a bit too brief for the subject, I think.


"For example, for YouTube alone, users upload over 400 hours of video every minute, which at one gigabyte per hour requires more than one petabyte (1M GB) of new storage every day or about 100x the Library of Congress. As shown in the graph, this continues to grow exponentially, with a 10x increase every five years."

This blew my mind. What does this even mean in terms of logistics. How many people do you need to have just to add all those hard drives? How many new datacenters do you need to build every 5 years?


It's actually not that much. Current consumer desktop hard drives top out at 8T (you can get enterprise RAID boxes of up to 48T now, but leave them out of the equation). 1P/day ~= 128 hard drives, so at 4min/drive, that's one person.

Of course, that's just YouTube, and Google has many other needs for data. But people forget just how big the denominators are on this quantity, and how effective Kryder's Law has been. They also forget how much reserve capacity there is in human labor; Google's datacenters have tiny employee counts because they are so automated, and could easily scale up into the exabyte/day range.

A more interesting question is what the differential rates of Kryder's Law vs. Moore's Law will do to how we architect software. Already, people in the know say that "disk is the new tape" - disk drive capacity has been increasing much faster than seek times, bus bandwidth, and available processing power, which means that you have to start treating the drive as a sequential storage device and not as a random-access platter. That's behind a lot of the shift from B-trees (as in conventional RDBMSes) to LSM-trees (as in BigTable/LevelDB), and also the resurgence of batch-processing frameworks like MapReduce. How does the software you build change when reading & writing data sequentially is really cheap, but accessing it randomly is expensive?


> It's actually not that much. Current consumer desktop hard drives top out at 8T (you can get enterprise RAID boxes of up to 48T now, but leave them out of the equation). 1P/day ~= 128 hard drives, so at 4min/drive, that's one person.

Of course at this scale you provision by prebuilt rack or even by container


> How does the software you build change when reading & writing data sequentially is really cheap, but accessing it randomly is expensive?

I thought we were already in that situation. Cache is king.


The real change is that with increasingly performant and large SSDs, it's becoming reasonable to have main storage for which random access is orders of magnitude faster than HDDs will ever be. Still much slower than cache, but a few orders of magnitude here and there are likely to shift what optimal strategies are.


I think you forget the redundancy. You probably need to double it to 256 drives per day.


Do they do need to even do that? Couldn't that be engineered into the free space on other physical drives and change as needed? (I do a version of that on a very small scale and noting I am not expert in this area it was just a common sense approach.) Meaning it wouldn't have to be "doubled" or anything close to that.


They do, but not quite in such a simplistic way. Basically, there's one service that manages the individual drives, and then another service (Colossus) that sits on top of it and provides the abstraction of an infinite, highly-available, perfectly robust filesystem. So services like YouTube just specify a path, data, an owner, and a replication policy and then hand it off to Colossus to find free space on one of the zillions of disks. There's a side mechanism for quota & budgeting.

It's unclear from the article whether YouTube's 1P/day is pre-replication or post-replication. I'd just assumed post-replication; it doesn't really change the conclusion in a material way. (Rather than the answer being "1 person", it becomes "2-3 people".)

And yes, you do need 2x (or more commonly, 3x) the disk space. You can use error-correcting codes to correct single-bit errors (Colossus uses Reed-Solomon), but that won't help you if there's a fiber cut and a whole datacenter goes offline.


Hah! Thanks for making me feel somewhat smart. Reminds me of when I was a kid and I said to my mom about cancer "why don't they just cut it out?".


Given 8tb drives and chassis that can fit 90 3.5" drives into a 4u enclosure, you're talking about a petabyte fitting into 8u. Sledding takes about 8 hours for that many drives, so a small team would be able to handle hardware buildouts pretty easily. Moreso if you can lean on your integrator to do the work for you.


A big thing they are saying in the paper is that they want to see data center HDDs that care slightly less about their medium error rates and opt to give better areal density and more consistent tail latencies and let the system above it to handle the reliability as it needs to do anyway.

I've been thinking like this for a while and would tend to reduce the ERC (Error Recovery Control) to a minimum but the disks are still not designed to have these at very low numbers and Google has several interesting ideas in their paper along these lines.

It would be really awesome if they get the HDD makers to go along with this.


Should be interesting to see what they come up with.

Somehow I don't think they will find things better than multi-million dollar research like HAMR for write-once read-many, but we'll see I guess.

I wonder if there will be 100TB spinning read/write drives by 2020, an exponential leap instead of incremental

Is there any reason for them to stay with 5.25 or 3.5 inch design for a datacenter?

Why not go back to 8 inch for massive surface area? Or is that too much mass to spin.


> I wonder if there will be 100TB spinning read/write drives by 2020, an exponential leap instead of incremental

The problem with spinning disks is that speed and reliability has not increased in line with capacity. There comes a point at which it doesn't make sense to make them any bigger (capacity-wise).

Just creating a filesystem on an 8TB disk takes hours. If you bring in disk (block level) encryption and the requirement to fill the disk with random data and then encrypt before creating the filesystem, you're looking at a multi-day task. Expanded to 100TB, you could be looking at a month just to bring a disk online.

On the reliability front - a spinning 8TB disk is probably about as reliable as a 1TB disk, so that means you have a 8x increase in probability of data loss, as well as ~ 8x more data to recover/re-distribute for every failure


> Just creating a filesystem on an 8TB disk takes hours.

Are you sure you mean TB?

    $ for i in $(seq 8); do truncate -s 1T 8tb.$i.img; losetup --find --show 8tb.$i.img; done
    /dev/loop2
    /dev/loop3
    /dev/loop4
    /dev/loop5
    /dev/loop6
    /dev/loop7
    /dev/loop8
    /dev/loop9
    $ sudo mdadm --create --level linear -n 8 /dev/md8 /dev/loop{2..9}
    mdadm: Defaulting to version 1.2 metadata
    mdadm: array /dev/md8 started.
    $ sudo time mke2fs /dev/md8
    [lots of output snipped]
    real    42m0.590s
    user     0m2.470s
    sys      1m0.440s
    $ du -sh . # just for fun
    65G     .
That's certainly not fast, but it's not "hours" either. It's also a magnetic disk that was actively being used by other processes. Given how much data it wrote, it averaged 25MB/s, which is not really that fast. An 8TB SSD would be much faster.


Is that defaulting to EXT4 or XFS ?


He used `mke2fs`, which defaults to Ext4.


> Just creating a filesystem on an 8TB disk takes hours.

Am I being dense here, but why would you create a filesystem on the hard drive? The hard drive should only store file contents.

There's all kinds of seeking and syncing and random access needed for the filesystem. Seeks from file contents can't be avoided, but ones due to the filesystem can be.

If there's no metadata-on-ssd + data-on-disk filesystem for Linux, there should be.


ZFS allows for that level of control: assign a cache device to the pool, then set your secondarycache flag to metadata. Yes the metadata will hit the primary cache first (memory), but it will spill onto the secondary as it is pressured out of ARC. I've got a SD card playing that role in my build machine, it works really well for keeping track of all the tiny files that make up the FreeBSD base and ports trees.


So what you're advocating is that instead of creating an ext4/zfs/xfs file system on each drive, have a sort of distributed filesystem that has raw access to each drive. And the i-node of the distributed filesystem could be always-available on RAM with SSD backing. My guess is, Google probably already does this.


Lots of distributed filesystems (I guess I'm talking about distributed object stores here rather than an actual clustered network filesystem) rely on a regular POSIX filesystem for the storage of blobs.

A huge amount of work has gone into making ext4/zfs/xfs etc fast and reliable, plus you get other benefits like journaling, filesystem caching, fsync(), metadata caching (a la ZFS), so there are credible arguments for not going down the path of NIH.


Interesting, I don't think much work has been done in this area. It's not a common way of dividing the workload, even though it sounds obvious.


> a spinning 8TB disk is probably about as reliable as a 1TB disk, so that means you have a 8x increase in probability of data loss

I don't understand that argument, not if you think at scale.

You need to store 8000TB. You can run 8000 x 1TB, or 1000 x 8TB. You loose 10% of the disks per year (to keep the math simple). So after a year, you have either had to redistribute 800 x 1TB or 100 x 1TB, which is the same to a software defined distributed storage.

There are operational upsides to fewer disks, power usage, manpower, rack space etc. But of course more disks are faster.

Google wants to have more options than just bigger vs faster available. Since they know exactly, and can influence, how their software behaves, they can gain much from using specialized disks that balance things differently.


> Just creating a filesystem on an 8TB disk takes hours.

Xfs creates such a filesystem in seconds. As another reply showed ext4 is quite a bit slower since it preallocates the inodes; more modern designs allocate them on the fly as needed.


> you're looking at a multi-day task. Expanded to 100TB, you could be looking at a month just to bring a disk online.

That's what adding value looks like.


In the long run anything that moves in DC is way less reliable than systems without moving parts. On the top of that, these systems also need more energy to maintain the movement. If you increase the size than you need better materials to keep up with the increased mechanical load. I bet by 2020 we are going to have better ways of storing large amount of data than a spinning disk. The new optical drive would be a good candidate for some use cases, mostly for archiving though:

http://www.sciencealert.com/this-new-5d-data-storage-disc-ca...

For continuous read/write workloads SSDs are the best for many reasons (energy efficiency, latency, IO bandwidth) and I think the world is moving that direction to replace spinning disks with these. I am hoping to see a bump in the SSD capacity as well. We are here now:

http://www.extremetech.com/computing/221303-the-worlds-bigge...


From the article:

"An obvious question is why are we talking about spinning disks at all, rather than SSDs, which have higher IOPS and are the “future” of storage."

"The root reason is that the cost per GB remains too high, and more importantly that the growth rates in capacity/$ between disks and SSDs are relatively close (at least for SSDs that have sufficient numbers of program-­erase cycles to use in data centers), so that cost will not change enough in the coming decade."


Well at least for YouTube video storage, I imagine the P/E cycles are not a problem. There have been issues with longevity like on the Samsung 840 EVO drives, but if you can avoid that then you can build some incredibly dense, fast, reliable and power efficient storage on top of V-NAND.

Cost per GB for SSDs is definitely catching up to HDD [1] and I think is expected to match or beat HDD within the decade. Power efficiency should be better on SSD and MTBF will be much better than HDD, so I would have thought TCO for SSD would have HDDs basically beat already.

[1] https://cms-images.idgesg.net/images/article/2015/12/ssd-vs-...


Indeed, it will be interesting to see what Google can think up compared to the other big researchers.

Though I wonder if the goals are slightly different, as Google is focusing on "Cloud Storage", rather than single hard performance.

Which as they mentioned, the Cloud doesn't need to have great reliability (since data is assumed to be already redundant), this may allow them to decrease the cost of hard drives for exchange in cheaper disks that may fail more.

Doesn't sound like they are expecting to increase size/performance alone with this effort.


I'd imagine it's more likely to be the engineering tolerances required to keep the vertical movement to a minimum at double the radius? But I have no real insight, just a guess.

It does strike me though - why do disks do serial reads, couldn't the head read several tracks at once? Why not have several heads angularly spaced around the drive? Presumably such things have been tried and found lacking in some respect. Looks like I need to do some research ...


> couldn't the head read several tracks at once?

The paper in fact suggests this and also cites an attempt by Conner to introduce multiple actuator drives:

http://www.pcguide.com/ref/hdd/op/actMultiple-c.html


There were some disks with two heads once. They were mechanically too complicated.


On Tuesday Eric Brewer gave a talk on this white paper at FAST. In the talk (and probably discussed more in the paper) he proposed looking at alternative disk form factors. He seemed to be particularly interested in "tall" disks, say 2U, that had the same surface area as today, but with more platters, and potentially more heads.


Ah, going back to the beginning of micros, where drives were often full-height things that could easily break feet when dropped. The thing that worries me there is that having had a big full-height 9 gb seagate scsi drive back in the day is that you needed to give the drive about a hundred watts on its own to run it, with a spool up that sounded like a jet engine was about to take off. I know that motors have gotten faster since then, and platters have gotten lighter, but that's still quite a bit of mass to get up to speed.

The thing that interests me more is the idea of more read/write heads. Add a second, independent set of heads for read/write and you will decrease latency a good amount. Or, a second dependent set of heads which would not provide as much of a performance boost, but would still improve latency a good amount.


I for one am confident that SSD and its technology will be much more scalable and future-proof than spinning disks; stackable chips are a thing already. I'm sure that by 2020, there will be companies that can deliver a storage platform with petabytes of space in a 1U package. Just imagine one of those server things crammed full of ssd chips.


In the long run that may be true, but the same things they talk about (minus the mechanical head changes) will still hold in SSDs. The background activity is even worse for SSDs and the tail latency still very much exists.

Essentially the problem is that the HDDs and SSDs are built as black boxes and are not vertically integrated with a data center storage scheme. The major win will be when the HDDs and SSDs are slightly less self-sufficient and let the higher components in the stack manage them more in depth. Being able to tell the disk "do not bother with ecc for now" and try to get the data from your RAID and if that doesnt work come back to the disk and tell it "we really really want the data, spend as much as needed on this" is very strong in both getting the tail latency down and getting the reliability to stay at the same level.

For SSDs there is an added complication of garbage collection and the incredible amounts of over-provisioning needed to get consistent performance from them (at least 30% in an ME drive, the RI drives suck at consistent performance). If those can be reduced by more vertical integration that would reduce costs of the systems and make for better overall performance as well.


It amazes me that SSD drives still (mostly) conform to the same drive sizes as spinning disks.

My most recent computer build used an M.2 slot SSD, and plugging that tiny little card into the motherboard seemed like the future. No cables required, and it was even smaller than the RAM sticks.


> Is there any reason for them to stay with 5.25 or 3.5 inch design for a datacenter?

> Why not go back to 8 inch for massive surface area? Or is that too much mass to spin.

One problem AFAIK is that larger disks wobble too much, more the faster they spin. Which is why if you break open a 3.5" 15k drive you'll see that the platters are more like 2.5", and a lot of wasted space. For 7.2k you can use bigger platters, but how much bigger? I doubt 8" is feasible.

Of course you can reduce the rpm further, but that has downsides too.


Current disk densities are about 10^12 bits/square inch. To get 800 terabits on a single platter, it would need a diameter of about 32 inches. At 10^14 bits/square inch, only 3.2 inches.

Assuming a compound annual growth rate of 25%, densities will be 10-fold larger in 10 years, and 100-fold larger in 20 years. (This actually happened in hard disks between 1972 and 1992.) But that would take us to 2035, not 2020.

At those densities, a bit will have to be stored inside a square about 7 or 8 nanometers on a side. I don't know whether that's possible.


For 100TB disks to make sense the read/write speed would have to improve considerably, otherwise the disk will be useless other than for cold storage. I haven't really seen spinning disks speed improving much. It would take weeks to rebuild an array of 100TB disks at the current speed.


100TB should only be 5x (i.e. sqrt( 25 )) slower to access each byte sequentially than a 4TB drive (although shingling will screw up this calculation, since it provides more tracks, but not more density on each track).

Which is a long time, but not totally out of bounds for the way we're currently using disks.

The coming decade is going to be very interesting, SSDs are pushing forward on an insane trajectory, which could either fall apart as the benefit of node shrinks disappear, or accelerate through a virtuous cycle as unit volume increases.

It's even plausible that spinning rust is on death's door in that time frame. The price floor of SSDs is much lower than that of hard drives (i.e. how much does an 8gb hard drive cost to manufacture today in volume? As much as a 500 gb drive), meaning that both mildly performance conscious and price sensitive consumers both move entirely over.

Enterprise customers will still chase the price/gb and storage density of traditional hard drives, but without the consumer dollars to invest in R&D, that could start to stall, quickly leaving them with diminishing advantage and obsolete in the blink of an eye.


I am repeating a comment to another response but if you reduce the access time and at the same time increase the amount of data stored on the drive, you are locking yourself out of your data. Which is fine if it is just for archival. But if it is data supposed to be randomly accessed (youtube video) it doesn't help to have 100TB of data on a drive if you can only extract them at a single 50MB/s at a time. The more data on a single drive, the most likely you will a need to access several segments of this data at the same time.


Most of youtube (and photos archiving even more so) is 'write-once, read-never'. Only a comparatively small number of videos become popular.


or you use multiple drives at once to access the data?


I remember a washing machine sized IBM mainframe hard disk with tho sets of heads. That's one trivial way to double bandwidth and cut latency (it'll read like a RAID-1 and write as a RAID-0). Also, the more platters, the more data you can read on a single rotation.


You double bandwidth for a single access. But if you are storing 100TB on a single drive, it's unlikely to be 100TB related to a single user. You will want to make a lot of requests to this data simultaneously, a lot more requests than what you would to a disk containing only 1TB of data.


If they go with anything Arm and Platter I'd be shocked.

I'm imagining a 10 acre array of NAND


They are saying the paper they want arms and platters as it's more economical. An HDD is cheaper than an SSD of the same capacity at four to ten times and I don't think the electricity cost will offset this if you care more about capacity than performance.


It is right now, but silicon and precision mechanics scale with different curves. There is a point where spinning metal will no longer be viable as a competitor and I'll guess, based on a hunch, it's not that far into the future.


Flash is not getting cheaper faster than HDD's, once you control for write endurance. One of the big messages in the paper is that IOPS matter. If you only have a petabyte or two of data, you can use a brute force solution such as flash or 15k drives. But at Cloud scale (we used to say Google scale, but there other companies like Amazon and Microsoft and Facebook that run at these scales now) you just can't afford that much flash.

So the trick is to use every last IOPS that you can out of each and every disk, and that means provisioning and spreading out your data across an entire fleet of disks. The alternative, where you silo your storage in disk pools, and where some disks or disk pools have IOPS that are unused --- wasted, and where you make up for that by using flash for your hot and warm workloads is just a much more expensive way to do things.

It's for that reason that SMR drives aren't all that interesting. Sure, you get 20% more capacity, but at the cost of burning most of your IOPS for GC. Or if you move all of your SMR-friendly data to the SMR drives, that's cold data, and it means that you've wasted all of the IOPS that those SMR drives are capable of. Which means the rest of your disk fleet is now much hotter, and you might have to buy more CMR spindles to cover the cost --- at which point the cost advantage of SMR drives disappears.

(I helped to author the section about SMR drives, and why we are interested in hybrid SMR/CMR drives as a solution to this problem; read the paper for more details.)


If you get any response from disk vendors I'd be very interested to hear about it and be somehow involved. I've been asking them similar things for some time and got no response beyond a "we'll take a look" and never getting a reply back.

One such request that is also mentioned in your article is to get more granular responses in the IO responses, in a SAS disk you can get correctable errors that do not hamper the behavior but will get you information on the internal status and actions of the disks. SATA has a problem with this since any IO Error in SATA will cancel all outstanding IOs. Though I think that there are device logs and such now that can be implemented to provide some semblance of a similar result.


  > Flash is not getting cheaper faster than HDD's, once you control for write endurance.
Is the category of WORM-suitable data (like videos) not large enough to be interesting, or is there simply no WORM technology better than writable?


There's no WORM technology better than rewritable as far as I'm aware. Storage density is already well above the point where optical technologies are viable.


I think the question is why does P/E cycle count matter for storing YouTube videos? As videos are uploaded, you shard them with FEC across your SSD array, but once the blocks are written, when would you ever need to erase them? The only concern is longevity of a single block which requires some amount of P/E (re-charging the cell) but lets say that's just 1 full disk write per month to keep everything fresh...


OK. I was thinking more along the lines of PROM (i.e. random access) but I can imagine there's nothing now more useful than flash.


There is "antifuse", although it's a little slow to program. Commonly available on chips in small amounts for secure key storage, serial numbers, calibration etc.

I'm not aware of anyone selling standalone parts, but here is an IEEE paper: http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=124097... (512Mb in 2003!)

If Google really wanted to, they could make a couple of petabytes of slow-to-write memory by commissioning it themselves for low numbers of millions of dollars. But I suspect they value the rewritability.

See also http://www.eetimes.com/document.asp?doc_id=1328234


The point about ECC and accepting lower error rates makes a lot of sense. It's the end-to-end principle again: since all of these HDDs are going into a global pool in which each of them is disposable and written content is protected by FEC spread across multiple drives, there is no need for each drive to spend a lot of resources, going deep into diminishing returns, trying to make itself as resilient as possible. If a network transmission fails, it is retried and doesn't need to be swathed in huge numbers of elaborate checksums and ultra-reliable links; if a hard drive fails, the content is recovered from the FEC and re-written out to a new drive.


reminds me of the 5.25" Quantum Bigfoot. I'd be happy with a big disk like that in a storage network / raid.

https://en.wikipedia.org/wiki/Quantum_Bigfoot


The article doesn't go into much detail on the performance. I'm curious if reading was any faster because of the relatively faster spin speed at the outer rings of a larger disk? Or if it was a wash with seek times increasing due to the larger area?


Which reminds me a teardown EEVblog did on 1980s 11" mainframe drive: https://www.youtube.com/watch?v=CBjoWMA5d84


> For example, for YouTube alone, users upload over 400 hours of video every minute, which at one gigabyte per hour requires more than one petabyte (1M GB) of new storage every day or about 100x the Library of Congress

Hmm, something up with the sums in the middle of that.

400 hours of video every minute is much more than one gigabyte per hour. It's way more than one terabyte per hour.

Working backwards:-

1 PB/day =~ 42 TB/hour =~ 728 GB/minute


I believe they are saying that the 400 hours they upload every minute are 400GB worth of data.


Ah, yes, good point.


That's only if you're storing one copy of the video.


Variable compression and video resolution?


400 hours video/min = 400 GB/min = 400 x 60 x 24 GB/day = 0.576 PB/day. Close enough.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: