Hacker News new | past | comments | ask | show | jobs | submit login
Backblaze is building a 270 TB storage pod (backblaze.com)
97 points by LukeLambert on Aug 12, 2014 | hide | past | favorite | 85 comments



Man, I keep seeing awesome posts about Backblaze, and really would like to use them. However I run exclusively on linux these days, and they don't seem to have a Linux client.

Public pledge, if backblaze releases a linux client, even if it's command line only and requires that I manually edit an XML file, I'll purchase a 1 year subscription.


The only issue I have with them is the removal of deleted files from backup archives after 30 days. Crashplan doesn't have that restriction (and also has a Linux client) although the Java-based engine is pretty clunky, and it eats memory like crazy to do client-side dedupe if you have it enabled. They've been promising native clients for a long time, and a sales guy earlier this year said v4.0 was going to be out Q1 this year, which the engineering team later denied ever saying... So I still don't know when the non-Java version is coming with its improved memory efficiency.


I really wish at some point JVM will stop being synonymous with memory–hog on desktop. It’s certainly possible, e.g. http://hiroshiyamauchi.blogspot.com/2013/06/making-jvm-relea...


I second this. I would love to be able to run a tool a la rsnapshot to store about 500 GB's worth of locally encrypted backups on a cloud provider. I am willing to pay about $100/year for this service. Backblaze offers this service for $60/year. But the problem is my home NAS (the origin point of all of these backups) is Linux/UNIX-based.


There _is_ Google Drive and the likes of Insync? That's $120/yr for 1TB.


I'm a Backblaze user, became one right after another post from their blog hit HN front page about 8 months ago. Backblaze almost saved me once when my external HDD died due to dropping—the reason it didn't actually save me is because initial backup wasn't yet completed (not their fault). Otherwise I could easily order from them a new HDD with restored data, which is a very nice feature.

Sadly their OS X client is kind of quirky—I'd rather have a command-line one that required hand-editing configuration files (I'd wish for something other than XML though), but which would run more reliably and transparently.

(Currently it shows “Computer Missing” and says that I haven't backed up in more than 2 weeks, though I did actually press Backup Now quite a few times in past few days. There're some other oddities in behavior. I suspect I'm back into “searching for decent online backup solution” mode lately…)


I have the same problems as well with the OS X client. It definitely has issues handling external drives. It somehow doesn't display my exclusions of files on external drives anymore, but somehow is still following them.

I'm really happy with their service (had it for ~2 years now), but would love to have a tool similar to how you describe with reduced bugginess and more tweaking ability.


To follow up on this, it actually turns out the configuration for backblaze on OS X is in XML and tweakable. It can be found in /Library/Backblaze.bzpkg/bzdata/bzinfo.xml


Hi! @YevP from Backblaze here -> Have you tried to write in to support? They can take a look at the account and see if they can figure out why that oddness is occurring? https://www.backblaze.com/help.html.


Agreed on the awesome post front and one of the reasons I became a customer in the first place. I'm unlikely to renew my annual subscription this year, though, because they don't support external drives/NAS. I don't like cluttering up my laptop with photos so have to transfer everything I need backing up to my rarely used desktop PC, which is a pain. I can understand there's a risk people might abuse networked storage by backing up multiple devices but in my case I wouldn't mind paying more for that service or simply based on actual GBs stored.

Edit: They do support external drives but not NAS. My bad.


Here's their page on external drives: https://www.backblaze.com/edrive.html


They do support external drives but not network drives. I have a 3 TB external drive attached to my MBA that backs up just fine.


Ugh, sorry. Not sure if that was always the case and I've mis-remembered or there's been a change but my NAS point remains. Personally, I'd rather not faff around with an external drive plugged directly into my laptop.

Edit: typo


Makes sense. Mine is connected through a thunderbolt display so I only have to make/break one connection.


> Public pledge, if backblaze releases a linux client, even if it's command line only and requires that I manually edit an XML file, I'll purchase a 1 year subscription.

Consider how many Linux users would need to do this to a) pay for the developer time to write/maintain the client and b) pay for the support time for Linux issues


I'm a Backblaze engineer, and we keep our one 'C' and 'C++' cross platform source tree building on three platforms every day: 1) Windows, 2) Macintosh, and 3) Debian Linux. (We also have iOS and some other future projects but I'm ommitting them from this discussion.) We use the Debian Linux binaries in our datacenter in production. The only parts that are missing are an installer and a GUI (or we could skip the GUI and let you hand edit the XML file you can already find on the Mac and Windows versions).

There are a couple of technical problems I'd like to overcome in the next 12 months and get the Linux client out there. One technical problem is how many versions and flavors of Linux to support. For example, we're going to ship binaries for Debian simply because it's the flavor we use in our own datacenter, but after than we'll need CentOS, RedHat, Ubuntu, probably Gentoo, others?

I'm not sure I understand the state of how others release shrink wrap Linux products in 2014? We could require customers to compile it, but that seems kind of horrid and very unlike Backblaze. If you aren't familiar with us, we are known as being friendly and easy to use. Backblaze is not a developer tool for scripting backups, you simply install Backblaze and it backs up your entire computer in order to be "easy". It was the only way we could be sure we had everything you needed. This is also why we charge a flat rate of $5/month. If we charged "per GByte" we would be accused of backing up everything to jack up our profits. :-)


There are other things to consider too: developer community goodwill, people that may not be on Linux now, but only use systems that are cross-platform (to curb lock-in to a particular OS), etc.

How much revenue is Atom for Linux expected to generate for Github?


Developer/Linux community goodwill for a user-friendly backup services isn't as valuable as you'd think.

IT support staff? Those are the people you want to enamor yourself with. And they're either on Apple or Windows gear.


Yes it is. Assuming your IT guy at work runs linux, then when you ask him: What backup should I get, if his answer is backblaze that brings a lot of business to you.

Personally I assume they don't release a linux client because then people would backup their server to it.


Those IT guys don't need to use Backblaze on Linux to recommend it; they're going to recommend whatever "just works" so they don't have to "go down the rabbit hole".

When I still used Android, I would recommend iPhones all the time to people. Why? Because it just works, and I don't want to be stuck troubleshooting their Android issues for the life of their phone.


What about server backups?


I suspect that a larger issue might be that Linux users tend to run servers with multi-terabyte collections of stuff.

It'd be difficult for them to make money at $5 a month off a guy who wants to backup 20TB of usenet archives or whatever.

Right now those kind of collections are weeded out since they don't backup network shares.


Someone just pointed out that you can't download data from them without providing them your private key's password. That seems boneheaded to me, and I guess would prevent me from honoring the pledge I made.


It sounds like you want http://www.tarsnap.com


Exactly. But all I see is since three years: "We would love to support Linux at some point, but unfortunately, we don't have anything to report on a timeframe"


They could take the easy way out: publicize an API, and let people write their own opensource clients. Would work well with the linux prosumer group.


But that would probably open them up to a lot of users using their service for online storage instead of just backups which they are trying to avoid I believe.


Just put a 6TB drive in my media server. I was going to put in another 4TB, but the cost per TB was not that much worse, so I decided to give it a go. There's also the issue that I only had room for one more drive anyway, and I wanted to RAID-1 the OS and important data partitions too.

I'm still learning the ins and outs of using btrfs too. I recently found this btrfs snapshot based backup utility that I'm going to give a try:

https://github.com/ruediste1/btrbck

This should be a lot, lot faster and more efficient than using rsync and rsnapshot for disk-to-disk backups with my current scheme.


How is it going?


I haven't tried that backup utility yet. I'm frankly a little nervous to trust it with valuable data. I'll run it and something based on rsnapshot for the near term. Now I just have to figure out if I've got enough external drives for that, or do I need to buy more.

I'll likely first try to run things so that the btrfs snapshots are running over the Internet, which will really cut down on the time and bandwidth needed. And as a backup use rsnapshot locally.


I always look forward to backblaze's posts.

Does anybody know if drive manufacturers pay attention to their posts about reliability? I know when I worked with mainframes we sent every bad drive back to IBM for failure analysis, but would Backblaze's aggregate data be useful to somebody like Seagate?


Yev from Backblaze here -> We've since spoken to some of the device manufacturers after some of our more recent posts. :-p


>> would Backblaze's aggregate data be useful to somebody like Seagate?

Probably not. They posted before that they don't even buy enough drives to purchase directly from the manufacturers.


What is interesting is that backblaze still use the 45 drive in 4u config.

We use Dell/LSI/Netapp/engenio md3260s which pack 60 drives into the same space. We use raid 6, as we want the storage to manage the redundancy, not the software. With four hot spares we achieve a density of 170 tb usable per 4u (technically its 5U as we have a fileserver sitting on top. however assuming 48u it works out the same density)

With backblaze, assuming raid 6(unless you are using FEC then there is no other way.) you'll only get 120TB per 4u (but a file server as well)

The big advantage of the Dell/netapp/etc raid is that the drives are easily hot swappable. Ironically at the scale that backblaze are doing it, they are probably cheaper as well.


We also have some of those Engenio JBODs at my office for "big data", but I assume they're much more expensive than the Storage Pod. My general experience with enterprisey hot-swap JBODs is that the enclosure costs more than the drives.

Supermicro also has a 4x12 drive in 4U box that looks a little less "homemade" than the Backblaze Storage Pod.


Yes and no, ours are pretty cheap, and they come with 24x7 4 hour response, which means that we don't have to worry about keeping lots of spares.


I don't understand this: 6TB drives are more expensive per gigabyte and they even mention it in their post. They say that they expect the drives to become cheaper over time and that's why they are switching now. Wait, what? Switching now when they are, I quote, "at the top of the curve"[1] does not make sense?

It might be that the physical space reduction compensates for it, but they don't mention that so I don't get it.

[1] https://www.backblaze.com/blog/wp-content/uploads/2014/08/bl...


My read was they they aren't switching over everything - they are testing new drives, to figure out what reliability looks like.

Once the cost curve costs over, they will then know which vendors drives to go all in on.


Oh, that explains, thanks :)


Yeah, I don't think they replace drives in pods until they fail. Even if they're old 1TB models. The drive costs just outstrip their infrastructure costs by so much that 4 racks of 1TB drives is still cheaper than retrofitting even 1 rack with new 4TB drives.


Yev from Backblaze here -> What we meant by "the top of the curve" was that the price per GB for the 6TB drives is still high, but will decrease greatly as time goes on. The "The Backblaze Cross Under Point" section of the blog will explain it a bit better. We're not quite fully switching over yet, just testing the hardware.


They're testing out 6tb drives early, so when the time comes to fully switch over. They know what drives to get from what manufacturer. They'll also have a better idea when to switch over.


Not just the physical space, but also the power and cooling and serving requirements might tip the balance in favour of a higher upfront cost, but lower lifetime cost.


BrianW from Backblaze here. Exactly. In rough numbers, 1 hard drive always takes the same power to spin up and to cool, regardless of how big of capacity it is. Also, going from 3 TByte drives up to 6 TByte drives means we have half as many hard drives (pods) to deploy every week and so we don't need to hire as many datacenter employees to deploy and maintain pods. Basically we have a little spreadsheet (it isn't complicated, but it has 10 line items on it) that tells us whether a certain price for a hard drive is a good deal or not in the 2 year timeframe, the 3 year timeframe, and the 4 year timeframe.


They mentioned power and that's a flat fee for them, so I didn't mention it.


I stopped using backblaze when I learned that they require you to TRANSMIT YOUR PRIVATE KEY TO THEIR SERVER in order to restore your files from backup.


Brian from Backblaze here. To be clear, there are two levels of security/encryption at Backblaze:

1) The friendliest way we could design for people to restore their files was to allow customers to sign into a website with a username/password and recover one or more files. This is the default situation.

2) You can optionally turn on a "private encryption key" but if you do that, understand you MUST write down that key because if you lose it, you can never recover it, and Backblaze (nor any government organization) will EVER be able to recover your files. NEVER. LOSE THAT PASSWORD AND THEY ARE GONE GONE GONE!

In the case of #2, as long as you don't need to recover from a crash, you don't enter your private encryption key and nobody will ever have access to your files, period. However, if you lose a file, you have to sign into the Backblaze website and provide your passphrase which is ONLY STORED IN RAM for a few seconds and your file is decrypted. Yes, you are now in a "vulnerable state" until you download then "delete" the restore at which point you are back to a secure state.

If you are even more worried about the privacy of your data, we highly recommend you encrypt it EVEN BEFORE BACKBLAZE READS IT on your laptop! Use TrueCrypt. Backblaze backs up the TrueCrypt encrypted bundle having no idea at all what is in it (thank goodness) and you restore the TrueCrypted bundle to yourself later.


I was considering signing up and just noticed this on their site... Definitely odd. I understand they want to be easy to use and feel seamless, but it defeats the purpose of the password protected private key.

https://www.backblaze.com/backup-encryption.html

> To decrypt your data, you are required to enter your passphrase on our secure website. When you do so, it is passed over an encrypted connection to our datacenter where it is used to decrypt your private key, which in turn is used to decrypt your data. Your passphrase is never saved on disk and it is discarded once it is used. As before, once we decrypt your data on our secure restore servers we then zip it and send it over an encrypted SSL connection to your computer. Once it arrives on your computer, you can unzip it and you have your data back.


Seagate promised 8 / 10TB HDD within the next 10 months. Cant wait to see that.


Yev from Backblaze here -> We've seen an 8TB drive, can't wait to get enough of them to drop in to a pod and test :)


I wonder if the power savings going from 4TB -> 6TB could make up for the difference/GB.

Spinning disks still operate at relatively high power (~8W), and I don't think this value would change with capacity (they don't add extra disk heads, I think its just more platters/higher capacity platters).

Pretty much the major cost running servers nowadays is simply the cost of the necessary electricity to run them and the cooling systems.

(Note that this doesn't apply for Backblazes usecase; they fill harddisks up, then pretty much power them down. That is why you can't instantly access your data stored with them)


I really do like this level of communicativeness from my backup service.


I was always a fan of Backblaze sharing details about how they built out their storage - it made me feel that if I wanted to, I could also build a redundant/massive storage array.


I have Backblaze-based storage pods in production at an old employer for a broadcast video automation system (plays out over the air). Quite the sturdy design.


For certain workloads these storage pods are much, much cheaper than S3. Anything where you are storing files that rapidly become stale, but still need to be instantly accessible for the rare random request.

I've only looked at it from the perspective of video files, though. Where I work we add a gigabyte of data per user per month. Eventually our S3 storage bill is going to be our largest cost due to compounding growth.


Couldn't you save a ton of money by explicitly moving old data to 'cold storage,' and keeping it on Glacier? Users would probably understand if data older than a year takes a few minutes to be retrieved.


According to the Glacier docs "Most Amazon Glacier jobs take about four hours to complete" [1] so a few minutes is a bit of an underestimate.

[1] http://docs.aws.amazon.com/amazonglacier/latest/dev/download...


Any idea what the durability of Backblaze is? S3 is 99.999999999%, is Backblaze anywhere near that? I can't seem to find it in their FAQ.


S3's number is just marketing. It's going to be that durable until a black swan event wipes out 0.1% of all files. I imagine they keep multiple copies of every file, spread across multiple data centers. It's durable against hardware failure of all varieties, but it's still vulnerable to a software bug. It's also possibly vulnerable to certain types of natural disasters.

IIRC backblaze employs a similar strategy. They have the advantage that your home computer is still storing a copy, so even if they have a minor catastrophe they can recover by simply having their client re-upload those files.


Human disasters too. I'd wager the odds of a catastrophic massive nuclear war are higher than one in ten million per year. Not that I expect Amazon (or much of anyone) to protect against that, nor would the integrity of my S3 data be a priority afterwards, but it makes the claim kind of absurd.

Edit: reading more closely, their durability number is per object. It's one object lost per 100 billion objects per year. The 10 million number comes from a hypothetical situation where you store 10,000 objects.

I don't know how many objects S3 stores altogether, but if we say it's a billion (presumably a vast underestimate) then that would imply that the probability of losing all objects in the system over the coming year is one in 100 quadrillion. I don't think this planet is that safe.


Not sure why it is absurd due to the redundancy they've implemented + data checksumming. Plus AWS hit two trillion objects a bit over a year ago.

http://aws.amazon.com/blogs/aws/amazon-s3-two-trillion-objec...


As great as S3 is, it's still confined to Earth. Seems to me that the odds of a planet-wide disaster taking out all Amazon infrastructure (as well as certain less important things like human civilization) are higher than the odds they're giving of a catastrophic S3 failure.


Sure. The AFR of your Western Digital desktop HD doesn't factor in and lower it's reliability slightly because of the slim chance that a 747 is going to crash into your house and destroy your home.

Barring unforeseen acts of God, the 9's listed above apply and you just have to personally weigh if you think the risk of S3 losing multiple datacenters is high enough for you to risk storing your data there.


Amazon pretty explicitly includes unforeseen catastrophic events in their durability estimate. "In addition, Amazon S3 is designed to sustain the concurrent loss of data in two facilities." I sure hope the loss of two facilities doesn't fall into the "foreseen" category!


Sure, it says they account for that right there in their FAQ, so I guess I don't understand your point.

If you think events like the world being destroyed by a meteorite, the Sun dying, or a zombie apocalypse should factor in to their 9's reliability percentage, it shouldn't.


OK, why not?

Serious question, here. Things like gigantic hurricanes flooding their data centers should factor into it, right? Risk of war destroying the data center should factor into it, right? (I mean, would you trust S3 to the same degree if all of their data centers were located in Gaza?) So why shouldn't a scenario like "all of our data centers are simultaneously destroyed as part of a worldwide nuclear conflict" factor into it?


Extreme weather events I'm sure are calculated into their factors based on location. IE: No hurricanes are going to happen in Indiana, but how are you going to predict a worldwide nuclear conflict?

Should your house insurance be higher because the world might be destroyed tomorrow by aliens? Something like this isn't quantifiable and if it happens you have way bigger things to worry about than your mp3's in S3, so minuscule events like this aren't relevant in the grand scheme of things.


My house insurance calls out certain extreme circumstances as being ineligible for coverage. Yes, including nuclear war.

I agree, it's not really quantifiable. However, Amazon lists their durability to a number of significant figures that implies they are able to quantify the risk down to that level. Yet these unquantifiable risks give every appearance of being considerably larger than Amazon's figure.

Does Amazon's figure come with a "excluding loss due to ..." clause? If so, what do they exclude?


Standard exclusions: http://aws.amazon.com/s3/sla/

> The Service Commitment does not apply to any unavailability, suspension or termination of Amazon S3, or any other Amazon S3 performance issues: (i) that result from a suspension described in Section 6.1 of the AWS Agreement; (ii) caused by factors outside of our reasonable control, including any force majeure event or Internet access or related problems beyond the demarcation point of Amazon S3; (iii) that result from any actions or inactions of you or any third party; (iv) that result from your equipment, software or other technology and/or third party equipment, software or other technology (other than third party equipment within our direct control); or (v) arising from our suspension and termination of your right to use Amazon S3 in accordance with the AWS Agreement (collectively, the “Amazon S3 SLA Exclusions”). If availability is impacted by factors other than those used in our calculation of the Error Rate, then we may issue a Service Credit considering such factors at our discretion.


I think the S3 number is a floor, not a ceiling. That is, you should expect to be losing that much of data, but, given an unplanned event, obviously a great deal more data can be lost.


Amazon is no where near that. Software glitches have already distroyed s3 data. More importantly anything over 99.999% has to take into account vary low probability events like nuclear war.


Their FAQ that has been in place for years needs to be updated: http://aws.amazon.com/s3/faqs/


Backblaze still shows country block due to sanctions that were dropped by US many years ago, for the country which no longer exists.


Can you elaborate on this? Where are they blocking their website?


BrianW from Backblaze -> yes, this is a silly situation we need to find time to fix. We blocked a list of countries something like 6 years ago and geopolitical boundaries and alliances have since changed. (sigh) Did I mention we have two open reqs for datacenter employees right now? Anybody want to come join us to help? We're in San Mateo, California. Hit up our "jobs" page...


I'm wondering why Backblaze hasn't been moving toward cold storage? With a wait time on recovering/downloading files, I don't see an obvious reason why not. But I can see some cost and energy savings to be had.


Likely because they keep deleting stuff which is difficult in cold storage. They keep stuff for 30-days after you delete it.

https://www.backblaze.com/remote-backup-everything.html

> Backblaze will keep versions of a file that changes for up to 30 days. However, Backblaze is not designed as an additional storage system when you run out of space. Backblaze mirrors your drive. If you delete your data, it will be deleted from Backblaze after 30 days.


This is really great! Anyone has any idea what kind of RAID are they running on their pods and what kind of read and write speed one can get from one of these?


I'm a happy Backblaze customer, but I've noticed it typically uses around 240MB of memory. Is this normal? Why is it so high?


BrianW from Backblaze here. That is a little high, but is plausible in some installations. When running Backblaze on a laptop, you will notice a couple different processes running. "bzserv" should always be very very small, it doesn't do much. When you see "bztransmit", that is the process that reads your files from disk, encrypts them in RAM, and transmits them to the Backblaze datacenter. It tends to be about 30 MBytes of executable code, plus up to another 30 MBytes of the file being encrypted in RAM and transmitted. So most customers will see bztransmit bounce around at 60 MBytes RAM.

HOWEVER, there are short moments when bztransmit must read a list of files into RAM, and it can reach larger sizes like 240 MBytes. bztransmit does the deduplication, which means it must store one SHA-1 of every file it has transmitted from your computer to our datacenter. If you have a lot of files that have been transmitted (over 1 million files is "a lot") then Backblaze might reach 200 MBytes in size.


The current desktop client runs atop a JVM


Backblaze engineer here -> we absolutely DO NOT use a JVM on the client running on laptops. We love Java and use it in the datacenter on every web server and every pod. The reason we don't use it on the client running on laptops is twofold:

1) Java doesn't deploy super smoothly - The initial download might be 30 MBytes to include the JVM instead of 1 or 2 MBytes for a 'C' executable. Also, you have to keep updating the JVM separately, etc. It's friction to customers.

2) Java is hard to make look "native". Macintosh/Apple customers especially are sensitive to the look and feel of applications and like them to feel extremely "native".


Their web page disagress with you:

"No Java Software Java is responsible for 91% of security attacks. Backblaze's code is native to Mac and PC and doesn't use Java."*

It even has a big logo with Java logo crossed out.


I'm pretty sure their clients are all native.


So straightforward but still so fascinating.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: