Took a quick look at the API. For context, I was involved in the early days of Google Cloud Storage.
It is surprising that they didn't make it compatible with the S3 API -- at least for common object/bucket create/delete. This will require more code to be written and it will be harder to adapt client libraries.
* The lack of scalable front-end load balancing is shown by the fact that they require users to first make an API call to get an upload URL followed by doing the actual upload.
* They require a SHA1 hash when uploading objects. This is probably overkill over a cheaper CRC. In addition, it means that users have to make 2 passes to upload -- first to compute the hash and then another to upload. This can slow uploads of large objects dramatically. A better method is to allow users to omit the hash and return it in the upload response. Then compare that response with a hash computed while uploading. In the rare case that the object was corrupted in transit, delete/retry. GCS docs here: https://cloud.google.com/storage/docs/gsutil/commands/cp#che...
> It is surprising that they didn't make it compatible with the S3 API .... The lack of scalable front-end load balancing is shown by the fact that they require users to first make an API call to get an upload URL
And... you answered your own question. :-) We reduce our operating costs
by not having as many load balancers in the datacenter and pushing off
the responsibility to the API. It all comes from our traditional backup
product where we wrote all the software on both sides so we could save
money this way.
With that said, we are actively considering offering an S3 compatible
API for a slightly higher cost (basically what it would cost us to deploy
the larger load balancing tech).
I, for one, prefer the directness of not having to go through a front-end proxy. It probably eliminates some failure modes. I think that, instead of Backblaze providing an S3-compatible API, someone should do an open-source S3-compatible front-end for B2, that any interested user can run on a cheap VPS.
I work at https://kloudless.com. While not S3-compatible, we offer a similar proxy that provides a single API to multiple storage services such as Dropbox, Box, Google Drive, SharePoint, etc. We've also released support for S3 and Azure and are looking into B2.
> They require a SHA1 hash when uploading objects. This is probably overkill over a cheaper CRC.
Having been on the receiving end of entirely too many corrupted files in my life, I strongly approve of their use of a hash that's been standardized and fast for decades and remains cryptographically strong. "But fast" if you fail to store it isn't very helpful. TCP has a CRC too. We're wallpapering over it with better ones and everyone serious has been for years: it's time to accept that cheap CRCs aren't a good place to get stuck.
Improving the API to avoid the 2-pass problem is spot-on though. Another possible solution is to require either a subsequent API call, or format the first message as a multipart, and use that route to have the caller submit the hash that's used to confirm and commit the file to storage after the body upload. This would solve the 2pass problem while still ensuring the client is actually doing the integrity check -- and since Backblaze is more than likely to take the heat on any corruption issues, it's probably a good policy for them to make sure lazy client implementations aren't going to cause problems that their storage then gets the publicity smear for.
Using a hash or CRC here is totally necessary. Often times CRCs in TCP fail due to corruption outside the network stack. Having an end to end check will catch, say, memory bit flips and such after data comes off the wire.
But there is no call for a cryptographic hash here. This isn't being used as any sort of ID or to verify integrity outside of corruption.
The API works on top of TLS, which already includes cryptographic authentication of all data (usually via SHA-1/2 HMAC or AES-GCM).
The hash would be computed at the client right after reading from disk and right before TLS enryption, and since they seem to terminate TLS at the storage server it would be computed right after TLS decryption and right before storage, so it doesn't seem to provide any gain.
I think they should just remove it, or at least make it optional.
This is all rare, but it does happen. This is why the GCS team wants to know if you are seeing corruption on file upload as it might be some bad hardware failing in a non-obvious way.
As jbeda mentions, hardware errors are one big reason: with the scale S3/Azure/GCS/Backblaze operate at it's a matter of when and not if you're going to run into problems. Also: TLS may guarantee the bits your client sends are the ones their server receives, but that's just one cause of errors.
There's the write path from B2 receives your bits to when they're stored on disk, for one. You could have unforeseen bugs in the code sitting on the other end of their upload URL (it's probably not all theirs, and even if it was it was written by human developers).
Or B2's internal network path (if they have any) between that and the disk. Ideally that would provide integrity too, but maybe not. They offer a low price point and call out other compromises they make to achieve it (e.g. limited load balancing) - so while I really doubt it, it's remotely plausible they deem the internal overhead of SSL too high.
But then there's the potential for mismatch between "what the customer thinks they uploaded" and "what the customer actually uploaded" too! Less of an issue for now because their API only appears to support uploading files all at once, but eventually I'm sure they'll support a multipart upload scheme like the other platforms do. At which point uploads become more complicated since clients need to retain state and potentially resume. What if a client screws it up and there's some off-by-one error (or whatever)? If you can provide instant feedback, at upload time, that your clients provided bogus data, that's a good thing.
You can argue it's a painful requirement to force on users since it means they have to track/compute it themselves (might be nontrivial for streaming applications), which is fair. But there are enough points of failure, and the numbers so large, that errors happening is a fact and you really need to insure against it. Especially here, your entire reason for existing is to reliably store bits so it's kinda important to get it provably right.
It seems completely sensible to err on the side of caution, especially as a new and relatively unproven platform (as an object storage platform provider I mean, obviously they have tons of experience storing things).
There are many places in your stack where data corruption can and will occur. You are correct that TLS provides payload integrity on a per-packet basis - but it doesn't protect you against silent truncation (to fight this, always declare and check content-length, or use chunked encoding). I have seen corruption occur in NIC buffers, ECC'd main memory, Xen MMU'd memory pages (yes, Xen was responsible), and multiple places in HTTP server and client stacks. None of those failures manifested until hundreds of terabytes of data had successfully gone through the system.
If you're handling data on behalf of others, it's paramount that you checksum data end-to-end. Amazon S3 allows you to do this by sending the MD5 or SHA along with the data. Google GCE allows you to do this with CRCs (which, despite what others in this thread say, are more appropriate for the task than crypto hashes, as long as you use enough bits).
Why would you want 'just' a checksum? I want something I can rely on. If I have to dedicate half a core per gbps of internet-crossing upload, that's not a big deal.
The purpose here is not to secure your data against an attacker (that's what TLS is for), or even against errors in transmission (as others have noted, TLS has you covered there as well) - you need something simple and inexpensive to secure against errors in hardware/memory before/after it enters that pipeline. While you shouldn't under-solve a problem, there are real costs to over-solving the problem as well.
You don't need a real attacker to want safety from assumptions that will be true the vast majority of the time, such as "same hash = same file".
For example, I might have md5-colliding files on my hard drive somewhere, that someone else made as a proof of concept. I honestly don't know. But I would worry about using a storage system that depends on md5, because what if it deduplicates without checking every byte?
For the same reason that UTF-16 has encouraged so many broken implementations, at least in a pre-emoji world, it's a bad idea to almost but not quite support convenient features. Either clearly don't support something, or fully support it.
The CRC in TCP is not powerful enough, but CRCs can be adjusted to be arbitrarily powerful. The main advantage of CRCs is that they can be independently computed for multiple parts and combined when concatenating the parts.
The CRC in TCP is weak. CRCs can be made arbitrarily powerful and still retain their key advantage: they can be computed independently and then combined when concatenating the parts. That key advantage, combined with a strong CRC, is what provides an elegant solution to the 2-pass problem.
Whether or not I decide to use this service, this is one of the most useful announcement blog posts I've read in a while. The tone is not just "ok we released this" but instead "we released this, and here are some practical use cases if you are already doing X, Y, or Z" -- nice job.
Is there a possibility to request the backed up data in physical form.
Let's say I backed up 8 TB of data for a small business, and I need to restore in 24 hrs, is it possible to request overnight shipments of hard drives of data so I can do the restore locally instead of taking weeks to download all that data
I know amazon has this feature, not sure about google.
Another question, what's the max number of buckets can an account hold?
> Is there a possibility to request the backed up data in physical form.
Yes. In our traditional 8 year old product line of online backup, you can order a restore on an external USB hard drive for $189 (you keep the hard drive, and the cost includes world wide shipping). We FedEx the restore to you anywhere in the world. (We ship to Europe all the time, but add an EXTRA 24 hours for that to arrive.) Oh, if you only have 128 GBytes of data you can order a USB thumb drive of that FedEx'ed to you for only $99.
B2 absolutely supports this USB drive restore functionality. We call the feature "Snapshots" (you take a "Snapshot" of some of your data, then you can either download it as one large zip file or you can have it sent to you via USB Hard Drive).
Our online backup product tells you that the restore is up to 4 TBytes, but we have prepared Drobos as "special orders" for customers that were much larger than that. We aren't trying to make profit from that part of the business, just kind of break even on materials and shipping and create customer goodwill. You would be AMAZED how happy some customers are to receive 8 TBytes of Drobo with all their data they thought they might have lost. :-)
I'm really glad to hear this. I haven't kept up w/ Backblaze, other than to read blog posts on hard drive reliability and the storage pod designs. This comment made your services vastly more interesting to me, and much more applicable to my Customers. Gonna go read about your product offerings now.
> what's the max number of buckets can an account hold?
We are limiting each account to 200 buckets right now until we gain some experience. Each bucket can hold unlimited numbers of files.
We really want feedback on these decisions, so as you run into issues just let us know. Many of the limits are arbitrary, I think the 200 buckets was so we didn't have to create a paginated list on your logged in webpage for version 1.
Love to see this as well. Part of the selling point of Backblaze as an online backup is the ability to get a FedEx'd hard drive.
I could see this being a practicality for small time video editors, I need to keep copies of old projects, but would be willing to pay $300 to get an overnight hard drive of the files since whatever new project will pay for that cost.
Up to the first 4 TBytes it is only $189 which includes you keeping the hard drive. I'll make it even better - if you ship us back the hard drive within 30 days (you pay return shipping) we'll refund the entire $189. The info on the program is published here: https://help.backblaze.com/entries/67512970-How-to-order-a-r...
Thanks for the information, I didn't know those drives were refundable!
I was always turned off by having to download a huge zip file from backblaze when I wanted to restore something, but this actually sounds really nice of you guys for large datasets.
We didn't originally offer refundable hard drives. About six months ago we started this quiet experiment of allowing full refunds. We were worried it suddenly make the USB drives hugely attractive and overwhelm the currently deployed servers and Backblaze employees dedicated to the USB restore tasks. :-) It's been a successful test, and now we'll start promoting it more.
I want a way to be able to use this with ssh rsync scp and so on to preserve legacy backup routines that are now in place. (Including on systems that you will not be writing b2 clients for like old versions of Solaris).
Backblaze generally does this very well with their product offerings and white papers. Would that the rest of the self-absorbed tech community do the same...
In addition to the good point raised by @jcreedon regarding their single datacenter (which I think is a bit of a bigger deal than he does, primarily because I don't think it scales linearly per-GB for the first few datacenters, though it might thereafter), I'm more concerned about the bandwidth.
There's no talk about their backbone or their network capacity. I get that they have terabytes of upload coming in, but as anyone who's used their software can tell you, it's throttled. I don't know how many users they have to tell you how much bandwidth they're actually handling, but can they handle people using B2 as a distribution point for large files for customers? For example, I have a huge S3/CF monthly bill from customers downloading ~400MiB ISO images tens thousands of times a month. Amazon CloudFront is ~$0.085/GB for the first TB, while BackBlaze B2 is an incredible $0.05/GB - but at what performance? Will my technical support representatives be getting angry phone calls about halting download speeds or do they have the capacity for something like this?
Hosting the world's data is no tiny task, I hope they're ready for it and I do, truly, wish them all the luck. I've been a BackBlaze customer for a few years now (at least 5 or 6, I imagine) as a tertiary or quaternary backup (haven't had to restore... yet), and B2 looks and sounds promising, but as far as technical details go, this post is nothing.
EDIT: In response to the reply below, I believe it's throttled by default in the client, though that can be turned off in the application settings. Also, you've replied to my claims of throttling but have ignored my question regarding backbone capacity and network readiness...
Sorry about skipping your network capacity question. I just got over-excited about throttling. :-)
We currently have about 100 Gbps symmetric capacity into our datacenter on a couple of redundant providers, but the key is we have open overhead and we'll purchase more as our customers need it.
But here is the best part (if you want OUTBOUND capacity) - our current product fills the INBOUND internet connection, but currently we only use a tiny, tiny fraction of the OUTBOUND connection. So if you want to serve files out of our datacenter we have a metric ton of unused bandwidth we would LOVE you to use. And if you fill it up, we promise to purchase more.
But also keep in mind, Backblaze is very experienced with STORAGE and I have a lot of confidence we won't lose any of your files. What we don't have a huge amount of experience with yet is serving up viral videos and such. So just bear with us during this beta period while we figure it all out. Personally I'm looking forward to that part (all the CDN/caching layers).
>But here is the best part (if you want OUTBOUND capacity) - our current product fills the INBOUND internet connection, but currently we only use a tiny, tiny fraction of the OUTBOUND connection. So if you want to serve files out of our datacenter we have a metric ton of unused bandwidth we would LOVE you to use. And if you fill it up, we promise to purchase more.
:) well, yeah -- but thats also what B2 charges for.... so The business model requires that BW to start getting consumed :-)
That's the first thing I noticed: This is awesome if most of your data is never touched. The moment you want to serve it up a lot, of course they promise to purchase more: Their bandwidth prices are as outrageously high as Amazon's.
If you serve up viral videos etc. and start eating a ton of bandwidth, even a "do it yourself" CDN out of VPS's could quickly save you a fortune...
But if your inbound capacity is pretty full these days, how can you manage to onboard _large_ new clients at this point? Can you scale your inbound bandwith as fast (and at the same cost) as adding a new vault a month?
Our inbound is not completely full, and we always try to have extra capacity/headroom for new customers. But if you plan to upload more than 5 petabytes at a rate of faster than 15 Gbps sustained, you probably want to contact us ahead of time to let us know it's coming and we'll increase our capacity for you. We can absorb anything less and it won't cause us any issues.
As somebody else mentioned, since we're in a commercial datacenter with a bunch of network providers already serving us, it's pretty easy to dial up our capacity as we need it.
I'd estimate a week? That's probably what you meant by "eons". :-)
It could go faster, but if we need to buy a new (expensive) network switch that can take a few days to arrive. And as you mention, the datacenter guys are happiest if you give them 3 - 4 days and a work order to do the cross connect.
Building out more vaults (the blocks of 20 storage pods we store data in) is usually about the same if we rush it, but we have a big (multi-petabyte) buffer spinning ready to accept data at anytime. We have a regularly scheduled delivery of pods once per month based on projections, but we have been known to tell our provider to go ahead and build three months worth of pod chassis (everything except for the drives) immediately and ship them to us. We supply the hard drives, so that either comes from our own stashes or we quickly order some more from various sources.
Not to speak for brian, but as someone who used to do physical datacenter operations, most facilities have a bunch of fiber already provisioned (in the ground). Its just a matter of getting the networking gear and provider provisioned. Turnup can be done as quickly as 24-72 hours, depending on the provider and the dollar amount involved.
> as anyone who's used their software can tell you, it's throttled
Brian from Backblaze here: no it is not throttled (by us). If you only have a 10 Mbit/sec upload capacity you are throttled by your ISP. Also make sure you visit our "Performance" tab in the online backup client and tweak a few settings, like increase the number of threads.
I have 100/100 up and down, and I barely push any more than 3MB/s when uploading. and in that time the client is easting all cores alive. I appreciate that it may not be the ISP. but the client does seem to end up being a major bottle neck.
I moved to Linux a few months back, and was going to basically cancel my Backblaze sub when I got around to it since you have no interset in making a Linux client. Maybe B2 can act as a solution to this at a price penalty.
Or a price savings! If you do the math, I think the break even is at 1 TByte. If you only need to backup 500 GBytes from your Linux server then you'll save 50%.
Now with B2 we immediately support Linux and provide a client for it out of the box (written in python). Granted, it is only a command line interface so give us a little time to polish it up and add some features.
Another Linux user here who would love to use a command line Python tool to backup my data to Backblaze: with Python I can see how my stuff is encrypted. That's the only reason I'm not using Backblaze right now: closed source client.
My only interest in B2 is backing up for a lower cost than the ridiculousness of S3: at $0.022/GB, I might as well buy a 3TB hard drive myself, put it at a friend's and push my data there. Every month. At the end of the year, I'd have 36TB in hard drive capacity if I bought drives instead of paying for 3TB of S3 storage.
(All numbers are estimates and "roughly"s. Also I don't have external backups now because I'm too lazy to write the software myself, so there is something to say for paying instead of not having it.)
You are ignoring the Electricity/BW costs spent by your friend which would raise your costs
I use backuplizard for personal data/photos which works out more like cost of a one 2 TB disk per year to me and to me Its easy to pay it instead of owning disks and worry about them breaking,etc
I did the math properly once, the S3 costs still didn't work out by far. I have a server here at home and an external hard drive for it, so I know how much power it draws. Bandwidth is free, but in the calculation I did assume I would pay for bandwidth.
All in all, it's by far the cheapest option to store it at a friend's. Storage providers could also cheapen things a whole lot by offering reduced redundancy and whatever else it is they do to make it so expensive (glacier storage is also more expensive than the price I got). It's a backup after all, I don't need my backup to have five copies on spinning (versus offline, non-powered) disks. If my backup dies, I'll upload it again...
When I was a customer of BB I noticed no issue with the uploads, but actually when I had a flood and my hardware was destroyed, redownloading all my information was order of magnitudes slower.
I tried from multiple physical locations but I could not increase my downloads past 1-2mbps, and for TB of data, that seemed like it was throttled by BB considering I was easily uploading 20mbps.
I contacted BB support and they ignored me, so I switched to a competing services and have had no issues ever since.
This really made me sad because BB's blog is amazing and their tech is really cool, but when you see people saying "its throttled" its because of real experiences out there, and not just ones limited to an ISP issue.
Brian from Backblaze here. I wonder if that was during the incredibly annoying "Comcast goes to war with Netflix" era that Backblaze got caught up in. That was Nov 2013 through Feb 2014, you can read a little about it here: https://www.backblaze.com/blog/obama-backs-net-neutrality/ (scroll down for our graphs showing our customers getting throttled). That seriously sucked for Backblaze.
But either way, we added threading to the bzdownloader (our custom application to download large restores) and if you tried it today crank it up to 10 threads and I swear you'll be happy with the download performance.
Brian, thank you for responding, I appreciate your clarity and honesty.
My issue did occur during that period, and I am impressed you can call that out from memory, it must have been a frustrating time for BB.
If that alone was the problem, you would have just 100% won back a customer, but the thing that irked me the most was the customer support response.
I know you do not work for your helpdesk, but their response was more the reason I left, their apparent lack of concern was what turned one of any service provider malfunctions into a dissatisfied customer looking for a competitor.
I can laugh about that period now, but yeah, it was a bad few months. We bled out good customers like yourself and we felt helpless. My basic faith in the internet was shaken up - I always thought I would send packets and they would be delivered quickly, and here are these HUGE players in the space messing with each other throttling each other and changing routing to get around throttling (and hurting Backblaze as collateral damage).
> do not work for your helpdesk
It's unfortunate when a customer gets a bad experience. The helpdesk guys are faced with this monumental task of responding to tons and tons of basic questions by Mom & Pop customers that are not computer professionals. Then mixed in are competent programmers and IT guys that know what the heck they are talking about. The helpdesk guys sometimes get it wrong who they are dealing with and it infuriates the competent computer users.
I think we should issue "professional computer user" cards where you can get a different level of support from all these companies. If you were helpful on forums you could earn points for your card, but if you ask helpdesk too many dumb questions your card could be revoked and you would go back to the first tier support. :-)
Is there anywhere I can traceroute to / test upload & download speed? I'm on a 10/1 connection, and I can only use about .5 of that 1 before my connection is completely tanked. (Thanks, Australian Governments). If I could do a trickle upload and write some good scheduling, I'm definitely moving to BB2 - from Glacier.
I think it's also important to point out this is cheaper then AWS Glacier. You could think of B2 as pure backup for now, and after tracking metrics expanded it out to more products. I doubt even Backblaze would suggest you make this your primary, mission-critical storage, hence the Beta title. But even so, there are plenty of non-main use cases. Especially at this price.
I'm bothered by the whole idea of putting all my data with any one vendor (with Backblaze or Amazon) and thinking you don't need a backup. I claim "RAID / Reed-Solomon / real time mirrored copies" is NOT "Backup". If your programmer makes a mistake and a line of code deletes some mission critical data from Amazon S3, then all the Reed-Solomon encoding in the world doesn't help you, the data is still gone.
What you need is a copy of all your data from Amazon S3 in another vendor lagging behind for 24 hours that is NOT real time mirrored. Maybe you lose all the customer data generated that day, but your business survives by restoring from backup. (I chose 24 hours arbitrarily, each business needs to choose their upper limit of loss where they can survive.)
A good rule of thumb for a CONSUMER is three copies of your data: 1) primary, 2) onsite backup, and 3) offsite backup. If you are a business that will lose millions of dollars if a programmer makes a mistake or an IT guy is disgruntled, add 4) another offsite backup with a totally different vendor that doesn't share a single line of code with 1-3 and has separate passwords.
> If your programmer makes a mistake and a line of code deletes some mission critical data from Amazon S3, then all the Reed-Solomon encoding in the world doesn't help you, the data is still gone.
I'm surprised at the implication here, that you'd use Glacier on a non-versioned bucket. Making destructive updates impossible doesn't cost much extra in archive fees.
Ok, so let's say you forget to pay your Glacier bill because the IT guy left and the credit card changed and the alert emails go nowhere. Bye-Bye-Glacier! No payment, no customer data, Amazon might delete your data due to a tiny administrative screwup.
My point stands: if you don't mind losing your data, store it in one vendor. But if you would REALLY lose your business and put 10 people out of work if the data is lost, storing it in Amazon (or Backblaze) without a second copy backed up somewhere else and a third copy backed up in yet a third location (with a totally different vendor with a totally different payment system) is irresponsible.
But... let's say you have three accounts with three storage companies. Now let's say you outsource the management of those accounts through one company... or even, equivalently, delegate it to a subsidiary or partner of your company. And then you accidentally stop paying them, or they can't requisition the budget necessary to pay the providers, or whatever. Now you're still stuck, even though you're nominally doing things "in-house."
What you actually need is a provider that will guarantee the durability of your data even if they (temporarily) cut off your access to it for lack of payment[1]. Anything else is just a level of indirection that suffers the same problems.
---
[1] I don't actually know if anyone does this, let alone AWS. Here's a quote from Tarsnap's FAQ—where you'd think cperciva is someone who would have considered the "I had no idea my infrastructure was relying on this service until it shut off" use-case:
> You will be sent an email when your account balance falls below 7 days worth of storage costs warning you that you should probably add more money to your account soon. If your account balance falls below zero, you will lose access to Tarsnap, an email will be sent to inform you of this, and a 7 day countdown will start; if your account balance is still below zero after 7 days, it may be deleted (along with any data you have stored) at our discretion. (If you can't add money yet but will be able to later, contact us and explain the situation. We're reasonable people and simply knowing that you're alive and haven't forgotten that you were using Tarsnap is very helpful.)
7 days is probably reasonable in the case where there's an active IT staff who will notice when, say, servers stop backing up. But if nobody's watching for that...
7 days is probably reasonable in the case where there's an active IT staff who will notice when, say, servers stop backing up. But if nobody's watching for that...
Do you have a solution in mind for the case of a company where email is going to /dev/null and nobody is reading the output of their cron jobs?
I mean, if I can't contact someone, it doesn't really matter if I wait a week or a month...
I'm not sure; I think the main thrust of the solution would be getting the users who really, really care about their backed-up data (which should be most of the people who are using a back-up solution) to do some extra stuff up-front to insure it:
• Ask people to provide optional contact information for an "executor of their estate"—a person who can make decisions about what happens to their data on their behalf if they cannot be reached.
• Ask people for a secondary credit card that can be charged as a backup: specifically, suggest that this be the personal card of Someone Important in the company, who will be likely to notice the charge and flip out.
• Ask for a flat-fee deposit to enable a secondary "long-term storage, no uploads, no monthly billing" mode of usage. Make it enough money to be motivating when you imagine just schlepping this hunk of data around for the rest of your life. If the user has paid this deposit, and their regular card gets declined, switch them to this mode and consume the deposit. If they close their account, refund the deposit if it hasn't been consumed.
I'm thrilled that we can offer this raw cloud storage for just $0.005/GB/month. Would love to hear from the Hacker News community what you do with storage today & what you might do with B2.
Do you have any plans to offer some sort of WORM model where you could “seal” a bucket, file, etc. to prevent future changes? This is really nice for protecting against malware, human error or malice, etc. and a regulatory requirement for certain industries.
We're at the very beginning of this journey, and heard from a lot of folks that they wanted access to inexpensive cloud storage. I imagine we'll learn a lot more about needs & use cases from people overtime. I can certainly understand the value of WORM storage and it is something we'll definitely consider adding. Appreciate the suggestion!
CAVEAT (PLEASE READ): this does NOT encrypt data yet!! This is just a quick technology demonstration, it isn't a polished backup client. Give us another month for that...
We expect Linux servers (and desktops) to make up a significant percentage of the things communicating to B2, so you can expect a lot more support than you have been getting from the traditional Backblaze Online Backup product line.
If I was going to tar and then gpg files, what would be the optimal "chunk size" from your point of view? Say I have 10GB that I'm encrypting and then splitting up, are you going to want 1GB chunks or many more 100mb chunks? (This also raises the question of how much data I want to lose to corruption...)
I would recommend 1 GByte chunks for several reasons. The first reason is that you'll get very good throughput and efficiency at that large size. If you go as small as 10 MBytes we see long distances to our datacenter not "ramping up" fully. In other words, we can't get a single threaded application between New Zealand and California to get the advertised bandwidth for small chunks, but up at 1 GByte we DO get the advertised bandwidth. But if you go too large (like 5 GBytes) we have seen that you start seeing too high a percentage of uploads rejected because of bit errors occurring somewhere in the networking between your computer and our servers (the SHA-1 checksum won't match).
One thing to consider is using ecryptfs as I do for my data at rest. It's mounted as a (mostly) regular filesystem at ~/Private (or similar) and the encrypted files that serve as the backend (~/.private) are then uploaded to the cloud service of my choice. Currently using crashplan, but B2 would be nice to create archives/snapshots of my data instead of just syncing it as most consumer backup solutions do.
This is a very low friction way to have accessible data and have it encrypted by a fairly popular mechanism.
> Would love to hear ... what you might do with B2.
Back stuff up, but properly encrypted instead of your Windows client's closed source stuff. Without B2 I would use S3 but at their rate I might as well rent a datacenter myself, so I'm going to do the math again with B2 soon.
Thank you for doing this. I've read some awesome stuff on Backblaze's infrastructure over the years, but never could use the service (a high level backup solution for Windows/Mac isn't so useful for me).
This announcement may mean I finally get to test your stuff! (I've been frustrated with the quality and feature creep in open source syncing solutions and procrastinating building my own, bare bones alternative).
Yeah, I'm really concerned about this too. I assume it's fairly good storage if they're dogfooding this for their backup product, but explicit is better than expectations in this case :)
Brian from Backblaze here. In general we are shooting for about the same reliability as Amazon S3. Backblaze is really transparent about our redundancy
and pretty much everything we do. We use 17+3 Reed Solomon across 20 computers in 20 different locations in our datacenter. You can read about it here:
https://www.backblaze.com/blog/vault-cloud-storage-architect...
And for the record Backblaze only has one datacenter. This bothers some customers deeply so if it is a show stopper definitely don't use Backblaze B2. We just want to be transparent about what we do and what we don't do. One idea is you could use B2 as a primary copy of the data, and make another copy into Amazon Glacier in case the Backblaze datacenter is hit by a meteor (or a terrorist attack or an airplane crashes into it).
Oh, I said this elsewhere but I have a lot of confidence we won't lose your data, we've been perfecting that for 8 years. What Backblaze DOESN'T have much experience in is serving up viral videos and the CDN (Content Delivery) layer. I'm looking forward to that layer, I think it will be fun to polish, but especially over the next few months of invite only beta anybody using B2 needs to be able to work with us to get the kinks worked out.
Tarsnap seems to rely pretty heavily on Amazon's infrastructure, so I'm guessing it won't support this? Which is a shame because I'd really like to use it, but can't afford to right now as an individual.
Arq seems really good at supporting a broad variety of cloud providers though, so hopefully they'll add this too. I'm hesitant to use cloud backups generally; I've never seen an audit of how secure Arq's backup scheme is, for example (though it seems pretty simple - https://www.arqbackup.com/s3_data_format.txt). I've used CrashPlan a lot and basically take it on faith that it's secure. It's probably good enough for my use, given that I'm not storing state secrets or anything, but it's still a little unsettling to 'lose control' of one's data.
From Backblaze's point of view, I guess this is either smart (diversifying themselves–people can use other backup software if they like, and Backblaze still profits) or less smart (turning themselves into a commodity), but it seems like their software is still first rate, so I guess it'll work for them.
Tarsnap seems to rely pretty heavily on Amazon's infrastructure, so I'm guessing it won't support this?
I'll be taking a look at this of course, but there are things which are more important than price -- for example, reliability. Tarsnap users trust me to not lose their data, and I trust S3 to not lose their data. That's a trust I don't have in B2 yet -- first, simply because B2 hasn't been around for long enough to prove itself, and second based on what I've heard from former Backblaze users.
> I've used CrashPlan a lot and basically take it on faith that it's secure
They use Blowfish. Says it all really - their default encryption is a long-obsolete 64-bit block cipher you might have picked in 1999 because it was faster than 3DES.
I can only assume they do this because migrating would cost them money, and being able to advertise "448 bit encryption" actually sounds like a plus to most people and not the glaring red flag it actually is.
> it seems like their software is still first rate
What, like their backup client that can't actually do restores? It's still all "log in to our website and let us decrypt your data for you" :/
> They use Blowfish. Says it all really - their default encryption is a long-obsolete 64-bit block cipher you might have picked in 1999 because it was faster than 3DES.
Not defending it, because I know it's old and there are weaknesses, but aren't Blowfish and 3DES both still technically secure? This is a genuine question. It was my understanding that if implemented correctly, with a random key etc., that neither has been formally broken. 3DES is 2^112 no? which is still not practically accessible by brute force. Not that this means anyone should use them, of course, AES is a standard for a reason...
As you say, I had just assumed the migration cost was too high to move to something newer, but I don't think it necessarily means data stored there is unsafe?
Sure, but it's not exactly putting them in a good light is it? Dressing up obsolete stuff as state of the art "same as your bank uses", while either being unwilling or unable to migrate to something more era-appropriate.
Calls into question their competence, their honesty and their architecture all at once.
Wait, what about Blowfish is insecure? BCrypt is built on top of Blowfish.
Blowfish supports key-lengths up to 448-bits. And I've never heard of a single criticism of the function. Its just kinda... less used than Rijndael because it didn't "officially" win the contest. But otherwise, it is a fine function.
EDIT: Confused Twofish with Blowfish in the AES finalists.
Obsolete is not the same as insecure. But it is old, it does have its weaknesses, and there have been better options out there for a very long time. Why continue to use it? Is upgrading your crypto that difficult that you'd rather just leave it for another decade or two?
It also calls into question the nature of all the other crypto they're using - is that all >20 years old too? Still tuned for a world of 486's and 68040's?
This is nice, but I have one critique: Having the sorting indicators right justified and the text for the column header left justified without any column borders had my clicking on the sorting indicators for the wrong column for a few seconds and wondering why the sorting was wonky. You can click anywhere on the header column to sort, which once you realize that makes it easy to use as intended, but a visual indication of separate columns in the header (a right/left border, or horizontal striping) would make it just a tad more obvious.
Nice table, it provides a good overview of what is available. I would suggest also to add the protocol (S3, Swift,...) used and coverage by libs/languages.
I like this, but I've gotten mixed messages from the people running it. I tried to email them asking if they're actually working on the project and didn't get a response. I've also heard that they be migrating this to "OVH Public Cloud", which is a service that isn't available yet.
I like this offering, but I'm not getting good signals on it's seriousness. It may be something they're going to sunset soon. I would need some reassurance as to what's going on here.
Seems they're currently reorganising RunAbove and merging some of the projects back into the main OVH product line-up. OVH Public Cloud[1] is being rolled out at the moment, but not yet available in all regions.
It is certainly interesting how OVH undercuts the competition by quite a large margin in everything. We use them for dedicated servers, but I wonder why they're not as well known as AWS...
Yeah, it is very interesting how everyone here loves AWS. And the reasons for it seem to be (1) the "trust" in AWS, (2) the existing knowledge about AWS's API etc. and (3) that it's just not worth the time to evaluate another provider because you could spend your time growing your startup/doing something more profitable.
And these are interestingly exactly the same reasons enterprises buy IBM and Oracle.
No! The reasons are absolutely valid. It's just interesting to see that the lean and cool startup is very similar to an established enterprise in that regard.
The trust is in quotes because I'm not sure what to think about it. Every month I see a post here about some AWS service outage but it looks like nobody is getting nervous because of this. People just wait until it is fixed. On the other hand, I have experienced that people begin to trust companies because the company advertises on TV. But AWS has earned its trust legitimately I think.
The whole concept of "confidence/trust in companies" is so important but I know so little about it.
The reason you hear about so many AWS outages is because it's a massive service with so many users. If you build appropriately, you can have extremely good uptime built on AWS. They've earned tons of trust from their users.
most outages only happens on a single az. which is not really hard to handle.
in over 1 year we had one outage on frankfurt. and that was just a small problem which a small reboot of our machines fixed the issues. oh and that happend automatically.
the problem is you need to know that failures could happen.
not only in the cloud but in the cloud these failures are more easily to handle since you could just create new boxes or use multi cloud envs.
Just as a comparison to B2, with OVH dedicated storage servers, you could get a cost per raw un-replicated GB of around 0.006-0.007 best case, if I am not mistaken.
First, they weren't in North America until recently. Having a server in France means high ping times for me and latency for the vast majority of my visitors. OVH started operations in Québec in 2013. So they've had less than three years to establish themselves. EC2 is 9 years old.
Second, it's hard to figure out what to buy. With EC2, they're all Xen instances and you decide on the right CPU/RAM configuration. DigitalOcean, Linode, Vultr, etc. all are easy. With OVH, what am I supposed to buy? Do I want a dedicated server or an infrastructure dedicated server? And then if I click for dedicated, I need to choose from Hosting, Enterprise, Infrastructure, Storage, Custom, or Game. I know computers - tell me the processor, RAM, and storage without breaking it into categories. So, I go with Hosting and half of the options are for "Delivery from September 30". Ok, that's more than a week out. Maybe I want more flexibility like hourly billing on VPSs. I can go to Cloud -> VPS. And now I can choose SSD or Cloud with different prices. Why is the SSD so much cheaper? $3.50 vs $9 and they're both 1 core, 2GB of RAM, 100Mbps network link KVM boxes. Then I wonder if these are the same things as the RunAbove labs vs regular. The labs ones shared the processor cores, but this seems to indicate that both don't have the noisy neighbor problem. So I check RunAbove. Wow, everything has changed. Looks like they don't offer the SSD of Ceph instances anymore, but they have SATA backed instances. So, they're running all sorts of different combinations. And should I be looking into Kimsufi or SYS brands? Do they still exist? What if I want object storage. Ok, the US site takes me to RunAbove which tells me that it's now part of OVH proper which brings me to their UK site with apparently no way of loading it on the American site. Compare that to DigitalOcean where you just get a very simple, "here are the plans, there's no complex stuff with weird names or categories, buy what you need." Even Vultr manages simple with SSD VPS, SATA VPS, and Dedicated Cloud. Perfect. Most likely I want the SSD VPS, but maybe I need more storage or maybe I want metal servers sold to me like cloud servers. Easy.
And to be fair, OVH used to be a lot more complicated and a lot worse. It looks like they're streamlining a ton. But they should still simplify a lot more.
Third, OVH is terrible at marketing. I want to define what I mean by marketing. DigitalOcean is a king of marketing. You go to their site and you see brief comments from the creator of jQuery, the creator of RailsCasts, the creator of Redis, and a Rails core member. You might not use those technologies or even like them, but you recognise that DigitalOcean can't be total crap given that these are people with options and a reasonable amount of taste. DigitalOcean sponsors hackathons like woah. Giving students a dozen or so dollars in credit makes them well-known and an easy service to try. DigitalOcean's site inspires confidence in its simplicity. You don't feel like there's some hidden thing because it's just simple plans that increase rather linearly. Finally, try searching for VPS + some tech term. "VPS Ansible" has a DigitalOcean blog article as #3. "VPS elasticsearch" has DO with the top two spots. The point is that you see that and it's an indication that they're part of the community (supporting some free content) and kinda get it.
OVH, on the other hand, inspires none of those good feelings. OVH has a generic site that you can't tell apart from other generic sites. It has the kind of "throw everything at the user and see what sticks" design that I don't think users want. We want DigitalOcean to say "this! this is good!". OVH is like, we have a lot of different things and someone has written "enterprise" or "cloud" on some of them without really indicating how some options are more "enterprise" or "cloud". And there are stock images of network switches and RAM and such like a pizza place that has a stock picture of a pizza on their take-away menu that isn't their pizza. Do they get it?
I really wish OVH well. More providers means downward pressure on pricing which is good for me. I mean, 2GB of RAM VPS for $3.50? Awesome! Glad to see that graduate from RunAbove. But OVH still has a ways to go. Lots of the time you have to wait for servers. If I want a dedicated SSD box, they're quoting a 10 day wait for all except one model. The entire "hosting" range has quotes of 3-12+ days. "Enterprise" has one box for 120 second provision, two that are 3 days out, and two that are 10 days out. It seems like OVH is a place to get a good deal if you're willing to deal with complicated process, waiting for a box, and them switching things up on you. But maybe OVH is stabalizing. I'm hoping their VPS offering will be a lot more stable than it has been. Seems like they're cutting down on using alternative brands like SYS and Kimsufi.
I can see OVH being a good company, but it's no surprise to me that they aren't as well known as AWS.
Agree with everything above. Especially the complexity.
Runabove, OVH, VPS all those. The worst part is they keep posting about their streamlining and deep thought into reorganization. And yet no body understand "their" why.
To many it seems more like a reorganization for the sake of reorganization.
Then there is their Network. Which could range from Very good to VERY VERY bad.
And the final final thing? Is their lack of support or communication. You fire a email or ticket at least other host gives you a reply. OVH? None.
And when you add their non active support, unable to talk to sales or people for inquiry, and their overall complexity it is not hard to see why they dont pick up as much customer as they should have.
Their current "10 days" delivery time is quite unfortunate, but I believe that's explained from the fact that they just started upgrading all their machines to DDR4 (and unfortunately, as I've been hit with it, a price increase for dedicated servers still on DDR3).
In the past I have used their 120 seconds delivery time extensively, but it has a few problems: 1. You have to verify your account first. 2. That's only guaranteed for one single server, try ordering 20+ of their top of line servers and it'll take a few days to get all of them.
Their control panel is also very confusing and feels very sluggish, and I'm talking about the one they released quite recently.
They also have tons of country specific domains (perhaps for tax reasons?).
They're pretty great on hardware, but I've experienced a bit of downtime with them, but as long as you have enough redundancy, you should be fine. I'm hosting game servers in there, with an automatic fallback to some cloud providers if those go down, so I'm not too worried about that. Still, seeing some servers randomly lose connectivity like this: http://i.imgur.com/9uMOHnH.png doesn't inspire confidence.
The biggest shortcoming I see compared to the other big players (AWS, Azure, Google), and it is something they don't mention, is that they only have one datacenter, compared to the several from the other big players. The pricing is quite incredible though. I suspect if enough people hop on board with this they will probably look into setting up another datacenter.
Disclaimer: I work at Backblaze. We do mention that we only have one datacenter! We're very transparent, we also tell you that it is 17+3 Reed-Solomon error correction across 20 separate machines in 20 separate locations inside that one datacenter.
We are already looking for another datacenter, but mostly because we're running out of space in the current one due to our traditional business (online backup) doing so well.
Something to note: Unless you're storing data in us-east-1, all other regions in AWS are "one datacenter". Yes, they have AZs, but those aren't datacenters, they're just compartmentalized segments of the same datacenter.
So! If you can tolerate the loss of a datacenter, store in Blackblaze. If you need geo-redundancy until Backblaze can offer it? Store in us-east-1 (which is geo-redundant between Virginia and Oregon).
Do you have a source that all other regions are "one datacenter"?
All I could find on the S3 FAQ says "your objects are redundantly stored on multiple devices across multiple facilities." which seems to contradict the "one datacenter" claim.
Also, do you have a source that us-east-1 is geo-redundant between Virginia and Oregon? That was not my understanding of how it worked.
AWS considers multiple facilities to be separate AZs in the same region. If you want multi region durability (besides us-east-1), you need cross region replication enabled (from the same FAQ you read).
"You specify a region when you create your Amazon S3 bucket. Within that region, your objects are redundantly stored on multiple devices across multiple facilities. Please refer to Regional Products and Services for details of Amazon S3 service availability by region"
Note, "within that region". Separate AZs, same geographic location.
"CRR is an Amazon S3 feature that automatically replicates data across AWS regions. With CRR, every object uploaded to an S3 bucket is automatically replicated to a destination bucket in a different AWS region that you choose. You can use CRR to provide lower-latency data access in different geographic regions. CRR can also help if you have a compliance requirement to store copies of data hundreds of miles apart."
This post http://shlomoswidler.com/2009/12/read-after-write-consistenc... has a quote from Jeff Barr at AWS indicating that us-east-1 is bicoastal, which is also why its eventually consistent, instead of immediately after a write (EDIT: it appears this constraint no longer applies to the US standard region).
I asked for sources about your "one datacenter" claim. Just because several facilities are in the same geographic region does not mean they are the same datacenter.
Just because something is bicoastal does not mean your data is replicated on both coasts. It could also mean that your data is stored on either the west or the east coast.
I would have trouble believing they store twice the data as their other regions but charge the same (actually a bit less!).
"To solve latency, Amazon built Availability Zones on groups of tightly coupled data centres. Each data centre in a Zone is less than 25 microseconds away from its sibling and packs 102Tbps of networking."
25 microseconds at the speed of light (best case, through a vacuum; through fiber is significantly slower) is ~4.7 miles, and based on the quote, that is the furthest they are apart. If your buildings are within 1-2 miles of each other, they're essentially the same facility.
Sure, it's not geographically redundant, but nobody in this thread claimed it was. DinkyG disputed that your "one datacenter" claim was false, which it appears to be.
Given that they created the account to comment in the DynamoDB thread, I'd guess they're a DynamoDB developer, but that doesn't invalidate anything they've said in this thread -- they even provided a 3rd party source.
> If you need geo-redundancy until Backblaze can offer it? Store in us-east-1
Or store it in Amazon AND store another copy in Backblaze. This isn't necessarily an "either/or" question. Having two copies with two different vendors in two separate regions is probably more reliable than having two copies inside the same vendor. For example, if Amazon has a large outage that affects both your regions, you can still access the copy in Backblaze.
If you're going to pick two providers, use Backblaze and Google. Google's Nearline Storage is still more reliable (AWS only offers a 98% SLA on a monthly basis for S3 IA storage class) and cheaper (if I recall properly) than AWS' Intermediate Availability offering.
This is incorrect. If you look for news articles about Amazon constructing data centers or buying facilities you'll notice that they have multiple data center facilities in each region.
Fair warning: Backblaze has a habit of making major changes silently (without any indication to the user), and their customer support is TERRIBLE. I was a customer for several years and never had a positive interaction with their support staff. The final straw was losing several files last year after they changed their backup method without notifying users, a method that contradicted their documentation. I got full IDGAF treatment from their support. That's right - a backup solution failed to backup, then support basically shrugged and said "your loss, too bad, so sad" when I contacted them even though they admitted it was their fault.
They started picking and choosing which parts of AppData they backed up, but never informed their users and kept on claiming they backed up "everything." See, the backup app used to be set up so you could see what was being excluded, and modify it if need be. I had the entirety of AppData selected as "for backup," then had several bits of saved user data wiped out. One of those was an extension that stored its data in the browser user profile, something I used frequently and had restored a half-dozen times from backup. Then it was gone. Backblaze stopped backing it up, deleted it from their servers, and I had no way to restore it. Three years of data lost and zero effort from Backblaze to recover it, not even an apology.
The technology isn't bad, but their customer service is some of the worst I've ever seen. I was a Backblaze customer for three years and not once did I have what I'd consider a positive experience. If anything goes wrong they leave you hanging. They're not a company I'd ever trust with valuable data again.
Reading the API, it seems that I need to precalculate a SHA-1 before uploading? This makes it impossible to stream data to b2 from another source, I'll need to store it first then send to b2.
Right now your only option would be to "buffer" packets of say 1 MByte in RAM, calculate the SHA-1, then store them as separate files in Backblaze B2.
We do plan to add file offset access and larger file support very soon, so you would be able to append a 1 MByte chunk to an existing file in Backblaze with a SHA-1 of only the 1 MByte chunk. That should allow you to stream?
All great feedback, by the way. We really want to hear about these shortcomings in our API right away.
Having to know the SHA1 in advance would be a show stopper for rclone ( http://rclone.org ) as it uses a streaming model internally (it can stream between cloud providers).
Being able to append to a file in 1 MByte chunks (or larger) would be perfect - that is exactly the way Amazon S3 multipart uploads and google drive multipart uploads work.
I'm planning on storing files as encrypted X megabyte chunks (where X is TBD) - would calculating per-chunk and then uploading solve it? There's metadata support which could store original filenames etc.
> We do plan to add file offset access and larger file support very soon, so you would be able to append a 1 MByte chunk to an existing file in Backblaze with a SHA-1 of only the 1 MByte chunk. That should allow you to stream?
Finally some more reasonable prices in this space.
Eventually it could make sense for Backblaze to partner with someone like DigitalOcean or Linode and offer low cost bulk storage and low cost virtualization colocated in the same datacenter: these services seem to be a perfect complement for each other.
Brian from Backblaze here. Yeah, our B2 storage may not be a good solution for an application that has to do a lot of analysis on the data over and over again. In Amazon S3 you don't pay for transfers between EC2 and S3, so computing on your data is only as expensive as buying the EC2 time. Since Backblaze doesn't yet offer the EC2 functionality you would need to download your data to analyze it.
What I'd really like is a deal with Amazon where we put a "virtual cross connect" from the Backblaze datacenter into Amazon's EC2 so you could use EC2 instances on B2 data without incurring a download charge (or not exposing that charge to our customers). But I don't know if Amazon is open to that kind of thing.
I would rate the likelihood of AWS offering free, multi-path, multi-Tbps connectivity into their AZs to a competitor to be effectively nil.
If the alternative cloud ecosystem wants to compete effectively against AWS, it desperately needs a more sophisticated authorization scheme. Don't forget that IAM/STS is a major enabling factor in applications integration of EC2 and S3.
Yup. That is the deal breaker for us. Our infrastructure exists on AWS so 100% of the data transfer is free (within the same region). While your per-GB storage is indeed less, the transfer costs would make too expensive ultimately
A real deal breaker is if you need to use an EC2 server to proxy the upload for any reason (content validation). The transfer into EC2 is free, but it's 9 cents for each GB out (18 months of storage cost).
To elaborate: I think these two would be able to become a viable competitor to AWS. If you think about it AWS launched with S3 and then EC2.
They could differentiate themselves by staying as a pure IaaS play. Then companies like Dropbox would not be afraid of DigitalBlazeOcean moving up the stack and competing as AWS has done in several instances (e.g. WorkDocs).
The price is being heralded as the thing to get excited about, but I'm also very concerned about the security and availability of my data. I suppose those details are somewhere, but they're not in this article which is quite lengthy.
I would like to know more of the implementation of this and more information on policies to protect access to my data. And I would like to know where the data is stored. I suppose I got read the manual, but maybe some info tidbits could be included in the announcement.
You should allows period (.) in your signup form email field. I know you can just remove the period and the email will still get delivered to my gmail, but plenty of people likely don't know this.
One feature that I'd love to see: the ability to update part of a file (e.g. replace bytes 512-1023 with new content of the same length) and/or append to an existing file! For whatever reason, these cloud storage products are always implemented as either block-based (so you can replace parts of the file) or as file-based (so you can create a hierarchy of files with names and metadata). Why can't I have my cake and eat it??
:-) We definitely plan to add an API to append to an existing file. The current largest file size is 5 GBytes, and we want to support much larger (imagine a 1 TByte encrypted disk image). That will be by appending chunks to files followed by a "commit" declaring the file as complete.
I think the reason most of us cloud providers don't like replacing parts of files is it helps our caching layer be much simpler, and it would change the SHA-1 checksum on the file which just means "more complexity". But it isn't out of the question, it might just come with a "cost" (like you can replace the span but it might take a while and then we provide you the final checksum of the whole file in the response).
I couldn't tell from the docs, but is it possible (or will it be possible) to generate a single use download URL to return to clients? I wouldn't want to have to pay double for outbound traffic going from B2 -> my server -> client.
Yev from Backblaze here -> Yes, you could get a URL and give it to somebody. Every time that person accesses it though, there'd be a charge (unless you're within our free parameters). If I understand correctly...
Yev at Backblaze -> So pleased to hear your say that! I'll forward it to the engineering team. They have LITERALLY been working nonstop on this for a year and deserve some praise! Thank you for signing up, I'm excited for you to start using it!
Only on https://www.backblaze.com/b2/why-b2.html it is that I can cite the following: "the B2 Cloud Storage service has layers of redundancy to ensure data is durable and available". What that exactly is or what it translates to is nowhere to be found. If you want corporations or developers to use your storage services for their precious data, I'd be a bit more specific.
Brian from Backblaze here: We're really transparent about our redundancy. We use 17+3 Reed Solomon across 20 computers in 20 different locations in our datacenter. You can read about it here: https://www.backblaze.com/blog/vault-cloud-storage-architect...
I have read plenty of posts from Backblaze in the past including the linked post, but I admit I also wanted to see details about the replication factor on the marketing site for B2.
Compared to S3, which while they say 11 9s of durability, they only commit to 99% uptime on a monthly basis (98% for their new S3 intermittent availability offering):
brianwski below shares a good link...but you make a good point jrnkntl that we should talk more about it on the site. We'll plan to add more content around that on the Why B2 page. Thanks!
Please go into detail about how failures are detected and handled – e.g. how often is the archive scrubbed, will bitrot be detected on access, etc. Those details are really important for comparing services.
This is a very important point. A lot of these products gloss over those "details". Because the numbers might not look good.
E.g. on B2 if you wanted to retrieve data to do your own scrub/validation it would cost you the equivalent of 10 months of storage just to do one retrieval: $0.005/GB/month to store, $0.05/GB to download.
Google Cloud Storage Nearline has the same problem: $0.01/GB to store, $0.12/GB for egress. But at least in this case you can egress for free to Compute Engine, so you would only need to pay $0.01/GB for retrieval.
So it's not possible (at reasonable cost) to do your own validation of what's stored in B2. In Google's case, as long as you're willing to use their cloud computers, validating your data once a month doubles your cost.
In conclusion, you're trusting the vendors to handle failures, it's very expensive to check your data yourself.
I'm definitely asking because those are questions to ask every vendor. Don't forget, also, that this space includes things like running your own services using Swift, Ceph, etc. so it's certainly possible to answer definitively those questions for at least some other options.
Brian from Backblaze here. We can't find them either. We buy either 4 TB or 8 TB right now, and we're experimenting with 10 TB drives. We don't know why the 6 TB drives appeared and then disappeared...
Are those 8TB and 10TB the helium ones with HAMR? Very slow.
My understanding of HAMR is that it is probably perfectly fine for Backblaze's backup products, which are (more or less) write-once, read-rarely. Shingled magnetic recording should also be OK for that use case.
Just yesterday I was discussing with a friend that I'd like to use Backblaze as backend for our zero-knowledge backup at https://cloudfleet.io and that I'll ask them for a metered pricing model when we're big enough.
Is anyone aware of an enterprise(ish?) product that does encrypted off-site backups from a Windows environment to something like Glacier or B2? We're 75/25 Windows/Nix, and our backups (VMWare, NetApp) are all managed on the Windows side. We'd like to just fling weeklies at B2.
AltaVault is a backup target, but not a backup product that the OP was likely asking for. But many companies use products like Arq, BackupExec, or NetBackup in combination with AltaVault to achieve what the OP described.
Disclaimer: I worked on the predecessor to AltaVault at Riverbed
Is this going to work for static asset hosting such as for low cost FE assets? Can this be set to serve an index file from the root directory? Doe's it provide you a public URL to your (stealing s3 terms) bucket?
Just curious as to why I would migrate from S3 for FE assets to Backblaze.
Yes! RIGHT NOW (during beta) we don't have the top level folder feature yet, the URLs all come from https:://f001.backblaze.com type of URLs. But this is one of my favorite uses (I have a private website I plan to move over) and we'll definitely be adding this feature soon.
Are there any plans to add application hosting along with this offering? Specifically, it could be useful to have a shim application that has direct access to the data without traversing the Internet, to minimize what needs to be transferred. For example, if used for a backup application, each day's incremental delta may be in one archive, but a periodic operation would be to move files from one archive to another. Or a full system restore may be pulling some data out of multiple archives, and a "shim" app (running within Backblaze's data center) would eliminate unnecessary transfers out.
Brian from Backblaze here. We'll add a LITTLE bit of app hosting support around this, but you won't be seeing a full blown EC2 type of product out of Backblaze for a while, realistically we don't have a large enough team to charge down that path and still do a great job at B2 and also our traditional online backup product that we still maintain.
What I'd really like in the short term is to do a deal with Amazon where we put a "virtual cross connect" from the Backblaze datacenter into Amazon's EC2 so you could use EC2 instances on B2 data without incurring a download charge (or not exposing that charge to our customers). But I don't know if Amazon is open to that kind of thing.
What provider does Backblaze use for SMS verification? I'm using a Twilio number and it seems that the two disagree with each other. (It might be due to short-code use, but I'm not sure what their outbound number is.)
Thanks! Twilio isn't able to handle incoming SMS from short codes at this point so that would explain the issue. I'll use another method to verify for now until they do
This is a really cool service, I can already think of a few cool uses for this. I once participated in a project (movie to gif library), and one of the biggest constraints was the cost of cloud storage - hopefully more competition in this space can drive prices down to where they become even more negligible.
Please tell me you're working on a PowerShell module. Microsoft is pushing PowerShell HARD (because it's awesome) and admins that do use it know its power and want ALL their vendors to use it.
You can open yourself up to a large number of customers by making it easy to get started via PowerShell.
Congratulations on the launch! And I am also happy to announce, that the official command line tool, b2, is already ported to FreeBSD: http://www.freshports.org/devel/b2/
This is awesome! I've been waiting for this for awhile.
I saw the comment about getting drives shipped to you, which is pretty neat, but what about the other way? I have about 50 TB of data we'd like to store, but only 5mbps upstream. Can we ship drives to you?
We call that feature "drive seeding" and it is requested often. Realistically I don't think we'll get to implement it in the next few months, but maybe in the 6 month timeframe?
> I ask because I tried Backblaze a while back, and uploads from the UK were very slow.
Curiously from here in Japan, I've managed to clock 80 MBit/s backing up to Backblaze. I presume it all has to do with what kind of international peering your ISP has.
Brian from Backblaze here. It's pretty simple for now. You can have as many people reading and writing to the same bucket, but they would all need to share the same credentials. Buckets are either "allPublic" or "allPrivate" (and you can flip them back and forth at any time between those two settings). But that's it at this point.
We're actively looking for feedback in this area, so as developers ask us for something like Amazon's IAM (AWS Identity and Access Management) we'll be filling that functionality out. Hopefully without adding too much complexity to the simple model we have now.
Personally I'd like to use some access management, and there's one case that I've not seen solved particularly well (though would appreciate anyone chiming in with things I've missed):
Distinct write and create permissions.
I'd like to be able to grant someone permission to create files but not allow them to modify or delete them later. I end up generally adding this externally.
I think B2 is really close to this, as you've got the file ids for multiple versions, so I can effectively ignore the filenames and use the file ids instead. It'd need a difference between "upload new version" and "delete version" though.
It's cheap for blob storage. It's terribly expensive compared to hosting yourself.
So yeah, I'd agree with you. But for anyone prepared to use S3 for anything but cold storage, this is still a lot cheaper.
My suggestion would be to use this for cold storage + big cache boxes at a provider with low bandwidth charges. Especially if your "hot" objects make up a relatively small percentage.
Backblaze is the same cost or cheaper as the cheapest tier in several other popular services like Amazon S3 and Microsoft Azure. We don't know of anybody with a lower cost of downloads.
It is surprising that they didn't make it compatible with the S3 API -- at least for common object/bucket create/delete. This will require more code to be written and it will be harder to adapt client libraries.
The API documentation is here: https://www.backblaze.com/b2/docs/
Other notes:
* The lack of scalable front-end load balancing is shown by the fact that they require users to first make an API call to get an upload URL followed by doing the actual upload.
* They require a SHA1 hash when uploading objects. This is probably overkill over a cheaper CRC. In addition, it means that users have to make 2 passes to upload -- first to compute the hash and then another to upload. This can slow uploads of large objects dramatically. A better method is to allow users to omit the hash and return it in the upload response. Then compare that response with a hash computed while uploading. In the rare case that the object was corrupted in transit, delete/retry. GCS docs here: https://cloud.google.com/storage/docs/gsutil/commands/cp#che...