Wow, that's a brutal under-the-bus-tossing. But well deserved. Most of the points are spot-on.
I've operated a several dozen machine fleet for 2+ years on EC2 and I can tell you that the number of times boxes have gone down never to come back (which most people tend to think happens regularly on EC2) is incredibly small.
We actually run a Hadoop cluster (including DFS) on a spot instances. We never lose the spot bid. We pay way less than the going rate for the compute time. Less than the reserved instance rate. It's awesome. Yes, you obviously need a plan to deal with your cluster vanishing in the blink of an eye. It's not too hard.
I would second another commenter's caution on EBS though. We never put it into production. Personally I never experienced an ephemeral drive failure that had repercussions - when they (rarely) occurred, the drive was RAIDed or in our Hadoop cluster (e.g. redundant). We made two experiments into using EBS with our DB's, and both times, literally within 24 hours, we experienced a catastrophic failure of the EBS volume, one time unrecoverable. So that put me pretty well off EBS.
I can't say that the performance is necessarily the best, and we do experience the occasional odd asymmetrical inter-machine latencies (e.g. 300ms to establish a tcp connection in one direction, but normal <1ms in the other), but for the most part AWS is just awesome.
A spot instance should stay around until the auction price is higher than what you bid. That price has been quite stable recently, so as long as you bid slightly higher than that rate you'll probably have your instance for months.
I actually did a market analysis of the EC2 spot price (they give you a year of price history in JSON format). The spot price was under the reserved price for all but a few hours since it came out. The savings in general were around 30-50% (depending on instance size), with higher savings coming from using spot market pricing in the non-VA markets (CA, EU, Asia), where prices were higher to start with.
Why don't people just bid the reserve price on the spot market, and never lose the bid? You'll still pay less...
Why don't people just bid the reserve price on the spot market, and never lose the bid?
Because you will lose the bid if a disaster occurs. A lot of reserved instances have been purchased for disaster-recovery purposes but aren't actively being used; those are instead being sold as spot instances. If a disaster occurs, the supply of spot instances will drop dramatically.
That's better than I was expecting. I'm surprised more people don't take advantage of this (or maybe they do, I don't know). Use S3 for permanent storage, ephemeral disk and RAM for cache, and pay spot rates. Sounds pretty good if one's design allows for it.
They only vanished once (the entire bid and therefore the entire cluster), and I think it was actually due to a glitch; the price never actually went over our bid. That's the only time we ever lost our spot instances. I think we actually bid over even the normal (non-reserved pricing), because I've (almost) never seen prices cross the reserve price, and even if we had to pay full price for a day, that's fine -- it's still tremendous savings in the long run.
My honest question is why there is this odd loyalty to virtual environments in this community. I realize that it may be boring but you guys are passing up insane savings that can be had by using colocation. All cloud providers are very expensive when you actually do the math and you need more then 10 servers.
Our example may be a bit extreme, but we are just building out a new datacenter at a colocation and will recover the entire up front investment ( about 150k, we have the cash to not need leasing ) in a bit over half a year.
Dedicated (rented) offers similar performance improvements and cost reductions without the extreme upfront cost (obviously costing more over its lifetime as a result). But these days it really seems like a blind spot. Maybe it's just that I never completely jumped on the cloud bandwagon (or was running servers far before it got rolling), but I really haven't found a generic use case for cloud hosting. Planning infrastructure upgrades isn't rocket science; you'll have more than 10 minutes' notice to get a new machine. There are certainly many specific use cases for cloud hosting, but as a generic hosting tier I find it incredibly overused—and companies waste money and are forced to deal with unreliable performance (EBS i/o anyone?!) as a result.
GoGrid lets you mix and match dedicated and cloud servers, which offers the performance and value of a baseline of dedicated, and the flexibility of the cloud. I find it to be unparalleled.
Softlayer (my preferred dedicated host for years) does this as well. I agree that it's currently impossible to beat the combination of bare metal performance and cost efficiency with the ability to spin up cloud servers within one's own VLAN.
I do colocation as well, but if you don't have someone there to babysit your server, sometimes it can be nightmarish. It's nice to not worry about if a disk fails and you need someone to swap out one of your raid spares and hopes he doesn't swap out the wrong drive bay.
With colocation companies, everyone is sending their own custom built servers to the data center so nobody knows the exact configuration per server. If a drive fails, or a memory chip goes bad, you need someone there that knows the layout of your hardware config to debug the issue. I have a bunch of servers at RippleWeb but I'm sure the guy there has to deal with 1U servers from Dell, Tyan, Supermicro, and all kinds of other manufacturers.
Agreed. We do colocation for about 50 servers. Up front costs higher, long term costs MUCH lower - but man can it be a pain in the ass sometimes when there's a problem. :)
Here's one from a friend of mine: a hard drive just failed in the middle of a raid array. An OCD worker there decides that instead of just swapping out the dead drive, he's going to move all the drives up one slot and put the new drive at the bottom of the raid array. So instead of having 1 drive failure, the disk controller now thinks there are 3 drive failures which ultimately lead to data loss and his removal from employment.
There's a big difference between the cost to install and operate and the cost to scale. If your hardware utilization characteristics are easy to calibrate for and capacity requirements are predictably steady, you can certainly optimize your own racks at a colo. However, hardware changes quickly, businesses typically change even quicker and capacity requirements are often highly variable. Unless the above conditions are met, it's insanely risky to go into a contract for a colo. Let us know how that one year investment recovery works out for you.
We also use colocation. I'm not a developer on the team that uses the infrastructure but we use OnRamp in Austin, TX (http://www.onr.com/) and I have rarely heard of any issues. Most delays are from international CDN requests which I believe we have also outsourced to speed it up.
I don't think you can compare every ones case equally though. In my case it would be a lot more expensive to go with co-location even ignoring the upfront costs.
Right now I pay $4K a month for ~93Mb of Ram available per hour. Averaging 2gb per server for web, bit more for caches and application we can fit a lot of servers in our 93Mb.
Could I get a couple beefy colocated servers setup Xen and run my own cloud? Sure, but then I have to pay someone to run that cloud and I have to manage and worry about the hardware and those costs are a lot more for me than a small monthly markup.
There are a few things I don't host in the cloud(master db, and primary load balancers) but even those I pay a bit more to have them be managed hardware wise.
CPU is a major bottleneck for Rackspace Cloud. All instance sizes get the same 4 cores and about the same compute resources. CPU performance is roughly the same on a 1GB cloud server as an 8GB cloud server, you are just paying for more memory. Rackspace also uses ONLY Opteron 2374 2.20 GHz processors. EC2 on the other hand offers linear CPU performance improvement on larger sized instances. EC2 also uses a heterogenous hardware ranging from Opteron 2218 or Xeon E5430 for m1 instances; Xeon E5410 for c1 instances; Xeon X5550 for m2 instances and Xeon X5570 (hyper-threaded to 16 cores) for the cluster compute instances. EBS on the cluster instance is also much faster than local disk IO in the Rackspace Cloud based on testing I've done (due to non-blocking 10G network). Here are a couple of references for this:
I've also seen highly variable io performance (ie, slow/stuck) on EBS volumes. Sometimes, all reads/writes block for several second or minutes, with iowait going to 100% on an m1.small.
However, I've decided that the other benefits of EBS volumes are just to important to give up on (snapshotting, lazy-load, re-attachment). Instead, I plan to monitor for such situations and blow away the node when I detect it.
But really, I don't understand why Amazon doesn't fix this. It's happened to me 3-4 times in a relatively small installation. Surely their monitoring can detect this?
That is likely not intended so I guess it's probably inherent to how Xen and/or their I/O layer do the resource allocation.
I, too, have seen these freezes occasionally on small instances. If you think that is harsh then try one of the new micro's. ;-)
Anyways, I haven't had such a freeze last longer than a minute yet, and the instance would always recover. If yours does not recover then that would clearly be a bug.
Is it because of the network throughput on the small instance type? I haven't noticed anything nearly that severe with our 20-instance setup of larger instance sizes, but then again we've only run production on EC2 for a few months now...
As a heavy user of EBS, I heartily concur. The ability to move EBS volumes around is handy but you pay it back in inconsistent, poor-to-mediocre performance. Good luck.
EBS definitely has performance limitations, however creating striped RAID arrays of EBS volumes significantly helps with performance.
This blog post walks through a bunch of things you can do to make the best out of the flexibility you get with EBS, and includes scheduler tuning, RAID block size planning, etc.
My team runs a large AWS/EBS backed implementation and, while we've seen some of the EBS performance issues, we've certainly engineered past them to get us where we need to be.
Wouldn't creating striped RAID arrays of EBS volumes add to the cost and remove the ability to do snapshots? How do you get a consistent snapshot of multiple striped EBS volumes?
Also keep in mind that creating a RAID0 of EBS volumes dramatically reduces your reliability too. If any one of the underlying EBS volumes misbehaves the entire RAID set has problems.
Snapshots are still doable, they just require coordination of all the underlying devices to make sure they are all in a consistent state. The way you do this can vary but it's not particularly hard.
This is a rather unpleasant review of a service we were about to move over to.
For companies that have moderate performance requirements (e.g. visitors in the range of 30k or 40k per day across a range of web apps and sites), reasonable but by no means expert level server administration skills, the need for a redundant environment to satisfy SLAs with clients (e.g. two app servers + load balancer + master/slave db servers), and the desire to focus mainly on software development instead of server admin, what companies does the HN community recommend?
We've been considering Rackspace Cloud and Linode, but are open to any suggestions. We also have a quote for a standard, managed four-server + hardware load balancer deployment in front of us but it is pricey ($3000+/month).
Softlayer (or AWS) + outsourced server management is great fit for your scenario. Softlayer has great HW / network, and can provide a good base for a scalable infrastructure. AWS EC2 also sounds like a great fit in your environment.
Regardless, having a great server admin guide this implementation will be a key part of the chosen architecture success.
I run a server management company, so I have a bit of a bias ;)
I've used Rackspace, Amazon and GoGrid. I'm currently almost entirely on GoGrid and just finishing up migrating back off of Amazon for the second time(Migrated some services back to them when they opened west coast).
There have been some stability issues in the past year(instances going down for maintenance regularly) but haven't had a problem with that in a while and was advised it was due to them upgrading all the servers in their cluster.
The ability to mix dedicated and cloud systems easily has been the killer feature for me. We also prepay and the number of servers we get for the cost is pretty good.
Have you looked at just buying the hardware and colocating it? $400 should get you half a cabinet. Bandwidth can be found for a few dollars a meg. As you're buying redundant servers, you can get refurb/ebay stuff and save on the upfront costs (who cares about next business day service replacemnt if you have spare parts?).
I wrote a thing about the pros and cons of doing that, but instead I thought it would be better to give you some objective numbers.
1 full rack with 15a of 120v power (remember, don't exceed 75% of a circuts rated potential, so that's really 11.25a) at he.net in freemont is going to set you back $400/month. $500/month with 100Mbps bandwidth. (more than you will need for now)
Now, you are talking between 5-7 low-power servers there.
You want to buy a metered/switched PDU ($400 or so new, $150 used)
one of my 8 core, 32GiB ram opterons with 4 disks is about $2500 in parts, new. I can't imagine anyone charging more than $500 to put the thing together.
Note, unless you have really cheap hardware guys and downtime costs you little, I'd avoid used hardware. I started on used servers, and I no longer buy used hardware that is user-serviceable. (I avoid used servers. But I may buy used switches.)
If you need racker/stackers I can point you at experienced guys who will be happy to work for $50-$60/hr, and people who can get the job done at half that. If you want more advanced hardware troubleshooting, double that for someone really good. but for that kind of money they should be able to set you up with an automated provisioning system. for comparison, I'll work for $100/hr if you are willing to deal with my inconsistent schedule.
Setting up a server should not take more than a few hours, including hooking up the serial console, etc... And hardware problems? figure on once every two years per server.
a lot depends on the quality of your provisioning setup (something like cobbler is good) and your serial console/rebooter setup. But it /can/ be a lot cheaper than 'the cloud' in many instances.
Right, so, if they were quoted $3K a month, in a couple of months of self hosting, they'd have definitely recovered their investment, and then start saving a few K a month.
Your point about onsite hands is something I forgot. In the companies I've had, I've always had offices adjoining the datacenter. (I've been lucky to end up with such great spaces.)
If you don't have onsite hands, then fixing even minor hardware issues can be a major pain. If you do, then cheap users servers are fine, if your software is fault tolerant. If a machine goes down, others takeover in a minute and you have slightly less capacity. No big deal.
I just see these crazy high numbers for hosting, especially "cloud" stuff and don't get it. Cloud in particular seems to only make sense if you're actually elastic, or scaling up very fast. Amazon and so on's pricing, for "always on" servers, does not look that appealing.
Personally, I think the very best use case for "cloud" is backup servers. Think of it, run a 'small' instance to slave your database and other stuff to EBS storage, and once a month practice bringing up production on "the cloud" - cheapest and most complete backup system I can think of.
As for onsite hands, 95% of what you would go to the co-lo for can be done from your comfy chair if you have a good serial console/rebooter setup. Also, the he.net price includes "remote hands" service. Note, don't expect more than "remote hands" and make sure you clearly label everything... but you can usually avoid leaving the office.
why not outsource your OS management and use shared hosting, engine yard or appengine? hardware guys willing to do the work of amazon can be had for very little money, even on a short-term contract basis.
there are situations where different levels of outsourcing are appropriate. And certainly, if your cost per compute node doesn't matter, few will fault you for going with amazon. But you should be aware that you are paying for that, and in some situations, having a high cost per compute unit will kill you.
This is a fair question, so I don't see why it should be downvoted.
The reason outsourcing hardware to AWS or Rackspace makes sense versus hiring hardware guys (usually) doesn't is that you introduce one more link in your failure chain. In My opinion it is far, far more likely that you and your hardware guy are going to mess up ZFS configuration somehow and have it fail when your app gets black swan traffic than Amazon going down. When it comes to Engine Yard, I'm all for that too, in the right circumstances, but you get fairly high lock in. I agree that you get that with AWS too, it just seems more likely that AWS will be around 10 years than it does Engine Yard. But sure, Engine Yard and Heroku are great.
filesystem user error can be just as deadly on aws as it can on your own hardware. either way, you need good backups.
While I agree that amazon itself is unlikely enough to go down that you can pretty much bet the company on it, any one server at amazon may go down at any time. Now, amazon has the huge advantage of being able to give you another server at any time, but you need to be prepared for that. any data on your local disk would be gone, and you need to be able to re-connect to your EBS store from only one instance. (many, many of the 'admin error' cases of data loss i've seen stemmed from mounting a network block device read/write on two different servers without a clustered file system.)
My point is just that amazon doesn't free you from the need for (and mistakes of) *NIX admins (like heroku, app engine and engine yard should.)
That's a good point - you could argue it doesn't free you from the need of a good admin, and it actually makes it more difficult to hire someone who knows what they are doing; good cloud SA skills are much more rare than good traditional SA skills.
When you can just drive to the datacenter and toss in some more disks or drop your new F5 on there (or hack it together with CARP), but instead have to be familiar with exactly what the restrictions and use cases around the cloud providers offerings are; the field of candidates just got painfully smaller, and probably more expensive.
Heroku and Engine Yard (for their cloud service) are just front ends to AWS anyway. Your basically paying them for the luxury of not having to deal with AWS. So as long as they don't get undercut by a competitor and all their customers flee, they should be around for a while. That or their owners get greedy, sell the companies, and the new owners move to "more economical hosting". Na, that never happens. ;-)
right. Just like you are paying AWS for the luxury of not having to deal with hardware.
Both can be rational decisions (or not, depending on many factors) I'm just pointing out that the people who deal with racking and stacking (the sort that amazon allows you to fire) are rather a lot less expensive than Linux SysAdmins, the sort who app engine, heroku and engine yard allow you to fire.
If I were doing this with a Rails app, I'd use Engine Yard for sure. Engine Yard Cloud makes stuff like load-balanced app servers and master/slave replication really easy (you can do it from the web interface), and they really know what they're doing and have tuned the stack based on that knowledge.
I don't know if there are similar offerings that are reasonably priced yet awesome for other languages though.
I found engineyard very unflexible... Sure you can create custom recipes to get around that but since they update their system without warning, the custom recipes end up becoming stale...
Yes, but the problem is that they use their own gentoo based environment that they update as time go...
To do anything custom that they didn't plan beforehand, you need to create custom chef recipes that will be executed whenever a new instance is provisioned or if you deploy with their web interface (but not if you deploy with capistrano).
The problem is that it's hard to make sure that the custom chef recipes will work when the underlying environment changes and you are not told ahead of time of those changes...
+1. In the past 24 months I've used Rackspace, ThePlanet, SoftLayer, Slicehost, Linode and the terrible Mediatemple. I'm doing all future development on Linode. Their basic VPS is pretty quick for the price and it's online almost instantly. It's still just basic Linux, so you can rig everything you want - no awkward managed interfaces to worry about.
I would consider switching to a different company, but for now Linode scratches my itch better than the rest.
Create an account. Open a service ticket. Wait hours for response. Plus no firewall options on your Nitro servers, at that price range, is ridiculous and way below the standards of the industry.
another +1. we moved to linode from rackspace. They don't have that crazy backup limitation that rackspace does, and they're always open to custom inquiries. Rackspace has the big company mentality -- if our interface doesn't have an option for it, it doesn't exist.
I like Linode, but they are wearing a little thin for me.
In general, uptimes have been excellent but the physical server hosting one of our VPSs had three emergency reboots in a little over a month before they replaced the hardware.
Perhaps they'd already planned to do a replacement after th e second incident but I had no way of knowing.
Which brings me to my second beef, which is that they don't share much when closing a ticket. Their explanations are pretty prefunctory. In some cases, like transient network issues I generally can't tell if they've investigated the problem and done something about it or if the first-tier tech just takes a quick look sees it's working now, and punts it back to me to let them know if there are more problems.
Similarly, after a major screwup that took down most of the virtual servers they host (including the ones that hosted their own website and web-based management tool), all we got was a brief explanation of what had gone wrong and a status blog hosted on someone elses infrastructure.
What I didn't get was a sense of how they were going to guard against anything similar happening again.
I'm sticking with them for now, but I'm also feeling like the effort I've put into automating the configuration of our application and all it's dependencies has been time well spent. I can boot a new infrastructure on another service in an hour or so and migrate to it with minimal downtime.
What are they like for CPU usage? I have been looking at VPS.NET and other such providers to ensure I get a reasonable slice of CPU which I can use for other things.
Most VPS providers I have come across dont like it if you start working out the millionth digit of Pi, but sometimes you have a task that takes a while and you can't run offsite.
I believe each physical host has at least 8 real cores. They give each instance 4 virtual cores. As long as CPU time is available, an instance can use the full capacity of four cores. If there is contention for CPU time, every instance gets an equal share of CPU, which is also their fare share of CPU, because all the instances on a box are the same size/pricepoint. Further, larger instances are shared with a proportionally smaller number of other instances.
My experience, which seems to be shared by just about anyone who has published benchmarks, is that Linode is appropriately conservative on the degree to which they oversell CPU, because generally, it appears that a task that wants 100% of four cores does in fact get close to 100% of four cores.
The only thing to look out for is that there is quite a difference in single threaded performance between their newer hardware and their older hardware. Our app is written in one of the popular interpreted dynamic languages. Generic benchmarks for that language showeda 2-3x difference on some tests when run on the different hardware, and our app showed similar differences on CPU bound tasks.
There are a few implications of this:
First, it complicates things when you try to make your staging environment mirror production. The quick check is to look at /proc/cpuinfo and see if clock,cache and model line up. If not, open a ticket and ask them to migrate instances so things match.
Second, while they try to size things appropriately so that your guaranteed CPU Is the same, regardless of which generation of host you are on, your peak CPU is going to vary dramatically since you get 4 cores, irregardless.
I hear lots of awesome things about Linode, however they are actually hosted by Rackspace so with Rackspace's known reliability issues... maybe an issue.
Another service to consider is VPS.net. We have 6 virtual servers with them. They are cheap and perform well, their control panel is great and they have tons of great pre-configured OS installs. However downtime can sometimes be an issue. Usually it's not major, e.g. a server will just be unavailable for say 30 seconds or maybe a few minutes, but this happens probably an average of once or twice a month for each of our servers so it can get annoying. Like I said though the outages are usually minor. There's only been one that I remember that was major, and lasted several hours. We've been using VPS.net for about 8 months and overall are very happy with them. Not sure I'd recommend them for your setup but we only use them for a custom built CDN and for that it works very well.
Rackspace cloud's DNS stuff stinks. No way to add TXT records -- you have to open a ticket! Sure you can host it yourself, but every other cloud provider has this in their UI.
I get the feeling they're just in "maintenence mode" over there and don't have anyone working hard on improving the offerings.
Their DNS used to be great, if simple. You could put in a record for your new server and by the time you'd logged in, the record was live.
Now, it's gotten to the point where their DNS updates so slowly that I just re-use old subdomains instead of creating new ones, just so that I can get my project tested today rather than tomorrow.
I'm working on moving everything over to Amazon; more flexibility. I like Rackspace Cloud, but they're falling behind and they either don't know, don't care, or can't catch up.
What is even more sad, is that their DNS console for hosted servers and Slicehost is so much better. They are literally sitting on something better and giving their cloud customers crap.
Another distinguishing factor between Rackspace Cloud and EC2 or Linode are bandwidth caps. Rackspace limits outbound public network throughput to 10mbps for 256mb instances to 70mbps for 16gb instances. EC2 and Linode both provide an uncapped GigE Internet uplink for instances of any size.
http://cloudservers.rackspacecloud.com/index.php/Frequently_...
This actually becomes really important when you set up a small server as a proxy sever. We had to bump up our HAProxy server to a 512mb instance in order to get adequate bandwidth to serve our sites.
Quoth the article: Lastly, we moved over to the Rackspace Cloud because they cut a deal with YCombinator (one of the many benefits of being part of YC).
Can anyone say what this deal is, or is it secret?
I had a similar experience: with minimal negotiating we were able to get 50% off their initial quote. I also renegotiated our terms later within the contract period with relative ease.
They pad their pricing pretty heavily, so don't be afraid to ask.
I am curious because it seems like almost any configuration & setup should be able to serve a blog post to the volume of request that come with being at the top of HN.
Another control panel complaint: DNS. For some reason beyond me you have to choose a server just to configure DNS. Added on top of the fact that the control panel is really slow, it just becomes a pain to use.
I moved all my services from EC2 to Rackspace Cloud about 2 years ago, but I'm regretting it.
Rackspace Cloud does one thing well - small instances have great value CPU & local IO performance. If your app is CPU or local IO bound, splitting it across multiple 256MB instances on Rackspace Cloud will get you huge performance relative to price. I've been worried that this would degrade as the service grew, but that hasn't been the case.
Unfortunately many other 'features' of Rackspace Cloud have been poor to awful recently. Some anecdotal stories;
1) We haven't been able to make images of our server or restore ANY backup of our servers for months. There is a bug in the backup service where if you have spaces in the names of your Cloud Files containers (completely unrelated to the backup service) then all images fail to be able to be restored. We can't remove the spaces in the containers because you can't rename containers (only delete) and there's too much data tied to different parts of our infrastructure in there.
2) In relation to the issue above, we have had a ticket open for over 2 months which we continually post updates with new information & asking for issue resolution. We never receive updates to the ticket itself and only receive information when contacting their live chat. The response is always "we're working on it". I could live with it if this was a short period, or not an absolutely vital part of their service, but come on - all backups broken for 2 months! No timelines on resolution. No ticket responses. No happy.
3) While CPU value is great on small instances you get the other end of the stick on large instances as other posters here have said. You don't get significantly improved performance above the 2GB servers. CPU capacity certainly does not double as their documents say.
4) Cloud Files latency is awful. Individual read/writes take 300-1000ms. Fine for a small number of large files. Impossible for a large number of small files. (Having said that, being able to upload files and publish to CDN in a click has saved me lots of time for static files I need to quickly publish).
5) Resizing mid to large instances is impossible. We recently tried to resize a 1GB (40GB disc) server to a 2GB (80GB disc) and it took OVER 3 DAYS. No really. It didn't complete. The resize process takes the server down towards the end. We had to get Rackspace to cancel the resize and manually spin up another server and transfer the files ourselves. To make it worse, we couldn't act on this issue initially because Rackspace insisted that the process was "almost complete" from 12 hours onwards. 2.5 days later we just gave up. We managed to do the manual transfer ourselves in a couple of hours. Even worse Rackspace seemed to not think that it was unusual for the process to take 3 days or express any desire to investigate further.
6) The web interface has awful performance at scale. Once you go above 20 cloud servers every single page load takes 10+ seconds. As the original poster says, the number of errors it spits out about not being able to complete operations is insane. It's rare I can go in there planning on doing something and not have to contact support to fix something broken on their end.
7) They're taking the entire web interface and API offline for 12 hours this week! You won't be able to spin up or take down any of your servers. Why? So they can fix a billing issue related to Cloud Sites (a service we don't use).
I've always been a champion of Rackspace Cloud and Rackspace in general, but sadly I would no longer recommend them to people. I'm starting to make contingency plans and looking for other providers again.
At a previous company, we were using Rackspace Cloud for one of our sites, and we decided to double the size of the server as we were under a lot of load and needed some capacity fast.
Each step completed (the new server was spun up, the old server was backed up, etc.). Then it said it was going to take the server offline to do the actual clone of the disk from one to another. Then it took us to a screen saying 'Did everything complete properly? Yes/No', saying that clicking 'no' would just boot the old server back up again.
We waited for an hour, then two, then three, and the server still hadn't come back online. I didn't want to click 'no' in case it was something simple, so I called their support. After explaining what had happened to the guy, he told me to click 'Yes', which I did. Suddenly, the new server booted up and came back online. Support told me that when you get to that screen, you just hit 'Yes' and then your server boots up.
I couldn't help but think this was the dumbest thing I'd ever heard. They take your server offline, tell you it's going to be back 'any minute now', and when it is, you tell them if it worked properly or not. In reality, your server stays offline until you say 'Yeah, everything's working great', at which point they bring it back online.
I still recommend them to people who need basic services, such as 'I need a dedicated server for my blog'. Amazon is flexible, but also very complicated for semi-pro users to deal with.
The whole resize thing is fraught with problems. It was one of the main reason I moved to Rackspace Cloud, but it's just way too unreliable. I'm too scared to use it on any mission critical servers due to unpredictable downtime.
If you're under high traffic and you try to resize the server the operation time hugely blows out. There's no way to block this TCP traffic other than changing DNS settings which is obviously undesirable and you might as well just manually setup a new server if you're going to do this (which is what I now tend to do). This process should be automated. It's an obvious bottleneck and fixing would dramatically improve their offering.
I didn't really mind Rackspace having all these issues at the beginning, but after 2 years I see no progress anywhere. Unlike AWS there's very little apparent investment in the infrastructure tools. Other cloud services have matured, but Rackspace hasn't. While the support at Rackspace is awesome, there's only so often I can hear someone apologizing. I'd rather the issues just got fixed and all this friction was reduced. Incredibly disappointed.
Rackspace Cloud used to have the upper hand on smaller sized instances. This is no longer the case with EC2's new 640mb micro instance at about $8/mo reserve. Not top performer, but it gets the job done for lower traffic servers.
For anything that needs much CPU at all that isn't going to be the case. Any Rackspace node will be literally an order of magnitude faster for anything CPU bound, compared to an EC2 micro instance.
On the other hand if you just need memory (without necessarily having the clock cycles to quickly access it), or just something that will eventually handle occasional requests, a micro instance should be just fine.
I can't give any benchmarks, but I was doing a compile (openssh) the other day, and I noticed something peculiar.
I started the ./configure, and it blasted through 40 or 50 checks almost instantly, then stopped. Suddenly, it started taking about 3-5 seconds for each check (i.e. 'check for blarg in -lfoo'). For the rest of the compile, it was like I'd gone back in time ten years; incredibly slow. Checking top, it showed the %steal at 99.5-100% - the host system was scheduling almost no CPU at all to my machine.
Playing around, I found that you basically get a specific allotment of CPU cycles per a certain amount of time, and once you run out you don't get any more for a while (other than the bare minimum to keep the server working). This makes it great for 'bursty' load cases, but after you burst burst, they cut you off, so it's terrible if you suddenly get a sustained burst of traffic.
It's so bad that I spun up a clone of the machine, compiled SSH there, and then spun it down afterwards, which turned out to take easily half the time as compiling it on the micro instance itself.
So: great for personal blogs; terrible for suddenly getting traffic to your personal blogs.
I can't release the full results of the ones I've run, and they didn't cover Micro instances (this was about a week before they came out).
I can say that when executing kcbench (set to do 4 kernel compiles, same configuration on all machines, using a '-j' equal to the number of CPUs in /proc/cpuinfo) I got the following times:
EC2 Small Node: ~849 seconds
Rackspace 1GB Node: ~81 seconds
From what I've seen, a micro instance can supposedly (that is according to Amazon get more "burst" cycles than a small instance, but in the long term will be significantly slower. A sibling to this comment refers to some work that compared small nodes to micro nodes, and found the small node to be over 2 times faster than the micro node for large processing jobs.
Based on that, for CPU heavy jobs I would put a 1GB Rackspace node at something like 25x faster than the EC2 micro node.
I found that for the most part, in order to compete with Rackspace on CPU you needed to go with at least an Extra Large (which was about on par with Rackspace) or a High-CPU Extra Large (which managed kcbench in ~44 seconds).
This is true, generally all Rackspace Cloud servers perform about the same in terms of CPU and disk IO, so the small Rackspace instances will outperform small EC2 instances. However, they don't scale well and on the high end, they under-perform relative to comparable EC2 instance. Here is some sample data validating this (these web service links will provide XML formatted benchmark results):
The results reported by danudey should not come as a surprise. The EC2 micro instances are designed for situations where short bursts of CPU are the norm. They were not intended to be used for continuous, compute-intensive chores.
I think the reason I was 'surprised' is that I expected the 'good for sudden bursts of CPU' to be what it was good for, rather than an actual hard limitation on how it works. Perhaps this is because I'm not terribly familiar with how EC2 is managed behind the scenes, being a new convert from Rackspace.
My post was mostly meant to illustrate that Amazon puts hard limits on how your VM operates (which makes it inconsistent over time under load), vs. Rackspace, which gives you a constant amount of CPU capacity all the time.
I'm suprised to see a setup using 50+ instances running on RackSpace cloud, wouldn't it have made sense to start moving towards dedicated servers by then?
True but in theory the advantage with RackSpace is you can mix and match to balance the cost of your known requirements with the flexiblity for your unknown.
I have 50+ instances on another cloud provider. The service cost benefit on going to a dedicated server is a fair bit but the management cost goes through the roof. We also prepay to save some expense on the monthly bill.
I can easily manage those servers, spin up new ones as requirements change, drop ones I don't need anymore, etc with very little hassle.
We also have a few dedicated servers(one actually with Xen on it for running smaller instances) and the time and effort to manage instances on it makes it pretty much not worth it for us without a dedicated sysadmin.
First of all, there are managed hardware providers who can get you hardware online in less than 4 hours. Second of all, capacity planning can save you from "needing elasticity". Third of all, if you were on machines that gave you reasonable performance (i.e. not shared hosting), you probably wouldn't "need" nearly as much elasticity.
you probably wouldn't "need" nearly as much elasticity.
You discard that as if nobody needed elasticity.
Have you ever dealt with a large B2C site? Traffic tends to be heavily seasonal there, and in many other genres, too.
"Capacity planning" is a nice word for what in reality commonly boils down a bit of educated guessing, followed by the decision to keep a $reasonable amount of over-capacity around. The exact value of $reasonable is hardly ever specified by much more than the equivalent of a dice-roll.
Moreover, the interesting question is not which provider can provision a pile of metal within 4 hours. The interesting question is which of them will take those machines back a day later, without charging for a full month. A cloud will do that. Will your managed ISP?
No one denies that capacity planning is hard. There are books written on the subject. The points you make are exactly the reason why you need to do capacity planning and plan for mitigating failures. If you aren't planning on 2x (in fact more) growth then I'm confused as to what kind of growth you really expect in your service.
If you aren't giving yourself room for expected and unexpected loads, you're doing it wrong. Add capacity and load testing to your process.
If you work on systems where you have the occasional 2x spike in traffic or planning for 2x capacity requirements in the future is easy then you don't have the same problems as suhail has.
I work in advertising for example. We could have 10 partners at 1x. Add 10 more and be at 1.1x or 2x, then add a large partner and be at 7x. There isn't a pattern to when we get partners from any of these groups but when we get them they need to go live as quickly as possible and sourcing and prepping hardware in situations like that isn't feasible. Nor is it feasible to have hardware on standby for the occasional 7x partner since you don't know when they are coming along and they could end up being a 10x partner.
If you aren't giving yourself room for expected and
-->unexpected<-- loads, you're doing it wrong.
You're using that word, I'm not sure it means what you think it means.
Over here in the real world, many applications (and notably web-applications) have one thing in common: They change all the time.
Your capacity plan from October might have been amazingly accurate for the software that was deployed and the load signature that was observed then.
Sadly now, in November, we have these two new features that hit the database quite hard. Plus, to add insult to injury, there's another old feature (that we had basically written off already) that is suddenly gaining immense popularity - and nobody can really tell how far that will go.
Capacity planning isn't just hard, it is costly. You have to profile every new version of your app, and every new version of the software you depend on. You have to update your planning models with that data, and then you have to provision extra hardware to handle whatever traffic spikes you think you'll be faced with within your planning window. Most of the time, those resources will be idle, but you will still be paying for them. Plus in the face of an extraordinary event, you'll be giving users a degraded experience.
Using "the cloud" doesn't solve all those problems but your costs can track your needs more closely, and with less up-front investment. Rather than carefully planning revisions to your infrastructure you can build a new one test it, cut over to it, and then ditch the old one.
You should still profile your app under load so you can be confident that you can indeed scale up easily, but even that is easier. You can bring up a full-scale version to test for a day and then take it down again.
I'm not against capacity planning, but it has it's time and place.
The fundamentals of capacity planning do not change based on the magnitude of your data growth. Why would they?
We're mostly talking about looking at your data growth curve and extrapolating points in the future. Why would that become impossible just because the curve is steep?
If you weren't paying such an enormous premium for your hardware, you'd have a lot more cash. On a per-dollar basis, you're paying anywhere from 2-10x the price for computing power on the cloud, depending on which resource you look at (CPU, Memory, Disk IOPS, etc).
suhail like myself works in an industry where one partner turning on can 2-4x capacity requirements and 10 other partners won't change your traffic profile much at all. It's easy to say plan in advance for growth but when there is a lot of varience in your growth then this becomes a problem. You will often find yourself overspending for unused capacity or struggling to meet new capacity. If your growth is a smooth line then yes you can say it is easy to figure out. But not everyones growth follows such a simple line.
The problem is not that is hard; the math can be done, the necessary capacity can be calculated, servers can be ordered.
The problem is that buying 3x the number of servers (or number of data centers) that you need "baseline", to handle the spikes, is a staggering expense.
Yup planning is hard when the next guy that registers can double your load, but dedicateds give you a lot of bang for your buck and you can still use VPSs (or cloud stuff of course) to scale up in a hurry.
I'm pushing a ton of data (>200m events/day now) through 2 dedicateds + 1x VPS and if it's a bad day I get 1 or 2 extra VPSs deployed, each of which adds another x,xxx events per second capacity.
<quote>Amazon has a CDN and servers distributed globally. This is important to Mixpanel as websites all over the world are sending us data. There’s nothing like this on Rackspace.</quote>
Actually, Rackspace cloud offers CDN services with Cloud Files through Limelight, although it does not support some features that CloudFront does like CName and streaming.
A CDN is probably the easiest service provider to switch. We've tried Panther, Voxel, Cloudfront and use Edgecast. It's just a matter of changing some CNAMEs.
Rackspace Cloud has data centers in Texas and Illinois. However, they don't let you choose (your account is assigned to one data center at the time of creation). EC2 certain offers a lot more flexibility in this regard with EC2 data center regions in California, Virginia, Ireland and Singapore. I spoke with Rackspace at Interop a few weeks ago and they told me they are working on expanding to additional data centers including an international data center very soon and offering choice.
You can "choose" insofar as you can email support and they can manually change what data-center your next VM and all subsequent ones will be built in. It really should be exposed through the API, but the people who write the customer-facing API haven't done it.
That is not the story I got when I contacted support and asked if I could deploy a VM specifically to the Chicago cloud. Rackspace's support told me this was not possible an I'd have to setup an entirely new account in order to deploy to that data center.
I'd also add that their billing software can't keep things straight. I've had a couple of servers that I spun up for a demo at a meetup, that somehow ended up with the same name, this prevented me from deleting them, and tells me that their control panel has concurrency issues, since you shouldn't be able to create two servers with the same name.
I didn't notice the issue until 2 months later (I thought I had successfully deleted the servers) because all of the sudden I received huge bill that contained over 1100 hours of usage for each instance, for that month alone. WTF? Turns out their software failed to bill me the previous month so I didn't notice any change.
Their response to my ticket about not being able to delete servers was to tell me the steps that I had to take to fix it (renaming the servers). I really wish when you had a ticket for stuff like that they'd actually act on it instead of just telling you how to fix stuff and expecting you to do it yourself.
A good friend deep in the security community once told me, off hand, that EC2 was "owned." I didn't take this too seriously until another good friend, who has been working at Amazon for the past several years, told me that engineers at Amazon were generally forbidden from using AWS due to security concerns.
That much said, I still decided to use EC2/RDS/S3 to host the infrastructure of my latest startup. It is just too convenient to walk away from. Once it matters, I can move the critical stuff to dedicated servers.
EDIT: To clarify, I'm not suggesting that Amazon knows AWS is "owned" and offers it to others anyway. I'm only noting that, for certain critical services, they themselves do not appear willing to take the risk.
I've worked with Amazon Web Services security people in the past, and while they're not perfect (nobody is) I have always had the impression that they take security seriously. AWS has many very large customers, including the US government and companies handling HIPAA-restricted data; based on the assumption that Amazon employees don't want to be thrown in jail for 10 years, I think it's safe to say that if EC2 is is "0wned" as you claim, it's certainly not well known within Amazon.
I agree -- but fraudulently violating HIPAA (e.g., if you advertised "this is a safe place to put your HIPAA data" while knowing that it wasn't safe) is probably a rather different matter.
Colin was implying that negligent management of EC2 could leave Amazon employees criminally liable. Obviously anybody who "owned up" EC2 is already a criminal.
"...told me that engineers at Amazon were generally forbidden from using AWS due to security concerns."
The opposite is the case: there has been a huge push for some time to move (significant) parts of Amazon retail to AWS. It's extremely complex and service quality is paramount, so it takes a while to make it all happen.
My friend from Amazon works in the supply-chain side of things, and he said he really wants to use it, but everything has to be encrypted and some stuff is off limits.
I take it you work on the retail side of things? I'd be interested to hear any more details that you can share.
That certain services can't yet be moved to AWS is not an an indicator AWS is compromised. Several services, for example the payments infrastructure, are subject to regulations that make it challenging to implement _at all_, much less in a shared environment like AWS. Again, this is not an argument that AWS is compromised, and teams at Amazon are absolutely using AWS.
"A good friend deep in the security community once told me, off hand, that EC2 was "owned." I didn't take this too seriously until another good friend, who has been working at Amazon for the past several years, told me that engineers at Amazon were generally forbidden from using AWS due to security concerns."
"EDIT: To clarify, I'm not suggesting that Amazon knows AWS is "owned" and offers it to others anyway. I'm only noting that, for certain critical services, they themselves do not appear willing to take the risk."
I may not be the smartest guy, but it seems to me that's exactly what you are saying.
I'm not sure where the confusion lies, but I'm guessing you see "security concerns" as equivalent to "knowledge of ownership"?
It seems to me those are entirely different things, as one can be concerned about potential threat without knowing if it is real or not. But I do not work in the security community myself and may be using language sloppily.
I would be much obliged if you could show me where the crux of the confusion lies.
To paraphrase what you said: "I didn't take [statement A] seriously until [statement B]."
statement A = EC2 was owned
statement B = engineers at Amazon forbidden from using AWS
Perhaps English isn't your first language, but the way you've phrased it, you're relying on statement B as evidence/proof of statement A, directly implying a connection between the two. It's difficult to read it any other way.
Rewording your original comment: "It was only when that I heard that engineers at Amazon were forbidden from using AWS that I took seriously the comment that EC2 was owned."
Thanks for the reply. There is a connection, of course, but it is not that Amazon knows. Statement B is evidence in the sense that it suggests Amazon does not believe security is sufficiently iron-clad around EC2, which would allow for statement A to be possible in the first place.
I honestly did not expect my comment to create such angst. I recognize that the wording was a bit confusing, but it seems the main thing people are upset about is that I am spreading FUD. Of course that would be quite inappropriate if it was completely unfounded, but I have stated exactly where my concerns came from, so it seems perfectly legit to me.
Your reply is very reasonable and polite, but I am disappointed at the bulk of knee-jerk reactions to this post, as well as their passive aggressive/ad-hominen nature.
Perhaps I am just in a poor mood, but I believe I will be moving on from HN. It was one of the few excuses left for me to procrastinate, so at least I should be more productive. ;)
EDIT: This, by the way, is an excellent article, though somewhat dated, on some of the security shortcomings of EC2. Note it does not address the "nightmare scenario" that Xen (the virtual machine software) is itself vulnerable.
While this is disconcerting, I wouldn't make any business decisions based on such a claim. The idea that EC2 is "owned" without Amazon knowing about it is closing in on absurdity. I've worked directly with Amazon as an outside vendor and they are very security concious, to the point of near paranoia.
While I agree it is hard to believe, it would be even more surprising if Amazon did know about it. The fact that they do not use AWS internally suggests that---at least with their level of paranoia---they seem to suspect AWS themselves.
I'm not sure how you can say that so matter-of-factly. My security friend was talking about something that Amazon does not (and presumably very few people do) know. Meanwhile my friend at Amazon was just stating the fact that he was not supposed to use AWS, or only with extreme caution. Of course that may differ from department to department, if that's what you mean.
Your resume is very impressive, and I see that you obviously know a lot about security at Amazon, yet this by itself does not discount my points. Those are:
1) AWS could be compromised, as my first friend claimed, without Amazon knowing about it.
2) My second friend is not allowed to use AWS for security reasons.
The truth of the first point is indeterminable, I think we may agree. Meanwhile, the second point may indeed be due to my friend being misinformed, if for example, you are aware of a Amazon-wide policy that says engineers can use AWS willy-nilly, so long as they abide by general security regulations that are used elsewhere.
On the offchance you're not trolling: the reason you're getting downvoted has nothing to do with resumes, it's because you are throwing out unfounded hearsay FUD. Come back with some actual evidence for debate, otherwise you're no different than any one of a million irc script kiddies. Anyone with knowledge of such an exploit would either A) keep it secret or B) tell Amazon about it. Casually dropping it in conversation screams wannabe.
Resume - and the direct personal experience in the right department of the company you're smearing that resume includes - trumps unsourced (and frankly, hard to believe) hearsay.
I'm not sure there is any proof Rackspace Cloud is any more secure than EC2. AWS offers a Virtual Private Data Center service (VPC) which is highly secure. Rackspace Cloud has nothing like that. AWS also offers firewall management functionality which Rackspace Cloud does not. Amazon.com is run out of the same data centers as AWS.
I've operated a several dozen machine fleet for 2+ years on EC2 and I can tell you that the number of times boxes have gone down never to come back (which most people tend to think happens regularly on EC2) is incredibly small.
We actually run a Hadoop cluster (including DFS) on a spot instances. We never lose the spot bid. We pay way less than the going rate for the compute time. Less than the reserved instance rate. It's awesome. Yes, you obviously need a plan to deal with your cluster vanishing in the blink of an eye. It's not too hard.
I would second another commenter's caution on EBS though. We never put it into production. Personally I never experienced an ephemeral drive failure that had repercussions - when they (rarely) occurred, the drive was RAIDed or in our Hadoop cluster (e.g. redundant). We made two experiments into using EBS with our DB's, and both times, literally within 24 hours, we experienced a catastrophic failure of the EBS volume, one time unrecoverable. So that put me pretty well off EBS.
I can't say that the performance is necessarily the best, and we do experience the occasional odd asymmetrical inter-machine latencies (e.g. 300ms to establish a tcp connection in one direction, but normal <1ms in the other), but for the most part AWS is just awesome.