Several people have asked for additional details. We just posted a quick follow-on:
[UPDATE] A central theme of the recent AWS issues has been the Amazon Elastic Block Storage (EBS) service. We use EBS at Twilio but only for non-critical and non-latency sensitive tasks. We've been a slow adopter of EBS for core parts of our persistence infrastructure because it doesn't satisfy the "unit-of-failure is a single host principle." If EBS were to experience a problem, all dependent service could also experience failures. Instead, we've focuses on utilizing the ephemeral disks present on each EC2 host for persistence. If an ephemeral disk fails, that failure is scoped to that host. We are planning a follow-on post describing how we doing RAID0 stripping across ephemeral disks to improve I/O performance.
Ironically, this highlights one of the main issues we discuss in the post!
The Twilio Engineering blog is hosted off an external Wordpress site with a single IP that's forwarded from ngnix load balancer pool. Since the load balancers assume that the external service can fail, they won't tied resources blocking access to other parts of the site.
Evan, I just noticed that your service seems to be running on Slicehost, not the AWS colo in Virginia. Is that correct? I got the opposite impression from your post, which seems to imply that Twilio is hosted on AWS, yet managed to weather the storm because of your design decisions.
Ah, I see it now. I just got a POST from one of your servers in the AWS US-West region. Is Twilio also hosted in US-East (the region affected by today's outage), and, if so, would Twilio have stayed up if it hadn't been spread across multiple regions?
We just enabled caching on the ngnix proxy to the external site hosting our Wordpress install for the engineering blog. Hopefully that should help performance.
When we launched the Twilio SMS Beta we tried hard to support sending SMS messages to both US and International destinations. When there were problems, we worked with our customers to collect forensic data on hundreds of carriers worldwide and pass it to our carriers partners to debug.
At Twilio we are dedicated to working with top quality carriers and technology. After months of working to fix problems, were not able to deliver the reliable International SMS service our customers have come to expect.
We apologize for any problems this has caused for our customer and we'll work to bring back International SMS service after were able to deliver on the quality we do the rest of Twilio services.
Hi Anthony, Twilio service status available via our public status page http://status.twilio.com/ We communicated degraded international SMS service to customers on August 16.
You bring up a good point that the information might not be readily discoverable. We'll work to make the status page more findable and to extend the API (http://status.twilio.com/documentation/rest) with features such as RSS to let customer subscribe up-to-the-minute status information.
Presumably they have a list of all their customers with contact info. Not saying you should spam them often, but for the message "no more international support"? I think it's worth telling everyone.
In a 100% ideal scenario an organization would only need developers as every possible failure/problem would be covered and handled by automation.
Obviously the real world is different. New code is deployed that has bugs, there are unpredicted events, there are complex failure scenarios that are difficult to automate etc.
A DevOps engineer may write automation software in conjunction with developers to automate the operational aspects of business logic. Thus, it's a partnership between the DevOps engineer whose metrics are driven by availability/reliability/scalability/security and the developer who is trying to attain some business objective.
I first heard about OpSec from Roland Dobbins with Arbor. Really smart fellow who has many suggestions on network design / defence.
I just think people need to specialize in their field (+ a general knowledge of how the overall systems works).
Almost how a carpenter, plumber, and electrician each specialize but probably have a general idea how everything works. Do you really want a plumber building the foundation of your house? Just seems a little silly to me. Each domain is so large that I would rather have deep expert knowledge in a area than a general handyman who might cut corners or waste time with solutions he doesn't know about.
Correct. Stashboard is simply a lightweight frontend display for your API/service status. It has a GUI and REST API that allow you to update status information. Using the API, one could wire Stashboard into Nagios or any other alerting system.