As I noted in the blog post, the storage resides within the cluster. It is storing small values (log sequence numbers) so the space is used with great efficiency, but there is a finite limit. While I don't know the specifics, I would expect that the mapping (time to size) was chosen based on real-world historical data. Cause that's how we roll!
Database activity is volatile, so they have to allocate disk space, and that takes time. So if you have light access, and then start uploading many large BLOBs, it may take time for them to allocate space, and during that time the backups might not go all the way back to X hours.
And, really, all of this is just servers in a datacenter, so you can have however many 9's of reliability, and still be That Guy that wins the fail lottery.
> It's good to know that Aurora will try. It's not like it needs to be reliable or anything.
"Reliable" isn't a binary concept. There's no service that is 100% reliable.
Having a solution that works 99.99% of the time (which is "only" four nines) is probably good enough that people are willing to use it as a last-resort. (From what it seems, it's included for free, so it's not like this costs you anything anyway - at worst, you end up where you would be if they hadn't released this).
It has to be extremely reliable, simply because on large databases, it may take a considerable amount of time before you find out that something is wrong; and when you find out, it may be too late to fix things, even when you have backups.
3. Some actual guarantee that you can use to feel safe in the service, even if that guarantee is much smaller and less miraculous than the PR would suggest.
1) Point In Time - streamed to S3. Guaranteed restore to a new cluster with less than 5 minutes data loss
2) Aurora Backtrack - On cluster buffer, up to 24 hours retention (no guarantee). "Instant" restore to the same cluster
So you should feel safe using the service (up to the 5 minute worst case scenario).
Obviously no one would ever want to use either backup method. However previously disaster recovery would involve spinning up a new cluster, deploying application changes with the new endpoint (or in a best case repointing a dns entry), waiting for the cache to warm up again. Now it becomes a 1 click fix through the UI with no application changes.
Aurora keeps coming along in leaps and bounds, congratulations to the team, this is a fantastic achievement!
I only wish that every new feature didn't inevitably come with the caveat that it's only for the MySQL flavour of Aurora.
I understand both the engineering and product development reasons for doing so (different stack and MySQL is undoubtedly a much larger customer base), but it always makes these announcements a little underwhelming as an Aurora Postgres user.
> I only wish that every new feature didn't inevitably come with the caveat that it's only for the MySQL flavour of Aurora.
Not only MySQL, but only MySQL version 5.6, which was released in 2013 and superceded by 5.7 in 2015. Aurora finally released support for 5.7 in February this year after years of development, but all new features so far have been stuck on the 5.6 branch which hasn't been very reassuring.
Ok Oracle has had this feature for at least a decade, it’s called a “flashback query”. Obviously Aurora costs 10% of Oracle, but still, I thought this was going to be a huge feature-add considering the HN comment count.
That being said, I love AWS, am Pro-Certified, and work with it everyday.
I know Oracle is a giant mean bully company, but at least their arrogance was never of the “world-destabilizing” kind like Facebook.
EDIT: changed rollback query to flashback query (flashback query can be used both to view or to actually change the DB)
> In Aurora, we have chosen a design point of tolerating (a) losing an entire AZ and one additional node (AZ+1) without losing data, and (b) losing an entire AZ without impacting the ability to write data. We achieve this by replicating each data item 6 ways across 3 AZs with 2 copies of each item in each AZ. We use a quorum model with 6 votes (V = 6), a write quorum of 4/6 (V w = 4), and a read quorum of 3/6 (V r = 3). With such a model, we can (a) lose a single AZ and one additional node (a failure of 3 nodes) without losing read availability, and (b) lose any two nodes, including a single AZ failure and maintain write availability. Ensuring read quorum enables us to rebuild write quorum by adding additional replica copies.
There are many 2 AZ regions in AWS, of course. I don't think you can stripe 3 copies per AZ, an AZ failure drops you to potentially 2/6, and if you allow for 2/6 and 3/6 writing you could have a split brain. Any thoughts how they manage that?
I left AWS before Aurora was introduced, but I suspect it's similar to how S3 fulfills a similar promise. From the S3 FAQ[1]:
> Amazon S3 Standard, S3 Standard-Infrequent Access, and Amazon Glacier storage classes replicate data across a minimum of three AZs to protect against the loss of one entire AZ. This remains true in Regions where fewer than three AZs are publicly available.
AWS has consistently stated Aurora storage spans 3 AZs. I believe some 2 AZ regions have a third AZ for internal AWS use only (not a 3rd DC in small regions but otherwise a separate AZ). I would imagine they might run the 5th + 6th copies in storage cells there...seems more likely than the other options...
This is nice but it appears that the entire database instance gets rolled back to that point. It'd be a lot nicer if it could be done at a per-db or per-table granularity.
Realistically I'd never use this feature because of the risk of data loss. I'd restore a new instance from backups and copy the lost data back over manually.
> It'd be a lot nicer if it could be done at a per-db or per-table granularity.
As another commenter pointed out, per-table would be scary for referential-integrity. And per-db kinda makes sense, but if you've got totally different use-cases hosted by the same MySQL you may be using the service incorrectly.
> I'd restore a new instance from backups and copy the lost data back over manually.
That's always been possible, but even with the best instances and most iops it could take hours to do that. So it's really designed for a different scenario.
Regarding referential integrity I don't disagree. It would be a process to be performed by an expert in who understands the data structures and what the effect of rolling them back would be.
I might be wrong but I believe Aurora backups can be restored quite a bit quicker than that, right? We are evaluating Aurora at work, hence my interest.
Very interesting. The describe it as a rewind. Does anybody know if it's really a rewind, where each log record is reversible? Or do they do the easier thing of saving snapshots and then replaying the log from snapshot to desired point?
Probably reversible transactions where possible, but with original-state metadata attached to the transaction if it's required to reverse it (https://en.wikipedia.org/wiki/Memento_pattern).
no, we don't perform destructive writes so we are able to simply mark a region of the log as to be skipped. this approach also allows us to move back and forth within the time domain even after a backtrack.
So, yes, they save snapshots and take you back to the time of your choosing based off that.
Okay, but you're most likely wrong with that. If you read about how Aurora works (https://www.allthingsdistributed.com/files/p1041-verbitski.p...) it seems likely that they will be using log entries to rollback database state, rather than retaining "snapshots".
Amazon is the new IBM. Knock yourself out and jump into the AWS ecosystem. In a few years down the line, you'll understand that you've lost the leverage you had to potentially take your public cloud business somewhere else when you have so many dependencies on Amazon tech. Basic principles from my view: don't adopt anything but standard EC2/S3 services and create diversity not only in your teams but in your infrastructure policies.
Given the ten-ish years I've been using AWS everything has gotten progressively cheaper and more feature rich. I see no signs of that stopping.
I'll happily take on the risk of that theoretical future wherein for some reason I need to unwind from AWS, and in turn continue to save countless hours of my and my engineers' time not spent tackling solved problems, and focusing on our core business instead.
I’m okay with you being cool with it because where there’s chaos, there’s money to be made. People will eventually move out of AWS (no lead lasts forever), as the cycle goes, and someone will get paid to orchestrate that.
I encourage my competitors to marry themselves to a vendor’s proprietary stack.
Source: 18 years in tech. Have done my share of unwinding from vendors
In the long run, we're all dead. How do you make an impact given finite time and resources? Vendor lock in is already far down the list of existential risks to tech ventures. Long term risk due to adoption of AWS specifically is even farther given their track record so far.
What you're talking about here is survivorship bias -- yes, indeed you will find people paying big sums to migrate off of AWS for whatever reason. But they're still around! Unless of course you assume using AWS (or any vendor for that matter) provides no competitive advantage whatsoever, even in the short term.
The specific category of risk you are talking about here are tech ventures that fail because of a specific sequence of events:
- They choose AWS
- They, miraculously, create a successful business on AWS
- Some existential risk emerges due to their use of AWS
- They either fail or cannot afford to move off of AWS
- This singular decision causes their business to fail catastrophically
- And, they would have been successful had they not chosen AWS in the first place
This is a very specific path to failure and is highly unlikely relative to the more common cases of "they never found product-market fit" or "they had a toxic culture" or "they decided to build everything themselves and ran out of runway."
> In the long run, we're all dead. How do you make an impact given finite time and resources? Vendor lock in is already far down the list of existential risks to tech ventures. Long term risk due to adoption of AWS specifically is even farther given their track record so far.
I work for a ~billion dollar organization in risk management. If developers can't certify to our team that you can pull your app off of $cloud_provider and have it running on prem or in another cloud provider in a few weeks, it doesn't go to a cloud provider. We'll assist with their IaaS/PaaS tooling selection, infrastructure orchestration, container lifecycle process, and other ancillary needs to ensure this requirement can be met.
The issue isn't survivorship bias, it's risk management, business continuity, and cost management. It's entirely possible your organization doesn't place a high dollar value on these risks, but that's a business decision each organization needs to make for itself as a whole, and its products and services individually.
We're talking about different things, I'm talking about initial adoption of AWS for new, speculative ventures, it sounds like you are talking about transitioning to AWS from an already existing going concern. I agree with you on the latter case that the risk profile of a transition towards introducing AWS is totally different. Typically posts here that warn against adopting vendors due to lock-in concerns are targeted towards people asking about the risks for their new project.
I don't understand why anyone would start a company if they don't believe it will last at least 10 years. You should always be thinking about vendor lock in. And in the first 5 to 10 years you don't need AWS complexity. Keep your options open because in 5 to 10 years the landscape will be totally different.
A) I start a company, get some investors and sell the company to a larger company that has the resources to either keep maintaining it or change out the infrastructure (Instagram). Result: I profit
B) I start a company, gain product market-fit, become successful and I have the resources to gradually move my infrastructure when the need arises: Result - profit.
C) I worry about "vendor lock-in" from day one, spend a lot of time and resources on my own infrastructure and run out of money and never survive long enough for A or B.
I know which one I'm going to choose.
And as far as not "needing the complexity", the entire purpose of going to AWS is to avoid the complexity and up front cost of providing your own infrastructure.
The HN darling Dropbox started out on AWS because it was faster and then when they grew they built there own infrastructure.
Netflix built their own infrastructure and decided that infrastructure wasn't their core competency and went entirely to AWS.
Your assumption that you'll build a unicorn or be able to untangle from vendor lock in at a time of your choosing with money you have no better use for is fairly optimistic.
The reality is you'll need to do it when you are tight on resources and you're stuck rebuilding what has become a very complex infrastructure (because you spent the last 5 years hooking into every proprietary bell and whistle) from scratch - with no internal expertise because you didn't start early when it was still a tractable problem.
I'm not saying you need to choose between AWS or running on your own physical hardware. But I am saying you don't want to lock in to the proprietary parts of AWS, which do tend to add lots of complexity to your infrastructure because they make that complexity easy to use.
Spinning up some ubuntu ec2 servers is fine - you can move those anywhere easily.
Have you seen the code that people write? When it comes to complexity, AWS is the least of their worries. If I happen to luck up and be part of unicorn that grows so large that AWS can't handle it - something larger than Netflix - I think I'll have the resources needed.
I've worked for a company that had two issues - badly written software and a badly architected AWS environment. Guess which one was easier to fix?
All of the companies that tied their fate to Webforms, Ruby (Twitter), or some other deprecated tech have far bigger problems than someone who tied themselves to AWS.
As far as AWS what complexity is that?
Databases - all of their databases are compatible with standard databases except for DynamoDB. DynamoDB sucks anyway. If I wanted a managed NoSQL database, I would go with one from Mongo.
Messaging - You usually end up sending and reviewing messages from one central place in your code, changing your messages system out is pretty simple.
Caching - ElastiCache exposes either a Memcached or RedisCache client.
Load balancing - the only thing special you do as far as software is have a health check endpoint. I used the same endpoint that I used on prem with Consul+Fabio.
ElasticSearch -same as using your own.
Autoscaling - I've got nothing. I've never tried doing autoscaling on prem.
Codepipeline/CodeBuild/Code Deploy - the triggering event is nothing more than a standard Web hook from your git repo of choice, Code Build is a preconfigured Docker Container that executes commands based on a yml file. I use a Python Lambda function for deployment, but I could run that in anything.
Even Lambda if written correctly - your handler should be skinny and treated like a Controller - shouldn't be that hard to port over.
Fargate/ECS - orchestration for standard Docker containers.
Parameter Store - yeah I could (and have) set up a cluster of Consul/Vault servers but why would I instead of using a service that ties in security roles and encryption with everything else?
Load Balancing - I've also set up my own Consul+Fabio clients on web servers/app servers for load balancing and service discovery. Not trying to do that anymore either when I can configure an ALB and an autoscaling group.
The software wouldn't be the hard part to move over. Even if we ran everything in EC2 instances, the hard part would be duplicating multiple geographically redundant VPCs in multiple locations in the US and an environment in Asia that was close to the outsourced developers.
Netflix went all in on AWS and much of their tooling is tied tightly into AWS. If Netflix can trust its entire business to AWS, I think anyone can.
> If Netflix can trust its entire business to AWS, I think anyone can.
This is a terrible argument. Netflix is simply a media streaming company, with very little importance compared to other businesses. Most of its data is served from its own CDN!
Check out a few of Netflix's blog posts, AWS re-invent presentations (available on YouTube), the podcast interview, I posted earlier, and thier open source project. Netflix uses a lot of AWS services.
I don’t think you have an understanding of the complexity behind Netflix. Their tech blog is a great resource that provides insight into some of their engineering challenges. I’d highly recommend.
Edit: just realized I duplicated scarface74’s comment. My bad!
In my opinion you've got the right idea, but I think some of your competitors might encourage you to keep spending cycles on pre-optimizing for vendor lock-in while they're creating features and being competitive.
Off topic: Ideally, infrastructure resources optimize infrastructure and developers iterate on features with a robust CICD pipeline and a top notch local dev environment. That’s the folly of developers being expected to build infrastructure and develop, it slows devs down. People and time are your most expensive resources, do not slow your team down, shard responsibility as soon as you can afford to. Everyone cannot be an expert in all domains, and there are only so many working hours in a day.
You don't have to be an "expert". It takes relatively little time to get "good enough" at AWS to spin up a VPC, and use Beanstalk to spin up a web server with load balancing and set up an RDS instance with the database of your choosing.
On a more advanced level, using CloudFormation or Terraform is not rocket science and neither is setting up a CodePipeline.
When you grow big enough to need someone to handle netops full time, you can hire an overpriced consulting company or find a developer who knows the ins and outs of AWS and dev ops. But for smaller companies, managing AWS is not worthy of being a full time job if you know your stuff around automation.
I'm mostly either a "Senior Developer" or an "Architect" depending on how the wind is blowing, but from working with "AWS architects" and consulting companies, I know I can hold my own against most non-Netflix level AWS architects.
But I'm glad we have a third party consulting company to handle the drudgery that I don't want to do even though I do have admin access to the console.
The most likely outcome has already played out. A lot of people aren't moving out of their datacenters, they're just putting new workloads in AWS. Sure, some big players actually moved, like Netflix, because the gains outweighed the cost of moving, but most enterprises don't have the budget or don't care.
If something ever comes along that is significantly better, most people won't move out. They'll just put new workloads in the new thing.
Netflix workloads are a bit different than typical workloads (duck curve traffic right before the evening rush hits, bulk transcoding) and I’m confident Netflix doesn’t pay anywhere close to AWS retail prices. Also, Netflix has its own CDN for serving its video content using their OpenConnect appliances (you’re aware of that, but the audience might not be).
I think you may have missed my point (because everything you wrote supports it).
Netflix moved out of their datacenter and into the cloud because their workload was different enough to warrant the investment. Most enterprises don't have enough of a different workload for that to make sense, so they don't move out of their datacenter.
Dropbox didn't actually move out of AWS, they just started storing new data in their new datacenters. They did it because it was cheaper at their scale, it had nothing to do with lock-in.
And for the same reasons, when the next big thing comes along, most companies won't move out of AWS. Unless there is a significant cost savings or significant new architecture, they won't move. They will just put new stuff in the new thing. Lock-in isn't an issue for most enterprises today, and won't be in the future. Inertia is the biggest thing that stops them from moving old workloads to AWS.
And eventually any product you use will need to be changed. When that time comes you make the necessary changes. But as far as being locked in to IBM as an analogy, that doesn't really support the argument. There are companies still running 30+ year old COBOL programs and IBM still sells compatible solutions.
Perfection is the enemy of good. You have to have a product before improving it. Blah blah blah. God forbid you tie yourself to that git ecosystem. eyeroll
You can't compare a FOSS like git to a complete stack of proprietary services over which you have zero control or knowledge of how it works under the hood.
You certainly should worry about the possibility of zero day attacks either when using FOSS or a proprietary solution.
If the provider of a proprietary solution stops offering it you might have no easy path of migration where as hypothetically FOSS solutions are easier to self host or find a new hoster for (and thus reducing the cost of a potential migration)
If a FOSS solution is abandoned, it's doubtful that most companies are going to take on the maintenance themselves. They are going to start an initiative to migrate.
But that goes back to the old saying "no one ever got fired for buying IBM." The same can be applied to Microsoft or AWS. Yes MS has put plenty of technologies in maintenance mode but you didn't have to rush to migrate. IBM still sells system that are compatible with what they sold in the 80s.
On the other hand, if a FOSS solution is abandoned you typically have quite a long time to migrate since you can self-host. If AWS shuts down tomorrow and you rely on it heavily, you're basically fucked.
If AWS shuts down tomorrow. There are a lot of companies that are going to be in the same boat - including Amazon and Netflix.
I would be more worried about my one colocated data center getting shut down or getting destroyed than the entire global AWS infrastructure that has data centers in 50+ availability zones.
As far as software, most AWS services are just managed versions of available software. Even if you're using lambdas for API endpoints, if you treat your lambdas like Controller actions, it shouldn't be that much of a heavy lift to concert them into standard MVC APIs equivalent.
For another perspective, listen to the podcast interview of Adrian Cockcroft, the guy who led the migration of Netflix's infrastructure from self hosted to AWS hosted.
Seems like you can both be right in different circumstances. Humble Bundle got to a nice acquisition married to GCP. Snapchat has successfully unwound. I haven't looked into it in depth but it seems examples on both sides abound.
That's the life cycle of proprietary stacks - when there's a lot of demand for them they dump resources into adding more and more and cheaper features. Nobody believes the party will end. But it always does. AWS is certainly more than 50% through that life cycle of "hot new thing", "stable feature rich thing that will be around forever", "ash heap".
If you went by the mantra "No one ever got fired for buying IBM in 1983" and decided to build your product on DB2 and COBOL, you could still get support and new compatible systems 35 years later.
If you decided to build your stack on top of MS SQL Server in 1990 you would have still had a relatively painless upgrade path over the years.
Well, in this case in particular, Aurora is "wire compatible" with MySQL. The only thing you have to do to
move away from Aurora is migrate your data to another MySQL instance and change your connection string.
In general, if you're moving to AWS and just setting up some EC2 instances and using S3 in the fear of one day in the future it may not be what you want, you are wasting money. There are much cheaper pure hosting solutions. The benefits of using AWS is all of the services they provide that keep you from doing the "undifferentiated heavy lifting".
> Well, in this case in particular, Aurora is "wire compatible" with MySQL. The only thing you have to do to move away from Aurora is migrate your data to another MySQL instance and change your connection string.
Don't be fooled, the compatibility is there so that it's easy to port applications to Aurora, not so it's easy to take them out.
Amazon still offers MySQL RDS. If you move to the much-more-expensive Aurora, it's generally because your MySQL RDS application is struggling, and Amazon says "Aurora knows special tricks to make the engine faster."
Some management headaches, like adjusting for disk growth, are handled by Aurora transparently, which is nice, but for the most part, I saw workloads going to Aurora because they were too slow on standard MySQL.
That means that instead of optimizing or fixing your application, you're relying on a single vendor's Secret Go Juice to make your application usable/performant. In such cases, unless the mainline distribution catches up in the meantime, you're going to be stuck because your stuff will be too slow on other stacks.
From the risk management POV, Aurora is at best a temporary solution while the program is refactored/optimized to be performant on standard distributions.
My first maxim I've been saying in comments is that "no one ever changes Infrastructure just to save a few dollars". You architect with plans to do it, but 99 Times our of 100 you will do a risk rewards calculation and realize it's not worth it.
The second maxim is "no one ever got fired for buying IBM". If you're going to bet on a horse, there is always a chance that things go sideways. You might as well bet on the fastest horse with the best record.
If you had a choice between IBM, DEC VAX, and Stratus VOS back in the day and you went the safe route to go with IBM, you would have been vindicated two decades later. IBM is still selling compatible systems.
I could go through the trouble of staying "vendor neutral" and use tools like Consul+Vault with an AWS backend, Nomad, and Terraform (been there done that) but the time I'm wasting being vendor neutral, I can spin up equivalent services on AWS that cost less and I don't have to maintain them.
Most of the things that I depend on AWS for now, I was doing the equivalent of on prem before and after AWS existed. I'm glad to not have to do that anymore and just being able to press a few buttons.
This isn't "script all your resources in Terraform so you can move to Google Cloud, hypothetically". We're talking about a database system here. By now, we've learned many times over that an open-source platform is a huge benefit; if we're at the point where Microsoft is being forced to admit it via things like SQL Server and .NET Core, I'd say that the closed-source platform ship has more or less sailed at this point. Aurora moves you off an open platform and onto a closed one. This is a bigger concern than "Amazon probably won't shut it down tomorrow".
And on top of that, in the realm of database engineering, Amazon is absolutely a newcomer, they've got another good 15 years before they can even start to be in contention for "reliable and trustworthy database vendor". Goes doubly when you consider that Aurora is essentially just a bunch of performance hacks to (old versions of) InnoDB and Postgres. Presumably, upstreams like MariaDB and PostgreSQL have forgone similar enhancements for a better reason than "only Amazon employees are smart enough to figure it out".
> And on top of that, in the realm of database engineering, Amazon is absolutely a newcomer, they've got another good 15 years before they can even start to be in contention for "reliable and trustworthy database vendor"
They did hire a bunch of experienced engineers. I'd rather have them actually contribute to postgres (which amazon pretty much doesn't do, some bug reports aside).
> Presumably, upstreams like MariaDB and PostgreSQL have forgone similar enhancements for a better reason than "only Amazon employees are smart enough to figure it out".
Some of them are harder to do if you don't have as much control over the environment as amazon does...
Based on what I've heard, these aren't hacks on Postgres - more a pluggable storage engine. Haven't cross checked this, but I think both MySQL and Postgres have a swappable low level storage API. AWS has made engines that use their specific infrastructure to make all these changes - there are technically no changes in either the MySQL or Postgres engines themselves - only the storage layer.
That's why no announcements cover change fundamental change to either database - all the features we're seeing a the kinds you'd do on a personal scale with SSD RAIDs and ZFS (or other copy on write FSes), except on a datacenter scale.
There are millions of companies that still use SQL Server and Oracle. Open source is not the be all and end all.
Have you seen the code see to know that they are just “performance hacks”? Have you thought that when you can actually spend millions on a project you might be able to throw some top notch engineers at it?
That is a weird way to look at it. Companies that used IBM hardware and services became successful. If anything they became locked into success with a reliable business partner. It is odd to look at that as a negative thing. Likewise, Amazon has built a reliable and professional platform for cloud computing that is guaranteed to make your business successful for years to come. Why wouldn't you want that? What is the disadvantage really to becoming reliant on a superior architecture and support structure.
>In a few years down the line, you'll understand that you've lost the leverage you had to potentially take your public cloud business somewhere else when you have so many dependencies on Amazon tech
This is a non-issue for the vast majority of products. You do your eval and you chose what makes sense. Short of hiking prices by 1000% you'll never move anyway (this also applied to building ridiculous abstractions because "maybe we'll want to switch cloud providers someday!" No, you won't.)
Or "I'm going to put a facade over my database access because I don't want to be locked in to one vendor". Hardly anyone ever changes there database vendors just to save a few dollars.
In 23 years I've worked on one project that moved databases. Client requirement, huge system that Oracle was the best match for.
Also, one that was a licensed software that needed to support DB2, Oracle and MS-SQL to integrate with various state's agencies. The former was a huge pain, the latter was horrible to write code for.
For the most part, you use what works for you and stick with it. The more risky a platform, the less you rely on features like stored procedures, etc. In the end, I'd rather have vendor lock-in than have to write complex queries three times, and other really cumbersome abstractions, slowing things down a lot! Extrapolating to services beyond the DB.
In the end, the db migration was far less time/effort than maintaining an untethered software.
It's a form a technical debt, but on a larger scale. Like all technical debt, it let's you move fast to achieve your goals quickly and get to market sooner.
And like all technical debt, eventually you may find good reasons to rewrite, redesign, rethink, in order to achieve some different goals. But you might not have been around long enough to reach that stage if you hadn't taken on the debt early on.
I absolutely agree with you. I think we are in the position that AWS is IBM but in the time of the mainframe. That there were multiple competing mainframes and it wasn't certain who would be the winner but your best bet would be IBM.
I wonder if we will be able to recognize Amazon in ten years.
This is good advice -- but now that AWS has so many services, it's an extremely complex question. There are path dependencies: adopting an AWS service today may accelerate your product sufficiently to get it to the point of traction and sustainability, where it would have failed otherwise.
I think it's important to have a good framework for making these decisions. As you mentioned, EC2 and S3 are on one end of the spectrum: low switching costs and coupling, but also low in terms of differentiation at this point. Whereas other AWS services that are more bleeding edge could actually be the foundation of a temporary competitive advantage for your product, but today would introduce deep coupling of your service with AWS.
The trick is basically landing somewhere where you've made good bets: the ideal scenario is that you have a good chance of having an escape plan if/when the situation warrants it, while not missing out on the opportunity to take advantage of the services that your competitors (like, for example, those who are on the "EC2/S3 only" methodology you mention) may be forced to build themselves out of fear.
Unfortunately the best way to do this is to be able to predict the future: which services on AWS have high value and high longevity, and will have similar, cheap offerings on their competitors and/or via OSS in the coming years? S3 is an example of a good bet: if you adopted S3 when it first arrived, it may have felt like unnecessary couping with AWS, but it was a huge accelerant if your business relied upon large, reliable file storage. (In fact many startups likely today exist because they were able to bootstrap off of S3.) S3 was a good, long term bet -- now, you can drop in an OSS replacement that is API-compatible in a worst-case scenario. Other services on AWS have come-and-gone through their hayday without going through that transition, because they were either (in retrospect) transitional technology or basically "the wrong thing." So, it comes down to predicting :)
A couple of things that increase probability an AWS service is more destined for long term success + commoditzation (in my view):
- Built on open protocols/standards
- Simplicity in APIs and feature set
- Versatility of use-cases
- Deeper in the stack, vs user-facing
- No existing open source alternative with traction
- Backchannel confirmation that Amazon itself is using it :)
For example, DynamoDB when it first came out was a huge "no" for me personally -- it was a very quirky design, with complex, bespoke concepts around indexing, with lots of good open source projects offering alternatives, and I heard Amazon didn't trust it internally :) Whereas other services like Redshift ticked all the right boxes out of the gate. So there's some hope of making good bets here, but sometimes you do need to throw caution to the wind if something extremely compelling comes along that feels like it could be a shot-in-the-arm to your product despite not having most of these attributes.
Also, FWIW since I like calling my shots: I think the current iteration of AWS server-less tech (while a necessary first step towards understanding the design space, and a technical marvel imho) is probably not a good long term bet as a foundational technology for a product you would like to de-risk in this aspect, but we'll see :)
Some great points in your blog-post-length comment ;)
I'd add that there's often a danger in the middle ground. Complexity and meaningful switching costs, without the benefit of interop, is likely the worst-case scenario. IOW, might be better to go all-in than muddle through with half-measures trying to avoid lock-in.
The long-term issue with adopting Amazon is pricing power--once you are committed AWS has it and you don't.
Take your Redshift example. It's an outstanding service but once you build apps around it and load a bunch of data the switching costs become very high. AWS can raise the price quite a bit before it makes economic sense for working apps to move off it.
Consequently a key question to ask is when do my favorite AWS services become cash cows? At that point your economics will change substantially. You may be living day-to-day in your business but the stock market thinks Amazon will extract large profits out of AWS in spite of growing competition. Food for thought...
AWS is a growth business, in a competitive space, so (particularly given their overall history) it seems unlikely they will increase margins if it could risk adoption.
Of course, in the "long term" (whatever that means), the market and shareholders expect to reap vast profits, and price that, along with the company's free cash flows, into the stock.
However, my personal thesis is that Amazon specifically is a bit of an anomaly in this regard: it seems unlikely to me that we will see any strategic shift in pricing from basically any Amazon-owned entity in our lifetimes -- the culture is too ingrained towards a strategy of complete and utter dominance over all markets, margin reduction being a key tool in the toolbox. So if I were to bet I would say pricing power is pretty low on the list of concerns of AWS lock in -- more concerning are things like unexpected conflicts of interest as Amazon eats the world (see also: Netflix), operational dependency (ultimately, you rely upon AWS ops' competency for your uptime, security, etc), end-of-lifing of services you depend on, opportunity cost vs other vendors who may offer better services, and other unknown unknowns.
Certainly there are other concerns--no argument there.
That said, Amazon Prime prices just went up 20%, so Amazon will clearly raise prices if they think the market will bear it.
Moreover, "complete and utter dominance" sounds like a monopoly. It's difficult to think of a monopoly that didn't raise prices and/or decrease service once they were established. I would not start a business on Amazon without assessing the risk of tying my fate to a single large vendor, for cost as well as other reasons you cited like Amazon deciding to compete with you.
Take your Redshift example. It's an outstanding service but once you build apps around it and load a bunch of data the switching costs become very high. AWS can raise the price quite a bit before it makes economic sense for working apps to move off it.
That's a bad example. Redshift is "wire compatible" with Postgres. You use the same drivers. There are a few popular extensions they added to Sql to copy files directly from and to S3 but that's about it.
Redshift is a data warehouse. It does not require indexes and handles wide tables--e.g., fact tables with lots of columns--very efficiently since it uses column storage. For data warehouse operations you can only substitute native PostgreSQL if the dataset is quite small.
Assuming you seek wire compatibility, your choice would therefore be a PostgreSQL-compliant DW, of which there are several. However, they have greatly differing cost and operational profiles, which make switching non-trivial. They also tend to support different versions of PG depending on when they forked from PostgreSQL.
Yes, if for some reason you want to leave RedShift, you will lose the benefits that Redshift provides and you will lose the speed. But you shouldn't have to change your program.
I'm not saying it is a good idea to leave Redshift in some distant future where you might save a few pennies. Heck, I hardly ever say it's a good idea to rearchitect a core part of your infrastructure without having a very good reason -- saving a few dollars isn't one. But from the software side, you aren't stuck with Redshift and you aren't embedding a lot of proprietary AWS specific drivers in your code base.
> There are path dependencies: adopting an AWS service today may accelerate your product sufficiently to get it to the point of traction and sustainability, where it would have failed otherwise.
You can say the same thing about a loan shark or high interest loan. Just because you can doesn’t mean you should.
And you can say the same thing about mortgages and corporate financing -- sometimes you should. What is your point? Did you even read my post or just hit the "reply" button after reading the second sentence?
I think you have to bet on some things. You can't just make yourself completely technology agnostic. You have to pick a development language, an IDE, a database technology, an operating system, network infrastructure etc.
So then is it worth it spending momentum (dollars + thought) on that stuff, or on solving the business problem?
I second your opinion and add that even s3 is kinda risky. Even if there are open replacement good old network attachhed storage can cover but the most outrageous scale and it’s even more trivial as a building block to replace.
RDS is fine too as long as one sticks to standard backends.
Architect your system as microservices. If AWS gets it wrong, it will likely not be bad for every part of your system simultaneously. Your migration strategy will be to stand up new services with a competing provider, and all the other services should see is a DNS change.
And, really, if you're doing that from the beginning, you might find you save a lot of money by taking advantage of other providers' strengths.
You're close. The real problem is Amazon may become your competitor, with their ever expanding reach into different markets. And the last thing you want to do is help your competitor succeed. I think Netflix made an incredible blunder in this regard.
Nope, we are going to run on Kubernetes. We don't care if it's on AWS, GC, Azure, DigitalOcean. We will abstract all local services so when we move, we can extend the code base with no modification.
I've used both in production, with all sorts of advanced features, and (to my surprise) I've never yet found a difference.
I would have a hard time leaving Aurora though. Not because of normal "vendor lock-in", but because of the operational worry-free scaling, backups, replicas, point-in-time restores etc. that it offers. Basically, I'm "stuck" on Aurora just because no one else offers MySQL/PostgreSQL databases like that.
There probably isn't a lot of fear with using RDS. They just do management for you, it isn't any fundamentally different technology that would cause heartache during a migration. Just stay away from Aurora if you care about this.
SQS: There's AWS MQ, which is based on ActiveMQ and supports AMQP. If you're going with a more CNCF-focused stack and want to use NATS, I'm not aware of any hosted options.
Features like database rewind aren’t something that would be bound to your core app structure either. While you could structure business process’s around it, it’s more of a shit hits the fan restore scenario, not a daily or weekly action.
Replicating it on a different platform could also be fine with a combination of logical and physical backups.
Point-in-time database restores are a best practice that is provided by maintaining database write logs for the timeframe that you expect to have point-in-time restoration for. You don't have to use Aurora to get them, Aurora just has the clicky buttons to make it a clicky-button matter. Any serious DBA should know how to do this without Amazon's platform wrapping it up in a GUI.
We've had point in time restore for quite some time. Backtrack is different. It moves you to a different point using the same instance. Since we don't do destructive writes to blocks (it is log-structured storage), we can simply mark a portion of the log as "ignored". It is a server feature, not a UI enhancement.
It's really hilarious to me that some new amazing technology comes out and people immediately are like "how can we not use this?"
The right approach would be to open an text file and start writing all your hardware drivers, an operating system, a database system, application services etc. so that you'll never ever be locked into anything, and you'll always be 100% in control of everything at all times.
Shit, at that point you'll also need to make sure you fabricate your own chips, lay your own pipes, invent your own transfer protocol, etc. as that's the only way to be absolutely certain you won't get locked in.
But then at that point, you've created a prison of your own demise.
That is the thing that the “lock into a vendor” people don’t understand. You build your own stuff and you can lock yourself into your own stuff and it can be incredibly hard to get out.
I’ve seen it happen time and time again. Companies write their own frameworks, build systems, hardware provisioning systems, analytics systems, you name it... all to avoid “lock in”.
All of that home brew stuff usually sucks. Almost all of it becomes abandonware that one or two people in the company know how to change. Why? Cause all that stuff has nothing to do with how the business delivers value. All of it should have been provided by third party packages.
But now the company got big and they are locked into shitty homebrew garbage that would take a massive political and engineering effort to get out of.
Moral: it is just as easy to lock yourself into your own garbage as it is to lock yourself into a third party. Leave stuff that doesn’t add value to your business to people whose business it is to build that stuff. Beware of doing everything in-house.
It's even worse when you lock yourself into your own garbage. At least locking yourself into a framework that other people use you have a chance to find someone with the same issue and there are people outside of your company that know how to use it -- as opposed to your own bespoke framework.
You know, there's a sensible middle point between the two extremes!
It's important to ask questions like that – "what happens if this service got away / becomes unaffordable / removes features we depend on". There's risks in that sense with using proprietary technology. Sometimes it'll be worth it, because the benefits of allowing you to deal more quickly with business problems will outweigh the technical risk. But I've definitely been in more than a few situations where a vendor has shut down a product or service that I depended on, resulting in a difficult or time-consuming migration.
Some of this risk is ameliorated by using open or popular standards and systems. Intel's not likely to stop producing x86 chips; GitHub is probably going to continue supporting git; Linux isn't going anywhere anytime soon. It's worth thinking about before using a service that nobody else offers!
"Some of this risk is ameliorated by using open or popular standards and systems."
This seems to be the middle ground. Making well-reasoned decisions about which vendors will have a long and well-maintained life of service at the expense of it being trendy or cool.
You're painting as if the only risk was them going away, but there are other risks with being tied to a single provider - like getting huge price increases: https://news.ycombinator.com/item?id=2533416
Meanwhile, the original post you replied to did not say to completely avoid AWS, only to use "standard EC2/S3 services" rather than provider-specific ones. It may not be the middle point you prefer, but it's hardly an extremist position worthy of ridicule.
GAE came out and people immediately built their entire businesses on them.
It's important to provide a detailed analysis weighing the pros vs cons of building an application on any brand new service like GAE etc.
For instance, one con could've been "if they ever increase their prices by X amount, then we're fucked" would've been a pretty relevant con to consider before jumping into using it.
At the risk of being repetitive, I just want to point out, this is not "amazing new technology". Point-in-time database restores are provided by maintaining database write logs and should be a normal part of any serious backup/restore process, not just because they allow for point-in-time restores, but because they're an integral part of standard database recovery situations. They're used to replay changes made between the last backup snapshot and the time of the incident that warrants recovery.
AWS may have contributed some new technologies, but for the most part, they've just wrapped things up in a GUI. I'm not denigrating that as a valid market niche to fill, but people should know that there are non-AWS solutions for virtually everything that AWS provides, and that your company's admins probably already employ a good portion of them.
RDS already has PIT restore, and yes there is a UI for it. This is different. They would not be making an announcement for PIT restore.
A good question is, how is it different. With PIT restores, they create a new instance from snapshot and play back the logs up to the relevant point. With this Aurora-only feature, they take advantage of Aurora's storage capabilities, and rewind the same database to a previous point. No new instance, no cold recovery from snapshot, no long log replay.
Yeah, I know that Aurora already had point-in-time restore. It's disingenuous to pretend like this is something different. This is, at best, an upgrade to Amazon's processes to make the restore process smoother in the ways you've mentioned.
As the previous commentor said, this isn't point in time restore or a change in our processes. It is a change to our database and storage engine to mark regions of log-structured storage as if they did not occur. That's why we can complete a Backtrack in a few seconds - we're just moving some pointers around and restarting the database.
Similar but from what I understand it allows granularity down to seconds instead of minutes and new cluster provisioning in minutes rather than potentially hours.
Speed of new cluster provisioning is neat, amazon-rds console currently shows the option to do a point of time restore down to the second. Maybe that's actually not available already and they're truncating to the latest backup?
Edit: I guess it's also nice that it's in place. Some people probably don't have the flexibility to easily point applications at a new cluster in a repeatable way, so that may help folks prevent down time.
Point in time restores to the last daily snapshot and replays logs since that snapshot; this can take quite some time (hours) if your database has a lot of writes, especially if the snapshot was 23 hours ago. The blog post doesn't quite make it clear but I would hope that backtracking is somewhat quicker than this method.
I think the main difference is that restore will create a new DB instance, whereas backtrack updates the state of the DB instance in place. It is also probably a lot faster than restoring.
It's a relatively classic invention, take something that exists and repackage. A snapshot and a log replay accomplishes something pretty similar. AWS slapped a ui and some orchestration around it. The cloud lock stuff makes sense (although if having an easy "undo button" on your db layer is mission critical to your business you might have other interesteting challenges.
That's not correct. We made a change to mark portions of our log-structured storage as though they should not have occurred. It is a totally different approach than point-in-time restore.
I don't know anything about Aurora and maybe I'm missing something. But why not just wrap everything in a TRANSACTION and then do a ROLLBACK if there's an issue?
This would be for deploying a database migration that hoses something. For instance, we had a migration that touched a bunch of tables and, due to a triggered procedure, it blew away all the modification dates.
TBH, I'm not sure how useful it is compared to the normal PITR. My usual guidance on recovering after a database fuck up is "put the app into maintenance mode and the database in read-only, do your investigation, run PITR, port user activity to the backup, do spot checks to determine everything got through, roll the app back to before the change, take the old database down, point the app at the new database, bring the system back online."
I'm sure a manager is thinking, "oh, we can be back online in 5 minutes, amazing!" But just because something broke doesn't mean users stopped hitting your database! You can't just hit Undo and throw all that out!
You can also clone the database volume in Aurora and then just backtrack one of the volumes. That should help you ensure you have a version available for forensic analysis.
There is usually an option to disable delete statements which do not contain a where clause. And if you really want to run that statement you'd need to write:
First, Transaction is not free, there is overhead comes with it. Second, it could be upstream software bug, that writes wrong data to the DB, you can possibly overwrite that.
This is amazon's model. A cobbled together team of 1-3 low level engineers takes a feature of some open source software package or the entire package itself, wraps it in an ill-conceived web layer or existing service and makes a big announcement about it. Almost no one uses it, but there will be lots of conference talks about how great it is.
The seamlessness of this feature is quite amazing. Backups are usually a huge pain to deal with (I've recently been dealing with Postgres/Barman quite a bit). And disaster scenarios aside (for which AWS already does replications across regions), I think a frequent purpose of backups is really to do this "Undo", go back in time and pretend something didn't happen.
All this makes me really really wanna use Aurora. :)
Various issues with recovery during testing. Missing WALs in the data files that `list-files` command says should be present. Backup failure because of a different number of parallel workers and so on. Combined with the fact that I had to write my own code to push it to S3 (no out of the box support).
I suppose it works fine on its own for small to medium databases, and it's a fantastic product. I just wish it was a bit better. :)
When faced with the alternative of loosing the entire database [1], loosing 410 transactions is usually preferable.
Per second granularity gives you ability to restore as close to the fatal mistake (or poison pill transaction) as possible.
After that, recovering transactions lost by the rewind can be really tricky (but not impossible). If the entire database got dropped, for example, it’s possible for the unwound writes to contain conflicting IDs.
This is exactly my concern. There is no production system I have ever worked on which this feature would ever be used.
What if you had an e-commerce site? Other customers were placing orders, right? Credit cards being charged? You didn't catch your mistake instantly. So you have a window of maybe, 5 minutes or a hour in which other things happened. You can't simply forget those other transactions and throw away the data.
There have been many well publicized events where a production database was lost, or in some way corrupted. I’ve lived through one such event myself. When that happens, you usually go to a backup, and typically run into two problems:
1) Point in time back ups can be hours old.
2) More importantly, “backups” are useless, it’s the “restores” that are valuable. And very few organizations have a well practiced muscle memory for restoring from a backup.
A turn key restore solution, with a per second granularity can both significantly decrease the loss window, and recovery time. Hope nobody gets to use it, but when you have, it can be a difference between a big and a small outage.
How is this better than their already existing PITR?
Why would someone want to rollback their own production database instead of PITR to a new database and switching over to it? Surely you would end up losing data because you wouldn't be able to reconcile the new data written to it.
That is a different feature (although a cool one). They provide the ability to run a query as of a point in time. We are moving the database backward in time (which matters for running applications)
In a database Undo, in spite of the name, is the act of applying a transaction log to return to a particular state. I used to work on databases earlier in my career, so selling this as a new invention seems somewhat bizarre to me.
Sure – you can set up a couple of VPSs running MySQL, then set up a durable, replicated configuration. Then you need some system to spin up new instances when an instance fails. Then you need some kind of UI to allow that to happen. And you need to implement some kind of point-in-time recovery system, as well as backups, user authentication, monitoring… and so on.
The point is, AWS allows you to buy those services instead of spending time building them yourself. There's nothing wrong with building your own database cluster if it makes sense for your use case – but there's nothing wrong with buying one in if it makes sense, either!
It's good to know that Aurora will try. It's not like it needs to be reliable or anything.