Paying for DataDog is many orders of magnitude higher than our AWS bill.
Wow! This is blowing my mind.
Do you this this case for most companies monitoring serverless applications with DataDog, or there is something specific about your infra which cause this
The application I am currently working on is all lambdas ~ 45, DynamoDB, S3, CloudFront, Cognito, and SQS and SNS. My employer has several serverless applications with relatively moderate use, and I work on one of them. Our total cloud costs for the product I work on for our DEV/STAGE/PROD/SANDBOX environments is currently less than $1,000/mo. Our estimated cost of DataDog monitoring for the next year just on my application is at least 23k/yr using negotiated rates. We don’t have crazy traffic, but do have global users invoking all of our lambdas at least once an hour. DataDog charges a fixed monthly cost for each lambda invoked at least once an hour on average. Then, you also need to pay for ingestion, storage, and custom metrics. Just on my product alone with multiple isolated environments, this gets expensive.
Many other product teams at my work have lightly used serverless apps. The DataDog costs simply aren’t feasible for serverless apps. We’re actively looking into alternatives such as just using CloudWatch, Elastic, etc as it’s a huge cost for us.
I was just checking Datadog pricing for serverless, there is says - $7.20 per active function per month. If you are using 45 lambdas, is the number of functions much higher? I am guessing ~200 or so?
Though I can see, how charging based on functions can quickly shoot up the bill
The problem is we have 4 isolated environments so it’s 4x number of lambdas. Plus, since you only pay for lambdas when they’re running we also deploy developers PRs in AWS so that we can test their API changes with integration tests before merging those changes in. The fixed cost is a killer. We have developers on our team in India, Ukraine, and the US so even our dev environment is used 24x7 essentially.
@cebert have you checked out Scanner.dev? It uses sparse skip-list indexing and serverless components to let you query a terabyte of logs in seconds and pricing is around the same as Cloudwatch
What are the 4 isolate environments in your case? Dev, test, accp, and prod? If so, you almost slash the cost by 50% by only monitoring accp and prod. Can be accomplished by introducing a toggle in your lambda’s.
Time to look into what output DD delivers which drives service/product decisions with financial impact and look for an alternative to re-implement. Stop being hypnotized by the fancy blinkenlights.
@cebert Have you looked at us serverless specific solutions like Thundra (my company), Serverless.com, etc.? I think the cost for use case may be order of magnitude lower since the pricing is only based on number of invocations.
We typically use Sandbox for significant deployment changes such as upgrading to Node 16, updating security policies, etc. in an isolated environment. If things break, it doesn’t impact DEV/QA.
It also blows my mind, we are also heavy Datadog users and our Datadog bill is roughly 1/10 of the AWS one. Our architecture isn't fully based on serverless because we like to get work done, but I wonder if that's the only cause or if they are using custom metrics wrong or something along those lines.
If you're paying less than 10% of infra costs for monitoring, you probably don't have good enough monitoring. But if you're paying more than 25% of your infra costs for monitoring, someone is not doing their job.
I think he's being funny. I thought it was hilarious. There are huge boons to productivity if you know your stack. I don't know serverless so if I built anything around it it would probably just be shiny object syndrome.
I think it depends on your use case and organization. My employer has traditionally built on-prem software customers run in their data centers. Everyone wants to move to the cloud now. However, we admittedly don’t have a lot of cloud experience yet. Serverless works well for us as we have a smaller but lucrative customer base (not Netflix scale). Amazon does a lot of heavy lifting for you such as 3 AZs by default, easy scaling, etc. W e provide value to our customers by understanding their domain and business logic challenges. Using serverless helps us focus on that and allows us to grow our cloud expertise without needing to manage k8 clusters or having large teams related to ops.
We have a lot of request/reply CRUD type requests that are heavier on reads than writes. We use API Gateway to manage websocket connections for us. This type of usage pattern and size of our customer base fits well with serverless.
A tangentially related anecdote - I heard from a guy from MS that if you turn on Azure's AKS monitoring without any filtering of events applied the cost of the monitoring will be significantly more than running AKS itself.
I had Azure AKS monitoring turned on for a minuscule, essentially unused hobby project. After about four months the monitoring costs suddenly exploded from about $4/mo to about $4k/mo.
No idea what happened and MSFT support couldn't tell me what was happening because at more than $100/day burn rate on a hobby project I started deleting everything connected with the effort as fast as possible.
All I know is my AKS wasn't exploding. Services were still responsive and acting normally in their minuscule cluster, this was just a logging cost explosion.
Absolutely. Two of my customers over last two years (a hypergrowth startup and a crypto marketplace) both had 20+MM/year DataDog bills, comparable in magnitude to both their AWS spend (both were built on AWS) and Snowflake spend (which was my area of focus). DataDog's wonderful yet it is pricey and that's why they have that beautiful target on them from all kinds of vendors.
Both had everything you can possibly get from AWS and then more. I didn't interact with Datadog much except for once loading 4PB of archived DD data into Snowflake to search through it to satisfy govt records request. That was an illuminating project, Datadog can't handle that, but Snowflake sure could.
Do you this this case for most companies monitoring serverless applications with DataDog, or there is something specific about your infra which cause this