Paying for DataDog is many orders of magnitude higher than our AWS bill. Wow! Th...

cebert · on Oct 2, 2022

The application I am currently working on is all lambdas ~ 45, DynamoDB, S3, CloudFront, Cognito, and SQS and SNS. My employer has several serverless applications with relatively moderate use, and I work on one of them. Our total cloud costs for the product I work on for our DEV/STAGE/PROD/SANDBOX environments is currently less than $1,000/mo. Our estimated cost of DataDog monitoring for the next year just on my application is at least 23k/yr using negotiated rates. We don’t have crazy traffic, but do have global users invoking all of our lambdas at least once an hour. DataDog charges a fixed monthly cost for each lambda invoked at least once an hour on average. Then, you also need to pay for ingestion, storage, and custom metrics. Just on my product alone with multiple isolated environments, this gets expensive.

Many other product teams at my work have lightly used serverless apps. The DataDog costs simply aren’t feasible for serverless apps. We’re actively looking into alternatives such as just using CloudWatch, Elastic, etc as it’s a huge cost for us.

VectorLock · on Oct 2, 2022

Wonder how much it would be if you just used all the native AWS tools for things you get from Datadog.

pranay01 · on Oct 2, 2022

As far as I understand, AWS native tools like Cloudwatch are not as good

pranay01 · on Oct 2, 2022

thanks for the detailed note.

I was just checking Datadog pricing for serverless, there is says - $7.20 per active function per month. If you are using 45 lambdas, is the number of functions much higher? I am guessing ~200 or so?

Though I can see, how charging based on functions can quickly shoot up the bill

cebert · on Oct 2, 2022

The problem is we have 4 isolated environments so it’s 4x number of lambdas. Plus, since you only pay for lambdas when they’re running we also deploy developers PRs in AWS so that we can test their API changes with integration tests before merging those changes in. The fixed cost is a killer. We have developers on our team in India, Ukraine, and the US so even our dev environment is used 24x7 essentially.

alb4 · on Oct 4, 2022

@cebert have you checked out Scanner.dev? It uses sparse skip-list indexing and serverless components to let you query a terabyte of logs in seconds and pricing is around the same as Cloudwatch

xiwenc · on Oct 2, 2022

What are the 4 isolate environments in your case? Dev, test, accp, and prod? If so, you almost slash the cost by 50% by only monitoring accp and prod. Can be accomplished by introducing a toggle in your lambda’s.

mustyoshi · on Oct 2, 2022

Are you sure you need to monitor the PR lambdas? If they're going to make it to one of the 3 environments before prod.

smetj · on Oct 2, 2022

Time to look into what output DD delivers which drives service/product decisions with financial impact and look for an alternative to re-implement. Stop being hypnotized by the fancy blinkenlights.

berkay · on Oct 3, 2022

@cebert Have you looked at us serverless specific solutions like Thundra (my company), Serverless.com, etc.? I think the cost for use case may be order of magnitude lower since the pricing is only based on number of invocations.

masterofmisc · on Oct 2, 2022

Can you tell me what the difference is between your DEV and SANDBOX environemnts is? Curious to know.

cebert · on Oct 4, 2022

We typically use Sandbox for significant deployment changes such as upgrading to Node 16, updating security policies, etc. in an isolated environment. If things break, it doesn’t impact DEV/QA.

masterofmisc · on Oct 4, 2022

Ahh gotcha.. Thanks for the info. Thats useful to know.

vasco · on Oct 2, 2022

It also blows my mind, we are also heavy Datadog users and our Datadog bill is roughly 1/10 of the AWS one. Our architecture isn't fully based on serverless because we like to get work done, but I wonder if that's the only cause or if they are using custom metrics wrong or something along those lines.

If you're paying less than 10% of infra costs for monitoring, you probably don't have good enough monitoring. But if you're paying more than 25% of your infra costs for monitoring, someone is not doing their job.

salil999 · on Oct 2, 2022

> Our architecture isn't fully based on serverless because we like to get work done

What do you mean by this?

tepitoperrito · on Oct 2, 2022

I think he's being funny. I thought it was hilarious. There are huge boons to productivity if you know your stack. I don't know serverless so if I built anything around it it would probably just be shiny object syndrome.

cebert · on Oct 4, 2022

I think it depends on your use case and organization. My employer has traditionally built on-prem software customers run in their data centers. Everyone wants to move to the cloud now. However, we admittedly don’t have a lot of cloud experience yet. Serverless works well for us as we have a smaller but lucrative customer base (not Netflix scale). Amazon does a lot of heavy lifting for you such as 3 AZs by default, easy scaling, etc. W e provide value to our customers by understanding their domain and business logic challenges. Using serverless helps us focus on that and allows us to grow our cloud expertise without needing to manage k8 clusters or having large teams related to ops.

We have a lot of request/reply CRUD type requests that are heavier on reads than writes. We use API Gateway to manage websocket connections for us. This type of usage pattern and size of our customer base fits well with serverless.

josephcooney · on Oct 2, 2022

A tangentially related anecdote - I heard from a guy from MS that if you turn on Azure's AKS monitoring without any filtering of events applied the cost of the monitoring will be significantly more than running AKS itself.

yodon · on Oct 2, 2022

I had Azure AKS monitoring turned on for a minuscule, essentially unused hobby project. After about four months the monitoring costs suddenly exploded from about $4/mo to about $4k/mo.

No idea what happened and MSFT support couldn't tell me what was happening because at more than $100/day burn rate on a hobby project I started deleting everything connected with the effort as fast as possible.

All I know is my AKS wasn't exploding. Services were still responsive and acting normally in their minuscule cluster, this was just a logging cost explosion.

Also billing alerts are your friend.

malkia · on Oct 2, 2022

That might be okay, say you enable all for 10 - 20 seconds, such that all traces are sampled, logs logged, etc, and then it ramps down.

sandermvanvliet · on Oct 2, 2022

Yeah got burned by that once. Fortunately I had billing alerts in place so we found it quick…

pranay01 · on Oct 2, 2022

This is the default monitoring which comes with AKS setup right?

josephcooney · on Oct 3, 2022

I'm not sure....I don't think it was turned on by default for AKS. I can do some digging if you want.

danielodievich · on Oct 2, 2022

Absolutely. Two of my customers over last two years (a hypergrowth startup and a crypto marketplace) both had 20+MM/year DataDog bills, comparable in magnitude to both their AWS spend (both were built on AWS) and Snowflake spend (which was my area of focus). DataDog's wonderful yet it is pricey and that's why they have that beautiful target on them from all kinds of vendors.

pranay01 · on Oct 2, 2022

Very interesting! I never thought DataDog would be close to AWS spend

Were these also on AWS Lambda or something else (EKS?)

danielodievich · on Oct 2, 2022

Both had everything you can possibly get from AWS and then more. I didn't interact with Datadog much except for once loading 4PB of archived DD data into Snowflake to search through it to satisfy govt records request. That was an illuminating project, Datadog can't handle that, but Snowflake sure could.