90 days retention is only 2,160 hours. Even at 999 GB/hr that is only ~2160 TB o...

Volundr · on Sept 6, 2023

A few years ago I joined a company aggressively trying to reduce their AWS costs. My jaw got the floor when I realized they were spending over a million a month in AWS fees. I couldn't understand how they got there with what they were actually doing. I feel like this comment perfectly demonstrates how that happens.

SOLAR_FIELDS · on Sept 7, 2023

AWS also purposefully makes it easy to shoot yourself in the foot. Case in point that we were burned on recently:

- set up some service that talks to a s3 bucket

- set up that bucket in the same region/datacenter

- send a decent but not insane amount of traffic through there (several hundred Gb per day)

- assume that you won’t get billed any data transfer fees since you’re talking to a bucket in the same data center

- receive massive bill under “EC2-Other” line item for NAT data transfer fees

- realize that AWS routes all traffic through NAT gateway by default even though it’s just turning around and going back into the data center it came from and billing exorbitant fees for that

- come to the conclusion that this is obviously a racket designed to extract money from unsuspecting people because there is almost no situation where you would want to do that by default and discover that hundreds to thousands of other people have been screwed in the exact same way for years (and there is a documented trail of it[1])

1: https://www.lastweekinaws.com/blog/the-aws-managed-nat-gatew...

WirelessGigabit · on Sept 7, 2023

If you're up for it... https://github.com/AndrewGuenther/fck-nat

Even has ha mode.

ilyt · on Sept 7, 2023

Developer will do the simplest thing to solve the problem.

If the solutions are:

* rewrite that part to add retention, or use better compression, or spend next month deciding which data to keep and which can be removed early

* Wiggle a thing in panel/API giving it more space

The second will win every single time unless there is pushback or it hits the 5% of the developers that actually care to make good architecture not just deliver tickets.

Dylan16807 · on Sept 8, 2023

They're pricing hot storage at $50/TB (not per month). That is definitely not AWS or anything like it.

On a per-month basis, the grossly exaggerated number is in the single thousands. The non-exaggerated number is down in the double digits.

$50/TB is a lowball if you want much of the data to be on SSDs, but taking an analysis server and stuffing in 20TB of SSD (plus RAID, plus room for growth) is a very small cost compared to repeated debugging sessions. Especially because the SSD has to deal with about 0.01 DWPD.

namaria · on Sept 7, 2023

Cloud truly monetizes the tar pit.

yjftsjthsd-h · on Sept 6, 2023

... only 2PB? You might be using a different scale than some of us.

Dylan16807 · on Sept 8, 2023

Their scale was money. Saying something is "only" a single digit number of developer months makes sense in this context.

And that was a number hundreds of times higher than what they were replying to, just to make a point.

floorballchamp · on Sept 6, 2023

Plus, logs have enormous compression potential since their entropy is so low. That's the property exploited by every logging-as-a-service out there.

giovannibonetti · on Sept 6, 2023

Related to that, last year Uber's engineering blog mentioned very interesting results with their internal log service [1].

I wonder if there's anything as good in the open-source world. The closest thing I can think of is Clickhouse's "new" JSON type, which is backed by columnar storage with dynamic columns [2].

[1] https://www.uber.com/en-BR/blog/reducing-logging-cost-by-two... [2] https://clickhouse.com/docs/en/integrations/data-formats/jso...

Veserv · on Sept 6, 2023

https://messagetemplates.org/

The design described there is what Uber should be logging in the first place. Instead they are logging the fully resolved message and then compressing back into the templated form.

However, the compression back into the templated form is a good idea if you have third party logs that you want to store where you can not rewrite the logging to generate the correct form in the first place.

giovannibonetti · on Sept 6, 2023

Neat! The only downside of this approach is having to force the developers to use the library, which can work in some companies. On the other hand, other approaches discussed previously like Uber's don't require any change in the application code, which should make adoption way simpler.

breakwaterlabs · on Sept 6, 2023

In what world is 2160TB $100k?

Current single disk solutions are around $25/TB for HDDs and ~$100/TB for NVMe.

At a minimum you're looking at $54k just for raw capacity-- assuming no backup, no chassis, no networking, and no redundancy.

More reasonable estimations would be in excess of $400/TB.

Veserv · on Sept 6, 2023

Sure, whatever, a factor of 10 here or there hardly matters. I literally misinterpreted “multiple gigabytes per hour” as 999 GB/hr, not a much more reasonable 10 GB/hr. I literally overestimated data rates by a factor of 10,000% and the number still comes out “reasonable” i.e. a cost that can be paid if the cost/benefit is there.

Unless you want to claim storage costs $5,000/TB for 3 MB/s of I/O “multiple gigabytes per hour” with 90 day retention for a team worth of logging is not stupid on its face. Not to say that is a efficient or smart solution, but certainly not a “look at this insane request by developers” the person I was originally responding to was making it out to be.

Personally, I would probably question the competence of the team if they had that sort of logging rate with manual logging statements, but I am merely pointing out that “multiple gigabytes per hour” for 90 days is not crazy on its face and a plausible business case could be made for it even with a relatively modest engineering team.

breakwaterlabs · on Sept 8, 2023

My recent discussions with multiple SAN vendors as well as quoting out cost to DIY storage has that number being far away from "reasonable". I do not claim storage is $5,000/TB but it is substantially higher than the $50/TB you're estimating.

It's difficult to estimate the log throughput in this scenario. Cisco on debug all can overload the device's CPU; systems like sssd can generate MB of logs for a single login.

All of this is really missing the core issue though. A 2PB system is nontrivial to procure, nontrivial to run, and if you want it to be of any use at all you're going to end up purchasing or implementing some kind of log aggregation system like Splunk. That incurs lifecycle costs like training and implementation, and then you get asked about retention and GDPR.... and in the process, lose sight of whether this thing you've made actually provides any business value.

IT is not an ends in itself, and if these logs are unlikely to be used the question is less about dollars-per-developer-hour and more about preventing IT scope creep and the accumulation of cruft that can mature into technical debt.

Dylan16807 · on Sept 8, 2023

But you wouldn't use a SAN here. SAN pricing is far away from reasonable for this situation.

For the 20TB case, you can fit that on 1 to 4 drives. It's super cheap. Plus probably a backup hard drive but maybe you don't even need to back it up.

For the 2PB case, you probably want multiple search servers that have all the storage built in. There's definitely cost increases here, but I wouldn't focus too much on it, because that was more of a throwaway. Focus more on the 20TB version.

> That incurs lifecycle costs like training and implementation

Those don't relate much to the amount of storage.

> and then you get asked about retention and GDPR....

It's 90 days. Maybe you throw in a filter. It's not too difficult.

> if these logs are unlikely to be used

The devs are complaining about the search features, it sounds like the logs are being used.

> preventing IT scope creep and the accumulation of cruft that can mature into technical debt

Sure, that's reasonable. But that has nothing to do with the amount of storage.

10000truths · on Sept 7, 2023

> Current single disk solutions are around $25/TB for HDDs

More like $15/TB. $100K for 2 PB of storage with redundancy and backups is quite reasonable.

breakwaterlabs · on Sept 8, 2023

I'm showing Exos x20 20TBs for ~$500 new.

$300 is moving towards refurb / shucked prices.

Dylan16807 · on Sept 8, 2023

> I'm showing Exos x20 20TBs for ~$500 new.

Where? For new prices I'm seeing $350 at amazon, $350 at B&H, $280 direct from newegg, $280 at serverpartsdeals.

ilyt · on Sept 7, 2023

> In what world is 2160TB $100k?

When you buy a SAN to present a bunch of disks as one thing to the rest of the machines.

syndicatedjelly · on Sept 6, 2023

…what? Without any other context on what they’re working on or the size of the company, an extra developer of cost is automatically reasonable?

justinclift · on Sept 6, 2023

For their use case it sounds like they wanted to index the heck out of it for near instant lookups and similar too. So probably need to double the data size (rough guess) to include the indexes. And it may need some dedicated server nodes just for processing/ingestion/indexing, etc.

arghwhat · on Sept 7, 2023

The idea that anyone would find storing 20TB of plain text logs for a normal service reasonable is quite amusing.

Don't get me wrong, I understand that a single-digit kUSD/month is peanuts against developer productivity gains, but I still wouldn't be able to take a developer making that suggestion seriously. I would also seriously question internal processes, GDPR (or equivalent) compliance, and whether the system actually brings benefit or if it is just lazy "but what if" thinking.