90 days retention is only 2,160 hours. Even at 999 GB/hr that is only ~2160 TB of storage. So, if we stretch the definition of “multiple gigabytes”, is maybe $100k in storage which is around 3-6 developer-months. If we use a more reasonable definition like 10 GB/hr, then that is 20 TB, so maybe $1k in storage which is around 1 developer-day.
A few years ago I joined a company aggressively trying to reduce their AWS costs. My jaw got the floor when I realized they were spending over a million a month in AWS fees. I couldn't understand how they got there with what they were actually doing. I feel like this comment perfectly demonstrates how that happens.
AWS also purposefully makes it easy to shoot yourself in the foot. Case in point that we were burned on recently:
- set up some service that talks to a s3 bucket
- set up that bucket in the same region/datacenter
- send a decent but not insane amount of traffic through there (several hundred Gb per day)
- assume that you won’t get billed any data transfer fees since you’re talking to a bucket in the same data center
- receive massive bill under “EC2-Other” line item for NAT data transfer fees
- realize that AWS routes all traffic through NAT gateway by default even though it’s just turning around and going back into the data center it came from and billing exorbitant fees for that
- come to the conclusion that this is obviously a racket designed to extract money from unsuspecting people because there is almost no situation where you would want to do that by default and discover that hundreds to thousands of other people have been screwed in the exact same way for years (and there is a documented trail of it[1])
Developer will do the simplest thing to solve the problem.
If the solutions are:
* rewrite that part to add retention, or use better compression, or spend next month deciding which data to keep and which can be removed early
* Wiggle a thing in panel/API giving it more space
The second will win every single time unless there is pushback or it hits the 5% of the developers that actually care to make good architecture not just deliver tickets.
They're pricing hot storage at $50/TB (not per month). That is definitely not AWS or anything like it.
On a per-month basis, the grossly exaggerated number is in the single thousands. The non-exaggerated number is down in the double digits.
$50/TB is a lowball if you want much of the data to be on SSDs, but taking an analysis server and stuffing in 20TB of SSD (plus RAID, plus room for growth) is a very small cost compared to repeated debugging sessions. Especially because the SSD has to deal with about 0.01 DWPD.
Related to that, last year Uber's engineering blog mentioned very interesting results with their internal log service [1].
I wonder if there's anything as good in the open-source world. The closest thing I can think of is Clickhouse's "new" JSON type, which is backed by columnar storage with dynamic columns [2].
The design described there is what Uber should be logging in the first place. Instead they are logging the fully resolved message and then compressing back into the templated form.
However, the compression back into the templated form is a good idea if you have third party logs that you want to store where you can not rewrite the logging to generate the correct form in the first place.
Neat! The only downside of this approach is having to force the developers to use the library, which can work in some companies. On the other hand, other approaches discussed previously like Uber's don't require any change in the application code, which should make adoption way simpler.
Sure, whatever, a factor of 10 here or there hardly matters. I literally misinterpreted “multiple gigabytes per hour” as 999 GB/hr, not a much more reasonable 10 GB/hr. I literally overestimated data rates by a factor of 10,000% and the number still comes out “reasonable” i.e. a cost that can be paid if the cost/benefit is there.
Unless you want to claim storage costs $5,000/TB for 3 MB/s of I/O “multiple gigabytes per hour” with 90 day retention for a team worth of logging is not stupid on its face. Not to say that is a efficient or smart solution, but certainly not a “look at this insane request by developers” the person I was originally responding to was making it out to be.
Personally, I would probably question the competence of the team if they had that sort of logging rate with manual logging statements, but I am merely pointing out that “multiple gigabytes per hour” for 90 days is not crazy on its face and a plausible business case could be made for it even with a relatively modest engineering team.
My recent discussions with multiple SAN vendors as well as quoting out cost to DIY storage has that number being far away from "reasonable". I do not claim storage is $5,000/TB but it is substantially higher than the $50/TB you're estimating.
It's difficult to estimate the log throughput in this scenario. Cisco on debug all can overload the device's CPU; systems like sssd can generate MB of logs for a single login.
All of this is really missing the core issue though. A 2PB system is nontrivial to procure, nontrivial to run, and if you want it to be of any use at all you're going to end up purchasing or implementing some kind of log aggregation system like Splunk. That incurs lifecycle costs like training and implementation, and then you get asked about retention and GDPR.... and in the process, lose sight of whether this thing you've made actually provides any business value.
IT is not an ends in itself, and if these logs are unlikely to be used the question is less about dollars-per-developer-hour and more about preventing IT scope creep and the accumulation of cruft that can mature into technical debt.
But you wouldn't use a SAN here. SAN pricing is far away from reasonable for this situation.
For the 20TB case, you can fit that on 1 to 4 drives. It's super cheap. Plus probably a backup hard drive but maybe you don't even need to back it up.
For the 2PB case, you probably want multiple search servers that have all the storage built in. There's definitely cost increases here, but I wouldn't focus too much on it, because that was more of a throwaway. Focus more on the 20TB version.
> That incurs lifecycle costs like training and implementation
Those don't relate much to the amount of storage.
> and then you get asked about retention and GDPR....
It's 90 days. Maybe you throw in a filter. It's not too difficult.
> if these logs are unlikely to be used
The devs are complaining about the search features, it sounds like the logs are being used.
> preventing IT scope creep and the accumulation of cruft that can mature into technical debt
Sure, that's reasonable. But that has nothing to do with the amount of storage.
For their use case it sounds like they wanted to index the heck out of it for near instant lookups and similar too. So probably need to double the data size (rough guess) to include the indexes. And it may need some dedicated server nodes just for processing/ingestion/indexing, etc.
The idea that anyone would find storing 20TB of plain text logs for a normal service reasonable is quite amusing.
Don't get me wrong, I understand that a single-digit kUSD/month is peanuts against developer productivity gains, but I still wouldn't be able to take a developer making that suggestion seriously. I would also seriously question internal processes, GDPR (or equivalent) compliance, and whether the system actually brings benefit or if it is just lazy "but what if" thinking.
Seems pretty reasonable to me.