I was responsible for some dev ops stuff at a state's health department and one ...

atoav · on Sept 6, 2023

I worked with an admin like that. We had a huge cluster, but he was greedy with the storage space for a service that was critical for the operation of the org.

And I get that this is a good mindset for not wasting space overall, but if a single backup fills 90% of your storage space in test use, that machine is not ready for production. And we are not talking about a lot of space here. The backup was maybe 30 Gb the disk 40 Gb. He could have easily just allocated 100 Gb and call it a day, this way we had to go to him 3 times to scale it up in 10 Gb steps each time, including a the stress of figuring out why things are failing (something that the admin should have seen on his monitoring system).

Admins are my heros, but please if you allocate disk space, just take the biggest expected backup and multiply it by π. And if you need to be stingy with storage for some reason, be stingy, but decide when and where to be stingy — and at least keep an eye on the monitoring and upsize the storage before it is too late.

stuff4ben · on Sept 6, 2023

Having been on both sides (admin and developer), developers are notoriously bad at estimating how much space they need. You can't give them carte blanche to the storage because they'll waste it and consume as much as they're given without a thought to conserving it. And then when you put limits in, they'll whine and complain until they get what they want. Being an Artifactory service provider for a large IT dept gave me a direct view in how hard it is to manage storage for developers. And as a developer using Artifactory, I don't want to worry about storage, I just want my builds and CI pipelines to complete.

geerlingguy · on Sept 6, 2023

This reminds me of a time we were helping a dev team bring logging in house because they weren't liking the features of their logs-as-a-service provider.

They set all applications to "debug" level logs in production and were generating multiple gigabytes of logs per hour.

They wanted 90 days retention, and the ability to do advanced searching through the live log data so they could debug in production (they didn't really use their dev or stage environments, or have a process for documenting and reproducing bugs).

Veserv · on Sept 6, 2023

90 days retention is only 2,160 hours. Even at 999 GB/hr that is only ~2160 TB of storage. So, if we stretch the definition of “multiple gigabytes”, is maybe $100k in storage which is around 3-6 developer-months. If we use a more reasonable definition like 10 GB/hr, then that is 20 TB, so maybe $1k in storage which is around 1 developer-day.

Seems pretty reasonable to me.

Volundr · on Sept 6, 2023

A few years ago I joined a company aggressively trying to reduce their AWS costs. My jaw got the floor when I realized they were spending over a million a month in AWS fees. I couldn't understand how they got there with what they were actually doing. I feel like this comment perfectly demonstrates how that happens.

SOLAR_FIELDS · on Sept 7, 2023

AWS also purposefully makes it easy to shoot yourself in the foot. Case in point that we were burned on recently:

- set up some service that talks to a s3 bucket

- set up that bucket in the same region/datacenter

- send a decent but not insane amount of traffic through there (several hundred Gb per day)

- assume that you won’t get billed any data transfer fees since you’re talking to a bucket in the same data center

- receive massive bill under “EC2-Other” line item for NAT data transfer fees

- realize that AWS routes all traffic through NAT gateway by default even though it’s just turning around and going back into the data center it came from and billing exorbitant fees for that

- come to the conclusion that this is obviously a racket designed to extract money from unsuspecting people because there is almost no situation where you would want to do that by default and discover that hundreds to thousands of other people have been screwed in the exact same way for years (and there is a documented trail of it[1])

1: https://www.lastweekinaws.com/blog/the-aws-managed-nat-gatew...

WirelessGigabit · on Sept 7, 2023

If you're up for it... https://github.com/AndrewGuenther/fck-nat

Even has ha mode.

ilyt · on Sept 7, 2023

Developer will do the simplest thing to solve the problem.

If the solutions are:

* rewrite that part to add retention, or use better compression, or spend next month deciding which data to keep and which can be removed early

* Wiggle a thing in panel/API giving it more space

The second will win every single time unless there is pushback or it hits the 5% of the developers that actually care to make good architecture not just deliver tickets.

Dylan16807 · on Sept 8, 2023

They're pricing hot storage at $50/TB (not per month). That is definitely not AWS or anything like it.

On a per-month basis, the grossly exaggerated number is in the single thousands. The non-exaggerated number is down in the double digits.

$50/TB is a lowball if you want much of the data to be on SSDs, but taking an analysis server and stuffing in 20TB of SSD (plus RAID, plus room for growth) is a very small cost compared to repeated debugging sessions. Especially because the SSD has to deal with about 0.01 DWPD.

namaria · on Sept 7, 2023

Cloud truly monetizes the tar pit.

yjftsjthsd-h · on Sept 6, 2023

... only 2PB? You might be using a different scale than some of us.

Dylan16807 · on Sept 8, 2023

Their scale was money. Saying something is "only" a single digit number of developer months makes sense in this context.

And that was a number hundreds of times higher than what they were replying to, just to make a point.

floorballchamp · on Sept 6, 2023

Plus, logs have enormous compression potential since their entropy is so low. That's the property exploited by every logging-as-a-service out there.

giovannibonetti · on Sept 6, 2023

Related to that, last year Uber's engineering blog mentioned very interesting results with their internal log service [1].

I wonder if there's anything as good in the open-source world. The closest thing I can think of is Clickhouse's "new" JSON type, which is backed by columnar storage with dynamic columns [2].

[1] https://www.uber.com/en-BR/blog/reducing-logging-cost-by-two... [2] https://clickhouse.com/docs/en/integrations/data-formats/jso...

Veserv · on Sept 6, 2023

https://messagetemplates.org/

The design described there is what Uber should be logging in the first place. Instead they are logging the fully resolved message and then compressing back into the templated form.

However, the compression back into the templated form is a good idea if you have third party logs that you want to store where you can not rewrite the logging to generate the correct form in the first place.

giovannibonetti · on Sept 6, 2023

Neat! The only downside of this approach is having to force the developers to use the library, which can work in some companies. On the other hand, other approaches discussed previously like Uber's don't require any change in the application code, which should make adoption way simpler.

breakwaterlabs · on Sept 6, 2023

In what world is 2160TB $100k?

Current single disk solutions are around $25/TB for HDDs and ~$100/TB for NVMe.

At a minimum you're looking at $54k just for raw capacity-- assuming no backup, no chassis, no networking, and no redundancy.

More reasonable estimations would be in excess of $400/TB.

Veserv · on Sept 6, 2023

Sure, whatever, a factor of 10 here or there hardly matters. I literally misinterpreted “multiple gigabytes per hour” as 999 GB/hr, not a much more reasonable 10 GB/hr. I literally overestimated data rates by a factor of 10,000% and the number still comes out “reasonable” i.e. a cost that can be paid if the cost/benefit is there.

Unless you want to claim storage costs $5,000/TB for 3 MB/s of I/O “multiple gigabytes per hour” with 90 day retention for a team worth of logging is not stupid on its face. Not to say that is a efficient or smart solution, but certainly not a “look at this insane request by developers” the person I was originally responding to was making it out to be.

Personally, I would probably question the competence of the team if they had that sort of logging rate with manual logging statements, but I am merely pointing out that “multiple gigabytes per hour” for 90 days is not crazy on its face and a plausible business case could be made for it even with a relatively modest engineering team.

breakwaterlabs · on Sept 8, 2023

My recent discussions with multiple SAN vendors as well as quoting out cost to DIY storage has that number being far away from "reasonable". I do not claim storage is $5,000/TB but it is substantially higher than the $50/TB you're estimating.

It's difficult to estimate the log throughput in this scenario. Cisco on debug all can overload the device's CPU; systems like sssd can generate MB of logs for a single login.

All of this is really missing the core issue though. A 2PB system is nontrivial to procure, nontrivial to run, and if you want it to be of any use at all you're going to end up purchasing or implementing some kind of log aggregation system like Splunk. That incurs lifecycle costs like training and implementation, and then you get asked about retention and GDPR.... and in the process, lose sight of whether this thing you've made actually provides any business value.

IT is not an ends in itself, and if these logs are unlikely to be used the question is less about dollars-per-developer-hour and more about preventing IT scope creep and the accumulation of cruft that can mature into technical debt.

Dylan16807 · on Sept 8, 2023

But you wouldn't use a SAN here. SAN pricing is far away from reasonable for this situation.

For the 20TB case, you can fit that on 1 to 4 drives. It's super cheap. Plus probably a backup hard drive but maybe you don't even need to back it up.

For the 2PB case, you probably want multiple search servers that have all the storage built in. There's definitely cost increases here, but I wouldn't focus too much on it, because that was more of a throwaway. Focus more on the 20TB version.

> That incurs lifecycle costs like training and implementation

Those don't relate much to the amount of storage.

> and then you get asked about retention and GDPR....

It's 90 days. Maybe you throw in a filter. It's not too difficult.

> if these logs are unlikely to be used

The devs are complaining about the search features, it sounds like the logs are being used.

> preventing IT scope creep and the accumulation of cruft that can mature into technical debt

Sure, that's reasonable. But that has nothing to do with the amount of storage.

10000truths · on Sept 7, 2023

> Current single disk solutions are around $25/TB for HDDs

More like $15/TB. $100K for 2 PB of storage with redundancy and backups is quite reasonable.

breakwaterlabs · on Sept 8, 2023

I'm showing Exos x20 20TBs for ~$500 new.

$300 is moving towards refurb / shucked prices.

Dylan16807 · on Sept 8, 2023

> I'm showing Exos x20 20TBs for ~$500 new.

Where? For new prices I'm seeing $350 at amazon, $350 at B&H, $280 direct from newegg, $280 at serverpartsdeals.

ilyt · on Sept 7, 2023

> In what world is 2160TB $100k?

When you buy a SAN to present a bunch of disks as one thing to the rest of the machines.

syndicatedjelly · on Sept 6, 2023

…what? Without any other context on what they’re working on or the size of the company, an extra developer of cost is automatically reasonable?

justinclift · on Sept 6, 2023

For their use case it sounds like they wanted to index the heck out of it for near instant lookups and similar too. So probably need to double the data size (rough guess) to include the indexes. And it may need some dedicated server nodes just for processing/ingestion/indexing, etc.

arghwhat · on Sept 7, 2023

The idea that anyone would find storing 20TB of plain text logs for a normal service reasonable is quite amusing.

Don't get me wrong, I understand that a single-digit kUSD/month is peanuts against developer productivity gains, but I still wouldn't be able to take a developer making that suggestion seriously. I would also seriously question internal processes, GDPR (or equivalent) compliance, and whether the system actually brings benefit or if it is just lazy "but what if" thinking.

bbarnett · on Sept 6, 2023

With silliness like that, you can bet it was the cost of their logging provider, as the feature they didn't like.

tcbawo · on Sept 6, 2023

This doesn’t seem terrible if the business benefits justify the costs. There is a cost/benefit to this, presumably.

akira2501 · on Sept 7, 2023

You have to fill out a load chart to fly a plane, they should have to fill out something like a storage chart to get a production allocation. What size are your objects? How many per unit of time and served entity? What is the lifetime of those objects? How is that lifetime managed?

ido · on Sept 7, 2023

If you agree to add a few months of development time and reduce future velocity to make sure these limits are enforced, sure. Usually adding storage costs about as much as 1 developer’s salary cost for what, an hour? A day?

ilyt · on Sept 7, 2023

It's not for saving storage. It's for making sure it will actually not overflow.

End number doesn't matter, what matters is developers thinking about how long data should be stored and what data should be stored.

Not doing that analysis and overprovisioning 4x will just cause disaster in 2 years instead of 6 months.

stuff4ben · on Sept 7, 2023

You missed the part where I said they are "notoriously bad at estimating". We really do suck at estimating everything... storage, work estimates, etc. Why can't we just say "it'll be done when it's done and I'll use ALL the storage until I'm done"?

atoav · on Sept 7, 2023

I mean in my case it was literally a database file filled with the number of (dummy) people who are currently in our org. So that database size was the size of the project. He just didn't plan for the size of backups (backups were his job, not ours).

ballenf · on Sept 6, 2023

Allocated storage should come directly from the consuming team's budget. Divvy up the total storage cost and allocate in proportion to requested limits.

atoav · on Sept 7, 2023

Sure, what are you going to bill me for a 30GB VM on a 100TB cluster? Whether I want 30GB or 100GB for an absolute central service for the whole org shouldn't matter. If we are talking about personal pet projects or user accounts — sure — but that wasn't my complaint here.

solatic · on Sept 7, 2023

> You can't give them carte blanche to the storage because they'll waste it

So what? Just buy more. Storage is cheap.

It's hard to have a discussion here without understanding the scales involved. Is the problem that they're wasting 100 GB or 100 TB? And if the issue is truly that they're wasting 100 TB, then clamp down on it as part of cost reduction efforts. The truth in most organizations is you get rewarded for eliminating mountains of waste, but trying to prevent the waste in the first place brands you as someone difficult to work with who is standing in the way. Why not lean into that?

tetha · on Sept 6, 2023

Be stingy in a smart way. I'd like a description why you need the storage, an estimation how much and a projection of growth over the next 6 - 12 month, though the latter one can wait for a month or three for something new. Beyond a certain scale we'd need a PO or a project to write up the cost too. And yes, we'll start bugging you again once it starts filling up to 80% or more, because we don't want your systems to fail :)

But this way, I directly get an idea of the increase in storage we will need over the next year in order to plan the next hardware expansion.

londons_explore · on Sept 6, 2023

This is why one should generally be using network/cloud storage, with soft/hard limits.

As soon as the soft limit is hit, fire off an alert. Have the hard limit set at double or more.

alanbernstein · on Sept 6, 2023

On my MacBook, I like to keep a few giant blank files that I can delete in a disk space emergency.

CamJN · on Sept 6, 2023

Since apfs, “large” empty files can also take only the 4K or so for the fs entry on disk, gotta make sure it’s not a sparse file, and doing so is a bit tricky, better to fill it with junk that doesn’t compress well.

JackFr · on Sept 6, 2023

And my mother sets all her clocks ahead 10 minutes so she's never late.

0cf8612b2e1e · on Sept 6, 2023

It is not an entirely unreasonable idea. If a system runs out of disk space, an unexpectedly large number of operations will fail. Which can make recovery more problematic than you would assume. If you can immediately recover some disk space and have breathing room, it could make the difference in restoring service.

SAI_Peregrinus · on Sept 6, 2023

Keeping a bit of disk reserved for recovery is extremely common with copy-on-write filesystems like ZFS & BTRFS. Even deletion takes some extra space, so without a reservation it's effectively impossible to delete any files from a full disk.

ilyt · on Sept 7, 2023

Or, you know, have alert on disk space like adults.

alanbernstein · on Sept 9, 2023

This is an alert that must be fixed immediately, with an escape hatch to fix it quickly in case you truly don’t have time to manage your actual files.

It is also trivial to set up, and does not require me to figure out how to set up an OS alert, or trust that whatever alert process is running. So it is an essentially fail proof alert that works the same on any OS.

What’s a good argument against it?

PeterStuer · on Sept 7, 2023

In my experience this makes the problem worse. People either compensate for it, or stop trusting clocks at all. Usually a mix of both of those resulting in even less punctuality.

syndicatedjelly · on Sept 6, 2023

How about a cron job that checks disk usage once per day and prints the top 3 culprits by file type if du exceeds 90%

PeterStuer · on Sept 7, 2023

Can't you set up overprovisioning so the storage controller can do something usefull with it while you don't need the space?

atoav · on Sept 6, 2023

This was a VM with a ceph cluster.

doubled112 · on Sept 6, 2023

I agree with most of your points, but with system resources, sometimes it is simply that you can give them, but you can never take them back.

JackFr · on Sept 6, 2023

If cost accounting is done correctly, business units will give them back willingly.

pgt · on Sept 6, 2023

"It is difficult to get a man to understand something, when his salary depends upon his not understanding it." – Upton Sinclair

xaerise · on Sept 6, 2023

I don't get it... If you have a good reason to use 2 TB, i'm happy to allocate it for you.

If you just "I want 20 GB of storage", i'm not going to give it to you.

Storage is cheap in relation to other things. Just have a good reason to why you need it.

floorballchamp · on Sept 6, 2023

> If you have a good reason

So, you are not expecting that your co-workers have good reasons for what they are doing? Maybe the hiring bar at your place is too low then.

I prefer to work at places where my default assumption is that everybody around me is smart and responsible. Lifts lots of worries off my shoulders (and tends to benefit the stock price over time too and thereby my income).

gameman144 · on Sept 6, 2023

My coworkers have called out gaps in my thinking thousands of times when I have explained perceived needs to them, that's one of the main value-adds one gets from working in a team.

If I wanted unquestioned control, I'd run my own shop. If I want the best product, then I hope that people question my assumptions.

floorballchamp · on Sept 6, 2023

We are not in disagreement here. Bouncing off ideas and thoughts is a good thing.

The way this was phrased was more from the angle "who knows what these guys were thinking; if they can't give me a good reason, no way they will get storage space as I don't trust that they make good decisions on their own".

arghwhat · on Sept 7, 2023

Generally? No. Not because they are not smart, but because in a large company, each individual have different goals and priorities - that's why we have e.g. SREs as dedicated roles - and it takes a bit of effort to find the intersection between all these.

Let's say I work in DevOps and want to optimize cloud costs. In that case, I would challenge the size of everything, the use of higher-costs services, the number of regions, all that - but the team might want more regions and bigger resources to improve latency and performance, and use more high-cost services for developer experience, and ship features without having to think about utilization.

It's a tug of war, and only works when you have forces on both sides to balance out. Being too conservative might stall innovation or make things too slow to save a buck, not being conservative enough might drain funds or make things impossible to scale.

floorballchamp · on Sept 7, 2023

> It's a tug of war,

Yeah, any workplace in which the word "war" was used in the context of colleague interaction saw me leave within a few months.

I like to plan those things ahead of time with all stakeholders involved, then we work together instead of against each other.

arghwhat · on Sept 7, 2023

I believe you are intentionally misunderstanding. The term "tug of war" is not used to indicate armed conflict or even a problem. It indicates balancing forces that you want to maintain - pull the rope too far to one side, and you end up in a suboptimal extreme.

Unless you work with clones of yourself, there will always be differences in opinions and priorities, and not every feature and bug fix can be a company-wide stakeholder meeting, and you certainly will not get any social points for trying to micro-manage other teams.

floorballchamp · on Sept 7, 2023

Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders. Of course not the whole company.

But the attitude needs to be "let's put the requirements on the table and see what we can do" instead of "you don't get what you want unless you give me a good reason". The latter comes from an angle of distrust which I'm arguing against. The former comes from an angle of collaborative problem solving.

In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.

> I believe you are intentionally misunderstanding.

You are free to believe what you like. Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.

arghwhat · on Sept 8, 2023

> Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.

Your response hitched on a single word ("war") within a common phrase ("tug of war", a game). While it might have been accidental, such answers mislead from the actual discussion (and tends to be used as distractions when no good answer is present).

> Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders.

When you discuss new architectures or large projects, this is a given, but this covers only a small portion of company operation - the rest is organic day-to-day work, which slowly but surely distorts initial assumptions. Slowly boiling the frog, so to speak. Think one team making changes that affect request patterns, another team making something that is accidentally quadratic, and a third team suddenly asking for a large number of cloud resources to carry this that should absolutely be challenged.

And at the same time, teams are under different organization units with different budgets, schedules, leaderships and priorities - and most certainly don't care about daily scrum work of other teams.

> In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.

No one said "we'll stop talking to you", but "you get what can be justified". If you take offense to be challenged and would rather work somewhere else, you do you, but if you can't justify your request I'd argue that you are not doing your job properly in the first place.

ilyt · on Sept 7, 2023

If your smart colleagues can't write a sentence like "we need extra 1TB for next 3 years of growth", they are not smart and you're not either...

JackFr · on Sept 6, 2023

Why wouldn't you assume that if I'm asking for it I have a good reason?

Are you going to rearchitect my system for me?

syndicatedjelly · on Sept 6, 2023

Accountability? Is that attitude any different from just asking for money and refusing to explain what it’s for?

bandyaboot · on Sept 6, 2023

Wouldn’t you expected to have to provide some level of justification if you were, say, requesting a new development machine?

Gud · on Sept 6, 2023

There is a difference between spending $2000+ on a new computer and $10, which is about what a terabyte costs. Probably just having the discussion itself would waste more resources than just giving the storage space.

mrguyorama · on Sept 6, 2023

The meeting to bring in the relevant stakeholders and discuss that reasoning literally costs more than just fucking buying some cloud space.

FireBeyond · on Sept 7, 2023

Had a boss that would swoop into “suspicious” meetings.

“There’s 10 people here whose time I bill out at $250/hr each, spending an hour discussing whether to buy a $1,000 software license? Why?”

salawat · on Sept 9, 2023

"Because you don't give us access to the financials, so we have no frigging idea if we can afford it, Frank."

"Some of us don't like jumping into things without looking."

I wager your boss would not be amused with me.

djmips · on Sept 7, 2023

dealing with the gate keeping often costs more in dev time then just approving. Especially when the DevOps think they know better - thank goodness a tech director can step in and bust the impasse.

ilyt · on Sept 7, 2023

We tried. Devs when give more space just didn't bother to clean old crap and exact same thing happened but with few months of delay. But then we generally just ask how much do they need and bill project for it so that's generally also on them.

But, unlike Toyota, we do have disk space alerts.

Sometimes the problem is also entirely political, the management needs to tell client and charge them for more storage and won't accept the change till that happens. Meanwhile clock is ticking...

atoav · on Sept 7, 2023

In my case the data was a dummy database of more dummy users than are at our org (maybe 50% more). So once this goes to production it would likely get smaller.

The problem in this case was twofold:

- admin had the job to implement database backups. He didn't factor in backup size when allocating disk space. So this was wntirely his own fault.

- the database does store certain transactions for a certain period, so this grew initially until it it setteled at a certain level. Because the margin of storage was slim, this caused the problem

ilyt · on Sept 8, 2023

...did you communicate any of that ?

Because most of our cases where that happens was either lack of planning or lack of communicating that plan. By far most common one was "neither dev nor client knows the data volume in longer period". Which is fine as long as that's also communicated, but that's also often a problem.

But I'm not denying of course that there are just shitty incompetent ops departments, just for the other customer we had dealing with ops department that had:

* backup storage (which was some remote FTPS server IIRC) provisioned so slow the backup wouldn't copy within 24 hours. And the backup size was below TB. * weeks long delays with any resize.

ooterness · on Sept 6, 2023

Nitpick: "b" is for bits, "B" is for bytes. Please don't casually mislabel units.

PeterStuer · on Sept 7, 2023

He was just training the users to go the shadowIT route with self bought non controlled storage, most probably in the form of personal usb drives, and/or departamental consumer NAS devices.

fillipvt · on Sept 6, 2023

Just wondering, why pi?

panarky · on Sept 6, 2023

I've tested both π and g, and while they both work well, g results in far fewer disk full errors. I've heard c works even better, though I haven't tried it yet.

placesalt · on Sept 7, 2023

> I've tested both π and g, and while they both work well, g results in far fewer disk full errors. I've heard c works even better, though I haven't tried it yet.

Good to know. FWIW i should also be avoided. It's tempting to use, since most programs use it as a counter, so it /should/ standardize the logfilesizes. But in practice it's very tricky to get a definitive disk space requirement with it.

lovehashbrowns · on Sept 6, 2023

I work for a retailer where the service I’m responsible for is used by every cash register around the world for certain operations. When I came in, the RDS DB for this service had 60GB allocated to it, had literally just run out of space and caused an outage. The last team just gave it an additional 20GB. A month later, I was put in charge of it and it was already 5GB away from running out of space again. I put an end to that and gave it 250GB. The cost is minimal compared to a store not being able to open due to an outage.

The instances for the service itself had 20GB of EBS allocated to them. Luckily they don’t need much local storage. But that’s typical here. There’s a Jenkins instance that is even more of a pain. I’m not responsible for it but every week or two one of the worker nodes runs out of space because they’re given 8GB of storage space. I’m just watching the disaster unfold over the course of a year and a half as I’m constantly telling that team to just up the storage space on the worker nodes instead of constantly having to fiddle with cron jobs.

It’s not even an expense thing. They just… don’t want to increase the storage space. It drives me insane.

floorballchamp · on Sept 6, 2023

I'd guess the worry is that once you increase the storage, you never decrease it again. Ever. It's a one-way street. So, once everything is 5x over-provisioned, then the services tend to fill that space anyway (cause why not be wasteful if it doesn't cost anything) and a year later you are in the same seat again.

I'm not saying this is real, but the worry certainly is.

lovehashbrowns · on Sept 6, 2023

That's certainly real and something to consider when provisioning systems. I'm fully on board with that. The problem is when the cost of the cost-savings solution vastly outweighs the cost of over-provisioning infrastructure. Like this Jenkins issue bubbling up ~2-4 times a month vs just giving the worker nodes more storage space. There's been times where it happened during the night and people got paged.

Or comparing the cost of one store not being able to open on time because the RDS database's space ran out. VPs and directors start yelling and there's suddenly like 20+ people involved in figuring out why this one store didn't open on time. What's the cost of that compared to just giving the DB 250GB of space so this never comes up again?

But you are also 100% correct and I've seen that happen here, too. There's some instances I'm responsible for that were using EFS for their local storage. Costing thousands of dollars every month for absolutely no reason. I switched those to reasonably-sized EBS volumes and that alone was half of my annual savings goal.

I was completely flabbergasted seeing these instances using EFS while others were stuck on 8GB EBS volumes. Backups on the EFS drives had ballooned to the many TBs. And the backups were worthless! Instances themselves are ephemeral. They use S3 for long-term storage & metadata is on a database. Those are the things that should be backed up & their cost compared to EFS is minuscule.

floorballchamp · on Sept 6, 2023

Yeah. I suppose the tricky thing is:

> compared to just giving the DB 250GB of space so this never comes up again?

As long as there is reasonable confidence in that this is actually the case, then just provision the space and be done with it. That requires a certain understanding of future space requirements/expectations, and anything even just so slightly running away / leaking space will hit any limit given enough time. So, due diligence requires looking at whether it's actually needed.

lovehashbrowns · on Sept 6, 2023

Yup, I implemented a bunch of graphs and alerts. Right now it's at 100GB of usage so it's still growing but at a fairly predictable rate. Another nice thing to know is if it's possible to reduce that usage. I haven't been able to look into that but I know one of the causes of the usage increase. The service uses the DB to store some indexing data. There's a team forcing it to re-index and I can tell when they deploy because the storage spikes a little bit every time they do a deployment. Nothing I can do about that, sadly.

xeromal · on Sept 6, 2023

In my experience itd just control for the sake of control

ilyt · on Sept 7, 2023

Does nobody have space alerts ?

willcipriano · on Sept 6, 2023

There was probably an array with a few dozen terabytes to spare and the guy made you run back to him once a month for some misguided job security purpose.

ReactiveJelly · on Sept 6, 2023

Steelman possibility: There's 20 people in 10 departments giving him conflicting non-written requests for disk space, his own request for new disks is held up somewhere, and his boss keeps telling him "Just do what you're assigned" which doesn't include preventing a full disk condition.

phpisthebest · on Sept 6, 2023

Or there was an array that was thin provisioned at 200% over and every 100GB request playing chicken with a storage over run condition....

sjdfbnionio · on Sept 6, 2023

Note that a few dozen terabytes is also nothing. 12 TiB is about $800 on an SSD and about $250 on a HDD. Plus some overhead for the enclosure and redundancy, of course. It costs on the order of a day's pay for an engineer, at most.

I don't excuse wanton waste of storage because it's easy for sloppy practices to balloon to massive confusion and inefficiency. But that discipline should be enforced by good engineering practices, not by limiting resources.

greedo · on Sept 6, 2023

This is such a uninformative comment. Business/Enterprise storage is far, far more expensive than a drive you pick up at Best Buy or Fry's.

marcosdumay · on Sept 6, 2023

Well, that's a choice your ops people make.

There is absolutely no technical reason to provide all of your storage needs with a single quality of disks. You may do that to get some economies of scale, but if the impact is that high, you should rethink it. You can just as well have two different solutions, one with a huge amount of disk space, little redundancy and low performance, and one with a limited amount of space, plenty of redundancy and high performance.

And yes, I know that goes against every common ops procedure, because disks are so cheap. But the next thing you hear is always that disks are not cheap at all, and you really can't have it both ways.

acdha · on Sept 7, 2023

The problem is that staff time costs an order of magnitude more and most of these things can go wrong. For example, you don’t just want to buy random lots of whatever disks are cheapest because then you have to track recalls and firmware updates for everything.

If you have two tiers available, someone will use the less reliable one because that’s how they fit their budget but then because it’s “in production” they’ll expect the same level of service.

All of these are manageable but what you’re really hearing is that the technical issues are really the tip of the social iceberg most organizations have. One of the reasons people pick AWS isn’t just that it’s usually cheaper than the full cost of rolling your own but that lots of these things don’t affect you: you never fail to provision an EBS volume because the VPs of finance and IT are still arguing about procuring a new rack of disks, people can’t request endless customizations because the options are “take it or leave it”, etc.

datadrivenangel · on Sept 6, 2023

When you start requiring redundancy, you start needing 2-6x the number of disks, minimum.

Still cheap, but not that cheap.

tremon · on Sept 7, 2023

Single disks are cheap. Multi-disk storage systems are not. Multi-tiered storage systems are even more expensive. I don't see an inherent conflict there.

willcipriano · on Sept 6, 2023

I'd bet the salary of the guy who is making this a pain could buy enough storage that they don't have to worry about it anymore.

greedo · on Sept 6, 2023

Depends on the org. Some orgs are well managed, others don't want to pay for stuff behind the scenes until it bites them in the ass. They don't know what they don't know.

doublepg23 · on Sept 6, 2023

I’ve been getting organic and cage-free hard drives nowadays.

justsomehnguy · on Sept 6, 2023

> and cage-free hard drives nowadays.

SOmewhat recently I toyed with an idea of HDDs with an 'interposer' which would be just a dumb ATA2iSCSI interface, with Ethernet or WiFi connectivity.

That would allow you to place those drives literally anywhere and with a minimal footprint.

Sadly it would be too costly for the home usage (I assume about $30/unit at best) and for the enterprise usage... 1Gbit is too slow, 10Gbit is too hot and bulky and then you can't sell a $3 plastic case for $100 'vendor approved with light-path(r)(tm) diagnostic indication' for each drive.

thomastjeffery · on Sept 6, 2023

So don't waste your money on it then!

Why in the world would you buy one "Enterprise" disk instead of three "consumer" disks?

acdha · on Sept 7, 2023

Because the enterprise disks are rated for years of continuous service and have things like firmware which doesn’t lie about whether data has been committed durably for the sake of benchmarks on review sites.

None of this means that you should trust any particular disk enough not to need redundancy, backups, etc. Companies can and do make trade offs based on their needs and management competency and people have been shifting software for a generation to rely less on the hardware – back when Sun announced ZFS, one of the major appeals was that you could drop expensive hardware RAID controller dependencies in favor of cheap boxes of disks – but there isn’t a single global optimum point. A lot of enterprise purchases are driven by being able to satisfy your most demanding users with the same service as everyone else so you can avoid needing your admins to be trained and experienced with dozens of different storage systems. That last part especially extends to testing: for example, does your rack of consumer drives with software redundancy come back up cleanly after a kernel panic or power outage, especially a nasty one like a fluctuating brownout? Depending on your budget, needs, and technical bench depth you might reasonably conclude that the savings are worth the ops work, or that it’s safer to pay an enterprise storage vendor who’ll certify that they’ve done that testing and will have tech support on-sight within an hour, or that you’ll use AWS/Azure/GCP because they do even more of that tedious but important work. All of those can be right, but I’ve typically found that people in the first two categories think they’re doing better than they are and would be paying less for better service in the cloud.

justinclift · on Sept 6, 2023

Well, "enterprise storage" can actually be multiple, redundant copies of the data behind the scenes.

So, internally the storage team might quote say $1000/TB (simple numbers for an example) for a given storage quantity. And behind the scenes they'll likely have at least redundant storage arrays, plus backups and 24/7 monitoring for all of the data, etc.

thomastjeffery · on Sept 6, 2023

Yes, but that still refutes the point all the same.

greedo · on Sept 6, 2023

Because if you're providing services to a large business, you're typically not managing servers with physical disks attached. You're using a SAN with fiber connections or NFS mounts. Most SANs require specific drives sold by the vendor with firmware they've tested, mounted to custom sleds etc. You can't just connect a WD Mybook drive.

Mom and Pop businesses using single servers can do all they want in regards to drives, I don't care. I would argue they shouldn't have servers, but if you do have servers, you should at minimum use RAID etc, not a drive plugged into your USB port.

thomastjeffery · on Sept 6, 2023

If you are in that situation, then obviously you have to do what you have to do.

Absolutely none of that has to apply to the sort of situation we are talking about here.

The only real reason to spend more per disk is when you know all your disks are going to fail, and extending the lifespan per average disk will definitely save you more than the enterprise markup costs. So you better have dozens or hundreds of disks in the first place.

For any truly valuable data, you should at minimum have a backup in one different physical location. That backup should be include at least one redundant disk. None of that is worth spending a dime on more expensive hardware.

greedo · on Sept 7, 2023

You try telling a VP that their business unit can't function because you decided to purchase the cheaper drive.

All disks fail eventually. Outliers may run longer than the MTBF for that drive model, but they all fail eventually.

And backups are fine for restoring data, but they don't help provide access to that data in a timely fashion. That's why people use SANs, AWS etc.

The cost of a SAN storage array is a nit for a business making cars (like Toyota) or selling insurance (like my company).

thomastjeffery · on Sept 7, 2023

It's very unlikely for all of your disks to fail at the same time. Even if they do, that's why you have an offsite backup. The name of the game is redundancy, not longevity.

In practically every use case, two consumer disks will be better than one enterprise disk. Once you start failing enough disks often enough, longevity can be worth the additional cost. Until then, it just isn't.

greedo · on Sept 7, 2023

Actually, it's not uncommon for a batch of disk from a vendor (the same lot #) to have failures.

And again, a backup is fine for deleted data, or fire/ransomware. But for day to day operations, no one is really willing to wait for you to restore from a local backup, much less an offsite backup.

thomastjeffery · on Sept 8, 2023

> no one is really willing to wait

And they are willing to wait for you to rebuild your raid array?

Either you lost your data from a disk failure, or you are waiting to lose your data from a disk failure. How are we not on the same page here?

greedo · on Sept 8, 2023

I think if you had worked/managed a SAN, you might realize that a single disk failure is a non-event. I'm not talking about a JBOD or an storage shelf using RAID5, I'm talking about a Netapp or similar system that can easily handle disk failures without interrupting service.

advisory5739f2 · on Sept 7, 2023

It’s because all consumer disks are garbage. The way it works is that disks are tested. Disks without failures are sold as “Enterprise.” Those that have failures are labeled and sold as “consumer”, not thrown in the bin.

wil421 · on Sept 6, 2023

Are you talking consumer prices, enterprise redundant SSDs in a Data Center, or enterprise cloud storage? In my experience the latter two could add another digit to your price.

bbarnett · on Sept 6, 2023

And all of that needs to be backed up, secured, stored, tested for validity regularly, and so on.

More disk space isn't one drive, but redundancy too And backups. And even network bandwidth.

This thread sort of highlights the problems. DEVs don't get it. Nor does management.

phpisthebest · on Sept 6, 2023

Entry level Enterprise grade HDD storage will run you about $1,000 per TB.

All Flash arrays will get even more $$$$

Price out a Nimble or PureStorage array can see if they will sell you 12TB for $800... I will wait.

Consultant32452 · on Sept 6, 2023

The audit department at my current gig says all records must be retained for eternity because of regulatory reasons but cannot direct anyone to the relevant regulation.

pnutjam · on Sept 6, 2023

on-prem block storage like a Scality Ring is where it's at.

conductr · on Sept 6, 2023

“No problem, just put in a ticket”

Bluecobra · on Sept 6, 2023

I can say the same thing but as a sysadmin working at a place where it was pulling teeth to get money to buy additional storage. Upper management was so daft/cheap and they can't see past six months so they decide to buy the cheapest enterprise solution with the least amount of upgrade capacity.

mobutu · on Sept 6, 2023

My experience as an admin was quite similar. Developers just couldn't seem to grasp the need for someone to foot the bill for their constant requests for 2 TB LUNs. I'm just glad I'm done dealing with that shit.

kenoh · on Sept 6, 2023

I'm just here to admire the fact that a Google search for "Toyototian" results this 1 hour old comment.

xeromal · on Sept 6, 2023

I was considering what to call a Toyota employee and thought Toyototian sounded right. Accidentally added an additional to* but oh well. It's pretty incredible how fast Google adds stuff to their index this quickly!

ddoolin · on Sept 6, 2023

I work for Toyota but I've never pondered what our demonym is; I think Toyototian (Toyotian?) has something going for it though. :)

xeromal · on Sept 6, 2023

Yeah, I think toyotian works. Haha

zimpenfish · on Sept 6, 2023

Anecdotes from bank-employed chums have reported similar things - including where it's been quicker and easier for them to nip to the shops and buy an external drive to get some space to relieve pressure whilst the 800 requisition forms are slowly working their way through the systems.

alexwasserman · on Sept 6, 2023

As an ex-bank employee one time I needed space and was told I couldn’t get it because all the filers were full.

When would we get new filers?

Dunno, because the datacenter was also full.

lostlogin · on Sept 6, 2023

It’s year ago now, but my father bought a second hand hard drive and found it full of banking data with customer names and all.

bbarnett · on Sept 6, 2023

Um, does he still have that data? Asking for a friend!

jl6 · on Sept 6, 2023

Often this is intentional administrative backpressure. Finance tells IT to minimize costs. IT knows that the majority of users are poor at housekeeping, so tight control over storage allocation provides an incentive to delete-first-expand-later.

Unfortunately, this strategy frustrates users who have a genuine need to expand storage and who are unable to efficiently obtain an exception to the process.

sam_bristow · on Sept 7, 2023

Being slow to provide resources when I need them also means that I'm less likely to relenquish them when I'm done.

acdha · on Sept 7, 2023

Also more likely to request far more than you need so you don’t have to deal with them as frequently.

Everywhere I’ve seen that strategy tried had massive overspending and outages due to it. What’s worked best is cloud style usage billing because that aligns the incentives with the people making the decisions, but it’s really non-trivial to get the accurate full cost.

sct202 · on Sept 6, 2023

We had an on-prem server at a new factory I was working at and less than a year in the whole thing shut down because we ran out of disk space for logs so none of the automation could successfully report back that they completed tasks to move to the next task. It turns out it was apparently terabytes of unallocated disk space on the server, so it was a quick fix but just felt so avoidable.

pnutjam · on Sept 6, 2023

In a past job I took over for an enterprise linux setup where all the log volumes where way too small. It was dozens of alerts every week to just grow the log volume or delete some stuff. It took me a couple years to get everything resized properly and logs rotating correctly. This was around 700 servers with all the change requirements of a heavily audited environment.