I was responsible for some dev ops stuff at a state's health department and one of the more infuriating things about working at that place was that it was like pulling teeth getting more storage allocated. Our backups would be running out of disk and they'd allocate me 50 or 100 GB at a time. I'm sure someone Toyototian was yelling that this was going to happen for the past 6 months.
I worked with an admin like that. We had a huge cluster, but he was greedy with the storage space for a service that was critical for the operation of the org.
And I get that this is a good mindset for not wasting space overall, but if a single backup fills 90% of your storage space in test use, that machine is not ready for production. And we are not talking about a lot of space here. The backup was maybe 30 Gb the disk 40 Gb. He could have easily just allocated 100 Gb and call it a day, this way we had to go to him 3 times to scale it up in 10 Gb steps each time, including a the stress of figuring out why things are failing (something that the admin should have seen on his monitoring system).
Admins are my heros, but please if you allocate disk space, just take the biggest expected backup and multiply it by π. And if you need to be stingy with storage for some reason, be stingy, but decide when and where to be stingy — and at least keep an eye on the monitoring and upsize the storage before it is too late.
Having been on both sides (admin and developer), developers are notoriously bad at estimating how much space they need. You can't give them carte blanche to the storage because they'll waste it and consume as much as they're given without a thought to conserving it. And then when you put limits in, they'll whine and complain until they get what they want. Being an Artifactory service provider for a large IT dept gave me a direct view in how hard it is to manage storage for developers. And as a developer using Artifactory, I don't want to worry about storage, I just want my builds and CI pipelines to complete.
This reminds me of a time we were helping a dev team bring logging in house because they weren't liking the features of their logs-as-a-service provider.
They set all applications to "debug" level logs in production and were generating multiple gigabytes of logs per hour.
They wanted 90 days retention, and the ability to do advanced searching through the live log data so they could debug in production (they didn't really use their dev or stage environments, or have a process for documenting and reproducing bugs).
90 days retention is only 2,160 hours. Even at 999 GB/hr that is only ~2160 TB of storage. So, if we stretch the definition of “multiple gigabytes”, is maybe $100k in storage which is around 3-6 developer-months. If we use a more reasonable definition like 10 GB/hr, then that is 20 TB, so maybe $1k in storage which is around 1 developer-day.
A few years ago I joined a company aggressively trying to reduce their AWS costs. My jaw got the floor when I realized they were spending over a million a month in AWS fees. I couldn't understand how they got there with what they were actually doing. I feel like this comment perfectly demonstrates how that happens.
AWS also purposefully makes it easy to shoot yourself in the foot. Case in point that we were burned on recently:
- set up some service that talks to a s3 bucket
- set up that bucket in the same region/datacenter
- send a decent but not insane amount of traffic through there (several hundred Gb per day)
- assume that you won’t get billed any data transfer fees since you’re talking to a bucket in the same data center
- receive massive bill under “EC2-Other” line item for NAT data transfer fees
- realize that AWS routes all traffic through NAT gateway by default even though it’s just turning around and going back into the data center it came from and billing exorbitant fees for that
- come to the conclusion that this is obviously a racket designed to extract money from unsuspecting people because there is almost no situation where you would want to do that by default and discover that hundreds to thousands of other people have been screwed in the exact same way for years (and there is a documented trail of it[1])
Developer will do the simplest thing to solve the problem.
If the solutions are:
* rewrite that part to add retention, or use better compression, or spend next month deciding which data to keep and which can be removed early
* Wiggle a thing in panel/API giving it more space
The second will win every single time unless there is pushback or it hits the 5% of the developers that actually care to make good architecture not just deliver tickets.
They're pricing hot storage at $50/TB (not per month). That is definitely not AWS or anything like it.
On a per-month basis, the grossly exaggerated number is in the single thousands. The non-exaggerated number is down in the double digits.
$50/TB is a lowball if you want much of the data to be on SSDs, but taking an analysis server and stuffing in 20TB of SSD (plus RAID, plus room for growth) is a very small cost compared to repeated debugging sessions. Especially because the SSD has to deal with about 0.01 DWPD.
Related to that, last year Uber's engineering blog mentioned very interesting results with their internal log service [1].
I wonder if there's anything as good in the open-source world. The closest thing I can think of is Clickhouse's "new" JSON type, which is backed by columnar storage with dynamic columns [2].
The design described there is what Uber should be logging in the first place. Instead they are logging the fully resolved message and then compressing back into the templated form.
However, the compression back into the templated form is a good idea if you have third party logs that you want to store where you can not rewrite the logging to generate the correct form in the first place.
Neat! The only downside of this approach is having to force the developers to use the library, which can work in some companies. On the other hand, other approaches discussed previously like Uber's don't require any change in the application code, which should make adoption way simpler.
Sure, whatever, a factor of 10 here or there hardly matters. I literally misinterpreted “multiple gigabytes per hour” as 999 GB/hr, not a much more reasonable 10 GB/hr. I literally overestimated data rates by a factor of 10,000% and the number still comes out “reasonable” i.e. a cost that can be paid if the cost/benefit is there.
Unless you want to claim storage costs $5,000/TB for 3 MB/s of I/O “multiple gigabytes per hour” with 90 day retention for a team worth of logging is not stupid on its face. Not to say that is a efficient or smart solution, but certainly not a “look at this insane request by developers” the person I was originally responding to was making it out to be.
Personally, I would probably question the competence of the team if they had that sort of logging rate with manual logging statements, but I am merely pointing out that “multiple gigabytes per hour” for 90 days is not crazy on its face and a plausible business case could be made for it even with a relatively modest engineering team.
My recent discussions with multiple SAN vendors as well as quoting out cost to DIY storage has that number being far away from "reasonable". I do not claim storage is $5,000/TB but it is substantially higher than the $50/TB you're estimating.
It's difficult to estimate the log throughput in this scenario. Cisco on debug all can overload the device's CPU; systems like sssd can generate MB of logs for a single login.
All of this is really missing the core issue though. A 2PB system is nontrivial to procure, nontrivial to run, and if you want it to be of any use at all you're going to end up purchasing or implementing some kind of log aggregation system like Splunk. That incurs lifecycle costs like training and implementation, and then you get asked about retention and GDPR.... and in the process, lose sight of whether this thing you've made actually provides any business value.
IT is not an ends in itself, and if these logs are unlikely to be used the question is less about dollars-per-developer-hour and more about preventing IT scope creep and the accumulation of cruft that can mature into technical debt.
But you wouldn't use a SAN here. SAN pricing is far away from reasonable for this situation.
For the 20TB case, you can fit that on 1 to 4 drives. It's super cheap. Plus probably a backup hard drive but maybe you don't even need to back it up.
For the 2PB case, you probably want multiple search servers that have all the storage built in. There's definitely cost increases here, but I wouldn't focus too much on it, because that was more of a throwaway. Focus more on the 20TB version.
> That incurs lifecycle costs like training and implementation
Those don't relate much to the amount of storage.
> and then you get asked about retention and GDPR....
It's 90 days. Maybe you throw in a filter. It's not too difficult.
> if these logs are unlikely to be used
The devs are complaining about the search features, it sounds like the logs are being used.
> preventing IT scope creep and the accumulation of cruft that can mature into technical debt
Sure, that's reasonable. But that has nothing to do with the amount of storage.
For their use case it sounds like they wanted to index the heck out of it for near instant lookups and similar too. So probably need to double the data size (rough guess) to include the indexes. And it may need some dedicated server nodes just for processing/ingestion/indexing, etc.
The idea that anyone would find storing 20TB of plain text logs for a normal service reasonable is quite amusing.
Don't get me wrong, I understand that a single-digit kUSD/month is peanuts against developer productivity gains, but I still wouldn't be able to take a developer making that suggestion seriously. I would also seriously question internal processes, GDPR (or equivalent) compliance, and whether the system actually brings benefit or if it is just lazy "but what if" thinking.
You have to fill out a load chart to fly a plane, they should have to fill out something like a storage chart to get a production allocation. What size are your objects? How many per unit of time and served entity? What is the lifetime of those objects? How is that lifetime managed?
If you agree to add a few months of development time and reduce future velocity to make sure these limits are enforced, sure. Usually adding storage costs about as much as 1 developer’s salary cost for what, an hour? A day?
You missed the part where I said they are "notoriously bad at estimating". We really do suck at estimating everything... storage, work estimates, etc. Why can't we just say "it'll be done when it's done and I'll use ALL the storage until I'm done"?
I mean in my case it was literally a database file filled with the number of (dummy) people who are currently in our org. So that database size was the size of the project. He just didn't plan for the size of backups (backups were his job, not ours).
Allocated storage should come directly from the consuming team's budget. Divvy up the total storage cost and allocate in proportion to requested limits.
Sure, what are you going to bill me for a 30GB VM on a 100TB cluster? Whether I want 30GB or 100GB for an absolute central service for the whole org shouldn't matter. If we are talking about personal pet projects or user accounts — sure — but that wasn't my complaint here.
> You can't give them carte blanche to the storage because they'll waste it
So what? Just buy more. Storage is cheap.
It's hard to have a discussion here without understanding the scales involved. Is the problem that they're wasting 100 GB or 100 TB? And if the issue is truly that they're wasting 100 TB, then clamp down on it as part of cost reduction efforts. The truth in most organizations is you get rewarded for eliminating mountains of waste, but trying to prevent the waste in the first place brands you as someone difficult to work with who is standing in the way. Why not lean into that?
Be stingy in a smart way. I'd like a description why you need the storage, an estimation how much and a projection of growth over the next 6 - 12 month, though the latter one can wait for a month or three for something new. Beyond a certain scale we'd need a PO or a project to write up the cost too. And yes, we'll start bugging you again once it starts filling up to 80% or more, because we don't want your systems to fail :)
But this way, I directly get an idea of the increase in storage we will need over the next year in order to plan the next hardware expansion.
Since apfs, “large” empty files can also take only the 4K or so for the fs entry on disk, gotta make sure it’s not a sparse file, and doing so is a bit tricky, better to fill it with junk that doesn’t compress well.
It is not an entirely unreasonable idea. If a system runs out of disk space, an unexpectedly large number of operations will fail. Which can make recovery more problematic than you would assume. If you can immediately recover some disk space and have breathing room, it could make the difference in restoring service.
Keeping a bit of disk reserved for recovery is extremely common with copy-on-write filesystems like ZFS & BTRFS. Even deletion takes some extra space, so without a reservation it's effectively impossible to delete any files from a full disk.
This is an alert that must be fixed immediately, with an escape hatch to fix it quickly in case you truly don’t have time to manage your actual files.
It is also trivial to set up, and does not require me to figure out how to set up an OS alert, or trust that whatever alert process is running. So it is an essentially fail proof alert that works the same on any OS.
In my experience this makes the problem worse. People either compensate for it, or stop trusting clocks at all. Usually a mix of both of those resulting in even less punctuality.
So, you are not expecting that your co-workers have good reasons for what they are doing? Maybe the hiring bar at your place is too low then.
I prefer to work at places where my default assumption is that everybody around me is smart and responsible. Lifts lots of worries off my shoulders (and tends to benefit the stock price over time too and thereby my income).
My coworkers have called out gaps in my thinking thousands of times when I have explained perceived needs to them, that's one of the main value-adds one gets from working in a team.
If I wanted unquestioned control, I'd run my own shop. If I want the best product, then I hope that people question my assumptions.
We are not in disagreement here. Bouncing off ideas and thoughts is a good thing.
The way this was phrased was more from the angle "who knows what these guys were thinking; if they can't give me a good reason, no way they will get storage space as I don't trust that they make good decisions on their own".
Generally? No. Not because they are not smart, but because in a large company, each individual have different goals and priorities - that's why we have e.g. SREs as dedicated roles - and it takes a bit of effort to find the intersection between all these.
Let's say I work in DevOps and want to optimize cloud costs. In that case, I would challenge the size of everything, the use of higher-costs services, the number of regions, all that - but the team might want more regions and bigger resources to improve latency and performance, and use more high-cost services for developer experience, and ship features without having to think about utilization.
It's a tug of war, and only works when you have forces on both sides to balance out. Being too conservative might stall innovation or make things too slow to save a buck, not being conservative enough might drain funds or make things impossible to scale.
I believe you are intentionally misunderstanding. The term "tug of war" is not used to indicate armed conflict or even a problem. It indicates balancing forces that you want to maintain - pull the rope too far to one side, and you end up in a suboptimal extreme.
Unless you work with clones of yourself, there will always be differences in opinions and priorities, and not every feature and bug fix can be a company-wide stakeholder meeting, and you certainly will not get any social points for trying to micro-manage other teams.
Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders. Of course not the whole company.
But the attitude needs to be "let's put the requirements on the table and see what we can do" instead of "you don't get what you want unless you give me a good reason". The latter comes from an angle of distrust which I'm arguing against. The former comes from an angle of collaborative problem solving.
In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.
> I believe you are intentionally misunderstanding.
You are free to believe what you like. Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.
> Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.
Your response hitched on a single word ("war") within a common phrase ("tug of war", a game). While it might have been accidental, such answers mislead from the actual discussion (and tends to be used as distractions when no good answer is present).
> Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders.
When you discuss new architectures or large projects, this is a given, but this covers only a small portion of company operation - the rest is organic day-to-day work, which slowly but surely distorts initial assumptions. Slowly boiling the frog, so to speak. Think one team making changes that affect request patterns, another team making something that is accidentally quadratic, and a third team suddenly asking for a large number of cloud resources to carry this that should absolutely be challenged.
And at the same time, teams are under different organization units with different budgets, schedules, leaderships and priorities - and most certainly don't care about daily scrum work of other teams.
> In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.
No one said "we'll stop talking to you", but "you get what can be justified". If you take offense to be challenged and would rather work somewhere else, you do you, but if you can't justify your request I'd argue that you are not doing your job properly in the first place.
There is a difference between spending $2000+ on a new computer and $10, which is about what a terabyte costs.
Probably just having the discussion itself would waste more resources than just giving the storage space.
dealing with the gate keeping often costs more in dev time then just approving. Especially when the DevOps think they know better - thank goodness a tech director can step in and bust the impasse.
We tried. Devs when give more space just didn't bother to clean old crap and exact same thing happened but with few months of delay. But then we generally just ask how much do they need and bill project for it so that's generally also on them.
But, unlike Toyota, we do have disk space alerts.
Sometimes the problem is also entirely political, the management needs to tell client and charge them for more storage and won't accept the change till that happens. Meanwhile clock is ticking...
In my case the data was a dummy database of more dummy users than are at our org (maybe 50% more). So once this goes to production it would likely get smaller.
The problem in this case was twofold:
- admin had the job to implement database backups. He didn't factor in backup size when allocating disk space. So this was wntirely his own fault.
- the database does store certain transactions for a certain period, so this grew initially until it it setteled at a certain level. Because the margin of storage was slim, this caused the problem
Because most of our cases where that happens was either lack of planning or lack of communicating that plan. By far most common one was "neither dev nor client knows the data volume in longer period". Which is fine as long as that's also communicated, but that's also often a problem.
But I'm not denying of course that there are just shitty incompetent ops departments, just for the other customer we had dealing with ops department that had:
* backup storage (which was some remote FTPS server IIRC) provisioned so slow the backup wouldn't copy within 24 hours. And the backup size was below TB.
* weeks long delays with any resize.
He was just training the users to go the shadowIT route with self bought non controlled storage, most probably in the form of personal usb drives, and/or departamental consumer NAS devices.
I've tested both π and g, and while they both work well, g results in far fewer disk full errors. I've heard c works even better, though I haven't tried it yet.
> I've tested both π and g, and while they both work well, g results in far fewer disk full errors. I've heard c works even better, though I haven't tried it yet.
Good to know. FWIW i should also be avoided. It's tempting to use, since most programs use it as a counter, so it /should/ standardize the logfilesizes. But in practice it's very tricky to get a definitive disk space requirement with it.
I work for a retailer where the service I’m responsible for is used by every cash register around the world for certain operations. When I came in, the RDS DB for this service had 60GB allocated to it, had literally just run out of space and caused an outage. The last team just gave it an additional 20GB. A month later, I was put in charge of it and it was already 5GB away from running out of space again. I put an end to that and gave it 250GB. The cost is minimal compared to a store not being able to open due to an outage.
The instances for the service itself had 20GB of EBS allocated to them. Luckily they don’t need much local storage. But that’s typical here. There’s a Jenkins instance that is even more of a pain. I’m not responsible for it but every week or two one of the worker nodes runs out of space because they’re given 8GB of storage space. I’m just watching the disaster unfold over the course of a year and a half as I’m constantly telling that team to just up the storage space on the worker nodes instead of constantly having to fiddle with cron jobs.
It’s not even an expense thing. They just… don’t want to increase the storage space. It drives me insane.
I'd guess the worry is that once you increase the storage, you never decrease it again. Ever. It's a one-way street. So, once everything is 5x over-provisioned, then the services tend to fill that space anyway (cause why not be wasteful if it doesn't cost anything) and a year later you are in the same seat again.
I'm not saying this is real, but the worry certainly is.
That's certainly real and something to consider when provisioning systems. I'm fully on board with that. The problem is when the cost of the cost-savings solution vastly outweighs the cost of over-provisioning infrastructure. Like this Jenkins issue bubbling up ~2-4 times a month vs just giving the worker nodes more storage space. There's been times where it happened during the night and people got paged.
Or comparing the cost of one store not being able to open on time because the RDS database's space ran out. VPs and directors start yelling and there's suddenly like 20+ people involved in figuring out why this one store didn't open on time. What's the cost of that compared to just giving the DB 250GB of space so this never comes up again?
But you are also 100% correct and I've seen that happen here, too. There's some instances I'm responsible for that were using EFS for their local storage. Costing thousands of dollars every month for absolutely no reason. I switched those to reasonably-sized EBS volumes and that alone was half of my annual savings goal.
I was completely flabbergasted seeing these instances using EFS while others were stuck on 8GB EBS volumes. Backups on the EFS drives had ballooned to the many TBs. And the backups were worthless! Instances themselves are ephemeral. They use S3 for long-term storage & metadata is on a database. Those are the things that should be backed up & their cost compared to EFS is minuscule.
> compared to just giving the DB 250GB of space so this never comes up again?
As long as there is reasonable confidence in that this is actually the case, then just provision the space and be done with it. That requires a certain understanding of future space requirements/expectations, and anything even just so slightly running away / leaking space will hit any limit given enough time. So, due diligence requires looking at whether it's actually needed.
Yup, I implemented a bunch of graphs and alerts. Right now it's at 100GB of usage so it's still growing but at a fairly predictable rate. Another nice thing to know is if it's possible to reduce that usage. I haven't been able to look into that but I know one of the causes of the usage increase. The service uses the DB to store some indexing data. There's a team forcing it to re-index and I can tell when they deploy because the storage spikes a little bit every time they do a deployment. Nothing I can do about that, sadly.
There was probably an array with a few dozen terabytes to spare and the guy made you run back to him once a month for some misguided job security purpose.
Steelman possibility: There's 20 people in 10 departments giving him conflicting non-written requests for disk space, his own request for new disks is held up somewhere, and his boss keeps telling him "Just do what you're assigned" which doesn't include preventing a full disk condition.
Note that a few dozen terabytes is also nothing. 12 TiB is about $800 on an SSD and about $250 on a HDD. Plus some overhead for the enclosure and redundancy, of course. It costs on the order of a day's pay for an engineer, at most.
I don't excuse wanton waste of storage because it's easy for sloppy practices to balloon to massive confusion and inefficiency. But that discipline should be enforced by good engineering practices, not by limiting resources.
There is absolutely no technical reason to provide all of your storage needs with a single quality of disks. You may do that to get some economies of scale, but if the impact is that high, you should rethink it. You can just as well have two different solutions, one with a huge amount of disk space, little redundancy and low performance, and one with a limited amount of space, plenty of redundancy and high performance.
And yes, I know that goes against every common ops procedure, because disks are so cheap. But the next thing you hear is always that disks are not cheap at all, and you really can't have it both ways.
The problem is that staff time costs an order of magnitude more and most of these things can go wrong. For example, you don’t just want to buy random lots of whatever disks are cheapest because then you have to track recalls and firmware updates for everything.
If you have two tiers available, someone will use the less reliable one because that’s how they fit their budget but then because it’s “in production” they’ll expect the same level of service.
All of these are manageable but what you’re really hearing is that the technical issues are really the tip of the social iceberg most organizations have. One of the reasons people pick AWS isn’t just that it’s usually cheaper than the full cost of rolling your own but that lots of these things don’t affect you: you never fail to provision an EBS volume because the VPs of finance and IT are still arguing about procuring a new rack of disks, people can’t request endless customizations because the options are “take it or leave it”, etc.
Single disks are cheap. Multi-disk storage systems are not. Multi-tiered storage systems are even more expensive. I don't see an inherent conflict there.
Depends on the org. Some orgs are well managed, others don't want to pay for stuff behind the scenes until it bites them in the ass. They don't know what they don't know.
SOmewhat recently I toyed with an idea of HDDs with an 'interposer' which would be just a dumb ATA2iSCSI interface, with Ethernet or WiFi connectivity.
That would allow you to place those drives literally anywhere and with a minimal footprint.
Sadly it would be too costly for the home usage (I assume about $30/unit at best) and for the enterprise usage... 1Gbit is too slow, 10Gbit is too hot and bulky and then you can't sell a $3 plastic case for $100 'vendor approved with light-path(r)(tm) diagnostic indication' for each drive.
Because the enterprise disks are rated for years of continuous service and have things like firmware which doesn’t lie about whether data has been committed durably for the sake of benchmarks on review sites.
None of this means that you should trust any particular disk enough not to need redundancy, backups, etc. Companies can and do make trade offs based on their needs and management competency and people have been shifting software for a generation to rely less on the hardware – back when Sun announced ZFS, one of the major appeals was that you could drop expensive hardware RAID controller dependencies in favor of cheap boxes of disks – but there isn’t a single global optimum point. A lot of enterprise purchases are driven by being able to satisfy your most demanding users with the same service as everyone else so you can avoid needing your admins to be trained and experienced with dozens of different storage systems. That last part especially extends to testing: for example, does your rack of consumer drives with software redundancy come back up cleanly after a kernel panic or power outage, especially a nasty one like a fluctuating brownout? Depending on your budget, needs, and technical bench depth you might reasonably conclude that the savings are worth the ops work, or that it’s safer to pay an enterprise storage vendor who’ll certify that they’ve done that testing and will have tech support on-sight within an hour, or that you’ll use AWS/Azure/GCP because they do even more of that tedious but important work. All of those can be right, but I’ve typically found that people in the first two categories think they’re doing better than they are and would be paying less for better service in the cloud.
Well, "enterprise storage" can actually be multiple, redundant copies of the data behind the scenes.
So, internally the storage team might quote say $1000/TB (simple numbers for an example) for a given storage quantity. And behind the scenes they'll likely have at least redundant storage arrays, plus backups and 24/7 monitoring for all of the data, etc.
Because if you're providing services to a large business, you're typically not managing servers with physical disks attached. You're using a SAN with fiber connections or NFS mounts. Most SANs require specific drives sold by the vendor with firmware they've tested, mounted to custom sleds etc. You can't just connect a WD Mybook drive.
Mom and Pop businesses using single servers can do all they want in regards to drives, I don't care. I would argue they shouldn't have servers, but if you do have servers, you should at minimum use RAID etc, not a drive plugged into your USB port.
If you are in that situation, then obviously you have to do what you have to do.
Absolutely none of that has to apply to the sort of situation we are talking about here.
The only real reason to spend more per disk is when you know all your disks are going to fail, and extending the lifespan per average disk will definitely save you more than the enterprise markup costs. So you better have dozens or hundreds of disks in the first place.
For any truly valuable data, you should at minimum have a backup in one different physical location. That backup should be include at least one redundant disk. None of that is worth spending a dime on more expensive hardware.
It's very unlikely for all of your disks to fail at the same time. Even if they do, that's why you have an offsite backup. The name of the game is redundancy, not longevity.
In practically every use case, two consumer disks will be better than one enterprise disk. Once you start failing enough disks often enough, longevity can be worth the additional cost. Until then, it just isn't.
Actually, it's not uncommon for a batch of disk from a vendor (the same lot #) to have failures.
And again, a backup is fine for deleted data, or fire/ransomware. But for day to day operations, no one is really willing to wait for you to restore from a local backup, much less an offsite backup.
I think if you had worked/managed a SAN, you might realize that a single disk failure is a non-event. I'm not talking about a JBOD or an storage shelf using RAID5, I'm talking about a Netapp or similar system that can easily handle disk failures without interrupting service.
It’s because all consumer disks are garbage. The way it works is that disks are tested. Disks without failures are sold as “Enterprise.” Those that have failures are labeled and sold as “consumer”, not thrown in the bin.
Are you talking consumer prices, enterprise redundant SSDs in a Data Center, or enterprise cloud storage? In my experience the latter two could add another digit to your price.
The audit department at my current gig says all records must be retained for eternity because of regulatory reasons but cannot direct anyone to the relevant regulation.
I can say the same thing but as a sysadmin working at a place where it was pulling teeth to get money to buy additional storage. Upper management was so daft/cheap and they can't see past six months so they decide to buy the cheapest enterprise solution with the least amount of upgrade capacity.
My experience as an admin was quite similar. Developers just couldn't seem to grasp the need for someone to foot the bill for their constant requests for 2 TB LUNs. I'm just glad I'm done dealing with that shit.
I was considering what to call a Toyota employee and thought Toyototian sounded right. Accidentally added an additional to* but oh well. It's pretty incredible how fast Google adds stuff to their index this quickly!
Anecdotes from bank-employed chums have reported similar things - including where it's been quicker and easier for them to nip to the shops and buy an external drive to get some space to relieve pressure whilst the 800 requisition forms are slowly working their way through the systems.
Often this is intentional administrative backpressure. Finance tells IT to minimize costs. IT knows that the majority of users are poor at housekeeping, so tight control over storage allocation provides an incentive to delete-first-expand-later.
Unfortunately, this strategy frustrates users who have a genuine need to expand storage and who are unable to efficiently obtain an exception to the process.
Also more likely to request far more than you need so you don’t have to deal with them as frequently.
Everywhere I’ve seen that strategy tried had massive overspending and outages due to it. What’s worked best is cloud style usage billing because that aligns the incentives with the people making the decisions, but it’s really non-trivial to get the accurate full cost.
We had an on-prem server at a new factory I was working at and less than a year in the whole thing shut down because we ran out of disk space for logs so none of the automation could successfully report back that they completed tasks to move to the next task. It turns out it was apparently terabytes of unallocated disk space on the server, so it was a quick fix but just felt so avoidable.
In a past job I took over for an enterprise linux setup where all the log volumes where way too small. It was dozens of alerts every week to just grow the log volume or delete some stuff.
It took me a couple years to get everything resized properly and logs rotating correctly. This was around 700 servers with all the change requirements of a heavily audited environment.