They do not try to blame it on complex systems or other factors.
Users lost 1 day and 1/2 of recent work (which doesn't seem to be that bad).
About file loss in Luster file system in your supercomputer system, we are 100% responsible.
We deeply apologize for causing a great deal of inconvenience due to the serious failure of the file loss.
We would like to report the background of the file disappearance, its root cause and future countermeasures as follows:
We believe that this file loss is 100% our responsibility.
We will offer compensation for users who have lost files.
[...]
Impact:
--
Target file system: /LARGE0
Deleted files: December 14, 2021 17:32 to December 16, 2021 12:43
Files that were supposed to be deleted: Files that had not been updated since 17:32 on December 3, 2021
[...]
Cause:
--
The backup script uses the find command to delete log files that are older than 10 days.
A variable name is passed to the delete process of the find command.
A new improved version of the script was applied on the system.
However, during deployment, there was a lack of consideration as the periodical script was not disabled.
The modified shell script was reloaded from the middle.
As a result, the find command containing undefined variables was executed and deleted the files.
[...]
Further measures:
--
In the future, the programs to be applied to the system will be fully verified and applied.
We will examine the extent of the impact and make improvements so that similar problems do not occur.
In addition, we will re-educate the engineers in charge of human error and risk prediction / prevention to prevent recurrence.
We will thoroughly implement the measures.
Japanese companies structure apologies very differently from US ones, because the legal consequences are very different. In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit, while in Japan, a sufficiently sincere* apology may well defuse the situation entirely.
* 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something
Also, the culprit here is not HP proper but their consulting/SI wing HP Enterprise, which has a, uhh, less than stellar reputation for competence.
Apologies, both personal and corporate, are taken very seriously in Japanese culture[1]. They're a way of preserving honor that dates back to the samurai era. You can see this in the custom of bowing, where the length and extension of a bow reflects the gravity of the situation. The act of seppuku can be considered an extreme version of this.
I'm not a Japanophile, but find their culture fascinating.
Japanese firms are interesting to work with. I enjoyed the culture, personally. I have two anecdotes.
The old white box PC shop I worked for, back in the 90s, quoted PCs to a Japanese-owned auto parts manufacturer. The Customer accepted the quote, paid an up-front deposit, and requested the PCs not be built and delivered until some construction at their site was completed in a few months.
In the meantime component pricing went down and speeds/feeds went up. When the PCs were built we ended up being forced to source higher clock speed CPUs and larger hard disk drives. It took some cajoling to get the Customer to take delivery. They felt they should pay more for the upgrades. (We were actually making more money on the deal even with the upgraded components anyway!)
My current company once pitched a support agreement to a Japanese-owned firm. We offered a discount for annual commitment versus month-to-month. I'd copy/pasted the month-to-month terms for the annual but forgot to alter the minimum notice period for ending the annual option. Both month-to-month and annual indicated a notice period of 30 days rather than the intended 180 days for the annual option.
The Customer's contact questioned the notice period being the same. He asked why they wouldn't opt for annual commitment, get the discounted rate, and also have the 30 day notice period. My partner, who had worked with Japanese firms in the past, responded: "We know you wouldn't choose the annual option if your intention was not to work with us for at least a year." The Customer agreed and we ended up getting the gig on an annual basis.
> The Customer's contact questioned the notice period being the same. He asked why they wouldn't opt for annual commitment, get the discounted rate, and also have the 30 day notice period. My partner, who had worked with Japanese firms in the past, responded: "We know you wouldn't choose the annual option if your intention was not to work with us for at least a year." The Customer agreed and we ended up getting the gig on an annual basis.
That's a really cute answer by your partner, I like it a lot, even ignoring that it manages to save face while paying serious respect to the purported customer.
Writing the parent comment gave me, if nothing else, an excuse to publicly document that exchange. I was, and still am, in awe of his ability to think on his feet and come up with such a wholly appropriate response.
Because Japanese culture accepts failure. Admission and acceptance preserves honour. Western culture, particularly in north America, punishes the admission of failure. Everyone is supposed to fight things out to the last, to never give an inch. Even when companies settle cases they rarely admit wrongdoing publicly. Notice that this Japanese company talks about re-educating/training personnel. An American company would be expected to sack all involved and sue the consulting firm.
Maybe it depends on the region/prefecture, but after 3 years working in Japan, my experience is quite different. The project suffers massive delays because nobody wants to make any decision and take any responsibility. My former boss (Japanese, but had worked for many years in the US), told me that Japanese managers don't want to take risks, because if something fails they need to apologize in public, and that is pretty much social suicide there.
> isn’t America’s acceptance of failure often cited as an element in the success of its entrepreneurial spirit
Yes, the parent comment is entirely wrong. The US culture is hyper accepting of failure compared to most other prominent cultures, including Japanese culture (which in fact does not tolerate failure very well at all).
America accepts failed ventures, not individual mistakes. A business that collapses is a learning experience. But, while they are operating, a business must never admit to wrongdoing impacting customers or the public. There is great respect for failed business judgment, but very little for weakness.
Weren’t Japanese soldiers in WWII instructed to kill themselves instead of being captured to “preserve honor?” That doesn’t sound like the acceptance of failure to me.
It's far too much to go in to in the HN comments section from a phone, but removing the key military leadership who pushed this mindset, a new government structure, and the rejuvenation of business under e.g., Deming caused a significant shift in Japanese culture following the war.
Edit: This is not necessarily in support of the grandparent comment; rather, a caution against judging a culture based on historical anecdotes.
That sounds like saying that the ongoing US war crimes at Guantanamo are important for understanding Google app engine service contracts. Just because Japan is far away doesn't mean that everything that happened there happened in the same place at the same time.
Only certain classes of officers if I remember correctly, and it more likely was motivated by preventing the leaking of intelligence under interrogation by the enemy, rather than really being about preserving honor. Of course it's probably easier to carry out seppuku if you convince yourself its honorable, rather than to just do it based on rationality.
Not all failure results in capture, so one particular failure scenario does not speak of the general view on failure.
There were misunderstandings that there will not be good chances of dying painlessly if one was to be captured alive, based on Japanese norms and very scarce knowledge on American cultures. Better to pull the pin while you can than having to beg them for days to stab just an inch towards arteries, that's what "preserve honor" ultimately means.
They were instructed to avoid capture by any means because they had been indoctrinated by propaganda that indicated horrific treatment by Allied POW personnel. While this was obviously false, it's interesting to note that Japanese knowledge of America's history with slavery, segregation, and Indian removal would have made this assumption not unreasonable, and further, may have influenced Japanese treatment of American POWs. After all, a major consideration in Japan's decision to go to war in the first place was the leadership's understanding that their lack of status as a white power would hamper their colonial ambitions. They were only a few decades removed from being excluded from the Berlin Conference, for example.
That's an oversimplification. Pre-existing notions about warrior conduct certainly played a role, but "samurai" were a class (bordering on caste); attitudes descending from bushido were a top-down mandate, enforced by the officer class, not something widespread in common civilian life (save for knowledge of how one is supposed to act towards high-ranking personnel).
If you think that the domestic propaganda machine wasn't running at mach speed in order to shape public perceptions to what would be most beneficial to the Imperial Army and Navy, I don't know what to say. They definitely were, and they definitely pushed, from multiple angles, the idea that surrendering would result in a worse spiritual, material, physical, and psychological outcome than the alternatives (take your pick of whichever motivates you most effectively).
That's exactly what acceptance of failure sounds like to me. E.g. you abandon a chess game when you accept you'll lose. To just keep pushing for a situation you know is irredeemable is to deny you've failed.
When you learned about that, you did not read the entire paragraph: being captured was traditionally considered a dishonor (for the past 500 years, at least), it is a huge failure for a soldier (or samurai). Ritual suicide is the solution to that dishonor.
There are degrees of failure, some is considered to be beyond fixing.
Unfortunately it can also be taken a bit too far. It’s not uncommon to essentially buy yourself out of punitive criminal justice via an apology and some cash (jidan/gomen money). It’s not totally unlike a settlement here, except it’s much more acceptable as an opening move (as a victim, you should seriously consider it). It seems to be inline with the goal of preserving social harmony and making “it” go away asap.
> a sufficiently sincere* apology may well defuse the situation entirely.
> * 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something
While it's true that makoto can be translated as sincere(ly) in constructions like "makotoni moushiwakearimasen" (I sincerely apologize) (although this is written 誠に rather than 真に), it is unlikely that the word makoto would be used in a phrase like "a sincere apology" or in discussing how sincere an apology was, so I don't really think introducing the word "makoto" in your comment sheds any additional light on japanese culture surrounding apologies.
You could actually make the exact same comment about how "real" in English can also effectively mean sincere in "really sorry" and draw the same conclusions about American culture.
My take away (as westerner with zero direct Japanese culture exposure) from the comment you’re replying to was that in Japan, companies are incentivized to take on some measure of ownership and voluntary restitution, because there is some legal notion in Japan around “honest mistakes not being litigable if they are genuinely rectified.”
In the US "genuinely rectified" is the standard for civil suit ("actual damages" for negligence, and "specific performance" or monetary equivalent for contracts). Punitive damages are added only for malicious intent.
The reason many US companies don't apologize is because they don't want to make restitution, and can get away without paying.
“In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit, while in Japan, a sufficiently sincere* apology may well defuse the situation entirely.”
——-
“hospital staff and doctors willing to discuss, apologize for and resolve adverse medical events through a “collaborative communication resolution program” experienced a significant decrease in the filing of legal claims, defense costs, liability costs and time required to close cases.”
HP and HPE are now two separate companies, split from post-Fiorina HP.
HP does consumer grade stuff only, while HPE does the enterprise side (not just consulting, in fact non trivial portion of HPE consulting arm was spun off and merged into DXC)
I agree with your point, and want to add that HP does commercial grade end-user compute and printing along with the related enterprise services. They have a whole set of offerings for the medical industry [1], industrial printing [2], and enterprise PC fleet management services [3].
That depends on which trail of awesomeness you are after.
The trail that maintains the legacy of DEC and Silicon Graphics and Cray is in HPE (where I work). The Cray legend is still very much alive, but you can still detect the whiff of the the spirit that made HP and DEC minicomputers extraordinary.
Well, I suspect the SGI legacy is now in better hands than when it was controlled by Rackable with branding filed off. The only good parts they sold us were the Ultraviolets, and those were probably the most nonsensical purchase (protip: do not buy supercomputer modules just to run 8 VMs on it, it's waste of money even if the hw is awesome)
> In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit
source? this is a popular theory among non-lawyers, but not, as far as I can tell, well-supported by the evidence. http://jaapl.org/content/early/2021/05/19/JAAPL.200107-20 has extensive citations including for their claim that "In theory, telling a patient about an error may make patients more likely to pursue litigation. In practice, however, bad outcomes alone are typically not reason enough for patients or their families to file malpractice claims."
Medicine may not fit the pattern (personally I'd want to know where I stand, even if the news is bad), but I took the OP to be saying "American firms prefer not to go on record saying they screwed up, since that would naturally be brought up in subsequent legal proceedings".
I flip table rage quit my 15 year job at a university over a new C-suite from who knows where throwing HPE equipment/software/consultants at me for a big sort of project and finding them all so utter crap that I would be a part of that stupidity. I have my standards.
Before I left HPE post Cray acquisition, some of the folks in the consulting/"cloud" division insisted that to use their tooling, we had to insert a windows machine into clusterstor, that mounted lustre, so that they could run their powershell script to gather usage metrics.
What I found working with these teams, was that the desire to flip tables was quite strong after meeting with them. I tried to address their concerns, one point at a time. They were bewildered that windows could not mount (modern) lustre. Really bewildered. I offered to help rewrite their scripts in another (portable) language, so we could avoid these problems. Still they were bewildered.
They were not why I left. Merely a confirmation that my decision to leave was the right one.
HPE is known for its acquisitions and, with that, there are diverse hardware teams who approach engineering differently. This can be a source of amazing innovation or power battles. The folks inside HPE need to learn how to engage in technical conflict better to prevent these catastrophic scenarios.
No, I don’t work for Rakuten, but the in-person apology from an appropriately senior person is hugely important for large Japanese organizations.
In my current company I have the opportunity to work with several large Japanese banks. Various members of our C-Suite and senior management travel occasionally for this purpose.
A sufficiently sincere apology with the appropriate level of ceremony can be the difference between maintaining and losing a substantial contract.
I don't know a huge amount about Rakuten. Don't they purposefully adopt a pretty flat communication structure, and the CEO travels around a lot, as well as English demanded in senior roles?
I was deeply impressed some--30?--years ago when there was a minor scandal in sumo wrestling: some kid, who had been advanced too quickly did a few stupid things, as I recall the sort of thing you can find in American sports pages every week. The heads of the sumo wrestling association acknowledged that they had contributed to the situation, and docked their own pay. Do you think Roger Goodell is going to do that?
So this is something I’ve never understood. If you modify a shell script while it’s running, the shell executes the modified file. This normally but not always causes the script to fail.
Now I’ve known about this behaviour for a very long time and it always seemed very broken to me. It’s not how binaries work (at least not when I was doing that kind of thing).
So I guess bash or whatever does an mmap of the script it’s running, which is presumably why modifications to the script are visible immediately. But if a new file was installed eg using cp/tar/unzip, I’m surprised that this didn’t just unlink the old script and create a new one - which would create a new inode and therefore make the operation atomic, right? And this (I assume) is why a recompiled binary doesn’t have the same problem (because the old binary is first unlinked).
So, how could this (IMO) bad behaviour be fixed? Presumably mmap is used for efficiency, but isn’t it possible to mark a file as in use so it’s cant be modified? I’ve certainly seen on some old Unices that you can’t overwrite a running binary. Why can’t we do the same with shell scripts?
Honestly, while it’s great that HP is accepting responsibility, and we know that this happens, the behaviour seems both arbitrary and unnecessary to me. Is it fixable?
> isn’t it possible to mark a file as in use so it’s cant be modified?
That's the route chosen by Windows for binary executables (exe/dll) and various other systems. Locking a file against writes, delete/rename or even read is just another flag in the windows equivalent of fopen [1]. This makes for software that's quite easy to reason about, but hard to update. The reason why you have to restart Windows to install Windows updates or even install some software is largely due to this locking mechanism: you can't update files that are open (and rename tricks don't work because locks apply to files, not inodes).
With about three decades of hindsight I'm not sure if it's a good tradeoff. It makes it easy to prevent the race conditions that are an endless source of security bugs on unix-like systems; but otoh most software doesn't use the mechanism because it's not in the lowest-common-denominator File APIs of most programming languages; and MS is paying for it with users refusing to install updates because they don't want to restart their PC.
I've updated .so files on FreeBSD while they're running. They weren't busy and a program which had it mmaped to run promptly crashed (my update wasn't intended to be hot loaded and wasn't crafted to be safe, although, it could have been if I knew it was possible). And now I won't forget why I should use install instead of cp (install unlinks before writing, by default, cp opens and overwrites the existing file)
This behavior in shell scripts predates mmap. In very early versions of Unix it was arguably even useful; there was a goto command which was implemented by seeking on the shell-script file descriptor rather than as a shell builtin, for example. I don't know of any use for it since the transition to the Bourne shell, but my knowledge is far from comprehensive. (I suppose if your shell script is not small compared to the size of RAM, it might be undesirable to read it all in at the start of execution; shar files are a real-life example even on non-PDP-11 machines.)
As I understand it, the reason for ETXTBSY ("on some old Unices...you can't overwrite a running binary") was to prevent segfaults.
cp usually just opens the file O_WRONLY|O_TRUNC, which seems like the wrong default; Emacs for example does create a new file and rename it over the old one when you save, usually, allocating a new inode as you say. By default it makes an exception if there are other hardlinks to the file.
Btrfs and xfs have a "reflink" feature that allows you to efficiently make a copy-on-write snapshot of a file, which would be ideal for this sort of thing, since the shell or whatever won't see any changes to the original file, even if it's overwritten in place. Unfortunately I don't think you can make anonymous reflinks, so for the shell to reflink a shell script when it starts executing it would need write access to somewhere in the filesystem to put the reflink, and then it would need to know how to find that place, somehow. And of course that wouldn't help if you were running on ext4fs or, I imagine, Lustre, though apparently an implementation was proposed in 02019: https://wiki.lustre.org/Lreflink_High_Level_Design
> there was a goto command which was implemented by seeking on the shell-script file descriptor rather than as a shell builtin, for example.
Oh noooo I just realized you could probably implement a shared library loadable module for bash `enable` that does the same thing... just fseek()s the fd...
“Emacs for example does create a new file and rename it over the old one when you save, usually, allocating a new inode as you say. By default it makes an exception if there are other hardlinks to the file.”
Though the trade off is that all operation ceases on a full hard drive.
I don’t have a better solution, but it’s worth noting.
Emacs gives you an error message in that case rather than destroying the old version of the file and then failing to completely write the new version, in the cases where it does the tempfile-then-rename dance. This is usually vastly preferable if Emacs or your computer crashes before you manage to free up enough space for a successful save.
It doesn't cease all operation; other Emacs features work as they normally do. Bash, by contrast, stops being able to tab-complete filenames, at precisely the time when you most need to be able to rapidly manipulate your files. At least, that's the case with the default completion setup in a few recent versions of Ubuntu.
Well, it looks like creating another hard link is a nearly-free solution. And beyond that, since emacs already has both behaviors, presumably you can tell it you want the in-place modification.
the reason why modifying a script during execution can have unpredictable results, not demonstrated in this test, is that Unix shells traditionally alternate between reading commands and executing them, instead of reading the entire file (potentially very large compared to 1970s RAM size) and executing commands from the in-memory copy. on modern systems, shell script sizes are usually negligible compared to system RAM. therefore, you can manually cause the entire file to be buffered by enclosing the script in a function or subshell:
bash will read(), do its multi-step expansion-parsing thing and then lseek back so the next read starts on the next input it needs to handle. This is why the problems described in the story can happen.
The other way to fix this is to simply use editors that will just make a new file and move over that file on the target on save. I believe vim or neovim does this by default, but things like, ed or vi do not. Emacs will do something similar on first save if you did not (setq backup-by-copying t) but any write after will still be done in-place. I tested this trivially without reviewing the emacs source simply doing the following and you can to with $EDITOR of choice:
!#/usr/bin/env bash
echo test
sleep 10
# evil command below, uncomment me and save
# echo test2
while running sleep, if changing the script causes things to happen, your editor may cause the problem described.
> If you modify a shell script while it’s running, the shell executes the modified file
That is dependent on the OS. In this case wasn't the shell script just executed fresh from a cronjob?
I remember on Digital Unix - on an Alpha so this was a few years ago - that you could change a c program (a loop that printed something then slept, for example), recompile and it would change the running binary.
> wasn't the shell script just executed fresh from a cronjob?
The description said that the script changed while it was running, so certain newly introduced environment variables didn’t have values and this triggered the issue.
My reading was that this was just a terrible coincidence - the cron job must have started just before the upgrade.
Regarding changing a C program, now you mention it I think that the behaviour you describe might also have happened on DG/UX, after an upgrade. IIRC it used to use ETXTBSY and after an upgrade it would just overwrite.
Not really behaviour that you want (or expect) tho.
It's nice to see the same mistakes that people have been making for as long as I've been alive, on small and large systems all over the world, still happen on projects with professional teams from HPE or IBM that cost hundreds of millions of dollars.
From what I know, so far linux doesn't have an exclusive lock capability on a file, windows does however. So in linux you can't mark a file in exclusive possession of a process.
Ahhh the joy of lustre and the accidental cronjob.
about 15 years ago I experienced the same thing. An updater script based on rsync was trying to keep one nfs machine image in sync with another. However for what ever reason, the script accidentally tries to sync the entire nfs root directory with its own, deleting everything show by show in reverse alphabetical order.
At the time Lustre didn't really have any good monitoring tools for showing you who was doing what, so they had to wait till they hit a normal NFS server before they could figure out and stop what was deleting everything.
Needless to say, a lot of the backups may have been failing.
Huh. I may be remembering incorrectly, but I recall having somebody somewhat entrenched in related business tell me that HP has been going downhill from an industry perspective roughly two years ago…
Nice to see them completely own up to the mistake right away. I wonder who made the final call on doing so, companies admitting fault so transparently & immediately offering recourse seems pretty damn rare anymore.
Without the intent of sounding xenophobic, I wonder if it’s because it’s HP Japan where reputation is much more culturally important. US MBA’s admitting fault… haha…
PATH should always be set. Try: env -i sh -c 'echo $PATH'
If you're prioritizing convenience over correctness, prepare to face the consequences.
> Every tool has its place, and dogma is often unhelpful.
Visual Basic's "ON ERROR RESUME NEXT" perhaps also had its place. That doesn't mean that using it is good advice.
If anything, I would consider the often cited wooledge etc. advice of not using -e/-u as dogma. Case in point: no one lost 77TB of data because they should not have used -e/-u.
I said PATHs, not PATH. There are at least four I use on a regular basis.
Super not interested in a pedantic debate. It’s easy to armchair analyze. I found flaws in 55 codebases at Matasao, and yours is no exception.
e makes it super annoying to pass a variable number of args to a script, since shift will fail and cause an exit.
I do usually turn it on after, but you seem like the type to fail a code review if a script doesn’t start with it. I don’t think that’s a productive attitude.
I disagree. You can write shell scripts just fine and always set -euo pipefail
* I'm not sure what you mean by four PATHs, but if you really mean to be using unset variables for them, you should be using " ${V-}" or "${V:-}" syntax which does not fail. But again I don't know why you would do this other than maybe [[ "${1-}" ]]
* Variable arguments are still trivial with $#. Check (($#>3)), use while (($#>0)), etc
I also disagree that this is unproductive. With minor modifications/(adding :- or -), you can prevent a whole class of bugs (undefined variables). This woukd have prevent real-world issues such as in the post here as well as Steam when it wiped home directories since it ran (not sure the exact syntax) rm -rf $STEAMROOT/* with an unset variable
I don't think the audience is interested in this. If you'd like to be specific, I'm happy to talk about specific critiques. Otherwise it's just posturing, and there are better things to do over the holidays.
The original assertion was that under no circumstances should a bash script not begin with -e. I gave a circumstance (passing optional arguments), and said dogma is often counterproductive. I stand by all of those.
I kind of agree with your point that there should be exceptions, but I think I also agree with OP that using -e as a general rule is probably a safe starting point.
If you mean as a shebang (#!/bin/sh -eu), I would suggest switching to using "set" instead, because the shebang will not be interpreted if the script is ran as "sh script.sh" (as opposed to ./script.sh).
Just pointing out that those are most likely just the days the files were saved. There could still be some unlucky souls that ran computations for several days/weeks that happened to terminate on those days (and store the results). Those people could lose significantly more than a day and a half. On the flip side, HP jobs tend to be frequently checkpointed unless the storage cost is prohibitive for the type of job.
> However, during deployment, there was a lack of consideration as the cronjob was not disabled.
I'm intrigued to see that the report you link (which is in Japanese) mentions `find` and `bash` by those names, but doesn't contain the word `cron`. How does the report refer to the idea of a "cronjob"? Why is it different?
The Japanese text in that PDF doesn't say anything about cron. It just says that the script was overwritten "while there was an executing script in existence" ("実行中のスクリプトが存在している状態で"), and doesn't say whether that was because that executing script was launched by cron or by hand.
The style of apology is very nice. It is not extensive as some technical post mortem analysis that I've read, but all of the important things are here.
And always, always, use ShellCheck (https://www.shellcheck.net/) to catch most pitfalls and common mistakes on this powerful but dangerous language that is shell scripting.
[^1]: I think this gist is better than the original article in which it is based, because the article also suggested changing the IFS variable, which is not that good of an advice, so sadly the original text becomes a bad recommendation!
Good point, except if an important part of your complex script is really just plumbing the outputs of one program to the inputs of another. Because that's what shell scripting excels at. Calling an external process is a first-class citizen in shell, whereas it is a somewhat clunky thing (or at the very least, much more verbose) to do in any other languages.
There is a reading which suggests that an environment variable being unset caused an overabundance of files being deleted. `set -u` causes the script to exit if any variables are unset.
Everyone is mentioning error control for shell scripts or "don't use shell scripts", but neither of those are the solution to this problem. The solution to this problem is correctly implementing atomic deployment, which is important for any system using any programming language.
What I like to do is have two directories I ping pong between when deploying, and a `cur` symlink that points to the current version. The symlink is atomically replaced (new symlink and rename it over) whenever the deploy process completes. Any software/scripts using that tree will be written to first chdir() in, which will resolve the symlink at that time, and thus won't be affected by the deploy (at least as long as you don't do it twice in a row; if that is a concern due to long running processes, you could use timestamped directories instead and a garbage collection process that cleans stuff up once it is certain there are no users left).
>the find command containing undefined variables was executed and deleted the files
Just a note that "set -u" at the beginning of a bash script will cause it to throw an error for undefined variables. warning that of course this should be tested as it will also cause [[ $var ]] to fail.
If that's the case
[ -z "${VAR:-}" ] && echo "VAR is not set or is empty" || echo "VAR is set to $VAR"
I've been a Linux coder and user forever, and I didn't know that bash "reloads" a script while running if the file is modified. Good to learn before I also delete a whole filesystem due to this! :)
However, during deployment, there was a lack of consideration as the periodical script was not disabled.
The modified shell script was reloaded from the middle.
In my opinion, this is the wrong takeaway, and an important lesson was not learned.
It's not an operator "lack of consideration".
The lesson should be "when dealing with important data, do not use outrageously bad programming languages that allow run-time code rewriting, and that continue to execute even in the presence of undefined variables".
If you use shell scripting, this is bound to happen, and will happen again.
"We'll use Python or anything else instead of shell" would fundamentally remove the possibility of this category of failure.
> outrageously bad programming languages that allow run-time code rewriting
Almost all languages allow run-time code rewriting. Some of them just make it easier than others, and some of them make it a very useful feature. If you're very careful, updating a bash script while you're running it can be useful, but most often it's a mistake; in Erlang, hot loading is usually intentional and often useful. Most other languages don't make it easy, so you'll probably only do it if it's useful.
The problem was not that they used shell scripts. The problem was that the people writing the shell scripts were just bad programmers. If you hire a bad programmer to write them in Python, they'll still have tons of bugs.
The shell scripts I write have fewer bugs than the Python code I see other teams churn out. But that's because I know what I'm doing. Don't hire people who don't know what they're doing.
I have switched to F# for scripting tasks and have found F# scripts are (usually) either correct on the first try or fail at the type-checking stage. I would highly recommend it for anything near production.
In the process of functional modification of the backup program by Hewlett-Packard Japan, the supplier of the supercomputer system, there was a problem in the unintentional modification of the program and its application procedure, which caused a malfunction in the process of deleting the files under the /LARGE0 directory instead of deleting the backup log files that are no longer needed.
Translated with www.DeepL.com/Translator (free version)
The cause of this is a known behavior of Unix/Linux scripts, but unfortunately not everyone knows this. If you change a script while it is running, the shell that runs it will read (what it thinks is) the next line from the old script, but it will be reading at the expected position in the old script file, but from the new script file. So what it reads and executes will probably not be what you wanted.
Assuming this was a "scratch" HPC filesystem, as I'd guess, "scratch" is used advisedly -- users should be prepared to lose anything on it, not that it should happen with finger trouble. However, if I understand correctly from the comments, I'm surprised at the tools, and that the vendor was managing the filesystem. I'd expect to use https://github.com/cea-hpc/robinhood/wiki with Lustre, though I thought I'd seen a Cray presentation about tools of their own.
They do not try to blame it on complex systems or other factors.
Users lost 1 day and 1/2 of recent work (which doesn't seem to be that bad).