Hacker News new | past | comments | ask | show | jobs | submit login
Japan HP accidentally deleted 77TB data in Kyoto U. supercomputing system (kyoto-u.ac.jp)
260 points by rguiscard on Dec 30, 2021 | hide | past | favorite | 153 comments



I really appreciate the announcement from Hewlett Packard, which is very apologetic: https://www.iimc.kyoto-u.ac.jp/services/comp/pdf/file_loss_i...

They do not try to blame it on complex systems or other factors.

Users lost 1 day and 1/2 of recent work (which doesn't seem to be that bad).

  About file loss in Luster file system in your supercomputer system, we are 100% responsible.
  We deeply apologize for causing a great deal of inconvenience due to the serious failure of the file loss.

  We would like to report the background of the file disappearance, its root cause and future countermeasures as follows:

  We believe that this file loss is 100% our responsibility.
  We will offer compensation for users who have lost files.

  [...]

  Impact:
  --

  Target file system: /LARGE0

  Deleted files: December 14, 2021 17:32 to December 16, 2021 12:43

  Files that were supposed to be deleted: Files that had not been updated since 17:32 on December 3, 2021

  [...]

  Cause:
  --

  The backup script uses the find command to delete log files that are older than 10 days.

  A variable name is passed to the delete process of the find command.

  A new improved version of the script was applied on the system.

  However, during deployment, there was a lack of consideration as the periodical script was not disabled.

  The modified shell script was reloaded from the middle.

  As a result, the find command containing undefined variables was executed and deleted the files.

  [...]

  Further measures:
  --

  In the future, the programs to be applied to the system will be fully verified and applied.

  We will examine the extent of the impact and make improvements so that similar problems do not occur.

  In addition, we will re-educate the engineers in charge of human error and risk prediction / prevention to prevent recurrence.

 We will thoroughly implement the measures.


Japanese companies structure apologies very differently from US ones, because the legal consequences are very different. In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit, while in Japan, a sufficiently sincere* apology may well defuse the situation entirely.

* 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something

Also, the culprit here is not HP proper but their consulting/SI wing HP Enterprise, which has a, uhh, less than stellar reputation for competence.


Apologies, both personal and corporate, are taken very seriously in Japanese culture[1]. They're a way of preserving honor that dates back to the samurai era. You can see this in the custom of bowing, where the length and extension of a bow reflects the gravity of the situation. The act of seppuku can be considered an extreme version of this.

I'm not a Japanophile, but find their culture fascinating.

[1]: https://theculturetrip.com/asia/japan/articles/sumimasen-beh...


Japanese firms are interesting to work with. I enjoyed the culture, personally. I have two anecdotes.

The old white box PC shop I worked for, back in the 90s, quoted PCs to a Japanese-owned auto parts manufacturer. The Customer accepted the quote, paid an up-front deposit, and requested the PCs not be built and delivered until some construction at their site was completed in a few months.

In the meantime component pricing went down and speeds/feeds went up. When the PCs were built we ended up being forced to source higher clock speed CPUs and larger hard disk drives. It took some cajoling to get the Customer to take delivery. They felt they should pay more for the upgrades. (We were actually making more money on the deal even with the upgraded components anyway!)

My current company once pitched a support agreement to a Japanese-owned firm. We offered a discount for annual commitment versus month-to-month. I'd copy/pasted the month-to-month terms for the annual but forgot to alter the minimum notice period for ending the annual option. Both month-to-month and annual indicated a notice period of 30 days rather than the intended 180 days for the annual option.

The Customer's contact questioned the notice period being the same. He asked why they wouldn't opt for annual commitment, get the discounted rate, and also have the 30 day notice period. My partner, who had worked with Japanese firms in the past, responded: "We know you wouldn't choose the annual option if your intention was not to work with us for at least a year." The Customer agreed and we ended up getting the gig on an annual basis.


> The Customer's contact questioned the notice period being the same. He asked why they wouldn't opt for annual commitment, get the discounted rate, and also have the 30 day notice period. My partner, who had worked with Japanese firms in the past, responded: "We know you wouldn't choose the annual option if your intention was not to work with us for at least a year." The Customer agreed and we ended up getting the gig on an annual basis.

That's a really cute answer by your partner, I like it a lot, even ignoring that it manages to save face while paying serious respect to the purported customer.


Writing the parent comment gave me, if nothing else, an excuse to publicly document that exchange. I was, and still am, in awe of his ability to think on his feet and come up with such a wholly appropriate response.


Because Japanese culture accepts failure. Admission and acceptance preserves honour. Western culture, particularly in north America, punishes the admission of failure. Everyone is supposed to fight things out to the last, to never give an inch. Even when companies settle cases they rarely admit wrongdoing publicly. Notice that this Japanese company talks about re-educating/training personnel. An American company would be expected to sack all involved and sue the consulting firm.


Maybe it depends on the region/prefecture, but after 3 years working in Japan, my experience is quite different. The project suffers massive delays because nobody wants to make any decision and take any responsibility. My former boss (Japanese, but had worked for many years in the US), told me that Japanese managers don't want to take risks, because if something fails they need to apologize in public, and that is pretty much social suicide there.


Hmm. I suppose it’s subjective, but isn’t America’s acceptance of failure often cited as an element in the success of its entrepreneurial spirit?

And, wrt to samurais and failure, seppuku?


> isn’t America’s acceptance of failure often cited as an element in the success of its entrepreneurial spirit

Yes, the parent comment is entirely wrong. The US culture is hyper accepting of failure compared to most other prominent cultures, including Japanese culture (which in fact does not tolerate failure very well at all).


America accepts failed ventures, not individual mistakes. A business that collapses is a learning experience. But, while they are operating, a business must never admit to wrongdoing impacting customers or the public. There is great respect for failed business judgment, but very little for weakness.


Yes, I think op is conflating the terms of apology and failure in this instance.


Weren’t Japanese soldiers in WWII instructed to kill themselves instead of being captured to “preserve honor?” That doesn’t sound like the acceptance of failure to me.


It's far too much to go in to in the HN comments section from a phone, but removing the key military leadership who pushed this mindset, a new government structure, and the rejuvenation of business under e.g., Deming caused a significant shift in Japanese culture following the war.

Edit: This is not necessarily in support of the grandparent comment; rather, a caution against judging a culture based on historical anecdotes.


That sounds like saying that the ongoing US war crimes at Guantanamo are important for understanding Google app engine service contracts. Just because Japan is far away doesn't mean that everything that happened there happened in the same place at the same time.


Only certain classes of officers if I remember correctly, and it more likely was motivated by preventing the leaking of intelligence under interrogation by the enemy, rather than really being about preserving honor. Of course it's probably easier to carry out seppuku if you convince yourself its honorable, rather than to just do it based on rationality.

Not all failure results in capture, so one particular failure scenario does not speak of the general view on failure.


A captured solider in war doesn't get to go home and try again after apologizing.

Also, going to war is inherently already a massive corruption of culture.


There were misunderstandings that there will not be good chances of dying painlessly if one was to be captured alive, based on Japanese norms and very scarce knowledge on American cultures. Better to pull the pin while you can than having to beg them for days to stab just an inch towards arteries, that's what "preserve honor" ultimately means.


They were instructed to avoid capture by any means because they had been indoctrinated by propaganda that indicated horrific treatment by Allied POW personnel. While this was obviously false, it's interesting to note that Japanese knowledge of America's history with slavery, segregation, and Indian removal would have made this assumption not unreasonable, and further, may have influenced Japanese treatment of American POWs. After all, a major consideration in Japan's decision to go to war in the first place was the leadership's understanding that their lack of status as a white power would hamper their colonial ambitions. They were only a few decades removed from being excluded from the Berlin Conference, for example.


> they had been indoctrinated by propaganda that indicated horrific treatment by Allied POW personnel.

Sources? I have never heard of that yet.


I can't remember where I initially read it, but it's mentioned on both https://en.wikipedia.org/wiki/Japanese_prisoners_of_war_in_W... and https://en.wikipedia.org/wiki/Propaganda_in_Japan_during_the... and presumably in the associated citations.

Can I ask where your disbelief is sourced?


Not true, there are lots of sources of information that gives the entire picture: it is a tradition from the ages of samurai.


That's an oversimplification. Pre-existing notions about warrior conduct certainly played a role, but "samurai" were a class (bordering on caste); attitudes descending from bushido were a top-down mandate, enforced by the officer class, not something widespread in common civilian life (save for knowledge of how one is supposed to act towards high-ranking personnel).

If you think that the domestic propaganda machine wasn't running at mach speed in order to shape public perceptions to what would be most beneficial to the Imperial Army and Navy, I don't know what to say. They definitely were, and they definitely pushed, from multiple angles, the idea that surrendering would result in a worse spiritual, material, physical, and psychological outcome than the alternatives (take your pick of whichever motivates you most effectively).


That's exactly what acceptance of failure sounds like to me. E.g. you abandon a chess game when you accept you'll lose. To just keep pushing for a situation you know is irredeemable is to deny you've failed.


When you learned about that, you did not read the entire paragraph: being captured was traditionally considered a dishonor (for the past 500 years, at least), it is a huge failure for a soldier (or samurai). Ritual suicide is the solution to that dishonor.

There are degrees of failure, some is considered to be beyond fixing.


Citizens were also instructed to kill themselves.


But at least you won't end up in court.


On the other hand, the failure is generally frowned upon and avoided at all costs.


Unfortunately it can also be taken a bit too far. It’s not uncommon to essentially buy yourself out of punitive criminal justice via an apology and some cash (jidan/gomen money). It’s not totally unlike a settlement here, except it’s much more acceptable as an opening move (as a victim, you should seriously consider it). It seems to be inline with the goal of preserving social harmony and making “it” go away asap.


> a sufficiently sincere* apology may well defuse the situation entirely.

> * 真 makoto, a word often glossed as "sincere" but not identical in meaning: it's about the amount of effort you're willing to take on, not how "honestly" you feel something

While it's true that makoto can be translated as sincere(ly) in constructions like "makotoni moushiwakearimasen" (I sincerely apologize) (although this is written 誠に rather than 真に), it is unlikely that the word makoto would be used in a phrase like "a sincere apology" or in discussing how sincere an apology was, so I don't really think introducing the word "makoto" in your comment sheds any additional light on japanese culture surrounding apologies.

You could actually make the exact same comment about how "real" in English can also effectively mean sincere in "really sorry" and draw the same conclusions about American culture.


My take away (as westerner with zero direct Japanese culture exposure) from the comment you’re replying to was that in Japan, companies are incentivized to take on some measure of ownership and voluntary restitution, because there is some legal notion in Japan around “honest mistakes not being litigable if they are genuinely rectified.”


In the US "genuinely rectified" is the standard for civil suit ("actual damages" for negligence, and "specific performance" or monetary equivalent for contracts). Punitive damages are added only for malicious intent.

The reason many US companies don't apologize is because they don't want to make restitution, and can get away without paying.


I think it’s more that legal measures would not be employed, rather than that they are legally unactionable.

You could have an uphill battle against an unsympathetic judge in front of you if you sue anyway though.


“In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit, while in Japan, a sufficiently sincere* apology may well defuse the situation entirely.”

——-

“hospital staff and doctors willing to discuss, apologize for and resolve adverse medical events through a “collaborative communication resolution program” experienced a significant decrease in the filing of legal claims, defense costs, liability costs and time required to close cases.”

https://www.natlawreview.com/article/you-had-me-i-m-sorry-im...


HP and HPE are now two separate companies, split from post-Fiorina HP.

HP does consumer grade stuff only, while HPE does the enterprise side (not just consulting, in fact non trivial portion of HPE consulting arm was spun off and merged into DXC)


I agree with your point, and want to add that HP does commercial grade end-user compute and printing along with the related enterprise services. They have a whole set of offerings for the medical industry [1], industrial printing [2], and enterprise PC fleet management services [3].

[1] https://www.hp.com/us-en/printers/3d-printers/industries/hea...

[2] https://www.hp.com/us-en/industrial-digital-presses.html

[3] https://www.hp.com/us-en/services/manageability.html


I think "sell the HP-35 for 3.14 × cost of materials" cool-HP became Agilent, right?

If I wanted to follow the trail of awesomeness what forest should I be sticking my nose to the ground in? :)


That depends on which trail of awesomeness you are after.

The trail that maintains the legacy of DEC and Silicon Graphics and Cray is in HPE (where I work). The Cray legend is still very much alive, but you can still detect the whiff of the the spirit that made HP and DEC minicomputers extraordinary.


Well, I suspect the SGI legacy is now in better hands than when it was controlled by Rackable with branding filed off. The only good parts they sold us were the Ultraviolets, and those were probably the most nonsensical purchase (protip: do not buy supercomputer modules just to run 8 VMs on it, it's waste of money even if the hw is awesome)


now Keysight, Agilent is medical equip.


More specifically, it spun off and merged with CSC to create DXC.


> In the US, an apology is considered an admission of responsibility and is often the starting point of legal action against the culprit

source? this is a popular theory among non-lawyers, but not, as far as I can tell, well-supported by the evidence. http://jaapl.org/content/early/2021/05/19/JAAPL.200107-20 has extensive citations including for their claim that "In theory, telling a patient about an error may make patients more likely to pursue litigation. In practice, however, bad outcomes alone are typically not reason enough for patients or their families to file malpractice claims."


Medicine may not fit the pattern (personally I'd want to know where I stand, even if the news is bad), but I took the OP to be saying "American firms prefer not to go on record saying they screwed up, since that would naturally be brought up in subsequent legal proceedings".


I flip table rage quit my 15 year job at a university over a new C-suite from who knows where throwing HPE equipment/software/consultants at me for a big sort of project and finding them all so utter crap that I would be a part of that stupidity. I have my standards.


Before I left HPE post Cray acquisition, some of the folks in the consulting/"cloud" division insisted that to use their tooling, we had to insert a windows machine into clusterstor, that mounted lustre, so that they could run their powershell script to gather usage metrics.

What I found working with these teams, was that the desire to flip tables was quite strong after meeting with them. I tried to address their concerns, one point at a time. They were bewildered that windows could not mount (modern) lustre. Really bewildered. I offered to help rewrite their scripts in another (portable) language, so we could avoid these problems. Still they were bewildered.

They were not why I left. Merely a confirmation that my decision to leave was the right one.


HPE is known for its acquisitions and, with that, there are diverse hardware teams who approach engineering differently. This can be a source of amazing innovation or power battles. The folks inside HPE need to learn how to engage in technical conflict better to prevent these catastrophic scenarios.

And yes, your decision was the right one.


I once caused a blank box to appear on Rakuten's homepage for several hours. My boss had to fly to Japan to apologize in person to their CEO.


Or that is what he used to justify a free (likely business class) flight to JP.


No, I don’t work for Rakuten, but the in-person apology from an appropriately senior person is hugely important for large Japanese organizations.

In my current company I have the opportunity to work with several large Japanese banks. Various members of our C-Suite and senior management travel occasionally for this purpose.

A sufficiently sincere apology with the appropriate level of ceremony can be the difference between maintaining and losing a substantial contract.


I don't know a huge amount about Rakuten. Don't they purposefully adopt a pretty flat communication structure, and the CEO travels around a lot, as well as English demanded in senior roles?


I was deeply impressed some--30?--years ago when there was a minor scandal in sumo wrestling: some kid, who had been advanced too quickly did a few stupid things, as I recall the sort of thing you can find in American sports pages every week. The heads of the sumo wrestling association acknowledged that they had contributed to the situation, and docked their own pay. Do you think Roger Goodell is going to do that?


> 弊社100%の責任により

Voluntarily stating 100% responsibility is consequential and not typical, smells politics.


I presume there were extensive discussions between the two parties about the wording before this statement was published.


So this is something I’ve never understood. If you modify a shell script while it’s running, the shell executes the modified file. This normally but not always causes the script to fail.

Now I’ve known about this behaviour for a very long time and it always seemed very broken to me. It’s not how binaries work (at least not when I was doing that kind of thing).

So I guess bash or whatever does an mmap of the script it’s running, which is presumably why modifications to the script are visible immediately. But if a new file was installed eg using cp/tar/unzip, I’m surprised that this didn’t just unlink the old script and create a new one - which would create a new inode and therefore make the operation atomic, right? And this (I assume) is why a recompiled binary doesn’t have the same problem (because the old binary is first unlinked).

So, how could this (IMO) bad behaviour be fixed? Presumably mmap is used for efficiency, but isn’t it possible to mark a file as in use so it’s cant be modified? I’ve certainly seen on some old Unices that you can’t overwrite a running binary. Why can’t we do the same with shell scripts?

Honestly, while it’s great that HP is accepting responsibility, and we know that this happens, the behaviour seems both arbitrary and unnecessary to me. Is it fixable?


> isn’t it possible to mark a file as in use so it’s cant be modified?

That's the route chosen by Windows for binary executables (exe/dll) and various other systems. Locking a file against writes, delete/rename or even read is just another flag in the windows equivalent of fopen [1]. This makes for software that's quite easy to reason about, but hard to update. The reason why you have to restart Windows to install Windows updates or even install some software is largely due to this locking mechanism: you can't update files that are open (and rename tricks don't work because locks apply to files, not inodes).

With about three decades of hindsight I'm not sure if it's a good tradeoff. It makes it easy to prevent the race conditions that are an endless source of security bugs on unix-like systems; but otoh most software doesn't use the mechanism because it's not in the lowest-common-denominator File APIs of most programming languages; and MS is paying for it with users refusing to install updates because they don't want to restart their PC.

1: Search for FILE_SHARE_DELETE in https://docs.microsoft.com/en-us/windows/win32/api/fileapi/n...


Files in use can be shadow updated and then will be actually replaced when possible.

Naturally no one reads MSDN docs.

Also to note that other non-UNIX clones follow similar approach to file locking.


>and rename tricks don't work because locks apply to files, not inodes

My experience is opposite, you can rename a locked file and place a new file with the name of the former one.

Depends on the lock type, I suppose.


On Unix/Linux you can't update a file mmaped for execution either - text files are busy.


I've updated .so files on FreeBSD while they're running. They weren't busy and a program which had it mmaped to run promptly crashed (my update wasn't intended to be hot loaded and wasn't crafted to be safe, although, it could have been if I knew it was possible). And now I won't forget why I should use install instead of cp (install unlinks before writing, by default, cp opens and overwrites the existing file)


In my experience on Linux, shared libraries can be modified while running (often causing a crash), while executables cannot (ETXTBUSY).


This behavior in shell scripts predates mmap. In very early versions of Unix it was arguably even useful; there was a goto command which was implemented by seeking on the shell-script file descriptor rather than as a shell builtin, for example. I don't know of any use for it since the transition to the Bourne shell, but my knowledge is far from comprehensive. (I suppose if your shell script is not small compared to the size of RAM, it might be undesirable to read it all in at the start of execution; shar files are a real-life example even on non-PDP-11 machines.)

As I understand it, the reason for ETXTBSY ("on some old Unices...you can't overwrite a running binary") was to prevent segfaults.

cp usually just opens the file O_WRONLY|O_TRUNC, which seems like the wrong default; Emacs for example does create a new file and rename it over the old one when you save, usually, allocating a new inode as you say. By default it makes an exception if there are other hardlinks to the file.

Btrfs and xfs have a "reflink" feature that allows you to efficiently make a copy-on-write snapshot of a file, which would be ideal for this sort of thing, since the shell or whatever won't see any changes to the original file, even if it's overwritten in place. Unfortunately I don't think you can make anonymous reflinks, so for the shell to reflink a shell script when it starts executing it would need write access to somewhere in the filesystem to put the reflink, and then it would need to know how to find that place, somehow. And of course that wouldn't help if you were running on ext4fs or, I imagine, Lustre, though apparently an implementation was proposed in 02019: https://wiki.lustre.org/Lreflink_High_Level_Design


> there was a goto command which was implemented by seeking on the shell-script file descriptor rather than as a shell builtin, for example.

Oh noooo I just realized you could probably implement a shared library loadable module for bash `enable` that does the same thing... just fseek()s the fd...

*Runs for the hills screaming*


“Emacs for example does create a new file and rename it over the old one when you save, usually, allocating a new inode as you say. By default it makes an exception if there are other hardlinks to the file.”

Though the trade off is that all operation ceases on a full hard drive.

I don’t have a better solution, but it’s worth noting.


Emacs gives you an error message in that case rather than destroying the old version of the file and then failing to completely write the new version, in the cases where it does the tempfile-then-rename dance. This is usually vastly preferable if Emacs or your computer crashes before you manage to free up enough space for a successful save.

It doesn't cease all operation; other Emacs features work as they normally do. Bash, by contrast, stops being able to tab-complete filenames, at precisely the time when you most need to be able to rapidly manipulate your files. At least, that's the case with the default completion setup in a few recent versions of Ubuntu.


Well, it looks like creating another hard link is a nearly-free solution. And beyond that, since emacs already has both behaviors, presumably you can tell it you want the in-place modification.


I think you can just customize the backup-by-copying variable to t, though I haven't tried it. Check the manual.


Does it mean that I need to have extra free space? Does not sound good.


> So I guess bash or whatever does an mmap of the script it’s running

this is incorrect, and is relatively easy to test:

  $ strace -y -P /tmp/test.sh bash /tmp/test.sh
  ioctl(3</tmp/test.sh>, TCGETS, 0x7ffc6daea580) = -1 ENOTTY (Inappropriate ioctl for device)
  lseek(3</tmp/test.sh>, 0, SEEK_CUR)     = 0
  read(3</tmp/test.sh>, "#!/bin/sh\n", 80) = 10
  lseek(3</tmp/test.sh>, 0, SEEK_SET)     = 0
  dup2(3</tmp/test.sh>, 255)              = 255</tmp/test.sh>
  close(3</tmp/test.sh>)                  = 0
  fcntl(255</tmp/test.sh>, F_SETFD, FD_CLOEXEC) = 0
  fcntl(255</tmp/test.sh>, F_GETFL)       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
  newfstatat(255</tmp/test.sh>, "", {st_mode=S_IFREG|0644, st_size=10, ...}, AT_EMPTY_PATH) = 0
  lseek(255</tmp/test.sh>, 0, SEEK_CUR)   = 0
  read(255</tmp/test.sh>, "#!/bin/sh\n", 10) = 10
  read(255</tmp/test.sh>, "", 10)         = 0
the reason why modifying a script during execution can have unpredictable results, not demonstrated in this test, is that Unix shells traditionally alternate between reading commands and executing them, instead of reading the entire file (potentially very large compared to 1970s RAM size) and executing commands from the in-memory copy. on modern systems, shell script sizes are usually negligible compared to system RAM. therefore, you can manually cause the entire file to be buffered by enclosing the script in a function or subshell:

  #!/bin/sh
  main() {
  # script goes here
  }
  main


>So, how could this (IMO) bad behaviour be fixed?

By reading in the whole file at once. Bash does not mmap shared the script it is parsing. You can see this behavior with

    strace -e read,lseek bash << EOF
    echo 1
    echo 2
    EOF
bash will read(), do its multi-step expansion-parsing thing and then lseek back so the next read starts on the next input it needs to handle. This is why the problems described in the story can happen.

The other way to fix this is to simply use editors that will just make a new file and move over that file on the target on save. I believe vim or neovim does this by default, but things like, ed or vi do not. Emacs will do something similar on first save if you did not (setq backup-by-copying t) but any write after will still be done in-place. I tested this trivially without reviewing the emacs source simply doing the following and you can to with $EDITOR of choice:

    !#/usr/bin/env bash
    echo test
    sleep 10
    # evil command below, uncomment me and save
    # echo test2
while running sleep, if changing the script causes things to happen, your editor may cause the problem described.


> If you modify a shell script while it’s running, the shell executes the modified file

That is dependent on the OS. In this case wasn't the shell script just executed fresh from a cronjob?

I remember on Digital Unix - on an Alpha so this was a few years ago - that you could change a c program (a loop that printed something then slept, for example), recompile and it would change the running binary.


> wasn't the shell script just executed fresh from a cronjob?

The description said that the script changed while it was running, so certain newly introduced environment variables didn’t have values and this triggered the issue.

My reading was that this was just a terrible coincidence - the cron job must have started just before the upgrade.

Regarding changing a C program, now you mention it I think that the behaviour you describe might also have happened on DG/UX, after an upgrade. IIRC it used to use ETXTBSY and after an upgrade it would just overwrite.

Not really behaviour that you want (or expect) tho.


It's nice to see the same mistakes that people have been making for as long as I've been alive, on small and large systems all over the world, still happen on projects with professional teams from HPE or IBM that cost hundreds of millions of dollars.


From what I know, so far linux doesn't have an exclusive lock capability on a file, windows does however. So in linux you can't mark a file in exclusive possession of a process.


Down voters should read up on the state of mandatory locking in Linux and what conditions need to be met and how reliable it is.


> The modified shell script was reloaded from the middle.

This is an incredible edge case. I'm amazed they hit this issue and just as amazed that they correctly identified that issue and reported on it.

This response is great, it's the exact opposite of the wishy-washy mealy-mouthed response to the lastpass security incident.


Ahhh the joy of lustre and the accidental cronjob.

about 15 years ago I experienced the same thing. An updater script based on rsync was trying to keep one nfs machine image in sync with another. However for what ever reason, the script accidentally tries to sync the entire nfs root directory with its own, deleting everything show by show in reverse alphabetical order.

At the time Lustre didn't really have any good monitoring tools for showing you who was doing what, so they had to wait till they hit a normal NFS server before they could figure out and stop what was deleting everything.

Needless to say, a lot of the backups may have been failing.


rsync has a number of safety and boundary options, not to mention --dry-run.

Options, as in I also found out about them the hard way.


For this reason I actually use simpler tools than rsync.


Huh. I may be remembering incorrectly, but I recall having somebody somewhat entrenched in related business tell me that HP has been going downhill from an industry perspective roughly two years ago…

Nice to see them completely own up to the mistake right away. I wonder who made the final call on doing so, companies admitting fault so transparently & immediately offering recourse seems pretty damn rare anymore.

Without the intent of sounding xenophobic, I wonder if it’s because it’s HP Japan where reputation is much more culturally important. US MBA’s admitting fault… haha…


Every shell script should start with set -e and set -u


e doesn’t work for (subshell | commands), and u is inconvenient when appending to PATHs. Every tool has its place, and dogma is often unhelpful.


> e doesn’t work for (subshell | commands)

That's not an argument against enabling it.

In bash, -o pipefail addresses this.

> and u is inconvenient when appending to PATHs

PATH should always be set. Try: env -i sh -c 'echo $PATH'

If you're prioritizing convenience over correctness, prepare to face the consequences.

> Every tool has its place, and dogma is often unhelpful.

Visual Basic's "ON ERROR RESUME NEXT" perhaps also had its place. That doesn't mean that using it is good advice.

If anything, I would consider the often cited wooledge etc. advice of not using -e/-u as dogma. Case in point: no one lost 77TB of data because they should not have used -e/-u.


I said PATHs, not PATH. There are at least four I use on a regular basis.

Super not interested in a pedantic debate. It’s easy to armchair analyze. I found flaws in 55 codebases at Matasao, and yours is no exception.

e makes it super annoying to pass a variable number of args to a script, since shift will fail and cause an exit.

I do usually turn it on after, but you seem like the type to fail a code review if a script doesn’t start with it. I don’t think that’s a productive attitude.


I disagree. You can write shell scripts just fine and always set -euo pipefail

* I'm not sure what you mean by four PATHs, but if you really mean to be using unset variables for them, you should be using " ${V-}" or "${V:-}" syntax which does not fail. But again I don't know why you would do this other than maybe [[ "${1-}" ]]

* Variable arguments are still trivial with $#. Check (($#>3)), use while (($#>0)), etc

I also disagree that this is unproductive. With minor modifications/(adding :- or -), you can prevent a whole class of bugs (undefined variables). This woukd have prevent real-world issues such as in the post here as well as Steam when it wiped home directories since it ran (not sure the exact syntax) rm -rf $STEAMROOT/* with an unset variable


That's quite a number of bad-faith assumptions in your comment, which are also incidentally wrong.

> e makes it super annoying

I rest my case?


I don't think the audience is interested in this. If you'd like to be specific, I'm happy to talk about specific critiques. Otherwise it's just posturing, and there are better things to do over the holidays.

The original assertion was that under no circumstances should a bash script not begin with -e. I gave a circumstance (passing optional arguments), and said dogma is often counterproductive. I stand by all of those.

Let's agree to disagree and move on.


I kind of agree with your point that there should be exceptions, but I think I also agree with OP that using -e as a general rule is probably a safe starting point.


I long adopted

    /bin/sh -eu
header in my scripts. It's a must-have.


If you mean as a shebang (#!/bin/sh -eu), I would suggest switching to using "set" instead, because the shebang will not be interpreted if the script is ran as "sh script.sh" (as opposed to ./script.sh).


Perdon my ignorance, but what do those do? Searching for it doesn't give me anything.


See https://www.gnu.org/software/bash/manual/bash.html#The-Set-B..., in short -e makes scripts exit if a command fails, and -u makes them exit if a variable is undefined.

If you think your colleagues won't know this, "set -o errexit; set -o nounset" would be easier for them to search on.

(Via "3.4 Shell Parameters" → "3.4.1 Positional Parameters" → "4 Shell Builtin Commands", or searching the whole page for "-e".)


Thank you very much, I appreciate it.



Agreed - no corporate-speak, sounds like it was written by an actual human.


It sounds a lot like Japanese corpo-speak.


Which is very formulaic, but also almost definitely written by a human :D


Just pointing out that those are most likely just the days the files were saved. There could still be some unlucky souls that ran computations for several days/weeks that happened to terminate on those days (and store the results). Those people could lose significantly more than a day and a half. On the flip side, HP jobs tend to be frequently checkpointed unless the storage cost is prohibitive for the type of job.


Is checkpointing really that common anymore. When I talked to my local HPC admins they gave me the impression that nobody does it anymore


> However, during deployment, there was a lack of consideration as the cronjob was not disabled.

I'm intrigued to see that the report you link (which is in Japanese) mentions `find` and `bash` by those names, but doesn't contain the word `cron`. How does the report refer to the idea of a "cronjob"? Why is it different?


The Japanese text in that PDF doesn't say anything about cron. It just says that the script was overwritten "while there was an executing script in existence" ("実行中のスクリプトが存在している状態で"), and doesn't say whether that was because that executing script was launched by cron or by hand.


I took: "bash は、シェルスクリプトの実行中に適時シェ", which means it's either cronjob or sleep with a loop ( https://iww.hateblo.jp/entry/20211229/file_lost_insident )


I read this sentence as “Bash reads the shell script just-in-time while executing it”, with no context as to why it was running (cron, loop, by hand…)


"シェルスクリプト" is "shell script", but my Japanese is too poor to understand 実行中 or 適時.


I guess the most correct context is that the script was running "periodically".

https://zenn.dev/mattn/articles/5af86b61004bdc https://iww.hateblo.jp/entry/20211229/file_lost_insident


実行中: while executing

適時: appropriate times, or as needed


Thank you very much!


Interesting, seems the shell script was executed from the cron job just as it was being replaced on the server itself?


1.5 days isn't too bad. If it were me my primary concern would be losing bash history :D


> As a result, the find command containing undefined variables was executed

And this is why shell should not execute commands with "undefined" variables and give an error instead.


The sense of honor and responsibility shining through is refreshing.


77TB in one and half day? Impressive.

The style of apology is very nice. It is not extensive as some technical post mortem analysis that I've read, but all of the important things are here.


What a strangely simple error


I think I found the problem:

“A new improved version of the script was applied on the system.”


I guess this is as good of a time as any other to remind people to use the "unofficial" Bash strict mode:

https://gist.github.com/robin-a-meade/58d60124b88b60816e8349... [^1]

And always, always, use ShellCheck (https://www.shellcheck.net/) to catch most pitfalls and common mistakes on this powerful but dangerous language that is shell scripting.

[^1]: I think this gist is better than the original article in which it is based, because the article also suggested changing the IFS variable, which is not that good of an advice, so sadly the original text becomes a bad recommendation!


And don't use shell for writing complex scripts, there are better automation tools and languages.


Good point, except if an important part of your complex script is really just plumbing the outputs of one program to the inputs of another. Because that's what shell scripting excels at. Calling an external process is a first-class citizen in shell, whereas it is a somewhat clunky thing (or at the very least, much more verbose) to do in any other languages.


I'd say that as long as bash script fits on a single screen you won't get any benefit from switching to some other tool or language


Such as?


Python


For example,take my project.

https://github.com/Mylab6/PiBluetoothMidSetup

While I could of done this in Bash.

1. I don't really like Bash

2. Python is much easier. I did challenge myself to only use Python's built in libraries, but aside from being unable to use Yaml everything works.

I can imagine in some environments you might not have access to a Python interrupter though...


> I guess this is as good of a time as any other to remind people to use the "unofficial" Bash strict mode

Not really; the report doesn't mention any error in the script.


There is a reading which suggests that an environment variable being unset caused an overabundance of files being deleted. `set -u` causes the script to exit if any variables are unset.


When communicating non-critical data-loss to teammates, I like to do it with this haiku:

  Three things are certain:
  Death, taxes, and lost data.
  Guess which has occurred.
From https://www.gnu.org/fun/jokes/error-haiku.en.html


Everyone is mentioning error control for shell scripts or "don't use shell scripts", but neither of those are the solution to this problem. The solution to this problem is correctly implementing atomic deployment, which is important for any system using any programming language.

What I like to do is have two directories I ping pong between when deploying, and a `cur` symlink that points to the current version. The symlink is atomically replaced (new symlink and rename it over) whenever the deploy process completes. Any software/scripts using that tree will be written to first chdir() in, which will resolve the symlink at that time, and thus won't be affected by the deploy (at least as long as you don't do it twice in a row; if that is a concern due to long running processes, you could use timestamped directories instead and a garbage collection process that cleans stuff up once it is certain there are no users left).


The original blue-green deployment strategy. I have done a similar thing as well.


>the find command containing undefined variables was executed and deleted the files

Just a note that "set -u" at the beginning of a bash script will cause it to throw an error for undefined variables. warning that of course this should be tested as it will also cause [[ $var ]] to fail.

If that's the case

[ -z "${VAR:-}" ] && echo "VAR is not set or is empty" || echo "VAR is set to $VAR"

will help test that condition


I've been a Linux coder and user forever, and I didn't know that bash "reloads" a script while running if the file is modified. Good to learn before I also delete a whole filesystem due to this! :)


Is that what happened? I can't reproduce that by changing a bash script that's running a while [ 1 ] loop.

Is it maybe that they were editing or copying the file and a cron job kicked off?


That's because it's a loop, so it's already read. Try appending a line to a running script instead.


Ah, yep. This does work, and prints out both "one" and "two":

  printf "echo one\nsleep 3\n" > s1;(bash s1 &);sleep 1 && printf "echo two\n" >> s1
That's interesting. And changing the "sleep 1" to "sleep 4" make it only output "one".


    However, during deployment, there was a lack of consideration as the periodical script was not disabled.

    The modified shell script was reloaded from the middle.
In my opinion, this is the wrong takeaway, and an important lesson was not learned.

It's not an operator "lack of consideration".

The lesson should be "when dealing with important data, do not use outrageously bad programming languages that allow run-time code rewriting, and that continue to execute even in the presence of undefined variables".

If you use shell scripting, this is bound to happen, and will happen again.

"We'll use Python or anything else instead of shell" would fundamentally remove the possibility of this category of failure.


> outrageously bad programming languages that allow run-time code rewriting

Almost all languages allow run-time code rewriting. Some of them just make it easier than others, and some of them make it a very useful feature. If you're very careful, updating a bash script while you're running it can be useful, but most often it's a mistake; in Erlang, hot loading is usually intentional and often useful. Most other languages don't make it easy, so you'll probably only do it if it's useful.


The problem was not that they used shell scripts. The problem was that the people writing the shell scripts were just bad programmers. If you hire a bad programmer to write them in Python, they'll still have tons of bugs.

The shell scripts I write have fewer bugs than the Python code I see other teams churn out. But that's because I know what I'm doing. Don't hire people who don't know what they're doing.


It’s amazing how human errors scale with technology. Just imagine, one day we’ll be making mistakes at the Type III civilization level! :)


I have switched to F# for scripting tasks and have found F# scripts are (usually) either correct on the first try or fail at the type-checking stage. I would highly recommend it for anything near production.


In the process of functional modification of the backup program by Hewlett-Packard Japan, the supplier of the supercomputer system, there was a problem in the unintentional modification of the program and its application procedure, which caused a malfunction in the process of deleting the files under the /LARGE0 directory instead of deleting the backup log files that are no longer needed.

Translated with www.DeepL.com/Translator (free version)


The cause of this is a known behavior of Unix/Linux scripts, but unfortunately not everyone knows this. If you change a script while it is running, the shell that runs it will read (what it thinks is) the next line from the old script, but it will be reading at the expected position in the old script file, but from the new script file. So what it reads and executes will probably not be what you wanted.


Yet another bug due to using command-line interface (which is designed for humans not programs) by programs.


Who brought tres commas?


just in case someone didn’t see this masterpiece https://youtu.be/vvDK8tMyCic


Assuming this was a "scratch" HPC filesystem, as I'd guess, "scratch" is used advisedly -- users should be prepared to lose anything on it, not that it should happen with finger trouble. However, if I understand correctly from the comments, I'm surprised at the tools, and that the vendor was managing the filesystem. I'd expect to use https://github.com/cea-hpc/robinhood/wiki with Lustre, though I thought I'd seen a Cray presentation about tools of their own.


That's a lot of floppy disks!



This is HPE - not HP. Servers, not Printers.


Clearly, the only honorable thing for the CEO of HPE to do is to ... er, blame Sunny Balwani!


it's a lustre filesystem. the data would've been eaten eventually anyway.


What would make you think of that?


experience (bitter).


Not really surprising. HPE has provided bottom-of-the-barrel support for decades.


Looks like 10 of 14 groups were restored from backup.


Auch! That's a nice run for the backups eh?


Did they had any other separated backups ?


nani!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: