Hacker News new | past | comments | ask | show | jobs | submit login
The long road to recover Frogger 2 source from tape drives (github.com/kneesnap)
510 points by WhiteDawn on May 24, 2023 | hide | past | favorite | 212 comments



Wow, this part makes my blood boil, emphasis mine:

> This issue doesn't affect tapes written with the ADR-50 drive, but all the tapes I have tested written with the OnStream SC-50 do NOT restore from tape unless the PC which wrote the tape is the PC which restores the tape. This is because the PC which writes the tape stores a catalog of tape information such as tape file listing locally, which the ARCserve is supposed to be able to restore without the catalog because it's something which only the PC which wrote the backup has, defeating the purpose of a backup.

Holy crap. A tape backup solution that doesn't allow the tape to be read by any other PC? That's madness.

Companies do shitty things and programmers write bad code, but this one really takes the prize. I can only imagine someone inexperienced wrote the code, nobody ever did code review, and then the company only ever tested reading tapes from the same computer that wrote them, because it never occured to them to do otherwise?

But yikes.


> Holy crap. A tape backup solution that doesn't allow the tape to be read by any other PC? That's madness.

What is needed is the backup catalog. This is fairly standard on a lot of tape-related software, even open source; see for example "Bacula Tape Restore Without Database":

* http://www.dayaro.com/?p=122

When I was still doing tape backups the (commercial) backup software we were using would e-mail us the bootstrap information daily in case we had to do a from-scratch data centre restore.

The first step would get a base OS going, then install the backup software, then import the catalog. From there you can restore everything else. (The software in question allowed restores even without a license (key?), so that even if you lost that, you could still get going.)


Right, the on-PC database act as index to data on the tape. That's pretty standard.

But having format where you can't recreate the index from data easily is just abhorrently bad coding...


Obviously to know what to restore, you need to index the data on the tapes. Tape is not a random access medium, there is no way around this.

This is only for a complete disaster scenario, if you’re restoring one PC or one file, you would still have the backup server and the database. But if you don’t, you need to run the command to reconstruct the database.


There is a way around this: You allocate enough space at the beginning (or the end, or both) of the tape for a catalog. There are gigabytes on these tapes; they could have reserved enough space to store millions of filenames and indices.


Then you would have to rewind the tape at the end, which is not what you want. You want to write the tape and keep it at that position and be done.

If you write the catalog at the end, you have to rewind and read the whole tape to find it and read it. Which is not an improvement over reading the tape and reconstructing it.

This is all either impossible or very difficult to fix when there is actually not a problem, if there is a disaster and the database is lost, you just read the tapes to reconstruct it.


If the catalog was at the start of the tape, how would you expand it when adding more files to the tape?

And if the catalog was at the end of the tape, how would you add more files at all?


> And if the catalog was at the end of the tape, how would you add more files at all?

modern zip softwares just remove the whole index at end, add file, reconstruct the index and append again.


I'm not sure about modern implementations, but it's not actually required to remove the old index at the end. It's perfectly legitimate (and somewhat safer, though the index should be reconstructable via a linear scan of record headers within the archive) to just append, and the old indexes in the middle of the archive will be ignored.


But you had multiple backups on these tapes. If you rewrite the index how do you restore from a certain day?


Your end-of-stream index would remain in place with a backup number / id.

Your entire index would be the logical sum of all such indices. Think of the end of stream index as a write ahead log of the index.


Put it in the middle and work your way in from the ends!


There are lots of append-only data structures that would support this, but would also require scanning the tape to reconstruct the catalog.


> but would also require scanning the tape to reconstruct the catalog.

If the index consists of a b+-tree/b*-tree interleaved with the data (with the root of the index the last record in the archive), a single backward pass across the tape (including rapid seeks to skip irrelevant data) is sufficient to efficiently restore everything. This should be very close in throughput and latency to restoring with an index on random-access storage. (Though, if you're restoring to a filesystem that doesn't support sparse files, writing the data back in reverse order is going to involve twice as much write traffic. On a side note, I've heard HFS+ supports efficient prepending of data to files.)

In other words, yes, you need to scan the tape to reconstruct the catalog, but since the tape isn't random access, you need to scan/seek through the entire tape anyway (even if you have a separate index on random-access media). If you're smart about your data structures, it can all be done in a single backward pass over the tape (with no forward pass). Keeping a second b+-tree/b*-tree interleaved with the data (keyed by archive write time) makes point-in-time snapshot backups just as easy, all with a single reverse pass over the tape and efficient seeks across unneeded sections of tape.


Beginning: write more entries into the allocated space I mentioned. End: write more entries into the allocated space I mentioned.


Wouldn't it make sense to also write the backup catalog to the tape though? Seems like a very obvious thing to do to me.


> Wouldn't it make sense to also write the backup catalog to the tape though? Seems like a very obvious thing to do to me.

The catalog would be written to tape regularly: this is what would gets e-mailed out. But it wouldn't necessarily be written to every tape.

Remember that the catalog changes every day: you'd have Version 3142 of the catalog at the beginning of Monday, but then you'd back a bunch of clients, so that catalog would now be out-of-date, so Version 3143 would have to be written out for disaster recovery purposes (and you'd get an e-mail telling you about the tape labels and offsets for it).

In a DR situation you'd go through your e-mails and restore the catalog listed in the most recent e-mail.


50GB was an enormous amount of space in the late 90s. Why wouldn't each file on the tape have something like begin/end sentinels and metadata about the file so that the current catalogue could be rebuilt in a DR scenario by just spinning through the whole tape?

I'm with the OP - depending 100% on a file that's not on the tape to restore the tape is bonkers. It's fine as optimization, but there should have always been a way to restore from the tape alone in an emergency.


Isn't there an saying about limiting criticism before we thoroughly understand the trade-offs that had to be decided.

One potential trade-off is being able to write a continuous datastream relatively unencumbered vs having to insert data to delineate files, which is going to be time consuming for some times of files.


There are trade-offs, but as someone who's been working in technology since the mid-90s and spent 10+ years as a systems engineer for a large corporation, "we have all of the backup tapes, and all of the production data is on them, but we can't restore some of it because we only have an older catalogue for those tapes" seems like a pretty unarguably huge downside.

I'm also having real trouble imagining any significant impediments to making the tape data capable of automatically regenerating the most recent catalogue in a disaster scenario, given the massive amount of storage that 50GB represented in that era. This sounds like a case where the industry had hit a local maximum plateau that worked well enough most of the time that no vendor felt compelled to spend the time and money to make something better.

I've written software to handle low-level binary data, and I can think of at least three independent methods for doing it. Either of the first and second options could even be combined with the third to provide multiple fallback options.

1 - Sentinels + metadata header, as I originally described. The obvious challenge here is how to differentiate between "actual sentinel" and "file being backed up contains data that looks like a sentinel", but that seems solvable in several ways.

2 - Data is divided into blocks of a fixed size, like a typical filesystem. The block size is written to the beginning of the tape, but can be manually specified in a DR scenario if the tape is damaged. Use the block before each file's data to store metadata about the file. In a DR scenario, scan through the tape looking for the metadata blocks. In the corner case where the backed-up data contains the marker for a metadata block, provide the operator with the list of possible interpretations for the data, ranked by consistency of the output that would result from using it. This would sacrifice some space at the end of every file due to the unused space in the last block it occupies, but that's a minor tradeoff IMO.

3 - Optionally, write the most recent catalogue at the end of every tape, like a zip file.



That's the one, thanks (Y)


you'd have to put the catalog at the end of the tape, but in that case you might as well rebuild the catalog by simply reading the tape on your way to the end (yeah, if the tape is partially unreadable blah blah backup of your backup...)


> you might as well rebuild the catalog by simply reading the tape on your way to the end

Right but is that actually possible? From what people are saying it sounds like it isn't, but you rightly assumed that it is because anything else would be incredibly dumb.


Storing the catalogue on the PC is standard. But being able to rebuild that catalogue from scratch is also standard. I’ve not used any tapes before now where you couldn’t recover the catalogue.


This type of thing a surprisingly common mistake, I've come across it several times in industry.

An example of this done right: If you disconnect a SAN volume from VMware and attach it to a completely different cluster, it's readable. You can see the VM configs and disks in named folders. This can be used for DR scenarios, PRD->TST clones, etc...

Done wrong: XenServer. If you move a SAN volume to a new cluster, it gets shredded, with every file name replaced by a GUID instead. The file GUID to display name mapping is stored in a database that's only on the hosts! That database is replicated host-to-host and can become corrupted. Backing up just the SAN arrays is not enough!


I’d like to believe maybe that’s why the company went out of business but that’s just wishful thinking - a lot of incompetence is often ignored if not outright rewarded in business nowadays. Regardless, it’s at least somewhat of a consolation those idiots did go out of business in the end, even if that’s wasn’t the root cause.


I'm familiar with needing to re-index a backup if it's accessed from a 'foreign' machine and sometimes the procedure is non-obvious but just not having that option seems pretty bad.


I worked for an MSP a million years ago and we had a customer that thought they had lost everything. They had backup tapes but the backup server itself had died, after showing them the 'catalog tape' operation, and keeping their fingers crossed for a few hours, they bought me many beers.


I always had the Customer keep a written log of which tapes were used on which days. It helped for accountability but also prevented the "Oh, shit, we have to catalog all the tapes because the log of which tapes were used on which day are on the now-failed server."


That is not terribly surprising. The cheap tape drives of that era were very picky like that. Even if I had the same tape drive as a friend it was not always certain that I could read back my tape on his box and the other way around. These drives were very affordable and the tapes were low priced as well. However, they were really designed for the 'oh no I messed up my computer let me restore it' or 'I deleted a file I should not have' scenarios. Not server side backup rotation solutions. Unless that was the backup/restore computer. Long term storage or off site type drives were decently more pricy.

My guess is they lack the RAM buffer and CPU to properly keep up correctly. Then with a side of assumptions on the software side.


this does not sound like a junior programmer error. this is not the kind of thing companies let inexperienced people come up with at least not on their own. this is a lack of testing. any real world backup test would have caught this. and i would expect the more senior engineers of the project to ensure this was covered


If you’re making an “ought to be” argument, I agree.

If you’re making an “is” argument, I completely disagree. I see companies (including my own) regularly having junior programmers responsible for decisions that cast inadvertently or unexpectedly long shadows.


It's basically an index stored on faster media. You would have redundancy on that media, too.


I guess that's why the .zip format chucks its catalog index at the end of the archive. But it's still unnatural to use in a streaming format like tapes though.


In The Singularity Is Near (2005) Ray Kurzweil discussed an idea for the “Document Image and Storage Invention”, or DAISI for short, but concluded it wouldn't work out. I interviewed him a few years later about this and here's what he said:

The big challenge, which I think is actually important almost philosophical challenge — it might sound like a dull issue, like how do you format a database, so you can retrieve information, that sounds pretty technical. The real key issue is that software formats are constantly changing.

People say, “well, gee, if we could backup our brains,” and I talk about how that will be feasible some decades from now. Then the digital version of you could be immortal, but software doesn’t live forever, in fact it doesn’t live very long at all if you don’t care about it if you don’t continually update it to new formats.

Try going back 20 years to some old formats, some old programming language. Try resuscitating some information on some PDP1 magnetic tapes. I mean even if you could get the hardware to work, the software formats are completely alien and [using] a different operating system and nobody is there to support these formats anymore. And that continues. There is this continual change in how that information is formatted.

I think this is actually fundamentally a philosophical issue. I don’t think there’s any technical solution to it. Information actually will die if you don’t continually update it. Which means, it will die if you don’t care about it. ...

We do use standard formats, and the standard formats are continually changed, and the formats are not always backwards compatible. It’s a nice goal, but it actually doesn’t work.

I have in fact electronic information that in fact goes back through many different computer systems. Some of it now I cannot access. In theory I could, or with enough effort, find people to decipher it, but it’s not readily accessible. The more backwards you go, the more of a challenge it becomes.

And despite the goal of maintaining standards, or maintaining forward compatibility, or backwards compatibility, it doesn’t really work out that way. Maybe we will improve that. Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.

So ironically, the most primitive formats are the ones that are easiest.


In 2005 the computing world was much more in flux than it is now.

PNG is 26 years old and basically unchanged since then. Same with 30 year old JPEG, or for those with more advanced needs the 36 year old TIFF (though there is a newer 21 year old revision). All three have stood the test of time against countless technologically superior formats by virtue of their ubiquity and the value of interoperability. The same could be said about 34 year old zip or 30 year old gzip. For executable code, the wine-supported subset of PE/WIN32 seems to be with us for the foreseeable future, even as Windows slowly drops comparability.

The latest Office365 Word version still supports opening Word97 files as well as the slightly older WordPerfect 5 files, not to mention 36 year old RTF files. HTML1.0 is 30 years old and is still supported by modern browsers. PDF has also got constant updates, but I suspect 29 year old PDF files would still display fine.

In 2005 you could look back 15 years and see a completely different computing landscape with different file formats. Look back 15 years today and not that much changed. Lots of exciting new competitors as always (webp, avif, zstd) but only time will tell whether they will earn a place among the others or go the way of JPEG2000 and RAR. But if you store something today in a format that's survived the last 25 years, you have good chances to still be able to open it in common software 50 years down the line.


This is too shortsighted by the archival standards. Even Word itself doesn't offer full compatibility. VB? 3rd party active components? Other Office software integration? It's a mess. HTML and other web formats are only readable by the virtue of being constantly evolved while keeping the backwards compatibility, which is nowhere near complete and is hardware-dependent (e.g. aspect ratios, colors, pixel densities). The standards will be pruned sooner or later, due to the tech debt or being sidestepped by something else. And I'm pretty sure there are plenty of obscure PDF features that will prevent many documents from being readable in mere half a century. I'm not even starting on the code and binaries. And cloud storage is simply extremely volatile by nature.

Even 50 years (laughable for a clay tablet) is still pretty darn long in the tech world. We'll still probably see the entire computing landscape, including the underlying hardware, changing fundamentally in 50 years.

Future-proofing anything is a completely different dimension. You have to provide the independent way to bootstrap, without relying on the unbroken chain of software standards, business/legal entities, and the public demand in certain hardware platforms/architectures. This is unfeasible for the vast majority of knowledge/artifacts, so you also have to have a good mechanism to separate signal from noise and to transform volatile formats like JPEG or machine-executable code into more or less future proof representations, at least basic descriptions of what the notable thing did and what impact it had.


>Future-proofing anything is a completely different dimension. You have to provide the independent way to bootstrap, without relying on the unbroken chain of software standards, business/legal entities, and the public demand in certain hardware platforms/architectures. This is unfeasible for the vast majority of knowledge/artifacts, so you also have to have a good mechanism to separate signal from noise and to transform volatile formats like JPEG or machine-executable code into more or less future proof representations, at least basic descriptions of what the notable thing did and what impact it had.

I'd argue that best way would be to not do that but to make sure format is ubiquitous enough that the knowledge will never be lost in the first place.


That, and use formats which can be accessed and explained concisely, like "read the first X bytes to metadata field A, then read the image payload by interpreting every three bytes as an RGB triplet until EOF" so that the information can be transmitted orally, in the off chance that becomes necessary

Hey I think I just described Windows 3.0-era PCX format :P


> HTML and other web formats are only readable by the virtue of being constantly evolved while keeping the backwards compatibility, which is nowhere near complete and is hardware-dependent (e.g. aspect ratios, colors, pixel densities).

HTML itself is relatively safe, by virtue of it being based on SGML. Though it's not ideal either because those who think it's their job to evolve HTML don't bother to maintain SGML DTDs or use other long established formal methods to keep HTML readable, but believe a hard-coded and (hence necessarily) erroneous and incomplete parsing description the size of a phone book is the right tool for the job.

Let me quote the late Yuri Rubinski's foreword to The SGML Handbook outlining the purpose of markup languages (from 1990):

> The next five years will see a revolution in computing. Users will no longer have to work at every computer task as if they had no need to share data with all their other computer tasks, they will not need to act as if the computer is simply a replacement for paper, nor will they have to appease computers or software programs that seem to be at war with one another.

However, exactly because evolving markup vocabularies requires organizing consensus, a task which W3C et al seemingly weren't up to (busy with XML, XHTML, WS-Star, and RDF-Star instead for over a decade), CSS and JS was invented and extended for the absurd purpose of basically redefining what's in the markup which itself didn't need to change, with absolute disastrous results for long-term readability or even readability on browsers other than from the browser cartel today.


> Though it's not ideal either because those who think it's their job to evolve HTML don't bother to maintain SGML DTDs or use other long established formal methods to keep HTML readable, but believe a hard-coded and (hence necessarily) erroneous and incomplete parsing description the size of a phone book is the right tool for the job.

> a task which W3C et al seemingly weren't up to (busy with XML, XHTML

You realise XML/XHTML is actually delightfully simple to parse compared to WHATWG HTML?


There is something called Lindy Effect, which states that a format's longevity is proportional to its current age.

I try to take advantage of this by only using older, open, and free things (or the most stable subsets of them) in my "stack".

For example, I stick to HTML that works across 20+ years of mainstream browsers.


While it's true that these standards are X years old, the software that encoded those formats yesteryear is very different from the software that decodes it today. It's a Ship of Theseus problem. They can claim an unbroken lineage since the distant future, the year 2000, but encoders and decoders had defects and opinions that were relied on--both intentionally and unintentionally--that are different from the defects and opinions of today.

I have JPEGs and MP3s from 20 years ago that don't open today.


Are they really JPEGs and MP3s, or just bitrot?

I've found https://github.com/ImpulseAdventure/JPEGsnoop useful to fix corruption but I haven't come across a non-standard JFIF JPEG unless it was intentionally designed to accommodate non-standard features (alpha channel etc).


I personally never encountered JPEGs or MP3s which were totally unreadable due to the being encoded by ancient software versions, but the metadata in common media formats is a total mess. Cameras and encoders are writing all sorts of obscure proprietary tags, or even things like X-Ray (STALKER Shadow of Chernobyl game engine) keeping gameplay-relevant binary metadata in OGG Vorbis comments. Which is even technically compliant with the standard I think, but that won't help you much.


> X-Ray (STALKER Shadow of Chernobyl game engine) keeping gameplay-relevant binary metadata in OGG Vorbis comments

Just, ew


In the case of individual files with non-conformant or corrupted elements it seems fairly straightforward project to build an AI model that can fix up broken files with a single click. I suspect such a thing will be widely-accessible in 10 years.


Just going to mention Pro/E forward compatibility here: https://youtu.be/tY_Gy-EElc0

"The roots of Creo Parametric. Probably one of the last running installations of PTC's revolutionary Pro/ENGINEER Release 7 datecode 9135 installed from tape. Release 7 was released in 1991 and is - as all versions of Pro/ENGINEER - fully parametric. Files created with this version can still - directly - be opened and edited in Creo Parametric 5.0 (currently the latest version for production).

This is a raw video, no edits, and shows a bit of the original interface (menu manager, blue background, yellow and red datum planes, no modeltree).

Hardware used: Sun SparcStation 5 running SunOS 4.1.3 (not OpenWindows), 128MB RAM

Video created on january 6, 2019."


That is great! NX has similar compatibility, though not quite as good.

In some cases, the actual version code of a feature is invoked by that data being encoded as part of the model data schema.

Literal encapsulation in action. That way bugs and output variances are preserved so that the regenerated model is accurate to what the software did decades ago.


I can't help but think bad thoughts whenever I see another "static site maker" posted on here, or a brand new way of using JavaScript to render a web page.

Talk about taking the simplest and most durable of (web) formats and creating a hellscape of tangled complexity which becomes less and less likely to be maintainable or easy to archive the more layers of hipster js faddishness you add...


One of the claimed benefits of the JVM (and obviously later VMs) was that it would solve this issue: Java programs written in 2000 should still be able to run in 2100. And as far as I know the JVM has continued to fulfill this promise.

An honest question: If you are writing a program that you want to survive for 100+ years, shouldn't you specifically target a well-maintained and well-documented VM that has backward compatibility as a top priority? What other options are there?


People routinely boot DOS in e.g. qemu. The x86 ISA is 45 years old, older if you consider the 8008/8080 part of the lineage. It's not pretty, but it's probably the most widespread backwards compatible system out there.


S/360 assembly programs probably would still run on a modern IBM mainframe. Punched cards kept in an inert atmosphere probably would last for centuries, and along with printed documentation in archival-quality paper would allow future generations to come up with card readers and an emulator to actually run the program.


While I love the JVM, and I also think it's one of the better runtimes in terms is backwards compatibility, there have been breakages. Most of the ones I've dealt with were easy to fix. But the ease of fixing is related to the access to source code. When something in a data stream is broken, be it an MP3 or a JPEG, I guess you almost inherently need special tooling to fix it (realistically). I imagine that with an SVG it'd be easier to hand-fix it.


> An honest question: If you are writing a program that you want to survive for 100+ years, shouldn't you specifically target a well-maintained and well-documented VM that has backward compatibility as a top priority? What other options are there?

I'd be tempted to target a well-known physical machine - build a bootable image of some sort as a unikernel - although in the age of VMWare etc. there's not a huge difference.

IMO the "right" way to do this would be to endow an institution to keep the program running, including keeping it updated to the "live" version of the language it's writen in, or even porting it between languages as and when that becomes necessary.


Basically, just target the System/360 from IBM


But he seems to have written this before virtual machines became widespread.

I think the concern is becoming increasingly irrelevant now, because if I really need to access a file I created in Word 4.0 for the Mac back in 1990, it's not too hard to fire up System 6 with that version of Word and read my file. In fact it's much easier now than it was in 2005 when he was writing. Sure it might take half an hour to get it all working, but that's really not too bad.

Most of this is probably technically illegal and will sometimes even have to rely on cracked versions, but also nobody cares and. All the OS's and programs are still around and easy to find on the internet.

Not to mention that while file formats changed all the time early on, these days they're remarkably long-lived -- used for decades, not years.

The outdated hardware concern was more of a concern (as the original post illustrates), but so much of everything important we create today is in the cloud. It's ultimately being saved in redundant copies on something like S3 or Dropbox or Drive or similar, that are kept up to date. As older hardware dies, the bits are moved to newer hardware without the user even knowing.

So the problem Kurzweil talked about has basically become less of an issue as time has marched on, not more. Which is kind of nice!


>I think the concern is becoming increasingly irrelevant now, because if I really need to access a file I created in Word 4.0 for the Mac back in 1990, it's not too hard to fire up System 6 with that version of Word and read my file. In fact it's much easier now than it was in 2005 when he was writing. Sure it might take half an hour to get it all working, but that's really not too bad.

And that was easy years ago.

Now you can WASM it and run it in a browser


You don't even need to set it up yourself: http://system6.app/

It's even got Word 5.1 installed. :)


> I think the concern is becoming increasingly irrelevant

I fear we may be on top of that point. With the "cloudification" where more and more software is run on servers one doesn't control there is no way to run that software in a VM as you don't have access to the software anymore. And even getting the pure data for a custom backup becomes harder and harder.


I'm certain that 100 years from now, when the collapse really gets rolling, we'll still have cuneiform clay tablets complaining about Ea-Nassir's shitty copper but most of the digital information and culture we've created and tried to archive will be lost forever. Eventually, we're going to lose the infrastructure and knowledge base we need to keep updating everything, people will be too busy just trying to find food and fighting off mutants from the badlands to care.


Well, almost all early tablets are destroyed or otherwise lost now. Do you think we will lose virtually all digital age information within a century? Maybe from a massive CME, I suppose.


Clay tablets were usually used for temporary records, as you could erase it simply by smearing the clay a little bit (a lot easier than writing with on papyrus). The tablets we have exist because of something that causes the clay to be baked into ceramic, which is generally some sort of catastrophic fire that caused the records to accidentally be preserved for much longer.


I know. My first iPad just stopped powering up. WTF!

I should etch something into its glass and bury it in my back yard. Perhaps a shopping list, or a complaint about how my neighbor inexplicably gets into his truck six or eight times a day and just sits there with it running.


I can see it happening. Not as a single catastrophic event but, like Rome falling bit by bit, our technological civilization fails and degenerates as climate change (in the worst possible scenario) wreaks havoc on everything.


I was able to backup/restore an old COBOL system via cpio between modern GNU cpio (man page last updated June 2018), and SCO's cpio (c. 1989). This is neither to affirm nor contradict Kurzweil, but rather to praise the GNU userland for its solid legacy support.



This is very very true. I have archived a number of books and magazines that were scanned and converted into "simplified" PDF, and archived on a DVD disks with C source code.

There are external dependencies but one hopes that the descriptions are sufficient to figure out how to make those work.


Actually I'd argue it's wrong precisely because we do manage to retrieve even such old artifacts. Only problem is that nobody cared for 30 years so the process was harder than it should be but in the end it was possible.

Sure, there is a risk that at some point, for example, any version of every PNG or H.264 decoder gets lost and so re-creating decoder for that would be significantly more complicated, but chances for that are pretty slim, but looking at `ffmpeg -codecs` I'm not really worried for that to ever happen.


> Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.

Maybe it isn't crude after all if it wins.


I do not consider microfiche or film crude at all.

They are just simple.

And what they do is very fully exploit the analog physics to yield high data density mere mortals can make effective use of.

And they make sense.

In my life, text, bitmaps and perhaps raw audio endure. Writing software to make use of this data is not difficult.

A quick scan later, microfiche type data ends up a bitmap.

Prior to computing, audio tape, pictures on film and ordinary paper, bonus points for things like vellum, had similar endurance and utility.

My own archives are photos, papers and film.


Modern backup would simply state “API keys and settings are here:”, and a link to collaboration platform closed after 3 years of existence.


Hey, it's the cloud. Backups are "someone else's problem". That is until they are your problem, then you're up a creek.


> Hey, it's the cloud. Backups are "someone else's problem". That is until they are your problem, then you're up a creek.

The FSF used to sell these wonderful stickers that said "There is not cloud. It's just someone else's computer."


The sticker: https://static.fsf.org/nosvn/stickers/thereisnocloud.svg

"Stickers from various FSF campaigns - Print out copies of our stickers for your own uses, local conferences and more." https://www.fsf.org/resources/stickers


Honestly backup space is weirdly sparse for anything on enterprise scale.

For anything more than few machines there is bacula/bareos (that pretends everything is tape with mostly miserable results), backuppc (that pretends tapes are not a thing, with miserable results), and that's about it, everything else seems to be point-to-point backups only with no real central management.


Are talking about open source only? Because there are loads of options available. Veritas has two products (netbackup and backupexec). There is also commvault, veeam, ibm spectrum protect and hp data protector. Admittedly only netbackup and commvault are what I would truly call enterprise, but your options are certainly not limited.


As someone who used to administer an ADSM server back a long time ago -- I'm curious what the gap between spectrum protect or whatever it's called now and commvault/netbackup? I've haven't really looked at that space for at least a decade.


You can add amanda to the "pretends everything is tape with mostly miserable results" list.


Absolutely amazing story. Fantastic!

I've actually long been stunned by the propensity of proprietary backup software to use undocumented, proprietary formats. I've always found this quite stunning, in fact. It seems to me like the first thing one should make sure to solve when designing a backup format is to ensure it can be read in the future even if all copies of the backup software are lost.

I may be wrong but I think some open source tape backup software (Amanda, I think?) does the right thing and actually starts its backup format with emergency restoration instructions in ASCII. I really like this kind of "Dear future civilization, if you are reading this..." approach.

Frankly nobody should agree to use a backup system which generates output in a proprietary and undocumented format, but also I want a pony...

It's interesting to note that the suitability of file formats for archiving is also a specialised field of consideration. I recall some article by someone investigating this very issue who argued formats like .xz or similar weren't very suited to archiving. Relevant concerns include, how screwed you are if the archive is partly corrupted, for example. The more sophisticated your compression algorithm (and thus the more state it records from longer before a given block), the more a single bit flip can result in massive amounts of run-on data corruption, so better compression essentially makes things worse if you assume some amount of data might be damaged. You also have the option of adding parity data to allow for some recovery from damage, of course. Though as this article shows, it seems like all of this is nothing compared to the challenge of ensuring you'll even be able to read the media at all in the future.

At some point the design lifespan of the proprietary ASICs in these tape drives will presumably just expire(?). I don't know what will happen then. Maybe people will start using advanced FPGAs to reverse engineer the tape format and read the signals off, but the amount of effort to do that would be astronomical, far more even than the amazing effort the author here went to.


To add, thinking a bit more about it: Designing formats to be understandable by future civilizations actually reduces to a surprising degree to the same set of problems which METI has to face. As in, sending signals designed to be intelligible to extraterrestrials - Carl Sagan's Contact, etc.

Even if you write an ASCII message directly to a tape, that data is obviously going to be encoded before being written to the tape, and you have no idea if anyone will be able to figure out that encoding in future. Trouble.

What makes this particularly pernicious is the fact that LTO nowadays is a proprietary format(!!). I believe the spec for the first generation or two of LTO might be available, but last I checked, it's been proprietary for some time. The spec is only available to the (very small) consortium of companies which make the drives and media. And the number of companies which make the drives is now... two, I think? (They're often rebadged.) Wouldn't surprise me to see it drop to one in the future.

This seems to make LTO a very untrustworthy format for archiving, which is deeply unfortunate.


The best format for archiving is many formats.

Make an LTO tape... But also make a Bluray... And also store it on some hard drives... And also upload it to a web archive...

The same for the actual file format... Upload PDF's... But also upload word documents.. And also ASCII...

And same for the location... Try to get diversity of continents... Diversity of geopolitics (ie. some in USA, some in Russia). Diversity of custodians (friends, businesses, charities).


Even ASCII itself is a strange encoding that could be lost with enough time and need to be recovered through cryptographic analysis and signals processing. That doesn't look at all likely today given UTF-8's promised and mostly accomplished ubiquity and its permanent grandfathering of ASCII. But ASCII is still only one of a number of potential encoding schemes, isn't necessarily obvious from first principles.

Past generations thought EBCDIC would last longer than it did.

Again, not that there any indications now that ASCII won't survive nearly as long as the English language does at this point, just that when we're talking about sending signals to the future, even assuming ASCII encoding is an assumption to question.


Baby's first cryptographic analysis, sure. Mapping letters to bits is easy, and the 8 bit repeating pattern is also easy.

The thing that might make it hard is if people have forgotten English itself, and in that case ASCII is one of the smallest barriers.

EBCDIC would also be fine.


These things make more sense because LTO is used for backup, not archival. Companies don't want to be able to read the tape data in 50 years, they want to be able to read it tomorrow, after the entire business campus burns down.


You mean the "if you are reading this in the distant future" instructions are written to the medium first? And are straight up ASCII?

Nice. That kind of thing makes too much sense. Wow. Such cheap insurance. Nice work from that team.


Yeah. If I ever wrote a backup system I'd do this too, write the whole spec for the format first to every medium. A 100k specification describing the format is nothing to waste on a medium which can store 10TB.


Seriously. Scale really changes things.


It's kinda strange that we still don't have a technology that would allow one to scan a magnetic medium at high resolution and then process it in software. This would be nice for all kinds of things that use magnetic tapes and platters — data recovery, perfect analog tape digitization, etc. The closest I've seen to it is that project that captures the raw signal from the video head of a VCR and then decodes it into a picture.


Isn't there a subset of that at least for floppy discs with Kryoflux or GreaseWeazle style controllers? They read the raw flux transitions off the drive head, and then it's up to software to figure out that it's a Commodore GCR disc or a Kaypro MFM one.


LTO tape media itself is typically only rated at 30 years, so I suspect the tapes will die before the drives do.


I've always admired the tenacity of people who reverse engineer stuff. To be able to spend multiple months figuring out barely documented technologies with no promise of success takes a lot a willpower and discipline. It's something I wish I could improve more in myself.


I think you could. In some sense "easily". It may be about finding that thing you're naturally so interested in or otherwise drawn to, that the months figuring out become a type of driven joy, and so the willpower kinda automatic.

And if you find it, don't judge what it is or worry what others might think - or even necessarily tell anyone. Sometimes the most motivating things are highly personal, as with the OP; a significant part of their childhood.


You definitely have a point there, looking at some of my previous work I was able to stick to projects for many months if I found the work interesting. I'll have to admit in the past 5 or so years any time I've tried to start a project there was always the thought in the back of my mind of 'will this benefit my career' or 'how can I make money on this in the future'. It seems having such thoughts adds additional anxiety to whenever I try and start to work on something for fun.

Looks like that is what I need to start looking for again, projects which I find interesting or fun to do in my spare time, without thinking about how it would affect my career or trying to find ways to monetize it.


I totally get this. Something I'm learning - slowly - as it is so counter to sound ethic, is that making some work into pure fun helps the other work by preventing burn-out, which for me at least is an ongoing risk. Certainly feels better!


Fascinating read that unlocked some childhood memories.

I'm secondhand pissed at the recovery company, I have a couple of ancient SD cards laying around and this just reinforces my fear that if I send them away for recovery they'll be destroyed (the cards aren't recognized/readable by the readers built into MacBooks, at least)


My understanding is that flash memory does not do very well at all for long term unpowered data retention. flash memory is basically a capacitor(it is not really a capacitor but close to one) and will loose it's charge after a few years.

And magnetic drives will seize up. and optical disks get oxidized, and tapes stick together. long term archiving is a tricky endeavor.

It is however an interesting challenge. I think I would get acquainted with the low level protocol used by sd cards. then modify a microcontroller sdmmc driver to get me an image of the card(errors and all). that is, without all the scsi disk layers doing their best to give you normalized access to the device. Or more realistically, hope someone more talented than me does the above.


Tapes hold up really well if they're not in absolutely awful storage conditions. And the claim at least was that the early CD-ROMs were quite durable, being a straight up laser carved diffraction grating. CDRs on the other hand rely on dye which will degrade rapidly.


M discs were made, quite possibly to meet the Mormon need of post Armageddon lineage documentation, to last at a minimum of 1000 years in reasonable condition. They are just special, expensive CDRs that use a different dye system that won't break down in 4 years.

My optical disk drive was $17 and has M disc writing compatibility, and my understanding is they are meant to be read by any CD reader.


That is true about mag tape, I suspect tape to be one of the better choices for archival storage. In fact the biggest problem you will have with mag tape is making sure you have a working drive 10 - 20 years in the future when you want to look at your archives. to make things worse tape drives are getting more and more flimsy and fragile as tape tech advances.

my first job we had a vault of ibm reel to reel tapes of old business data. our attitude if we were ever asked to pull any of the data was that we would probably only get one chance at it as the ferric material on on the tape had a disturbing tendency to flake off. note that there are techniques to reduce this, but we did not have the means or motivation to apply them.

And a pedantic observation on your correct point about optical disks. you can't make backups on pressed disks, only recordable ones.


You or I can't make backups on pressed disks, but it is an interesting consideration if part of your archive consists of commercially released movies, games, music, etc. that could be stored as original Blu-Rays, DVDs or CDs.


> My understanding is that flash memory does not do very well at all for long term unpowered data retention

You need to let flash cells rest before writing again if you want to achieve long retention periods, see section 4 in [1]. The same document says 100 years is expected if you only cycle it once a day, 10k times over 20 years (Fig 8).

[1]: https://www.infineon.com/dgdl/Infineon-AN217979_Endurance_an...


Last year I helped a friend recover photos from a portable WD HDD. It was formatted in FAT32 and I was forced to run R-Studio to get reliable results. There was a lot of damaged (readable, with artifacts) and corrupted (doesn't render, have wrong size) files.


Painful lesson I've learned myself the hard way - don't rush something that doesn't need to be rushed.


This is giving me some anxiety about my tape backups.

I have backed up my blu-ray collection to a dozen or so LTO-6 tapes, and it's worked great, but I have no idea how long the drives are going to last for, and how easy it will be to repair them either.

Granted, the LTO format is probably one of the more popular formats, but articles like this still keep me up at night.


The only surefire method to keep the bits readable is to continue moving them onto new media every few years. Data has a built-in recurring cost. I'd love to see a solution to that problem but I think it's unlikely. It's a least possible, though, that we'll come up with a storage medium with sufficient density and durability that'll it'll be good enough.

I don't even want to think about the hairy issues associated with keeping the bits able to be interpreted. That's a human behavior problem more than a technology problem.


LTO is one of the best choices for compatibility. I remember just how awful DDS (same sort of media as DAT) tape backups were - due to differences in head alignments, it was a real lottery as to whether any given tape could be read on a different drive than the one that wrote it.


Do test restores. LTO is very good but without verification some will fail at some point.

But your original bluray disk are also a backup.


LTO-7 drives read LTO-6, and will be available for quite a while.

In 2016 I've used an LTO-3 drive to restore a bunch (150 or 200) of LTO-1/2 tapes from 2000-2003, and almost all but one or two worked fine.


I really wish they would name the data recovery company so that I can never darken their door with my business.


https://news.ycombinator.com/item?id=36063114 claims it's https://www.datarecovery.net/tape-data-recovery.aspx (and that https://news.ycombinator.com/item?id=36062785 had been edited to censor the information, so I'm dupicating it here). Caveat that I don't know if that's actually correct, since efforts to suppress it are only circusantial evidence in favor.


> Over the span of about a month, I received very infrequent and vague communications from the company despite me providing extremely detailed technical information and questions.

Ahh the business model of "just tell them to send us the tape and we'll buy the drive on eBay"


To be honest as long as they are very careful about not doing any damage to the original media then it might work and be a win-win for both sides in a "no fix no fee" model where the customer only pays if the data is successfully recovered.

Their cardinal sin was that they irreparably damaged the tape without prior customer approval.


If you are advertising and attempting recovery from formats you are unfamiliar with, damaging the original medium is inevitable.


When a customer shows up with an unfamiliar format you invest the time to become familiar with it - this may involve buying test tapes (that you don't mind damaging) so you can test your recovery process on it and make sure it works before running it on the customer's tape.


Unless you are clearly making the customer aware that you are doing so, I think it's slimey to do so. It's very "fake it till you make it" to advertise a service that you only assume you can pick up as needed.


It’s not too hard to find with the following search, “we can recover data from tape formats including onstream”


The OP explicity didn't name them (despite many people recommending to, even preservationists in this field on Reddit and Discord) but it's easy to find just by googling the text on the screenshots



Name them and we can setup a thread or site to publicly shame them


the comment I replied to edited the link out https://www.datarecovery.net/tape-data-recovery.aspx


>> The tape was the only backup for those things, and it completes Frogger 2's development archives, which will be released publicly.

In cases like this can imagine some company yelling "copyright infringement" even though they don't possess a copy themselves. It's a really odd situation.


As a kid, I got this game as a gift and really, really wanted to play it. But after beating the second level, the game would always crash on my computer with an Illegal Operation exception. I remember sending a crash report to the developer, and even updating the computer, but I never got it working.


I adored this game as a kid, and I think I do have a faint memory of some stability issues, but I believe I was able to beat the game.


I work in the tape restoration space. My biggest piece of advice is never NEVER encrypt your tapes. If you think restoring data from an unknown format tape is hard, trying to do it when the drive will not let you read the blocks off the tape without a long lost decryption key is impossible.


TIL there are three completely different games named "Frogger 2" I assumed this was for the 1984 game, but this is for the 2000 game (there is also a 2008 game).


Thanks for that, it seems like a surprisingly modern format for such an old game.

Links for the games referenced:

- Frogger II: ThreeeDeep! (1984)

https://www.mobygames.com/game/7265/frogger-ii-threeedeep/

- Frogger 2: Swampy's Revenge (2000) [1]

https://www.mobygames.com/game/2492/frogger-2-swampys-reveng...

- Frogger 2 (2008) [2]

https://www.mobygames.com/game/47641/frogger-2/


> the ADR-50e drive was advertised as compatible, but there was a cave-at

I'm assuming the use of "cave-at" means the author has inferred an etymology of "caveat" being made up of "cave" and "at", as in: this guarantee has a limit beyond which we cannot keep our promises, if we ever find ourselves AT that point then we're going to CAVE. (As in cave in, meaning give up.) I can't think of any other explanation of the odd punctuation. Really quite charming, I'm sure I've made similar inferences in the past and ended up spelling or pronouncing a word completely wrong until I found out where it really comes from. There's an introverted cosiness to this kind of usage, like someone who has gained a whole load of knowledge and vocabulary from quietly reading books without having someone else around to speak things out loud.


Dang it. OP here, I saw this typo and swear I fixed this typo before posting it!!


I thought it might have been a transcription error of “carve out,” but your theory is more logical.


Truly noble effort. Hopefully the writeup and the tools will save others much heartbreak.


Wow, that backup software sounds like garbage. Why not just use tar? Why would anyone reinvent that wheel?


The world of tape backup was (is?) absolutely filled with all sorts of vendor-lock in projects and tools. It's a complete mess.

And even various versions of tar aren't compatible, and that's not even starting with star and friends.


It's not just limited to tape, most archiving and backup software is proprietary. It's impossible to open Acronis or Macrium Reflect images without their Windows software. In Acronis's case they even make it impossible to use offline or on a server OS without paying for a license. NTBackup is awfully slow and doesn't work past Vista, and it's not even part of XP POSReady for whatever reason, so I had to rip the exe from a XP ISO and unpack it (NTBACKUP._EX... I forgot microsoft's term for that) because the Vista version available on Microsoft's site specifically checks for longhorn or vista.

Then there's slightly more obscure formats that didn't take off in the western world, and the physical mediums too. Not many people had the pleasure of having to extract hundreds of "GCA" files off of MO disks using obscure Japanese freeware from 2002. The English version of the software even has a bunch of flags on virustotal that the standard one doesn't. And there's obscure LZH compression algorithms that no tool available now can handle.

I've found myself setting up one-time Windows 2000/XP VMs just to access backups made after 2000.


I can only speak for macrium but they have good reasons to use their own format, so that you can have differential mountable backups. That's very different from someone inventing tar-but-worse.


I have at various times considered a tape backup solution for my home, but always give up when it seems every tape vendor is only interested in business clients. It was a race to stay ahead of hard drives and oftentimes they seemed to be losing. The price points were clearly aimed at business customers, especially on the larger capacity tapes. In the end I do backup to hard drives instead because it's much cheaper and faster.


Tape absolutely isn't viable for the consumer at all, but definitely worth exploring for the novelty. Even if you manage to get a pretty good deal on a legacy LTO system (other formats don't even come close to the tb/$ of 10+ year old LTO and drives are still fairly cheap), the drives aren't being made any more and aren't getting any cheaper. Backwards compatibilty may be in your favor depending on your choice of tape generation at least, I think there's at least two generations guaranteed. Optical will probably remain king though the pricing is worse than HDDs, there's no shortage of DVD or BD readers, but you might run into issues with quad layer 128 BD as they only hit the market fairly recently.


The only reasonable solution is to keep migrating and checking the data on various media; but this is expensive and often deemed not worth it.


Depends on the size of the data.

If your dataset is below 1-2Tb it would cost you less than $200 in a decade to move to a newer HDD every 5 years.


That used to be my method but I decided to "upgrade" to paying backblaze to manage that method for me and just in the nick of time too as my SSD crib-deathed soon after.


Tape drive and Bareos/Bacula "just works"

Absolutely not worth it tho. Drives are hideously expensive which means they only start making sense where you have at least dozens of tapes.

There is an advantage of tapes not being electrically connected most of the time so lightning strike will not burn your archives, I have pondered making a separate box with a bunch of hard drives that boots once a month and just copies last months of backups on hard drives, powered from solar or something just to separate from the network


The only way to do tape at home is with used equipment and Linux/BSD. You can do quite a bit with tar and mt (iirc) - even controlling auto loaders.

What’s fun are the hard drive based systems designed to perfectly imitate a tape autoloader so you don’t have to buy new backup software (virtual tape libraries).


ARCServe was a Computer Associates product. That's all you need to know.

It had a great reputation on Novell Netware but the Windows product was a mess. I never had a piece of backup management software cause blue screens (e.g. kernel panics) before an unfortunate Customer introduced me to ARCServe on Windows.


My favorite ArcServe bug which they released a patch for (and which didn’t actually fix the issue, as I recall) had a KB article called something along the lines of “The Open Database Backup Agent for Lotus Notes Cannot Backup Open Databases”.


IIRC tar has some Unixisms that don't necessarily work for Windows/NTFS. Not saying reinventing tar is appropriate but there's Windows/NTFS that a Windows based tape backup need to support.


Most of what makes NTFS different than FAT probably doesn't need to be backed up. Complex ACLs, alternative data streams, shadow copies, etc, are largely irrelevant when it comes to making a backup. Just a simple warning "The data being backed up includes alternative data streams. These aren't supported and won't be included in the backup" would suffice.


All of that stuff matters when you're using the backup for its intended purpose: to restore a system after hardware failure.

Unix tar is obviously not the right solution, but a Windows tar seems like it shouldn't be that hard to do and yet we are in the situation we are today. I've been using dump/restore for decades now on Unix, including to actually recover from loss, but I admit that it's not that pleasant to use. I like that it is very simple and reliable however, unlike the mess that is Time Machine (recovering from a hardware loss on a Mac is a roll of the dice, and I've gotten snakes) or worse Deja Dup. I'm not sure I've ever successfully recovered a system from a Deja Dup backup.


> using the backup for its intended purpose: to restore a system after hardware failure.

No. The intended purpose of a backup is to restore the data (such as the Frogger 2 source code) after a hardware failure. If it has the side effect of also producing a working system, that's good, but it's not the point. After all, the hardware necessary to build a working system may not exist any more; one (only-probably not the last) instance of said hardware just broke, after all.


Your one trivial use case isn't all use cases, and it sure isn't my important one. If you're doing more than backing up your personal workstation, metadata is extremely important. If I ever have to restore something, even just data, out of the multi-petabytes we have on tape, I better not have to manually go through it to figure out who should actually have access to it before I make it available to the people who need it.


The metadata matters.


I think the use case for disaster recovery is a bit different than long-term archival.


Does anyone buying tape storage actually use it for archival rather than disaster recovery?


We do. We have vast amounts of data that we merely have to keep around for a decade or so for compliance purposes that would rarely be accessed, so it goes on tape and off to Iron Mountain. Then we have backups where we need to be able to recover a running system from some 'known good' state, which is somewhat complicated. The former is conceptually tape drive+tar/cpio/etc.; the latter is an expensive setup that includes some proprietary solutions.


If you’re backing up a db or something sure, but for a file server this can be just as important as the data itself (ex: now everyone can read HR’s personnel files which had strict permissions before)


That’s fair; I wasn’t really considering windows. It seems like there ought to be some equivalent by now though.


The format is extensible enough that it could be added


The company that made it probably was hoping for vendor lock-in


Vendor lock in for backup and archival products is so ridiculous. It increases R&D to ensure the lock-in, and the company won't exist by the time the lock-in takes effect.


Well yes, but the boss probably is willing to invest more money (meaning higher salaries, more people, better tools) expecting a future return than when using reasonable formats.


Is there way to read magnetic tapes like these in such a way as to get the raw magnetic flux at high resolution?

It seems like it would be easier to process old magnetic tapes by imaging them and then applying signal processing rather than finding working tape drives with functioning rollers. Most of the time, you're not worried about tape speed since you're just doing recovery read rather than read/write operations. So, a slow but accurate operation seems like it would be a boon for these kinds of things.


For anybody who is into this this is a a good excuse to share a presentation from Vintage Computer Fest West 2020 re: magnetic tape restoration: https://www.youtube.com/watch?v=sKvwjYwvN2U

The presentation explores using software-defined signal processing analyze a digitized version of the analog signal generated from the flux transitions. It's basically moving the digital portion of the tape drive into software (a lot like software-defined radio). This is also very similar to efforts in floppy disk preservation. Floppies are amazingly like tape drives, just with tiny circular tapes.


OP here! Yes I'd highly recommend this video, I stumbled across it early on when trying to familiarize myself with what the options were-- and it's a good video!


At the very least, and the cost for this perhaps would be prohibitive, but some mechanism to duplicate the raw flux off the tape onto another tape in an identical format, a backup of the backup. This would allow for attempts to read the data that may be potentially destructive to the media (for example, breaking the tape accidentally) and not lose the original signal.


Sounds like at least in this case that ASIC in the drive was doing some (non trivial) signal processing. Would be interesting to know how hard it would be to get from the flux pattern back to zeros and ones. I guess with a working drive you can at least write as many test patterns as you want until you maybe figure it out.


At the very least the drive needs to be able to lock onto the signal. It's probably encoded in a helix on the drive and if the head isn't synchronized properly you won't get anything useful, even with a high sampling rate.


I would be surprised if it used helical recording. Data tape recorders rarely do because it's much more complex, increases tape wear, and the use cases don't usually demand that kind of linear bandwidth.


Wasn't the whole selling point of these drives that they were able to encode more on the tape than their competitors? I figure that must include some clever tricks.


The data format is documented at a high level here:

https://github.com/Kneesnap/onstream-data-recovery/blob/main...

It's a linear, 192-track format that's read 8 tracks at a time. The other files in that repo are also worth reading.


You still need to know where to look, the format, and using specialized equipment which cost wasn't driven down by mass manufacturing, so, in theory yes, in practice not.

(Completely guessing here with absolute no knowledge of the real state of things)


Yes. There’s some guy on YouTube who does stuff like that (he reverse engineered the audio recordings from a 747 tape array) but it can be quite complicated.


Would you have a link by any chance? Thanks!


https://youtu.be/MU02pQe3E5Q Is the one I’m thinking of, digital would require more work


F2 was a really neat game. It almost invented Crypt of the Necrodancer’s genre decades early.

It’s a little sad that it took such a monumental effort to bring the source code back from the brink of loss. It’s times like that that should inspire lawmakers to void copyright in the case that the copyright holders can’t produce the thing they’re claiming copyright over.


Heh, I remember playing .mp3 files directly from QIC-80 tapes, somewhere around 1996. One tape could store about 120 MB, which is equal to about two compact discs' worth of audio. The noise of the tape drive was slightly annoying, though. And it made me appreciate what the 't' in 'tar' stands for.


Did you mean 1200 MB? That would make sense wrt. 2x CD capacity.


No, it was really only 120 MB. I was referring to the length of an audio compact disc, not the capacity of a CD-ROM. At 128 kbps, you'd get about 2 hours of play time.

Of course it didn't really make sense to use digital tapes for that use case, even back then. It was just for fun, and the article sparked some nostalgic joy, which felt worth sharing :)


They reference MP3, and a CD ripped down to MP3 probably fits in the 50-100MB envelope for size. It has been a very long time since I last ripped an album, but that size jives with my memory.


This is just random, but reading this and the backup discussion made me think about SGI IRIX and how it could do incremental backups.

One option was to specify a set of files, and that spec could just be a directory. Once done, the system built a mini filesystem and would write that to tape.

XFS was the filesystem in use at the time I was doing systems level archival.

On restores, each tape, each record was a complete filesystem.

One could do it in place and literally see the whole filesystem build up and change as each record was added. Or, restore to an empty directory and you get whatever was in that record.

That decision was not as information dense as others could be, but it was nice and as easy as it was robust.

What our team did to back up some data managed engineering software was perform a full system backup every week, maybe two. Then incrementals every day, written twice to the tape.

Over time, full backups were made and sent off site. One made on a fresh tape, another made on a tape that needed to be cycled out of the system before it aged out. New, fresh tapes entered the cycle every time one aged out.

Restores were done to temp storage and rather than try and get a specific file, it was almost always easier to just restore the whole filesystem and then copy the desired file from there into its home location. The incrementals were not huge usually. Once in a while they got really big due to some maintenance type operation touching a ton of files.

The nifty thing was no real need for a catalog. All one needed was the date to know which tapes were needed.

Given the date, grab the tapes, run a script and go get coffee and then talk to the user needing data recovery to better understand what might be needed. Most of the time the tapes were read and the partial filesystem was sitting there ready to go right about the time those processes completed.

Having each archive, even if it were a single file, contain a filesystem data set was really easy to use and manage. Loved it.


A few months ago I was looking for an external backup drive and thought that SSD would be great because it's fast and shock resistant. Years ago I killed a Macbook Pro HD by throwing it on my bed from few inches high. Then I read a comment on Amazon about SSD losing information when unpowered for a long time. I couldn't find any quick confirmation in the product page, took me a few hours of research to find some paper about this phenomenon. If I remember correctly it takes a few weeks for the stored SSD to start losing its data. So I bought a mechanical HD.

Another tech tip is not buying 2 backup devices from the same batch or even the same model. Chances being these will fail in the same way.


To the last bit, I've seen this first hand. Had a whole RAID array of the infamous IBM DeathStar drives fail one after the other while we frantically copied data off.

Last time I ever had the same model drives in an array.


Heh, I remember in the early 1990s having a RAID array with a bunch of 4Gb IBM drives come up dead after a weekend powerdown for a physical move due to "stiction". I was on the phone with IBM, and they were telling me to physically bang the drives on the edge of desk to loosen them up. Didn't seem to be working, so their advice was "hit it harder!" When I protested, they said, "hey, it already doesn't work, what have you got to lose?" So I hit it harder. Eventually got enough drives to start up to get the array on line, and you better believe the first thing I did after that was create a fresh backup (not that we didn't have a recent backup anyway), and the 2nd thing I did was replace those drives, iirc, with Seagate Barracudas.


Ouch. I knew someone who claimed to have dealt with that or a similar effect after their cleaning person had pulled the plug on their servers by putting the drive in an oven while connected and heating it slowly.

Personally my most nailbiting period was when I got my first (20MB!) drive as a kid and it was too big an investment to replace even when it refused to spin up without me opening up the drive(!) and nudging the platter with my finger to help the motor spin it up... I backed everything up (on floppies), and stored everything important straight to floppies, but it was still more convenient to hold on to the HD for the next 6 months or so until I'd saved up enough to replace it...

It's remarkable what drives can survive if you're lucky... Also remarkable how quickly that luck can run out, though.

> "hey, it already doesn't work, what have you got to lose?"

This attitude has saved me more than once. Recognising when you can afford to do things that seems ridiculous helps surprisingly often.


When I was still relatively familiar with flash memory technologies (in particular NAND flash, the type used in SSDs and USB drives), the retention specs were something like 10 years at 20C after 100K cycles for SLC, and 5 years at 20C after 5-10K cycles for MLC. The more flash is worn, the leakier it becomes. I believe the "few weeks" number for modern TLC/QLC flash, but I suspect that is still after the specified endurance has been reached. In theory, if you only write to the flash once, then the retention should still be many decades.

Someone is trying to find out with an experiment, however: https://news.ycombinator.com/item?id=35382252


Indeed. The paper everyone gets the "flash loses its data in a few years" claim from wasn't dealing with consumer flash and consumer use patterns. Remember that having the drive powered up wouldn't stop that kind of degradation without explicitly reading and re-writing the data. Surely you have a file on an SSD somewhere that hasn't been re-written in several years, go check yourself whether it's still good.

Even the utter trash that is thumb drives and SD cards seem to hold data just fine for many years in actual use.

IIRC, the paper was explicitly about heavily used and abused storage.


CD-R drives were already common in 2001: https://en.wikipedia.org/wiki/CD-R

I wonder would a CD-R disk retain data for these 22 years?


Only if you kept the disk in a refrigerator. Bits are stored by melting the plastic slightly and the dye seeping in. Over time, the warmth of "room temperature" will cause the pits to become less well-defined so the decoder has to spend more time calculating "well, is that really a 1 or is it a sloppy 0". There's a lot of error detection/correction built into the CD specs, but eventually, there will be more error than can be corrected for. If you've ever heard the term "annealing" when used in machine learning, this is equivalent.

Living in South Florida, ambient temperatures were enough to erase CD-Rs - typically in less than a year. I quickly started buying the much more expensive "archival" discs, but that wasn't enough. One fascinating "garage band" sold their music on CD-Rs and all of my discs died (it was a surfer band from Alabama).


The recording is made in the dye layer, a chemical change, and the dye degrades (particularly in sunlight) so the discs have a limited shelf life. Checking Wikipedia, it appears azo dye formulations can be good for tens of years.

Melting polycarbonate would call for an absurdly powerful laser, a glacial pace, or both, and you wouldn't have to use dye at all. I'd guess such a scheme would be extremely durable, though.


I tried a couple of CD-Rs that were stored in a dry closed drawer for most of the last 20 years recently, and they seemed to at least initially come up. Now I can reinstall Windows 2000 with Service Pack 4 slipstreamed!


On the topic of Froggers, I enjoyed https://www.youtube.com/watch?v=FCnjMWhCOcA


This brings back (unpleasant) memories. I remember trying to get those tape drives working with FreeBSD back in 1999, and it going nowhere.


This will be fun in 20 years, trying recover 'cloud' backups from servers found in some warehouse.


Nah it will be very simple:

....What do you mean "nobody paid for the bucket for last 5 years" ?

There is some chance someone might stash old hard drive or tape with backup somewhere in the closet. There is no chance there will be anything left when someone stops paying for cloud.


Those drives will all be encrypted and most likely shredded.


I'm pretty sure that even with the substantial damage done by the recovery company, a professional team like Kroll Ontrack can still recover the complete tape data, although it probably won't be cheap.


As the other comment here says, any company claiming to do data recovery, and damaging the original media to that extent, should be named and shamed. I can believe that DR companies have generic drives and heads to read tapes of any format they come across, but even if they couldn't figure out how the data was encoded, there was absolutely no need to cut and splice the tape. I suspect they did that just out of anger at not likely being able to recover anything (and thus having spent a bunch of time for no profit.)

Melted pinch rollers are not uncommon and there are plenty of other (mostly audio) equipment with similar problems and solutions --- dimensions are not absolutely critical and suitable replacements/substitutes are available.

As an aside, I think that prominent "50 Gigabytes" capacity on the tape cartridge, with a small asterisk-note at the bottom saying "Assumes 2:1 compression", should be outlawed as a deceptive marketing practice. It's a good thing HDD and other storage media didn't go down that route.


Name and shame the company, you had a personal experience, you have proof. Name and shame. It helps nobody if you don't publicize it. Let them defend it, let them say whatever excuse, but your review will stand.


I don't want to even remotely tempt them to sue. They have no grounds, but I'm not taking risks-- companies are notorious for suing when they know they'll lose. Others who have posted it here have identified the right company though.


This is a masterful recovery effort. The README should be shared as an object lesson far and wide to every data restoration and archival service around.


I’ve been suffering through something similar with a DLT IV tape from 1999. Luckily I didn’t send out to the data recovery company. But still unsuccessful.


Is anyone else calling it “froggering/to frogger” if they have to cross a bigger street by foot without a dedicated crossing?


DVDs should not be overlooked for backup. The Millennium type have been simulated to withstand 1,000 years.


The author has fantastic endurance, what a marathon to get the files of the tape.


Someone was wise enough to erase the evidence in Party.


Nice catch! I think it was a little less juvenile than it might sound. I believe this was for a different game, Fusion Frenzy, which was a party minigame collection.


While I didn't understand the parent you are replying to (not your answer), your mention of Fusion Frenzy caught my eye. I've had a soft spot for that game since spending hours playing the "xbox magazine" demo with a childhood friend. Could you clarify? Is there any history gem about that one? I'd dig a PC port!


In one of the screenshots was an empty folder called "Party", with the commenter suggesting they deleted it before the backup as some way to hide it, but most likely it was just because it was for another game.

I don't have much knowledge about Fusion Frenzy, but I am looking for options for archiving its development too, if that's even a possibility, which I'm not certain of yet.


Ah, that makes more sense, thank you for explaining.

I would love to see some development archives, and source code for it one day.


Yet, despite ARCserve showing a popup which says "Restoration Successful", it restores up to the first 32KB of every file on the tape, but NO MORE."

From 10,000 feet, this sounds suspiciously like ARCserve is reading a single tape block or transfer buffer's worth of data for each file, writing out the result, then failing and proceeding to the next file.

Success popup notwithstanding, I'd expect to find errors in either the ARCserve or Windows event logs in this case — were there none?

While it's been decades since I've dealt with ARCserve specifically, I've seen similar behavior caused by any number of things. Off the top of my head,

(1) Incompatibilities between OS / backup software / HBA driver / tape driver.

In particular, if you're using a version of Windows much newer than Windows 2000, try a newer version of ARCserve.

In the absence of specific guidance, I'd probably start with the second* ARCserve version that officially supports Windows Server 2003:

(a) Server 2003 made changes to the SCSI driver architecture that may not be 100% compatible with older software.

(b) The second release will likely fix any serious Server 2003 feature-related bugs the first compatible version may have shipped with, without needing to install post-release patches that may be hard to find today.

(b) Significantly newer ARCserve versions are more likely to introduce tape drive / tape format incompatibilities of their own.

(2) Backup software or HBA driver settings incompatible with the hardware configuration (e.g., if ARCserve allows it, try reducing the tape drive transfer buffer size or switching from fixed block (= multiple tape blocks per transfer) to variable block (= single tape block per transfer) mode; if using an Adaptec HBA, try increasing the value of /MAXIMUMSGLIST[1]).

(3) Shitty modern HBA driver support for tape (and, more generally, non-disk) devices.

For example, modern Adaptec Windows HBA drivers have trouble with large tape block sizes that AFAIK cannot be resolved with configuration changes (though 32 kB blocks, as likely seen here, should be fine).

In my experience with PCIe SCSI HBAs, LSI adapters are more likely to work with arbitrary non-disk devices and software out-of-the-box, whereas Adaptec HBAs often require registry tweaks for "unusual" circumstances (large transfer sizes; concurrent I/O to >>2 tape devices; using passthrough to support devices that lack Windows drivers, especially older, pre-SCSI 2 devices), assuming they can be made to work at all.

LSI20320IE PCIe adapters are readily available for $50 or less on eBay and, in my experience, work well for most "legacy" applications.

(To be fair to Adaptec, I've had nothing but good experiences using their adapters for "typical" applications: arbitrary disk I/O, tape backup to popular drive types, CD/DVD-R applications not involving concurrent I/O to many targets, etc.)

(4) Misconfigured or otherwise flaky SCSI bus.

In particular, if you're connecting a tape drive with a narrow (50-pin) SCSI interface to a wide (68-pin) port on the HBA, make sure the entire bus, including the unused pins, are properly terminated.

The easiest way to ensure this is to use a standard 68-pin Ultra320 cable with built-in active LVD/SE termination, make sure termination is enabled on the HBA, disabled on the drive, that the opposite end of the cable from the built-in terminator is connected to the HBA, and, ideally, that the 68-to-50-pin adapter you're using to connect the drive to the cable is unterminated.

You can also use a 50-pin cable connected to the HBA through a 68-to-50-pin adapter, but then you're either relying on the drive properly terminating the bus — which it may or may not do — or else you need an additional (50-pin) terminator for the drive end, which will probably cost as much as a Ultra320 cable with built-in termination (because the latter is a bog-standard part that was commonly bundled with both systems and retail HBA kits).

Note that I have seen cases where an incorrect SCSI cable configuration works fine in one application, but fails spectacularly in another, seemingly similar application, or even the same application if the HBA manages to negotiate a faster transfer mode. While this should be far less likely to occur with a modern Ultra160 or Ultra320 HBA, assume nothing until you're certain the bus configuration is to spec (and if you're using an Ultra2 or lower HBA, consider replacing it).

With all that said, reversing the tape format may well be easier than finding a compatible OS / ARCserve / driver / HBA combination.

In any case, good job with that, and thanks for publishing source code!

[1] http://download.adaptec.com/pdfs/readme/relnotes_29320lpe.pd...


On a related note, I own a few older tape drives[1], have access to many more[2], and would be happy to volunteer my time and equipment to small-scale hobbyist / retrocomputing projects such as this — tape format conversions were a considerable part of my day job for several years, and tape drives are now a minor hobby.

See my profile for contact information.

[1] 9-track reel, IBM 3570, IBM 3590, early IBM 3592, early LTO, DLT ranging from TK-50 to DLT8000.

[2] IBM 3480/3490/3490E, most 4mm and 8mm formats, most full-sized QIC formats including HP 9144/9145, several QIC MC/Travan drives with floppy controllers of some description, a Benchmark DLT1 assuming it still works, probably a few others I'm forgetting about.


At some point, I feel as if it may be easier just to rewrite the code from the ground up vs. going through all that computational archaeology....

Or in a few years, just have an AI write the code...


> This is where the story should probably have stopped. Given up and called it a day, right? Maybe, but I care about this data, and I happen to know a thing or two about computers.

Hahaha awwwww yeah :muscle:




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: