Hacker News new | past | comments | ask | show | jobs | submit login

Absolutely amazing story. Fantastic!

I've actually long been stunned by the propensity of proprietary backup software to use undocumented, proprietary formats. I've always found this quite stunning, in fact. It seems to me like the first thing one should make sure to solve when designing a backup format is to ensure it can be read in the future even if all copies of the backup software are lost.

I may be wrong but I think some open source tape backup software (Amanda, I think?) does the right thing and actually starts its backup format with emergency restoration instructions in ASCII. I really like this kind of "Dear future civilization, if you are reading this..." approach.

Frankly nobody should agree to use a backup system which generates output in a proprietary and undocumented format, but also I want a pony...

It's interesting to note that the suitability of file formats for archiving is also a specialised field of consideration. I recall some article by someone investigating this very issue who argued formats like .xz or similar weren't very suited to archiving. Relevant concerns include, how screwed you are if the archive is partly corrupted, for example. The more sophisticated your compression algorithm (and thus the more state it records from longer before a given block), the more a single bit flip can result in massive amounts of run-on data corruption, so better compression essentially makes things worse if you assume some amount of data might be damaged. You also have the option of adding parity data to allow for some recovery from damage, of course. Though as this article shows, it seems like all of this is nothing compared to the challenge of ensuring you'll even be able to read the media at all in the future.

At some point the design lifespan of the proprietary ASICs in these tape drives will presumably just expire(?). I don't know what will happen then. Maybe people will start using advanced FPGAs to reverse engineer the tape format and read the signals off, but the amount of effort to do that would be astronomical, far more even than the amazing effort the author here went to.




To add, thinking a bit more about it: Designing formats to be understandable by future civilizations actually reduces to a surprising degree to the same set of problems which METI has to face. As in, sending signals designed to be intelligible to extraterrestrials - Carl Sagan's Contact, etc.

Even if you write an ASCII message directly to a tape, that data is obviously going to be encoded before being written to the tape, and you have no idea if anyone will be able to figure out that encoding in future. Trouble.

What makes this particularly pernicious is the fact that LTO nowadays is a proprietary format(!!). I believe the spec for the first generation or two of LTO might be available, but last I checked, it's been proprietary for some time. The spec is only available to the (very small) consortium of companies which make the drives and media. And the number of companies which make the drives is now... two, I think? (They're often rebadged.) Wouldn't surprise me to see it drop to one in the future.

This seems to make LTO a very untrustworthy format for archiving, which is deeply unfortunate.


The best format for archiving is many formats.

Make an LTO tape... But also make a Bluray... And also store it on some hard drives... And also upload it to a web archive...

The same for the actual file format... Upload PDF's... But also upload word documents.. And also ASCII...

And same for the location... Try to get diversity of continents... Diversity of geopolitics (ie. some in USA, some in Russia). Diversity of custodians (friends, businesses, charities).


Even ASCII itself is a strange encoding that could be lost with enough time and need to be recovered through cryptographic analysis and signals processing. That doesn't look at all likely today given UTF-8's promised and mostly accomplished ubiquity and its permanent grandfathering of ASCII. But ASCII is still only one of a number of potential encoding schemes, isn't necessarily obvious from first principles.

Past generations thought EBCDIC would last longer than it did.

Again, not that there any indications now that ASCII won't survive nearly as long as the English language does at this point, just that when we're talking about sending signals to the future, even assuming ASCII encoding is an assumption to question.


Baby's first cryptographic analysis, sure. Mapping letters to bits is easy, and the 8 bit repeating pattern is also easy.

The thing that might make it hard is if people have forgotten English itself, and in that case ASCII is one of the smallest barriers.

EBCDIC would also be fine.


These things make more sense because LTO is used for backup, not archival. Companies don't want to be able to read the tape data in 50 years, they want to be able to read it tomorrow, after the entire business campus burns down.


You mean the "if you are reading this in the distant future" instructions are written to the medium first? And are straight up ASCII?

Nice. That kind of thing makes too much sense. Wow. Such cheap insurance. Nice work from that team.


Yeah. If I ever wrote a backup system I'd do this too, write the whole spec for the format first to every medium. A 100k specification describing the format is nothing to waste on a medium which can store 10TB.


Seriously. Scale really changes things.


It's kinda strange that we still don't have a technology that would allow one to scan a magnetic medium at high resolution and then process it in software. This would be nice for all kinds of things that use magnetic tapes and platters — data recovery, perfect analog tape digitization, etc. The closest I've seen to it is that project that captures the raw signal from the video head of a VCR and then decodes it into a picture.


Isn't there a subset of that at least for floppy discs with Kryoflux or GreaseWeazle style controllers? They read the raw flux transitions off the drive head, and then it's up to software to figure out that it's a Commodore GCR disc or a Kaypro MFM one.


LTO tape media itself is typically only rated at 30 years, so I suspect the tapes will die before the drives do.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: