In The Singularity Is Near (2005) Ray Kurzweil discussed an idea for the “Docume...

wongarsu · on May 24, 2023

In 2005 the computing world was much more in flux than it is now.

PNG is 26 years old and basically unchanged since then. Same with 30 year old JPEG, or for those with more advanced needs the 36 year old TIFF (though there is a newer 21 year old revision). All three have stood the test of time against countless technologically superior formats by virtue of their ubiquity and the value of interoperability. The same could be said about 34 year old zip or 30 year old gzip. For executable code, the wine-supported subset of PE/WIN32 seems to be with us for the foreseeable future, even as Windows slowly drops comparability.

The latest Office365 Word version still supports opening Word97 files as well as the slightly older WordPerfect 5 files, not to mention 36 year old RTF files. HTML1.0 is 30 years old and is still supported by modern browsers. PDF has also got constant updates, but I suspect 29 year old PDF files would still display fine.

In 2005 you could look back 15 years and see a completely different computing landscape with different file formats. Look back 15 years today and not that much changed. Lots of exciting new competitors as always (webp, avif, zstd) but only time will tell whether they will earn a place among the others or go the way of JPEG2000 and RAR. But if you store something today in a format that's survived the last 25 years, you have good chances to still be able to open it in common software 50 years down the line.

orbital-decay · on May 24, 2023

This is too shortsighted by the archival standards. Even Word itself doesn't offer full compatibility. VB? 3rd party active components? Other Office software integration? It's a mess. HTML and other web formats are only readable by the virtue of being constantly evolved while keeping the backwards compatibility, which is nowhere near complete and is hardware-dependent (e.g. aspect ratios, colors, pixel densities). The standards will be pruned sooner or later, due to the tech debt or being sidestepped by something else. And I'm pretty sure there are plenty of obscure PDF features that will prevent many documents from being readable in mere half a century. I'm not even starting on the code and binaries. And cloud storage is simply extremely volatile by nature.

Even 50 years (laughable for a clay tablet) is still pretty darn long in the tech world. We'll still probably see the entire computing landscape, including the underlying hardware, changing fundamentally in 50 years.

Future-proofing anything is a completely different dimension. You have to provide the independent way to bootstrap, without relying on the unbroken chain of software standards, business/legal entities, and the public demand in certain hardware platforms/architectures. This is unfeasible for the vast majority of knowledge/artifacts, so you also have to have a good mechanism to separate signal from noise and to transform volatile formats like JPEG or machine-executable code into more or less future proof representations, at least basic descriptions of what the notable thing did and what impact it had.

ilyt · on May 24, 2023

>Future-proofing anything is a completely different dimension. You have to provide the independent way to bootstrap, without relying on the unbroken chain of software standards, business/legal entities, and the public demand in certain hardware platforms/architectures. This is unfeasible for the vast majority of knowledge/artifacts, so you also have to have a good mechanism to separate signal from noise and to transform volatile formats like JPEG or machine-executable code into more or less future proof representations, at least basic descriptions of what the notable thing did and what impact it had.

I'd argue that best way would be to not do that but to make sure format is ubiquitous enough that the knowledge will never be lost in the first place.

pdntspa · on May 25, 2023

That, and use formats which can be accessed and explained concisely, like "read the first X bytes to metadata field A, then read the image payload by interpreting every three bytes as an RGB triplet until EOF" so that the information can be transmitted orally, in the off chance that becomes necessary

Hey I think I just described Windows 3.0-era PCX format :P

tannhaeuser · on May 25, 2023

> HTML and other web formats are only readable by the virtue of being constantly evolved while keeping the backwards compatibility, which is nowhere near complete and is hardware-dependent (e.g. aspect ratios, colors, pixel densities).

HTML itself is relatively safe, by virtue of it being based on SGML. Though it's not ideal either because those who think it's their job to evolve HTML don't bother to maintain SGML DTDs or use other long established formal methods to keep HTML readable, but believe a hard-coded and (hence necessarily) erroneous and incomplete parsing description the size of a phone book is the right tool for the job.

Let me quote the late Yuri Rubinski's foreword to The SGML Handbook outlining the purpose of markup languages (from 1990):

> The next five years will see a revolution in computing. Users will no longer have to work at every computer task as if they had no need to share data with all their other computer tasks, they will not need to act as if the computer is simply a replacement for paper, nor will they have to appease computers or software programs that seem to be at war with one another.

However, exactly because evolving markup vocabularies requires organizing consensus, a task which W3C et al seemingly weren't up to (busy with XML, XHTML, WS-Star, and RDF-Star instead for over a decade), CSS and JS was invented and extended for the absurd purpose of basically redefining what's in the markup which itself didn't need to change, with absolute disastrous results for long-term readability or even readability on browsers other than from the browser cartel today.

pwdisswordfishc · on May 25, 2023

> Though it's not ideal either because those who think it's their job to evolve HTML don't bother to maintain SGML DTDs or use other long established formal methods to keep HTML readable, but believe a hard-coded and (hence necessarily) erroneous and incomplete parsing description the size of a phone book is the right tool for the job.

> a task which W3C et al seemingly weren't up to (busy with XML, XHTML

You realise XML/XHTML is actually delightfully simple to parse compared to WHATWG HTML?

forgotmypw17 · on May 24, 2023

There is something called Lindy Effect, which states that a format's longevity is proportional to its current age.

I try to take advantage of this by only using older, open, and free things (or the most stable subsets of them) in my "stack".

For example, I stick to HTML that works across 20+ years of mainstream browsers.

moron4hire · on May 24, 2023

While it's true that these standards are X years old, the software that encoded those formats yesteryear is very different from the software that decodes it today. It's a Ship of Theseus problem. They can claim an unbroken lineage since the distant future, the year 2000, but encoders and decoders had defects and opinions that were relied on--both intentionally and unintentionally--that are different from the defects and opinions of today.

I have JPEGs and MP3s from 20 years ago that don't open today.

matja · on May 24, 2023

Are they really JPEGs and MP3s, or just bitrot?

I've found https://github.com/ImpulseAdventure/JPEGsnoop useful to fix corruption but I haven't come across a non-standard JFIF JPEG unless it was intentionally designed to accommodate non-standard features (alpha channel etc).

orbital-decay · on May 24, 2023

I personally never encountered JPEGs or MP3s which were totally unreadable due to the being encoded by ancient software versions, but the metadata in common media formats is a total mess. Cameras and encoders are writing all sorts of obscure proprietary tags, or even things like X-Ray (STALKER Shadow of Chernobyl game engine) keeping gameplay-relevant binary metadata in OGG Vorbis comments. Which is even technically compliant with the standard I think, but that won't help you much.

LocalH · on May 25, 2023

> X-Ray (STALKER Shadow of Chernobyl game engine) keeping gameplay-relevant binary metadata in OGG Vorbis comments

Just, ew

jlawson · on May 24, 2023

In the case of individual files with non-conformant or corrupted elements it seems fairly straightforward project to build an AI model that can fix up broken files with a single click. I suspect such a thing will be widely-accessible in 10 years.

jeffrallen · on May 25, 2023

Just going to mention Pro/E forward compatibility here: https://youtu.be/tY_Gy-EElc0

"The roots of Creo Parametric. Probably one of the last running installations of PTC's revolutionary Pro/ENGINEER Release 7 datecode 9135 installed from tape. Release 7 was released in 1991 and is - as all versions of Pro/ENGINEER - fully parametric. Files created with this version can still - directly - be opened and edited in Creo Parametric 5.0 (currently the latest version for production).

This is a raw video, no edits, and shows a bit of the original interface (menu manager, blue background, yellow and red datum planes, no modeltree).

Hardware used: Sun SparcStation 5 running SunOS 4.1.3 (not OpenWindows), 128MB RAM

Video created on january 6, 2019."

ddingus · on May 25, 2023

That is great! NX has similar compatibility, though not quite as good.

In some cases, the actual version code of a feature is invoked by that data being encoded as part of the model data schema.

Literal encapsulation in action. That way bugs and output variances are preserved so that the regenerated model is accurate to what the software did decades ago.

dmje · on May 25, 2023

I can't help but think bad thoughts whenever I see another "static site maker" posted on here, or a brand new way of using JavaScript to render a web page.

Talk about taking the simplest and most durable of (web) formats and creating a hellscape of tangled complexity which becomes less and less likely to be maintainable or easy to archive the more layers of hipster js faddishness you add...

magpi3 · on May 24, 2023

One of the claimed benefits of the JVM (and obviously later VMs) was that it would solve this issue: Java programs written in 2000 should still be able to run in 2100. And as far as I know the JVM has continued to fulfill this promise.

An honest question: If you are writing a program that you want to survive for 100+ years, shouldn't you specifically target a well-maintained and well-documented VM that has backward compatibility as a top priority? What other options are there?

cmrdporcupine · on May 25, 2023

People routinely boot DOS in e.g. qemu. The x86 ISA is 45 years old, older if you consider the 8008/8080 part of the lineage. It's not pretty, but it's probably the most widespread backwards compatible system out there.

elzbardico · on May 25, 2023

S/360 assembly programs probably would still run on a modern IBM mainframe. Punched cards kept in an inert atmosphere probably would last for centuries, and along with printed documentation in archival-quality paper would allow future generations to come up with card readers and an emulator to actually run the program.

sverhagen · on May 25, 2023

While I love the JVM, and I also think it's one of the better runtimes in terms is backwards compatibility, there have been breakages. Most of the ones I've dealt with were easy to fix. But the ease of fixing is related to the access to source code. When something in a data stream is broken, be it an MP3 or a JPEG, I guess you almost inherently need special tooling to fix it (realistically). I imagine that with an SVG it'd be easier to hand-fix it.

lmm · on May 25, 2023

> An honest question: If you are writing a program that you want to survive for 100+ years, shouldn't you specifically target a well-maintained and well-documented VM that has backward compatibility as a top priority? What other options are there?

I'd be tempted to target a well-known physical machine - build a bootable image of some sort as a unikernel - although in the age of VMWare etc. there's not a huge difference.

IMO the "right" way to do this would be to endow an institution to keep the program running, including keeping it updated to the "live" version of the language it's writen in, or even porting it between languages as and when that becomes necessary.

mrguyorama · on May 25, 2023

Basically, just target the System/360 from IBM

crazygringo · on May 24, 2023

But he seems to have written this before virtual machines became widespread.

I think the concern is becoming increasingly irrelevant now, because if I really need to access a file I created in Word 4.0 for the Mac back in 1990, it's not too hard to fire up System 6 with that version of Word and read my file. In fact it's much easier now than it was in 2005 when he was writing. Sure it might take half an hour to get it all working, but that's really not too bad.

Most of this is probably technically illegal and will sometimes even have to rely on cracked versions, but also nobody cares and. All the OS's and programs are still around and easy to find on the internet.

Not to mention that while file formats changed all the time early on, these days they're remarkably long-lived -- used for decades, not years.

The outdated hardware concern was more of a concern (as the original post illustrates), but so much of everything important we create today is in the cloud. It's ultimately being saved in redundant copies on something like S3 or Dropbox or Drive or similar, that are kept up to date. As older hardware dies, the bits are moved to newer hardware without the user even knowing.

So the problem Kurzweil talked about has basically become less of an issue as time has marched on, not more. Which is kind of nice!

ilyt · on May 24, 2023

>I think the concern is becoming increasingly irrelevant now, because if I really need to access a file I created in Word 4.0 for the Mac back in 1990, it's not too hard to fire up System 6 with that version of Word and read my file. In fact it's much easier now than it was in 2005 when he was writing. Sure it might take half an hour to get it all working, but that's really not too bad.

And that was easy years ago.

Now you can WASM it and run it in a browser

duskwuff · on May 24, 2023

You don't even need to set it up yourself: http://system6.app/

It's even got Word 5.1 installed. :)

johannes1234321 · on May 25, 2023

> I think the concern is becoming increasingly irrelevant

I fear we may be on top of that point. With the "cloudification" where more and more software is run on servers one doesn't control there is no way to run that software in a VM as you don't have access to the software anymore. And even getting the pure data for a custom backup becomes harder and harder.

krapp · on May 24, 2023

I'm certain that 100 years from now, when the collapse really gets rolling, we'll still have cuneiform clay tablets complaining about Ea-Nassir's shitty copper but most of the digital information and culture we've created and tried to archive will be lost forever. Eventually, we're going to lose the infrastructure and knowledge base we need to keep updating everything, people will be too busy just trying to find food and fighting off mutants from the badlands to care.

jakeinspace · on May 24, 2023

Well, almost all early tablets are destroyed or otherwise lost now. Do you think we will lose virtually all digital age information within a century? Maybe from a massive CME, I suppose.

jcranmer · on May 24, 2023

Clay tablets were usually used for temporary records, as you could erase it simply by smearing the clay a little bit (a lot easier than writing with on papyrus). The tablets we have exist because of something that causes the clay to be baked into ceramic, which is generally some sort of catastrophic fire that caused the records to accidentally be preserved for much longer.

ShadowBanThis01 · on May 24, 2023

I know. My first iPad just stopped powering up. WTF!

I should etch something into its glass and bury it in my back yard. Perhaps a shopping list, or a complaint about how my neighbor inexplicably gets into his truck six or eight times a day and just sits there with it running.

krapp · on May 24, 2023

I can see it happening. Not as a single catastrophic event but, like Rome falling bit by bit, our technological civilization fails and degenerates as climate change (in the worst possible scenario) wreaks havoc on everything.

hello_computer · on May 24, 2023

I was able to backup/restore an old COBOL system via cpio between modern GNU cpio (man page last updated June 2018), and SCO's cpio (c. 1989). This is neither to affirm nor contradict Kurzweil, but rather to praise the GNU userland for its solid legacy support.

gwern · on June 4, 2023

Interview? https://www.computerworld.com/article/2477417/the-kurzweil-i...

ChuckMcM · on May 24, 2023

This is very very true. I have archived a number of books and magazines that were scanned and converted into "simplified" PDF, and archived on a DVD disks with C source code.

There are external dependencies but one hopes that the descriptions are sufficient to figure out how to make those work.

ilyt · on May 24, 2023

Actually I'd argue it's wrong precisely because we do manage to retrieve even such old artifacts. Only problem is that nobody cared for 30 years so the process was harder than it should be but in the end it was possible.

Sure, there is a risk that at some point, for example, any version of every PNG or H.264 decoder gets lost and so re-creating decoder for that would be significantly more complicated, but chances for that are pretty slim, but looking at `ffmpeg -codecs` I'm not really worried for that to ever happen.

0xdeadbeefbabe · on May 24, 2023

> Hard documents are actually the easiest to access. Fairly crude technologies like microfilm or microfiche which basically has documents are very easy to access.

Maybe it isn't crude after all if it wins.

ddingus · on May 25, 2023

I do not consider microfiche or film crude at all.

They are just simple.

And what they do is very fully exploit the analog physics to yield high data density mere mortals can make effective use of.

And they make sense.

In my life, text, bitmaps and perhaps raw audio endure. Writing software to make use of this data is not difficult.

A quick scan later, microfiche type data ends up a bitmap.

Prior to computing, audio tape, pictures on film and ordinary paper, bonus points for things like vellum, had similar endurance and utility.

My own archives are photos, papers and film.