Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Software Heritage (softwareheritage.org)
92 points by ingve on June 30, 2016 | hide | past | favorite | 12 comments


If only the software itself weren't so fragile.

Because of the hype around conversational UI lately, I was curious to start up an example of such from a half century ago, hook it up to Slack and provide people some historical context, but it was written targeting an interpreter which I couldn't find a copy of for a machine I couldn't find a working emulator for. (BTW, If anyone has a working way to emulate a PDP-10 on 64 bit linux, and an image for that machine providing MacLisp, please let me know - <username>@gmail.com.)

Maybe that will never happen for today's languages, operating systems or architectures, but I wonder if in 2065 a curious engineer would, even if a solution for his curiosity existed in this archive, be able to get something working (within some economical limits upon his effort).

Too rarely is code written in a way that it does much other than happen to work, definitely not in a fashion that makes future code archeologists likely to be able to reapply it's core learning (algorithms, patterns).


One emulator is availible here: http://klh10.trailing-edge.com/ A tutorial on getting TOPS-20 here: http://gunkies.org/wiki/Running_TOPS-20_V4.1_under_SIMH (Or ITS: http://gunkies.org/wiki/ITS)

If those don't provide MacLisp (I think the TOPS-20 image should), then you might be able to find it (or a link that sets you on the right track) here: http://www.softwarepreservation.org/projects/LISP/maclisp_fa...

A manual for MacLISP is here, by the way: http://www.maclisp.info/pitmanual/meta-index.html Some assorted software for the PDP-10 is here (Including TECO!): http://pdp-10.trailing-edge.com/

The reason I have posted this as a comment, rather than via email is because too often have I been looking for solutions to a problem, only to discover a solution was found and emailed to someone else -- denying myself and others access to the solution!


Thank you! I somehow dead-ended before finding these.


Yeah definitely -- I think the more ambitious project is not just to save the source code, or even search it, but to save a complete build AND runtime environment for the software.

Of course, it's underspecified turtles all the way down, but I bet you can bottom out with QEMU, which can be run in a mode where it's mostly ANSI C I believe? (i.e. the slow mode without hardware acceleration) QEMU seems to be the most portable emulator AFAICT.


The name is somewhat unfortunate, as I assumed historical software (like archive.org does with its old BBS collections, and such), but the roadmap looks very promising.

Specifically, they're planning search across their entire collection. I just watched a talk by Yegge about Google's internal code search (for their gazillion lines of code), and it struck me that if we had something like that at web scale, it could potentially be a force multiplier...maybe not quite comparable to the Internet itself or to the Open Source community (both of which are likely 10x or 100x or even bigger multipliers to productivity), but certainly a productivity booster.


Google used to have public Code Search


Yes, and they've open sourced part of that here (at least, I think it is related, and is also called "Code Search"): https://github.com/google/codesearch

But, their public code search was not the same as the internal thing, as far as I know. The internal beast sounds kind of awesome. Here's his talk about it from a few years ago: https://www.youtube.com/watch?v=KTJs-0EInW8

I think the really useful thing (beyond a mere search engine that has indexed a lot of source code) is the comprehension of the "guzzintas" and "guzzouttas" of the code, more than merely a naive word search; i.e. knowing how pieces interact, so you can search for functions that work with a particular kind of object, for example, or objects that have the same properties, or code that interacts with the same API you're working with, whatever.



softwareheritage is a good idea, and I'm heartened to see that they are collecting provenance information and restricting their collection to open source.

My enthusiasm for code search outside an environment like Google's internal environment is tempered, even for an archive which is only open source (unlike Github), because it facilitates some bad behavior in the wild and because its usefulness over simply having source available is questionable.

The tendency of users to copy-paste without regard to license leads to a lot of unethical and legally hazardous behavior. I need a fizzbuzz - great, here it is on my screen, why not just paste it in without checking the license or even adding an attribution? If I see it, it must be open source, and if it's open source I can do whatever I want. That is a behavioral problem when people find code in this bottom-up way instead of starting from relevant projects and working through properties like licensing to get down to code.

On the other hand, the productivity bonus of wholesale copy-paste is easy to overstate, since it comes so often without understanding of what is being pasted or what it will do in the context it is being pasted to, and results in a great amount of duplicated effort and failures to patch serious bugs. (That's aside from the whole argument that somehow it is okay for a project to have 150 versions of the same routine). Again, there is a strong advantage to finding code in a top-down way, based on understanding of its function.

Most of the perceived benefit of code search comes just from having readable source code available for relevant projects. These projects are not really discoverable from code search (how would I know what someone else will name the variable of interest?)

Most of what remains is copy-paste. Although copy-paste might be a LOC multiplier in typical corporate environments with poor review and bad incentives, that is not a good thing for quality.

There seem to be few unexploited technological productivity "multipliers" which are not within, say, the range of 0 to 1.5. They keep being promised, but they either don't materialize in general practice, or they do not have the promised benefits in general practice.


Presumably the only things that would show up would be things that have a known open source license. It'd be trivial to query based on compatible licenses (e.g. Apache license is compatible with everything, GPL needs to be in GPL code, BSD may need attribution, etc.) Github strongly encourages a license for new repositories, and hopefully most folks are including one if they want their code to be re-used.

I think you and I are picturing things differently. I'm imagining the entirety of the world's code in whatever language I'm working in as my "standard library". It's not "copy-paste coding" to use the standard library; it's using the tools that are available to get the job done effectively and quickly. The library is just much bigger. There needs to be a lot of additional meta-data to make this workable, which is why I said merely searching code isn't the full picture. We need to know how popular a function or library is (so we can choose the most "standard" one, if we don't have compelling reasons to choose a less often used one), we need to know if it's compatible with our versions of stuff, we need to know what its interfaces are, whether it's got good test coverage, what types or kinds of objects it works on or with, whatever. And, we need the machine to be able to weed out inappropriate options for us, and bubble the possible good matches up to the top.

This is stuff we already do every day in an ad hoc fashion. I recently wrote a few hundred lines of Ruby code for the first time in a decade; I had no idea coming into it which testing framework to use, what options parser to use, how to interact with the JSON encoder (it has some weird quirks for advanced use cases, and I needed to do some searching and reading), etc. I figured it all out, but it took a lot of reading code that wasn't really related to what I was doing, just to rule it out as not being relevant. And, I still don't know what popularity looks like for RSpec vs. test-unit, for example. There's a wide variety of metrics you can look at, but number of projects that actually use something would be cool to know.

Another example, I'm currently updating old Perl code to be more modern. It took quite a bit of searching and reading to figure out what to do with bareword file handles (because it's different for STDOUT/STDIN vs. for normal filehandles which can be have lexical scope, for example). If I'd been able to easily search for examples of code that used those idioms in those contexts, it would have been nicer.

I guess if you only work in one language, and you know the ecosystem very, very, well, such a thing might become less useful over time. But, it'd be wonderful for learning, I think.

"There seem to be few unexploited technological productivity "multipliers" which are not within, say, the range of 0 to 1.5. They keep being promised, but they either don't materialize in general practice, or they do not have the promised benefits in general practice."

I'm not sure I agree with you. I've been writing software off and on for most of my life. Rarely big projects, but rarely do I go more than a few months without building something of some sort. I was just thinking back on how much work it took when I started writing code (on a C64) vs. now. Likewise, every step along the way, I can look back and think, the amount of work (where work is not just the lines of code, but the amount of reading, the amount of experimentation, etc.) I would have had to do 5, 10, 15 years ago would have been much higher; and the bar to entry on some projects was so high I wouldn't have even started them, if it weren't my full-time job...now I can pick up and throw together a useful thing in a weekend that would have taken weeks or months without the APIs, the tooling, the resources like Stack Overflow and github, and the massive libraries of re-usable code. I think the force multipliers are all around us...and maybe accelerating in how rapidly they're coming (though staying aware of them gets harder every day because there's so much happening).

Yes, Internet was a huge force multplier all at once, probably the biggest single boost to all sorts of knowledge productivity; and, the Open Source community was another huge one, possibly bigger than the Internet in our particular industry, though requiring the Internet as a pre-requisite.

But, the library ecosystem for damned near every language is vast today. Perl's CPAN was the first truly huge plug-and-play software ecosystem I ever saw, but CPAN of a decade or two ago is a blip on the radar compared to something like npm today (or any other major language ecosystem). Even CPAN itself has grown remarkably with the times, and it would dwarf the CPAN of yesteryear (though it, too, is dwarfed by the likes of npm). I think the missing link is often discoverability.

There's a tremendous body of prior art out there, and we keep reinventing the wheel because we don't know about it, is what I'm trying to say.


I was hoping it would be for older software (say, from before github or even sourceforge) that's currently scattered across the Internet. I've been trying to locate the code for the MC6839 [1] and have yet to come across it. I've come across the binary, but sadly, no source.

[1] A floating point package (IEEE 754) for the 6809 in an 8K ROM (position independent). Apparently, Motorola wrote it, made datasheets available but never actually sold any. Then in the very late 80s they (from what I understand) released the code into the public domain.


It seems they're somewhat new and it's natural for them to start with the low-hanging fruit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: