Does anyone here work at Archive.org? Can you speak to how well-funded the organization is and what sort of measures are in place to keep it afloat? I think it's a fantastic service and I donate, but I worry that it could vanish the next day if funding suddenly dries up. I feel like a large corner of the Internet collectively takes the site for granted and don't bother doing their own in-house archiving because "TheArchive will just suck it up for us."
I can't speak formally for the Internet Archive, but the existing content and services are not going to disappear overnight: funding comes from several sources, thought has been put in to organizational structure, and things have been designed to keep core access and preservation infrastructure running with minimal cost and effort (eg, if the economy tanks).
Getting the content coverage people sometimes assume we already have is another matter. Additional funding (thanks for you donation!) go towards additional crawling and keeping up with the endless treadmill of media types and protocols. Eg, headless browser crawling development and deployment to capture javascript-heavy sites (https://github.com/internetarchive/brozzler); this is much more expensive than "classic" crawling.
For more on increasing storage costs and the under-funded state of web archiving in general, I recommend David Rosenthal's blog, eg:
Far more effective and robust than hoping the archive is "suck it up for us" is to upload snapshots/dumps/exports yourself! Anybody can create an archive.org account and upload content (recommend https://github.com/jjjake/internetarchive over the HTML form), within reasonable limits. Obviously, care needs to be taken to remove sensitive (and personal) information first.
> I don't work for them but I think the move towards decentralization is the best way to address those concerns.
How feasible would a reliable, distributed archive be; given how massive amount of data Archive.org has? After all, it was created precisely because the already-decentralized web was too ephemeral and unstable. I don't think decentralization is a panacea in this case.
Yeah, spot on. Governance and durability through org structure is a thing.
The way to solve this is to provide the Internet Archive with enough resources to build out a globally distributed storage system. Could you hack something together using their torrent tracker for every item served? Yes. But you don't hack together something made to preserve digital human culture in perpetuity.
> Yeah, spot on. Governance and durability through org structure is a thing.
> The way to solve this is to provide the Internet Archive with enough resources to build out a globally distributed storage system.
Yeah, I agree. I do see a space for other appropriate institutions (such as the Library of Congress, British Library, etc.) to pool resources and facilities with the Internet Archive to achieve that goal.
Ultimately, it'd be awesome to see each national library run a semi-autonomous IA copy that synchronizes with all the others, but can continue to operate independently (scrapers and all), if need be.
To be clear, this is not the state government of California, but a division of the University of California (the California Digital Library) working with the Internet Archive and Code for Science & Society.
> To be clear, this is not the state government of California, but a division of the University of California
The University of California is a part of the government of the State of California established in the State Constitution, whose governing body is comprised of 18 members appointed by the Governor and confirmed by the Senate, plus seven ex-officio members, three of whom are State elected Constitutional officers (Governor, Lt. Governor, and Superintendent of Public Instruction) and one of whom is the Speaker of the Assembly.
(That said, it is unusual and potentially misleading to refer to UC as “California”, but not because UC is actually separate from the government of the State.)
I hope archive.org doesn't fall victim to GDPR and the upcoming EU "copyright" reform (eg. will stop serving to EU). I'm not a fan of the vague "right to be forgotten" concept as it applies to individuals, and think history rewriting is a much more serious issue going forward.
Though I've heard credible complaints from "copyright" holders vs archive.org.
Yes. The popular sentiment here is "GDPR good, EU copyright reform bad". And that's understandable.
But both data-privacy and copyright they try to create ownership of information and must do so through intrusive legal measures because physical nature makes is against it.
The project aims to demonstrate how members of a cooperative, decentralized network can leverage shared services to ensure data preservation while reducing storage costs and increasing replication counts.
Again, that's simply not true. That's not how copyrights work. Take Project Gutenberg for example- Project Gutenberg hosts public domain work. The copy of the work is still under public domain. What's not under public domain is the fact that PG hosts the public domain work.
For example, you wouldn't be able to use Archive.org's trademarks to advertise public domain work. But Archive.org does not gain hold a copyright over a copy of public domain work
So Getty images has a whole lot of photos of the paintings in the Sistine chapel. I bet those photos are copyright the photographer, ‘yeah but that’s because it’s Getty, that’s allowed’
Taking a photograph is creation of a new copyrighted work under some now-very-old United States Law. The key idea is "transformation": it's understood that photography includes an editor's eye of what is worth taking a photograph of.
Making a copy of a public domain book does not create an independent claim of copyright to the contents of the book. You can make a case that maybe it should, but it legally does not; that's just not how it works.
When a DMCA takedown notice is filed, the item will go dark. It will still be archived, but inaccessible. If your Internet Archive uploader account has frequent DMCA requests against it, it's disabled.
Items don't lose their copyright status by being uploaded or stored in the Internet Archive.
It will be inaccessible until it becomes public domain, but the copy that you download from archive.org will be copyright archive.org. You can get the music from somewhere else, if you can, you most likely won’t be able to.
If I take a photo of an original work of sheakespear, then I own the copyright on the photo. And i could license the usage of my photo. If I make a copy of anything, then I own the copyright on the copy. What if, in time, the only copy you could get hold of was my copy. My copy would be copyright me and you’d need to wait for it to become public domain before you could use it under your terms
> If I take a photo of an original work of sheakespear, then I own the copyright on the photo. And i could license the usage of my photo.
That's 100% not true. Please stop spreading misinformation because you have no idea what you are talking about.
"According to a landmark 1999 federal district court ruling, The Bridgeman Art Library, Ltd., Plaintiff v. Corel Corporation, 'exact reproductions of public domain artworks are not protected by copyright.'"
You're just flat out wrong dude. You've got some very bizarre believes about copyright law and I can't fathom where you got them. Whoever you learned from did you a great disservice, or if you 'taught' yourself then you didn't do a very good job at reading.
I believe that in the context of copyright, 'exact' doesn't mean a 100% perfect copy. If I make a film, I hold the copyright for it whether its distributed on VHS, DVD, or online streaming, even if the quality on VHS is different than the film shown in theaters. It'd be like if changing an image from jpeg to tif changed its copyright status. Your example elsewhere of correcting a few typos and saying it's a different work is similar: it's still the same work, even with the minor tweaks.
Anyways, copyright is weird and complicated and there are many people who don't really understand it. But your theory that archive.org is going to get a copyright on everything they scan and store is not rooted in any sort of legal reality.
> If I take a photo of an original work of sheakespear, then I own the copyright on the photo.
If the photo meets the requirement for independent copyright ability, which requires being a distinct creative work, that's true, and for a photo that would often but not always be the case.
For a simple non-transformative digital copy, there's no copyrightable new work and no copyright in such a work.
> What if, in time, the only copy you could get hold of was my copy. My copy would be copyright me and you’d need to wait for it to become public domain before you could use it under your terms
If someone was copying elements that were original in your work of the photograph, true, but if they were merely copying the pubkic domain text of which you had taken a photograph, no, not true at all. Copyright protects your original work.
> It will be inaccessible until it becomes public domain, but the copy that you download from archive.org will be copyright archive.org. You can get the music from somewhere else, if you can, you most likely won’t be able to.
No.
> For example, if a book published in 1995 is a reprint of a book published in 1900, then it is eligible. However, the onus is on us to prove that it is a reprint, and if it doesn't say on the TP&V that it is a reprint, confirming its eligibility may be impractical.
What if the 1995 copy had a few deliberate errors thrown in like spelling mistakes and re worded sentences, and your new copy copied those errors verbatim, then I could say that you copied the 1995 version, which was still under copyright by the publisher, and not the 1900 version which we can all agree is public domain
I'm not sure how worried I am about archive.org, specifically, in this regard. But the concern does reflect a trend in copyright, promulgated by those with significant vested interests (IP value they're seeking to maintain or grow) and so-called "maximalists".
There's a growing push to legitimize copyright claims for "instances" of a work, even after the base work has entered the public domain.
If this "sounds ridiculous" to you, just recall how most of us are increasingly worried that, in the U.S., Congress and the Executive are going to... "re-Mickey" the copyright term. As in, they already pushed it to life plus 70 years when certain Disney copyrights were about to expire. (And keep using "trade agreements" as one mechanism to try to "back door" increases to "plus 70" into other countries' IP terms.)
A separate concern I have, is that currently archive.org continues to "retroactively respect" robots.txt changes.
404 your once public content, and archive.org "disappears" it from their corresponding records/copies.
As long as that's true, you can't really view it as a permanent, unbiased archive.
As politicians, commercial interests, and their lawyers continue to have a field day constraining "online rights" (and IP rights, and etc.), currently the only "guarantee" the public has of continued access and a more complete historical record is, ironically, local copies.
They say, "History is written by the winners."
Well, unless they can't find the copy you have squirreled away.