Hacker News new | past | comments | ask | show | jobs | submit login
1.3B Worldcat scrape and data science mini-competition (annas-blog.org)
248 points by crtasm on Oct 4, 2023 | hide | past | favorite | 88 comments



I looked into using Dewey Decimal for a hobby project. OCLC has a de facto monopoly on it due to the Worldcat database. They're a non-profit, but they're supported by having libraries pay a subscription fee for Worldcat.

Back when OCLC was founded, the idea that people would want to have a copy of a card catalog for personal use was laughable, so I'm sympathetic to the people that set up their funding model. It's far cheaper for a library to subscribe to Worldcat than to hire a team to maintain such a database, so it created a win-win situation.

However, keeping the world's books' metadata a secret (and leaving control in the hands of a monopoly) is an anachronism.

It's well past the time when someone (such as an international coalition of Libraries of Congress) should figure out how to sustainably fund OCLC while also releasing their work into the public domain.


I have been told that my organization developed a system, HABS, that pre-dated OCLC [0]. That OCLC used this system as an inspiration. However, I cannot confirm this. Closest I can do is to find a footnote that Thanks Fred Kilgore, the founder of OCLC [1]. I should reach out to Koh, a friend of a friend, while she is still alive to confirm the story. Nevertheless, we have a collection of punch cards in a dusty room in an attic that was once the HABS system. I think it is a pretty fascinating legacy, and I wish it was better preserved.

[0] https://journals.sagepub.com/doi/pdf/10.1177/106939716900400... [1]https://journals.sagepub.com/doi/abs/10.1177/106939717300800...


Neat! I hope you can learn more from Koh.

I know a bit about Henriette Avram and her work at LoC developing MARC, but it of course makes sense that other libraries were thinking along similar lines at the time.


I suppose one way to do it would be to allow patrons of subscriber libraries to access the database dumps and API.

The downside is that would still make it harder than necessary to access and leave some people out. The upside is that it's not that much of change from their existing model. I'm sure there would also be concerns about database dumps being shared publicly, although Anna's Archive has already released their entire database, and I suspect most people who would pay for formal access wouldn't use an authorized copy. Ultimately, I suspect OCLC would still be resistant to this change, as it would feel like a huge shift, even if I'm not sure it would change much from their perspective.


The Melville Decimal System popularized by https://www.librarything.com may be of interest.

Here's an explanation from their footer: "Although Dewey invented his system in 1876, recent editions of his system are in copyright. LibraryThing's Melvil Decimal System is based on the classification work of libraries around the world, whose assignments are not copyrightable. The "schedules" (the words that describe the numbers) come from a pre-copyright edition of his system, John Mark Ockerbloom's Free Decimal System, and member contributions."


One strange thing about the Dewey Decimal System is that it's copyrighted and libraries pay a fee to use it.

It came out when the Library Hotel in NYC used some of their notation as their room numbers and were sued. Everyone was WTF?


OCLC is a nonprofit membership cooperative and would argue that it itself is that international coalition of national libraries and archives.


OCLC is a parasitic company masquerading as a "membership cooperative".

Libraries (often publicly funded) produce all the work, OCLC claims ownership of the results of that work, Libraries pay to get it back (but they do get a discount if they contribute).

The only reason OCLC continues to exist is because libraries don't have the support or resources to fight them. It's very similar to the Elsevier issue in academic publishing, but OCLC does a better job with PR.


I mean you’ve clearly read Aaron Swartz’s diatribes, but you also clearly have no clue about OCLC’s business model. The catalog data is intellectually interesting, but the value is in the holdings data and more importantly the interlibrary loan service it enables.

OCLC is exactly what happens when the libraries want to avoid another EBSCOhost or Proquest situation with ILL.


What were the EBSOhost / ProQuest situations, if you don't mind?


ISBN is the default ID when it comes to book related projects, yes it is convenient but not without its caveats. The often overlooked fact is ISBN was introduced in late 1960s, so books published prior to that obviously does not have that number; and not all countries adopted ISBN from day one, some like China was on its own catalog systems until 1980s; and bc ISBN are usually centralized managed by govt or commercial agencies, censorship with political or commercial reasons are not uncommon, some books were not able to get published, or may only see the world without an ISBN.

For obvious reasons, older / non-English / suppressed books may be those need more care when it comes to preserving.


A second issue is that ISBNs identify a specific SKU (different formats will have different ISBNs, different printings may even get different ISBNs, etc), but book-related projects typically want some way to identify "the same book" across all these different formats, printings, sometimes even editions and translations and collections. OCLC IDs are identifying a different space than ISBNs are.


The biggest problem of all is that there are many ISBNs that have been reused for either a later edition of "the same" book or for a totally different book, which should never happen.

Sometimes it is because people are sloppy, sometimes people try to save a little money (because ISBNs cost money).


What you're referring to is sometimes referred to as a "work" vs "edition"

https://openlibrary.org/help/faq/editing#work-edition


It gets much much more complicated than that. There are never ending discussions about FRBR: Functional Requirements for Bibliographic Records. https://www.ifla.org/references/best-practice-for-national-b...

* Work is defined as the intellectual or artistic content of a distinct creation. It refers to a very abstract idea of a creation e.g. Shakespeare’s Romeo and Juliet and not a specific expression.

* Expression is the intellectual or artistic realization of a work. The realization may take the form of text, sound, image, object, movement, etc., or any combination of such forms.

* Manifestation is the embodiment of an expression of a work. For example a particular edition of a book or a specific music recording.

* Item is a single exemplar of a manifestation. Cataloguing is generally done, based on an item directly available to a cataloguer


Along with other limiations, ISBN are, well, book numbers. They're specific to books, and exclude many other forms of published materials.

OCLC spans books, articles, audio recordings, videos, and other catalogued artefacts and documents.


How does Anna's Archive keep their all their lawyers from quitting?

> Even though OCLC is a non-profit, their business model requires protecting their database. Well, we’re sorry to say, friends at OCLC, we’re giving it all away. :-) [...]

> This included a substantial overhaul of their backend systems, introducing many security flaws. We immediately seized the opportunity, and were able scrape hundreds of millions (!) of records in mere days. [...]

> PS: We do want to give a genuine shout-out to the Worldcat team. Even though it was a small tragedy that your data was locked up, you did an amazing job at getting 30,000 libraries on board to share their metadata with you. As with many of our releases, we could not have done it without the decades of hard work you put into building the collections that we now liberate. Truly: thank you.


For real. Openly bragging about exploiting security flaws to scrape out data en-masse, which undoubtedly put massive strain on back-end systems, is a far cry from what is considered legal (politely scraping public information).


I think this is the least of their concerns considering the rest of their activities. I guess they've got a sort of pirate's privilege in that they can openly brag about this stuff since they're already starting from the point of openly flaunting the law.

Also, I wouldn't be surprised if there simply are no lawyers working at, for or with Anna's Archive.


They could possibly get important ongoing support from institutions like the EFF and Berkman.

But I think that's less likely if they're gleefully bragging about civil and criminal liabilities.

And also chilling future data-sharing willingness for other orgs (because someone is coming along and ignoring assurances).


Scraping and giving away the content is a better look than scraping and selling it as well.

This is being done for public benefit, not private profit.


I'm not commenting on the morality, only the legality.


The legality is itself a function of commercial and political power amongst publishers.

In front of a jury of peers, the moral arguments might well be persuasive.


flaunting-> flouting


Nice catch, thanks. I suppose they're flaunting their flouting of the law ;)


You realize they're not a business, right?


Yes, why do you ask?


What lawyers?


From the end:

> We do want to give a genuine shout-out to the Worldcat team. Even though it was a small tragedy that your data was locked up, you did an amazing job at getting 30,000 libraries on board to share their metadata with you.

I wonder what the story is behind Worldcat getting so many libraries across the world on board? I don't know much about the software but it must be pretty compelling.


Disclaimer, I was a Linux admin at OCLC for a few years. The WorldCat database has been around since the early 70s, so I think that helps the numbers a bit. I don't have any insight into their marketing/sales/end user experience though.


It's not the software per se, which is generally fit for purpose but not amazing, but the traditions and economics underpinning how libraries maintain their bibliographic metadata.

Libraries sharing metadata for their catalogs has a long history, dating back to at least 1902 when the Library of Congress started selling catalog cards for use by other libraries. In the 1960s, the Library of Congress embarked on various projects to computerize their catalog, leading to the creation of the MARC format as a common metadata format for exchanging bibliographic records. (And there is a straight line between how card catalogs were put together and much of how library metadata is conceptualized, although that's been (slowly) changing.)

One problem is that bibliographic metadata from the Library of Congress is mostly generated in-house, and LoC does not catalog everything; not even close. In the late 1960s, OCLC, the organization behind Worldcat, was started to operate a union catalog. The idea is that libraries could download bibliographic records needed for their own catalogs ("copy cataloging") and contribute new records for the unique stuff they cataloged ("original cataloging"). Under the aegis of OCLC as a non-profit organization, it was a pretty good deal for libraries, and over time led to additional services such as brokering interlibrary loan requests. After all, since Worldcat had a good idea of the holdings of libraries in North America (and over time, a good chunk of Europe and other areas), it was straightforward to set up an exchange for ILL requests.

Tie this with a general trend over the past couple decades of libraries decreasing the funding and staffing for maintaining their local catalogs, and need for sharing in the creation and maintenance of library metadata has gotten only more important.

However, OCLC has had a long history of trying to control access and use of the metadata in WorldCat, to the point of earning a general perception in many library quarters of trying to monopolize it. To give a taste, Aaron Swartz tangled with them back in the day. [1] One irony, among many, is that the majority of metadata in Worldcat has its origins in the efforts by publicly-funded libraries and as such shouldn't have been enclosed in the first place. OCLC also has a focus on growing itself, to the point where it does far more than run Worldcat. Its various ventures have earned itself a reputation for charging high prices to libraries, to the point where it can be too expensive for smaller libraries to participate in Worldcat. (Fortunately for them, there are various alternative ways of getting MARC records for free or very cheap, but nobody has a database more comprehensive than Worldcat.)

That said, OCLC does do quite a bit itself to improve the overall quality of Worldcat and to try to push libraries past the 1960s-era MARC format. But one of the ironies of the scraping is that it's not going to be immediately helpful to the libraries who are unable to afford to participate in Worldcat. This is because the scrape didn't (and quite possibly never could have) capture the data in MARC format, which is what most library catalog software uses. While MARC records could be cross-walked from the JSON, they will undoubtedly omit some data elements found in the original MARC.

[1] http://www.aaronsw.com/weblog/oclcreply


If you liked the comment-length analysis OCLC & want more, there's a whole essay on the subject. [1]

>But one of the ironies of the scraping is that it's not going to be immediately helpful to the libraries who are unable to afford to participate in Worldcat. This is because the scrape didn't (and quite possibly never could have) capture the data in MARC format, which is what most library catalog software uses. While MARC records could be cross-walked from the JSON, they will undoubtedly omit some data elements found in the original MARC.

While it would have been ideal to get all the data in MARC & as many other formats as possible, I wonder how true this is worldwide - many libraries don't use MARC or have a digital catalog at all. Maybe there are some ways the data could be processed that make it easier to integrate into such places, but of course local needs/desires will vary widely.

[1] https://core.ac.uk/download/pdf/11883899.pdf - it was also published in this book: https://archive.org/details/radicalcatalogin0000unse


> While it would have been ideal to get all the data in MARC & as many other formats as possible, I wonder how true this is worldwide - many libraries don't use MARC or have a digital catalog at all. Maybe there are some ways the data could be processed that make it easier to integrate into such places, but of course local needs/desires will vary widely.

Indeed, MARC is not universal (and for that matter, it wouldn't surprise me if at this point the majority of records in Worldcat were _not_ derived from MARC sources), and there are certainly non-MARC library catalog platforms out there. That said, as the growth of Koha shows, for better or worse MARC has become a close to a global baseline for a lot of libraries.


Worse, definitely worse.


> (Fortunately for them, there are various alternative ways of getting MARC records for free or very cheap, but nobody has a database more comprehensive than Worldcat.)

what are some of these sources? isbndb and open library? proquest?


Many libraries [1], including the likes of the Library of Congress and the National Library of Australia, make their catalogs' MARC records freely available via a library-specific protocol called Z39.50. The Library of Congress makes their catalog metadata available in other ways [2][3] and the Internet Archive has a collection of MARC records as well [4]. There are also a couple commercial services that provide them, and publishers, particularly of digital collections used by libraries, will sometimes supply MARC records (though they tend to be low-quality).

[1] https://irspy.indexdata.com/ [2] https://www.loc.gov/cds/products/marcDist.php [3] https://id.loc.gov/ [4] https://archive.org/details/ol_data


they probably have a good interface for personal library tracking


It's not a LibraryThing or GoodReads; it's meant for libraries that are institutions. That said, I don't think there is anything stopping an individual person signing up and entering their collection, but there would be no point in paying the fees unless you had (say) a unique scholarly collection and wanted to lend books to other libraries - and if so, in the long run you'd likely be better off seeing if a library wanted to acquire your collection.


> We scraped ISBNdb, and downloaded the Open Library dataset, but the results were unsatisfactory. The main problem was that there was not a ton of overlap of ISBNs.

What prevents ISBN number collisions between authors? Is there a central authority assigning them, or is there say a national prefix, with each government assigning ISBN's for local publications (perhaps delegating this to another body in that nation).

Surely such bodies would have the most complete view on all this data.

It's also bizarre that this simple metadata is not available from whatever authority assigns ISBN numbers..


There are regional ISBN agencies. The US agency, Bowker, assigns ISBN prefixes by publisher, and publishers assign within their prefix as they please. They're supposed to use one ISBN per edition and format, but many publishers use ISBN as a kind of SKU so you can't 100% count on that.

If that sounds sloppy...I went to publishing conferences fairly regularly from the late 90's into the teens, and I never saw a program that didn't have at least one session or panel titled something like "Publishers must improve their metadata practices."


It's exactly like this. Publishers get a range, and do whatever they want with it.

Also, some agencies sell the numbers range to publishers (I believe the US is in that case) and others give them away at no cost (France). As a result, some small publishers get ISBNs from more liberal agencies than their own country: one can never be 100% sure a French ISBN matches a French publisher for example.

It's also possible some publishers re-use old numbers or affix the same number to different releases/editions of a book.

It's a mess.

But a global centralized system would probably be way worse, so we have to live with that mess.


Why not use UUIDs?

People never enter ISBNs manually, anyways. So it might as well be longer string. Or a QR code.


As someone who's worked in the field of used books, I can say from personal experience that's not quite true. It's pretty common to have to type in an ISBN for a variety of reasons. Many times the barcodes have been covered up or defaced, and many publications don't have barcodes in the first place.


It was true, for sure. But the question is, is it still true?


Schools often ask for a specific edition of a classic book and the only way to be reasonably sure you're buying the correct one is to search by ISBN.


Well, just copy and paste from the email you got from school.

I mean, if IT can do anything, it is to solve this problem.


same


The point is not just to have universally unique identifiers, but to collect common metadata that's associated with the identifiers. Like this:

https://www.bowker.com/siteassets/files/pdf-files/datasubmis...

Since this is the second time I've mentioned Bowker, let me just say that I do not and have never worked for them or with them, although I did meet with their reps several times when I worked in a different part of digital publishing. It's just that they're inescapable when you're talking about ISBNs.


ISBNs contain a check digit, to prevent typos. So you’d have to invent a new format.


Sounds a bit like how DOIs are assigned.


or vice versa! Although DOI RA's are organized functionally rather than regionally. CrossRef -- for journal articles -- is by far the largest RA.

DOI's are far, far more centralized in that CrossRef issues many orders of magnitude more DOI's than all of the others.

The cite-and-be-cited-by use case for scholarly articles is way more compelling than any other use case devised for DOI's.


ISBNs are messy.

The International ISBN Agency coordinates assigning ISBN ranges to national agencies, who in turn will assign subranges to publishers. The publishers in turn assign specific numbers to their own works. However, the international agency does not itself maintain a universal database of assigned ISBNs - the most it operates is a global database of publishers and their assigned ranges. And since it's the publishers who are assigning numbers from their allocations, various errors can crop up, including reusing ISBNs for different works and failing to issue distinct ISBNs for different formats. (For example, if you publish hardcover, paperbook, and ebook versions of a book, you should assign three ISBNs. That rule is not always observed.)

Also, libraries hold many books that long predate ISBNs; it wasn't until 1965 that the immediate predecessor of the ISBN, the SBN, was a twinkle in a bookseller's eye.


Yes.

And while in most countries you can't properly publish a book without an ISBN (ie, have it sold in bookshops), you can publish a Kindle book without it (if you opt to only offer the ebook).

That leaves a huge part of publications completely out of the system. Kindle-only books are on Amazon servers and nowhere else.


>while in most countries you can't properly publish a book without an ISBN (ie, have it sold in bookshops)

I'm quite skeptical of this, given the amount of books I've personally seen published in recent decades without ISBNs, along with the limited & haphazard attempts to regulate what it means to 'publish' something or even to be a 'proper' bookseller. But if you have some experience I don't with this, I'm interested in hearing about it.


bookstores don't want to carry a book without an isbn because no isbn means it's not available to purchase from their distributor and it's easiest for a store to order through real distribution channels (like ingram in the u.s.).

but most stores carry a small amount of self-published books and sometimes those books have no isbn. those books are typically by local writers. but in my experience as a bookseller, self published books are a pain to work with. some self-published books aren't returnable, but returns are an important part of the bookstore business since a lot of books don't sell. working with a lot writers individually about ordering etc is more involved than going through a single distributor, this takes a lot of time for whoever has to do this.

> given the amount of books I've personally seen published in recent decades without ISBNs

i'd bet this on amazon? iirc, you can't always return self-published amazon books. i think the author decides this, bc they get charged a processing fee for returns.

> or even to be a 'proper' bookseller

you can totally sell your collection online without any isbns and you'd be considered a bookseller. you'll just need an sku system. there's a difference between a used/collectible seller and a bookseller who carries new books as well as used/collectibles. the new books require proper distribution channels.


There's lots of agencies, they hand out blocks of numbers from their allocation. Seems there's no central database for the metadata:

https://www.isbn-international.org/content/isbn-users-manual... ISBN_FAQs_to_7ed_Manual_Absolutely_final.docx

> Will people in other countries be able to search for my books in search engines in those countries?

> This does not happen automatically ... In order for your book to be listed in other countries you should contact the respective ISBN Agency and ask them for details of how to be entered into their national catalogue for books in circulation (books in print). Sometimes you will have to obtain a distributor from that country or have an address in that country before this is possible. In some circumstances in order to be listed, the book must be in the language of that country. As well as catalogues of books in circulation, you may also want to ensure that you are listed by internet retailers... . Again, you will need to contact each of these organisations directly (including each separate international branch) with details of your book.


For US/UK/NZ/Aus/SA, ISBNs are granted through Bowker who does maintain their "Books In Print" data set that, in theory, contains metadata for all of the ISBNs they've granted. In practice though it's a mess. It's expensive to access and relies on publishers to enter in accurate and consistent metadata, which is...variable in quality to say the least. Often publishers buy blocks of ISBNs to use later so no metadata is entered up front and has to be pushed back to Bowker at a later date. To be somewhat fair to Bowker, the history of ISBNs far predates modern data standards and I can imagine wrangling publishers to get accurate data is a difficult task. But on the other hand, you'd think they'd have a lot to gain for doing it right. As someone who runs a book website, it is endlessly frustrating.


what is your website?


Companies can get a range of ISBNs before deciding what to publish under each ISBN, or whether to publish something at all. So the authorities assigning ISBNs don't necessarily know what they're being used for.


I can't get the .torrent file to work in my client. Can anyone give me a magnet link for it?

I need the magazine ISSNs for my magazine encyclopedia.

edit: got the .torrent working in qTorrent


It's unclear what exactly the competition is about. Just to poke around the dataset?


As a note, I wish I had enough space to mirror their library. Looking at this brings out the collector in me...a tendency that I've successfully suppressed. You can only keep so many terabytes of archive around.


Yes.


An earlier study that addressed the scope of all published works:

J-B Michel, et al. "Quantitative Analysis of Culture Using Millions of Digitized Books". Science (the journal) (16 Dec 2010). https://www.science.org/doi/10.1126/science.1199644

Their focus was on words but along the way they analyzed the number of published texts, a study that "includes serials and sets but excludes kits, mixed media, and periodicals such as newspapers".

They concluded that the world had published 129 million "editions" (one book may have multiple editions).


Noob Question: Isn't this going to be an great source for training Language models? Is it safe to assume that OpenAI/Google/Meta etc already have these?

In any case great work!


Worldcat is a database of books, not the books themselves. The summary and description text might be useful though?


If you could somehow download the entire archive you could feed it into your LLM for training. This is a huge corpus and is sort of ill-gotten. That said, it would be pretty awesome.

Google has this sort of thing already, since they have that whole "let's digitize the world's books" project. Interesting as to why google never developed a ChatGPT, given that they literally have a large amount of the world's books digitized.


Google launched Bard earlier this year.


Yes, but why weren't they first?


It's like asking why wasn't X invented earlier.

Google and everyone else had no idea how successful LLMs could be until OpenAI did it.


Untrue. The whole transformer idea came from the goog team


yes, of course. AA has a special program for llm developers.


Question for anyone from Anna's Archive or elsewhere: are catalogue metadata available from national library collections such as the British Library or US Library of Congress?

(I've ... worked a bit with LoC classification and subject headings data, of which publicly-available data are only available in PDF or wordprocessing (MS Word or Wordperfect, if memory serves) formats. Which is ... somewhat unfortunate.)


Check out z39.50, you can use it to search and pull data from most national and university library catalogs. But also be prepared for some code archaeology along the way, as the protocol is around 50 years old.

https://en.wikipedia.org/wiki/Z39.50 https://z-brary.com/ https://github.com/asl2/PyZ3950/blob/master/PyZ3950/zoom.py


Depending on what you're looking for, a _lot_ more is being published by the Library of Congress as Linked Data nowadays, including LoC classification and subject headings. Check out https://id.loc.gov/.


That seems to be just metadata about metadata.

Where do you go to actually find the Classifications or Subject Headings themselves, in whatever format?


https://id.loc.gov/download/ offers bulk downloads of many of the vocabularies, including the subject headings. In addition, individual records of various types can be search for.


Thanks.

That seems to have Subject Headings, but not the Classification, unless I'm blind (a fairly high likelihood).

I've spent quite some time poking around the LoC website over the years without finding anything more accessible than the PDFs of the Library of Congress Classification. Are you aware of that being generally available?


Near as I can tell, the best that's available is starting with the class(es) you want (e.g., https://id.loc.gov/authorities/classification/H.html for social sciences) and crawling recursively to grab whatever serialization of the RDF you want to consume. There's no SPARQL endpoint for id.loc.gov, alas, but it beats resorting to the PDFs.


Thanks.


this is an obvious prediction, but with the writers class action lawsuit against openai for using their books, the internet will become more closed. it's gonna be so hard to scrape websites in the future. we were trending in this direction before gpt, but gpt exacerbated this and put the issue into the mainstream.


It's infuriating seeing non-profits gatekeep datasets that were compiled with grant money. At least Elsevier doesn't present itself as a charity.

I was recently trying to get my hands on the Switchboard and Fisher conversational speech datasets. Both were funded by DARPA grants, and maintained by the non-profit LDC, which charges you thousands of dollars for access (and no discounts for individual researchers) - that is, if they'll even pay attention to you without a .edu email address. And both are standard corpora in the field of audio NLP, which makes replicating studies impossible.

Sadly, I couldn't find any way to pirate the datasets - they're too niche. So I applaud the authors for sticking it to Worldcat and scraping their data.


I hope one day there will a piratebay for datasets (the pile) and ai models (llama)


Good news, such a site was launched very recently: https://www.thenose.cc/

There's plenty of discussion about it by 'nostril' here on HN: https://news.ycombinator.com/threads?id=nostril


This is theft.


Why wouldn't you download a car?


>Over the past year, we’ve meticulously scraped all Worldcat records. At first, we hit a lucky break. Worldcat was just rolling out their complete website redesign (in Aug 2022). This included a substantial overhaul of their backend systems, introducing many security flaws. We immediately seized the opportunity, and were able scrape hundreds of millions (!) of records in mere days.

>After that, security flaws were slowly fixed one by one, until the final one we found was patched about a month ago. By that time we had pretty much all records, and were only going for slightly higher quality records.

OCLC carelessly fiddlefarted around with their moat and lost it. Poof!


I don't think anyone is (legally) going to prop up a business or non-profit using data that was admittedly taken from them using their security holes.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: