Show HN: Goodreads Data Pipeline

jonluca · on Feb 27, 2020

As an aside, good reads is the slowest site I use on a regular basis. It's genuinely shocking how there are 5 to 6 second page load times. I'm not sure what their stack is but I'm always blown away by how any continues to use this. It feels like a competitor could beat them by just being faster.

spullara · on Feb 27, 2020

Probably Ruby on Rails. https://stackshare.io/goodreads/goodreads

traverseda · on Feb 27, 2020

That stackshare site won't let me view content without creating an account.

So that's a downvote from me, please try to find a source that doesn't require you register an account.

FalconSensei · on Feb 27, 2020

Downvoting someone for contributing to the conversation, without providing any better source. Great

traverseda · on Feb 28, 2020

That's sort of what the downvote button is for, removing things that don't contribute to the conversation. A source that you can't look at without capitulating to their dark patterns is more or less what that button is for.

Also explaining why you're downvoting is a good idea.

danielg6 · on Feb 28, 2020

Or you could have just found the site yourself and linked to it here like everyone else does.

OP: Paywall Link You: Here’s the non-paywall link.

Not that hard. And if you think about it, OP typed the answer “Ruby on Rails”. Would you have downvoted if he didn’t provide a source?

idkris · on Feb 27, 2020

I just accessed the stackshare URL without signing in.

jaredmosley · on Feb 27, 2020

I was able to access that link but couldn't do anything else without having to sign in. I also can no longer go back to that link without signing in.

nevster · on Feb 27, 2020

I had to open it in a private browser window

richie5um · on Feb 27, 2020

Couldn’t agree more. It must have one of the worst usage-to-enjoyment ratios of any site.

adam-_- · on Feb 27, 2020

This is potentially quite interesting to me we are having conversations on/off at work about data reporting, visualising etc., which is leading me to pay attention to related topics.

However, it's lacking in any context explaining what you're trying to achieve and why.

It's probably obvious to some people but for me, it's not, which I think is a shame.

mrlatinos · on Feb 27, 2020

Beyond just data replication/archival purposes, it seems you can use the this to run analysis against Goodreads entire public dataset. This is much more efficient than using their API alone.

swyx · on Feb 27, 2020

the architecture also seems pretty complex - i am wondering at what level of requirements or data complexity people should consider something like this, as opposed to running a little cronjob on a $5 server somewhere.

not dissing the author, genuinely trying to understand the spectrum of data science needs

habosa · on Feb 27, 2020

Someone, somewhere has to be able to make a better alternative to Goodreads right? The site is slow, ugly, and buggy. The functionality is so simple: I tell you when I read a book and what I think about it.

I'm just shocked Amazon has been able to own this niche with so little effort.

drusepth · on Feb 27, 2020

I've been working on a competitor for a while now, and the hardest part of replicating functionality is the data. OpenLibrary is probably the best source for book metadata online, but even their library dumps are riddled with mistakes that manifest in weird ways as you start building your own library. The Goodreads site sucks, but they've got surprising data quality that I don't think anyone else has; and they have a super restrictive data policy so you can't repurpose book data, reviews, shelf data, etc, even when users auth with a Goodreads account.

It's a small moat, but definitely penetrable with more than a little effort.

barkingcat · on Feb 27, 2020

The good data quality is actually an artifact of humans being involved in every step of the cataloging process. There's a large group in goodreads called the GoodReads Librarians, and that group has around a hundred thousand dedicated people who go through and flag anomalies, correct titles and indexes etc

Book publishers or people who've worked in book publishing will know that the book database is one area you don't want to mess with unless you know what you are doing. ISBN's are not the be all and end all of the story, and when you start taking into account special editions, covers, ebook editions, language translations, you'll start to realize that the Book Catalog system going back in history, including Dewey decimal system is a marvel of human achievement.

Of course establishing a good quality index is going to take work. People often forget that quality take human work and effort.

EDIT: I lied. I changed the number from my original estimate of a "few hundred" to "hundred thousand". The Goodreads Librarians group has 103718 members as of when I just peeked now - so it's actually a large number of humans submitting fixes to their catalog.

https://www.goodreads.com/group/show/220-goodreads-librarian...

If you take a look at the kind of discussions taking place, those are the kinds of things any competitor to Goodreads needs to know about.

johnmaguire · on Feb 28, 2020

I wonder if a partnership with Kobo, or even better Nook (Barnes & Noble) could help solve both the data problem and the issue with Kindle integration - while potentially bolstering the e-reader that integrates with it as well.

Kihashi · on Feb 28, 2020

I for sure would take that sort of integration into account when looking at a new ereader. Although the primary thing I'd be looking for is library integration a la OverDrive.

spullara · on Feb 27, 2020

Facts are not copyrightable and scraping has been determined to be legal. IANAL, but I'm not sure law would protect them from the factual metadata about books being repurposed.

jjeaff · on Feb 27, 2020

Not being illegal wouldn't protect you from a crushing lawsuit though. Especially since the details likely vary (linked in data was publicly accessible, not sure if Goodreads requires a login).

hombre_fatal · on Feb 27, 2020

Demanding perfect data is a waste of time that will let you procrastinate your product indefinitely.

maw · on Feb 27, 2020

If you're aiming for "like X, but for people who are actually interested in the market X supposedly serves" then maybe this isn't true.

trollied · on Feb 27, 2020

You could nick the scraping code from Calibre...

dewey · on Feb 27, 2020

The scraping part is probably not the complicated portion of the endeavor.

spillguard · on Feb 27, 2020

As a regular Goodreads user, I've never cared about the site's relatively slow load times. What's important to me is the trust that the site will still be around in 20 years, largely thanks to the Amazon ownership and Kindle integrations. I wouldn't have that same faith in an anonymous competitor.

kirubakaran · on Feb 27, 2020

Goodreads was that anonymous site once upon a time. You're just not an early adopter and that's okay. That's no reason to not create a better alternative.

spillguard · on Feb 28, 2020

I think it's a little insincere to compare Goodreads' release 13 years ago with a competitor launching against it now.

FalconSensei · on Feb 27, 2020

The kindle integration, the amount of correct data, and the fact that it's not going to vanish in the next year is what keeps me using GoodReads.

grimgrin · on Feb 27, 2020

Not to imply this functionality is complex, but really the most important thing for me are the lists:

I _love_ that I can take a book I enjoyed, see it's on a list of "Best Magic Systems", and note what was rated even better for its magic system

A simple method of discovery for me

https://www.goodreads.com/list/show/871.Most_Interesting_Mag...

https://www.goodreads.com/list/show/8497.Aliens_First_Contac...

sidthekid · on Feb 27, 2020

I find the site decently fast, definitely ugly but then again I don't want it to get a reddit-style redesign either. The information density is ok right now, and I'm actually impressed by the wide range of functionalities they have, related to reviewing books and updating your progress.

trollied · on Feb 27, 2020

The thing that goodreads has that will be hard to replicate is the Kindle integration.

mmanfrin · on Feb 27, 2020

This is the Achilles heel of any potential competitor. The lazy integration means there is a big subset of users who simply won't engage with a competitor because it requires more work. Couple that with the social graph Goodreads already has and you're looking at a huge moat.

hombre_fatal · on Feb 27, 2020

Have you tried looking? There's LibraryThing and a couple others.

I don't think there's much value left on the table in the niche, though. Kindles have first-class Goodreads sync and even a Goodreads button in their global navbar. And Goodreads' competitors, for the few people who don't want to use Goodreads, already have a deep rut of incumbency.

Even you, who has supposed great issues with Goodreads, apparently wasn't bothered enough to even see if competitors existed all this time, much less before writing your comment. Doesn't bode well for the Goodreads' competitor market, lol.

dlsso · on Feb 27, 2020

None of the competitors I'm aware of have fuzzy search though, which is pretty annoying.

"color prple"

LibraryThing: 0 results

Goodreads: 2,000+ results and they're well sorted

hombre_fatal · on Feb 27, 2020

Reddit has terrible search too, but you can appreciate that "Reddit but with good search" isn't all it takes to compete with Reddit. That's 0.001% of the work.

And of course Goodreads has issues of its own, but none of them are show-stoppers for most people, especially few of the people who just use it as a glorified Excel spreadsheet.

I only chuckle about this because, like many enterprising HNers, I myself have considered building a Goodreads competitor in the past and even managed to build the ol' weekend prototype (i.e. 0.001% of the work). It's one of those projects where you start and, after you get some of the easy things done like fuzzy search, you go "wait, wtf am I doing? Who would switch to this?"

Using and improving OpenLibrary is also alluring, but pretty hard to do without an application with actual users that have some sort of "edit book" functionality that you can then moderate and submit upstream to the OpenLibrary data source.

For example, look how ListenNotes.com lets users edit its podcast database: https://www.listennotes.com/podcasts/the-joe-rogan-experienc... -> the "Edit" tab.

FalconSensei · on March 1, 2020

I think most people use Reddit to just browse the subreddits. GoodReads is about searching for books an adding then to your shelves, many of those might be books that someone just mentioned to you in passing, or you don't remember the full/correct name.

Different usage than reddit

FalconSensei · on Feb 27, 2020

GoodReads already doesn't look awesome, but whoa, LibraryThing looks like it hasn't been update in the past decade

aaron-santos · on Feb 27, 2020

Just signed up for FediReads[1] last week. It's a decentralized Goodreads with ActivityPub, and open source[2].

[1] http://fedireads-test.glitch.me/

[2] https://github.com/mouse-reeve/fedireads/

FalconSensei · on Feb 27, 2020

> This is just a demo, any data here may be deleted without warning. sign up for email updates

So basically, keep using GoodReads for now?

101008 · on Feb 27, 2020

I've been working on something like this. Super simple, like an Spreadsheet of what you read but as a SaaS. I was thinking in monetize it a là Pinboard: focused on privacy. Like, $3 per month and you have it, without Amazon or Google knowing what books you read and how you rate them.

bilater · on Feb 27, 2020

Nice - You can use UNION ALL instead of UNION in your query at the end if you're confident the datasets don't overlap. Query is less expensive. I'm also curious what the backfilling/recovery process is if something goes wrong and you have to stop your 10 min load jobs.

prions · on Feb 27, 2020

Really similar to the pipelines that I engineer/manage at my current company. Although we have our Airflow on kubernetes.

One optimization though is separating your loading tasks from compute tasks. This makes the pipeline more resilient and makes backfilling/reprocessing less of a headache.

san089 · on Feb 28, 2020

Thanks for the tip. I actually thought about such separation, but it was too late to make such changes. I already laid down the architecture till that time. But you made a good point.

krmmalik · on Feb 27, 2020

Super interesting. Surely there are some business cases of how someone could use this data for good (?)

For example someone could show the disparity between a New York times bestseller and the book getting the most amount of activity on GoodReads (added to most shelves for example)

gwern · on Feb 27, 2020

Is this limited strictly to the GoodReads API or does it pull in more interesting data like the shelf-tags? When I did https://www.gwern.net/GoodReads the other month, I had to literally scrape shelves by hand because the API doesn't cover them and they lie to bots.

wefarrell · on Feb 28, 2020

3 xlarge EMR instances sounds like overkill assuming a volume of around 11gb every 10 mins. Using postgres COPY I've loaded larger files into tables in seconds. Semi complex queries will also take seconds depending on indexes. My understanding is that EMR doesn't make sense unless you're processing terabytes.

TuringNYC · on Feb 28, 2020

I read through the Readme but didn’t see any volume or velocity figures (I saw the entity count, but what does this translate into w/r/t bytes?)

Anyone run this who could comment on the metrics and, consequently, server sizing?

sails · on Feb 28, 2020

Nice overview. I'd suggest to anyone interested in doing something like this to also consider the much simpler managed approach of using tools like: * Stitch [etl/elt] * Snowflake [data warehouse] * dbt [transformations]

I'd recommend taking a look at dbt [1] for a refreshing approach to this domain. The AWS EMR Redshift approach is great if you _know_ you'll need all the configurability, but chances are you won't, and even with that said, the above stack provides it as necessary.

[1] https://blog.getdbt.com/analytics-engineering-for-everyone/

thrower123 · on Feb 28, 2020

The worst thing about Goodreads is that it is horribly biased by terrible people. The Historical Fiction category is exceptionally terrible. Unfortunately it is integrated heavily by Amazon.

slightwinder · on Feb 28, 2020

Biased toward what?

FalconSensei · on Feb 28, 2020

I've seen many cases of massive 1 star ratings of books that were not even published yet, because people didn't like the author as a person, or because it dealt with a sensitive subject.

I've also seem the opposite, with 5 star ratings on an unreleased book because the author (as a person) is like by the community

thrower123 · on March 6, 2020

The historical fiction category, in particular, is wildly female-biased towards historical-romance.

kgraves · on Feb 27, 2020

Interesting, would like some detail on the cost of this ETL setup on AWS, unfortunately I can't see anything on this from the project page.

jmedefind · on Feb 27, 2020

Since they are owned by Amazon. I would think their cost is close to nothing.

dswalter · on Feb 27, 2020

That's a bit of a misconception. From what I understand, Amazon's non-AWS branches don't get deeply-discounted services from AWS. There is a discount, but it's not enough to turn dark skies into sunshine and rainbows.

Amazon tends to want every part of itself to be in ship-shape, and giving itself a massive discount would discourage efficiency in non-AWS parts of the business.

Disclosure: neither a current nor former Amazon employee.

augmachina · on Feb 27, 2020

This is a misconception. Amazon wants to depict itself as wanting every part to be in ship-shape, but it does not operate that way and AWS is treated like any internal resource like printers and staplers.

devcpp · on Feb 27, 2020

AWS basically finances the rest of Amazon. It's 70% of its revenue (that is public info). Except for retail, the rest is all losses. So the discounts don't matter much, other branches just try to save money (frugality is one of Amazon's core values) but basically get what they need.

danial · on Feb 27, 2020

This repository is not associated with Amazon.

chrisweekly · on Feb 28, 2020

No, but Goodreads (the child subject under discussion) is.

skandl · on Feb 27, 2020

Amazon literally siphons off the data and has invested so little in its users.

Any recommendations for an alternative?

FalconSensei · on Feb 27, 2020

LibraryThing if you don't mind the ugly website

ldng · on Feb 27, 2020

For my own education, what is "Data Lake" ? Data wharejouse is "has-been" and that's the new hype way to call it ?

bdibs · on Feb 27, 2020

A data lake is raw, unstructured data vs. a warehouse where everything is already parsed, processed, and currently query-able is my understanding.

ldng · on Feb 28, 2020

Thanks. I see that there is a wikipedia page now. Only started to hear about it 1 or 2 years ago and did not find to much on it at the time.

swyx · on Feb 27, 2020

i'm not a data science guy so I need an ELI5 on this - is scraping all of goodreads and passing it into a data pipeline? seems like a 3rd party project. is this just to demonstrate Data/ETL skills? is what are some practical uses of this? Sorry it's not obvious to me.