As an aside, good reads is the slowest site I use on a regular basis. It's genuinely shocking how there are 5 to 6 second page load times. I'm not sure what their stack is but I'm always blown away by how any continues to use this. It feels like a competitor could beat them by just being faster.
That's sort of what the downvote button is for, removing things that don't contribute to the conversation. A source that you can't look at without capitulating to their dark patterns is more or less what that button is for.
Also explaining why you're downvoting is a good idea.
This is potentially quite interesting to me we are having conversations on/off at work about data reporting, visualising etc., which is leading me to pay attention to related topics.
However, it's lacking in any context explaining what you're trying to achieve and why.
It's probably obvious to some people but for me, it's not, which I think is a shame.
Beyond just data replication/archival purposes, it seems you can use the this to run analysis against Goodreads entire public dataset. This is much more efficient than using their API alone.
the architecture also seems pretty complex - i am wondering at what level of requirements or data complexity people should consider something like this, as opposed to running a little cronjob on a $5 server somewhere.
not dissing the author, genuinely trying to understand the spectrum of data science needs
Someone, somewhere has to be able to make a better alternative to Goodreads right? The site is slow, ugly, and buggy. The functionality is so simple: I tell you when I read a book and what I think about it.
I'm just shocked Amazon has been able to own this niche with so little effort.
I've been working on a competitor for a while now, and the hardest part of replicating functionality is the data. OpenLibrary is probably the best source for book metadata online, but even their library dumps are riddled with mistakes that manifest in weird ways as you start building your own library. The Goodreads site sucks, but they've got surprising data quality that I don't think anyone else has; and they have a super restrictive data policy so you can't repurpose book data, reviews, shelf data, etc, even when users auth with a Goodreads account.
It's a small moat, but definitely penetrable with more than a little effort.
The good data quality is actually an artifact of humans being involved in every step of the cataloging process. There's a large group in goodreads called the GoodReads Librarians, and that group has around a hundred thousand dedicated people who go through and flag anomalies, correct titles and indexes etc
Book publishers or people who've worked in book publishing will know that the book database is one area you don't want to mess with unless you know what you are doing. ISBN's are not the be all and end all of the story, and when you start taking into account special editions, covers, ebook editions, language translations, you'll start to realize that the Book Catalog system going back in history, including Dewey decimal system is a marvel of human achievement.
Of course establishing a good quality index is going to take work. People often forget that quality take human work and effort.
EDIT: I lied. I changed the number from my original estimate of a "few hundred" to "hundred thousand". The Goodreads Librarians group has 103718 members as of when I just peeked now - so it's actually a large number of humans submitting fixes to their catalog.
I wonder if a partnership with Kobo, or even better Nook (Barnes & Noble) could help solve both the data problem and the issue with Kindle integration - while potentially bolstering the e-reader that integrates with it as well.
I for sure would take that sort of integration into account when looking at a new ereader. Although the primary thing I'd be looking for is library integration a la OverDrive.
Facts are not copyrightable and scraping has been determined to be legal. IANAL, but I'm not sure law would protect them from the factual metadata about books being repurposed.
Not being illegal wouldn't protect you from a crushing lawsuit though. Especially since the details likely vary (linked in data was publicly accessible, not sure if Goodreads requires a login).
As a regular Goodreads user, I've never cared about the site's relatively slow load times. What's important to me is the trust that the site will still be around in 20 years, largely thanks to the Amazon ownership and Kindle integrations. I wouldn't have that same faith in an anonymous competitor.
Goodreads was that anonymous site once upon a time. You're just not an early adopter and that's okay. That's no reason to not create a better alternative.
I find the site decently fast, definitely ugly but then again I don't want it to get a reddit-style redesign either. The information density is ok right now, and I'm actually impressed by the wide range of functionalities they have, related to reviewing books and updating your progress.
This is the Achilles heel of any potential competitor. The lazy integration means there is a big subset of users who simply won't engage with a competitor because it requires more work. Couple that with the social graph Goodreads already has and you're looking at a huge moat.
Have you tried looking? There's LibraryThing and a couple others.
I don't think there's much value left on the table in the niche, though. Kindles have first-class Goodreads sync and even a Goodreads button in their global navbar. And Goodreads' competitors, for the few people who don't want to use Goodreads, already have a deep rut of incumbency.
Even you, who has supposed great issues with Goodreads, apparently wasn't bothered enough to even see if competitors existed all this time, much less before writing your comment. Doesn't bode well for the Goodreads' competitor market, lol.
Reddit has terrible search too, but you can appreciate that "Reddit but with good search" isn't all it takes to compete with Reddit. That's 0.001% of the work.
And of course Goodreads has issues of its own, but none of them are show-stoppers for most people, especially few of the people who just use it as a glorified Excel spreadsheet.
I only chuckle about this because, like many enterprising HNers, I myself have considered building a Goodreads competitor in the past and even managed to build the ol' weekend prototype (i.e. 0.001% of the work). It's one of those projects where you start and, after you get some of the easy things done like fuzzy search, you go "wait, wtf am I doing? Who would switch to this?"
Using and improving OpenLibrary is also alluring, but pretty hard to do without an application with actual users that have some sort of "edit book" functionality that you can then moderate and submit upstream to the OpenLibrary data source.
I think most people use Reddit to just browse the subreddits. GoodReads is about searching for books an adding then to your shelves, many of those might be books that someone just mentioned to you in passing, or you don't remember the full/correct name.
I've been working on something like this. Super simple, like an Spreadsheet of what you read but as a SaaS. I was thinking in monetize it a là Pinboard: focused on privacy. Like, $3 per month and you have it, without Amazon or Google knowing what books you read and how you rate them.
Nice - You can use UNION ALL instead of UNION in your query at the end if you're confident the datasets don't overlap. Query is less expensive.
I'm also curious what the backfilling/recovery process is if something goes wrong and you have to stop your 10 min load jobs.
Really similar to the pipelines that I engineer/manage at my current company. Although we have our Airflow on kubernetes.
One optimization though is separating your loading tasks from compute tasks. This makes the pipeline more resilient and makes backfilling/reprocessing less of a headache.
Thanks for the tip. I actually thought about such separation, but it was too late to make such changes. I already laid down the architecture till that time. But you made a good point.
Super interesting. Surely there are some business cases of how someone could use this data for good (?)
For example someone could show the disparity between a New York times bestseller and the book getting the most amount of activity on GoodReads (added to most shelves for example)
Is this limited strictly to the GoodReads API or does it pull in more interesting data like the shelf-tags? When I did https://www.gwern.net/GoodReads the other month, I had to literally scrape shelves by hand because the API doesn't cover them and they lie to bots.
3 xlarge EMR instances sounds like overkill assuming a volume of around 11gb every 10 mins. Using postgres COPY I've loaded larger files into tables in seconds. Semi complex queries will also take seconds depending on indexes. My understanding is that EMR doesn't make sense unless you're processing terabytes.
Nice overview. I'd suggest to anyone interested in doing something like this to also consider the much simpler managed approach of using tools like:
* Stitch [etl/elt]
* Snowflake [data warehouse]
* dbt [transformations]
I'd recommend taking a look at dbt [1] for a refreshing approach to this domain. The AWS EMR Redshift approach is great if you _know_ you'll need all the configurability, but chances are you won't, and even with that said, the above stack provides it as necessary.
The worst thing about Goodreads is that it is horribly biased by terrible people. The Historical Fiction category is exceptionally terrible. Unfortunately it is integrated heavily by Amazon.
I've seen many cases of massive 1 star ratings of books that were not even published yet, because people didn't like the author as a person, or because it dealt with a sensitive subject.
I've also seem the opposite, with 5 star ratings on an unreleased book because the author (as a person) is like by the community
That's a bit of a misconception. From what I understand, Amazon's non-AWS branches don't get deeply-discounted services from AWS. There is a discount, but it's not enough to turn dark skies into sunshine and rainbows.
Amazon tends to want every part of itself to be in ship-shape, and giving itself a massive discount would discourage efficiency in non-AWS parts of the business.
Disclosure: neither a current nor former Amazon employee.
This is a misconception. Amazon wants to depict itself as wanting every part to be in ship-shape, but it does not operate that way and AWS is treated like any internal resource like printers and staplers.
AWS basically finances the rest of Amazon. It's 70% of its revenue (that is public info). Except for retail, the rest is all losses. So the discounts don't matter much, other branches just try to save money (frugality is one of Amazon's core values) but basically get what they need.
i'm not a data science guy so I need an ELI5 on this - is scraping all of goodreads and passing it into a data pipeline? seems like a 3rd party project. is this just to demonstrate Data/ETL skills? is what are some practical uses of this? Sorry it's not obvious to me.