I think 11m monthly users is a lot. Sure, compared to Facebook 2023 it's minuscule. But assuming those 11m monthly users log in 2 times a week (~8 per month) and check out 15 pages each time that comes out to 1.3bn pageviews a month. Divided by (60s/m * 60m/h * 24h/d * 30d/m) comes out to about sustained 500 pageviews per second.
I assume since they're a viral / addictive platform the numbers will be higher. Plus they're not a blog or so, but rather a personalized social media platform, which makes stuff more complicated (you can't just cache their news feeds, they're all personalized).
It's easy to say meh, 11m is nothing, there are other platforms with more users (esp. since you don't bring up examples how you managed otherwise). But I think it's a big technical feat to do this with 6 people.
Afaik a transcoding session holds the entire CPU core, for example malformed gif avatar typically can lie down a single-server web-forum for a few seconds.
500 pageviews per second are big but 90 Web Engine + 50 API engine? i.e. ~6 rps per web server and 10*x rps/ API server. Apart from that ~250 servers for DB+cache. I know my comment smells so HN and kudos for the team that they build and scale it. But as an outsider hmmm.
A single SSD can serve over 100k ops/s. Pubsub systems built for the finance industry operate in excess of 60M messages/second [0]. I understand having multiple machines for failover reasons, but I can't help but feel that the majority of scale-out is due to the people who are doing the development not having the skills to properly optimise their systems.
There's an element of "how can they be so inefficient" in this thread, but hardware has come a long way in the last 12 years. I'd bet you could handle this scale with a single DB and 3-4 read replicas with modern hardware for example.
Our stack was similar than pinterest, django but postgres and redis. Team of 2.
Indeed, keep it simple. Cache things. Use queue, etc.
You can scale vertically a lot. Db perfs and disk space were our points of focus.