Hacker News new | past | comments | ask | show | jobs | submit login
How we successfully handled 2.5x traffic in a week (khanacademy.org)
154 points by talonx on May 13, 2020 | hide | past | favorite | 94 comments



If Khan Academy uses Youtube to serve their video and uses Fastly to serve static content, what makes it hard to scale ?

I mean being able to scale that easily is a great thing, but is there anything worth sharing with the world in their case ?


I need to write a blog post about this :)

A lot of people seem to think of Khan Academy as a bunch of videos. Many have also seen the exercises and articles. Those things are all pretty static (though it gets more complex when you consider how much content there is and how many languages it's localized into).

There's a whole bunch of dynamic behavior around that static content. Keeping track of progress to tell a learner how they're doing, plus to help recommend the next place to go in the content. Reporting on progress to parents and teachers. Letting teachers create assignments and manage their classrooms. Bubbling up information to school districts.

Content pages have discussions and clarifications. There are notifications to tell students about new assignments, for example.

There are connections to tests, like the SAT prep or integration with the MAP test, which involve connecting our accounts with external accounts in order to help students based on those test results.

And a bunch of other stuff that isn't coming to mind right now because I'm just naming things off the top of my head.

Doing all these things across a user base of millions of monthly users can get quite involved.


Thank you! I'm curious what the cost was to run this stack for one month?

In the blog post Marta said:

> In the month of April, we served 30 million learners on our platform

How much did it cost to run this whole stack on GCP in April 2020? Was it $150,000? More or less?


Unfortunately, I don't think I've seen this publicly reported and I'm not in a position to decide to make that public.

I will say that the amount we pay for our infrastructure, especially right now, is reduced by generous support from both Fastly and Google. It still remains a substantial expense after that support.

One final note on cost: we are actively working on a Python to Go transition[1] that will reduce our hosting costs, among other benefits.

[1]: http://engineering.khanacademy.org/posts/goliath.htm


Yes. In this case, they share the fact they used common sense. This is not saying that all of the other major re-architecture blog posts are flawed. However, it is good to know that scaling when using cloud vendor specific tools is as easy as advertised.


Yes, for the same reason that researchers should publish null results: all of the data is useful. Getting confirmation that a particular formulation of a strategy works or does not work is valuable in and of itself, regardless of the exact outcome. The only reason why it wouldn't be valuable would be if there were a plethora of similar reports of successfully scaling this solution, which I do not, so their experience is very welcome.

To turn it around, why would you ever want someone to not share their experience with the world if they took time to write it down? It's not like you must read it; it doesn't cost you anything to exist. But if someone's experience adds to the library of human knowledge, even a little bit, why would one try to reject that?


"How to scale: Make it somebody else's problem."


A surprisingly legitimate solution.


Yes, with the caveat that you may have to check that they're actually capable of handling the load and you don't get a surprise notice that "uh we can't do this, you're on your own".


I am pretty confident that YouTube will be able to scale to handle any increased load coming from my service.


But they might not at any time, at least without charging. With an apparent internal push for some services to become self-sustaining (see Google Maps API, Recaptcha), YouTube embeds might be next.


Khan Academy today supports serving video outside of YouTube, which is blocked in some schools. We could essentially flip a switch to not use YouTube, but the cost would be substantial because those videos go Fastly->S3, so anything not in cache is going to result in S3 egress charges.


Something to consider in your tooling for this is to target Backblaze's B2 object store, which supports an S3 compatibility layer [1] (so you shouldn't need to change too much code). I'm unsure if Fastly supports B2 in this configuration yet though.

[1] https://news.ycombinator.com/item?id=23069114


Yeah, I'm sure that if we needed to serve more videos through our "fallback player", we'd investigate more cost effective ways to run it. With thousands of videos, it'd be a pain to move … but at least that would be a one time pain!


Something else that can also help cost and quality is EasyBroadcast’s viewer assisted streaming in addition to Fastly. Adding a JS to the player pages enables cdn offloading by making each viewer act as a potential source based on QoE/QoS metrics. Disclaimer: I am one of the cofounders. Happy to help.


It seems unlikely Khan Academy doesn't have the technical competency to deploy their video content to an alternate host rapidly, whether that be an object store (Backblaze) or dedicated servers with very cheap bandwidth (Hetzner, OVH, and similar), perhaps even using PeerTube.

There's a reason other non profits like Wikipedia and the Internet Archive run their own hardware, networking, and connectivity to transit providers. And before the "doing that is expensive!" argument comes up, note how expensive having someone else do these things are. Lots of margin built into cloud services.


I am loosely involved with one of the language teams. This language team, and others like it, are groups of volunteers and not Kahn Academy employees. I don't have all the details but I know that the teams are responsible for creating the content and uploading the content to youtube. Changing to an alternate host would require more coordination than you might think.

My contribution is that I am helping with automating video uploads to the team youtube channel, which is allowed to have "Kahn Academy {language}" branding.

Shameless plug: If any google or youtube employees can help me raise our API quota please get in touch!!! :)


They can. Doesn't mean they will.


Or a bill, please pay one million.

Kthxby


"And give them bookoo $$$ to do so."


Which is usually known in the industry as "good problems to have." If you're staring down a huge bill because your company is blowing up you pay it, throw a party, and then figure out how to reduce costs once the hangover wears off.


I'd rename it to "pay someone else to do it". In the case of YouTube, even if the hosting is free, YT still makes a profit (ads, tracking, etc.) so it's a win-win for all.


Or, to phrase it differently: their technical problem, your financial problem.


There's quite a bit of dynamic content, no? Things like exercise grading, and progression through the skill tree.


Which is powered by Google's AppEngine, which scales very easily, at least technically easily.


s/easily/expensively/

Khan Academy doesn't have infinite money.


what makes GAE expensive?


Anything is expensive if you have to scale up beyond your financial ability to do so.


Not sure why you’re being downvoted. Scaling systems at companies with a lot of money is very different from designing system to be reliable and performant and with reducing spend being a real goal.

Some companies need to take a much harder look at their spend than others regardless of where you deploy.


"Just use external video hosting and deploy CDN for static" sounds like a trivial advice, but I constantly see businesses here and there not following it. So, as it says in Russsian proverb, "repetition is a mother of learning".


I'd like to hear engineering stories from Youtube in that case. What does their operation look like? Do they ever tell?


Make an educated guess about the design of youtubes serving infrastructure, and I'm pretty certain you're right.

There's only really one sensible way to do it, and that's the way it's done.


That's not terribly helpful to a large portion of people that don't have the requisite knowledge or experience to make an educated guess, or don't have the knowledge to eliminate most of a few educated guesses.

It's sort of like someone asking how you rebuild the stock carburetor on a 1967 Mustang, and replying "Make an educated guess about the design of youtubes serving infrastructure, and I'm pretty certain you're right." For the vast majority of people, that's no use at all.


Like the rest of the system that isn't just videos and static content?


It's a cookie cutter app engine stack which scales automatically


This is a good example of how cloud tools make this kind of scaling easy.

The trickier part can be the cost-- which this piece notes will increase roughly linearly with the number of users. If Khan Academy is free, I think this means those who are generous are going to need to keep giving to keep it that way. Let's step up, everyone.


Wouldn’t you deploy generally deploy your services with a certain safety margin? I am pretty sure most systems I am working on could handle 10x pretty easily. Then it would get hard but it seems 2.5x is a pretty normal and expected fluctuation.


Provisioning for 2.5x of peak load would be pretty expensive and, in most cases, overly cautious (good luck selling 10x to your financers!). For a large service that wants to be able to handle unexpected spikes, I'd expect something more like 1.1-1.5x margin.

As their blog post says, they weren't unprepared. They had the means to grow and used it. Burning money before that time wasn't necessary.


Makes sense. I guess I was thinking about smaller deployments where the absolute server cost isn't that high.


2.5x is a surprisingly small jump for having all of your brick and mortar competitors shut down for an indeterminate period of time. Either Khan academy had amazing penetration into the education space, or the follow through rates for kids educating at home is abysmally low.

Disclaimer: I don’t have children, so I have no real world experience with Khan.


There's a lot going on in your observation, and this is all speculation on my part (even though I am a Khan Academy employee).

We did have quite a bit of usage and awareness among schools already before the shutdowns started. Couple that with there being many options for teaching online… I wouldn't be surprised if a lot of schools just switched to having their teachers attempt to do their normal teaching via Zoom (which sounds really hard to me!). Many schools had contracts of various sorts with other online learning platforms.

Some schools or classes haven't had great follow through rates, which is unfortunate, but educators all over have had to quickly adjust. I suspect that more robust plans will be in place by the fall, given how much uncertainty there is for fall classes. Khan Academy is, at least, an always-free resource that's there for people if they need it.

That 2.5x is starting from a large base, and there's also a lot of activity in online education generally.


"I wouldn't be surprised if a lot of schools just switched to having their teachers attempt to do their normal teaching via Zoom"

This is exactly what is happening in our school district and it is a big failure. That and teachers emailing their lesson plans for parents to print or parents can go to the school and pick up printed packets. We then have the joy of taking photos of the completed work and emailing those back to the teachers.

It is extremely inefficient and I have already informed our school that we will not be doing that if we are stuck in this scenario come fall. We will be using Khan for math and other online learning platforms for LA.


What are the other platforms you mention?


Its also coming up to exam season and summer holidays in a lot of places. Seeing the same with my product. We run it internationally so there is a spectrum (country dependant) about usage changes. Some are the same, or less, some are 2-3x more but only certain days of the week. One was 1000x more, but only for one week and is now just normal. It's been pretty crazy.


All the kids I know of (Jr High / High School) already actively participate in Khan Academy - on some topics at school requirement, on other topics at their own discretion because they're already accustomed to the platform.


I can see myself using it to learn things that my (admittedly lousy) teachers weren't able to teach. But, is it really true that teachers are straight up assigning Khan Academy material as part of the course requirements? That's interesting to me and I had no idea that was going on.


My nieces had it explicitly assigned as part of their math homework.


Khan Academy is actively soliciting donations right now, as is referenced in the footnote to the article:

> Khan Academy's increased usage has also increased our hosting costs, and we're a not-for-profit that relies on philanthropic donations from folks like you.


I love it when people have both the inclination and the political pull to keep an environment super minimalist like this. Fastly to AppEngine is a blazing fast combo and so well sorted to "just work".


Khanacademy is great and don't get me wrong, but what I see is an engineering blog that doesn't force HTTPS and an ad for Google products. All that Khanacademy did was optimize code, setup the partner console properly and pay the (likely enormous) bill. As I read in another comment here: "How to scale: Make it somebody else's problem" + pay the bill.

Edit: ah, a case study from Google about Khanacademy. This post was definitely an ad: https://cloud.google.com/customers/khan-academy Another: https://cloudplatform.googleblog.com/2013/08/khan-academy-ru...


It does enforce HTTPS - what are you talking about?


Notably:

- No Rust/Go rewrite

- GC not disabled

- Didn't apply the latest research on k/v storage

Jokes aside, this is the fun parts of hosted software and glad to hear the "things don't have to be so hard" side of things. Hope it continues working out!


If you're already using GCP, my general advice for new projects is almost always some form of "just throw it on AppEngine". No, you don't need multi-region deployments. No, you don't need 32TB of memory per instance. No, you do not need kubernetes. No, istio is not going to solve this. No, you're not hosting your own kafka cluster.

I've found devs are always trying to over-engineer complex solutions to dead simple problems. Just let Google do it and get some sleep.


For new projects sure, but you need an escape hatch. App Engine costs can spiral out of control. I know of at least one startup that was pretty successful in finding product market fit but sunk their own ship because they weren't able to migrate off of App Engine quickly enough.


If you run on app engine flexible it shouldn’t be hard to migrate.


Not even flex. AppEngine standard added the ability to deploy containers in 2018.


You make it sound as if appengine solves all your problems, the prime source of complexity in most web apps is the database, which you don't mention.


In larger organizations you’re rarely building an entirely new web app from scratch. Yes, that’s a more complex endeavor. A lot of the problems boil down to “sit on a queue, grab some data, mix in some other data, send it somewhere else”. But GAE at the very least solves the scaling and cert management.


[flagged]


Starting simple isn't putting all your eggs in one basket. In fact it's such a common recommendation (and so commonly forgotten about) that there are several phrases designed just to teach this one lesson:

KISS - Keep it simple, silly

YAGNI - You ain't gonna need it

MVP - Minimum viable product

Overengineering

Real artists ship

I'm sure there's a ton more examples but that's just off the top of my head. Point is, until you know that you need high availability and multi-zone disaster recovery and etc etc, just engineer it for the problems you actually have.


Nope, just one or two eggs. When those eggs hatch and you have a massive chicken farm, you can start putting your eggs in multiple baskets.


People who count their chickens before they are hatched act very wisely because chickens run about so absurdly that it's impossible to count them accurately... – Oscar Wilde, letter to R. Ross, 31 May 1898


AppEngine doesn't scale fast, gradual traffic increase fits better for AE. We have spikes x 1000 and back within one minute, stable 10 boxes with average hardware (compute engine) handle it more reliably than more than 100 AE Golang instances which will be hammering everything downstream. Also GAE costs will be insane compared to compute engine.


If you can predict when the spike is going to occur, you can actually use the GAE admin API to modify the number of idle instances on the fly. We used this exact method to handle spikes when new games were released at my last job. Then a cron or delayed cloud task would reset the instance count afterward.


Just don't pretend that the two eggs you start with are twenty just yet.


it's easy to put them all in one basket if you only have one or two eggs



:facepalm:

And their reason is because their code is currently Python 2. To avoid some manual code refactoring to support Python 3, they shall rewrite in an entirely different language. So they'll train their entire programming staff in Go and task these new inexperienced Go programmers to do a straight port of their code base from one highly opinionated language to a different highly opinionated language.

They've certainly done an excellent job enumerating all of the very lucrative benefits of a Go rewrite, while ignoring or hand-waving away any of the challenges. It all screams version 2 syndrome.

Who knows, maybe this is a great move for them. I’m only going on their blog post about it. But at minimum they've done a poor job communicating how they've looked at this challenge objectively and without rose-tinted glasses.


I know you're joking, but they actually are doing a Go rewrite. https://engineering.khanacademy.org/posts/goliath.htm


But notably missing from the blog post is how much this stack costs to run at that scale. $100k/month? $200k/month?


Anyone knows which stack they are actually using?


Khan Academy uses Python [1], Google App Engine [2], React.js [3] and recently Go [4] among other languages [5].

They have used Backbone.js [6] and experimented with other programming languages like Kotlin [7].

Read their engineering blog [8] and you will know more and maybe learn a few things.

[1] http://engineering.khanacademy.org/posts/python-refactor-1.h...

[2] http://engineering.khanacademy.org/posts/transaction-safety....

[3] http://engineering.khanacademy.org/posts/upgrade-buttons-lin...

[4] http://engineering.khanacademy.org/posts/goliath.htm

[5] https://github.com/khan/

[6] http://engineering.khanacademy.org/posts/upgrade-buttons-lin...

[7] http://engineering.khanacademy.org/posts/kotlin-adoption.htm

[8] http://engineering.khanacademy.org/


They are now suffering a major connectivity outage:

https://status.khanacademy.org/

Unfortunate timing for the blog post to reach HN...


I haven't actively thought about Khan Academy for several years and only just remembered its' existence. I do think that it's all sorts of brilliant and that's why I just signed up as a volunteer translator. I hope some of you other people here will so the same.


On a largely content based app/site, most of "scaling" comes down to caching. However you do that is up to you, but somewhere between caching at the browser layer, proxy layer, web server layer, or memcache layer, things should be fast and scalable without getting too fancy.


Some fun facts from scaling and optimizing the dominant school management site in our country that are used by schools, kindergartens, parents and kids. Schools years ago no longer may keep physical journals around because they use this system.

Peak was usually when semester end was coming. Currently, daily, we must sustain 2x of that peak, but peaked at 3x of that on the first day "remote schools opened". The everyday traffic is currently like almost 3x it would have been, measuring by requests per second received on frontends.

We were struggling usually at end of semester. Luckily we stated to do some upgrades and optimizations to handle that, just before the lockdown started. In the end, we can now sustain many more times traffic we have. The hardest part wasn't adding more resources (that was a MUST for sure) but it was much more effort to handle stuff that scales vertically (SQL) and some file share issues.

So there were some intermittent issues that were frustrating the users, sometimes bringing whole site down for minutes. (Sometimes is the trickiest part to handle). That includes hunting down expensive queries (not so hard), calling less or more optimal queries, taming SQL plan cache and dealing with some NTFS stuff for file share.

Some of the issues couldn't be solved solely throwing more hardware in.

I'm still puzzled for the file share issue on Windows. Yeah, not so clever to store millions of files in a single folder within NTFS filesystem. We have append-only share, no deletes ever happening. Stuff like timestamps and short paths were disabled, defragmented $mft and the likes, but... every ~24 hours the drive would become sooo painfully slow and inacessible. Some access denied errors get thrown, etc. And it continues for some minutes. Maybe 15. Sometimes more. But between that ~15min period it works, so it's like a wave with some period. But the thing is, outside of this window, the file share works very good (with all those millions of files within a folder). They were never deleted and we didn't need to enumerate - just fetch file by file.

No, Antivirus wasn't at fault, no backup system wasn't messing around. Is there stuff NTFS may do under the hood on SSD drives? I'm aware there is a TRIM process, but as I understand it has to deal when the stuff is being deleted from the SSD?

We moved to rotating disks (!) and split those files across few folders and got rid of those issues. But still, 1 folder contains way more files you would hear anyone saying is healthy on NTFS.


Love these sort of explanations on how companies and people run their infra. Great job KA!


it's a rather high level, low in insightful detail article.


TL;DR: Load Balancers and a clear policy means the cloud works as advertised.

Seven years ago I was at a medical conference in Portland, Oregon with a panel of "experts" discussing the security and accessibility of medical record systems and wearable devices. There was a principal engineer from Intel on the panel. When someone asked about the cloud, this tall, lanky, long-bearded man with a thick accent stood up and said:

"The cloud? (chuckles) What is the cloud? Where is the cloud? Is it over here? (Points to a table) Is it over there? (points to another table). The cloud is a joke, man. It's a complete joke."

EDIT: added an anecdote for SEO. :)


is fastly the best service for caching?


I'm very happy with Cloudflare's workers so far.

You can store stuff in workers KV (sessions, images, complete static sites, etc) even interact with their global cache with an API.


I haven’t used them but I think one of the bigger benefits (for some) is that they’re running (a forked version of) varnish as their caching proxy, and you can provide your own VCL for doing “fancy” stuff.


Depends on who you ask but.. no they are not. But I think they're pretty good from a cost/performance perspective.


from what i've heard Fastly is cheaper


if they're just referencing youtube videos, what is there to scale up? Speedier downloads of static content and repeat visits from likely the relatively stable set of user base?


This question is not uncommon, so I really should write a blog post I can refer to. I've got another comment in this thread about this: https://news.ycombinator.com/item?id=23171877


Is there a good alternative video host or platform to YouTube? I always worry about their fickle content management/censorship practices, and also just don't like the idea of massive centralization around Google.


What is the use case, does it need to be free as in beer? Cloudflare Streaming starts at $5/month for 1000 minutes stored and $1/1000 minutes streamed. Though I had to admit I had performance problems with their built in analytics API (and their support was unable/unwilling to look into it). Sure there are plenty around.


Have you tried Vimeo? The community is not great, but the player is decent.


2.5x in a week is news? I have worked on many things from viral apps, blogs that get picked up by large news orgs, etc that need to scale sometimes 100x or more in a day.


1 to 100 is easier than 100M to 250M (made up numbers, but you get the point).


it depends on what you have to scale, if this is just php rendering, this is still relatively easy to do.


As an extreme, can you imagine if Facebook doubled its usage in a week? It depends on what the site does and what floor it's starting from.

We (Khan Academy) are certainly not at Facebook's scale, but we do run a site with a lot of dynamic behavior and millions of monthly users. If we had our own bare metal in data centers, we either would have been way overprovisioned or would have been scrambling to keep up.


You mean like scaling from 1 user to 100 users?


No, something like 1,000 to 100,000




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: