A lot of people seem to think of Khan Academy as a bunch of videos. Many have also seen the exercises and articles. Those things are all pretty static (though it gets more complex when you consider how much content there is and how many languages it's localized into).
There's a whole bunch of dynamic behavior around that static content. Keeping track of progress to tell a learner how they're doing, plus to help recommend the next place to go in the content. Reporting on progress to parents and teachers. Letting teachers create assignments and manage their classrooms. Bubbling up information to school districts.
Content pages have discussions and clarifications. There are notifications to tell students about new assignments, for example.
There are connections to tests, like the SAT prep or integration with the MAP test, which involve connecting our accounts with external accounts in order to help students based on those test results.
And a bunch of other stuff that isn't coming to mind right now because I'm just naming things off the top of my head.
Doing all these things across a user base of millions of monthly users can get quite involved.
Unfortunately, I don't think I've seen this publicly reported and I'm not in a position to decide to make that public.
I will say that the amount we pay for our infrastructure, especially right now, is reduced by generous support from both Fastly and Google. It still remains a substantial expense after that support.
One final note on cost: we are actively working on a Python to Go transition[1] that will reduce our hosting costs, among other benefits.
Yes. In this case, they share the fact they used common sense.
This is not saying that all of the other major re-architecture blog posts are flawed. However, it is good to know that scaling when using cloud vendor specific tools is as easy as advertised.
Yes, for the same reason that researchers should publish null results: all of the data is useful. Getting confirmation that a particular formulation of a strategy works or does not work is valuable in and of itself, regardless of the exact outcome. The only reason why it wouldn't be valuable would be if there were a plethora of similar reports of successfully scaling this solution, which I do not, so their experience is very welcome.
To turn it around, why would you ever want someone to not share their experience with the world if they took time to write it down? It's not like you must read it; it doesn't cost you anything to exist. But if someone's experience adds to the library of human knowledge, even a little bit, why would one try to reject that?
Yes, with the caveat that you may have to check that they're actually capable of handling the load and you don't get a surprise notice that "uh we can't do this, you're on your own".
But they might not at any time, at least without charging. With an apparent internal push for some services to become self-sustaining (see Google Maps API, Recaptcha), YouTube embeds might be next.
Khan Academy today supports serving video outside of YouTube, which is blocked in some schools. We could essentially flip a switch to not use YouTube, but the cost would be substantial because those videos go Fastly->S3, so anything not in cache is going to result in S3 egress charges.
Something to consider in your tooling for this is to target Backblaze's B2 object store, which supports an S3 compatibility layer [1] (so you shouldn't need to change too much code). I'm unsure if Fastly supports B2 in this configuration yet though.
Yeah, I'm sure that if we needed to serve more videos through our "fallback player", we'd investigate more cost effective ways to run it. With thousands of videos, it'd be a pain to move … but at least that would be a one time pain!
Something else that can also help cost and quality is EasyBroadcast’s viewer assisted streaming in addition to Fastly. Adding a JS to the player pages enables cdn offloading by making each viewer act as a potential source based on QoE/QoS metrics.
Disclaimer: I am one of the cofounders. Happy to help.
It seems unlikely Khan Academy doesn't have the technical competency to deploy their video content to an alternate host rapidly, whether that be an object store (Backblaze) or dedicated servers with very cheap bandwidth (Hetzner, OVH, and similar), perhaps even using PeerTube.
There's a reason other non profits like Wikipedia and the Internet Archive run their own hardware, networking, and connectivity to transit providers. And before the "doing that is expensive!" argument comes up, note how expensive having someone else do these things are. Lots of margin built into cloud services.
I am loosely involved with one of the language teams. This language team, and others like it, are groups of volunteers and not Kahn Academy employees. I don't have all the details but I know that the teams are responsible for creating the content and uploading the content to youtube. Changing to an alternate host would require more coordination than you might think.
My contribution is that I am helping with automating video uploads to the team youtube channel, which is allowed to have "Kahn Academy {language}" branding.
Shameless plug: If any google or youtube employees can help me raise our API quota please get in touch!!! :)
Which is usually known in the industry as "good problems to have." If you're staring down a huge bill because your company is blowing up you pay it, throw a party, and then figure out how to reduce costs once the hangover wears off.
I'd rename it to "pay someone else to do it". In the case of YouTube, even if the hosting is free, YT still makes a profit (ads, tracking, etc.) so it's a win-win for all.
Not sure why you’re being downvoted. Scaling systems at companies with a lot of money is very different from designing system to be reliable and performant and with reducing spend being a real goal.
Some companies need to take a much harder look at their spend than others regardless of where you deploy.
"Just use external video hosting and deploy CDN for static" sounds like a trivial advice, but I constantly see businesses here and there not following it. So, as it says in Russsian proverb, "repetition is a mother of learning".
That's not terribly helpful to a large portion of people that don't have the requisite knowledge or experience to make an educated guess, or don't have the knowledge to eliminate most of a few educated guesses.
It's sort of like someone asking how you rebuild the stock carburetor on a 1967 Mustang, and replying "Make an educated guess about the design of youtubes serving infrastructure, and I'm pretty certain you're right." For the vast majority of people, that's no use at all.
This is a good example of how cloud tools make this kind of scaling easy.
The trickier part can be the cost-- which this piece notes will increase roughly linearly with the number of users. If Khan Academy is free, I think this means those who are generous are going to need to keep giving to keep it that way. Let's step up, everyone.
Wouldn’t you deploy generally deploy your services with a certain safety margin? I am pretty sure most systems I am working on could handle 10x pretty easily. Then it would get hard but it seems 2.5x is a pretty normal and expected fluctuation.
Provisioning for 2.5x of peak load would be pretty expensive and, in most cases, overly cautious (good luck selling 10x to your financers!). For a large service that wants to be able to handle unexpected spikes, I'd expect something more like 1.1-1.5x margin.
As their blog post says, they weren't unprepared. They had the means to grow and used it. Burning money before that time wasn't necessary.
2.5x is a surprisingly small jump for having all of your brick and mortar competitors shut down for an indeterminate period of time. Either Khan academy had amazing penetration into the education space, or the follow through rates for kids educating at home is abysmally low.
Disclaimer: I don’t have children, so I have no real world experience with Khan.
There's a lot going on in your observation, and this is all speculation on my part (even though I am a Khan Academy employee).
We did have quite a bit of usage and awareness among schools already before the shutdowns started. Couple that with there being many options for teaching online… I wouldn't be surprised if a lot of schools just switched to having their teachers attempt to do their normal teaching via Zoom (which sounds really hard to me!). Many schools had contracts of various sorts with other online learning platforms.
Some schools or classes haven't had great follow through rates, which is unfortunate, but educators all over have had to quickly adjust. I suspect that more robust plans will be in place by the fall, given how much uncertainty there is for fall classes. Khan Academy is, at least, an always-free resource that's there for people if they need it.
That 2.5x is starting from a large base, and there's also a lot of activity in online education generally.
"I wouldn't be surprised if a lot of schools just switched to having their teachers attempt to do their normal teaching via Zoom"
This is exactly what is happening in our school district and it is a big failure. That and teachers emailing their lesson plans for parents to print or parents can go to the school and pick up printed packets. We then have the joy of taking photos of the completed work and emailing those back to the teachers.
It is extremely inefficient and I have already informed our school that we will not be doing that if we are stuck in this scenario come fall. We will be using Khan for math and other online learning platforms for LA.
Its also coming up to exam season and summer holidays in a lot of places. Seeing the same with my product. We run it internationally so there is a spectrum (country dependant) about usage changes. Some are the same, or less, some are 2-3x more but only certain days of the week. One was 1000x more, but only for one week and is now just normal. It's been pretty crazy.
All the kids I know of (Jr High / High School) already actively participate in Khan Academy - on some topics at school requirement, on other topics at their own discretion because they're already accustomed to the platform.
I can see myself using it to learn things that my (admittedly lousy) teachers weren't able to teach. But, is it really true that teachers are straight up assigning Khan Academy material as part of the course requirements? That's interesting to me and I had no idea that was going on.
Khan Academy is actively soliciting donations right now, as is referenced in the footnote to the article:
> Khan Academy's increased usage has also increased our hosting costs, and we're a not-for-profit that relies on philanthropic donations from folks like you.
I love it when people have both the inclination and the political pull to keep an environment super minimalist like this. Fastly to AppEngine is a blazing fast combo and so well sorted to "just work".
Khanacademy is great and don't get me wrong, but what I see is an engineering blog that doesn't force HTTPS and an ad for Google products.
All that Khanacademy did was optimize code, setup the partner console properly and pay the (likely enormous) bill.
As I read in another comment here: "How to scale: Make it somebody else's problem" + pay the bill.
Jokes aside, this is the fun parts of hosted software and glad to hear the "things don't have to be so hard" side of things. Hope it continues working out!
If you're already using GCP, my general advice for new projects is almost always some form of "just throw it on AppEngine". No, you don't need multi-region deployments. No, you don't need 32TB of memory per instance. No, you do not need kubernetes. No, istio is not going to solve this. No, you're not hosting your own kafka cluster.
I've found devs are always trying to over-engineer complex solutions to dead simple problems. Just let Google do it and get some sleep.
For new projects sure, but you need an escape hatch. App Engine costs can spiral out of control. I know of at least one startup that was pretty successful in finding product market fit but sunk their own ship because they weren't able to migrate off of App Engine quickly enough.
In larger organizations you’re rarely building an entirely new web app from scratch. Yes, that’s a more complex endeavor. A lot of the problems boil down to “sit on a queue, grab some data, mix in some other data, send it somewhere else”. But GAE at the very least solves the scaling and cert management.
Starting simple isn't putting all your eggs in one basket. In fact it's such a common recommendation (and so commonly forgotten about) that there are several phrases designed just to teach this one lesson:
KISS - Keep it simple, silly
YAGNI - You ain't gonna need it
MVP - Minimum viable product
Overengineering
Real artists ship
I'm sure there's a ton more examples but that's just off the top of my head. Point is, until you know that you need high availability and multi-zone disaster recovery and etc etc, just engineer it for the problems you actually have.
People who count their chickens before they are hatched act very wisely because chickens run about so absurdly that it's impossible to count them accurately... – Oscar Wilde, letter to R. Ross, 31 May 1898
AppEngine doesn't scale fast, gradual traffic increase fits better for AE.
We have spikes x 1000 and back within one minute, stable 10 boxes with average hardware (compute engine) handle it more reliably than more than 100 AE Golang instances which will be hammering everything downstream. Also GAE costs will be insane compared to compute engine.
If you can predict when the spike is going to occur, you can actually use the GAE admin API to modify the number of idle instances on the fly. We used this exact method to handle spikes when new games were released at my last job. Then a cron or delayed cloud task would reset the instance count afterward.
And their reason is because their code is currently Python 2. To avoid some manual code refactoring to support Python 3, they shall rewrite in an entirely different language. So they'll train their entire programming staff in Go and task these new inexperienced Go programmers to do a straight port of their code base from one highly opinionated language to a different highly opinionated language.
They've certainly done an excellent job enumerating all of the very lucrative benefits of a Go rewrite, while ignoring or hand-waving away any of the challenges. It all screams version 2 syndrome.
Who knows, maybe this is a great move for them. I’m only going on their blog post about it. But at minimum they've done a poor job communicating how they've looked at this challenge objectively and without rose-tinted glasses.
I haven't actively thought about Khan Academy for several years and only just remembered its' existence. I do think that it's all sorts of brilliant and that's why I just signed up as a volunteer translator. I hope some of you other people here will so the same.
On a largely content based app/site, most of "scaling" comes down to caching. However you do that is up to you, but somewhere between caching at the browser layer, proxy layer, web server layer, or memcache layer, things should be fast and scalable without getting too fancy.
Some fun facts from scaling and optimizing the dominant school management site in our country that are used by schools, kindergartens, parents and kids. Schools years ago no longer may keep physical journals around because they use this system.
Peak was usually when semester end was coming. Currently, daily, we must sustain 2x of that peak, but peaked at 3x of that on the first day "remote schools opened". The everyday traffic is currently like almost 3x it would have been, measuring by requests per second received on frontends.
We were struggling usually at end of semester. Luckily we stated to do some upgrades and optimizations to handle that, just before the lockdown started. In the end, we can now sustain many more times traffic we have. The hardest part wasn't adding more resources (that was a MUST for sure) but it was much more effort to handle stuff that scales vertically (SQL) and some file share issues.
So there were some intermittent issues that were frustrating the users, sometimes bringing whole site down for minutes. (Sometimes is the trickiest part to handle). That includes hunting down expensive queries (not so hard), calling less or more optimal queries, taming SQL plan cache and dealing with some NTFS stuff for file share.
Some of the issues couldn't be solved solely throwing more hardware in.
I'm still puzzled for the file share issue on Windows. Yeah, not so clever to store millions of files in a single folder within NTFS filesystem. We have append-only share, no deletes ever happening. Stuff like timestamps and short paths were disabled, defragmented $mft and the likes, but... every ~24 hours the drive would become sooo painfully slow and inacessible. Some access denied errors get thrown, etc. And it continues for some minutes. Maybe 15. Sometimes more. But between that ~15min period it works, so it's like a wave with some period. But the thing is, outside of this window, the file share works very good (with all those millions of files within a folder). They were never deleted and we didn't need to enumerate - just fetch file by file.
No, Antivirus wasn't at fault, no backup system wasn't messing around. Is there stuff NTFS may do under the hood on SSD drives? I'm aware there is a TRIM process, but as I understand it has to deal when the stuff is being deleted from the SSD?
We moved to rotating disks (!) and split those files across few folders and got rid of those issues. But still, 1 folder contains way more files you would hear anyone saying is healthy on NTFS.
TL;DR: Load Balancers and a clear policy means the cloud works as advertised.
Seven years ago I was at a medical conference in Portland, Oregon with a panel of "experts" discussing the security and accessibility of medical record systems and wearable devices. There was a principal engineer from Intel on the panel. When someone asked about the cloud, this tall, lanky, long-bearded man with a thick accent stood up and said:
"The cloud? (chuckles) What is the cloud? Where is the cloud? Is it over here? (Points to a table) Is it over there? (points to another table). The cloud is a joke, man. It's a complete joke."
I haven’t used them but I think one of the bigger benefits (for some) is that they’re running (a forked version of) varnish as their caching proxy, and you can provide your own VCL for doing “fancy” stuff.
if they're just referencing youtube videos, what is there to scale up? Speedier downloads of static content and repeat visits from likely the relatively stable set of user base?
This question is not uncommon, so I really should write a blog post I can refer to. I've got another comment in this thread about this: https://news.ycombinator.com/item?id=23171877
Is there a good alternative video host or platform to YouTube? I always worry about their fickle content management/censorship practices, and also just don't like the idea of massive centralization around Google.
What is the use case, does it need to be free as in beer? Cloudflare Streaming starts at $5/month for 1000 minutes stored and $1/1000 minutes streamed. Though I had to admit I had performance problems with their built in analytics API (and their support was unable/unwilling to look into it).
Sure there are plenty around.
2.5x in a week is news? I have worked on many things from viral apps, blogs that get picked up by large news orgs, etc that need to scale sometimes 100x or more in a day.
As an extreme, can you imagine if Facebook doubled its usage in a week? It depends on what the site does and what floor it's starting from.
We (Khan Academy) are certainly not at Facebook's scale, but we do run a site with a lot of dynamic behavior and millions of monthly users. If we had our own bare metal in data centers, we either would have been way overprovisioned or would have been scrambling to keep up.
I mean being able to scale that easily is a great thing, but is there anything worth sharing with the world in their case ?