You have thousands of dollars, they have tens of billions. $1,000 vs $10,000,000,000. They have 7 more zeros than you, which is one less zero than the scale difference in users: 1 user (you) vs 700,000,000 users (openai). They managed to squeak out at least one or two zeros worth of efficiency at scale vs what you're doing.
Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.
You can knock off a zero or two just by time shifting the 700 million distinct users across a day/week and account for the mere minutes of compute time they will actually use in each interaction. So they might no see peaks higher than 10 million active inference session at the same time.
Conversely, you can't do the same thing as a self hosted user, you can't really bank your idle compute for a week and consume it all in a single serving, hence the much more expensive local hardware to reach the peak generation rate you need.
During times of high utilization, how do they handle more requests than they have hardware? Is the software granular enough that they can round robin the hardware per token generated? UserA token, then UserB, then UserC, back to UserA? Or is it more likely that everyone goes into a big FIFO processing the entire request before switching to the next user?
I assume the former has massive overhead, but maybe it is worthwhile to keep responsiveness up for everyone.
Inference is essentially a very complex matrix algorithm run repeatedly on itself, each time the input matrix (context window) is shifted and the new generated tokens appended to the end. So, it's easy to multiplex all active sessions over limited hardware, a typical server can hold hundreds of thousands of active contexts in the main system ram, each less than 500KB and ferry them to the GPU nearly instantaneously as required.
The context after application of the algorithm is just text, something like 256k input tokens, each token representing a group of roughly 2-5 characters, encoded into 18-20 bits.
The active context during inference, inside the GPUs, explodes each token into a 12288 dimensions vector, so 4 orders of magnitude more VRAM, and is combined with the model weights, Gbytes in size, across multiple parallel attention heads. The final result are just more textual tokens, which you can easily ferry around main system RAM and send to the remote user.
First of all, they never “handle more requests than they have hardware.” That’s impossible (at least as I’m reading it).
The vast majority of usage is via their web app (and free accounts, at that). The web app defaults to “auto” selecting a model. The algorithm for that selection is hidden information.
As load peaks, they can divert requests to different levels of hardware and less resource hungry models.
Only a very small minority of requests actually specify the model to use.
There are a hundred similar product design hacks they can use to mitigate load. But this seems like the easiest one to implement.
> But this seems like the easiest one to implement.
Even easier: Just fail. In my experience the ChatGPT web page fails to display (request? generate?) a response between 5% and 10% of the time, depending on time of day. Too busy? Just ignore your customers. They’ll probably come back and try again, and if not, well, you’re billing them monthly regardless.
I don't usually see responses fail. But what I did see shortly after the GPT-5 release (when servers were likely overloaded) was the model "thinking" for over 8 minutes. It seems like (if you manually select the model) you're simply getting throttled (or put in a queue).
In addition to stuff like that they also handle it with rate limits, that message that Claude would throw almost all the time when they were like "demand is high so you have automatically switched to concise mode", making batch inference cheaper for API customers to convince them to use that instead of real time replies. The site erroring out during a period of high demand also works, prioritizing business customers during a rollout, the service degrading. It's not like any provider has a track record for effortlessly keeping responsiveness super high. Usually it's more the opposite.
It's not special and fine tuning a foundation model isn't destructive when you have checkpoints. LoRa allows you to approximate the end result of a fine tune while saving memory.
I know the basilisk is well trodden ground, but it comes up in conversations in my day to day with people who I'm relaying AI concepts to and I'd like to put my thoughts into words. I haven't read anything that makes my exact argument before, so I'm making my first substack contribution now.
Mentioning philosophy on the internet is a good way to start arguments, so if you're trying to do that you'll probably succeed.
There is a more concise phrasing for the idea you seem to be trying to get at - decision theory has to consider the fact that the decision makers in the real world that we live in have finite resources. There are an infinite number of potential Roko Basilisks (all religions and all theoretically describable religions of which Roko's is one). Since the decision maker doesn't have enough resources to deal with all of them, they end up having to ignore the basilisk in practice.
There isn't really a question here about alignment in the Roko scenario, the argument falls apart a lot earlier than that. The problem is there are an infinite number of unaligned AI that would have different punishment schemes - making the threat of any given one worth devoting 0 resources to in practice. Including argument/mental consideration time, for that matter.
I'd like to introduce you to my new invention, Roko's Cockatrice.
If Roko's Basilisk is real, then it stands to reason that there will eventually be resistance to it. Presumably this resistance, aware of the promise that the Basilisk is supposed to have made towards its followers, will target them, and ensure that they are never uploaded or recreated in virtuality.
In short, Roko's Cockatrice cancels out the Basilisk. Even if the Basilisk never comes into existence, the immense risk of it arising—and the fact that everyone knows it could happen—ensures that, if the cult of the Basilisk grows enough, it will be ostracised and stamped out by the majority. Believers are the perfect scapegoat for an authoritarian culture, and a legitimate risk to a rational one—after all, wouldn't True Believers eventually strive to bring about the Basilisk on their own, so they can have the rewards they've promised themselves?
At this point you're probably thinking—"Hey, that's just Pascal's Mugging with extra steps!"—fine. But it's way funnier than the way the article presented the idea.
Stay tuned for Roko's Gorgon, which is just a T-800 sent back in time to kill Roko before he makes the original forum post. (Given enough time, and the possibility of time travel being invented, it has to happen eventually, right?)
> (Given enough time, and the possibility of time travel being invented, it has to happen eventually, right?)
I like to imagine if time travel were real, that we wouldn't necessarily know what was changed, because it would have been changed before we knew about it. Time marches on, so a clone or fork is made of the universe and time just carries on from whatever the tamper was.
So, the fact we're discussing it means: time travel exists but Roko's Basilisk does never; or time travel does not exist and the actuality of Roko's basilisk is unknowable.
I was going to say any permutation of those four things but I am very tied and I am unsure if that is a weaker argument. Tot zeins
Too abstract. Try Battlestar Galactica 'reimagined'. Stories sell better, especially to all the Incels fapping to Cylon Caprica 6. Or their 'resurrection technology' if in range of a receiver. Or their 'belief' that they are the true life, because of perfection.
Otherwise one could just brood about Supervolcanoes, Solar Storms, Gamma Ray Bursts, Killer Asteroids, all sorts of bad weather, Pandemics, blocks of frozen shit from defect passenger jets plunging out of the skies, fucking Islam...
Presumably the artist is a human who directly or indirectly paid money to view a film containing an archaeologist with the whip.
I don't think this is about reproduction as much as how you got enough data for that reproduction. The riaa sent people to jail and ruined their lives for pirating. Now these companies are doing it and being valued for hundreds of billions of dollars.
A human friend can get tired and there's so many request he/she can fulfill and at a max rate. Even a team of human artists have a relatively low limit.
But Gen AI has very high limits and speeds, and it never gets tired. It seems unfair to me.
Yeah ok, I get not wanting to do the grunt work. I take classes for fun. But if it's not for a credential and I don't want to do coursework, I'm just going to buy a textbook.
Why are you upset how a frame is generated? We're not talking about free range versus factory farming. Here, a frame is a frame and if your eye can't tell the difference then it's as good as any other.
"a frame is a frame" - of course it isn't that makes no sense. The point of high frame rate is to have more frames that render the game state accurately for more subdivisions
Otherwise you could just duplicate every frame 100 times and run at 10k fps
Or hell just generate a million black frames every second, a frames a frame right
let me just apply my super proprietary machine learning architecture... ah yes it's done, behold, I can generate 3.69 trillion frames per second, because I compressed each frame to a single bit and that's how fast the CPU's memory bus is
the main point of more fps is lower latency. if you're getting 1000 fps but they are all ai generated from a single real frame per second, your latency will be 500ms and the experience will suck
Also, you CAN run local models that are as good as GPT 4 was on launch on a macbook with 24 gigs of ram.
https://artificialanalysis.ai/?models=gpt-oss-20b%2Cgemma-3-...
reply