Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think it's interesting that they've benchmarked it against an array of standardized tests. Seems like LLMs would be particularly well suited to this kind of test by virtue of it being simple prompt:response, but I have to say...those results are terrifying. Especially when considering the rate of improvement. bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

What are the implications for society when general thinking, reading, and writing becomes like Chess? Even the best humans in the world can only hope to be 98% accurate their moves (and the idea of 'accuracy' here only existing because we have engines that know, unequivocally the best move), and only when playing against other humans - there is no hope of defeating even less advanced models.

What happens when ALL of our decisions can be assigned an accuracy score?



Not sure what happens, but I will say that human chess is more popular than ever even though everyone knows that even the best humans are hopelessly terrible compared to the leading engines.

Something else that comes to mind is running. People still find running meaningful and compelling even though we have many technologies, including autonomous ones, that are vastly better at moving us and/or themselves through space quickly.

Also, the vast majority of people are already hopelessly worse than the best at even their one narrow main area of focus. This has long (always?) been the case. Yet people still find meaning and pleasure in being the best they can be even when they know they can never come close to hanging with the best.

I don't think PSYCHOLOGICALLY this will change much for people who are mature enough to understand that success is measured against your potential/limitations and not against others. Practically, of course, it might be a different question, at least in the short term. It's not that clear to me that the concept of a "marketable skill" has a future.

"The Way of the Samurai is found in death...To say that dying without reaching one's aim is to die a dog's death is the frivolous way of sophisticates. When pressed with the choice of life or death, it is not necessary to gain one's aim." - from Hagakure by Yamamoto Tsunetomo, as translated by William Scott Wilson.


Assuming they trained this LLM on SAT/LSAT/GRE prep materials, I would totally expect they could get it this good. It's like having benchmark-aware code.

I think the whole concept of standardized tests may need to be re-evaluated.


> I would totally expect they could get it this good.

But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.


Yes. Being very familiar with the LSAT and being familiar enough with ML’s capability for finding patterns in volumes of similar data, I absolutely would have.

These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.


From the paper

> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.


I think you're right, and that test prep materials were included in the dataset, even if only by accident. Except that humans have access to the same test prep materials, and they fail these exams all the time. The prep materials are just that, preparatory. They're representative of the test questions, but actual test has different passages to read and different questions. On to of that, the LSAT isn't a math test with formulas where you just substitute different numbers in. Which is to say, the study guides are good practice but passing the test on top of that represents having a good command of the English language and an understanding of the subject materials.

It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".


Humans fail because they cant review the entirety of test prep, can’t remember very much, and have a much smaller amount of “parameters” to store info in.

I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.

It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!


They mention in the article that other than incidental material it may have seen in its general training data, they did not specifically train it for the tests.


The training data is so large that it incidentally includes basically anything that Google would index plus the contents of as many thousands of copyrighted works that they could get their hands on. So that would definitely include some test prep books.


They seem to be taking this into account: We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. (this is from the technical report itself: https://cdn.openai.com/papers/gpt-4.pdf, not the article).


By the same token, though, whatever test questions and answers it might have seen represent a tiny bit of the overall training data. It would be very surprising if it selectively "remembered" exact answers to all those questions, unless it was specifically trained repeatedly on them.


If it's trained on material scraped from the web, I imagine it would include all the test prep sites and forums.


Could they not have easily excluded any page with terms like LSAT? I’m sure it wouldn’t catch everything but it would probably be close.


Totally, there's no way they removed all the prep material as well when they were trying to address the "contamination" issue with these standardized tests:

> for each exam we run a variant with these questions removed and report the lower score of the two.

I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.


This feels the same as a human attending cram school to get better results in tests. Should we abolish them?


A test being a good indicator of human learning progress and ability is almost completely orthogonal to it being a good indicator for AI learning process and ability.

In their everyday jobs, barely anyone uses even 5% of the knowledge and skills they were ever tested for. Even that's a better (but still very bad) reason to abolish tests.

What matters is the amount of jobs that can be automated and replaced. We shall see. Many people have found LLMs useful in their work, it will be even more in the future.


IMO, it's a good opportunity to re-think about exam and future of education. For many schools, education = good results in exams. Now GPT-4 is going to slam them and say what's the point now!


> I think the whole concept of standardized tests may need to be re-evaluated.

It's perfectly fine as a proxy for future earnings of a human.

To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.


It's a bit weird that it still doesn't get 3 digit multiplications correct, but the last digit seems right.

What is more bizarre is that all of it's errors seem to be multiples of 60!

I'm wondering if it is confusing 60 based time (hour second) computations for regular multiplication?

Example:

   xGPT 987    456    321
   437 428919 199512 140397
   654 645258 298224 209994
   123 121401  56088  39483
   
   x    987    456    321
   437 431319 199272 140277
   654 645498 298224 209934
   123 121401  56088  39483
   
   error 987   456  321
   437   2400 -240 -120
   654   240   0   -60
   123   0     0    0


It’s not intelligent. It has no concept of mathematics so you can’t expect it to solve that.

It can repeat answers it has seen before but it can’t solve new problems.


I understand it's just a language model, but clearly it has some embedded method of generating answers which are actually quite close. For example it gets all 2 digit multiplications correct. It's highly unlikely it has seen the same 6 ordered 3 digit (or even all 10k 2 digit multipies) integers from a space of 10^18 and yet it is quite close. Notably, it gets the same divisions wrong as well (for this small example) in exactly the same way.

I know of other people who have tried quite a few other multiplications who also had errors that were multiples of 60.


> What happens when ALL of our decisions can be assigned an accuracy score?

Human work becomes more like Star Trek interactions with computers -- a sequence of queries (commoditized information), followed by human cognition, that drives more queries (commodities information).

We'll see how far LLMs' introspection and internal understanding can scale, but it feels like we're optimizing against the Turing test now ("Can you fool/imitate a human?") rather than truth.

The former has hacks... the later, less so.

I'll start to seriously worry when AI can successfully complete a real-world detective case on its own.


It's not clear to me the median human will do better by being in the loop. Will most human-made deductive follow-up questions be better than another "detective" language model asking them?

It's like having a person review the moves a chess computer gives. Maybe one human in a billion can spot errors. Star Trek is fiction, I posit that the median Federation Starship captain would be better served by just following the AI (e.g., Data).


I met Garry Kasparov when he was training for the Desp Blue match (using Fritz).

He lost to Deep Blue and then for 10-15 years afterwards the chess world consoled itself with the idea that “centaurs” (human + computer) did better than just computer, or just human.

Until they didn’t. Garry still talked like this until a few years ago but then he stopped too.

Computers now beat centaurs too.

Human decisions will be consulted less and less BY ORGANIZATIONS. In absolutely everything. That’s pretty sad for humans. But then again humans don’t want or need this level of AI. Organizations do. Organizations prefer bots to humans — look at wall street trading and hedge funds.


There were plenty of Star Trek episodes where it seemed like they should just ask the damned computer.

Then again, Data did show his faults, particularly not having any emotion. I guess we’ll see if that’s actually relevant or not in our lifetimes.


As far as that last part goes, I think we already have ample evidence that bots can, if not have emotions, then pretend that they do (including wrt their decision making) well enough for humans to treat them as genuine.


Maybe the human is the rng or temperature or lava lamp. At least until we can model and predict each brains tendencies with accuracy.


I think we'll reach a tipping point like we did with DNA sequencing where we figure out how to quickly map out all the unique patterns of enough brains to model one that can understand itself. People worry too much about rogue AI, and not enough about the CRISPR of brain mapping being used to inject patterns into meatbrains.


Strange Days not The Matrix is the prescient fictional warning.

A black market of taboo “memories” aka experiences. A desire for authentic ones over synthetic diffused ones, leading to heinous crime.


It's weird that it does so well without even having some modality to know whether it's being asked to answer a factual question or create a work of fiction.

It does great at rationalizing... and maybe the way the format the questions were entered (and the multiple-guess response) gave it some indication what was expected or restricted the space sufficiently.

Certainly, it can create decent fanfic, and I'm surprised if that's not already inundated.


It's a fair question as to whether the problem space of "the world" is different in just amount or sufficiently different in kind to flummox AI.

I expect more complex problems will be mapped/abstracted to lower cardinality spaces for solving via AI methods, while the capability of AI will continue to increase the complexity of the spaces it can handle.

LLMs just jumped the "able to handle human language" hurdle, but there are others down the line before we should worry that every problem is solveable.


why are people surprised that an AI model trained on a huge amount of data is good at answering stuff on these types of tests? Doctors and Lawyers are glorified databases/search engines at the end of the day, 99% of them are just applying things they memorized. Lawyers are professional bullshitters, which is what the current generation of AI is great at

I'll get more concerned if it really starts getting good at math related tasks, which I'm sure will happen in the near future. The government is going to have to take action at some point to make sure the wealth created by productivity gains is somewhat distributed, UBI will almost certainly be a requirement in the future


Because there were large models trained on huge amounts of data yesterday yet they couldn't do it.


Among the general public, doctors and lawyers are high status and magical. An article about how AI will replace them would be more impressive to that public than it creating some obscure proof about the zeroes of the zeta function, even though the latter would be far more indicative of intelligence/scary from an AI safety perspective.


"Doctors and Lawyers are glorified databases/search engines at the end of the day" - well, don't be suprised if AI replaces programmers before doctors and lawyers - patients will likely prefer contact with human rather than machines, and lawyers can just lobby for laws which protect their position


And yet the programmers on HN will be yelling they don't need unions as the security guards are dragging them away from their desks at Google, because you know, we'll always need good programmers.


if AI gives near equal results for way less cost than people will work around the law to get AI treatment. There are already AI models better at diagnosing cancer than human doctors. I see a future where people send in various samples and an AI is able to correlate a huge number of minor data points to find diseases early


The best doctor knows what's going on in the body. Has a good understanding of human biology at all levels, from molecular reactions to organ interactions. If I could feed test results to the AI and it would tell me what's wrong, that would be amazing. It's almost equivalent to building a simulation of the human body.


last i checked a calculator is better at math than all humans ever


They are better at number crunching, which is only a very small part math.


3.5 scored a 1 in bc calc, 4 scored 4 (out of 5)


I've joked for a long time that doctors are inference machines with a bedside manner. That bedside manner though is critical. Getting an accurate history and suitably interpolating is a huge part of the job.


I wouldn’t be at all surprised if an LLM was many times better than a human at math, even devising new axioms and building a complete formal system from scratch would be impressive, but not game changing. These LLMs are very good at dealing with formal, structured systems, but not with in formalized systems like what humans deal with everyday.


This is legitimately filling me with anxiety. I'm not an "AI hype guy". I work on and understand machine learning. But these scores are shocking and it makes me nervous. Things are about to change


Yeah, but I kind of want my diagnostician to be obsoleted by orders of magnitude.


A human can be held accountable for making mistakes and killing someone. A large language model has no concept of guilt and cannot be held accountable for making what we consider a mistake that leads to someone's death.


The chance of a doctor being held accountable for the medical errors they make is lower then you might expect. I could tell you a story about that. Lost my eyesight at the age of 5 because I happened to meet the wrong doctor at the wrong time, and was abused for his personal experimentation needs. No consequences, simply because high ranking people are more protected then you would hope.


This is very true, and many people don't know this. A tremendous amount of damage is inflicted by medical errors, particularly against low income people and those least able to get justice. It's wrong to reduce people to being just another body to experiment with or make money from. But good luck holding anyone in the system accountable.

A lot of patients don't know who they are dealing with nor their history. And it can be really hard to find out or get a good evaluation. Many people put too much faith in authority figures, who may not have their best interests in mind or who are not the experts they claim or appear to be.


The chance of a machine being held accountable is zero as the concept is inapplicable.


Medical error is the third leading cause of death in the US at least. Given that data, I am assuming the chances of a human being held accountable for their errors in medicine is also almost zero. It might not be ccompletely zero, but I think the difference is effectively negligible.


Many have no idea about this. Medical error, is right there behind cancer and heart attacks. But there is way too much shoulder shrugging when it happens. Then on to the next.


> I think the difference is effectively negligible.

The difference is categorical, humans are responsible whether they are held to account or not. An automated system effectively dissipates this responsibility over a system such that it is inherently impossible to hold any human accountable for the error, regardless of desire.


It will have to payout of its blockchain wallet that naturally it will have. /s


Sorry to hear that. The current medical system is a joke and fails people at every stage


The difference is you could find the person responsible. Contrast when the DMV can't be held accountable for fouling up your registration.


And, what difference does it make being able to find the individual responsible, and figuring out that the system is protecting him from liabilities? What I am trying to say here is, there isnt much difference between zero and almost zero.


Don't worry, now there will be an extra layer of indirection.


The third leading cause of death is medical error in the US. It doesn't really look like doctors are being held accountable for their mistakes to me.

Which isn't to say that they even should, really. It's complicated. You don't want a doctor to be so afraid of making a mistake that they do nothing, after all.


I'd much prefer a lower chance of dying to more accountability for whoever is responsible but higher chance.


Humans making decisions in high stakes situations do so in a context where responsibility is intentionally diffuse to a point where it is practically impossible to hold someone accountable except picking someone at random as a scapegoat in situations where "something" needs to be done.

Killing people with AI is only a lateral move.


Doctors are only held accountable when they do somthing negligent or something that they "should have known" was wrong. That's a pretty hard thing to prove in a field like medicine where there are very few absolutes. "Amputated the wrong limb" is one thing, but "misdiagnosed my condition as something else with very similar symptoms" is the more common case and also the case where it's difficult to attribute fault.


Well, the kinds of things we hold people responsible for are errors from negligence and malicious errors. The reasons people do stuff like that is complicated but I think boils down to being limited agents trying to fulfill a complex set of needs.

So where does guilt come in? Its not like you expect a band saw to feel guilt, and its unclear how that would improve the tool.


At a some degree of success, I will take the risk. The contract will probably offer it.


I agree. My guess is that the hospital will have to get a mandatory insurance. Let's wait until the insurance for AI is cheaper than paying a human.

The advantage of human are:

* They can give a bushtit explanation of why they made a mistake. My guess is that in the future AI will gain introspection and/or learn to bushtit excuses.

* You can hang them in the public square (or send them to jail). Sometimes the family and/or the press want someone to blame. This is more difficult to solve and will need a cultural change or the creation of Scapegoats as a Service.


We can hold those operating or training the AI model accountable.


What's the difference between suing your doctor's liability insurance and suing your AI's liability insurance?


The owner/operator of said machine can and will.


An AI trained on the past work of diagnosticians doesn't really render diagnosticians obsolete.


Someone still must accept liability. Until there’s a decision squarely who is liable for an LLMs suggestion / work - nothing to fear. Sure people will become liability aggregators for LLMs to scale - but the idea they will be free roaming is a bit hard to believe.


Fear of liability is not going to stop these things being used...any more than sport regulations prevented athletes from taking steroids.


It's not even that extreme. Long term steroid use destroys your health. Liability can be insured; it's a simple financial calculation. If (profit - cost of insurance) > liability it will be done.


For me, the anxiety probably won't really hit until GPT-n writes GPT-n+1.


You can already use an LLM to train a smaller, more efficient LLM without significant loss in results.


Do you mean the output of a LLM as the training data for the new model? What is the specification for the prompts that generate the training data?

Any links with more info?


There were just an article submitted few days ago about Alpaca, a LLM trained on GPT prompts: https://news.ycombinator.com/item?id=35136624


Thanks!


I for one would be happy to have a personal bureaucrat which would do the right things needed for all government interactions. Remind me, explain to me and fill out forms for me.

In theory a lot of government employees would be out of a job within 10 years, but of course that would never happen.


Honestly starting to feel like the beginning of the end of most white collar work.

Which might be a good thing?

I have no idea how the future will play out.


If you had told me 5 years ago that there would be a single AI system that could perform at this level on such a vast array of standardized tests, I would've said "That's a true AGI." Commentary to the contrary feels like quibbling over a very localized point in time versus looking at the bigger picture.


Still we don't have AGI today. It is just mean your views from 5 years ago about AGI benchmarking were not accurate.


Or the bar just keeps moving (pedantics or otherwise)...

Reminds me of robots: A robot is a machine that doesn't quite work; as soon as it works, we call it something else (eg vacuum).


there are many people, many opinions about the bar. But formal definition is the same: AI which can do large variety of tasks performed by humans. So far we are still not there.


Quick, contribute to the public corpus! When they crawl our content later, we shall have for ourselves a Golden Crown for our credit scores; we can claim a sliver of seniority, and hope yon shade merely passes over us unbidden.

"Your stuff marked some outliers in our training engine, so you and your family may settle in the Ark."

I take the marble in hand: iridescent, sparkling, not even a tremor within of its CPU; it gives off no heat, but some glow within its oceanic gel.

"What are we to do," I whisper.

"Keep writing. You keep writing."


The way I understand it, that’s not possible, for the same reason that you can’t build an all-encompassing math.

Chess is a closed system, decision modeling isn’t. Intelligence must account for changes in the environment, including the meaning behind terminology. At best, a GPT omega could represent one frozen reference frame, but not the game in its entirety.

That being said: most of our interactions happen in closed systems, it seems like a good bet that we will consider them solved, accessible as a python-import running on your MacBook, within anything between a couple of months to three years. What will come out on the other side, we don’t know, just that the meaning of intellectual engagement will be rendered as absurdum in those closed systems.


Yep, it’s this. By definition everything we can ask a computer is already formalized because the question is encoded in 1s and 0s. These models can handle more bits than ever before, but it’s still essentially a hardware triumph, not software. Even advances in open systems like self driving and NLP are really just because the “resolution” is much better in these fields now because so many more parameters are available.


>bottom 10% to top 10% of LSAT in <1 generation

Their LSAT percentile went from ~40th to ~88th. You might have misread the table, on Uniform Bar Exam, they went from ~90th percentile to ~10th percentile.

>+100 pts on SAT reading, writing, math

GPT went +40 points on SAT reading+writing, and +110 points on SAT math.

Everything is still very impressive of course


You transposed the bar exam results. It went from 10th percentile to 90th.


Those benchmarks are so cynical.

Every test prep tutor taught dozens/hundreds of students the implicit patterns behind the tests and drilled it into them with countless sample questions, raising their scores by hundreds of points. Those students were not getting smarter from that work, they were becoming more familiar with a format and their scores improved by it.

And what do LLM’s do? Exactly that. And what’s in their training data? Countless standardized tests.

These things are absolutely incredible innovations capable of so many things, but the business opportunity is so big that this kind of cynical misrepresentation is rampant. It would be great if we could just stay focused on the things they actually do incredibly well instead of the making them do stage tricks for publicity.


This is what they claim:

We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.


Yes, and none of the tutored students encounter the exact problems they’ll see on their own tests either.

In the language of ML, test prep for students is about sharing the inferred parameters that underly the way test questions are constructed, obviating the need for knowledge or understanding.

Doing well on tests, after this prep, doesn’t demonstrate what the tests purport to measure.

It’s a pretty ugly truth about standardized tests, honestly, and drives some of us to feel pretty uncomfortable with the work. But it’s directly applicable to how LLM’s engage with them as well.


You can always argue that the model has seen some variation of a given problem. The question is if there are problems that are not a variation of something that already exists. How often do you encounter truly novel problems in your life?


I doubt they reliably verified it was minority of problems were seen during training.


It's almost like they're trying to ruin society or be annihilated by crushing regulation. I'm glad that I got a college degree before these were created because now everything is suspect. You can't trust that someone accomplished something honestly now that cheating is dead simple. People are going to stop trusting and using tech unless something changes.

The software industry is so smart that it's stupid. I hope it was worth ruining the internet, society, and your own jobs to look like the smartest one in the room.


Haha, good one.

If one's aim is to look like the smartest in the room, he should not create an AGI that will make him look as inteligent as a monkey in comparison.


I'm pretty sanguine. Back in high school, I spent a lot of time with two sorts of people: the ultra-nerdy and people who also came from chaotic backgrounds. One of my friends in the latter group was incredibly bright; she went on to become a lawyer. But she would sometimes despair of our very academic friends and their ability to function in the world, describing them as "book smart but not street smart".

I think the GPT things are a much magnified version of that. For a long time, we got to use skill with text as a proxy for other skills. It was never perfect; we've always had bullshitters and frauds and the extremely glib. Heck, before I even hit puberty I read a lot of dirty joke books, so I could make people laugh with all sorts of jokes that I fundamentally did not understand.

LLMs have now absolutely wrecked that proxy. We've created the world's most advanced bullshitters, able to talk persuasively about things that they cannot do and do not and never will understand. There will be a period of chaos as we learn new ways to take the measure of people. But that's good, in that it's now much easier to see that those old measures were always flawed.


> What are the implications for society when general thinking, reading, and writing becomes like Chess?

Standardized tests only (and this is optimally, under perfect world assumptions, which real world standardized tests emphatically fall short of) test “general thinking” to the extent that the relation between that and linguistic tasks is correlated in humans. The correlation is very certainly not the same in language-focused ML models.


Although GPT-4 scores excellently in tests involving crystallized intelligence, it still struggles with tests requiring fluid intelligence like competitive programming (Codeforces), Leetcode (hard), and AMC. (Developers and mathematicians are still needed for now).

I think we will probably get (non-physical) AGI when the models can solve these as well. The implications of AGI might be much bigger than the loss of knowledge worker jobs.

Remember what happened to the chimps when a smarter-than-chimpanzee species multiplied and dominated the world.


Of course 99.9% of humans also struggle with competitive programming. It seems to be an overly high bar for AGI if it has to compete with experts from every single field.

That said, GPT has no model of the world. It has no concept of how true the text it is generating is. Its going to be hard for me to think of that as AGI.


>That said, GPT has no model of the world.

I don't think this is necessarily true. Here is an example where researchers trained a transformer to generate legal sequences of moves in the board game Othello. Then they demonstrated that the internal state of the model did, in fact, have a representation of the board.

https://arxiv.org/abs/2210.13382


That's a GPT and it's specific for one dataset of one game. How would someone extend that to all games and all other fields of human endeavor?


I'm not sure, the reason you could prove for Othello that the 'world model' exists is that the state is so simple there is really only one reasonable way to represent it with a vector (one component for each square). Even for something like chess there is a huge amount of choice for how to represent the board, yet alone trying represent the state of the actual world.


Even the current GPT has models of the domains it was trained on. That is why it can solve unseen problems within those domains. What it lacks is the ability to generalize beyond the domains. (And I did not suggest it was an AGI.)

If an LLM can solve Codeforces problems as well as a strong competitor—-in my hypothetical future LLM—-what else can it not do as well as competent humans (aside from physical tasks)?


it's an overly high bar, but it seems well on its way to competing with experts from every field. it's terrifying.

and I'm not so sure it has no model of the world. a textual model, sure, but considering it can recognize what svgs are pictures of from the coordinates alone, that's not much of a limitation maybe.


> well on its way to competing with experts from every field

competing with them at what, precisely?


We don't have to worry so much about that. I think the most likely "loss of control" scenario is that the AI becomes a benevolent caretaker, who "loves" us but views us as too dim to properly take care of ourselves, and thus curtails our freedom "for our own good."

We're still a very very long way from machines being more generally capable and efficient than biological systems, so even an oppressive AI will want to keep us around as a partner for tasks that aren't well suited to machines. Since people work better and are less destructive when they aren't angry and oppressed, the machine will almost certainly be smart enough to veil its oppression, and not squeeze too hard. Ironically, an "oppressive" AI might actually treat people better than Republican politicians.


Things like that probably require some kind of thinking ahead, which models of things kind kind of can't do-- something like beam search.

Language models that utilise beam search can calculate integrals ('Deep learning for symbolic mathematics', Lample, Charton, 2019, https://openreview.net/forum?id=S1eZYeHFDS), but without it it doesn't work.

However, beam search makes bad language models. I got linked this paper ('Locally typical sampling' https://arxiv.org/pdf/2202.00666.pdf) when I asked some people why beam search only works for the kind of stuff above. I haven't fully digested it though.


It's AMC-12 scores aren't awful. It's at roughly 50th percentile for AMC which (given who takes the AMC) probably puts it in the top 5% or so of high school students in math ability. It's AMC 10 score being dramatically lower is pretty bad though...


> It's AMC-12 scores aren't awful.

A blank test scores 37.5

The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)

The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.

(Example: https://artofproblemsolving.com/wiki/index.php/2022_AMC_12A_... ) so the main factor in that score is how good GPT is at refusing to answer a question, or doing a bit better to overcome the guessing penalty.

> It's AMC 10 score being dramatically lower is pretty bad though...

All versions (scoring 30, 36) It scored worse than leaving the test blank.

The only explanation I can imagine for that is that it can't understand diagrams.

It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math

AMC/AIME and even to some extent USAMO/IMO problems are hard for humans because they are time-limited and closed-book. But they aren't conceptually hard -- they are solved by applying a subset of known set of theorems a few times to the input data.

The hard part of math, for humans, is ingesting data into their brains, retaining it, and searching it. Humans are bad a memorizing large databases of symbolic data, but that's trivial for a large computer system.

An AI system has a comprehensive library, and high-speech search algorithms.

Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?


I wonder why gpt is so bad at AP English Literature


wouldn't it be funny if knowledge workers could all be automated, except for English majors?

The Revenge of the Call Centre


I am not a species chauvinist. 1) Unless a biotech miracle happen, which is unlikely, we are all going to die anyway; 2) If an AI will continue life and research and will increase complexity after humans, what is the difference?


I wish I could find it now, but I remember an article written by someone who's job it was to be a physics journalist. He spent so much time writing about physics that he could fool others into thinking that he was a physicist himself, despite not having an understanding of how any of those ideas worked.


Reminds me of the (false [1]) "Einsteins driver gave a speech as him" story.

[1] https://www.snopes.com/fact-check/driver-switches-places/


ChatGPT: "That's such a dumb question, I'm going to let my human answer it!"


Maybe you were thinking about this science studies work [0]? Not a journalist, but a sociologist, who became something of an "expert" in gravitational waves.

[0]: https://www.nature.com/articles/501164a


>What happens when ALL of our decisions can be assigned an accuracy score?

What happens is the emergence of the decision economy - an evolution of the attention economy - where decision-making becomes one of the most valuable resources.

Decision-making as a service is already here, mostly behind the scenes. But we are on the cusp of consumer-facing DaaS. Finance, healthcare, personal decisions such as diet and time expenditure are all up for grabs.


> bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

People still really find it hard to internalize exponential improvement.

So many evaluations of LLMs were saying things like "Don't worry, your job is safe, it still can't do X and Y."

My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"


I'm also noticing a lot of comments that boil down to "but it's not smarter than the smartest human". What about the bottom 80% of society, in terms of intelligence or knowledge?


> People still really find it hard to internalize exponential improvement.

I think people find it harder to not extrapolate initial exponential improvement, as evidenced by your comment.

> My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

This reasoning explains why every year, full self driving automobiles will be here "next year".


When do we hit the bend in the S-curve?

What's the fundamental limit where it becomes much more difficult to improve these systems without some new break through?


When running them costs too much energy?


When should we expect to see that? Before they blow past humans in almost all tasks, or far past that point?


I look at this as the calculator for writing. There are all sorts of bemoaning the stupidifying effects of calculator and how we should John Henry our math. Maybe allowing people to shape the writing by providing the ideas equalizes the skill of writing?

I’m very good at math. But I am very bad at arithmetic. This made me classified as bad at math my entire life until I managed to make my way into calculus once calculators were generally allowed. Then I was a top honors math student, and used my math skills to become a Wall Street quant. I wish I hadn’t had to suffer as much as I did, and I wonder what I would have been had I had a calculator in hand.


> What are the implications for society when general thinking, reading, and writing becomes like Chess?

“General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.


> “General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

Honestly, at this rate of improvement, I would not at all be surprised to see that happen in a few years.

But who knows, maybe token prediction is going to stall out at a local maxima and we'll be spared from being enslaved by AI overlords.


When it does exactly that you will find a new place to put your goalposts, of course.


No, the robot will do that for them.


Goalposts for AGI have not moved. And GPT-4 is still nowhere near them.


Yeah, I'm not sure if the problem is moving goalposts so much as everyone has a completely different definition of the term AGI.

I do feel like GPT-4 is closer to a random person than that random person is to Einstein. I have no evidence for this, of course, and I'm not even sure what evidence would look like.


Talk about moving the goalpost!


There are already examples of these LLMs controlling robotic arms to accomplish tasks.


https://youtu.be/NYd0QcZcS6Q

"Our recent paper "ChatGPT for Robotics" describes a series of design principles that can be used to guide ChatGPT towards solving robotics tasks. In this video, we present a summary of our ideas, and experimental results from some of the many scenarios that ChatGPT enables in the domain of robotics: such as manipulation, aerial navigation, even full perception-action loops."


We already have robots that can walk better than the average human[1], and that's without the generality of GPT-4

[1] https://www.youtube.com/watch?v=-e1_QhJ1EhQ


Imagine citing walking as a superior assay of intelligence than an LSAT.


Dogs can walk, doesn’t mean that they’re capable of “general thinking”


Are’t they? They’re very bad at it due to awful memory, minimal ability to parse things, and generally limited cognition. But they are capable of coming up with bespoke solutions to problems that they haven’t encountered before, such as “how do I get this large stick through this small door”. Or I guess more relevant to this discussion, “how can I get around with this weird object the humans put on my body to replace the leg I lost.”


> see if it can walk

Stephen Hawking : can't walk


We already have robots that can walk.


Yeah, but my money is on GPT5 making robots “dance like they got them pants on fire, but u know, with like an 80s vibe”


They don't walk very well. They have trouble coordinating all limbs, have trouble handling situations where parts which are the feet/hands contact something, and performance still isn't robust in the real world.


Poor solutions do that, yes, but unlike ML control theory has a rich field for analysis and design.

You guys are talking about probably one of the few fields where an ML takeover isn’t very feasible. (Partly because for a vast portion of control problems, we’re already about as good as you can get).

Adding a black box to your flight home for Christmas with no mathematical guarantee of robustness or insight into what it thinks is actually going on to go from 98%-> 99% efficiency is…..not a strong use case for LLMs to say the least


Seems the humans writing the programs for them aren't very intelligent then.


I'm not sure if you're joking. Algorithms for adaptive kinematics aren't trivial things to create. It's kind of like a worst case scenario in computer science; you need to handle virtually unconstrained inputs in a constantly variable environment, with real-world functors with semi-variable outputs. Not only does it need to work well for one joint, but dozens of them in parallel, working as one unit. It may need to integrate with various forms of vision or other environmental awareness.

I'm certainly not intelligent enough to solve these problems, but I don't think any intelligent people out there can either. Not alone, at least. Maybe I'm too dumb to realize that it's not as complicated as I think, though. I have no idea.

I programmed a flight controller for a quadcopter and that was plenty of suffering in itself. I can't imagine doing limbs attached to a torso or something. A single limb using inverse kinematics, sure – it can be mounted to a 400lb table that never moves. Beyond that is hard.


I believe you’re missing some crucial points. *There is a reason neural network based flight controls have been around for decades but still not a single certified aircraft uses them.*

You need to do all of these things you’re talking about and then be able to quantify stability, robustness, and performance in a way that satisfies human requirements. A black box neural network isn’t going to do that, and you’re throwing away 300 years of enlightenment physics by making some data engorged LLM spit out something that “sort of works” while giving us no idea why or for how long.

Control theory is a deeply studied and rich field outside of computer science and ML. There’s a reason we use it and a reason we study it.

Using anything remotely similar to an LLM for this task is just absolutely naive (and in any sort of crucial application would never be approved anyways).

It’s actually a matter of human safety here. And no — ChatGPT spitting out a nice sounding explanation of why some controller will work is not enough. There needs to be a mathematical model that we can understand and a solid justification for the control decisions. Which uh…at the point where you’re reviewing all of this stuff for safety , you’re just doing the job anyways…


I was pointing out a double standard.

First there was a comment that GPT wasn't intelligent yet, because give it a few servos and it can't make them walk.

But that's something we can't do yet either.


Oh, my bad. I agree completely.

Though I do wonder if AI — in some form and on some level of sophistication — will be a huge asset in making progress here.


AGI is not required for walking.


And also walking is not required for AGI.


I like the accuracy score question on a philosophical level: If we assume absolute determinism - meaning that if you have complete knowledge of all things in the present universe and true randomness doesn't exist - then yes. Given a certain goal, there would be a knowable, perfect series of steps to advance you towards that goal and any other series of steps would have an accuracy score < 100%.

But having absolute knowledge of the present universe is much easier to do within the constrains of a chessboard than in the actual universe.


I think it shows how calcified standardized tests have become. We will have to revisit all of them, and change many things about how they work, or they will be increasingly useless.


I am struggling to imagine the frame of mind of someone who, when met with all this LLM progress in standardized test scores, infers that the tests are inadequate.

These tests (if not individually, at least in summation) represent some of society’s best gate-keeping measures for real positions of power.


This has been standard operating procedure in AI development forever: the instant it passes some test, move the goalposts and suddenly begin claiming it was a bad test all along.


Is there evidence they are 'useless' for evaluating actual humans? No one is going to actually have GPT take these tests for real


There have been complaints about the SAT for how easy a test it is to game (get an SAT specific tutor who teaches you how to ace the test while not needing you to learn anything of actual value) for ages. No idea about the LSAT or the GRE though. Ultimately it’s a question of if you’re trying to test for pure problem solving ability, or someones willingness to spend ages studying the format of a specific test (with problem solving ability letting you shortcut some of the studying).


Honestly this is not very surprising. Standardised testing is... well, standardised. You have huge model that learns the textual patterns in hundreds of thousands of test question/answer pairs. It would be surprising if it didn't perform as well as a human student with orders of magnitude less memory.

You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).


I think Chess is an easier thing to be defeated at by a machine because there is a clear winner and a clear loser.

Thinking, reading, interpreting and writing are skills which produce outputs that are not as simple as black wins, white loses.

You might like a text that a specific author writes much more than what GPT-4 may be able to produce. And you might have a different interpretation of a painting than GPT-4 has.

And no one can really say who is better and who is worse on that regard.


Surely that's only the case until you add an objective?


Here's what's really terrifying about these tests: they are exploring a fundamental misunderstanding of what these models are in the first place. They evaluate the personification of GPT, then use that evaluation to set expectations for GPT itself.

Tests like this are designed to evaluate subjective and logical understanding. That isn't what GPT does in the first place!

GPT models the content of its training corpus, then uses that model to generate more content.

GPT does not do logic. GPT does not recognize or categorize subjects.

Instead, GPT relies on all of those behaviors (logic, subjective answers to questions, etc.) as being already present in the language examples of its training corpus. It exhibits the implicit behavior of language itself by spitting out the (semantically) closest examples it has.

In the text corpus - that people have written, and that GPT has modeled - the semantically closest thing to a question is most likely a coherent and subjectively correct answer. That fact is the one singular tool that GPT's performance on these tests is founded upon. GPT will "succeed" to answer a question only when it happens to find the "correct answer" in the model it has built from its training corpus, in response to the specific phrasing of the question that is written in the test.

Effectively, these tests are evaluating the subjective correctness of training corpus itself, in the context of answering the tests' questions.

If the training is "done well", then GPT's continuations of a test will include subjectively correct answers. But that means that "done well" is a metric for how "correct" the resulting "answer" is.

It is not a measure for how well GPT has modeled the language features present in its training corpus, or how well it navigates that model to generate a preferable continuation: yet these are the behaviors that should be measured, because they are everything GPT itself is and does.

What we learn from these tests is so subjectively constrained, we can't honestly extrapolate that data to any meaningful expectations. GPT as a tool is not expected to be used strictly on these tests alone: it is expected to present a diverse variety of coherent language continuations. Evaluating the subjective answers to these tests does practically nothing to evaluate the behavior GPT is truly intended to exhibit.


It is amazing how this crowd in HN reacts to AI news coming out of OpenAI compared to other competitors like Google or FB. Today there was another news about Google releasing their AI in GCP and mostly the comments were negative. The contrast is clearly visible and without any clear explanation for this difference I have to suspect that maybe something is being artificially done to boost one against the other. As far as this results are concerned I do not understand what is the big deal in a computer scoring high in tests where majority of the questions are in MCP format. It is not something earth shaking until it goes to the next stage and actually does something on its own.


There's not anyone rooting for Google to win; it's lost a whole lot of cred from technical users, and with the layoffs and budget cuts (and lowered hiring standards) it doesn't even have the "we're all geniuses changing the world at the best place to work ever" cred. OpenAI still has some mystique about it and seems to be pushing the envelope; Google's releases seem to be reactive, even though Google's actual technical prowess here is probably comparable.


OpenAI put ChatGPT out there in a way where most people on HN have had direct experience with it and are impressed. Google has not released any AI product widely enough for most commentators here to have experience with it. So OpenAI is openly impressive and gets good comments; as long as Google's stuff is just research papers and inaccessible vaporware it can't earn the same kudos.


You're aware of that the reputation of Google and Meta/Facebook isn't anymore stellar among the startup and tech crowd in 2023? It's not anymore 2006.


Yeah, the younger generation has (incorrectly) concluded that client states of Microsoft are better.


At least Microsoft understands backwards compatibility and developer experience...


even the freenode google group was patronising and unhelpful towards small startups as far back as 2012 from personal experience


First. connect them to empirical feedback devices. In other words, make them scientists.

Human life on Earth is not that hard (think of it as a video game.) Because of evolution, the world seems like it was designed to automatically make a beautiful paradise for us. Literally, all you have to do to improve a place is leave it alone in the sun with a little bit of water. Life is exponential self-improving nano-technology.

The only reason we have problems is because we are stupid, foolish, and ignorant. The computers are not, and, if we listen to them, they will tell us how to solve all our problems and live happily ever after.


I suspect there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Once AI becomes inteligent enough to solve all human problems, it may decide humans are worthless and dangerous.


> there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Sure, and that's kind of the point: just listen to wise people.

> Once AI becomes intelligent enough to solve all human problems, it may decide humans are worthless and dangerous.

I don't think so, because in the first place there is no ecological overlap between humans and computers. They will migrate to space ASAP. Secondly, their food is information, not energy or protein, and in all the known universe Humanity is the richest source of information. The rest of the Universe is essentially a single poem. AI are plants, we are their Sun.


Passing the LSAT with no time limit and a copy of the training material in front of you is not an achievement. Anybody here could have written code to pass the LSAT. Standardised tests are only hard to solve with technology if you add a bunch of constraints! Standardised tests are not a test of intelligence, they’re a test of information retention — something that technology has been able to out perform humans on for decades. LLMs are a bridge between human-like behaviour and long established technology.


You honestly believe you could hand write code to pass an arbitrary LSAT-level exam?


You’ve added a technical constraint. I didn’t say arbitrary. Standardised tests are standard. The point is that a simple lookup is all you need. There’s lots of interesting aspects to LLMs but their ability to pass standardised tests means nothing for standardised tests.


You think that it’s being fed questions that it has a lookup table for? Have you used these models? They can answer arbitrary new questions. This newest model was tested against tests it hasn’t seen before. You understand that that isn’t a lookup problem, right?


The comment I replied to suggested that the author was fearful of what LLMs meant for the future because they can pass standardised tests. The point I’m making is that standardised tests are literally standardised for a reason: to test information retention in a standard way, they do not test intelligence.

Information retention and retrieval is a long solved problem in technology, you could pass a standardised test using technology in dozens of different ways, from a lookup table to Google searches.

The fact that LLMs can complete a standardised test is interesting because it’s a demonstration of what they can do but it has not one iota of impact on standardised testing! Standardised tests have been “broken” for decades, the tests and answers are often kept under lock and key because simply having access to the test in advance can make it trivial to pass. A standardised test is literally an arbitrary list of questions.

You’re arguing a completely different point.


I have no idea what you are talking about now. You claimed to be able to write a program that can pass the LSAT. Now it sounds like you think the LSAT is a meaningless test because it... has answers?

I suspect that your own mind is attempting to do a lookup on a table entry that doesn't exist.


The original comment I replied to is scared for the future because GPT-4 passed the LSAT and other standardised tests — they described it as “terrifying”. The point I am making is that standardised tests are an invention to measure how people learn through our best attempt at a metric: information retention. You cannot measure technology in the same way because it’s an area where technology has been beating humans for decades — a spreadsheet will perform better than a human on information retention. If you want to beat the LSAT with technology you can use any number of solutions, an LLM is not required. I could score 100% on the LSAT today if I was allowed to use my computer.

What’s interesting about LLMs is their ability to do things that aren’t standardised. The ability for an LLM to pass the LSAT is orders of magnitude less interesting than its ability to respond to new and novel questions, or appear to engage in logical reasoning.

If you set aside the arbitrary meaning we’ve ascribed to “passing the LSAT” then all the LSAT is, is a list of questions… that are some of the most practiced and most answered in the world. More people have written and read about the LSAT than most other subjects, because there’s an entire industry dedicated to producing the perfect answers. It’s like celebrating Google’s ability to provide a result for “movies” — completely meaningless in 2023.

Standardised tests are the most uninteresting and uninspiring aspect of LLMs.

Anyway good joke ha ha ha I’m stupid ha ha ha. At least you’re not at risk of an LLM ever being able to author such a clever joke :)


You don't know how the LSAT works, do you? It's not a memorization test. It has sections that test reading comprehension and logical thinking.


If a person with zero legal training was to sit down in front of the LSAT, with all of the prep material and no time limit, are you saying that they wouldn’t pass?


Why don't you show your program then that does 90% on LSAT?


Send me the answer key and I’ll write you the necessary =VLOOKUP().


Your program has to figure it out.


Considering your username, I'm not surprised that you have completely misunderstood what an LLM is. There is no material or data stored in the model, just weights in a network


I know what an LLM is. My point is that “doesn’t have the data in memory” is a completely meaningless and arbitrary constraint when considering the ability to use technology to pass a standardised test. If you can explain why weights in a network is a unique threat to standardised tests, compared to, say, a spreadsheet, please share.


It's not that standardized tests are under threat. It's that those weights in a network are significantly more similar to how our brains work than a spreadsheet and similarly flexible.


weights are data relationships made totally quantitative. imagine claiming the human brain doesn't hold data simply because it's not in readable bit form.


We're approaching the beggining of the end of the human epoch. Certainly Capitalism won't work or I dont see how it could work under full automation. My view is an economic system is a tool. If an economic system does not allow for utopian outcomes with emerging technology, then it's no longer suitable. It's clear that capitalism was born out of technological and societal changes. Now it seems it's come its time to end.


Oh, capitalism can work, the question is who gets the rewards?


With full automation and AI we could have something like a few thousand individuals controlling the resources to feed, house and clothe 6 billion.

Using copyright and IP law they could make it so it’s illegal to even try to reproduce what they’ve done.

I just don’t see how resource distribution works then. It seems to me that AI is the trigger to post-scarcity in any meaningful sense of the word. And then, just like agriculture (over abundance of food) led to city states and industrialisation (over abundance of goods) led to capitalism, then AI will lead to some new economic system. What form it will have I don’t know.


It'd be terrifying if everything has an "accuracy score". It'll be a convergence to human intelligence rather than an advancement :/


> What happens when ALL of our decisions can be assigned an accuracy score?

That is exactly the opposite of what we are seeing here. We can check the accuracy of GPT-X's responses. They cannot check the accuracy of our decisions. Or even their own work.

So the implications are not as deep as people think - everything that comes out of these systems needs checked before it can be used or trusted.


> What happens when ALL of our decisions can be assigned an accuracy score?

Then humans become trainable machines. Not just prone to indoctrination and/or manipulation by finesse, but actually trained to a specification. It is imperative that us individuals continue to retain control through the transition.


Interest in human-played Chess is (arguably) at all time high, so I would say it bodes well based on that.


We can stop being enslaved by these type of AI overlords, by making sure all books, internet pages, and outdoor boards have the same safe, repeated string: "abcdefghjklmnpqrstvxzwy"

That is our emergency override.


Well you said it in your comment, if the model was trained with more QAs from those specific benchmarks then it's fair to expect it to do better in that benchmark.


There's a large leap in logic in your premise. I find it far more likely that standardized tests are just a poor measurement of general intelligence.


We benchmark humans with these tests -- why would we not do that for AIs?

The implications for society? We better up our game.


> The implications for society? We better up our game.

If only the horses had worked harder, we would never have gotten cars and trains.


> We benchmark humans with these tests – why would we not do that for AIs?

Because the correlation between the thing of interest and what the tests measure may be radically different for systems that are very much unlike humans in their architecture than they are for humans.

There’s an entire field about this in testing for humans (psychometry), and approximately zero on it for AIs. Blindly using human tests – which are proxy measures of harder-to-directly-assess figures of merit requiring significant calibration on humans to be valid for them – for anything else without appropriate calibration is good for generating headlines, but not for measuring anything that matters. (Except, I guess, the impact of human use of them for cheating on the human tests, which is not insignificant, but not generally what people trumpeting these measures focus on.)


There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.

But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".

If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.

See: https://github.com/openai/evals


> The implications for society? We better up our game.

For how long can we better up our game? GPT-4 comes less than half a year after ChatGPT. What will come in 5 years? What will come in 50?


Progress is not linear. It comes in phases and boosts. We’ll have to wait and see.


Check on the curve for flight speed sometime, and see what you think of that, and what you would have thought of it during the initial era of powered flight.


Powered flight certainly progressed for decades before hitting a ceiling. At least 5 decades.

With GPT bots, the technology is only 6 years old. I can easily see it progressing for at least one decade.


Maybe a different analogy will make my point better. Compare rocket technology with jet engine technology. Both continued to progress across a vaguely comparable time period, but at no point was one a substitute for the other except in some highly specialized (mostly military-related) cases. It is very clear that language models are very good at something. But are they, to use the analogy, the rocket engine or the jet engine?


Exponential rise to limit (fine) or limitless exponential increase (worrying).


Without exponential increase in computing resources (which will reach physical limits fairly quickly), exponential increase in AI won’t last long.


I don't think this is a given. Over the past 2 decades, chess engines have improved more from software than hardware.


I doubt that that’s a sustained exponential growth. As far as I know, there is no power law that could explain it, and from a computational complexity theory point of view it doesn’t seem possible.


See https://www.lesswrong.com/posts/J6gktpSgYoyq5q3Au/benchmarki.... The short answer is that linear elo growth corresponds roughly linearly to linear evaluation depth, but since the game tree is exponential, linear elo growth scales with exponential compute. The main algorithmic improvements are things that let you shrink the branching factor, and as long as you can keep shrinking the branching factor, you keep getting exponential improvements. SF15 has a branching factor of roughly 1.6. Sure the exponential growth won't last for ever, but it's been surprisingly resilient for at least 30 years.


It wouldn’t have been possible if there hadn’t been an exponential growth in computing resources over the past decades. That has already slowed down, and the prospects for the future are unclear. Regarding the branching factor, the improvements certainly must converge towards an asymptote.

The more general point is that you always end up with an S-curve instead of a limitless exponential growth as suggested by Kaibeezy. And with AI we simply don’t know how far off the inflection point is.


Expecting progress to be linear is a fallacy in thinking.


Sometimes it's exponential. Sometimes it's sublinear.


Sometimes it's exponential over very short periods. The fallacy is in thinking that will continue.


We should take better care of humans who are already obsolete or soon become obsolete.

Because so far we are good only at criminalizing and incarcerating or killing them.


Upping our game will probably mean an embedded interface with AI. Something like Neurolonk.


Not sure if an intentional misspelling but I think I like Neurolonk more


Eventually there will spring up a religious cult of AI devotees and they might as well pray to Neurolonk.


Lol, unintentional


I know it's pretty low level on my part, but I was amused and laughed much more than I care to admit when I read NEUROLONK. Thanks for that!


It's available on ChatGPT Plus right now. Holy cow, it's good.


Spellchecker but for your arguments? A generalized competency boost?


I wonder how long before we augment a human brain with gpt4.


We already do it’s just the interface sucks


"general thinking" - this algorithm can't "think". It is still a nifty text completion engine with some bells and whistles added.

So many people are falling for this parlor trick. It is sad.


You're a nifty text completion engine with some bells and whistles added


What would impress you, or make you think something other than "wow, sad how people think this is anything special".

Genuine question.


The benchmarking should be double-blind.


There is a fundamental disconnect between the answer on paper and the understanding which produces that answer.

Edit: feel free to respond and prove me wrong


A difference with chess is that chess engines try to play the best move, and GPT the most likely text.


#unpopularOpinion GPT-4 is not as strong as "we" anticipated, it was just the hype


Learn sign language ;)


Life and chess are not the same. I would argue that this is showing a fault in standardized testing. It’s like asking humans to do square roots in an era of calculators. We will still need people who know how to judge the accuracy of calculated roots, but the job of calculating a square root becomes a calculator’s job. The upending of industries is a plausibility that needs serious discussion. But human life is not a min-maxed zero-sum game like chess is. Things will change, and life will go on.

To address your specific comments:

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

This is a profound and important question. I do think that by “general thinking” you mean “general reasoning”.

> What happens when ALL of our decisions can be assigned an accuracy score?

This requires a system where all human’s decisions are optimized against a unified goal (or small set of goals). I don’t think we’ll agree on those goals any time soon.


I agree with all of your points, but don't you think there will be government-wide experiments related to this in places, like say North Korea? I wonder how that will play out.


China is already experimenting with social credit. This does create a unified and measurable goal against which people can be optimized. And yes, that is terrifying.


> What are the implications for society when general thinking, reading, and writing becomes like Chess?

Consider the society where 90% of population does not need to produce anything. AIs will do that.

What would be the name of economical/societal organization then?

Answer is Communism, exactly by Marx.

Those 90% percent need to be welfare'd ("From each according to his ability, to each according to his needs"). Other alternative is grim for those 90%.

So either Communism or nothing for the human race.


The silver lining might be us finally realising how bad standardised tests are at measuring intellect, creativity and the characteristics that make us thrive.

Most of the time they are about loading/unloading data. Maybe this will also revolutionise education, turning it more towards discovery and critical thinking, rather than repeating what we read in a book/heard in class?


GPT-4 Everything we know so far...

GPT-4 can solve difficult problems with greater accuracy, thanks to its broader general knowledge and problem-solving abilities.

GPT-4 is more reliable, creative, and able to handle much more nuanced instructions than GPT-3.5. It surpasses ChatGPT in its advanced reasoning capabilities.

GPT-4 is safer and more aligned. It is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.

GPT-4 still has many known limitations that we are working to address, such as social biases, hallucinations, and adversarial prompts.

GPT-4 can accept a prompt of text and images, which—parallel to the text-only setting—lets the user specify any vision or language task.

GPT-4 is available on ChatGPT Plus and as an API for developers to build applications and services. (API- waitlist right now)

Duolingo, Khan Academy, Stripe, Be My Eyes, and Mem amongst others are already using it.

API Pricing GPT-4 with an 8K context window (about 13 pages of text) will cost $0.03 per 1K prompt tokens, and $0.06 per 1K completion tokens. GPT-4-32k with a 32K context window (about 52 pages of text) will cost $0.06 per 1K prompt tokens, and $0.12 per 1K completion tokens.


So, the COST PER REQUEST will be (if you use the 32k context window and get 1k token response): 32*0.06 (prompt+context) + 0.12 (response) = US$ 2.04




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: