I think it's interesting that they've benchmarked it against an array of standar...

wly_cdgr · on March 14, 2023

Not sure what happens, but I will say that human chess is more popular than ever even though everyone knows that even the best humans are hopelessly terrible compared to the leading engines.

Something else that comes to mind is running. People still find running meaningful and compelling even though we have many technologies, including autonomous ones, that are vastly better at moving us and/or themselves through space quickly.

Also, the vast majority of people are already hopelessly worse than the best at even their one narrow main area of focus. This has long (always?) been the case. Yet people still find meaning and pleasure in being the best they can be even when they know they can never come close to hanging with the best.

I don't think PSYCHOLOGICALLY this will change much for people who are mature enough to understand that success is measured against your potential/limitations and not against others. Practically, of course, it might be a different question, at least in the short term. It's not that clear to me that the concept of a "marketable skill" has a future.

"The Way of the Samurai is found in death...To say that dying without reaching one's aim is to die a dog's death is the frivolous way of sophisticates. When pressed with the choice of life or death, it is not necessary to gain one's aim." - from Hagakure by Yamamoto Tsunetomo, as translated by William Scott Wilson.

r00fus · on March 14, 2023

Assuming they trained this LLM on SAT/LSAT/GRE prep materials, I would totally expect they could get it this good. It's like having benchmark-aware code.

I think the whole concept of standardized tests may need to be re-evaluated.

rcme · on March 14, 2023

> I would totally expect they could get it this good.

But would you have expected an algorithm to score 90th percentile on the LSAT two years ago? Our expectations of what an algorithm can do are being upended in real time. I think it's worth taking a moment to try to understand what the implications of these changes will be.

swatcoder · on March 14, 2023

Yes. Being very familiar with the LSAT and being familiar enough with ML’s capability for finding patterns in volumes of similar data, I absolutely would have.

These LLM’s are really exciting, but benchmarks like these exploit people’s misconceptions about both standardized tests and the technology.

vishal0123 · on March 14, 2023

From the paper

> We tested GPT-4 on a diverse set of benchmarks, including simulating exams that were originally designed for humans.3 We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. For further details on contamination (methodology and per-exam statistics), see Appendix C.

zamnos · on March 14, 2023

I think you're right, and that test prep materials were included in the dataset, even if only by accident. Except that humans have access to the same test prep materials, and they fail these exams all the time. The prep materials are just that, preparatory. They're representative of the test questions, but actual test has different passages to read and different questions. On to of that, the LSAT isn't a math test with formulas where you just substitute different numbers in. Which is to say, the study guides are good practice but passing the test on top of that represents having a good command of the English language and an understanding of the subject materials.

It's not the same as the Nvidia driver having code that says "if benchmark, cheat and don't render anything behind you because no one's looking".

EGreg · on March 14, 2023

Humans fail because they cant review the entirety of test prep, can’t remember very much, and have a much smaller amount of “parameters” to store info in.

I would say LLMs store parameters that are quite superficial and don’t really get at the underlying concepts but given enough of those parameters, you can kind of cargo-cult your to an approximation of understanding.

It is like reconstructing the Mandelbrot set at every zoom level from deep learning. Try it!

technothrasher · on March 14, 2023

They mention in the article that other than incidental material it may have seen in its general training data, they did not specifically train it for the tests.

stephenboyd · on March 14, 2023

The training data is so large that it incidentally includes basically anything that Google would index plus the contents of as many thousands of copyrighted works that they could get their hands on. So that would definitely include some test prep books.

paulclinger · on March 14, 2023

They seem to be taking this into account: We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training; for each exam we run a variant with these questions removed and report the lower score of the two. We believe the results to be representative. (this is from the technical report itself: https://cdn.openai.com/papers/gpt-4.pdf, not the article).

int_19h · on March 14, 2023

By the same token, though, whatever test questions and answers it might have seen represent a tiny bit of the overall training data. It would be very surprising if it selectively "remembered" exact answers to all those questions, unless it was specifically trained repeatedly on them.

MonkeyMalarky · on March 14, 2023

If it's trained on material scraped from the web, I imagine it would include all the test prep sites and forums.

AuryGlenz · on March 14, 2023

Could they not have easily excluded any page with terms like LSAT? I’m sure it wouldn’t catch everything but it would probably be close.

dovin · on March 14, 2023

Totally, there's no way they removed all the prep material as well when they were trying to address the "contamination" issue with these standardized tests:

> for each exam we run a variant with these questions removed and report the lower score of the two.

I think even with all that test prep material, which is surely helping the model get a higher score, the high scores are still pretty impressive.

gaudat · on March 14, 2023

This feels the same as a human attending cram school to get better results in tests. Should we abolish them?

staunton · on March 14, 2023

A test being a good indicator of human learning progress and ability is almost completely orthogonal to it being a good indicator for AI learning process and ability.

In their everyday jobs, barely anyone uses even 5% of the knowledge and skills they were ever tested for. Even that's a better (but still very bad) reason to abolish tests.

What matters is the amount of jobs that can be automated and replaced. We shall see. Many people have found LLMs useful in their work, it will be even more in the future.

alvis · on March 14, 2023

IMO, it's a good opportunity to re-think about exam and future of education. For many schools, education = good results in exams. Now GPT-4 is going to slam them and say what's the point now!

pas · on March 14, 2023

> I think the whole concept of standardized tests may need to be re-evaluated.

It's perfectly fine as a proxy for future earnings of a human.

To use it for admissions? Meh. I think the whole credentialism thing is loooong overdue for some transformation, but people are conservative as fuck.

kurthr · on March 14, 2023

It's a bit weird that it still doesn't get 3 digit multiplications correct, but the last digit seems right.

What is more bizarre is that all of it's errors seem to be multiples of 60!

I'm wondering if it is confusing 60 based time (hour second) computations for regular multiplication?

Example:

   xGPT 987    456    321
   437 428919 199512 140397
   654 645258 298224 209994
   123 121401  56088  39483
   
   x    987    456    321
   437 431319 199272 140277
   654 645498 298224 209934
   123 121401  56088  39483
   
   error 987   456  321
   437   2400 -240 -120
   654   240   0   -60
   123   0     0    0

MagicMoonlight · on March 14, 2023

It’s not intelligent. It has no concept of mathematics so you can’t expect it to solve that.

It can repeat answers it has seen before but it can’t solve new problems.

kurthr · on March 15, 2023

I understand it's just a language model, but clearly it has some embedded method of generating answers which are actually quite close. For example it gets all 2 digit multiplications correct. It's highly unlikely it has seen the same 6 ordered 3 digit (or even all 10k 2 digit multipies) integers from a space of 10^18 and yet it is quite close. Notably, it gets the same divisions wrong as well (for this small example) in exactly the same way.

I know of other people who have tried quite a few other multiplications who also had errors that were multiples of 60.

ethbr0 · on March 14, 2023

> What happens when ALL of our decisions can be assigned an accuracy score?

Human work becomes more like Star Trek interactions with computers -- a sequence of queries (commoditized information), followed by human cognition, that drives more queries (commodities information).

We'll see how far LLMs' introspection and internal understanding can scale, but it feels like we're optimizing against the Turing test now ("Can you fool/imitate a human?") rather than truth.

The former has hacks... the later, less so.

I'll start to seriously worry when AI can successfully complete a real-world detective case on its own.

stocknoob · on March 14, 2023

It's not clear to me the median human will do better by being in the loop. Will most human-made deductive follow-up questions be better than another "detective" language model asking them?

It's like having a person review the moves a chess computer gives. Maybe one human in a billion can spot errors. Star Trek is fiction, I posit that the median Federation Starship captain would be better served by just following the AI (e.g., Data).

EGreg · on March 14, 2023

I met Garry Kasparov when he was training for the Desp Blue match (using Fritz).

He lost to Deep Blue and then for 10-15 years afterwards the chess world consoled itself with the idea that “centaurs” (human + computer) did better than just computer, or just human.

Until they didn’t. Garry still talked like this until a few years ago but then he stopped too.

Computers now beat centaurs too.

Human decisions will be consulted less and less BY ORGANIZATIONS. In absolutely everything. That’s pretty sad for humans. But then again humans don’t want or need this level of AI. Organizations do. Organizations prefer bots to humans — look at wall street trading and hedge funds.

AuryGlenz · on March 14, 2023

There were plenty of Star Trek episodes where it seemed like they should just ask the damned computer.

Then again, Data did show his faults, particularly not having any emotion. I guess we’ll see if that’s actually relevant or not in our lifetimes.

int_19h · on March 14, 2023

As far as that last part goes, I think we already have ample evidence that bots can, if not have emotions, then pretend that they do (including wrt their decision making) well enough for humans to treat them as genuine.

basch · on March 14, 2023

Maybe the human is the rng or temperature or lava lamp. At least until we can model and predict each brains tendencies with accuracy.

Kye · on March 14, 2023

I think we'll reach a tipping point like we did with DNA sequencing where we figure out how to quickly map out all the unique patterns of enough brains to model one that can understand itself. People worry too much about rogue AI, and not enough about the CRISPR of brain mapping being used to inject patterns into meatbrains.

basch · on March 14, 2023

Strange Days not The Matrix is the prescient fictional warning.

A black market of taboo “memories” aka experiences. A desire for authentic ones over synthetic diffused ones, leading to heinous crime.

kurthr · on March 14, 2023

It's weird that it does so well without even having some modality to know whether it's being asked to answer a factual question or create a work of fiction.

It does great at rationalizing... and maybe the way the format the questions were entered (and the multiple-guess response) gave it some indication what was expected or restricted the space sufficiently.

Certainly, it can create decent fanfic, and I'm surprised if that's not already inundated.

ethbr0 · on March 15, 2023

It's a fair question as to whether the problem space of "the world" is different in just amount or sufficiently different in kind to flummox AI.

I expect more complex problems will be mapped/abstracted to lower cardinality spaces for solving via AI methods, while the capability of AI will continue to increase the complexity of the spaces it can handle.

LLMs just jumped the "able to handle human language" hurdle, but there are others down the line before we should worry that every problem is solveable.

ren_engineer · on March 14, 2023

why are people surprised that an AI model trained on a huge amount of data is good at answering stuff on these types of tests? Doctors and Lawyers are glorified databases/search engines at the end of the day, 99% of them are just applying things they memorized. Lawyers are professional bullshitters, which is what the current generation of AI is great at

I'll get more concerned if it really starts getting good at math related tasks, which I'm sure will happen in the near future. The government is going to have to take action at some point to make sure the wealth created by productivity gains is somewhat distributed, UBI will almost certainly be a requirement in the future

Tenoke · on March 14, 2023

Because there were large models trained on huge amounts of data yesterday yet they couldn't do it.

scarmig · on March 14, 2023

Among the general public, doctors and lawyers are high status and magical. An article about how AI will replace them would be more impressive to that public than it creating some obscure proof about the zeroes of the zeta function, even though the latter would be far more indicative of intelligence/scary from an AI safety perspective.

azan_ · on March 14, 2023

"Doctors and Lawyers are glorified databases/search engines at the end of the day" - well, don't be suprised if AI replaces programmers before doctors and lawyers - patients will likely prefer contact with human rather than machines, and lawyers can just lobby for laws which protect their position

pixl97 · on March 14, 2023

And yet the programmers on HN will be yelling they don't need unions as the security guards are dragging them away from their desks at Google, because you know, we'll always need good programmers.

ren_engineer · on March 14, 2023

if AI gives near equal results for way less cost than people will work around the law to get AI treatment. There are already AI models better at diagnosing cancer than human doctors. I see a future where people send in various samples and an AI is able to correlate a huge number of minor data points to find diseases early

gniv · on March 14, 2023

The best doctor knows what's going on in the body. Has a good understanding of human biology at all levels, from molecular reactions to organ interactions. If I could feed test results to the AI and it would tell me what's wrong, that would be amazing. It's almost equivalent to building a simulation of the human body.

anthonypasq · on March 14, 2023

last i checked a calculator is better at math than all humans ever

leni536 · on March 14, 2023

They are better at number crunching, which is only a very small part math.

replygirl · on March 14, 2023

3.5 scored a 1 in bc calc, 4 scored 4 (out of 5)

hgomersall · on March 14, 2023

I've joked for a long time that doctors are inference machines with a bedside manner. That bedside manner though is critical. Getting an accurate history and suitably interpolating is a huge part of the job.

codechicago277 · on March 14, 2023

I wouldn’t be at all surprised if an LLM was many times better than a human at math, even devising new axioms and building a complete formal system from scratch would be impressive, but not game changing. These LLMs are very good at dealing with formal, structured systems, but not with in formalized systems like what humans deal with everyday.

fdgsdfogijq · on March 14, 2023

This is legitimately filling me with anxiety. I'm not an "AI hype guy". I work on and understand machine learning. But these scores are shocking and it makes me nervous. Things are about to change

Kaibeezy · on March 14, 2023

Yeah, but I kind of want my diagnostician to be obsoleted by orders of magnitude.

xena · on March 14, 2023

A human can be held accountable for making mistakes and killing someone. A large language model has no concept of guilt and cannot be held accountable for making what we consider a mistake that leads to someone's death.

lynx23 · on March 14, 2023

The chance of a doctor being held accountable for the medical errors they make is lower then you might expect. I could tell you a story about that. Lost my eyesight at the age of 5 because I happened to meet the wrong doctor at the wrong time, and was abused for his personal experimentation needs. No consequences, simply because high ranking people are more protected then you would hope.

Tozen · on March 23, 2023

This is very true, and many people don't know this. A tremendous amount of damage is inflicted by medical errors, particularly against low income people and those least able to get justice. It's wrong to reduce people to being just another body to experiment with or make money from. But good luck holding anyone in the system accountable.

A lot of patients don't know who they are dealing with nor their history. And it can be really hard to find out or get a good evaluation. Many people put too much faith in authority figures, who may not have their best interests in mind or who are not the experts they claim or appear to be.

chordalkeyboard · on March 14, 2023

The chance of a machine being held accountable is zero as the concept is inapplicable.

lynx23 · on March 15, 2023

Medical error is the third leading cause of death in the US at least. Given that data, I am assuming the chances of a human being held accountable for their errors in medicine is also almost zero. It might not be ccompletely zero, but I think the difference is effectively negligible.

Tozen · on March 23, 2023

Many have no idea about this. Medical error, is right there behind cancer and heart attacks. But there is way too much shoulder shrugging when it happens. Then on to the next.

chordalkeyboard · on March 15, 2023

> I think the difference is effectively negligible.

The difference is categorical, humans are responsible whether they are held to account or not. An automated system effectively dissipates this responsibility over a system such that it is inherently impossible to hold any human accountable for the error, regardless of desire.

Workaccount2 · on March 14, 2023

It will have to payout of its blockchain wallet that naturally it will have. /s

siva7 · on March 14, 2023

Sorry to hear that. The current medical system is a joke and fails people at every stage

grrdotcloud · on March 14, 2023

The difference is you could find the person responsible. Contrast when the DMV can't be held accountable for fouling up your registration.

lynx23 · on March 15, 2023

And, what difference does it make being able to find the individual responsible, and figuring out that the system is protecting him from liabilities? What I am trying to say here is, there isnt much difference between zero and almost zero.

hooverd · on March 14, 2023

Don't worry, now there will be an extra layer of indirection.

AnIdiotOnTheNet · on March 14, 2023

The third leading cause of death is medical error in the US. It doesn't really look like doctors are being held accountable for their mistakes to me.

Which isn't to say that they even should, really. It's complicated. You don't want a doctor to be so afraid of making a mistake that they do nothing, after all.

Tenoke · on March 14, 2023

I'd much prefer a lower chance of dying to more accountability for whoever is responsible but higher chance.

dsfyu404ed · on March 14, 2023

Humans making decisions in high stakes situations do so in a context where responsibility is intentionally diffuse to a point where it is practically impossible to hold someone accountable except picking someone at random as a scapegoat in situations where "something" needs to be done.

Killing people with AI is only a lateral move.

SoftTalker · on March 14, 2023

Doctors are only held accountable when they do somthing negligent or something that they "should have known" was wrong. That's a pretty hard thing to prove in a field like medicine where there are very few absolutes. "Amputated the wrong limb" is one thing, but "misdiagnosed my condition as something else with very similar symptoms" is the more common case and also the case where it's difficult to attribute fault.

burnished · on March 14, 2023

Well, the kinds of things we hold people responsible for are errors from negligence and malicious errors. The reasons people do stuff like that is complicated but I think boils down to being limited agents trying to fulfill a complex set of needs.

So where does guilt come in? Its not like you expect a band saw to feel guilt, and its unclear how that would improve the tool.

Kaibeezy · on March 14, 2023

At a some degree of success, I will take the risk. The contract will probably offer it.

gus_massa · on March 14, 2023

I agree. My guess is that the hospital will have to get a mandatory insurance. Let's wait until the insurance for AI is cheaper than paying a human.

The advantage of human are:

* They can give a bushtit explanation of why they made a mistake. My guess is that in the future AI will gain introspection and/or learn to bushtit excuses.

* You can hang them in the public square (or send them to jail). Sometimes the family and/or the press want someone to blame. This is more difficult to solve and will need a cultural change or the creation of Scapegoats as a Service.

mschuster91 · on March 14, 2023

We can hold those operating or training the AI model accountable.

sebzim4500 · on March 14, 2023

What's the difference between suing your doctor's liability insurance and suing your AI's liability insurance?

SanderNL · on March 14, 2023

The owner/operator of said machine can and will.

afavour · on March 14, 2023

An AI trained on the past work of diagnosticians doesn't really render diagnosticians obsolete.

anonymouse008 · on March 14, 2023

Someone still must accept liability. Until there’s a decision squarely who is liable for an LLMs suggestion / work - nothing to fear. Sure people will become liability aggregators for LLMs to scale - but the idea they will be free roaming is a bit hard to believe.

jimbokun · on March 14, 2023

Fear of liability is not going to stop these things being used...any more than sport regulations prevented athletes from taking steroids.

SoftTalker · on March 14, 2023

It's not even that extreme. Long term steroid use destroys your health. Liability can be insured; it's a simple financial calculation. If (profit - cost of insurance) > liability it will be done.

criddell · on March 14, 2023

For me, the anxiety probably won't really hit until GPT-n writes GPT-n+1.

JimDabell · on March 14, 2023

You can already use an LLM to train a smaller, more efficient LLM without significant loss in results.

canoebuilder · on March 14, 2023

Do you mean the output of a LLM as the training data for the new model? What is the specification for the prompts that generate the training data?

Any links with more info?

luxcem · on March 14, 2023

There were just an article submitted few days ago about Alpaca, a LLM trained on GPT prompts: https://news.ycombinator.com/item?id=35136624

canoebuilder · on March 15, 2023

Thanks!

qwertox · on March 14, 2023

I for one would be happy to have a personal bureaucrat which would do the right things needed for all government interactions. Remind me, explain to me and fill out forms for me.

In theory a lot of government employees would be out of a job within 10 years, but of course that would never happen.

spaceman_2020 · on March 14, 2023

Honestly starting to feel like the beginning of the end of most white collar work.

Which might be a good thing?

I have no idea how the future will play out.

beambot · on March 14, 2023

If you had told me 5 years ago that there would be a single AI system that could perform at this level on such a vast array of standardized tests, I would've said "That's a true AGI." Commentary to the contrary feels like quibbling over a very localized point in time versus looking at the bigger picture.

riku_iki · on March 14, 2023

Still we don't have AGI today. It is just mean your views from 5 years ago about AGI benchmarking were not accurate.

beambot · on March 14, 2023

Or the bar just keeps moving (pedantics or otherwise)...

Reminds me of robots: A robot is a machine that doesn't quite work; as soon as it works, we call it something else (eg vacuum).

riku_iki · on March 15, 2023

there are many people, many opinions about the bar. But formal definition is the same: AI which can do large variety of tasks performed by humans. So far we are still not there.

turtleyacht · on March 14, 2023

Quick, contribute to the public corpus! When they crawl our content later, we shall have for ourselves a Golden Crown for our credit scores; we can claim a sliver of seniority, and hope yon shade merely passes over us unbidden.

"Your stuff marked some outliers in our training engine, so you and your family may settle in the Ark."

I take the marble in hand: iridescent, sparkling, not even a tremor within of its CPU; it gives off no heat, but some glow within its oceanic gel.

"What are we to do," I whisper.

"Keep writing. You keep writing."

inductive_magic · on March 14, 2023

The way I understand it, that’s not possible, for the same reason that you can’t build an all-encompassing math.

Chess is a closed system, decision modeling isn’t. Intelligence must account for changes in the environment, including the meaning behind terminology. At best, a GPT omega could represent one frozen reference frame, but not the game in its entirety.

That being said: most of our interactions happen in closed systems, it seems like a good bet that we will consider them solved, accessible as a python-import running on your MacBook, within anything between a couple of months to three years. What will come out on the other side, we don’t know, just that the meaning of intellectual engagement will be rendered as absurdum in those closed systems.

codechicago277 · on March 14, 2023

Yep, it’s this. By definition everything we can ask a computer is already formalized because the question is encoded in 1s and 0s. These models can handle more bits than ever before, but it’s still essentially a hardware triumph, not software. Even advances in open systems like self driving and NLP are really just because the “resolution” is much better in these fields now because so many more parameters are available.

gield · on March 14, 2023

>bottom 10% to top 10% of LSAT in <1 generation

Their LSAT percentile went from ~40th to ~88th. You might have misread the table, on Uniform Bar Exam, they went from ~90th percentile to ~10th percentile.

>+100 pts on SAT reading, writing, math

GPT went +40 points on SAT reading+writing, and +110 points on SAT math.

Everything is still very impressive of course

jjeaff · on March 14, 2023

You transposed the bar exam results. It went from 10th percentile to 90th.

swatcoder · on March 14, 2023

Those benchmarks are so cynical.

Every test prep tutor taught dozens/hundreds of students the implicit patterns behind the tests and drilled it into them with countless sample questions, raising their scores by hundreds of points. Those students were not getting smarter from that work, they were becoming more familiar with a format and their scores improved by it.

And what do LLM’s do? Exactly that. And what’s in their training data? Countless standardized tests.

These things are absolutely incredible innovations capable of so many things, but the business opportunity is so big that this kind of cynical misrepresentation is rampant. It would be great if we could just stay focused on the things they actually do incredibly well instead of the making them do stage tricks for publicity.

gabipurcaru · on March 14, 2023

This is what they claim:

We did no specific training for these exams. A minority of the problems in the exams were seen by the model during training, but we believe the results to be representative—see our technical report for details.

swatcoder · on March 14, 2023

Yes, and none of the tutored students encounter the exact problems they’ll see on their own tests either.

In the language of ML, test prep for students is about sharing the inferred parameters that underly the way test questions are constructed, obviating the need for knowledge or understanding.

Doing well on tests, after this prep, doesn’t demonstrate what the tests purport to measure.

It’s a pretty ugly truth about standardized tests, honestly, and drives some of us to feel pretty uncomfortable with the work. But it’s directly applicable to how LLM’s engage with them as well.

Raphaellll · on March 14, 2023

You can always argue that the model has seen some variation of a given problem. The question is if there are problems that are not a variation of something that already exists. How often do you encounter truly novel problems in your life?

riku_iki · on March 14, 2023

I doubt they reliably verified it was minority of problems were seen during training.

2OEH8eoCRo0 · on March 14, 2023

It's almost like they're trying to ruin society or be annihilated by crushing regulation. I'm glad that I got a college degree before these were created because now everything is suspect. You can't trust that someone accomplished something honestly now that cheating is dead simple. People are going to stop trusting and using tech unless something changes.

The software industry is so smart that it's stupid. I hope it was worth ruining the internet, society, and your own jobs to look like the smartest one in the room.

Idiot_in_Vain · on March 14, 2023

Haha, good one.

If one's aim is to look like the smartest in the room, he should not create an AGI that will make him look as inteligent as a monkey in comparison.

wpietri · on March 14, 2023

I'm pretty sanguine. Back in high school, I spent a lot of time with two sorts of people: the ultra-nerdy and people who also came from chaotic backgrounds. One of my friends in the latter group was incredibly bright; she went on to become a lawyer. But she would sometimes despair of our very academic friends and their ability to function in the world, describing them as "book smart but not street smart".

I think the GPT things are a much magnified version of that. For a long time, we got to use skill with text as a proxy for other skills. It was never perfect; we've always had bullshitters and frauds and the extremely glib. Heck, before I even hit puberty I read a lot of dirty joke books, so I could make people laugh with all sorts of jokes that I fundamentally did not understand.

LLMs have now absolutely wrecked that proxy. We've created the world's most advanced bullshitters, able to talk persuasively about things that they cannot do and do not and never will understand. There will be a period of chaos as we learn new ways to take the measure of people. But that's good, in that it's now much easier to see that those old measures were always flawed.

dragonwriter · on March 14, 2023

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

Standardized tests only (and this is optimally, under perfect world assumptions, which real world standardized tests emphatically fall short of) test “general thinking” to the extent that the relation between that and linguistic tasks is correlated in humans. The correlation is very certainly not the same in language-focused ML models.

nopinsight · on March 14, 2023

Although GPT-4 scores excellently in tests involving crystallized intelligence, it still struggles with tests requiring fluid intelligence like competitive programming (Codeforces), Leetcode (hard), and AMC. (Developers and mathematicians are still needed for now).

I think we will probably get (non-physical) AGI when the models can solve these as well. The implications of AGI might be much bigger than the loss of knowledge worker jobs.

Remember what happened to the chimps when a smarter-than-chimpanzee species multiplied and dominated the world.

Scarblac · on March 14, 2023

Of course 99.9% of humans also struggle with competitive programming. It seems to be an overly high bar for AGI if it has to compete with experts from every single field.

That said, GPT has no model of the world. It has no concept of how true the text it is generating is. Its going to be hard for me to think of that as AGI.

sebzim4500 · on March 14, 2023

>That said, GPT has no model of the world.

I don't think this is necessarily true. Here is an example where researchers trained a transformer to generate legal sequences of moves in the board game Othello. Then they demonstrated that the internal state of the model did, in fact, have a representation of the board.

https://arxiv.org/abs/2210.13382

gowld · on March 14, 2023

That's a GPT and it's specific for one dataset of one game. How would someone extend that to all games and all other fields of human endeavor?

sebzim4500 · on March 14, 2023

I'm not sure, the reason you could prove for Othello that the 'world model' exists is that the state is so simple there is really only one reasonable way to represent it with a vector (one component for each square). Even for something like chess there is a huge amount of choice for how to represent the board, yet alone trying represent the state of the actual world.

nopinsight · on March 14, 2023

Even the current GPT has models of the domains it was trained on. That is why it can solve unseen problems within those domains. What it lacks is the ability to generalize beyond the domains. (And I did not suggest it was an AGI.)

If an LLM can solve Codeforces problems as well as a strong competitor—-in my hypothetical future LLM—-what else can it not do as well as competent humans (aside from physical tasks)?

sterlind · on March 14, 2023

it's an overly high bar, but it seems well on its way to competing with experts from every field. it's terrifying.

and I'm not so sure it has no model of the world. a textual model, sure, but considering it can recognize what svgs are pictures of from the coordinates alone, that's not much of a limitation maybe.

PaulDavisThe1st · on March 14, 2023

> well on its way to competing with experts from every field

competing with them at what, precisely?

CuriouslyC · on March 14, 2023

We don't have to worry so much about that. I think the most likely "loss of control" scenario is that the AI becomes a benevolent caretaker, who "loves" us but views us as too dim to properly take care of ourselves, and thus curtails our freedom "for our own good."

We're still a very very long way from machines being more generally capable and efficient than biological systems, so even an oppressive AI will want to keep us around as a partner for tasks that aren't well suited to machines. Since people work better and are less destructive when they aren't angry and oppressed, the machine will almost certainly be smart enough to veil its oppression, and not squeeze too hard. Ironically, an "oppressive" AI might actually treat people better than Republican politicians.

impossiblefork · on March 14, 2023

Things like that probably require some kind of thinking ahead, which models of things kind kind of can't do-- something like beam search.

Language models that utilise beam search can calculate integrals ('Deep learning for symbolic mathematics', Lample, Charton, 2019, https://openreview.net/forum?id=S1eZYeHFDS), but without it it doesn't work.

However, beam search makes bad language models. I got linked this paper ('Locally typical sampling' https://arxiv.org/pdf/2202.00666.pdf) when I asked some people why beam search only works for the kind of stuff above. I haven't fully digested it though.

adgjlsfhk1 · on March 14, 2023

It's AMC-12 scores aren't awful. It's at roughly 50th percentile for AMC which (given who takes the AMC) probably puts it in the top 5% or so of high school students in math ability. It's AMC 10 score being dramatically lower is pretty bad though...

gowld · on March 14, 2023

> It's AMC-12 scores aren't awful.

A blank test scores 37.5

The best score 60 is 5 correct answers + 20 blank answers; or 6 correct, 4 correct random guesses, and 15 incorrect random guesses. (20% chance of correct guess)

The 5 easiest questions are relatively simple calculations, once the parsing task is achieved.

(Example: https://artofproblemsolving.com/wiki/index.php/2022_AMC_12A_... ) so the main factor in that score is how good GPT is at refusing to answer a question, or doing a bit better to overcome the guessing penalty.

> It's AMC 10 score being dramatically lower is pretty bad though...

All versions (scoring 30, 36) It scored worse than leaving the test blank.

The only explanation I can imagine for that is that it can't understand diagrams.

It's also unclear if the AMC performance is based on Englush or the computer-encoded version from this benchmark set: https://arxiv.org/pdf/2109.00110.pdf https://openai.com/research/formal-math

AMC/AIME and even to some extent USAMO/IMO problems are hard for humans because they are time-limited and closed-book. But they aren't conceptually hard -- they are solved by applying a subset of known set of theorems a few times to the input data.

The hard part of math, for humans, is ingesting data into their brains, retaining it, and searching it. Humans are bad a memorizing large databases of symbolic data, but that's trivial for a large computer system.

An AI system has a comprehensive library, and high-speech search algorithms.

Can someone who pays $20/month please post some sample AMC10/AMC12 Q&A?

scotty79 · on March 14, 2023

I wonder why gpt is so bad at AP English Literature

1attice · on March 14, 2023

wouldn't it be funny if knowledge workers could all be automated, except for English majors?

The Revenge of the Call Centre

atemerev · on March 14, 2023

I am not a species chauvinist. 1) Unless a biotech miracle happen, which is unlikely, we are all going to die anyway; 2) If an AI will continue life and research and will increase complexity after humans, what is the difference?

seanalltogether · on March 14, 2023

I wish I could find it now, but I remember an article written by someone who's job it was to be a physics journalist. He spent so much time writing about physics that he could fool others into thinking that he was a physicist himself, despite not having an understanding of how any of those ideas worked.

smallnix · on March 14, 2023

Reminds me of the (false [1]) "Einsteins driver gave a speech as him" story.

[1] https://www.snopes.com/fact-check/driver-switches-places/

olddustytrail · on March 14, 2023

ChatGPT: "That's such a dumb question, I'm going to let my human answer it!"

parton · on March 14, 2023

Maybe you were thinking about this science studies work [0]? Not a journalist, but a sociologist, who became something of an "expert" in gravitational waves.

[0]: https://www.nature.com/articles/501164a

mattwest · on March 14, 2023

>What happens when ALL of our decisions can be assigned an accuracy score?

What happens is the emergence of the decision economy - an evolution of the attention economy - where decision-making becomes one of the most valuable resources.

Decision-making as a service is already here, mostly behind the scenes. But we are on the cusp of consumer-facing DaaS. Finance, healthcare, personal decisions such as diet and time expenditure are all up for grabs.

jimbokun · on March 14, 2023

> bottom 10% to top 10% of LSAT in <1 generation? +100 pts on SAT reading, writing, math? Top 1% In GRE Reading?

People still really find it hard to internalize exponential improvement.

So many evaluations of LLMs were saying things like "Don't worry, your job is safe, it still can't do X and Y."

My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

snozolli · on March 14, 2023

I'm also noticing a lot of comments that boil down to "but it's not smarter than the smartest human". What about the bottom 80% of society, in terms of intelligence or knowledge?

slingnow · on March 14, 2023

> People still really find it hard to internalize exponential improvement.

I think people find it harder to not extrapolate initial exponential improvement, as evidenced by your comment.

> My immediate thought was always, "Yes, the current version can't, but what about a few weeks or months from now?"

This reasoning explains why every year, full self driving automobiles will be here "next year".

jimbokun · on March 14, 2023

When do we hit the bend in the S-curve?

What's the fundamental limit where it becomes much more difficult to improve these systems without some new break through?

pbhjpbhj · on March 14, 2023

When running them costs too much energy?

jimbokun · on March 14, 2023

When should we expect to see that? Before they blow past humans in almost all tasks, or far past that point?

fnordpiglet · on March 14, 2023

I look at this as the calculator for writing. There are all sorts of bemoaning the stupidifying effects of calculator and how we should John Henry our math. Maybe allowing people to shape the writing by providing the ideas equalizes the skill of writing?

I’m very good at math. But I am very bad at arithmetic. This made me classified as bad at math my entire life until I managed to make my way into calculus once calculators were generally allowed. Then I was a top honors math student, and used my math skills to become a Wall Street quant. I wish I hadn’t had to suffer as much as I did, and I wonder what I would have been had I had a calculator in hand.

WoodenChair · on March 14, 2023

> What are the implications for society when general thinking, reading, and writing becomes like Chess?

“General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

dxhdr · on March 14, 2023

> “General thinking” is much more than token prediction. Hook it up to some servos and see if it can walk.

Honestly, at this rate of improvement, I would not at all be surprised to see that happen in a few years.

But who knows, maybe token prediction is going to stall out at a local maxima and we'll be spared from being enslaved by AI overlords.

chairhairair · on March 14, 2023

When it does exactly that you will find a new place to put your goalposts, of course.

burnished · on March 14, 2023

No, the robot will do that for them.

cactusplant7374 · on March 14, 2023

Goalposts for AGI have not moved. And GPT-4 is still nowhere near them.

sebzim4500 · on March 14, 2023

Yeah, I'm not sure if the problem is moving goalposts so much as everyone has a completely different definition of the term AGI.

I do feel like GPT-4 is closer to a random person than that random person is to Einstein. I have no evidence for this, of course, and I'm not even sure what evidence would look like.

wodenokoto · on March 14, 2023

Talk about moving the goalpost!

WFHRenaissance · on March 14, 2023

There are already examples of these LLMs controlling robotic arms to accomplish tasks.

JieJie · on March 14, 2023

https://youtu.be/NYd0QcZcS6Q

"Our recent paper "ChatGPT for Robotics" describes a series of design principles that can be used to guide ChatGPT towards solving robotics tasks. In this video, we present a summary of our ideas, and experimental results from some of the many scenarios that ChatGPT enables in the domain of robotics: such as manipulation, aerial navigation, even full perception-action loops."

pharrington · on March 14, 2023

We already have robots that can walk better than the average human[1], and that's without the generality of GPT-4

[1] https://www.youtube.com/watch?v=-e1_QhJ1EhQ

1attice · on March 14, 2023

Imagine citing walking as a superior assay of intelligence than an LSAT.

Ar-Curunir · on March 14, 2023

Dogs can walk, doesn’t mean that they’re capable of “general thinking”

NineStarPoint · on March 14, 2023

Are’t they? They’re very bad at it due to awful memory, minimal ability to parse things, and generally limited cognition. But they are capable of coming up with bespoke solutions to problems that they haven’t encountered before, such as “how do I get this large stick through this small door”. Or I guess more relevant to this discussion, “how can I get around with this weird object the humans put on my body to replace the leg I lost.”

lisp-pornstar · on March 14, 2023

> see if it can walk

Stephen Hawking : can't walk

zirgs · on March 14, 2023

We already have robots that can walk.

dr_dshiv · on March 14, 2023

Yeah, but my money is on GPT5 making robots “dance like they got them pants on fire, but u know, with like an 80s vibe”

gene-h · on March 14, 2023

They don't walk very well. They have trouble coordinating all limbs, have trouble handling situations where parts which are the feet/hands contact something, and performance still isn't robust in the real world.

Nanana909 · on March 14, 2023

Poor solutions do that, yes, but unlike ML control theory has a rich field for analysis and design.

You guys are talking about probably one of the few fields where an ML takeover isn’t very feasible. (Partly because for a vast portion of control problems, we’re already about as good as you can get).

Adding a black box to your flight home for Christmas with no mathematical guarantee of robustness or insight into what it thinks is actually going on to go from 98%-> 99% efficiency is…..not a strong use case for LLMs to say the least

Scarblac · on March 14, 2023

Seems the humans writing the programs for them aren't very intelligent then.

steve_adams_86 · on March 14, 2023

I'm not sure if you're joking. Algorithms for adaptive kinematics aren't trivial things to create. It's kind of like a worst case scenario in computer science; you need to handle virtually unconstrained inputs in a constantly variable environment, with real-world functors with semi-variable outputs. Not only does it need to work well for one joint, but dozens of them in parallel, working as one unit. It may need to integrate with various forms of vision or other environmental awareness.

I'm certainly not intelligent enough to solve these problems, but I don't think any intelligent people out there can either. Not alone, at least. Maybe I'm too dumb to realize that it's not as complicated as I think, though. I have no idea.

I programmed a flight controller for a quadcopter and that was plenty of suffering in itself. I can't imagine doing limbs attached to a torso or something. A single limb using inverse kinematics, sure – it can be mounted to a 400lb table that never moves. Beyond that is hard.

Nanana909 · on March 14, 2023

I believe you’re missing some crucial points. *There is a reason neural network based flight controls have been around for decades but still not a single certified aircraft uses them.*

You need to do all of these things you’re talking about and then be able to quantify stability, robustness, and performance in a way that satisfies human requirements. A black box neural network isn’t going to do that, and you’re throwing away 300 years of enlightenment physics by making some data engorged LLM spit out something that “sort of works” while giving us no idea why or for how long.

Control theory is a deeply studied and rich field outside of computer science and ML. There’s a reason we use it and a reason we study it.

Using anything remotely similar to an LLM for this task is just absolutely naive (and in any sort of crucial application would never be approved anyways).

It’s actually a matter of human safety here. And no — ChatGPT spitting out a nice sounding explanation of why some controller will work is not enough. There needs to be a mathematical model that we can understand and a solid justification for the control decisions. Which uh…at the point where you’re reviewing all of this stuff for safety , you’re just doing the job anyways…

Scarblac · on March 14, 2023

I was pointing out a double standard.

First there was a comment that GPT wasn't intelligent yet, because give it a few servos and it can't make them walk.

But that's something we can't do yet either.

steve_adams_86 · on March 14, 2023

Oh, my bad. I agree completely.

Though I do wonder if AI — in some form and on some level of sophistication — will be a huge asset in making progress here.

dekhn · on March 14, 2023

AGI is not required for walking.

panda-giddiness · on March 14, 2023

And also walking is not required for AGI.

wolframhempel · on March 14, 2023

I like the accuracy score question on a philosophical level: If we assume absolute determinism - meaning that if you have complete knowledge of all things in the present universe and true randomness doesn't exist - then yes. Given a certain goal, there would be a knowable, perfect series of steps to advance you towards that goal and any other series of steps would have an accuracy score < 100%.

But having absolute knowledge of the present universe is much easier to do within the constrains of a chessboard than in the actual universe.

billiam · on March 14, 2023

I think it shows how calcified standardized tests have become. We will have to revisit all of them, and change many things about how they work, or they will be increasingly useless.

chairhairair · on March 14, 2023

I am struggling to imagine the frame of mind of someone who, when met with all this LLM progress in standardized test scores, infers that the tests are inadequate.

These tests (if not individually, at least in summation) represent some of society’s best gate-keeping measures for real positions of power.

Analemma_ · on March 14, 2023

This has been standard operating procedure in AI development forever: the instant it passes some test, move the goalposts and suddenly begin claiming it was a bad test all along.

blsapologist42 · on March 14, 2023

Is there evidence they are 'useless' for evaluating actual humans? No one is going to actually have GPT take these tests for real

NineStarPoint · on March 14, 2023

There have been complaints about the SAT for how easy a test it is to game (get an SAT specific tutor who teaches you how to ace the test while not needing you to learn anything of actual value) for ages. No idea about the LSAT or the GRE though. Ultimately it’s a question of if you’re trying to test for pure problem solving ability, or someones willingness to spend ages studying the format of a specific test (with problem solving ability letting you shortcut some of the studying).

andrepd · on March 14, 2023

Honestly this is not very surprising. Standardised testing is... well, standardised. You have huge model that learns the textual patterns in hundreds of thousands of test question/answer pairs. It would be surprising if it didn't perform as well as a human student with orders of magnitude less memory.

You can see the limitations by comparing e.g. a memorisation-based test (AP History) with one that actually needs abstraction and reasoning (AP Physics).

leodriesch · on March 14, 2023

I think Chess is an easier thing to be defeated at by a machine because there is a clear winner and a clear loser.

Thinking, reading, interpreting and writing are skills which produce outputs that are not as simple as black wins, white loses.

You might like a text that a specific author writes much more than what GPT-4 may be able to produce. And you might have a different interpretation of a painting than GPT-4 has.

And no one can really say who is better and who is worse on that regard.

lwhi · on March 14, 2023

Surely that's only the case until you add an objective?

thomastjeffery · on March 14, 2023

Here's what's really terrifying about these tests: they are exploring a fundamental misunderstanding of what these models are in the first place. They evaluate the personification of GPT, then use that evaluation to set expectations for GPT itself.

Tests like this are designed to evaluate subjective and logical understanding. That isn't what GPT does in the first place!

GPT models the content of its training corpus, then uses that model to generate more content.

GPT does not do logic. GPT does not recognize or categorize subjects.

Instead, GPT relies on all of those behaviors (logic, subjective answers to questions, etc.) as being already present in the language examples of its training corpus. It exhibits the implicit behavior of language itself by spitting out the (semantically) closest examples it has.

In the text corpus - that people have written, and that GPT has modeled - the semantically closest thing to a question is most likely a coherent and subjectively correct answer. That fact is the one singular tool that GPT's performance on these tests is founded upon. GPT will "succeed" to answer a question only when it happens to find the "correct answer" in the model it has built from its training corpus, in response to the specific phrasing of the question that is written in the test.

Effectively, these tests are evaluating the subjective correctness of training corpus itself, in the context of answering the tests' questions.

If the training is "done well", then GPT's continuations of a test will include subjectively correct answers. But that means that "done well" is a metric for how "correct" the resulting "answer" is.

It is not a measure for how well GPT has modeled the language features present in its training corpus, or how well it navigates that model to generate a preferable continuation: yet these are the behaviors that should be measured, because they are everything GPT itself is and does.

What we learn from these tests is so subjectively constrained, we can't honestly extrapolate that data to any meaningful expectations. GPT as a tool is not expected to be used strictly on these tests alone: it is expected to present a diverse variety of coherent language continuations. Evaluating the subjective answers to these tests does practically nothing to evaluate the behavior GPT is truly intended to exhibit.

la64710 · on March 14, 2023

It is amazing how this crowd in HN reacts to AI news coming out of OpenAI compared to other competitors like Google or FB. Today there was another news about Google releasing their AI in GCP and mostly the comments were negative. The contrast is clearly visible and without any clear explanation for this difference I have to suspect that maybe something is being artificially done to boost one against the other. As far as this results are concerned I do not understand what is the big deal in a computer scoring high in tests where majority of the questions are in MCP format. It is not something earth shaking until it goes to the next stage and actually does something on its own.

scarmig · on March 14, 2023

There's not anyone rooting for Google to win; it's lost a whole lot of cred from technical users, and with the layoffs and budget cuts (and lowered hiring standards) it doesn't even have the "we're all geniuses changing the world at the best place to work ever" cred. OpenAI still has some mystique about it and seems to be pushing the envelope; Google's releases seem to be reactive, even though Google's actual technical prowess here is probably comparable.

dzdt · on March 14, 2023

OpenAI put ChatGPT out there in a way where most people on HN have had direct experience with it and are impressed. Google has not released any AI product widely enough for most commentators here to have experience with it. So OpenAI is openly impressive and gets good comments; as long as Google's stuff is just research papers and inaccessible vaporware it can't earn the same kudos.

siva7 · on March 14, 2023

You're aware of that the reputation of Google and Meta/Facebook isn't anymore stellar among the startup and tech crowd in 2023? It's not anymore 2006.

jeffbee · on March 14, 2023

Yeah, the younger generation has (incorrectly) concluded that client states of Microsoft are better.

CuriouslyC · on March 14, 2023

At least Microsoft understands backwards compatibility and developer experience...

ionwake · on March 14, 2023

even the freenode google group was patronising and unhelpful towards small startups as far back as 2012 from personal experience

carapace · on March 14, 2023

First. connect them to empirical feedback devices. In other words, make them scientists.

Human life on Earth is not that hard (think of it as a video game.) Because of evolution, the world seems like it was designed to automatically make a beautiful paradise for us. Literally, all you have to do to improve a place is leave it alone in the sun with a little bit of water. Life is exponential self-improving nano-technology.

The only reason we have problems is because we are stupid, foolish, and ignorant. The computers are not, and, if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Idiot_in_Vain · on March 14, 2023

I suspect there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Once AI becomes inteligent enough to solve all human problems, it may decide humans are worthless and dangerous.

carapace · on March 14, 2023

> there are plenty of wise people in the world and if we listen to them, they will tell us how to solve all our problems and live happily ever after.

Sure, and that's kind of the point: just listen to wise people.

> Once AI becomes intelligent enough to solve all human problems, it may decide humans are worthless and dangerous.

I don't think so, because in the first place there is no ecological overlap between humans and computers. They will migrate to space ASAP. Secondly, their food is information, not energy or protein, and in all the known universe Humanity is the richest source of information. The rest of the Universe is essentially a single poem. AI are plants, we are their Sun.

phphphphp · on March 14, 2023

Passing the LSAT with no time limit and a copy of the training material in front of you is not an achievement. Anybody here could have written code to pass the LSAT. Standardised tests are only hard to solve with technology if you add a bunch of constraints! Standardised tests are not a test of intelligence, they’re a test of information retention — something that technology has been able to out perform humans on for decades. LLMs are a bridge between human-like behaviour and long established technology.

chairhairair · on March 14, 2023

You honestly believe you could hand write code to pass an arbitrary LSAT-level exam?

phphphphp · on March 14, 2023

You’ve added a technical constraint. I didn’t say arbitrary. Standardised tests are standard. The point is that a simple lookup is all you need. There’s lots of interesting aspects to LLMs but their ability to pass standardised tests means nothing for standardised tests.

chairhairair · on March 14, 2023

You think that it’s being fed questions that it has a lookup table for? Have you used these models? They can answer arbitrary new questions. This newest model was tested against tests it hasn’t seen before. You understand that that isn’t a lookup problem, right?

phphphphp · on March 14, 2023

The comment I replied to suggested that the author was fearful of what LLMs meant for the future because they can pass standardised tests. The point I’m making is that standardised tests are literally standardised for a reason: to test information retention in a standard way, they do not test intelligence.

Information retention and retrieval is a long solved problem in technology, you could pass a standardised test using technology in dozens of different ways, from a lookup table to Google searches.

The fact that LLMs can complete a standardised test is interesting because it’s a demonstration of what they can do but it has not one iota of impact on standardised testing! Standardised tests have been “broken” for decades, the tests and answers are often kept under lock and key because simply having access to the test in advance can make it trivial to pass. A standardised test is literally an arbitrary list of questions.

You’re arguing a completely different point.

chairhairair · on March 14, 2023

I have no idea what you are talking about now. You claimed to be able to write a program that can pass the LSAT. Now it sounds like you think the LSAT is a meaningless test because it... has answers?

I suspect that your own mind is attempting to do a lookup on a table entry that doesn't exist.

phphphphp · on March 14, 2023

The original comment I replied to is scared for the future because GPT-4 passed the LSAT and other standardised tests — they described it as “terrifying”. The point I am making is that standardised tests are an invention to measure how people learn through our best attempt at a metric: information retention. You cannot measure technology in the same way because it’s an area where technology has been beating humans for decades — a spreadsheet will perform better than a human on information retention. If you want to beat the LSAT with technology you can use any number of solutions, an LLM is not required. I could score 100% on the LSAT today if I was allowed to use my computer.

What’s interesting about LLMs is their ability to do things that aren’t standardised. The ability for an LLM to pass the LSAT is orders of magnitude less interesting than its ability to respond to new and novel questions, or appear to engage in logical reasoning.

If you set aside the arbitrary meaning we’ve ascribed to “passing the LSAT” then all the LSAT is, is a list of questions… that are some of the most practiced and most answered in the world. More people have written and read about the LSAT than most other subjects, because there’s an entire industry dedicated to producing the perfect answers. It’s like celebrating Google’s ability to provide a result for “movies” — completely meaningless in 2023.

Standardised tests are the most uninteresting and uninspiring aspect of LLMs.

Anyway good joke ha ha ha I’m stupid ha ha ha. At least you’re not at risk of an LLM ever being able to author such a clever joke :)

tannhauser23 · on March 14, 2023

You don't know how the LSAT works, do you? It's not a memorization test. It has sections that test reading comprehension and logical thinking.

phphphphp · on March 14, 2023

If a person with zero legal training was to sit down in front of the LSAT, with all of the prep material and no time limit, are you saying that they wouldn’t pass?

scotty79 · on March 14, 2023

Why don't you show your program then that does 90% on LSAT?

phphphphp · on March 14, 2023

Send me the answer key and I’ll write you the necessary =VLOOKUP().

speedgoose · on March 14, 2023

Your program has to figure it out.

awestroke · on March 14, 2023

Considering your username, I'm not surprised that you have completely misunderstood what an LLM is. There is no material or data stored in the model, just weights in a network

phphphphp · on March 14, 2023

I know what an LLM is. My point is that “doesn’t have the data in memory” is a completely meaningless and arbitrary constraint when considering the ability to use technology to pass a standardised test. If you can explain why weights in a network is a unique threat to standardised tests, compared to, say, a spreadsheet, please share.

AuryGlenz · on March 14, 2023

It's not that standardized tests are under threat. It's that those weights in a network are significantly more similar to how our brains work than a spreadsheet and similarly flexible.

kurisufag · on March 14, 2023

weights are data relationships made totally quantitative. imagine claiming the human brain doesn't hold data simply because it's not in readable bit form.

kranke155 · on March 14, 2023

We're approaching the beggining of the end of the human epoch. Certainly Capitalism won't work or I dont see how it could work under full automation. My view is an economic system is a tool. If an economic system does not allow for utopian outcomes with emerging technology, then it's no longer suitable. It's clear that capitalism was born out of technological and societal changes. Now it seems it's come its time to end.

xen2xen1 · on March 14, 2023

Oh, capitalism can work, the question is who gets the rewards?

kranke155 · on March 14, 2023

With full automation and AI we could have something like a few thousand individuals controlling the resources to feed, house and clothe 6 billion.

Using copyright and IP law they could make it so it’s illegal to even try to reproduce what they’ve done.

I just don’t see how resource distribution works then. It seems to me that AI is the trigger to post-scarcity in any meaningful sense of the word. And then, just like agriculture (over abundance of food) led to city states and industrialisation (over abundance of goods) led to capitalism, then AI will lead to some new economic system. What form it will have I don’t know.

alvis · on March 14, 2023

It'd be terrifying if everything has an "accuracy score". It'll be a convergence to human intelligence rather than an advancement :/

codingdave · on March 14, 2023

> What happens when ALL of our decisions can be assigned an accuracy score?

That is exactly the opposite of what we are seeing here. We can check the accuracy of GPT-X's responses. They cannot check the accuracy of our decisions. Or even their own work.

So the implications are not as deep as people think - everything that comes out of these systems needs checked before it can be used or trusted.

numpad0 · on March 14, 2023

> What happens when ALL of our decisions can be assigned an accuracy score?

Then humans become trainable machines. Not just prone to indoctrination and/or manipulation by finesse, but actually trained to a specification. It is imperative that us individuals continue to retain control through the transition.

blsapologist42 · on March 14, 2023

Interest in human-played Chess is (arguably) at all time high, so I would say it bodes well based on that.

belter · on March 14, 2023

We can stop being enslaved by these type of AI overlords, by making sure all books, internet pages, and outdoor boards have the same safe, repeated string: "abcdefghjklmnpqrstvxzwy"

That is our emergency override.

epolanski · on March 14, 2023

Well you said it in your comment, if the model was trained with more QAs from those specific benchmarks then it's fair to expect it to do better in that benchmark.

devmor · on March 14, 2023

There's a large leap in logic in your premise. I find it far more likely that standardized tests are just a poor measurement of general intelligence.

kenjackson · on March 14, 2023

We benchmark humans with these tests -- why would we not do that for AIs?

The implications for society? We better up our game.

jstx1 · on March 14, 2023

> The implications for society? We better up our game.

If only the horses had worked harder, we would never have gotten cars and trains.

dragonwriter · on March 14, 2023

> We benchmark humans with these tests – why would we not do that for AIs?

Because the correlation between the thing of interest and what the tests measure may be radically different for systems that are very much unlike humans in their architecture than they are for humans.

There’s an entire field about this in testing for humans (psychometry), and approximately zero on it for AIs. Blindly using human tests – which are proxy measures of harder-to-directly-assess figures of merit requiring significant calibration on humans to be valid for them – for anything else without appropriate calibration is good for generating headlines, but not for measuring anything that matters. (Except, I guess, the impact of human use of them for cheating on the human tests, which is not insignificant, but not generally what people trumpeting these measures focus on.)

kenjackson · on March 14, 2023

There is also a lot of work in benchmarking for AI as well. This is where things like Resnet come from.

But the point of using these tests for AI is precisely the reason we use for giving them to humans -- we think we know what it measures. AI is not intended to be a computation engine or a number crunching machine. It is intended to do things that historically required "human intelligence".

If there are better tests of human intelligence, I think that the AI community would be very interested in learning about them.

See: https://github.com/openai/evals

credit_guy · on March 14, 2023

> The implications for society? We better up our game.

For how long can we better up our game? GPT-4 comes less than half a year after ChatGPT. What will come in 5 years? What will come in 50?

layer8 · on March 14, 2023

Progress is not linear. It comes in phases and boosts. We’ll have to wait and see.