Hacker News new | past | comments | ask | show | jobs | submit login

Scoring 96 percentile among humans taking the exam without moving goal posts would have been science fiction two years ago. Now it’s suddenly not good enough and the fact a computer program can score decent among passing lawyers and first time test takers is something to sneer at.

The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying.

I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad.




On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)

Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)


> On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

That's probably true, which is why human most knowledge workers aren't going away any time soon.

That said, I have better luck with a different approach: I use LLM's to learn things that I don't already understand well. This forces me to actively understand and validate the output, rather than consume it passively. With an LLM, I can easily ask questions, drill down, and try different ideas, like I'm working with a tutor. I find this to be much more effective than traditional learning techniques alone (e.g. textbooks, videos, blog posts, etc.).


Might be better to think of the LLM as the student, and you're an imposter tutor. You're trying to assess the kid's knowledge without knowing the material yourself, but the kid is likely to lie when he doesn't know something to try to impress you, hoping that you don't know your stuff either. So you just have to keep probing him with more questions to suss out if he's at least consistent.


I would classify all of those as "non-traditional" learning techniques, unless you actually mean using a textbook while taking a class with a human teacher.

Well written textbooks are consumable on their own for some people, but most are not written for that.


That's a good observation about textbooks and helps explain why I had difficulties trying to teach myself topics from a textbook alone!


A lot just aren't very good but they also tend to make assumptions about prior knowledge in line with what would be typical prerequisites for a class and some degree of guidance.


I've had teachers who didn't understand the subject they were teaching. It's not a good experience and replicating that seems like a terrible idea.


A key advantage is that LLMs dont have emotional states that need to be managed.


It depends on the topic (and the LLM - ChatGPT-4 equivalent at least, any model equivalent to 3.5 or earlier is just a toy in comparison) - but I've had plenty of success using it as a productivity enhancing tool for programming and AWS infrastructure, both to generate very useful code and as an alternative to Google for finding answers or at least a direction to answers. But I only use it where I'm confident I can vet the answers it provides.


> On any topic that I understand well, LLM output is garbage

I've heard that claim many times, but never is there any specific follow-up on which topics they mean. Of course, there are areas like math and programming where LLMs might not perform as well as a senior programmer or mathematician, sometimes producing programs that do not compile or incorrect calculations/ideas. However, this isn't exactly "garbage" as some suggest. At worst, it's more like a freshman-level answer, and at best, it can be a perfectly valid and correct response.


> At worst, it's more like a freshman-level answer

That is garbage.


I hope you don't hold a teaching position at a university then.


I did, the growth students have from first to second year is enormous. Everyone know freshmen produce garbage answers, that is why they are freshmen and not out doing work, they are there to learn not to produce answers. If freshmen answers were good enough people wouldn't bother hiring college grads, just hire dropouts and high school grads.

> I hope you don't hold a teaching position at a university then.

You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for. An LLM however doesn't grow, so while such students are worth teaching even though they produce garbage answers the LLM isn't.


> You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for.

I think many students including freshman have interesting and sometimes thought provoking ideas. And they come up with creative solutions, which is based on their previous experience in life. I would never describe that as garbage.


On what topics you understand well does GOT-4o or Claude Opus produce garbage?


I do run into the issue where the longer the conversation goes the more inaccurate the information.

But a common situation is that with code generation it will fail to understand the context of where the code belongs and so it's a function that will compile but makes no sense.


Yeah. I often springboard into a new context by having the LLM compose the next prompt based on the discussion and restart the context. Remarkably effective if you ask it to incorporate “prompt engineering” terms from research.


Anything deeper than surface level in medicine.

Try getting it to properly select crystalloids with proper additives for a patient with a given history and lab results and watch in horror as it confidently gives instructions that would kill the patient.

What is even more irritating is that I had gpt4 debate me on things that it was completely wrong about and it was only when I responded with a stern rebuke that it hit me with the usual "Apologies for the misunderstanding..."


LLMs are not good at answering expert level questions at the forefront of human knowledge.


Unfortunately it would be considered basic medicine in this case.


Is it basic but not documented? Basic to me means the first google search result is generally correct.


That's not how medicine operates.

Medical problems are highly contextual, so you are not going to get much valuable information at the level of what a doctor is thinking from the first page of Google. That doesn't mean it isn't a simple within our area of expertise.


In my area of expertise, a well formulated google search can result in a page 1 full of academic articles on the general topic, but there isn’t necessarily consensus. This might be a case of the curse of knowledge :)


To be fair, I have not found MDs to be particularly reliable for answering basic questions about medicine either.


OK. I can't speak for what you've experienced. I can only offer what I see from LLMs given what I know.


High school math problems.


I suspect by garbage you mean not perfect.

To be more precise can you please give a topic you know well and your % guess how often the answers are wrong on the topic?


I would take their meaning as 'contains enough errors to not be useful', which doesn't need a very high percentage of wrong answers.


Even better, looks right, might even compile, but will be doing the subtly (or obviously) wrong thing.


Functional linear analysis - it has tendency to produce a proof for unprovable statements; the proofs will be logically argued and well structured and step 8 will have a statement that is obvious nonsense even to a beginning student, me. The professor on the other hand will ask why I'm trying to prove the false statement and expertly help me find my logic error.


Specifics like this make it much easier to agree on LLM capabilities, thank you.

Automatic proof generation is a massive open problem in all of computer science and not close to be solved. It’s true LLMs aren’t great at it and more is required for example as with the geometry system Deepmind progresses on.

On the other hand they can be very useful to explain concepts and allow interactive questioning to drill down and help build understanding of complex mathematical concepts, all during a morning commute via the voice interface.


How do yo debug its hallucination misinformation via voice interface while you commute?


I just use my memory and verify later. Unlike a LLM I have persistent long term durable storage of knowledge. Typically I can pretty easily pick out a hallucination though because there’s often a very clear inconsistency or logical leap that is non sense.


I’m not the parent, but depending on the context, GPT-4 will often make up functions that then end up requiring research and correction; in other cases like once when I asked it to show me an example of a class of X86 assembly instructions, it just added a label and skipped the actual instruction and implementation!

Yesterday I was looking for some help on an issue with the unshare command; it repeatedly made bad assumptions about the nature of the error even I provided it with the full error message and one could already guess the initial cause by looking at that.

I guess such errors can be frighteningly common once you get outside of typical web development.


the models that you have tried .. are garbage. hmmm Maybe you are not among the many, many, many inside professionals and unofrmed services that have different access than you? money talks?


It is remarkable that folks who tried a garbage LLM like copilot, 3.5, Gemini, or made meta LLMs say naughty words, seem to think these are still SOA. Sometimes I stumble on them and I am shocked at the degradation in quality then realize my settings are wrong. People are vastly underestimating the rate of change here.


People have tried gpt-4, it does the same kind of errors as gpt-3, it just has a bigger set of known things where it does ok so it is immensely more useful.

It is like a calculator that only worked in one digit, and now it works on 2, the improvement is immense but its still nowhere close to replacing mathematicians since it isn't even working on the same kind of problems.

Edit: In several years we might have a perfect calculator that is better than any human at such tasks, but it still doesn't beat humans at stuff unrelated to calculations. Or in the case of LLMs pattern matching texts, humans don't pattern match texts to plan or mentally simulate scenarios etc, that part isn't covered by LLMs. Human level planning with todays LLM level pattern matching on text would be really useful, we see a lot of humans work that way by using the LLM as a pattern matcher, but there is no progress on automating human level planning so far, LLMs aren't it.


> People are vastly underestimating the rate of change here

GPT-3.5 was released in March 2022. We are now in June 2024. Over 2 years later.

And on average GPT-4 is about 40% more accurate.

For me, LLMs are very much like self-driving cars. On the journey towards perfect accuracy it gets progressively harder to make advancements.

And for it to replace the status quo it really does need to be perfect. And there is no evidence or research that this is possible.


Its enough to decrease the amount of ppl you need in IT by a factor of 20-30%.

Ppl dont want to hear that, but you see less and less offers and not only for junior positions.

Hard truth is that like with any tool/automation - the higher performance improves, the less ppl are needed for this kind of work.

Just look at how some parts of manual labor were made redundant.

Why ppl think it wont be the same with mental work is beyond me.


Not yet, because the reliability isn't there. You still need to validate everything it does.

E.g. I had it autocompleting a set of 20 variable#s today Something like output.blah=tostring(input[blah]). The kind of work you give to a regex.

In the middle of the list, it decides to go output.blah=some long weitd piece of code, completely unexpected and syntactically invalid.

I am still in my AI evaluation phase, and sometimes I am impressed with what it does. But just as possible is an unexpected total failure. As long is it does that, I can't trust it.


>On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Is it generally because the LLM was not trained on that data, therefore have no knowledge of it or because it can't reason well enough?


LLMs don't and are not built to reason, they are next token predictors.


The real problem is that tests used for humans are callibrated based on the way different human abilities correlate: they aren't objectives themselves, they are convenient proxies.

But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.

The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.

(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)


This is a good point.

I've noticed one thing that LLMs seem to have trouble with is going "off task".

There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.

The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.

Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.

In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.

If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.


I have difficulty being optimistic about LLMs because they don’t benefit my work now, and I don’t see a way that they enhance our humanity. They’re explicitly pitched as something that should eat all sorts of jobs.

The problem isn’t the LLMs per se, it’s what we want to do with them. And, being human, it becomes difficult to separate the two.

Also, they seem to attract people who get real aggressive about defending them and seem to attach part of their identity onto them, which is weird.


By 96th percentile do you mea 69th? From the abstract:

> data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be 62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays.


It scored less than 50% when compared to people who had taken the test once.


The nerds aren't jaded, they are worried. I'd be too if my job needed nothing more than a keyboard to be completed. There are a lot of people here who need to squeeze another 20-40 years out of a keyboard job.


You're assuming that keyboard jobs are easier simply because the models were built to output text, but nothing prevents physical motion to be easier simply due to sheer repetitiveness. In fact, you can get away with building dedicated robots e.g. for drywall spraying and sanding, whereas the keyboard guys tend to have to switch tasks all the time.


Similar comments were made that microwaves will eliminate cooking.

At the end of the day (a) LLMs aren't accurate enough for many use cases and (b) there is far more to knowledge worker's jobs than simply generating text.


The profusion of jaded nerds, although saddening at times, seems to be pushing science forward. I have a feeling that a prolonged sense of "Awe" can hinder progression at times, and the lack of it is usually a sign of the adaptability of a group (how quick new developments are normalized?)


It’s the hype. We could invent warp drive but if it was hyped as the cure for cancer, poverty, war, and the gateway to untold riches and immortality while simultaneously being the most dangerous invention in history destined to completely destroy humanity people would be “oh ho hum we made it to Centauri in a week” pretty fast.

Add some obnoxious pseudo-intellectual windbags building a cult around it and people would be down right turned off.

Hype is also taken as a strong contrarian indicator by most scientific and engineering types. A lot of hype means it’s snake oil. This heuristic is actually correct more often than it’s not, but it is occasionally wrong.


Yeah it’s insane, I am actually scared the llm is like sentient and secretly plotting to kill me. I bet we have like full AGI next year because Elon said so and Sam Altman probably has AGI already internally at Open AI. I am actually selling my house now and going all in Nvidia and just live in my car until we get the AGI


> The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking

That's called a programming language. It's nothing new.


It's a programming language except the programming part, and the language part.


How is the text you write not a language, and how is writing instructions that computers follow not programming?

Edit: LLMs biggest feat is being a natural language interpreter, so it can run natural language scripts. It is far from perfect at it, but that is still programming.


Sure, in the way that you program your dog to play fetch.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: