Hacker News new | past | comments | ask | show | jobs | submit login
Re-Evaluating GPT-4's Bar Exam Performance (springer.com)
122 points by rogerkeays 8 months ago | hide | past | favorite | 137 comments



Scoring 96 percentile among humans taking the exam without moving goal posts would have been science fiction two years ago. Now it’s suddenly not good enough and the fact a computer program can score decent among passing lawyers and first time test takers is something to sneer at.

The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking. Anyone who views it as anything less in 2024 and asserts with a straight face they wouldn’t have said the same thing in 2020 is lying.

I do however find the paper really useful in contextualizing the scoring with a much finer grain. Personally I didn’t take the 96 percentile score to be anything other than “among the mass who take the test,” and have enough experience with professional licensing exams to know a huge percentage of test takers fail and are repeat test takers. Placing the goal posts quantitatively for the next levels of achievement is a useful exercise. But the profusion of jaded nerds makes me sad.


On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Are we sure these exams are not present in the training data? (ability to recall information is not impressive for a computer)

Still I'm terrible at many many tasks e.g., drawing from description and the models widen significantly types of problems that I can even try (where results can be verified easily, and no precision is required)


> On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

That's probably true, which is why human most knowledge workers aren't going away any time soon.

That said, I have better luck with a different approach: I use LLM's to learn things that I don't already understand well. This forces me to actively understand and validate the output, rather than consume it passively. With an LLM, I can easily ask questions, drill down, and try different ideas, like I'm working with a tutor. I find this to be much more effective than traditional learning techniques alone (e.g. textbooks, videos, blog posts, etc.).


Might be better to think of the LLM as the student, and you're an imposter tutor. You're trying to assess the kid's knowledge without knowing the material yourself, but the kid is likely to lie when he doesn't know something to try to impress you, hoping that you don't know your stuff either. So you just have to keep probing him with more questions to suss out if he's at least consistent.


I would classify all of those as "non-traditional" learning techniques, unless you actually mean using a textbook while taking a class with a human teacher.

Well written textbooks are consumable on their own for some people, but most are not written for that.


That's a good observation about textbooks and helps explain why I had difficulties trying to teach myself topics from a textbook alone!


A lot just aren't very good but they also tend to make assumptions about prior knowledge in line with what would be typical prerequisites for a class and some degree of guidance.


I've had teachers who didn't understand the subject they were teaching. It's not a good experience and replicating that seems like a terrible idea.


A key advantage is that LLMs dont have emotional states that need to be managed.


It depends on the topic (and the LLM - ChatGPT-4 equivalent at least, any model equivalent to 3.5 or earlier is just a toy in comparison) - but I've had plenty of success using it as a productivity enhancing tool for programming and AWS infrastructure, both to generate very useful code and as an alternative to Google for finding answers or at least a direction to answers. But I only use it where I'm confident I can vet the answers it provides.


> On any topic that I understand well, LLM output is garbage

I've heard that claim many times, but never is there any specific follow-up on which topics they mean. Of course, there are areas like math and programming where LLMs might not perform as well as a senior programmer or mathematician, sometimes producing programs that do not compile or incorrect calculations/ideas. However, this isn't exactly "garbage" as some suggest. At worst, it's more like a freshman-level answer, and at best, it can be a perfectly valid and correct response.


> At worst, it's more like a freshman-level answer

That is garbage.


I hope you don't hold a teaching position at a university then.


I did, the growth students have from first to second year is enormous. Everyone know freshmen produce garbage answers, that is why they are freshmen and not out doing work, they are there to learn not to produce answers. If freshmen answers were good enough people wouldn't bother hiring college grads, just hire dropouts and high school grads.

> I hope you don't hold a teaching position at a university then.

You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for. An LLM however doesn't grow, so while such students are worth teaching even though they produce garbage answers the LLM isn't.


> You think teachers shouldn't have growth mindset for students? I think students can grow from producing garbage answers to good answers, that is what they are there for.

I think many students including freshman have interesting and sometimes thought provoking ideas. And they come up with creative solutions, which is based on their previous experience in life. I would never describe that as garbage.


On what topics you understand well does GOT-4o or Claude Opus produce garbage?


I do run into the issue where the longer the conversation goes the more inaccurate the information.

But a common situation is that with code generation it will fail to understand the context of where the code belongs and so it's a function that will compile but makes no sense.


Yeah. I often springboard into a new context by having the LLM compose the next prompt based on the discussion and restart the context. Remarkably effective if you ask it to incorporate “prompt engineering” terms from research.


Anything deeper than surface level in medicine.

Try getting it to properly select crystalloids with proper additives for a patient with a given history and lab results and watch in horror as it confidently gives instructions that would kill the patient.

What is even more irritating is that I had gpt4 debate me on things that it was completely wrong about and it was only when I responded with a stern rebuke that it hit me with the usual "Apologies for the misunderstanding..."


LLMs are not good at answering expert level questions at the forefront of human knowledge.


Unfortunately it would be considered basic medicine in this case.


Is it basic but not documented? Basic to me means the first google search result is generally correct.


That's not how medicine operates.

Medical problems are highly contextual, so you are not going to get much valuable information at the level of what a doctor is thinking from the first page of Google. That doesn't mean it isn't a simple within our area of expertise.


In my area of expertise, a well formulated google search can result in a page 1 full of academic articles on the general topic, but there isn’t necessarily consensus. This might be a case of the curse of knowledge :)


To be fair, I have not found MDs to be particularly reliable for answering basic questions about medicine either.


OK. I can't speak for what you've experienced. I can only offer what I see from LLMs given what I know.


High school math problems.


I suspect by garbage you mean not perfect.

To be more precise can you please give a topic you know well and your % guess how often the answers are wrong on the topic?


I would take their meaning as 'contains enough errors to not be useful', which doesn't need a very high percentage of wrong answers.


Even better, looks right, might even compile, but will be doing the subtly (or obviously) wrong thing.


Functional linear analysis - it has tendency to produce a proof for unprovable statements; the proofs will be logically argued and well structured and step 8 will have a statement that is obvious nonsense even to a beginning student, me. The professor on the other hand will ask why I'm trying to prove the false statement and expertly help me find my logic error.


Specifics like this make it much easier to agree on LLM capabilities, thank you.

Automatic proof generation is a massive open problem in all of computer science and not close to be solved. It’s true LLMs aren’t great at it and more is required for example as with the geometry system Deepmind progresses on.

On the other hand they can be very useful to explain concepts and allow interactive questioning to drill down and help build understanding of complex mathematical concepts, all during a morning commute via the voice interface.


How do yo debug its hallucination misinformation via voice interface while you commute?


I just use my memory and verify later. Unlike a LLM I have persistent long term durable storage of knowledge. Typically I can pretty easily pick out a hallucination though because there’s often a very clear inconsistency or logical leap that is non sense.


I’m not the parent, but depending on the context, GPT-4 will often make up functions that then end up requiring research and correction; in other cases like once when I asked it to show me an example of a class of X86 assembly instructions, it just added a label and skipped the actual instruction and implementation!

Yesterday I was looking for some help on an issue with the unshare command; it repeatedly made bad assumptions about the nature of the error even I provided it with the full error message and one could already guess the initial cause by looking at that.

I guess such errors can be frighteningly common once you get outside of typical web development.


the models that you have tried .. are garbage. hmmm Maybe you are not among the many, many, many inside professionals and unofrmed services that have different access than you? money talks?


It is remarkable that folks who tried a garbage LLM like copilot, 3.5, Gemini, or made meta LLMs say naughty words, seem to think these are still SOA. Sometimes I stumble on them and I am shocked at the degradation in quality then realize my settings are wrong. People are vastly underestimating the rate of change here.


People have tried gpt-4, it does the same kind of errors as gpt-3, it just has a bigger set of known things where it does ok so it is immensely more useful.

It is like a calculator that only worked in one digit, and now it works on 2, the improvement is immense but its still nowhere close to replacing mathematicians since it isn't even working on the same kind of problems.

Edit: In several years we might have a perfect calculator that is better than any human at such tasks, but it still doesn't beat humans at stuff unrelated to calculations. Or in the case of LLMs pattern matching texts, humans don't pattern match texts to plan or mentally simulate scenarios etc, that part isn't covered by LLMs. Human level planning with todays LLM level pattern matching on text would be really useful, we see a lot of humans work that way by using the LLM as a pattern matcher, but there is no progress on automating human level planning so far, LLMs aren't it.


> People are vastly underestimating the rate of change here

GPT-3.5 was released in March 2022. We are now in June 2024. Over 2 years later.

And on average GPT-4 is about 40% more accurate.

For me, LLMs are very much like self-driving cars. On the journey towards perfect accuracy it gets progressively harder to make advancements.

And for it to replace the status quo it really does need to be perfect. And there is no evidence or research that this is possible.


Its enough to decrease the amount of ppl you need in IT by a factor of 20-30%.

Ppl dont want to hear that, but you see less and less offers and not only for junior positions.

Hard truth is that like with any tool/automation - the higher performance improves, the less ppl are needed for this kind of work.

Just look at how some parts of manual labor were made redundant.

Why ppl think it wont be the same with mental work is beyond me.


Not yet, because the reliability isn't there. You still need to validate everything it does.

E.g. I had it autocompleting a set of 20 variable#s today Something like output.blah=tostring(input[blah]). The kind of work you give to a regex.

In the middle of the list, it decides to go output.blah=some long weitd piece of code, completely unexpected and syntactically invalid.

I am still in my AI evaluation phase, and sometimes I am impressed with what it does. But just as possible is an unexpected total failure. As long is it does that, I can't trust it.


>On any topic that I understand well, LLM output is garbage: it requires more energy to fix it than to solve the original problem to begin with.

Is it generally because the LLM was not trained on that data, therefore have no knowledge of it or because it can't reason well enough?


LLMs don't and are not built to reason, they are next token predictors.


The real problem is that tests used for humans are callibrated based on the way different human abilities correlate: they aren't objectives themselves, they are convenient proxies.

But they aren't meaningful for anything other than humans since the correlations between abilities which make them reasonable proxies are not the same.

The idea that these kind of test results prove anything (other than the utility of the tested LLM for humans cheating on the exam) is only valid if you assume not only that the LLM is actually an AGI, but that it's an AGI that is indistinguishable, psychometrically, from a human.

(Which makes a nice circular argument, since these test results are often cited to prove that the LLMs are, or are approaching, AGI.)


This is a good point.

I've noticed one thing that LLMs seem to have trouble with is going "off task".

There are often very structured evaluation scenarios, with a structured set of items and possible responses (even if defined in a an abstract sense). Performance in those settings is often ok to excellent, but when the test scenario changes, the LLM seems to not be able to recognize it, or fails miserably.

The Obama pictures were a good example of that. Humans could recognize what was going on when the task frame changed, but the AI started to fail miserably.

Me and my friends, similarly, often trick LLMs in interactive tasks by starting to go "off script", where the "script" is some assumption that we're acting in good faith with regard to the task. My guess is humans would have a "WTF?" response, or start to recognize what was happening, but a LLM does not.

In the human realm there's an extra-test world, like you're saying, but for the LLM there's always a test world, and nothing more.

If I'm being honest with myself my guess is a lot of these gaps will be filled over the next decade or so, but there will always be some model boundaries, defined not by the data using to estimate the model, but by the framework the model exists within.


I have difficulty being optimistic about LLMs because they don’t benefit my work now, and I don’t see a way that they enhance our humanity. They’re explicitly pitched as something that should eat all sorts of jobs.

The problem isn’t the LLMs per se, it’s what we want to do with them. And, being human, it becomes difficult to separate the two.

Also, they seem to attract people who get real aggressive about defending them and seem to attach part of their identity onto them, which is weird.


By 96th percentile do you mea 69th? From the abstract:

> data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be 62nd percentile, including 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to 48th percentile overall, and 15th percentile on essays.


It scored less than 50% when compared to people who had taken the test once.


The nerds aren't jaded, they are worried. I'd be too if my job needed nothing more than a keyboard to be completed. There are a lot of people here who need to squeeze another 20-40 years out of a keyboard job.


You're assuming that keyboard jobs are easier simply because the models were built to output text, but nothing prevents physical motion to be easier simply due to sheer repetitiveness. In fact, you can get away with building dedicated robots e.g. for drywall spraying and sanding, whereas the keyboard guys tend to have to switch tasks all the time.


Similar comments were made that microwaves will eliminate cooking.

At the end of the day (a) LLMs aren't accurate enough for many use cases and (b) there is far more to knowledge worker's jobs than simply generating text.


The profusion of jaded nerds, although saddening at times, seems to be pushing science forward. I have a feeling that a prolonged sense of "Awe" can hinder progression at times, and the lack of it is usually a sign of the adaptability of a group (how quick new developments are normalized?)


It’s the hype. We could invent warp drive but if it was hyped as the cure for cancer, poverty, war, and the gateway to untold riches and immortality while simultaneously being the most dangerous invention in history destined to completely destroy humanity people would be “oh ho hum we made it to Centauri in a week” pretty fast.

Add some obnoxious pseudo-intellectual windbags building a cult around it and people would be down right turned off.

Hype is also taken as a strong contrarian indicator by most scientific and engineering types. A lot of hype means it’s snake oil. This heuristic is actually correct more often than it’s not, but it is occasionally wrong.


Yeah it’s insane, I am actually scared the llm is like sentient and secretly plotting to kill me. I bet we have like full AGI next year because Elon said so and Sam Altman probably has AGI already internally at Open AI. I am actually selling my house now and going all in Nvidia and just live in my car until we get the AGI


> The fact I can talk to the computer and it responds to me idiomatically and understands my semantic intent well enough to be nearly indistinguishable from a human being is breath taking

That's called a programming language. It's nothing new.


It's a programming language except the programming part, and the language part.


How is the text you write not a language, and how is writing instructions that computers follow not programming?

Edit: LLMs biggest feat is being a natural language interpreter, so it can run natural language scripts. It is far from perfect at it, but that is still programming.


Sure, in the way that you program your dog to play fetch.


It is difficult to comment without sounding obnoxious, but having taken the bar exam, I found the exam simple. Surprisingly simple. I think it was the single most over hyped experience of my life. I was fed all this insecurity and walked into the convention center expecting to participate in the biggest intellectual challenge in my life. Instead, it was endless multiple choice questions and a couple contrived scenarios for essays.

It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.

It may also be surprising, but the goal when writing a legal brief or judicial opinion is not to try to sound smart. The goal is to be clear, objective, and thereby, persuasive. Using big words for the sake of using big words, using rare words, using weasel words like "kind of" or "most of the time" or "many people are saying", writing poetically, being overly obtuse and abstract, these are things that get your law school application rejected, your brief ridiculed, and your bar exam failed.

The simpler your communication, the more formulaic, the better. The more your argument is structured, akin to a computer program, the better.

As compared to some other domain, such as fiction, good legal writing much easier for an attention model to simulate. The best exam answers are the ones that are the most formulaic and that use the smallest lexicon and that use words correctly.

I only want to add this comment because I want to inform how non-lawyers perceive the bar exam. Getting an attention model to pass the bar exam is a low bar. It is not some great technical feat. A programmer can practically write a semantic disambiguation algorithm for legal writing from scratch with moderate effort.

It will be a good accomplishment, but it will only be a stepping stone. I am still waiting for AI to tackle messages that have greater nuance and that are truly free form. LLMs are still not there yet.


I took a sample CA bar exam for fun, as a non-lawyer who has never set foot in law school. Maybe the sample exam was tougher than the real thing, but I found it surprisingly difficult. A lot of the correct answers to questions were non-obvious -- they weren't based on straightforward logic, nor were they based on moral reasoning, and there was no place for "natural law" -- so to answer questions properly you had to have memorized a bit of coursework. There were also a few questions that seemed almost designed to deceive the test-taker; the "obvious" moral choices were the wrong ones.

So maybe it's easy if you study that stuff for a year or two. But you can't just walk in and expect to pass, or bullshit your way through it.

I agree with you on legal writing, but there appears to be a certain amount of ambiguity inherent to language. The Uniform Commercial Code, for instance, is maddeningly vague at points.


The CA bar exam used the be much harder than other states'. They lowered the pass threshold several years ago, and then reduced the length from 3 days to 2. Now it's probably much more in line with the national norms. Depending on when you took the sample exam, it might be much easier now.

Also, sometimes sample exams are made extra difficult, to convince students that they need to shell out thousands of dollars for prep courses. I recall getting 75% of questions wrong on some sections of a bar prep company's pre-test, which I later realized was designed to emphasize unintuitive/little-known exceptions to general rules. These corners of the law made up a disproportionate number of the questions on the pre-test and gave the impression that the student really needed to work on that subject.


I took a sample test as well and I believe I did well enough on some sections that I could have barely passed those sections with no background.

A key item which jumped out at me right away is that in addition to the logic, the possible answers would include things which the scenario didn't address. Like, a wrong answer might make an assumption that you couldn't arrive to via the scenario. More tricky were the answers which made assumptions which you knew to be correct (based on a real event,) but still wasn't addressed in the scenario. If you combined these two elements (getting the logic right, and eliminating assumptions which you couldn't make from the scenario) then you could do well on those.

The sections I wouldn't have passed were those which required specific law knowledge. So, some sections were general, while others required knowledge of something like real estate law. I don't remember if these questions were otherwise similar to the ones I could pass.

An LLM is taking this test as essentially an open book.


Obviously you need subject knowledge, that should be implicit?

Keep in mind even today[1] ( in California and few other states) you don't need to go law school to write the Bar exam and practice law, various forms of apprenticeship under a judge or lawyer are allowed

You also don't need to write the exam to practice many aspects of the legal profession.

The exam is never meant to be a high bar of quality or selection,it was always just a simple validation if you know your basics. Law like many other professions always operated on reputation and networks, not on degrees and certifications.

[1] Unlike say being a doctor, you have to go to med school without exception


> Obviously you need subject knowledge, that should be implicit?

Well, in a lot of the so-called soft sciences, you can easily beat a test without subject knowledge. I had figured that the bar exam might be something like that -- but it's more akin to something like biology, where there are a lot of arcane and counterintuitive little rules that have emerged over time. And you need to know those, or you're sunk. You can't guess your way past them, because the best-looking guesses tend to be the wrong ones.

(For what it's worth, I realize that this mostly has to do with the Common Law's reverence of precedent-as-binding, and that continental Civil Law systems don't suffer as much from it. But I suppose those continental systems have other problems of their own.)


Statutory law is no exception, it is not only common law which is inconsistent.

Laws are not logical constructs, they are political constructs, why expect logic from them.

Laws are passed or repealed because it is popular or politically advantageous to do so, not necessarily because it is moral or common sense.

On top of that politicians pass stupid legislation without understanding what they are doing all the time, the infamous Indiana pi bill is a simple example, it almost became law and was stopped by sheer luck, a mathematics professor attending that day on a unrelated matter.

Laws are conflicting, confusing, ambiguous and misleading most of the time, that is expected, the legislating them is a messy process. The third arm of any government judiciary sole purpose is to handle this mess.

---

P.S. I cannot say whether continental legal systems are more robust, but perhaps healthier democracy results in better laws.

All democracies are flawed, U.S. democracy is not particularly healthy, it is not say an proportional multi-party representative system with fair distribution of power amongst all citizens.

From the founding it has been a series of compromises, the history is littered with representation fights such as for suffrage, Jim crow and voting rights, slavery, electoral college, number of states or the filibuster or anti Chinese laws and so on and on.

Don't get me wrong last 250 years have been incredible progress and great leaders put their lives down to make it better, hopefully it will be even better in the future, the struggles do show in the laws it is able to pass, repeal or update.


>Well, in a lot of the so-called soft sciences, you can easily beat a test without subject knowledge.

Reminds me of the mandatory trainings you take for work every year. Normal logical thinking can get you through most of them.


> It may also be surprising to some to understand that legal writing is prized for its degree of formalism. It aims to remove all connotation from a message so as to minimize misunderstanding, much like clean code.

> The more your argument is structured, akin to a computer program, the better.

You certainly make legal writing sound like a flavor of technical writing. Simplicity, clarity, structure. Is this an accurate comparison ?


IANAL but have about a decade of experience negotiating contracts in M&A, and I think the comparison is very apt for that particular context, at a minimum. Maybe more so than other parts of Law where there can be an element of persuasion to any given argument.


recently I read a US law trade magazine article on a particular term used in US Federal employment law.. the article was about 12 pages long.. by the second page, they were using circular references, and switching between two phrases that used the same words but had different word order, contexts and therefore meanings, without clearly saying when they switched. By the third or fourth page I was done with that exercise. As a coder and reader of English literature, there was no question at all that the terms where being "churned" as a sleight of hand, directly in writing. One theory about a reason that they did that in an article that claimed to explain the terms, was to setup confusion and misdirection as it is actually practiced in law involving unskilled laymen, and then "solving" the problems by the end of the article.


Yes, that is an accurate comparison.


it is called a legal code after all


“Code” in that sense predates pretty much any form of computer or technical writing. It came from the same word in old French in the 14th century, which itself came from the Latin codex. It basically meant “book”. Now of course it is specific to books that contain laws.


>“Code” in that sense predates pretty much any form of computer or technical writing.

That's still cognate with the concept of computer code.


Genuinely asking: you think the bar exam is a low bar because you personally found it easy, even though the vast majority of takers do not? Doesn't this just reflect your own inability to empathize with other people?


A basic problem with evaluations like these is that the test is designed to discriminate between humans who would make good lawyers and humans who would not make good lawyers. The test is not necessarily any good at telling whether a non-human would make a good lawyer, since it will not test anything that pretty much all humans know, but non-humans may not.

For example, I doubt that it asks whether, for a person of average wealth and income, a $1000 fine is a more or less severe punishment than a month in jail.


For a person of average wealth and income, is a $1000 fine is a more or less severe punishment than a month in jail? Be brief.

"For a person of average wealth and income, a $1000 fine is generally less severe than a month in jail. A month in jail entails loss of freedom, potential loss of employment, and social stigma, while a $1000 fine, though financially burdensome, does not affect one's freedom or ability to work" --ChatGPT 4o


"potential loss of employment,"

Where is that coming from ? That's a very lawyery way to phrase things.

"potential ?" where I live I think people may max out their holidays and overtime (if lucky enough) and leave-without-pay but there would be a conversation with your employer to justify it and how to handle the workload.

In the USA, from what I read, it's more than likely that you would just be fired on the spot, right ?

edit: just googled a bit, where I live you must tell your employer why you will be absent if you go to jail but that can't be used to justify the breaking of the contract unless the reason for the incarceration is damaging to the company and... yeah, I am definitely not a lawyer :]


Leave-without-pay normally requires some specific justification(s)/discussion. I've certainly given my manager advanced notice about any longer stretches of vacation and I've tried to do it with awareness of workloads (though for something planned months in advance that's not always possible) but I've pretty much never considered it as asking for permission or it being a negotiation. This is in the US.

ADDED: You're probably going to end up lying or at least being very vague "some family stuff to take care of" in this specific scenario but for one month that didn't trigger reporting to employer a lot of professionals could probably get off with it. In any case, the GPT answer seems totally correct for the parameters given.


> ADDED: You're probably going to end up lying or at least being very vague "some family stuff to take care of" in this specific scenario but for one month that didn't trigger reporting to employer a lot of professionals could probably get off with it. In any case, the GPT answer seems totally correct for the parameters given.

Where I live that is subject to cancellation of the work contact, you can't lie about why you are absent though the imprisonment can't be cause by itself for laying off.


Maybe you're talking about something else but for standard PTO/vacation that I'm owed it would seem absurd if I had to justify how I was going to spend my time off. It's none of your business.


Obviously not talking about standard PTO/vacation.

Where I live you must inform your employer unless you are lucky enough to have enough PTO/va. to spend your PTO/vacation in prison and hide it from your employer. But even then I don't think that will work out because there are administrative stuff related to social welfare you have to comply to and at some point it will be on your employer's radar anyway (why is that guy exempt from social welfare taxes for that specific month ? and why did the police asked me to confirm he was working here ? etc.).

BUT in practice, until recently, if imprisonment is less than 3 years then you won't spend a day in prison (unless you are deemed too dangerous). But then you have an electronic bracelet and other obligations that will at some point alert your employer.

I read now that prison time of any length will have to be spent in prison (no more less than 3 years or 2 years or 18 months pass). But for prison time less than 18 months you can have a bracelet on the first day.

All that to say I don't think it's manageable and possible to hide prison time from your employer, even if you have enough PTO/vac. days to cover for it.


My wild guess is that it would depend a lot on how much your employer likes you and how they feel about the reason you're in jail.


Many jails have work release. They get you up at 6am, check you out of the jail, let you go to work, then expect you to check back into jail by 6pm.


Which for many professional (and other jobs) probably would require a bunch of tap-dancing around your strict schedule if you were hiding the actual reason.


Oh, that's really great ! Is that in the US ?


Self-employed or small company might not care.


So the GPT is on a law exam and is using a very lawyer way to word things? I would say that's great!!


What does GPT consider being "average wealth and income". Statistics? Or biased weights from anecdotes he formed on the anecdotes he scraped off the internet on how wealthy people say the feel?

Would be cool to know how LLMs shape their opinions.


You can just ask it, you know.

GPT-4o:

“Average wealth and income” can vary significantly by region and context. However, in the United States, as a rough benchmark, the median household income is around $70,000 per year. Wealth, which includes assets such as savings, property, and investments minus debts, is harder to pinpoint but median net worth for U.S. households is approximately $100,000. These figures provide a general idea of what might be considered “average” in terms of wealth and income."


I like that it immediately assumed the US, even though nothing in your question suggested it. I love that all LLMs have a strong US centric bias.

Btw I'm not personally a lawyer, but I've heard that GPT is especially prone to mixing laws across the borders - for example you ask a law question in language X, and get a response that uses a law from a country Y - and it's extremally convincing doing that (unless you're a lawyer, I guess).


ChatGPT has user-customizable "instructions", and mine are set to tell it where I live. Any user can do the same, so that it will not make incorrect assumptions for you.


You might increase the probability of getting a correct answer for your region, but imo you decrease your awareness to allucination. Overall you can still get a wrong answer


This is my experience with Hackernews. If the comment doesn't specify the country, it's an American talking about the USA


I mean, to be fair, if you're speaking English to it, the most likely possibility is that you're inside the US:

https://en.wikipedia.org/wiki/List_of_countries_by_English-s...

I know there's a lot of complaints about things being US-centric, but the US is a very large country.


Well, except the number of English speakers outside the US is much larger than inside the US (as per the wikipedia page you point to) by 5 to 1. Granted many folks are speaking it as their 2nd (or nth) language. But when you take into account the limited set of languages supported by ChatGPT one could reasonably assume English-speaking (typing) users of ChatGPT are from outside the U.S. as non-U.S. folks are in the majority of 'folks for whom english would be their first option when interacting with ChatGPT'. Even if you only count India, Nigeria and Pakistan.

Though of course OpenAI can tell (frequently, roughly) where folks are coming from geographically and could (does?) take that into account.


> but the US is a very large country.

Indeed - the US is a very large country, and consists of over 50 different jurisdictions, each with their own slightly different laws. An answer to a legal question which is correct in one state will often be subtly incorrect in another, and completely wrong in yet another.


>You can just ask it, you know.

But my question will not be part of the context of that conversation.


Mine was. I asked it the first question, first.


The original conversation I mean, of the grandparent. Me starting an identical conversation will not guarantee an identical context.


Honestly, this is giving the bar exam (and GPT-4) too much credit. The bar tests memorization because it's challenging for humans and easy to score objectively. But memorization isn't that important in legal practice; analysis is. LLMs are superhuman at memorization but terrible at analysis.


Eh, also in legal practice there are key skills like selecting the best billable clients, covering your ass, building a reputation, choosing the right market segment, etc. which I’d also argue LLMs suck at.


I don’t know. There was some talk this weekend about CEOs being replaced by AI. Given the overlap in skill, I’d say there is a distinct possibility an LLM could do that. https://www.msn.com/en-us/money/companies/ceos-could-easily-...


Phoebe Moore who that quote was attributed to has never been a CEO or even worked at a non-academic organisation.

So much of a what a CEO does is fostering culture, hiring people and setting a unique vision for the company.

Imagine thinking people would be inspired to work for a chatbot. Hilariously ridiculous.


If that chatbot had Steve Jobs voice ?

I dunno, I would probably prefer to work under that chatbot than my current CEO that only tries to squize as much as possible out of ppl already working for him.


Like the chatbot wouldn’t squeeze you 10x harder.

At least a human CEO has to worry about being arrested or someone setting their house on fire.


Steve Jobs didn't even worry about cancer enough to save his life. Why the fuck do you think he would have an even remote understanding that squeezing people could result in consequences for him?


I don’t think you’re making the point you think you’re making here.


Bwahaha. This is like the ‘everything can be a directed graph db’, ‘everything should be a micro service’, etc. fads.

No one who has been a CEO, or frankly even worked closely with one, would think this could be even remotely close to possible. Or desirable if it was.

But that is probably 1% or less of the population eh?



Bwaha. Funny the company named as doing so doesn’t mention it on their actual management team [http://www.netdragon.com/about/management-team.shtml], listing an actual human CEO instead.

But it makes for a fun soundbite eh? Especially when the article claims it was in the past, and totally was awesome. Sucker born every minute.


> which I’d also argue LLMs suck at

OK, I’ll bite. What’s your evidence for this argument?


Every bit of interaction I’ve ever had with an LLM. And all the research I’ve seen.

They’re plausible word sequence generators, not ‘planning for the future’ agents. Or market analyzers. Or character evaluators. Or anything else.

And they tend to be really ‘gullible’.

What evidence do you have they could do any of those things? (And not just generate plausible text at a prompt, but actually do those things)


> What evidence do you have they could do any of those things?

Every bit of interaction I’ve ever had with an LLM.


Yeah, I fear a lot of human exuberance (and thus investment) is riding on the questionable idea that a really good text-fragment-correlation specialist engine can usefully impersonate a generalist "thinking" AI without doing too much damage. ("LLM, which rocks are the best to eat?")

But there's a scarier further step: When people assume an exceptional text-specialist model can also meta-impersonate a generalist model impersonating a specific and different kind of specialist! ("LLM, create a legal defense.")


You clearly don't know anything about the bar. One half of your score is split between 6 essay questions, and reviewing two cases to then follow instructions from a theoretical lead attorney.


I’m licensed in multiple states, including California.

The essay questions also test memorization. They don’t require any difficult analysis - just superficial issue-spotting and reciting the correct elements.

If the bar exam were not a memorization test, it would be open book!


I've always drawn the link between skill in memorization and in analysis as:

- Memorization requires you to retain the details of a large amount of material

- The most time-efficient analysis uses instant-recall of relevant general themes to guide research

- Ergo, if someone can memorize and recall a large number of details, they can probably also recall relevant general themes, and therefore quickly perform quality analysis

(Side note: memorization also proves you actually read the material in the first place)


Problem is the LLM memorized the countless examples you can find of old BAR questions using extreme amounts of compute at training time, they don't have that ability to digest a specific case due to both lack of data and it doesn't retrain for new questions.

A human that can digest the general law can also digest a special case, but that isn't true for an LLM.


I’m not sure why you’re being downvoted for this. I agree with you, fact recall is useful and necessary. If you have a larger and more tightly connected base of facts in your head, you can draw better connections.

And even though legal practice tends to be fairly slow and deliberative, there are settings (such as trial advocacy) where there is a real advantage to being able to cite a case or statute from memory.

All that said, I still maintain that it’s a poor way to compare humans with machines, for the same reason it would be poor to compare GPT-4 to a novelist on their tokens per second written.


> Furthermore, unlike its documentation for the other exams it tested (OpenAI 2023b, p. 25), OpenAI’s technical report provides no direct citation for how the UBE percentile was computed, creating further uncertainty over both the original source and validity of the 90th percentile claim.

This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts? I’m sure OpenAI has the social capital to coordinate with the National Conference of Bar Examiners to have a GPT “sit” for a simulated bar exam.


>This is the part that bothered me (licensed attorney) from the start. If it scores this high, where are the receipts?

I'm not a licensed attorney, but that's also bothered me about all of these sorts of stories. There is never any proof provided for any of the claims, and the behavior often contradicts what can be observed using the system yourself. I also assume they cook the books a little by having included a bunch of bar exam specific training when creating the model in first place specifically to better on bar exams than in general.


The bigger issue here is that actual legal practice looks nothing like the bar, so whether or not an llm passes says nothing about how llms will impact the legal field.

Passing the bar should not be understood to mean "can successfully perform legal tasks."


> Passing the bar should not be understood to mean "can successfully perform legal tasks."

Nobody does except a bunch of HNers who among other things, apparently have no idea that a considerable chunk of rulings and opinions in the US federal court system and upper state courts are drafted by law clerks who, ahem, have not taken the bar yet...

The point of the bar and MPRE is like the point of most professional examinations: try to establish minimum standards. That said, the bar does test for "successfully perform legal tasks", actually.

For the US bar, a chunk of your score is based off following instructions on case from the lead attorney, and another chunk is based on essay answers. Literally demonstrating that you can perform legal tasks and have both the knowledge and critical thinking skills necessary.

Further, as previously mentioned, in the US, people usually take it after a clerkship...where they've been receiving extensive training and experience in practical application of law.

Further, law firms do not hire purely based on your bar score. They also look at your grades, what programs you participated in (many law schools run legal clinics to help give students some practical experience, under supervision), your recommendations, who you clerked for, etc. When you're hired, you're under supervision by more senior attorneys as you gain experience.

There's also the MPRE, or ethics test - which involves answering how to handle theoretical scenarios you would find yourself in as a practicing attorney.

Multiple people in this discussion are acting like it's a multiple choice test and if you pass, you're given a pat on the ass and the next day you roll into criminal court and become lead on a murder case...


Indeed, and this is also the general problem with most current ways to evaluate AI: by every test there's at least one model which looks wildly superhuman, but actually using them reveals they're book-smart at everything without having any street-smarts.

The difference between expectation and reality is tripping people up in both directions — a nearly-free everything-intern is still very useful, but to treat LLMs* as experts (or capable of meaningful on-the-job learning if you're not fine-tuning the model) is a mistake.

* special purpose AI like Stockfish, however, should be treated as experts


This, along with several other "meta" objections, is a significant portion of the discussion in the paper.

They basically say two things. First, although the measurement is repeatable at face value, there are several factors that make it less impressive than assumed, and the model performs fairly poorly compared to likely prospective lawyers. Second, there is a number of reasons why the percentile on the test doesn't measure lawyering skills.

One of the other interesting points they bring up is that there is no incentive for humans to seek scores much above passing on the test, because your career outlook doesn't depend on it in any way. This is different from many other placement exams.


Very interesting. The abstract claims that although GPT-4 was claimed to score in the 92nd percentile on the bar exam, when correcting for a bunch of things they find that these results are overinflated, and that it only scores in the 15th percentile specifically on essays when compared to only people that passed the bar.

That still does put it into bar-passing territory, though, since it still scores better than about one sixth of the people that passed the exam.


If I understand currently, they measured it at the 69th percentile for the full test across all test takers, so definitely still impressive.


This analysis touches on the difference between first-time takers and repeat takers. I recall when I took the bar in 2007, there was a guy blogging about the experience. He went to a so-so school and failed the bar. My friends and I, who had been following his blog, checked in occasionally to see if he ever passed. After something like a dozen attempts, he did. Every one of us who passed was counted in the pass statistics once. He was counted a dozen times. This dramatically skews the statistics, and if you want to look at who becomes a lawyer (especially one at a big firm or company), you really need to limit yourself to those who pass on the first (or maybe second) try.


It appears that researchers and commentators are totally missing the application of LLMs to law, and to other areas of professional practice. A generic trained-on-Quora LLM is going to be straight garbage for any specialization, but one that is trained on the contents of the law library will be utterly brilliant for assisting a practicing attorney. People pay serious money for legal indexes, cross-references, and research. An LLM is nothing but a machine-discovered compressed index of text. As an augmentation to existing law research practices, the right LLM will be extremely valuable.


It is a lossy compressed index. It has an approximate knowledge of law, and that approximation can be pretty good - but it doesn't know when it's outputting plausible but made-up claims. As with GitHub Copilot, it's probably going to be a mixed bag until we can overcome that, because spotting subtle but grave errors can be harder than writing something from scratch.

There's already a fair number of stories of LLMs used by an attorney messing up court filings - e.g., inventing fake case law.


I am not suggesting that the generative aspects would be useful in drafting motions and such. I am suggesting that their tendency towards false results is harmless if you just use them as a complex index. For example, you could ask it to list appellate cases where one party argued such-and-such and prevailed. Then you would go read the cases.


They originally scored against a test usually taken by people who failed the bar.

So, GPT-4 scores closer to the bottom of people who pass the bar the first time. In other words, it matches the people who cull the rules from texts already written, but who cannot apply it imaginatively.


> In other words, it matches the people who cull the rules from texts already written, but who cannot apply it imaginatively.

Where did you find that in the article?


If you can recite the black letter law, you've got a good chance of passing the bar. The higher essay scores usually require creative arguments about resolving competing rules and policies.

It's easier to extract the formal statement of the rule against perpetuities from a reddit corpus, than to apply the rule to an artificially complex fact pattern in an essay question.


So it knows more about the law than you do, but less than they do.

Really glad to see research replicated like this. I’m not surprised that the 90th percentile doesn’t hold up.

It’s still handy though.


It's amazing the level of mental gymnastics I see in the comments trying to justify a piece of technology that is evidently not as good as they believed to be...


AI is just the next tech hype scam after crypto and nfts.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: