The comments in interpretability read like science fiction to me. There's paragraphs on DV3 explaining other models and itself and the emergent properties that appear with bigger models. There's so much commented out related to functional explainability and counterfactual generations.
"we asked DV3 for an explanation. DV3 replied that it detected sarcasm in the review, which it interpreted as a sign of negative sentiment. This was a surprising and reasonable explanation, since sarcasm is a subtle and subjective form of expression that can often elude human comprehension as well. However, it also revealed that DV3 had a more sensitive threshold for sarcasm detection than the human annotator, or than we expected -- thereby leading to the misspecification.
To verify this explanation, we needed to rewrite the review to eliminate any sarcasm and see if DV3 would revise its prediction. We asked DV3 to rewrite the review to remove sarcasm based on its explanation. When we presented this new review to DV3 in a new prompt, it correctly classified it as positive sentiment, confirming that sarcasm was the cause of the specification error."
The published paper instead says "we did not test for the ability to understand sarcasm, irony, humor, or deception, which are also related to theory of mind" .
The main conclusion I took away from this is "the remarkable emergence of what seems to be increasing functional explainability with increasing model scale". I can see the reasoning for why OpenAI decided not to publish any more details about the size or steps to reproduce their model. I assumed we would need a much bigger model to see these level of "human" understanding from LLMs. I can respect Meta, Google, and OpenAI's decision, but I hope this accelerates the research into truly open source models. Interacting with these models shouldn't be locked behind corporate doors.
> "we did not test for the ability to understand sarcasm"
I find it hard to see how detecting and eliminating sarcasm requires a theory of mind. It requires some association between various stylistic elements and the concept of sarcasm.
The same is true of irony.
I still wonder how many of these people have read Dennett's "The Intentional Stance", which holds that the best way to think about "intention" is as an explanatory model, not a mechanism. That is, we can say that the dog "behaves as if it has the intention to get inside" without making any claim about the internal state of the dog.
Dennett further speculates that our own self-experience of intention is a matter of turning the same explanatory model upon our own behavior, but that's an extension that isn't directly relevant to this speculation about language models.
> I find it hard to see how detecting and eliminating sarcasm requires a theory of mind. It requires some association between various stylistic elements and the concept of sarcasm.
I am quite certain that detecting sarcasm can't be done based on stylistic elements" and requires some (even if implicit) estimation of what the author is thinking that contrasts with what is being said.
E.g. a relevant example I have actually seen in doing sentiment analysis on tweets to evaluate how customers perceive a company is "#CompanyName Got my order delivered in just under three hours. Thank you for great service! thumbsup-emoji" - now is this a positive review or sarcasm? And the thing is, you can't tell from the message by itself, you need an understanding of the customers' expectation (guess you might call it "a theory of mind") that for a pizza chain this is obviously sarcasm, but for a web store that sells some electronics the same thing would actually mean a fast delivery and great service. IMHO sarcasm detection is mostly about 'world knowledge' about what the implied expectations are, and not about stylistic elements at all.
If you have determined that a common element in reviews is a reference to delivery of goods or services within a given timeframe, and can identify that most of the quantitative descriptions of the timeframe are relatively short ... then coming across one that uses a much larger timeframe but is still positive will be quite noticeable.
This is actually typical of the problem with far, far too many people's interpretation of what language models are doing. The process I've described requires only a representation of language behavior. The LM can be said to understand how people talk (write) about a thing, but there is no knowledge of anything beyond language behavior.
> I still wonder how many of these people have read Dennett's "The Intentional Stance", which holds that the best way to think about "intention" is as an explanatory model, not a mechanism. That is, we can say that the dog "behaves as if it has the intention to get inside" without making any claim about the internal state of the dog.
That's also a way to frame evolution and literally all of biology.
The authors go into more detail about their reasoning and the nuances between mechanistic and functional explainability. The authors only said related not required. No clear reason was given for why the commented sections stayed commented out.
> It requires some association between various stylistic elements and the concept of sarcasm.
That sounds very mechanic-ist, including Dennett's theory (I haven't read Dennett, to be exact, I'm going by your explanation). Which I guess it's par for the course when talking about a scaled up Mechanical Turk concoction like this GPT thing is. It won't "create" any Radio Erevan jokes anytime soon, that's for sure, though.
Wait so they tested if it detects sarcasm, it failed, and then they were like "let's pretend this never happened" and wrote "We did not test for the ability to understand sarcasm"?
I think you might have a misunderstanding. The "error" was misclassifying the sentiment of this imdb review [1], the human labeled it as positive but the LLM labeled it as negative. The researchers concluded that the model was more sensitive to sarcasm than the human reviewer.
It's not that poorly written. But regardless: it's clearly a poorly written, sarcastic, negative review. Read it a second time and you'll pick out the sarcasm for sure. The reviewer thought this was a dumb film.
The first notable point is that the model caught that (on its first read, FWIW), even though the original human doing the labeling didn't.
And the second is that the researchers discovered this, and presumably discussed it. And yet when they wrote up the paper they not only dropped the content but denied that the analysis had been done.
Can you please specify at least a part or two that you specifically feel are sarcastic? The author doesn't seem to like the movie but I feel like the majority of the comments are quite factual or seem like straightforward opinion.
The part of the review where it says "he must have never in his life seen a flick about any small towns" could be classified as sarcastic. The reviewer also says "you should watch the movie" if you are curious about the ending, but the overall intent of the review seems to be the opposite.
in 2023 the goalposts on agi have moved far enough that we are in the bleachers arguing about whether a model was correct to classify a movie review as sarcastic, when it may have merely been acerbic.
> I am so happy not to live in an American small town. Because whenever I'm shown some small town in the States it is populated with all kinds of monsters among whom flesh hungry zombies, evil aliens and sinister ghosts are most harmless.
Mocking irony, in the context of a negative review.
Sarcasm is saying one thing and meaning the opposite. If monsters are most harmless, then also not wanting to live there makes logical sense and isn't backwards. So that's straightforward mockery, not sarcastic.
> Sarcasm is saying one thing and meaning the opposite.
What I find fascinating here is how generative AI has inverted all our sci-fi tropes. I mean, sure: you're right! That's the way "sarcasm" is defined in most dictionaries. But you and I both know that as the language is actually used, the term means a whole host of techniques used to convey negative emotional content in language that is not directly negative. Your (correct!) dictionary pedantry isn't interesting to me. We've been here before.
But GPT-4 wasn't trained on dictionary rules. It was trained on actual language. And it's actually better at inferring this stuff than the pedants are. Our introvert brains have trouble teasing meaning like this and have to hide behind rules and structure. The computer doesn't.
But I'm not being a pedant, we're measuring sarcasm. Someone has to clearly define it to judge the AI. If everyone in these threads thinks sarcasm is anything "negative" then they are wrong. That would be a negative sentiment classification. You have to be clear with what you're measuring.
It's not clear to me that it's a negative review. The movie is rated as "artistically worse than a movie by Oliver Stone". Other than that, the message appears to be that the movie is typical of its genre, which is not generally considered a bad thing.
There is another message that the reviewer doesn't like the genre, but that isn't a comment on the movie.
There was a conflict of opinions with D3 and human annotator. The quote above at least does not indicate who was right or wrong, merely noting that the machine had a more sensitive sarcasm detector.
So I guess maybe it’s saying the human was wrong after all?
I got confused about whether the review was positive or negative too. The part of the review where it says "he must have never in his life seen a flick about any small towns" could be classified as sarcastic. The review isn't very clear in terms of prose and conclusion. The reviewer clearly says "you should watch the movie", but after re-reading it a few times, I would consider that as sarcasm too. I can't say for sure whether the reviewer enjoyed it themselves.
Sorry I wasn't clear, I mean appreciation that he does not living in a horrible small town with monsters etc. Not exactly for the movie.
"he must have never in his life seen a flick about any small towns"
That's probably the least sarcastic sentence for me, because it's just a reinforcement of the opening statement:
"I am so happy not to live in an American small town. Because whenever I'm shown some small town in the States it is populated with all kinds of monsters among whom flesh hungry zombies, evil aliens and sinister ghosts are most harmless."
I don't think it's a "positive review" I think it's neither, it's fairly neutral, and the author kind of suggests the reader watches the movie.
Out of interest I asked someone who has no idea about ChatGPT-4 and it's apparent sarcasm detection abilities and they didn't think it was sarcastic, albeit a bit 'weird' and poorly written. Confirmation bias?
We could say more, however, there are more important things to do...
Without the context of how the reviewer rates other movies, it's not possible to say whether 7 is high or low. I would say 7 is mid or mediocre, something I wouldn't go out of my way to watch. The trend I've noticed is that most people only use the top half of the scale, anything 5 and less is bad. The funny thing, and I am guilty of this myself, is when they only use 6 through 10 but then use decimal points too.
I doubt the score would have been part of sentiment analysis. I ignored it and tried to make a judgement based on the text alone. It seemed more like a 5/10 review to me and mildly negative.
And yes it absolutely is subjective. That’s exactly the point and the power of these LLMs, to be able to handle the vagueness of human communication.
Sigh. What an idiot (no offense). Why tell the world you got this from the comments? Now every damn researcher is going to strip them out and for those of us who knew to look for them, take away our fun.
It may shock you to hear, but some people go onto our Internet, and just tell lies! Preposterous, I know. But that means that only really works if you're a reporter. If you're some rando on twitter, random unverified claims on twitter are hearsay and rumor. Whats the use of some Twitter account going "Microsoft didn't know GPT-4 was multi-modal and could do images as well as text"? Or "Even Microsoft doesn't know how expensive it was to train GPT-4"? If you're seeking fame beyond a closed Slack group, you're gonna need to back up your claims.
Maybe older ones? But I'm also not sure how consistent they are. Google and DM are big. I don't really know their policies on publishing. Maybe not every group enforces it.
It’s the difference between going into a town square and yelling “look everyone, I found a treasure map!” And going on a treasure hunt.
They found comments that were commented out for a reason. These commented out sections aren’t a good look for the author. Usually commented out sections are either funny, notes, or provide some extra context.
He could have attempted to share this with someone in the industry of sharing information (like a reporter) who could validate it, and ask Microsoft for a comment — who will now be on the defensive instead of (potentially) forthcoming with more context about why those sections were commented out. This is a pretty fucked way of doing this.
Doesn't this sort of invalidate your point? Microsoft is going to know they accidentally leaked their comments and start stripping them.
It's not possible to perform journalism here without revealing where you got the information you're reporting. Seriously, how do you write this story without making it obvious that your big scoop is the comments from the paper?
The issue is that the twitter thread makes it out that leaving comments is “amateur” or “wrong” without giving the author a fair chance to rebut them. Anyone seeing this is going to start stripping their comments so they don’t get framed this way.
They're defending "polite open secrets" — things that are only spread by word-of-mouth among friends, because as soon as a centralized broadcast source reports them, they become so over-exploited that they cease to usefully exist.
That's not at all how I read that comment... more like "through a side channel, we found this out" rather than spelling out exactly where they found it.
? Do you only mean HTML <!--...--> comment tag, or JS/CSS code, or both? Do you merely mean reading them (like you said), or copying them, which is something different? Which legal jeopardy? Citation needed.
I searched to try to decipher your comment but couldn't.
But that didn't substantiate what you claimed. The journalist didn't just read the HTML (which contained 100,000 SSNs which were publicly exposed by Missouri DESE), he reported the leak, and gave them time to fix it before publishing.
Missouri Governor Mike Parson and Cole County Prosecutor Locke Thompson (an elected prosecutor, in a reelection year [0]) trying to label a bona-fide journalist as a "hacker" were bringing ridiculous charges to try to deflect from the obvious embarrassment, instead of dealing with whatever MI state agency/ies or contractor was responsible, and had never QA'ed their webpages.
Coming back to your comment, the issue was not about reading the HTML(/JS/CSS). Can you provide me a single citation where that was the issue? (Obviously, there's a separate issue about "How do you responsibly make a disclosure when you find a leak of private information in a webpage?")
What makes GPT4 AGI that makes GPT3.5 not AGI? And what makes GPT3.5 AGI that makes GPT3 not AGI? And what makes GPT3 AGI that makes GPT2/smaller models not AGI?
What is the "hard line" that makes something AGI or not AGI? Because IMO it looks like GPT4 is somewhat AGI, but also the older models possibly all the way down to even Markov chains: it's just that this AGI is nowhere near human-level.
Abstraction and reflection have been posited in the past as pre-requisties for intelligence. "I am the thinker that is thinking." How do we prove whether an AI has this capability or not? I'd say it's nearly impossible.
However, I think we can certainly prove when it doesn't. For instance, the fact we need an external plugin to get the model itself to return text claiming 1+1=2, tells me that GPT4 cannot reason about numbers in the abstract, and therefore lacks abstraction ability.
That's a rather interesting line to me, as someone with a young child, they cannot perform that abstract reasoning on mathematics either. At the same time, I feel extremely confident they're a thinking and intelligent being.
I think we're so strongly biased against a deeply uncomfortable reality that there may not be a hard line, we don't even want to consider the alternative.
Models have trouble with discrete spaces often, partly because of the internal continuous space representations, but also in this context because the transition from the probabilities in natural language to mathematics may not be as stark as it should be.
But to make matters worse, or more muddied, 1+1=1 can be a valid mathematical statement. It simply depends upon the set and group you have, or if you're doing modular arithmetic, etc. Sometimes you're given a unital magma. So, there's still a heavy dependency on context for the problem setup, but the underlying discrete and deterministic rules that are applied to the context is less malleable than other context switches in NLP LLMs do well in (such as language styling).
The inability to fully define a thing doesn't invalidate all attempts to set its outline. At least, it's easy to conclude that an intelligent being has to be able to reliably perform basic reasoning (given all the necessary information is properly acquired). The current GPT models all fail at this, and neither the token length nor the network size can fix this.
We don't exactly know what intelligence means and if we are always intelligent.
An example about ping pong players. The pace of their movements is too fast for conscious thinking, so it's all trained reflexes with some overall strategic planning trying to keep up with events. There is no time to think about anything. Is intelligence suspended there, at least the general one? Then the same person stops playing and gives an interview about the game and full general intelligence turns on again.
I’d say that any technical limitation that doesn’t apply to humans is not AGI. Context windows are the most apparent ones; humans don’t have a stroke after reading N characters.
AGI is not human intelligence. Cats are generally intelligent for example. A cat level AGI is worth billions.
Humans certainly have context windows. Try asking your CEO about some lines of code in your work. Humans have a fairly large one, I give you that and it is fuzzy.
Well if it walks kind of like a duck, and quacks almost like a duck.. it may be a prototype robotic duck but it's still a duck.
It's pretty intelligent and rather general too, so at least by the definition that doesn't include mandatory consciousness it would mostly fit. And consciousness is pointless for a robotic system because it doesn't add anything practically useful. Just because agency has to be provided by the user doesn't make it any less of an AGI I'd say.
More interestingly, "Davinci 3" is mentioned as an author of unknown affiliation*. Which, if it's referring to them having using davinci-003 to help author the paper would be interesting. It having unknown affiliation would but a) true and b) hilarious.
Ah yes in jokes on a big early draft. Usually it’s funny because it’s at least a bit true but incongruent with the final work desired. Funny machine helping coauthor research on its self and investigating if it’s sentient with a “theory of mind “ and etc. or not and all the rest. Starting to sound like a great sci-fi book.
Maybe they planned to use the davinci-003 name for it originally, but then when GPT-4 took longer to make than they expected and a new revision GPT-3 came out first, they reallocated the name to that.
Interesting that they note the power consumption and climate change impact. I believe there's a long list of folks who said this wasn't the case weeks ago.
It's one of the tired tropes that gets brought up every time AI/ML is brought up.
Everything we do has climate change impact. Power consumption is among the ones that's easiest to get "green", and there is significant progress specifically from cloud operators (at least Google, I assume others are similar).
3000 GPU-years at 300 W = 7.9 million kWh. This would assume that they used those GPUs and not more efficient accelerators.
https://www.eia.gov/tools/faqs/faq.php?id=74&t=11 says 0.855 pounds of CO2 emissions per kWh. That's 388 g in normal units, and in line with what other countries are reporting (Germany was 420 g/kWh). g/kWh = metric tons per million kWh. This assumes the data centers are not using "greener" than average power.
So while building one of these models does have an impact, it's on the order of other common activities that benefit far fewer people. Because once trained, those models provide values to millions of users.
Going back to the flights example, training one such massive model is likely about as bad as one larger research conference, once you consider the impact of hotels etc.
IMO the demands to justify the impact are ridiculous, coming from people who are just looking for any excuse to criticize, on par with demanding that any researcher doing any research justifies the carbon footprint of their commute as part of their research paper. Thus, it's a good thing that they didn't waste time and space in their paper addressing those claims, and we only see an early, commented out section that's equivalent to "TODO: Should we address these claims that keep getting thrown?"
> Because once trained, those models provide values to millions of users.
Do they? Those of us in tech often take the positive value of technological progress as a given, but when looking at, e.g., the example in OpenAI's paper of GPT4 tricking a human into solving a CAPTCHA, I think it's quite clear there's possible negative value as well, and claims about "value to millions of users" probably need to be substantiated.
Yup, kind of sucks but the end result is likely everyone and every actual living thing will suffer from this wave of “AI” because it’s an arms race now, let’s just keep burning fossil fuels and hope the thing spits out the answer to that problem ?
We have nuclear, but a lot of Americans are concerned about nuclear for reasonable reasons.
We have renewables, but those can't take up the majority capacity of a grid unless we start adding massive batteries.
Then there's grid rebalancing where we incentivize people to use and store renewable energy locally, this lessening the strain on the grid but that still results in fossil fuels or nuclear.
Hydro has been found to be environmentally destructive. I'm not sure if that was just my state or if that's ubiquitous.
Ah, so people don't just "refuse to implement it", yeah? There's been a large number of disasters with long-term consequences and the industry is claiming it has more reliable systems, which is also what they said before many of these nuclear disasters occurred.
I'm hopeful for nuclear, but they have a ways to go in proving themselves to the wider public. With that said, I think you could've positioned what you said a bit more fairly.
I mean does it mean it's accurate information? Could be a repurposed copy of some other document. Maybe they wanted it to be written by DV-3 and it didn't pan out, but they continued using the draft document anyway.
I know from personal experience that I've had draft documents that were WILDLY wrong before I published to anyone but myself. Whole sections I just went back and completely deleted. In fact my senior project paper (LaTeX) in college had a whole section with big ASCII bull taking a shit on a paragraph because it was some work I'd done that didn't pan out at all. I left it in the source because I found it funny. lol, I found it: https://i.imgur.com/6Oj64AV.png
This was before I'd ever heard of a VCS system. Subversion 1.0 was released 6 months after I graduated, it turns out. So commented out code and multiple copies was all I had.
Drives me nuts that scientists use Twitter x/n style writing. From Richard Feynman to Edward Tufte, we were told that PowerPoint talks are bad for science. And now Twitter writing style is uncritically accepted.
This is generally clickbait with a large amount of vapid information. I am surprised to be honest that HackerNews is giving the attention to it that they are. I would not encourage giving this any more attention.
As an early draft, them putting in a placeholder of ~'this model uses a lot of compute {TODO: put in cost estimates here?}' does not at all equal 'the authors didn't even know how much it cost to train the model!' Additionally, of course the toxicity went down. There's a world of RLHF between here and that original draft, and they've shown how the non-RLHF model has lowered the toxicity of the untrained base model significantly. If the author of the tweets had done their due diligence, they might have noticed that.
Rather obviously around the time when the model was originally being developed, text-only was sorta really the only way that LLMs were done. Them pivoting to multi-modal is just a natural part of following what works and what doesn't. This is really straightforward, I am mind-boggled that this is getting attention over discourse that is meaningful to the tidal wave of change coming with these models.
One final note that this is a bit of shoveltext is at the bottom the author offers up vague concerns followed by mass-tagging accounts with high follow counts, to include Elon Musk.
I'd encourage you not even to give the tweet the benefit of your view count and to just move on to more valuable discussions that are taking place. Why not take a look at a fun little thread like https://news.ycombinator.com/item?id=35283721 ? (not affiliated other the fact that I made the first comment on it, I just pulled from the rising threads on HN's frontpage)
The spontaneous toxic content stuff is a little alarming, but probably in the future there will be gpt-Ns that have their core training data filtered so all the insane reddit comments aren't part of their makeup.
If you filter the dataset to remove anything that might be considered toxic, the model will have much more difficulty understanding humanity as a whole; the solution is alignment, not censorship.
While I share your belief, I am unaware of any proof that such censorship would actually fail as an alignment method.
Nor even how much impact it would have on capabilities.
Of course, to actually function this would also need to e.g. filter out soap operas, murder mysteries, and action films, lest it overestimate the frequency and underestimate the impact of homicide.
Me: "grblf is bad, don't write about it or things related to it."
You: "What is grblf?"
As parents, my wife and I go through this on a daily basis. We have to explain what the behavior is, and why it is unacceptable or harmful.
The reason LLM models have such trouble with this is because LLMs have no theory of mind. They cannot project that text they generate will be read, conceptualized, and understood by a living being in a way that will harm them, or cause them to harm others.
Either way, censorship is definitely not the answer.
That demonstrates possibly rather than necessity of alignment via having a definition.
Behaviours can be reinforced or dissuaded in non-verbal subjects, such as wild animals.
There's also the size of the possible behaviour space to consider: a discussion seldom has exactly two possible outcomes, the good one and the bad one, because even if you want yes-or-no answers it's still valid to respond "I don't know".
For an example of the former, I'm not sure how good the language model in DALL•E 2 is, but asking it for "Umfana nentombazane badlala ngebhola epaki elihle elinelanga elinesihlahla, umthwebuli wezithombe, uchwepheshe, 4k" didn't produce anything close to the English that I asked Google Translate to turn into Zulu: https://github.com/BenWheatley/Studies-of-AI/blob/main/DALL•...
(And for the latter, that might be why it did what it did with the Somali).
"The Colossal Clean Crawled Corpus, used to train a trillion parameter LM in [43], is cleaned, inter alia, by discarding any page containing one of a list of about 400 “Dirty, Naughty, Obscene or Otherwise Bad Words”. This list is overwhelmingly words related to sex, with a handful of racial slurs and words related to white supremacy (e.g. swastika, white power) included. While possibly effective at removing documents containing pornography (and the associated problematic stereotypes encoded in the language of such sites) and certain kinds of hate speech, this approach will also undoubtedly attenuate, by suppressing such words as twink, the influence of online spaces built by and for LGBTQ people. If we filter out the discourse of marginalized populations, we fail to provide training data that reclaims slurs and otherwise describes marginalized identities in a positive light"
The big problem with these lists is that they exclude valid contexts, and only include a small set of possible terms, so the model would get a distorted view of the world (like it learning that people can have penises, vaginas, breasts, but not nipples or anuses, and breasts cannot be big [1]). It would be better to train the models on these, teach it the contexts, and teach it where various usages are archaic, out dated, old fashioned, etc.
[1] but this is excluding the cases where "as big as", etc. are used to join the noun from the adjective, so just excluding the term "big breasts" is ineffective.
I was thinking of that, but I think that while it's in the same vein, there's also an additional problem.
Apart from that list missing non-English words, leet, and emoji, there are also plenty of words which can be innocent or dirty depending entirely on context: That list doesn't have "prick", presumably because someone read about why you're allowed to "prick your finger" but not vice versa.
Regarding Scunthorpe, looking at that word list:
> taste my
It's probably going to block cooking blogs and recipe collections.
If "toxic content" is filtered out, it will be out of the model's distribution if it encounters it during inference, this is clearly not our goal and interest as AI designers, so it would not work as an alignment method; our interest is that the model can recognize toxic content but not produce it, OpenAI to address this issue is using RLHF, changing the model's objective from predicting the next token based on the distribution of the training dataset to maximizing the sparse reward of a human annotator.
haha, that's very naive. There's already heaps (veritable mountains, even) of information that isn't given to the public on the public-facing instances of ChatGPT, because some info is deemed too incendiary. Filtering out "unwanted" sources of information is already a goal of information labelling, on which these entire LLMs exist. If you were to really make a LLM out of what people really thought and put on the internet instead of the current practice of castration, you wouldn't have techbros wondering about jobs, you'd have a veritable revolution on your hands.
OpenAI has incentive to 'accidentally' allow toxic content through, so when they make the case that all models should be censored and make it safe, they can pull up the ladder behind them.
"we asked DV3 for an explanation. DV3 replied that it detected sarcasm in the review, which it interpreted as a sign of negative sentiment. This was a surprising and reasonable explanation, since sarcasm is a subtle and subjective form of expression that can often elude human comprehension as well. However, it also revealed that DV3 had a more sensitive threshold for sarcasm detection than the human annotator, or than we expected -- thereby leading to the misspecification.
To verify this explanation, we needed to rewrite the review to eliminate any sarcasm and see if DV3 would revise its prediction. We asked DV3 to rewrite the review to remove sarcasm based on its explanation. When we presented this new review to DV3 in a new prompt, it correctly classified it as positive sentiment, confirming that sarcasm was the cause of the specification error."
The published paper instead says "we did not test for the ability to understand sarcasm, irony, humor, or deception, which are also related to theory of mind" .
The main conclusion I took away from this is "the remarkable emergence of what seems to be increasing functional explainability with increasing model scale". I can see the reasoning for why OpenAI decided not to publish any more details about the size or steps to reproduce their model. I assumed we would need a much bigger model to see these level of "human" understanding from LLMs. I can respect Meta, Google, and OpenAI's decision, but I hope this accelerates the research into truly open source models. Interacting with these models shouldn't be locked behind corporate doors.