> how few bad things people can point to that he actually did..
I'd encourage you to read The Power Broker, if you haven't. Some examples that aren't "policy decisions" we can disagree with given our knowledge of how it turned out (eg, building or not building public transit in a given place), but rather things that were clearly morally wrong at the time:
- repeated wholesale destruction of low-income neighborhoods through a variety of bridge, park, and highway construction projects
- evicting farmers and poor rural landowners through opaque legal methods to build highways and parks atop their land
- running a "slum clearance" program that primarily evicted people from slums and demolished them without providing any real place for the humans to go afterwards
- funneling vast sums of money into the pockets of collaborators, friends, and, in the end, himself
The Power Broker paints a nuanced picture, but he did some pretty terrible things in his time.
Sometimes! My attempt with GPT-4 yields a response where it acknowledges the print/len swap, but does not produce correct code in the end - it sort of loses track of what the original goal was. https://chat.openai.com/share/300382cb-ac72-4a75-847c-ecbf5a...
And if we're doing Science, ie., trying to explain how ChatGPT works and what it's intrinsic properties are --- this case is far more significant than the other.
Inasmuch as the hypothesis that ChatGPT works "so as to be actually sensitive to the meaning of the code" is here falsified -- by a single case.
An infinite number of apparent confirmations of this hypothesis are now Invalid!
I'm not comfortable with this introduction of falsificationism to what is not a scientific experiment, but only an experiment testing the predictive accuracy of a classifier. Of course the classifier will get it wrong sometimes because it's only approximating a function: that's by definition, and even by design i.e. we build classifiers as function approximators because we know that learning precise definitions of target concepts is really hard. Under PAC-Learning assumptions, we expect a classifier to have some probability of some error, and we are only trying to estimate the probability of a certain degree of error in the classifier's decision.
The practical problem of course is that, in good practice, we estimate the error of a classifier by testing it on (ostensibly) unseen data, i.e. data that was not available to the classifier during training. With LLMs that kind of testing is impossible because nobody knows what's in their training data and so nobody can safely assume that success, or failure, on a specific task, is predictive of the performance of the model on an arbitrarily chosen task.
To make matters worse, everybody should understand very well by now that LLMs' performance varies, even wildly varies, with their prompt, and there is no known way to systematically create prompts that maximise the probability of a desired response. The result of that is that every observation of an LLM failing to carry out a task, may be just that, or it may be an observation of the user failing to prompt the LLM so as to maximise the probability of the correct response.
In a sense, testing LLMs by hand-crafted prompts risks measuring the experimenter's ability to craft a prompt, rather than the LLM's ability to respond correctly. In that sense, we can't really falsify any hypothesis about LLMs' capabilities.
Of course, the flip side of that is that people should refrain from making any such hypotheses and instead working on the best method to systematically and rigorously test LLMs. Too bad very few people are willing to do that. Too bad for most, that is. I'm pretty sure that at some point someone will come up with a way to rigorously test LLMs and take the cookie, and leave everyone else feeling like fools for wasting all that time poking LLMs for nothing.
It's not black and white with these probabilistic models. The same input generated two outputs. Both were "actually sensitive to the meaning of the code", to varying degrees. One got it exactly right, one made an error, but partly got it right.
I suspect that this comment section is going to fill up with such anecdotes, but his OCW lecture series on Linear Algebra got me through a college course with a less than inspired lecturer. Had no idea he'd been doing it so long, extremely impressive
What gives you confidence its explanations are accurate?
> write and explain me more optimised algorithms for certain cryptographic operations
This domain in particular strikes me as a poor choice for this approach. "Don't roll your own crypto... but definitely don't let a language model roll it for you, either"
Well it gives me a direction to dig in - often papers use inscrutable notation or seemingly magical variables that I’m not sure where they come from.
Will ChatGPT always be right? Probably not - but these are things I can validate better than no info at all!
Re: crypto algorithms, the quert in question was implementing exponentiation for arbitrary sized integers. My own implementation was taking until the heat death of the universe to finish for big integers and I didn’t want to just copypasta an impl from elsewhere.
ChatGPT‘s worked flawlessly and it was able to explain me certain tricks it used in depth (which I could independently verify from other sources).
Would I ship it to prod? Not without a security audit, but that ought to be the case regardless when rolling your own (or even someone else‘s) cryptography :)
> Maybe a real life reading group or something would help.
This is what it took for me - having a few other people agreeing to a schedule and meeting to talk about what we had read. I ultimately very much enjoyed GR but it is really difficult to read alone for the first time. Both the social pressure to keep at it and the ability to have "what the hell was that" conversations with other people really helped.
(My father's favorite author is Pynchon, so while he slightly prefers Mason & Dixon to GR, it's been in my awareness for a very long time; I think my first attempt was in my teens and I didn't succeed at finishing until my mid twenties)
This doesn't seem accurate - the scenario needs to be imaginable, but that doesn't mean it's a prediction by the author. I could write a story about the downstream effects of discovering FTL travel in 2030, but that doesn't mean I think it's going to happen. Even for something more realistic - something that will probably happen on some timeline, just not right away, like a Mars colony - the author might not want to explore or think about other technological or societal changes that happen in a longer period of time? Fiction is just that, fiction
I think it's more than this - it feels pretty unclear that the "league" feature is real in any sense.
There's no particular reason to believe that the other users and scores in the league you're in represent other real-world humans, and could simply be generated algorithmically to put you at a specific point in a score distribution based on A/B testing for what works the best to keep people engaged. And if they do pull real human scores into that list, they don't necessarily need to make that list consistent between users; so if you get second place, the real human whose score is shown in fourth place could be looking at their own wholly separate list in which they were second (with a userbase as large as Duolingo, I think of these two things as largely isomorphic). As far as I know, Duolingo doesn't document or discuss the mechanics of league formation, so even if they were manipulating outcomes like this it wouldn't be outright lying.
My experience doing Duolingo regularly was that my own score would vary significantly week-over-week based on my time and effort, and I would always land somewhere in the top four-ish spots in the "league" I was in regardless. If I were really being put together with a set of humans at the beginning of the league and the scores just played out organically, I would expect to occasionally win big or get demolished, but that never happened to me.
And my guess is that being competitive towards the top of the league but not consistently winning is the best for user engagement, so they'd have every reason to fake/engineer that outcome.
Do you have a source for this? I've seen content around GMs finding stronghold-style setups that the AI cannot find a way out of, and theoretically getting to repeated move / 50 moves without capture draw scenarios. But I'm not aware of any examples of Hikaru or anyone actually winning against the best computers (eg modern Stockfish) on even footing
I'd encourage you to read The Power Broker, if you haven't. Some examples that aren't "policy decisions" we can disagree with given our knowledge of how it turned out (eg, building or not building public transit in a given place), but rather things that were clearly morally wrong at the time:
- repeated wholesale destruction of low-income neighborhoods through a variety of bridge, park, and highway construction projects
- evicting farmers and poor rural landowners through opaque legal methods to build highways and parks atop their land
- running a "slum clearance" program that primarily evicted people from slums and demolished them without providing any real place for the humans to go afterwards
- funneling vast sums of money into the pockets of collaborators, friends, and, in the end, himself
The Power Broker paints a nuanced picture, but he did some pretty terrible things in his time.