GPT-3 can run code

thrtythreeforty · on March 29, 2022

> GPT-3 struggles with large numbers, decimal numbers, and negative numbers. When used it returns answers that are close but often incorrect.

Regarding GPT-3's "guesstimates," intuitively it feels like the network has to guess because it hasn't been given a way to do exact computation--a neural network is built out of nonlinear functions--even if it "understands" the prompt (for whatever value you want to give to "understand").

Are there any techniques that involve giving the model access to an oracle and allowing it to control it? To continue the analogy, this would be the equivalent of giving GPT-3 a desk calculator.

If this is a thing, I have other questions. How do you train against it? Would the oracle have to be differentiable? (There are multiple ways to operate a desk calculator to evaluate the same expression.) Also, what control interface would the model need so that it can learn to use the oracle? (Would GPT-3 emit a sequence of 1-hot vectors that represent functions to do, and would the calculator have "registers" that can be fed directly from the input text? Some way of indirectly referring to operands so the model doesn't have to lossily handle them.)

visarga · on March 29, 2022

There are many papers trying to couple language models with external modules.

In the Retrieval-Enhanced Transformer (RETRO) paper a large language model was coupled with a similarity based text index. It can populate the prompt with relevant information from the index thus being more grounded and update-able.

In another paper (AlphaCode) the language model was coupled with a compiler and could run programs and check if they match the expected outputs for a few test cases. The model was able to solve competition style coding problems above average human score.

In another paper (Language Models as Zero Shot Planners) a language model generates commands to navigate a virtual home environment and performs tasks. The knowledge in the LM helps in quickly learning tasks.

A recent one can learn new concepts by simple conversation, then apply them where necessary. You can talk-train your model. (Memory assisted prompt editing to improve GPT 3 after deployment)

So the trend is to add "toys" on language models - a simulator, a compiler, a search engine, a long term memory module.

I'd like to see a recursive language model, that can sub-call itself to decompose problems.

gwern · on March 29, 2022

You forgot all the inner monologue (https://www.gwern.net/docs/ai/gpt/inner-monologue/index) & scratchpad papers which give it additional steps or access to Python REPL etc: eg https://arxiv.org/abs/2112.15594 https://arxiv.org/abs/2111.08267 https://arxiv.org/abs/2111.08171

visarga · on March 29, 2022

AI Chains really takes it to the next level.

gwern · on March 30, 2022

Yeah, but I didn't bring it up because I wasn't sure how much is really the model choosing and how much is the human workflow: they emphasize the interactive part heavily.

Anyway, today another great paper dropped on self-distillation: "STaR: Bootstrapping Reasoning With Reasoning" https://arxiv.org/abs/2203.14465 , Zelikman et al 2022.

sprobertson · on March 30, 2022

> I'd like to see a recursive language model, that can sub-call itself to decompose problems.

I tried a very simple and specific version of this a few years ago (Recursive Application of Recurrent Neural Networks) and it worked great for intent parsing: https://github.com/spro/RARNN

Would like to see what "real" researchers with more modern models could do with the concept.

josefx · on March 30, 2022

> The model was able to solve competition style coding problems above average human score.

I am not sure if I am thinking of the right study, but as far as I remember the model included a human wading through and filtering solutions and while there may have been a compiler attached they also scored themselves. The marketing blurb of course tried to make it sound as if they had competed.

visarga · on March 31, 2022

The model generates a large number of solutions, then they filter those that actually compile and generate the right output when executed, then they cluster to select a few (<10 solutions) and submit them. They are not allowed to present too many attempts.

Here's a good analysis of the paper: https://www.youtube.com/watch?v=s9UAOmyah1A

josefx · on April 1, 2022

Ah, the paper describes a fixed method for the last selection step and also AI generated tests to reduce the results even more before that. Quite a bit better, even if the participation is still only simulated.

The_rationalist · on March 30, 2022

> A recent one can learn new concepts by simple conversation, then apply them where necessary. You can talk-train your model.

Which model? Sauce please

visarga · on March 30, 2022

"Memory assisted prompt editing to improve GPT 3 after deployment"

paper: https://arxiv.org/abs/2201.06009

video: https://www.youtube.com/watch?v=gYxJEd3EUKs

ravi-delia · on March 29, 2022

I believe the dominant thinking is that GPT-3 has trouble with math because it doesn't see individual digits. It obviously has no trouble working on words, which are much more discreet than numbers. I wouldn't be surprised if it had trouble carrying a long equation though. When writing it can reconsider the whole context with each new word, externalizing that memory, but with most computations it would have to carry out the whole thing in one go. That's a lot of dedicated parameters for a single subtask.

bertday · on March 29, 2022

Even the tokenization is wonky. Imagine if you had no concept of math characters and instead has a lookup table of common-ngrams (BPE encoding). For example, the binary addition function “a+b” may be tokenized as a unary “3+b” because “3+b” occurs commonly. That tokenization is vastly different from “3.00000001+b”. GPT has to invert this tokenization artifact with finite training data.

ravi-delia · on March 29, 2022

Yeah, I think that's the most accepted explanation. Everything after my first sentence was total speculation, the tokenization is usually cited as the issue.

thrtythreeforty · on March 29, 2022

> with most computations it would have to carry out the whole thing in one go

Is there a way to allow models to say "let me think about this some more"? With language models like GPT-3 you emit one token per inference iteration, with its previous output fed back in as input/state. Can models opt out of providing a token, but still update state? That would allow it to break up the computation into discrete steps.

thesz · on March 30, 2022

Here it is: https://arxiv.org/abs/1611.06188

RNN outputs "confidence" bit which can guide computation to perform more steps to obtain more confidence in the result. Essentially, RNN asks "let me think about that some more".

But, separate ablation study found that if you just drop confidence bit altogether and allow RNN to compute some more every time (e.g., always perform 4 computations on single input for 1 output), you get same or better results without extra complexity of training.

There is also Microsoft Research's paper I can't find right now about the variable computation for image classification where there is a "confidence" bit at some of the final layers - if lower layer is cinfident enough, it's output will be used for classification, otherwise the output of that layer will be passed into more transformation of upper layers.

rablackburn · on March 30, 2022

> But, separate ablation study found that if you just drop confidence bit altogether and allow RNN to compute some more every time (e.g., always perform 4 computations on single input for 1 output), you get same or better results without extra complexity of training.

Do they saw what happens if you do both? Perhaps the “benefit from a higher computation/per cycle” phenomena and the “benefit from signalling relative computation resource allocation” one are different.

I guess I’ll have to try and read the paper, but I’m new to the literature and am clueless about the current state of research.

durovo · on March 29, 2022

I believe GPT-3 has a transformer-based architecture. So it doesn't recursively ingest it's own output in each iteration. I believe attention-based transformer models have enough complexity to be able to learn what you are talking about on their own.

ravi-delia · on March 30, 2022

GPT-3's transformers only recur some finite amount. Attention does a lot compared to a bog standard RNN, and probably if the numbers were tokenized it would be enough for most reasonable computations, but eventually you definitely would hit a cap. That's probably a good thing, of course. The network and training are Turing complete together, but it would suck if the network itself could fail to terminate.

thrtythreeforty · on March 29, 2022

Thank you for pointing out the difference. I went and reread about transformers; previously I thought they were a kind of RNN. (I am not an ML engineer.)

zaptrem · on March 30, 2022

That would be neat. You could give it backspace and "let me think more" tokens that would signal the inference program to run it again on the prompt plus its own output. That way it could generate "thoughts thoughts thoughts [THINKMORE] thoughts thoughts thoughts [THINKMORE] [BACKSPACE]X 8 (The real output would go here""

It would of course have to be penalized in some way for [THINKMORE]ing to avoid infinite processing time. It would have to learn to reason at what point diminishing returns would kick in from continuing to [THINKMORE] VS recording its best answer. The penilization function would have to take into account remaining tokens that would fit in the transformer prompt.

ravi-delia · on March 29, 2022

I think it would work, but backprop would be computed in a different way every time. I'm not an expert, so there may be sneaky ways around it, but I'm pretty sure you'd lose out on a long history of little efficiency improvements when you could just make it more recurrent instead.

PeterisP · on March 30, 2022

Hardcoding a tokenization tweak that keeps individual digits separate would be a trivial change to the preprocessing that would not affect the rest of the model training process.

edflsafoiewq · on March 29, 2022

Can it do math on "prose numbers", eg. "two thousand three hundred and four"?

ravi-delia · on March 30, 2022

Not super well in the GPT-2 based models I have access to. It falls into different error modes though, diving into prose rather than even making an attempt. Makes sense in retrospect!

daniel-cussen · on March 29, 2022

And that's where you see the man behind the curtain.

AitchEmArsey · on March 29, 2022

Next year: GPT-NG offloads it's answers to Amazon Mechanical Turk, and we've come full circle.

daniel-cussen · on March 29, 2022

Yeah for sure. With energy prices soaring, Moore's law being morally over for since 2010, wages being so completely destroyed by the hatred Democrats have for them, and the sneaky little misconceptions and errors the golem's makers did not fight hard enough to let in, AI will be supplanted by plain I.

renonce · on March 30, 2022

Check out my project https://github.com/Thopliterce/transformer-arithmetic. This is a concrete implementation based on GPT-2 model that does multiplication accurately, digit by digit. It does so by generating a dataset that tells the model how to do multiplication step by step. Doing arithmetic actually works with just GPT-2, without an oracle.

rablackburn · on March 30, 2022

Call it the uncanny valley, but I find this mildly disturbing… and absolutely fascinating.

Iv · on March 30, 2022

That's actually pretty straightforward: (Tested with EleutherAI GPT-J-6B because why use a closed model when an open one exists?)

Prompt: "Question: Solve three plus six.

Answer:

a=3

b=6

a+b

Question: Solve twelve times fifteen. Answer: a="

And the model dutifully answered:

"a=12

b=15

a*b"

Which you could feed directly to a python console.

This kind of approach, where you make a long prompt to make the model understand the kind of result you want is named "prompt engineering" and I find it crazy how close we get to robopsychology.

woeirua · on March 30, 2022

Well, the theory around neural nets strongly suggests that enough nonlinear activation functions combined in the right way should be able to learn any function, including basic arithmetic. Now, whether or not you have the right approach to training the network to get the right set of weights is a different story...

ironmagma · on March 30, 2022

Any computable function I assume? I wonder what other limitations there might be.

emmelaich · on March 29, 2022

An intriguing thought is that a GAI will behave very much like a well-read smart individual. With the faults, mystery and foibles that implies.

tluyben2 · on March 30, 2022

A well read smart human won’t guess things; they will look it up, find software to get the correct answer (like a calculator) or refer to a colleague.

emmelaich · on March 30, 2022

If they have enough time.

mahastore · on March 30, 2022

From the article it seems GPT produces the correct output when the instruction is as follows:

# Instruction def f(x): if x > 30: return "too large" else: return x + 3

How the hell it is different from the programmer writing the python function herself and where exactly is the "intelligence" in this?

rnk · on March 30, 2022

People are asking this question to see if it has an evaluator like wolfram alpha.

tluyben2 · on March 30, 2022

The point of the article is that gpt3 can run code?

hoseja · on March 30, 2022

It's bad at math in a similar way brains are, the hell.

Veedrac · on March 29, 2022

> GPT-3 seems to have issues with large numbers. Moyix’s gist covers this in detail. GPT-3 tends to guesstimate an algebraic function instead of evaluating the numbers, so the answer is only correct to a certain approximation.

There are two issues here. One is the lack of working memory, which means that there is very little scratch space for calculating things with a meaningful sequential depth. GPT-3 is very unlike traditional evaluation methods in this regard, in that it is easier for it to interpret the meaning of a program you give it and then intuit the result given the context than it is to mechanically execute its steps.

The other issue is the text encoding, which makes it much harder for GPT-3 to do digit-by-digit operations. Many arbitrary numbers are just their own token. A fixed length number to us looks like a fixed number of characters, but for GPT-3 they can be and almost arbitrary number of tokens divided into almost arbitrary chunks. Using thousands separators is very helpful for it.

If you account for these and design a prompt that mitigates them you can get much stronger results. Here is an example: https://news.ycombinator.com/item?id=30299360#30309302. I managed an accuracy of 42% for 3-by-3 digit multiplication.

YeGoblynQueenne · on March 30, 2022

>> There are two issues here. One is the lack of working memory, which means that there is very little scratch space for calculating things with a meaningful sequential depth.

It's a language model. It can generate text, not "calculate things".

If you give it the right prompt, it will generate the right text, but if there's any computation going on, that's you computing the right prompt.

See Clever Hans:

https://en.wikipedia.org/wiki/Clever_Hans

canjobear · on March 30, 2022

If this were true, then engineered prompts would fail for held-out problem instances. But they don’t.

YeGoblynQueenne · on March 30, 2022

I don't understand what you mean by that. Which held-out problem instances?

canjobear · on March 30, 2022

Suppose you engineer a prompt to make GPT3 do arithmetic. You design the prompt to work for a particular set of training examples like 1+1 and 2+3. If all the computation is in the prompt engineering, and GPT3 is just Clever Hans, then this engineered prompt should do no better than chance if you then hand it new instances like 4+5 with the same prompt.

YeGoblynQueenne · on March 30, 2022

>> Suppose you engineer a prompt to make GPT3 do arithmetic

Oh, I think I see what you mean. Thank you for clarifying. So, no, I didn't mean that the prompt is engineered to make it look like the model is performing a calculation. I meant that GPT-3 has memorised instances of arithmetic operations and in order to retrieve them from its memory the human user must figure out the right prompt. I wrote "that's you computing the prompt", not "that's you computing the result".

The prompt is like a SQL query, right? If you don't enter the right query, you don't get the right results. That's the point of all those people on the internets fiddling with their prompts- it's like they're trying to query a database, but they don't know what the right syntax is for their query, so they tweak it until it returns the results they want.

For example, the OP mentioned thousands separators being very helpful to the model. That's because it's memorised more arithmetic results with thousands separators, than without. So you're more likely to get the right results out of it if you use thousands separators.

Also because like the OP says GPT-3 has a separate concept for a digit and a string of digits and a separate one again for a string of digits and other symbols. "9999" is, in its model, a different thing than "9,999".

Which, btw, is why it can't calculate. Because to calculate, a system must have a representation of the concept of a number. Otherwise, calculate- with what?

mbowcut2 · on March 29, 2022

So, for people unfamiliar with deep language models like GPT, it's essentially a program that takes in a prompt and predicts the next set of words based on a training corpus -- which in GPT-3's case is a large portion of the internet. In these examples GPT is not executing any python code, it has just been trained on enough Python code/output to successfully predict what kinds of outputs these functions would produce.

FartyMcFarter · on March 30, 2022

> GPT is not executing any python code, it has just been trained on enough Python code/output to successfully predict what kinds of outputs these functions would produce.

This distinction is not that clear though. If you can predict well the output of a function, that's equivalent to executing the code.

toomanydoubts · on March 30, 2022

For that to be somewhat true, I can see at least two prerequisites: 1) the function must be pure(no side-effects); 2) predicting well is not enough, GPT must predict perfectly a hundred percent of the time.

Still, technically you're not executing the code.

FartyMcFarter · on March 30, 2022

Predicting perfectly 100% of the time is impossible as it is equivalent to solving the halting problem - and in O(1) time at that!

wyattpeak · on March 30, 2022

Computers don't execute code perfectly 100% of the time.

I agree that it's fundamentally different, but I'm not exactly sure how, and I think it's subtler than you're suggesting.

tluyben2 · on March 30, 2022

It is fundamentally different in mathematical foundations; some functions are proven formally verified and therefor will execute 100% perfectly (I guess you are talking about actual bugs like hardware issues?); what gpt3 does is not even close to that; if you put the same input to gpt3 multiple times it comes up with different answers. That is nowhere close to a computer executing an algorithm.

wyattpeak · on March 30, 2022

I'm not talking about GPT-3, I'm discussing the theoretical question raised by the grandparent of my comment: How is predicting the output of a function fundamentally different from executing the code?

We call computers deterministic despite the fact that they don't with perfect reliability perform the calculations we set them. The probability that they'll be correct is very high, but it's not 1. So the requirement we have for something to be considered deterministic is certainly not "perfectly a hundred percent of the time", as the parent to my comment suggested.

danans · on March 30, 2022

> if you put the same input to gpt3 multiple times it comes up with different answers. That is nowhere close to a computer executing an algorithm.

It's a non-deterministic algorithm, of which many kinds exist. Producing different answers that are close-ish to correct is in fact what a Monte Carlo algorithm does. Not that you'd use GPT3 as a Monte Carlo algorithm though, but it's not that different.

tluyben2 · on March 30, 2022

Sure, but if you have something as clear as some of the actual deterministic python code from the article, this doesn’t fly.

Close-ish to correct makes sense for some problems and makes no sense at all for others.

3np · on March 30, 2022

Assuming no hardware errors, they do.

wyattpeak · on March 30, 2022

I don't think that's a reasonable assumption. If we allow ourselves to assume no errors, we could just assume GPT-3 makes no errors and declare it equivalent to a code interpreter.

3np · on March 30, 2022

Interpreter? Sure. That interpretation is not "equivalent to executing the code", though.

Imagine a C compiler that does aggressive optimizations - sacrificing huge amounts of memory for speed. On one hand, it even reduces computational complexity, on the other it produces incorrect results for many cases.

GPT-3 as presented here would be comparable to that. Neither are equivalent to executing the original code.

Meanwhile, the result of something like gcc is, even if it runs on a computer with faulty RAM.

wyattpeak · on March 31, 2022

I've lost track of what point you're making.

Speed and memory is orthogonal to my point, which is about the output of two methods of arriving at an answer. I'm obviously not saying GPT-3 is anything like as efficient as running a small function.

What distinction are you drawing between the output of an interpreted program and a compiled program?

tluyben2 · on March 30, 2022

If you can predict that for any input then indeed you are right. Gpt3 cannot (yet) though. It is eerily (but logically) flawed in its guessing.

kevincox · on March 29, 2022

I find it quite interesting that in the JSON to YAML example it reordered the list. If this was an access control list that could be a serious security issue that could have easily been missed in review. (Especially if dozens of files like this were changed at once). Of course a malicious user could have done this as well and likely got by code review but the fact that it was accidental is scarier in a way.

kcorbitt · on March 29, 2022

For folks wanting to play around with the GPT-3 code-editing capabilities referenced in the article within your own codebase, I wrote a simple open source VS Code plugin that lets you run commands against your currently-open file and get GPT-3's suggested edits back in a diff: https://marketplace.visualstudio.com/items?itemName=clippy-a...

58x14 · on March 29, 2022

I think I'm going to pair this with Copilot and see what happens. Hopefully I don't accidentally unlock something bizarre.

kcorbitt · on March 31, 2022

Yes, I use both together! They work together perfectly fine.

timdellinger · on March 29, 2022

I assume that GPT-3 is just exhibiting rote memory. For small numbers, it has accurate answers memorized from the training set, but for larger numbers, it just "remembers" whatever is close... hence the ability to estimate.

My take is not that GPT-3 can run code, but rather that GPT-3 has memorized what code looks like and what the output looks like.

ogogmad · on March 29, 2022

Theorem: There are infinitely many prime numbers.

Proof:

[GPT-3's "proof"]

Let p be a prime number. Then by definition, p is a positive integer that has no positive divisors other than 1 and p.

Now consider the integer q = p + 1. We will show that q is not a prime number, which will prove that there are infinitely many prime numbers.

Since p is a positive divisor of q, q cannot be prime.

[...another attempt with the same prompt]

We know that there are infinitely many natural numbers. Let's suppose that there are only finitely many prime numbers. Then we can list them all:

p_1, p_2, p_3, ..., p_n

Now, consider the number

N = p_1 p_2 p_3 ... p_n + 1

This number is greater than any of the prime numbers in our list, so it cannot be divisible by any of them. But it is also a natural number, so it must be prime. This contradicts our assumption that there are only finitely many prime numbers, so we must conclude that there are infinitely many prime numbers.

~~Personal remark: I had to give it the same prompt like 5 times to get a correct-ish proof. The first attempts came out like the first example.~~ The proof is still wrong.

brian_cloutier · on March 29, 2022

Interestingly, these attempts are about the same as what pops up when I try to remember the proof:

- It's a proof by contradiction - The key step is in taking the finite list of primes, multiplying them together, and adding 1

I then try to flesh out the details, it might take a second to realize that this new number is also prime, and then a few moments more to remember the exact rationale why.

Along the way the proof lives in a kind of superposition where I'm not clear on the exact details. The "proofs" you gave here seem to be serializations of a similar superposition! GPT-3 seems to remember the proof about as well as I do, but it's missing the final sanity check which tweaks the proof until all the pieces correctly fit together.

In this case, you seem to be performing a version of this sanity check by running the prompt multiple times until a correct answer comes out. I wonder if it's possible to prove something more obscure using a similar process: GPT-3 comes up with ideas and the human sanity checks.

Banana699 · on March 29, 2022

>this new number is also prime

Not necessarily, it might be composite, but in this case one of it's prime factors will necessarily not lie in the supposed list of primes, therefore also a contradiction.

The first counter example to "If L := {P0,P1,..,Pn} is a list of primes, then prod(L)+1 is prime" is {2,3,5,7,11,13}, their product is 30030, and 30031 is a composite of 2 primes, none of which are in the list.

falcor84 · on March 29, 2022

It's somewhat silly semantics, but I believe it is a valid deductive step on the way to the contradiction - if the number is not divisible by any other prime, then it must be a new prime, ⊥.

ivegotnoaccount · on March 29, 2022

The issue is that it is not divisible by any other prime *from the list*. The two cases (prime or composite) must be handled separately since they do not use the same logic to infer there is one more prime.

For instance, 2 * 3 * 5 * 7 * 11 * 13 + 1 = 30031 = 59 * 509.

Tainnor · on March 29, 2022

You don't need two separate cases.

Assume p1, ..., pn is a finite list of primes. The sum p1+...+pn+1 is divisible by a prime, because every natural number> 1 is. However, it's not divisible by p1,...,pn, hence there must be an additional prime not in the list.

(I think you're right though that GP's "contradiction" doesn't work)

ivegotnoaccount · on March 30, 2022

Oh, you're right.

Never thought of using "by definition, all numbers can be divided by a prime" tu merge the two cases. It's not that shorter, but is IMHO quite elegant, I'll remember it. Thanks for correcting me.

Tainnor · on March 30, 2022

Well, it's not by definition, but "every number is divisible by a prime" is fairly obvious (just keep dividing until you reach a prime) and can technically be proven by using (strong) induction.

Tainnor · on March 30, 2022

too late to edit: p1 * ... * pn+1, of course. not plus.

ravi-delia · on March 29, 2022

But to get the contradiction, you assume a finite number of primes. As each of them does not divide the new one, the new one is not divisible by a prime. It seems like your method is some kind of induction? Which probably gets a little closer to the "reason" for it, but isn't the standard proof I've seen.

Tainnor · on March 30, 2022

These are really just logically equivalent ways of getting at the same result. You can either prove the statement "for every finite list of primes, there exists a prime not in this list" directly from the axioms of arithmetic, or you can add its negation "there are finitely many primes" as an assumption, derive a contradiction, and therefore conclude the negation of that new assumption. Nothing substantially changes about the proof either way.

ravi-delia · on March 30, 2022

I mean, yeah? It's still true you don't need to prove the composite case separately if you structure it a little different. Plus the original comment was clearly angling for the contradiction, so pivoting without warning to induction is just misleading

Tainnor · on March 30, 2022

There is no induction involved (I mean, yes, for some of the lemmas about divisibility, probably you'll need induction, but not for the main proof).

ravi-delia · on March 30, 2022

Oh I see! I was talking about bootstrapping from "there's always another prime" to "there's a countably infinite number of primes", but you can just piggyback off the naturals.

Tainnor · on March 30, 2022

Well, I suppose it matters which definition of "infinity" you want to use. The modern definition of an infinite set is that it's a set for which there exists an injection into the natural numbers. But that definition brings you into the territory of set theory, which seems unnecessarily complex when you're just trying to prove something about arithmetic.

Euclid's original proof of the theorem is of the form "for any list of primes, I can find an additional prime" [0], and for good reason: in Ancient Greece, thinking of infinity, or infinite sets, as a concrete object that you could manipulate would have seemed weird.

But the proof variant where you produce a contradiction doesn't really get into the set-theoretic details either. All it does is say: "Assume there is a finite list of all primes. Derive a contradiction. Therefore there is no such list." That's pretty much equivalent to the direct proof, it's just using different logical inference rules.

[0]: http://aleph0.clarku.edu/~djoyce/java/elements/bookIX/propIX...

mannykannot · on March 30, 2022

I feel that calling the final step a "sanity check" underrates its significance. To me, it implies that you essentially have the proof, and you are just looking for confirmation that it is sound and whether there are some edge cases to finish off. In contrast, I would say that it is the first point at which you understand how the half-remembered fragments of the proof can be put together to make an unassailable case for the proposition. Until then, it is as if you are groping around in the dark, trying to remember what the room looked like when the light was on (I know what it's like, as I have frequently been in that situation!)

These answers are the sort one might expect from something that has a vast memory for what it has seen before, and an ability to draw huge networks of syntax-level associations and generalizations from all that text, but is not so strong on semantic associations and generalizations that are not manifest at the level of syntax. What surprises me is how successful that has been.

actually_a_dog · on March 29, 2022

The thing I find interesting about the proof attempts in the GP comment is that they very much resemble what you'd expect to see coming from a hypothetical somewhat confused undergrad. I think that ties into what you say about the proof living "in a kind of superposition where I'm not clear on the exact details," because that's where I imagine said hypothetical confused undergrad's understanding being.

mr_toad · on March 29, 2022

It’s imitation rather than true understanding. Still, even imitation is a remarkable ability for a computer.

ctoth · on March 29, 2022

I believe this recent paper demonstrates a method for allowing these large language models to perform this "sanity check" automatically[0].

[0]: Self-Consistency Improves Chain of Thought Reasoning in Language Models https://arxiv.org/abs/2203.11171

gnulinux · on March 29, 2022

Both proofs are wrong, second one is closest. Second one should not claim that N is a prime (it likely isn't). It should say N is not divisible by any of p_i, and since due to Fun. Theo. of Arith. it is such that N = Sum {c_i q_i} where q_i are prime, and none of q_i in {p_i} which shows a finite list of primes is not possible construct.

nonameiguess · on March 29, 2022

This isn't really the "human level mathematician" equivalent task anyway. A human mathematician's main purpose isn't to memorize and reproduce proofs generated by other people. It's to prove original results no one else has proven before. To remember and reproduce existing proofs, I just typed "proof infinitely many primes" into DuckDuckGo and it gave me plenty of correct results.

karpierz · on March 29, 2022

That's like saying "standing still" isn't a human-level sprinter's task. In principle, yes, nothing in the 100m sprint requires that you need to be able to stand still. In practice, I would be very skeptical of someone who can't stand claiming they can sprint.

zardo · on March 29, 2022

It's a human level mathematics student problem. If it can't determine that's it's proof is nonsense here there's little hope it could produce any worthwhile original work.

jameshart · on March 29, 2022

What does GPT-3 come up with if you ask it for a proof that there are a finite number of primes? Or that pi is rational?

I guess it would stitch together some more seemingly sensible statements that also don’t quite add up to a rigorous proof?

ogogmad · on March 29, 2022

I keep asking GPT-3 to prove that the LR algorithm (for finding eigenvalues and eigenvectors) converges for PSD matrices. It keeps insisting that it's a form of gradient descent. Is that true?

supermdguy · on March 29, 2022

I'm actually taking a proofs class right now, and edit my Latex in VS Code with Copilot enabled. Its syntax is always perfect, but most of the time it produces stuff that doesn't make a ton of sense. There have been a few times when it gets the next couple of lines correct for repetitive proofs with a lot of "boilerplate", but it doesn't really make big logical/creative jumps.

lopatin · on March 29, 2022

Can someone explain for a dummy how this is possible? How does it know that range() is zero indexed? Was it specifically trained on Python input/function/output data? Or did it just "learn" it? Do the researchers know how it learned it?

Does it actually "run" the code? Like, if it was looping over 1 billion iterations would it take 1B times longer than if it was just one iteration? I have so many questions.

MauranKilom · on March 29, 2022

> How does it know that range() is zero indexed?

If you read through all of the internet once, would you know that range() is zero indexed?

> Like, if it was looping over 1 billion iterations would it take 1B times longer than if it was just one iteration?

It clearly cannot, because querying the network for a token executes the exact same sequence of operations every time.

But it's very impressive that it can basically recognize the Collatz Conjecture in the code and mostly guess in the right ballpark for the results.

The fact that it's just liking (in a loose sense) inputs to inputs it has seen is quite visible in the f(g(x)) vs g(f(x)) behavior - the former is significantly more common, so it struggles to work with the latter.

lucidrains · on March 29, 2022

https://alphacode.deepmind.com/ gives you a glimpse inside of what emerged from a similar attention net trained on code. however, whether the attention net has been forced upon pixels, language, amino acid sequences, the resultant representations are a bit beyond human reasoning, even if we can examine what individual attention heads are 'looking' at

etskinner · on March 29, 2022

It seems more likely that it learned it. If you knew nothing about Python, but understood the word "for" a little, and understood code a little, you're likely to figure out that range() is zero-indexed after you see something like this a few times

>>> for i in range(3): print(i)

0 1 2

lopatin · on March 29, 2022

My mind is just blown that it learned a language runtime based on examples. What would happen if you gave it an infinitely recusrive function? It can't stack overflow, there's no stack! Wait, is there?

stevenhuang · on March 29, 2022

My guess is it would respond with the standard stack overflow error, from examples of similar output posted in its training set.

YeGoblynQueenne · on March 30, 2022

It hasn't. It's memorised the examples and can generate them with some variation and according to what output is more likely given the input. But that's generation, not computation.

I don't know if this example helps, but a computer can generate (pseudo-)random numbers by executing an algorithm. A pair of dice can also generate random numbers because they're thrown so that they land at random and someone has marked pips on their faces that a human can read as numbers. The result may be similar, but one is generated by a computation and the other by a random process. The random process is not a computation. It's a random process.

(Or just unpredictable).

mhh__ · on March 29, 2022

How do you know a range is zero indexed? (As in how is it stored in your brain)

lopatin · on March 29, 2022

I have no idea how it's stored in my brain. Is that the same way it's stored in GPT-3?

tasuki · on March 30, 2022

Probably similar. GPT3 also doesn't know how it's stored in its brain, and neither do we.

graiz · on March 29, 2022

GPT3 is a really impressive auto-complete. It takes inputs and predicts what text should be output. It's super impressive and it looks like it's smart but it is not running code, it's not Turing complete and if you understand how it works it's very easy to cause it to produce significant errors.

kaetemi · on March 29, 2022

It has a ton of programming books in its training data. It only "runs" anything that's close enough to any samples it has seen that included output. Anything complex, and it fails, because it does not reason about it logically. It's bad at the same things humans are bad at.

mr_toad · on March 29, 2022

Human programmers rely on intuition and experience much more than some people give them credit for. An experienced programmer can find common errors quickly, simply because they’ve seen (and made) so many.

Being able to intuit what a block of code does is actually a core skill; having to actually step through code in your head is slow and difficult.

berryg · on March 29, 2022

I struggle to understand how GPT-3 executes code. Is it simply running a python (or any other language) interpreter? Or is GPT-3 itself interpreting and executing python code? If the latter question is true that would be amazing.

PeterisP · on March 30, 2022

It does not execute code, it "guesses" what the output of the code should be, given all the data it has seen during training - and, surprisingly, for many types of problems these guesses are accurate or close to that.

bidirectional · on March 29, 2022

It is the latter.

bitwize · on March 29, 2022

GPT-3 is starting to remind me of SCP-914. Give it an input, and its millions of tiny wheels churn and it produces something like what you want, but otherwise quite unexpected.

Let's hope it doesn't turn into something like SCP-079...

csmeder · on March 29, 2022

What year will GTP be able to take an app written in Swift/SwiftUI and output a spectacular Android translation? 3-years? 5-years? 10-years?

This is an interesting benchmark because it is a very difficult problem, however: GTP has both everything it needs to do this without needing a fundamental improvement to the core of GTP (this process is more of a science than art) and using automated UI testing GTP can check if its solution worked.

Thus this challenge is in the realm of what GTP already is, however, once it can do this it will have massive implications for how software is built.

anyfoo · on March 29, 2022

A terrible prospect.

It's hard enough for people to faithfully port an application. People who participate and live in the world that makes up our reality. Leaving this up to an AI will at best flood us with low quality junk. At worst it's actively harmful.

daenz · on March 29, 2022

Nit, but YAML is a superset of JSON, so no conversion required :)

jefftk · on March 29, 2022

This sort of "do what I mean" situation, where doing the thing the user intended is different from doing something technically correct, is a place GPT-3 excels. Even though returning the input would be easiest, it has the pragmatic judgement to predict that's not what the user wants.

spupe · on March 29, 2022

This is fascinating. I feel that we are still in the infancy of the field, however. These observations are analogous to naturalists of the past describing an animal's behavior, but we need to get to the point where more accurate estimates are made (ie, how often does it do each thing, how accurate it is after 100+ tries, etc). Every day we see a new observation showing wha GPTs can do, we also need a good way to make these observations systematic.

PaulHoule · on March 29, 2022

It would be remarkable if it got the right answers.

But it can't because it doesn't have the right structure (e.g. GPT-3 finishes in a finite time, a program in a real programming doesn't necessarily!)

GPT-3's greatest accomplishment is that it has "neurotypical privilege", that is if it gets an answer that is 25% or 95% correct people give it credit for the whole thing. People see a spark of intelligence in it the way that people see faces in leaf axels or in martian rock formations or how G.W. Bush looked in Vladimir Putin's eyes and said he got a sense of Putin's soul. (That was about the only thing in his presidency that he later said he regretted!)

As an awkward person I am envious because sometimes it seems I get an answer 98% correct or 99.8% correct and get no credit at all.

Micoloth · on March 29, 2022

GPT3 does not think like a human, but it definitely executes code in a way that is more similar to a human than a computer..

Proof is, that indeed humans do get the wrong answer in quizzes like these sometimes!

So i cannot understand this point of view of diminishing it as "spark of intelligence". It is exactly what advertised: a very big step forward towards real AI, even if definitely not the last one?

YeGoblynQueenne · on March 30, 2022

>> Proof is, that indeed humans do get the wrong answer in quizzes like these sometimes!

GPT-3 gets the wrong answer because it has memorised answers and it generates variations of what it has memorised. It generates variations by sampling at random from a probability distribution over what it's memorised. If it has the correct answer memorised, sometimes it will generate the correct answer, sometimes it will generate a slight variation of it, sometimes it will generate a large variation of it, sometimes it will generate something completely irrelevant (i.e. with a very small probability).

Failure is not an exclusive characteristic of humans. In particular, any mechanical device will fail, eventually. For example, a flashlight will stop functioning when it runs out of battery. But not because it is somehow like a human and it just got it wrong that one time.

PaulHoule · on March 29, 2022

It is the Emperor's New Clothes incarnate.

It has the special talent of hijacking your own intelligence to make you think it is intelligent.

People understood this about the 1966 ELIZA program but intellectual standards have dropped greatly since then.

a-dub · on March 29, 2022

is there a search engine for the training data so that one can verify that it is actually performing novel operations and not just quoting back stuff from its incredibly large training set?

ivegotnoaccount · on March 29, 2022

> For example, it seems to understand how to find a sum, mean, median, and mode. > Input: 1, 4, 5, 6, 2, 1, 1 > Output: 2.28571428571

Well, even with those small numbers, it's wrong. The first "2" after the dot should not be there. The result it gives is 16/7, not 20/7.

loganmhb · on March 29, 2022

I wonder how much of this is an illusion of precision that comes from pattern matching on content from filler sites like https://www.free-hosting.biz/division/16-divided-7.html (I do not recommend clicking the link, but the result appears there).

ivegotnoaccount · on March 30, 2022

I was thinking the same thing, especially as we are talking about division, and the result is "correct" for 16/7 to a great number of digits.

See also the "x = x + x three times", for which the result is not random but the result for the same thing... Two times instead of three (so result/2). That heavily smells like it has read sites that had nearly the same code on them.

aplanas · on March 29, 2022

Seems that it can convert from Python to Perl:

https://beta.openai.com/playground/p/o4qZWSXVz8JMmVaI9j9NMIK...

happycube · on March 30, 2022

Oddly, "convert to C" failed completely for me (wrapping the Python in main() { }) but C++ worked.

learndeeply · on March 29, 2022

Anyone have any ideas on how they're doing text insertion using an auto-regressive model?

lucidrains · on March 29, 2022

yes, they are most likely finetuning with this type of pretraining https://arxiv.org/abs/2103.10360 quite easy to build

zora_goron · on March 29, 2022

A quick question for anyone familiar with the architecture of these Transformer-based models -- I've heard that one reason why they don't work well with numbers is how the inputs are tokenized (i.e. as "chunks" rather than individual words/numbers). Is there anything architecturally preventing an exception in this form of tokenizing in the data preprocessing step, and passing numbers into the model in the format of 1 digit == 1 token? It seems like such a change could possibly result in a better semantic "understanding" of digits by the model.

Veedrac · on March 29, 2022

Nothing prevents it, no. Transformers are certainly capable of learning mathematical tasks; consider [1] as an example, which uses big but regular token lengths.

Alternatively you could just scale 'till the problem solves itself.

[1] https://arxiv.org/abs/2201.04600

imranq · on March 29, 2022

An interesting research direction would be to see how much the GPT3 deviates as we get more precise on various computational tasks. Possibly this would give some measure of some of the concepts the model has learned

sho_hn · on March 29, 2022

Do we today have any test suites/benchmarks for models along those lines?

charcircuit · on March 29, 2022

>Is GPT-3 Turing complete? Maybe.

It's obviously not. To handle infinite loops it needs to solve the halting problem. Which is not possible.

anyfoo · on March 29, 2022

I don't quite understand your answer. You don't need to solve the halting problem to be Turing complete, quite obviously. Why would GPT-3 need to in order to be?

charcircuit · on March 29, 2022

GPT-3 always gives an output after a certain amount of time. What should GPT-3 return when running:

  print("A");
  potentiallyHalts();
  print("B");

anyfoo · on March 30, 2022

Potentially time out? I don’t see the difference to, say, a python interpreter with a timeout. What would a human do, are we not Turing complete?

I mean, in the strictest sense that isn’t Turing complete either, because when you have a timeout you cannot run every program a theoretical Turing machine could. But then no practical computer is, because resources are always constrained (e.g. finite memory instead of an infinite tape). So when we talk about something being Turing complete, we usually disregard the resource limitations and effectively substitute something like “we mean Turing complete in the sense that it would be if we also had infinite memory and time”.

So, I still don’t understand why GPT-3 would have to (impossibly) solve the halting problem to be Turing complete[1], but everything else including a python interpreter or lambda calculus doesn’t.

[1] Note that I don’t assert that GPT-3 could or could not be Turing complete, I just don’t know why the halting problem predicates that.

_Nat_ · on March 30, 2022

Probably easier to just observe that, if GPT-3 isn't reliably correct, then it's not consistent enough to simulate a Turing-machine and therefore isn't Turing-complete.

As for loops: a Turing-machine could do infinitely many loops, so Turing-completeness implies that a system can do the same. If GPT-3 can't do infinitely many loops, it's not strictly Turing-complete; and if it can't do many loops, then it wouldn't seem like a meaningful approximation of a Turing-complete system.

anyfoo · on March 30, 2022

Yeah, that I agree with.

luxurytent · on March 30, 2022

Similar to how my four year old can read books, because he’s memorized the words I’ve read to him through repeated story times.

algon33 · on March 29, 2022

If I remember rightly, the AlphaCode paper includes a list of benchmarks, including the results of a finetuned GPT-3 for coding. I think they did it because Codex wasn't available to them when were doing their tests, but I might be wrong there.

curling_grad · on March 30, 2022

Their paper has results from Codex.

(see p. 21) https://arxiv.org/pdf/2203.07814v1.pdf

Avalaxy · on March 29, 2022

Just because you can, doesn't mean that you should. For some things it's just better to use a rules-based engine that is always correct, rather than a heuristics based algorithm that gives answers that are merely close.

tasty_freeze · on March 29, 2022

I don't think the author of the piece (or anyone for that matter) thinks GPT-3 should be used for running programs or evaluating functions.

It is being discussed because it is surprising that GPT-3 can do it at all. It is worth investigating what types of emergent knowledge and behavior are encoded in the trained network, as the boundaries of its capabilities may help illuminate future neural network architecture design.

mountainriver · on March 29, 2022

This is such an interesting field but I think there needs to be more focus on determinism and correctness. The stuff that’s happening with retrieval transformers is likely where this is heading

7373737373 · on March 29, 2022

Has anyone tried using it for SAT problems yet?

timdellinger · on March 29, 2022

my recollection is that the original journal article announcing GPT-3 included some data on how it performed against SAT-style questions

7373737373 · on March 30, 2022

Apparently these were college SAT questions, I'm wondering about https://en.wikipedia.org/wiki/SAT_solver

DC-3 · on March 29, 2022

Very far from an expert on ML, but isn't GPT-3 trivially not Turing Complete since it halts deterministically?

inopinatus · on March 30, 2022

Even a stopped clock tells the right time, twice a day.

unixhero · on March 29, 2022

Great so how do I run GPT-3 on my own hardware at home?

mr_toad · on March 29, 2022

It’s not available to the public or open source so you can’t. Only the smallest models might run on a single GPU, the largest would need a large grid.

unixhero · on March 30, 2022

I have many computers and hundreds of threads, and not afraid to acquire 3090 cards. However GPT-3 seems elusive, and I can't find out what it takes to run it myself.