Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Surprising. If only there were a way that we could have foreseen that an AI trained to write code in part by looking at people who, self-admittedly, don’t know how to write code, and people who write code for others with minimal context (Stack Overflow), would produce buggy code. It is a case of GIGO.

Most developers do not learn much from Stack Overflow. Why do we expect AI to fare better? In my experience, one in ten (optimistically) Stack Overflow answers give anything more than a code snippet with enough information to get the asker through their immediate issue. It can be beneficial if you have the necessary understanding already and only want the code snippet, but this is no way for humans or machines to learn.

Also, having an “AI Assistant” must lower programmers’ guards against buggy code. After all, it is an assistant - it must assist you, right? Subordinating humans to machines will not work in this domain until there is better training data and the machines can be taught the reason they are writing specific code. Until then, I have low hopes for AI-generated code.

Even if AI could generate correct, bug-free code the majority (say 99.9% of the time), I expect finding and correcting bugs will be difficult for humans. For example, how many bugs are found and corrected by the author of code during development, versus how many in peer review? I’m reminded of a saying akin to “ask someone to review 5,000 lines of code: no bugs. Ask someone to review 5 lines of code: 5 bugs”. We are poor critical reviewers, and AI cannot fix that. AI assistants probably worsen reviews, because reviewers will expect high-quality code from their AI assistants.



If I had a little robot riding in the passenger seat that could tell me whether to go left, straight, or right, and it was correct 90% of the time, I'd think that was pretty great. I'd get where I needed to be, even with a couple mishaps.

ML code suggestions are the same thing to me. If I don't know where I am going, I can just ask it for suggestions. And it's probably going to be what I want.

In both cases, I am annoyed with myself for having started before I knew where I want to end up.


The problem with ML is that it's pattern recognition, it's an approximation. Code is absolute, it's logic that is interpreted very literally and very exactly. This is what makes it so dangerous for coding; it creates code that's convincing to humans but with deviations that allow for all sorts of bugs. And the worst part is, since you didn't write the code, you may not have the skills (or time) to figure out if those bugs exist, especially if the ML is extremely convincing/clever in what it writes. I would argue that this overhead is even worse for productivity over just writing it yourself.


Expand your view. AI can write tests, read error messages, find bugs in your code, we just need to give this task.

Let's think about tests. You write a function, the AI writes a few tests for you. maybe you need to add a few more. But it's better to have tests, and you might have missed one of them.

Error messages - we rely on error messages to make this leap from "code parrots" to "bug free". Most of our codes fail the first time we run them. We're just fancy pattern matchers too, but we have a runtime. So the AI could also fix its bugs all alone, given the opportunity.

Finding bugs - we can train AI to spot bugs. It can become an excellent tool to check not just AI code, but also human code. Having a bug detector running in the background would be great, even if is not perfect.


All those things are susceptible to the same issue: the ML model can generate test cases and error messages that are convincing to you and me regardless of whether they're actually right. Don't get me wrong, ML will reach a point someday where it catches up to humans in this respect; this is merely a shortcoming of today's ML, not tomorrow's, and it highlights where ML's weaknesses currently lie (in contrast to more vaguely defined goals like artwork).


It doesn't work in "closed book" form, the only way you can do that now is to have good tests and let the language models try many attempts.


Yup the next iteration of ML needs to focus more on proofs. Okay you know this but how do you know it, show me the proof


It's about refining the approximations it makes until those approximations are too tiny/negligible to create failures in code.


Your example hinges on at least two things

1) How many turns do you take on a particular trip

2) How do those wrong turns end up? if it's "travel time extended by 30 seconds" or "My car, the car I hit and the side of this building are all in shambles" changes what a 10% failure rate means a lot.


Right? Took a turn the wrong way down a one way street. Hit a lady with a stroller who was looking the other way. She is a surgeon. Her husband is a lawyer. You killed the kid. Your life is over.


On the other hand, you did just describe the early days of both consumer GPS navigation and things like mapquest. There were countless stories and gleeful news articles about people turning the wrong way down a one way street, driving into ponds, getting stuck on logging roads and all sort of other chaos caused by inattentive drivers and blind faith in flawed systems. But the key takeaway I think is that in the end, consumer GPS succeeded. Bugs were fixed and now everyone does have a little robot in their pocket that can tell them how to get somewhere.


90% right is superficially impressive but in actual practice is abysmal.

Voice recognition software needed to get to 99.9% accuracy to be actually useable.


The difference is that with code suggestions, you don't necessarily notice errors, even extremely important errors.


> If I had a little robot riding in the passenger seat that could tell me whether to go left, straight, or right, and it was correct 90% of the time, I'd think that was pretty great. I'd get where I needed to be, even with a couple mishaps.

Pretty sure you wouldn't when it starts telling you to make illegal turns, or tell you to turn where there are no roads etc, that is the state of language models for code. You'd likely run over a person or get stopped by police at some point if you listen to it, then it is easier to just turn it off so you can focus on driving. A bad assistant is much worse than no assistant.

Edit: And these models are more like Teslas autopilot than an assistant giving you directions, since you have to oversee what it does instead of letting it tell you what to do. An autopilot that does the wrong thing 10% of the time is horrible.


One in ten turns being told to do the wrong thing sounds incredibly annoying. Maybe if you’re driving on the highway for long stretches, but otherwise that would be awful.


> having an “AI Assistant” must lower programmers’ guards against buggy code

Why would you assume that?

If it’s buggy a couple times, if everyone talks about how buggy and unreliable it is, it can easily become common knowledge and common practice to triple check the output.


Then how much time are you actually saving if you have to review everything it produces. The bottle neck was never typing speed, at that point all the AI is allowing you to do is produce more buggy code more quickly.


I use Copilot daily and experimented with using ChatGPT for real work code.

It’s an incredibly valuable tool even with having to rewrite the larger outputs… the small stuff like autocompleting variables and keys is highly accurate and what it most often generates (it’s scary how good it is at finishing the exact line you had in your head x50 times a day).

What you need to be careful about is when it generates entire functions or whole mini-modules. This is still extremely useful because it gets your brain running. It provides a simple template to suggest how it might look.

That way you’re no longer starting from scratch, you see a toy example with real code - for intellectual/creative work having that sort of seed is super useful.

Imagine a writer with writers block staring at a blank page vs a generated sample of dialogue between two characters or an intro paragraph to get the ball rolling.

Usually you have to burn a few cycles, fail a couple times writing some code, to get to the point where you’ve written something good. So it’s a normal part of the process to throw Version 0.1 away IMO.


1) verifying code is harder than writing it and

2) verifying code requires domain knowledge, which implies that the utility of these models is limited to things could write myself if I weren't too lazy. That's hugely constricting.


Yes but I don’t see it as generating entire blocks of code you’re supposed to copy and paste into your project.

It’s like a template, a suggestion from which you can build your own version.

Only rarely does it have the context or understanding of the wider codebase to do a programmers job for them.

The times when it does generate a copy/pastable function it’s usually some isolated utility function like “format date as DD-YYYY” something really simple and easy to verify. The type of stuff you’d copy entirely from Stackoverflow rather than finding a specific solution you adapt.

It’s mostly a glorified autocomplete and example suggestion service. It is not a full code writing service.

Domain expertise will obviously still be a job requirement. It’s an assistant to the programmer, not an occasional replacement for the programmer (and if you have domain expertise you usually use a 3rd party library).

Maybe future versions will try to do more but that’s not what we have today.


I think that many people will treat it as something that can generate entire blocks of code. Unfortunately it can be quite broken even just writing basic functions. One of my tests was doing some quaternion rotations. It did them, but refused to stop multiplying by the conjugate afterwards.

Another was converting dates to "years ago", which was broken for BCE because ChatGPT doesn't understand the underlying concept.


> I think that many people will treat it as something that can generate entire blocks of code. Unfortunately it can be quite broken even just

Have you tried to use it for this purpose?

It basically can’t unless you’re only building a toy app. Even after multiple levels of refinement it still requires tons of real programming work.

Which is largely my point, it won’t because it’s fundamentally incapable of providing that in its current state. Even without the buggy part it’s mostly just outputs generic stuff that will always need to be integrated into the wider codebase and to specifically what you’re trying to build.


I have, actually. Today I had it write a basic JS map for my wedding site. Yesterday I had it produce a puzzle solver with CLI and DSL. Obviously I'm still doing manual interventions at key points, but it's changed my personal cost/benefit calculation on whether various random ideas are worth doing.


> Surprising.

For me it was the least surprising thing I've read in a while.

One thing that has become clear is AI's learn to cover the common cases very well, and the less common not at all well. So you get a lot of mistakes in the area's where there is not a lot of training data. Sometimes those mistakes don't matter. Creating art is a great example - how do you even define a mistake in art? (Perhaps it's not so difficult, as apparently AI output is recognisable but it takes an experienced eye to spot where it's gone wrong.) So is playing a game like Go - playing the 2nd best move occasionally doesn't matter. Apparently protein folding fits into the same class as getting the basic shape right matters a lot. Ditto for voice to text - humans compensate, and the same for translation. There are lots of places mistakes don't matter.

But some applications are very intolerant of mistakes. Driving a car appears to be one. Programming is probably one of the worst as a minor mistakes don't just degrade the program, they can destroy it in a way that's utterly non-obvious. It seems to me the current generation of AI's are always going to struggle in areas where mistakes, even minor ones, are costly.

Interestingly, one area of programming mistakes are tolerable is review. Therefore I'd predict programmers will find AI's reviewing code is a net positive.


AI can learn to do code review, there is plenty of data on Github. It could also write tests and suggest possible bugs on its own. Overall, using it might be better than doing it by hand.

If you are using the AI just to write snippets of code, then it is suboptimal. What it needs is to monitor execution errors and fix its code over a few iterations, just like humans do.


Humans fix the code by understanding the model it represents. In fact it's often that bug fixing is where you are forced to understand what you wrote gliby from memory/patten matching.


Does it look at the questions on Stack Overflow? That would be silly. But yeah, even the answers are far from perfect - they might solve the immediate problem but lack error checking, use undocumented features, etc.


reCaptcha v5: Which lines have errors?


> If only there were a way that we could have foreseen that an AI trained to write code in part by looking at people who, self-admittedly, don’t know how to write code, and people who write code for others with minimal context (Stack Overflow), would produce buggy code. It is a case of GIGO.

So, I'll claim the real issue is just that this generation of AI isn't able to "learn", it merely "trains": if I were alone in a room for years and you gave me a book on how to program that has an error in it, during my careful study of the book (without a computer to test on!), I am likely to notice the error, get annoyed at the author trying to figure out if I failed to understand some special case, and then eventually decide the author was wrong. With only the knowledge from the book, I will also be able to study the concepts of programming and will eventually be able to design large complex systems; again: I will be able to do this even if I don't have a computer, in the same way people have studied math for millennia.

And like, this is how we all learned to program, right? The books and tutorials we learn to program with often suck; but, after years dedicated to our craft synthesizing the best of what we learn, we not only can become better than any one of the sources we learned from, given enough time to devote to practice and self-study we can become better than all of them, both combined and on average (and if we couldn't, then of course no progress could ever be made by a human).

With a human, garbage in can lead to something fully legitimate out! A single sentence by someone saying "never do X, because Y can happen, where Y is extremely important" can cause us to throw out immense amounts of material we already learned. Somewhere, GitHub Copilot has seen code that was purposefully documented with bugs (the kind we use to train humans for "capture the flag events") as well as correct code with comments explaining how to avoid potential bugs... it just didn't "give a shit", and so it is more likely to do something ridiculous like generate code with a bug in it and a comment explaining the bug it just generated than to generate correct code, because it doesn't have any clue what the hell it is doing and isn't analyzing or thinking critically about the training input.

> Even if AI could generate correct, bug-free code the majority (say 99.9% of the time), I expect finding and correcting bugs will be difficult for humans.

There is some error rate below which you beat the chance of a human making a dumb mistake just because they are distracted or tired, and at that point the AI will just bear the humans. I don't know if that is 99.9% or 99.9999% (it might be extremely tight, as humans generate thousands and thousands of individual decisions in their code every work session), but past that point you are actually better off than the current situation where I first program something myself and then hire a team of auditors to verify I coded it correctly (and/or a normal company where someone is tasked to build something and then every now and then someone like me is hired to figure out if there are serious mistakes).


> With a human, garbage in can lead to something fully legitimate out!

Because we get to see the error messages, fix and try again. You can try this on chatGPT - give it a task, run the code, probably fails, copy the error back, and let it fix is errors. After a few rounds it gets the result with much higher probability than when you allow it one single shot.

A language model can write programs, and then we can run the programs to check if they pass tests, then the language model has a special signal - execution feedback. If you retrain the model with this new data, it will learn to code better and better. It is reinforcement learning, not language modelling.

AlphaGo was able to generate its own data and beat humans at Go by doing this exact thing. It's an evolutionary method as well, because you are cultivating populations of problems and solutions through generate + execute + validate.


> Because we get to see the error messages, fix and try again.

As I noted explicitly, a human will get better even with garbage input even without access to a computer. I also explicitly noted how we are able to learn from a single well-reasoned note.

I recommend you seriously evaluate how you yourself learn if you truly believe that you only learn things using active feedback from external sources of truth via trial runs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: