I have a youtube channel with a playlist of coding with AI, where I show what I do with it (a small part, actually, but representative enough I hope). This is is the first video of the series I believe:
I don't get it. Watched the first video, and it seems like the LLMs provided no value at all? Like, you already knew where the bug was, and what the bug was, and they couldn't find it? So 20+min of messing around with prompts and reading through pages of output.. for nothing? How is this supposed to help?
In the video I show what happened AFTER the LLM fixed the bug while I was not recording. Of course I had no idea where the bug was, when the LLM found it. After a refactoring I removed a line for a mistake, causing a memory corruption. The LLM seeing the code and the patch immediately pointed out what the issue was.
Data point of one: ChatGPT 3.5, even the free product, is so much better at answering technical questions than Google.
Some questions I had successfully answered recently:
> "I would like to animate changing a snippet of code. I'll probably be using Remotion. Is there a JavaScript library that can animate changing one block of text into another?"
> "In Golang, how can I unit test a http mux? How can I test the routes I've registered without making real http calls?"
> "Hello ChatGPT. I have an application using OpenTelemetry. I want to test locally whether it can collect logs and metrics. The application is running in Tilt. How can my local integration test read the logs and metrics from the OTel collector?"
ChatGPT is better on average for sure than Google for arriving at a correct answer, but they fail in different ways. When Google fails, it's usually in the form of, "I cannot find an answer. Better ask someone smart for help." but when ChatGPT fails, it's often giving an incorrect answer.
Depending on your fault tolerance and timeline, one will be better than the other. If you have low tolerance for faults, ChatGPT is bad, but if you are on a crunch and decide its OK to be confidently incorrect some small percentage of the time, then ChatGPT is a great tool.
Most industry software jobs, at least the high paying ones, are generally low fault tolerant and that's why ChatGPT is not entirely replacing anyone yet.
So, even in your example, and even if you write the code all yourself, there is still a risk that you are operating above your own competence level, do exactly as ChatGPT instructs, and then it fails miserably down the line because ChatGPT provided a set of steps that an expert would have seen the flaws in.
I would also use the axis of how easy it is to tell if it’s wrong. If you ask an LLM for code and you quickly get a syntax error or the wrong result, it’s not going to waste much time or, usually, make you look bad. If you ask it to do some analysis on a topic where you don’t have enough knowledge to tell if it’s right, however, that’s a lot riskier because you get the negative reputation.
This is the big problem with Google’s AI results: before, the wrong answer was from seoscum.com and people would learn to ignore them. Now the wrong answer is given Google’s corporate reputation and also there’s no way to conditionally distrust it so you learn not to trust them for anything.
Google doesn't really say it can't find an answer; instead it finds less relevant (irrelevant) search results. LLMs hallucinate, while search engines display irrelevance.
> Data point of one: ChatGPT 3.5, even the free product, is so much better at answering technical questions than Google.
That's not the point of Google. It gives you a start to research the answer you need. ChatGPT just gives you an answer that might not be correct. So how do you define "successfully answered"?
In programming there are always tradeoffs. It's not about picking the 1st answer that looks like it "runs".
One example: I had to send a report to a slack webhook, showing how many oban jobs were completed in the last 24 hours for specific uses cases based on oban job params.
That's:
sql query, slack webhook api docs reading, ecto query for oban jobs with complex para filtering, oban job to run cron, cron syntax.
easily like a 2 hour job?
it took me 5 minutes with AI. then we decided to send the slack alert at 7am EST instead of 12pm PST. instead of doing all that math, I just ctrl+k and asked it to change it. 1 second change.
these gains are compounding. if you're an experienced engineer, you let go of minutae and FLY. i believe syntax memorization is dead.
Hold up, if you don't know a language's syntax, how can you verify that the answer returned by LLM is correct (at a glance, because a) nobody writes exhaustive tests, LLMs included, and b) you wouldn't be able to read the tests to confirm their validity either)?
I struggle to think of a case where explaining a task to an LLM in a natural language is somehow faster than writing it yourself, specifically in the case where you know a programming language and related libs to accomplish the task, implying non-zero ROI on learning these things.
Adapt to? I'm not saying to not use AI but whether you use AI, do it manually or outsource it it's never 5 minutes. It's about being responsible. Double check what happens. Review it.
antirez's series looks awesome. My two cents w/ composer:
Don't rely on the LLM for design. Start by defining some tests via comments or whatever your tools allow. The tab completion models can be helpful here. Start the interfaces / function defs and constrain them as much as possible. Let the LLM fill in the rest using TDD. The compiler / test loop along with the constraints can get you pretty far. You'll always need to do a review but I find this cuts out a lot of the wasted cycles you'll inevitably get if you "vibe" with it.
For Cursor specifically, you'll want to aim for prompts that will get you from A to B with the least amount of interactions. Cursor will do 25 tool calls (run the tests / compile, whatever) with one interaction. A little planning goes a long way to doing less and paying less.
I agree. I'm not suggesting to optimize for that. We want the best acceptable outcome in the least amount of time.
More toolcalls per interaction is typically a product of planning ahead, which in my experience produces better outcome. Each toolcall is a verification of the last step. Without those guardrails I find I waste a tremendous amount of time.
> Are you willing to elaborate about how they can accelerate you?
A few examples from my experience:
- Here is a SQL query. Translate this back into our janky ORM
- Write tests that cover cases X, Y, Z
- Wtf is this code trying to do?
- I want to do X, write the boilerplate to get me started
- Reshape this code to follow the new conventions
And it often picks up on me doing a refactor then starts making suggestions so refactoring feels like tab tab tab instead of type type type.
- "I've never used this language / framework, this is what I'm trying to do, how would I do it?"
- the documentation for these libraries is ... not useful. How do I do X?
Followed by: "these parts didn't work due to these restrictions, tell me more".
(I'm currently using this one to navigate Unity and UdonSharp's APIs. It is far from perfect
but having *something* that half-works and moves me in the right direction of understanding
how everything connects together is much much faster than sitting there, confused, unable
to take a single step forward)
I find that a lot of cases where "just read the documentation" is the best route are situations where there is good (or any) documentation that is organized in a single, usable space and that doesn't require literal days worth of study to sufficiently understand the whole context to do what is, with all that context, a very simple task.
I'm reminded a bit of the days when I was a brand new Java programmer and I would regularly Google / copy-paste:
public class Foo {
public static void main(String[] args) {
}
}
Or new Python devs when they constantly have to look up
if __name__ == '__main__':
run_me()
because it's just a weird, magical incantation that's blocking their ability to do what they want to do