My experience (almost exclusively Claude), has just been so different that I don't know what to say. Some of the examples are the kinds of things I explicitly wouldn't expect LLMs to be particularly good at so I wouldn't use them for, and others, she says that it just doesn't work for her, and that experience is just so different than mine that I don't know how to respond.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air.
Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
> Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
The problem is that I feel I am constantly being bombarded by people bullish on AI saying "look how great this is" but when I try to do the exact same things they are doing, it doesn't work very well for me
Of course I am skeptical of positive claims as a result.
I don't know what you are doing or why it's failed. Maybe my primary use cases really are in the top whatever percentile for AI usefulness, but it doesn't feel like it. All I know is that frontier models have already been good enough for more than a year to increase my productivity by a fair bit.
Your use case is in fact in the top whatever percentile for AI usefulness. Short simple scripting that won't have to be relied on due to never being widely deployed. No large codebase it has to comb through, no need for thorough maintenance and update management, no need for efficient (and potentially rare) solutions.
The only use case that would beat yours is the type of office worker that cannot write professional sounding emails but has to send them out regularly manually.
I fully believe it's far better at the kind of coding/scripting that I do than the kind that real SWEs do. If for no other reason than the coding itself that I do is far far simpler and easier, so of course it's going to do better at it. However, I don't really believe that coding is the only use case. I think that there are a whole universe of other use cases that probably also get a lot of value from LLMs.
I think that HN has a lot of people who are working on large software projects that are incredibly complex and have a huge numbers of interdependencies etc., and LLMs aren't quite to the point that they can very usefully contribute to that except around the edges.
But I don't think that generalizing from that failure is very useful either. Most things humans do aren't that hard. There is a reason that SWE is one of the best paid jobs in the country.
Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.
Real programming is on a totally different scale than what you're describing.
I think that's true for most jobs. Superficially an AI looks like it can do good.
But LLMs:
1. Hallucinate all the time. If they were human we'd call them compulsive liars
2. They are consistenly inconsistent, so are useless for automation
3. Are only good at anything they can copy from their data set. They can't create, only regurgitate other people's work
4. AI influencing hasn't happened yet, but will very soon start making AI LLMs useless, much like SEO has ruined search. You can bet there are a load of people already seeding the internet with a load of advertising and misinformation aimed solely at AIs and AI reinforcement
> Even a 1 month project with one good senior engineer working on it will get 20+ different files and 5,000+ loc.
For what it's worth, I mostly work on projects in the 100-200 files range, at 20-40k LoC. When using proper tooling with appropriate models, it boosts my productivity by at least 2x (being conservative). I've experimented with this by going a few days without using them, then using them again.
Definitely far from the massive codebases many on here work on, small beans by HN standards. But also decidedly not just writing one-off scripts.
> Real programming is on a totally different scale than what you're describing.
How "real" are we talking?
When I think of "real programming" I think of flight control software for commercial airplanes and, I can assure you, 1 month != 5,000 LoC in that space.
And... I know people who now use AI to write their professional-sounding emails, and they often don't sound as professional as they think they do. It can be easy to just skim what an AI generates and think it's okay to send if you aren't careful, but the people you send those emails to actually have to read what was written and attempt to understand it, and doing that makes you notice things that a brief skim doesn't catch.
It's actually extremely irritating that I'm only half talking to the person when I email with these people.
It's kinda like machine translated novels. You have to really be passionate about the novel to endure these kinds of translations. That's when you realize how much work novel translators do to get a coherent result.
Especially jarring when you have read translation that put thought in them. Noticed this in Xianxia so Chinese power-fantasy. Where selection of what to translate and what to transliterate can have huge impact. And then editorial work also becomes important if something in past need to be changed based on future information.
I literally had a developer of an open source package I’m working with tell me “yeah that’s a known problem, I gave up on trying to fix it. You should just ask ChatGPT to fix it, I bet it will immediately know the answer.”
Annoying response of course. But I’d never used an LLM to debug before, so I figured I’d give it a try.
First: it regurgitated a bunch of documentation and basic debugging tips, which might have actually been helpful if I had just encountered this problem and had put no thought into debugging it yet. In reality, I had already spent hours on the problem. So not helpful
Second: I provided some further info on environment variables I thought might be the problem. It latched on to that. “Yes that’s your problem! These environment variables are (causing the problem) because (reasons that don’t make sense). Delete them and that should fix things.” I deleted them. It changed nothing.
Third: It hallucinated a magic numpy function that would solve my problem. I informed it this function did not exist, and it wrote me a flowery apology.
Clearly AI coding works great for some people, but this was purely an infuriating distraction. Not only did it not solve my problem, it wasted my time and energy, and threw tons of useless and irrelevant information at me. Bad experience.
The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.
If I give it all my information and add "I think the problem might be X, but I'm not sure", the LLM always agrees that the problem is X and will reinterpret everything else I've said to 'prove' me right.
Then the conversation is forever poisoned and I have to restart an entirely new chat from scratch.
98% of the utility I've found in LLMs is getting it to generate something nearly correct, but which contains just enough information for me to go and Google the actual answer. Not a single one of the LLMs I've tried have been any practical use editing or debugging code. All I've ever managed is to get it to point me towards a real solution, none of them have been able to actually independently solve any kind of problem without spending the same amount of time and effort to do it myself.
> The biggest thing I've found is that if you give any hint at all as to what you think the problem is, the LLM will immediately and enthusiastically agree, no matter how wildly incorrect your suggestion is.
I'm seeing this sentiment a lot in these comments, and frankly it shows that very few here have actually gone and tried the variety of models available. Which is totally fine, I'm sure they have better stuff to do, you don't have to keep up with this week's hottest release.
To be concrete - the symptom you're talking about is very typical of Claude (or earlier GPT models). o3-mini is much less likely to do this.
Secondly, prompting absolutely goes a huge way to avoiding that issue. Like you're saying - if you're not sure, don't give hints, keep it open-minded. Or validate the hint before starting, in a separate conversation.
I literally got this problem earlier today on ChatGPT, which claims to be based on o4-mini. So no, does not sound like it's just a problem with Claude or older GPTs.
And on "prompting", I think this is a point of friction between LLM boosters and haters. To the uninitiated, most AI hype sounds like "it's amazing magic!! just ask it to do whatever you want and it works!!" When they try it and it's less than magic, hearing "you're prompting it wrong" seems more like a circular justification of a cult follower than advice.
I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively. I buy that. But some more specific advice would be helpful. Cause as is, it sounds more like "LLMs are magic!! didn't work for you? oh, you must be holding it wrong, cause I know they infallibly work magic".
> I understand that it's not - that, genuinely, it takes some experience to learn how to "prompt good" and use LLMs effectively
I don't buy it this at all.
At best "learning to prompt" is just hitting the slot machine over and over until you get something close to what you want, which is not a skill. This is what I see when people "have a conversation with the LLM"
At worst you are a victim of sunk cost fallacy, believing that because you spent time on a thing that you have developed a skill for this thing that really has no skill involved. As a result you are deluding yourself into thinking that the output is better.. not because it actually is, but because you spent time on it so it must be
On the other hand, when it works it's darn near magic.
I spent like a week trying to figure out why a livecd image I was working on wasn't initializing devices correctly. Read the docs, read source code, tried strace, looked at the logs, found forums of people with the same problem but no solution, you know the drill. In desperation I asked ChatGPT. ChatGPT said "Use udevadm trigger". I did. Things started working.
For some problems it's just very hard to express them in a googleable form, especially if you're doing something weird almost nobody else does.
i started (re)using AI recently. it/i mostly failed until i decided on a rule.
if it's "dumb and annoying" i ask the AI, else i do it myself.
since that AI has been saving me a lot of time on dumb and annoying things.
also a few models are pretty good for basic physics/modeling stuff (get basic formulas, fetching constants, do some calculations). these are also pretty useful. i recently used it for ventilation/co2 related stuff in my room and the calculations matched observed values pretty well, then it pumped me a broken desmos syntax formula, and i fixed that by hand and we were good to go!
---
(dumb and annoying thing -> time-consuming to generate with no "deep thought" involved, easy to check)
> For some problems it's just very hard to express them in a googleable form
I had an issue where my Mac would report that my tethered iPhone's batteries were running low when the battery was in fact fine. I had tried googling an answer, and found many similar-but-not-quite-the-same questions and answers. None of the suggestions fixed the issue.
I then asked the 'MacOS Guru' model for chatGPT my question, and one of the suggestions worked. I feel like I learned something about chatGPT vs Google from this - the ability of an LLM to match my 'plain English question without a precise match for the technical terms' is obviously superior to a search engine. I think google etc try synonyms for words in the query, but to me it's clear this isn't enough.
Google isn't the same for everyone. Your results could be very different from mine. They're probably not quite the same as months ago either.
I may also have accidentally made it harder by using the wrong word somewhere. A good part of the difficulty of googling for a vague problem is figuring out how to even word it properly.
Also of course it's much easier now that I tracked down what the actual problem was and can express it better. I'm pretty sure I wasn't googling for "devices not initializing" at the time.
But this is where I think LLMs offer a genuine improvement -- being able to deal with vagueness better. Google works best if you know the right words, and sometimes you don't.
This morning I was using an LLM to develop some SQL queries against a database it had never seen before. I gave it a starting point, and outlined what I wanted to do. It proposed a solution, which was a bit wrong, mostly because I hadn't given it the full schema to work with. Small nudges and corrections, and we had something that worked. From there, I iterated and added more features to the outputs.
At many points, the code would have an error; to deal with this, I just supply the error message, as-is to the LLM, and it proposes a fix. Sometimes the fix works, and sometimes I have to intervene to push the fix in the right direction. It's OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three.
A key part of the workflow, imo, was that we were working in the medium of the actual code. If the code is broken, we get an error, and can iterate. Asking for opinions doesn't really help...
I often wonder if people who report that LLMs are useless for code haven't cracked the fact that you need to to have a conversation with it - expecting a perfect result after your first prompt is setting it up for failure, the real test is if you can get to a working solution after iterating with it for a few rounds.
As someone who has finally found a way to increase productivity by adding some AI, my lesson has sort of been the opposite. If the initial response after you've provided the relevant context isn't obviously useful: give up. Maybe start over with slightly different context. A conversation after a bad result won't provide any signal you can do anything with, there is no understanding you can help improve.
It will happily spin forever responding in whatever tone is most directly relevant to your last message: provide an error and it will suggest you change something (it may even be correct every once in a while!), suggest a change and it'll tell you you're obviously right, suggest the opposite and you will be right again, ask if you've hit a dead end and yeah, here's why. You will not learn anything or get anywhere.
A conversation will only be useful if the response you got just needs tweaks. If you can't tell what it needs feel free to let it spin a few times, but expect to be disappointed. Use it for code you can fully test without much effort, actual test code often works well. Then a brief conversation will be useful.
Because once you get good at using LLMs you can write it with 5 rounds with an LLM in way less time than it would have taken you to type out the whole thing yourself, even if you got it exactly right first time coding it by hand.
Most of the code in there is directly copied and pasted in from https://claude.ai or https://chatgpt.com - often using Claude Artifacts to try it out first.
Some changes are made in VS Code using GitHub Copilot
If you do a basic query to GPT-4o every ten seconds it uses a blistering... hundred watts or so. More for long inputs, less when you're not using it that rapidly.
I know. That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do. (One recent example: https://simonwillison.net/2024/Sep/10/software-misadventures... )
(I get boosts from LLMs to a bunch of activities too, like researching and planning, but those are less obvious than the coding acceleration.)
> That's why I've consistently said that LLMs give me a 2-5x productivity boost on the portion of my job which involves typing code into a computer... which is only about 10% of what I do
This explains it then. You aren't a software developer
You get a productivity boost from LLMs when writing code because it's not something you actually do very much
That makes sense
I write code for probably between 50-80% of any given week, which is pretty typical for any software dev I've ever worked with at any company I've ever worked at
So we're not really the same. It's no wonder LLMs help you, you code so little that you're constantly rusty
I very much doubt you spend 80% of your working time actively typing code into a computer.
My other activities include:
- Researching code. This is a LOT of my time - reading my own code, reading other code, reading through documentation, searching for useful libraries to use, evaluating if those libraries are any good.
- Exploratory coding in things like Jupyter notebooks, Firefox developer tools etc. I guess you could call this "coding time", but I don't consider it part of that 10% I mentioned earlier.
- Talking to people about the code I'm about to write (or the code I've just written).
- Filing issues, or updating issues with comments.
- Writing documentation for my code.
- Straight up thinking about code. I do a lot of that while walking the dog.
- Staying up-to-date on what's new in my industry.
- Arguing with people about whether or not LLMs are useful on Hacker News.
You must not be learning very many new things then if you can't see a benefit to using an LLM. Sure, for the normal crud day-to-day type stuff, there is no need for an LLM. But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.
Sure, it often spits out incomplete, non-ideal, or plain wrong answers, but that's where having SWE experience comes in to play to recognize it
> But when you are thrown into a new project, with new tools, new code, maybe a new language, new libraries, etc., then having an LLM is a huge benefit. In this situation, there is no way that you are going to be faster than an LLM.
In the middle of this thought, you changed the context from "learning new things" to "not being faster than an LLM"
It's easy to guess why. When you use the LLM you may be productive quicker, but I don't think you can argue that you are really learning anything
But yes, you're right. I don't learn new things from scratch very often, because I'm not changing contexts that frequently.
I want to be someone who had 10 years of experience in my domain, not 1 year of experience repeated 10 times, which means I cannot be starting over with new frameworks, new languages and such over and over
Exactly! I learn all kinds of things besides coding-related things, so I don't see how it's any different. ChatGPT 4o does an especially good job of walking thru the generated code to explain what it is doing. And, you can always ask for further clarification. If a coder is generating code but not learning anything, they are either doing something very mundane or they are being lazy and just copy/pasting without any thought--which is also a little dangerous, honestly.
It really depends on what you're trying to achieve.
I was trying to prototype a system and created a one-pager describing the main features, objectives, and restrictions. This took me about 45 minutes.
Then I feed it into Claude and asked to develop said system. It spent the next 15 minutes outputting file after file.
Then I ran "npm install" followed by "npm run" and got a "fully" (API was mocked) functional, mobile-friendly, and well documented system in just an hour of my time.
It'd have taken me an entire day of work to reach the same point.
Yeah nah. The endless loop of useless suggestions or ”solutions” is very easily achiavable and common, at least on my use cases, not matter how much you iterate with it. Iterating gets counter-productive pretty fast, imo. (Using 4o).
When I use Claude to iterate/troubleshoot I do it in a project and in multiple chats. So if I test something and it throws and error or gives an unexpected result I’ll start a new chat to deal with that problem, correct the code, update that in the project, then go back to my main thread and say “I’ve update this” and provide it the file, “now let’s do this”. When I started doing this it massively reduced the LLM getting lost or going off on weird quests. Iteration in side chats, regroup in the main thread. And then possibly another overarching “this is what I want to achieve” thread where I update it on the progress and ask what we should do next.
I have been thinking about this a lot recently. I have a colleague who simply can’t use LLMs for this reason - he expects them to work like a logical and precise machine, and finds interacting with them frustrating, weird and uncomfortable.
However, he has a very black and white approach to things and he also finds interacting with a lot of humans frustrating, weird and uncomfortable.
The more conversations I see about LLMs the more I’m beginning to feel that “LLM-whispering” is a soft skill that some people find very natural and can excel at, while others find it completely foreign, confusing and frustrating.
It really requires self-discipline to ignore the enthusiasm of the LLM as a signal for whether you are moving in the direction of a solution. I blame myself for lazy prompting, but have a hard time not just jumping in with a quick project, hoping the LLM can get somewhere with it, and not attempt things that are impossible, etc.
> OK - the whole process took a couple hours, and probably would have been a whole day if I were doing it on my own, since I usually only need to remember anything about SQL syntax once every year or three
If you have any reasonable understanding of SQL, I guarantee you could brush up on it and write it yourself in less than a couple of hours unless you're trying to do something very complex
Obviously to a mega super genius like yourself an LLM is useless. But perhaps you can consider that others may actually benefit from LLMs, even if you’re way too talented to ever see a benefit?
You might also consider that you may be over-indexing on your own capabilities rather than evaluating the LLM’s capabilities.
Lets say an llm is only 25% as good as you but is 10% the cost. Surely you’d acknowledge there may be tasks that are better outsourced to the llm than to you, strictly from an ROI perspective?
It seems like your claim is that since you’re better than LLMs, LLMs are useless. But I think you need to consider the broader market for LLMs, even if you aren’t the target customer.
Knowing SQL isn't being a "mega super genius" or "way talented". SQL is flawed, but being hard to learn is not among its flaws. It's designed for untalented COBOL mainframe programmers on the theory that Codd's relational algebra and relational calculus would be too hard for them and prevent the adoption of relational databases.
However, whether SQL is "trivial to write by hand" very much depends on exactly what you are trying to do with it.
Sure, I could do that. But I would learn where to put my join statements relative to the where statements, and then forget it again in a month because I have lots of other tihngs that I actually need to know on a daily basis. I can easily outsource the boilerplate to the LLM and get to a reasonable starting place for free.
Think of it as managing cognitive load. Wandering off to relearn SQL boilerplate is a distraction from my medium-term goal.
edit: I also believe I'm less likely to get a really dumb 'gotcha' if I start from the LLM rather than cobbling together knowledge from some random docs.
If you don’t take care to understand what the LLM outputs, how can you be confident that it works in the general case, edge cases and all? Most of the time that I spend as a software engineer is reasoning about the code and its logic to convince myself it will do the right thing in all states and for all inputs. That’s not something that can be offloaded to an LLM. In the SQL case, that means actually understanding the semantics and nuances of the specific SQL dialect.
That makes sense, and from what I’ve heard this sort of simple quick prototyping is where LLM coding works well. The problem with my case was I’m working with multiple large code bases, and couldn’t pinpoint the problem to a specific line, or even file. So I wasn’t gonna just copy multiple git repos into the chat
(The details: I was working with running a Bayesian sampler across multiple compute nodes with MPI. There seemed to be a pathological interaction between the code and MPI where things looked like they were working, but never actually progressed.)
I wonder if it breaks like this: people who don't know how to code find LLMs very helpful and don't realize where they are wrong. People who do know immediately see all the things they get wrong and they just give up and say "I'll do it myself".
This is exactly my experience, every time! If I offer it the slightest bit of context it will say 'Ah! I understand now! Yes, that is your problem, …' and proceed to spit out some non-existent function, sometimes the same one it has just suggested a few prompts ago which we already decided doesn't exist/work. And it just goes on and on giving me 'solutions' until I finally realise it doesn't have the answer (which it will never admit unless you specifically ask it to – forever looking to please) and give up.
I’ve followed your blog for a while, and I have been meaning to unsubscribe because the deluge of AI content is not what I’m looking for.
I read the linked article when it was posted, and I suspect a few things that are skewing your own view of the general applicability of LLMs for programming. One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
I think it’s great that it’s a technology you’re passionate about and that it’s useful for you, but my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful. And that’s okay, it doesn’t have to be all things to all people. But it’s not fair to say that we’re just holding it wrong.
"my experience is that in the context of working in a large systems codebase with years of history, it’s just not that useful."
It's possible that changed this week with Gemini 2.5 Pro, which is equivalent to Claude 3.7 Somnet in terms of code quality but has a 1 million token context (with excellent scores on long context benchmarks) and an increased output limit too.
I've been dumping hundreds of thousands of times of codebase into it and getting very impressive results.
See this is one of the things that’s frustrating about the whole endeavor. I give it an honest go, it’s not very good, but I’m constantly exhorted to try again because maybe now that Model X 7.5qrz has been released, it’ll be really different this time!
It’s exhausting. At this point I’m mostly just waiting for it to stabilize and plateau, at which point it’ll feel more worth the effort to figure out whether it’s now finally useful for me.
Not going to disagree that it's exhausting! I've been trying to stay on top of new developments for the past 2.5 years and there are so many days when I'll joke "oh, great, it's another two new models day".
Just on Tuesday this week we got the first widely available high quality multi-modal image output model (GPT-4o images) and a new best-overall model (Gemini 2.5) within hours of each other. https://simonwillison.net/2025/Mar/25/
> One, your projects are small enough that you can reasonably provide enough context for the language model to be useful. Two, you’re using the most common languages in the training data. Three, because of those factors, you’re willing to put much more work into learning how to use it effectively, since it can actually produce useful content for you.
Take a look at the 2024 StackOverflow survey.
70% of professional developer respondents had only done extensive work over the last year in one of:
LLMs are of course very strong in all of these. 70% of developers only code in languages LLMs are very strong at.
If anything, for the developer population at large, this number is even higher than 70%. The survey respondents are overwhelmingly American (where the dev landscape is more diverse), and self-select to those who use niche stuff and want to let the world know.
Similar argument can be made for median codebase size, in terms of LOC written every year. A few days ago he also gave Gemini Pro 2.5 a whole codebase (at ~300k tokens) and it performed well. Even in huge codebases, if any kind of separation of concerns is involved, that's enough to give all context relevant to the part of the code you're working on. [1]
What’s 300k tokens in terms of lines of code? Most codebases I’ve worked on professionally have easily eclipsed 100k lines, not including comments and whitespace.
But really that’s the vision of actual utility that I imagined when this stuff first started coming out and that I’d still love to see: something that integrates with your editor, trains on your giant legacy codebase, and can actually be useful answering questions about it and maybe suggesting code. Seems like we might get there eventually, but I haven’t seen that we’re there yet.
We hit "can actually be useful answering questions about it" within the last ~6 months with the introduction of "reasoning" models with 100,000+ token contest limits (and the aforementioned Gemini 1 million/2 million models).
The "reasoning" thing is important because it gives models the ability to follow execution flow and answer complex questions that down many different files and classes. I'm finding it incredible for debugging, eg: https://gist.github.com/simonw/03776d9f80534aa8e5348580dc6a8...
I built a files-to-prompt tool to help dump entire codebases into the larger models and I use it to answer complex questions about code (including other people's projects written in languages I don't know) several times a week. There's a bunch of examples of that here: https://simonwillison.net/search/?q=Files-to-prompt&sort=dat...
After more than a few years working on a codebase? Quite a lot. I know which interfaces I need and from where, what the general areas of the codebase are, and how they fit together, even if I don’t remember every detail of every file.
> But it’s not fair to say that we’re just holding it wrong.
<troll>Have you considered that asking it to solve problems in areas it's bad at solving problems is you holding it wrong?</troll>
But, actually seriously, yeah, I've been massively underwhelmed with the LLM performance I've seen, and just flabbergasted with the subset of programmer/sysadmin coworkers who ask it questions and take those answers as gospel. It's especially frustrating when it's a question about something that I'm very knowledgeable about, and I can't convince them that the answer they got is garbage because they refuse to so much as glance at supporting documentation.
LLMs need to stay bad. What is going to happen if we have another few GPT-3.5 to Gemini 2.5 sized steps? You're telling people who need to keep the juicy SWE gravy train running for another 20 years to recognize that the threat is indeed very real. The writing is on the wall and no one here (here on HN especially) is going to celebrate those pointing to it.
I don't think people really realize the danger of mass unemployment
Go look up what happens in history when tons of people are unemployed at the same time with no hope of getting work. What happens when the unemployed masses become desperate?
Naw I'm sure it will be fine, this time will be different
Just wanted to chime in and say how appreciative I’ve been about all your replies here, and overall content on AI. Your takes are super reasonable and well thought out.
I see people say, "Look how great this is," and show me an example, and the example they show me is just not great. We're literally looking at the same thing, and they're excited that this LLM can do a college grads's job to the level of a third grader, and I'm just not excited about that.
What changed my point of view regarding LLMs was when I realized how crucial context is in increasing output quality.
Treat the AI as a freelancer working on your project. How would you ask a freelancer to create a Kanban system for you? By simply asking "Create a Kanban system", or by providing them a 2-3 pages document describing features, guidelines, restrictions, requirements, dependencies, design ethos, etc?
Which approach will get you closer to your objective?
The same applies to LLM (when it comes to code generation). When well instructed, it can quickly generate a lot of working code, and apply the necessary fixes/changes you request inside that same context window.
It still can't generate senior-level code, but it saves hours when doing grunt work or prototyping ideas.
"Oh, but the code isn't perfect".
Nor is the code of the average jr dev, but their codes still make it to production in thousands of companies around the world.
They're sophisticated tools at much as any other software.
About 2 weeks ago I started on a streaming markdown parser for the terminal because none really existed. I've switched to human coding now but the first version was basically all llm prompting and a bunch of the code is still llm generated (maybe 80%). It's a parser, those are hard. There's stacks, states, lookaheads, look behinds, feature flags, color spaces, support for things like links and syntax highlighting... all forward streaming. Not easy
> LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Exactly this.
I once had a function that would generate several .csv reports. I wanted these reports to then be uploaded to s3://my_bucket/reports/{timestamp}/.csv
I asked ChatGPT "Write a function that moves all .csv files in the current directory to and old_reports directory, calls a create_reports function, then uploads all the csv files in the current directory to s3://my_bucket/reports/{timestamp}/.csv with the timestamp in YYYY-MM-DD format""
And it created the code perfectly. I knew what the correct code would look like, I just couldn't be fucked to look up the exact calls to boto3, whether moving files was os.move or os.rename or something from shutil, and the exact way to format a datetime object.
It created the code far faster that I would have.
Like, I certainly wouldn't use it to write a whole app, or even a whole class, but individual blocks like this, it's great.
I have been saying this about llms for a while - if you know what you want, how to ask for it, and what the correct output will look like, LLMs are fantastic (at least Claude Sonnet is). And I mean that seriously, they are a highly effective tool for productive development for senior developers.
I use it to produce whole classes, large sql queries, terraform scripts, etc etc. I then look over that output, iterate on it, adjust it to my needs. It's never exactly right at first, but that's fine - neither is code I write from scratch. It's still a massive time saver.
> they are a highly effective tool for productive development for senior developers
I think this is the most important bit many people miss. It is advertised as an autonomous software developer, or something that can take a junior to senior levels, but that's just advertising.
It is actually most useful for senior developers, as it does the grunt work for them, while grunt work is actually useful work for a junior developer as a learning tool.
Precisely -- you have to be experienced in your field to use these tools effectively.
These are power tools for the mind. We've been working with the equivalent of hand tools, now something new came along. And yeah, a hole hawg will throw you clear off a ladder if you're not careful -- does that mean you're going to bore 6" holes in concrete ceilings by hand? Think not.
> It is advertised as an autonomous software developer
By a few currently niche VC players, I guess. I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
> I don't see Anthropic, the overwhelming revenue leader in dollars spent on LLM-related tools for SWE, claiming that.
Are you sure about that? [1]:
> "I think we will be there in three to six months, where AI is writing 90% of the code. And then, in 12 months, we may be in a world where AI is writing essentially all of the code," Amodei said at a Council of Foreign Relations event on Monday.
"How to ask for it" is the most important part. As soon as you realize that you have to provide the AI with CONTEXT and clear instructions (you know, like a top-notch story card on a scrum board), the quality and assertiveness of the results increase a LOT.
Yes, it WON'T produce senior-level code for complex tasks, but it's great at tackling down junior to mid-level code generation/refactoring, with minor adjustments (just like a code review).
So, it's basically the same thing as having a freelancer jr dev at your disposal, but it can generate working code in 5 min instead of 5 hours.
I've had so many cases exactly like your example here. If you build up an intuition that knows that e.g. Claude 3.7 Sonnet can write code that uses boto3, and boto3 hasn't had any breaking changes that would affect S3 usage in the past ~24 months, you can jump straight into a prompt for this kind of task.
It doesn't just save me a ton of time, it results in me building automations that I normally wouldn't have taken on at all because the time spent fiddling with os.move/boto3/etc wouldn't have been worthwhile compared to other things on my plate.
I think you have an interesting point of view and I enjoy reading your comments, but it sounds a little absurd and circular to discount people's negativity about LLMs simply because it's their fault for using an LLM for something it's not good at. I don't believe in the strawman characterization of people giving LLMs incredibly complex problems and being unreasonably judgemental about the unsatisfactory results. I work with LLMs every day. Companies pay me good money to implement reliable solutions that use these models and it's a struggle. Currently I'm working with Claude 3.5 to analyze customer support chats. Just as many times as it makes impressive, nuanced judgments it fails to correctly make simple trivial judgements. Just as many times as it follows my prompt to a tee, it also forgets or ignores important parts of my prompt. So the problem for me is it's incredibly difficult to know when it'll succeed and when it'll fail for a given input. Am I unreasonable for having these frustrations? Am I unreasonable for doubting the efficacy of LLMs to address problems that many believe are already solved? Can you understand my frustration to see people characterize me as such because ChatGPT made a really cool image for them once?
It's a weird circle with these things. If you _can't_ do the task you are using the LLM for, you probably shouldn't.
But if you can do the task well enough to at least recognize likely-to-be-correct output, then you can get a lot done in less time than you would do it without their assistance.
Is that worth the second order effects we're seeing? I'm not convinced, but it's definitely changed the way we do work.
I think this points to much of the disagreement over LLMs. They can be great at one-off scripts and other similar tasks like prototypes. Some folks who do a lot of that kind of work find the tools genuinely amazing. Other software engineers do almost none of that and instead spend their coding time immersed in large messy code bases, with convoluted business logic. Looping an LLM into that kind of work can easily be net negative.
Maybe they are just lazy around tooling. Cursor with Claude works well for project sizes much larger than I expected but it takes a little set up. There is a chasm between engineers who use tools well and who do not.
I don't really agree with framing it as lazy. Adding more tools and steps to your workflow isn't free, and the cost/benefit of each tool will be different for everyone. I've lost count of how many times someone has evangelized a software tool to me, LLM or not. Once in a while they turn out to be useful and I incorporate them into my regular workflow, but far more often I don't. This could be for any number of reasons like it does not fit with my workflow well, or I already have a better way of doing whatever it does, or the tool adds more friction than it remove.
I'm sure spending more time fiddling with the setup of LLM tools can yield better results, but that doesn't mean that it will be worth it for everyone. In my experience LLMs fail often enough at modestly complex problems that they are more hassle than benefit for a lot of the work I do. I'll still use them for simple tasks, like if I need some standard code in a language I'm not too familiar with. At the same time, I'm not at all surprised that others have a different experience and find them useful for larger projects they work on.
I'm tired of people bashing LLMs. AI is so useful in my daily work that I can't understand where these people are coming from. Well, whatever...
As you said, examples where I wouldn't expect LLMs to be good at from people who dismiss the scenarios where LLMs are great at. I don't want to convince anyone, to be honest - I just want to say they are incredibly useful for me and a huge time saver. If people don't want to use LLMs, it's fine for me as I'll have an edge over them in the market. Thanks for the cash, I guess.
I'll give you a simple and silly example which could give you additional ideas. LLMs can be great for checking whether people can understand something.
One day I came up with a joke and wondered whether people would "get it". I told the joke to ChatGPT and asked it to explain it back to me. ChatGPT did a great job and nailed what's supposedly funny about the joke. I used it in an email so I have no idea whether anyone found it funny, but at least I know it wasn't too obscure. If an AI can understand a joke, there's a good chance people will understand it too.
This might not be super useful but demonstrates that LLMs aren't only about generating text for copy-and-paste or retrieving information. It's "someone" you can bounce ideas with, ask opinions and that's how I use it most frequently.
every time someone brings up "Code that doesn't need to deal with edge cases" I like to point at that such code is not likely to be used for anything that matters
Oh, but it is. I can have code that does something nice to have, needs not to be 100% correct etc. For example, I want a background for my playful webpage. Maybe a WebGL shader. It might not be exactly what I asked for, but I can have it in few minutes up and running. Or some non-critical internal tools - like scraper for lunch menus from restaurants around office. Or simple parking spot sharing app. Or any kind of prototypes which in some companies are being created all the time. There are so many use cases that are forgiving regarding correctness and are much more sensitive to development effort.
There is a cost burden to not being 100% correct when it comes to programming. You simply have chosen to ignore that burden, but it still exists for others. Whether it's for example a percent of your users now getting stalled pages due to the webgl shader, or your lunch scraper ddosing local restaurants. They aren't actually forgiving regarding correctness.
Which is fine for actual testing you're doing internally, since that cost burden is then remedied by you fixing those issues. However, no feature is as free as you're making it sound, not even the "nice to have" additions that seem so insignificant.
I never said it's free. (But also aiming for 100% correctness is very very expensive) I'm talking about trading correctness, readability, security and maybe others for another metrics. What I said is just not every project that has value should be optimized for the same metrics. Bank or medical software needs to be correct as close to 100% as possible. Some tool I'm creating for my team to simplify a process does not necessarily need to. I would not mind my webgl shader possibly causing problems to some users. It would get reported and fixed. Or not. It's my call what I would spend my effort on.
Of course the tradeoffs should be well considered. That's why it may get out of hand real bad if software will be created (or vibe coded) by people with little understanding of these metrics and tradeoffs. I'm absolutely not advocating for that.
I’m always amazed in these discussions how many people apparently have jobs doing a bunch of stuff that either doesn’t need to be correct or is simple enough that it doesn’t require any significant amount of external context.
The point is more that everyone seems to acknowledge that a) output is spotty, and b) it’s difficult to provide enough context to work on anything that’s not fairly self-contained. And yet we also constantly have people saying that they’re using AI for some ridiculous percentage of their actual job output. So, I’m just curious how one reconciles those two things.
Either most people’s jobs consist of a lot more small, self-contained mini-projects than my jobs generally have, or people’s jobs are more accepting of incorrect output than I’m used to, or people are overstating their use of the tool.
Automating the easy 80% sounds useful, but in practice I'm not convinced that's all that helpful. Reading and putting together code you didn't write is hard enough to begin with.
The things I'm wary of are pitfalls that are often only in the command/function docs. Kinda like rsync with how it handles terminating slashes at the end of the path. Which is why I always took a moment to read them.
Not GP, but more often than not I reach out to tools I already know (sed,awk,python) or read the docs which don't take that much time if you know how to get to the sections you need.
I write code like that all the time. It's used for very specific use cases, only by myself or something I've also written. It's not exposed to random end users or inputs.
> Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"
I’ve never seen it from my students. Why do you think this? It’s trivial to pick a real book/article. No student is generating fake material whole cloth and fake references to match. Even if they could, why would they risk it?
TBD whether that makes the effort to spot-check their references greater (does actually say what the student - explicitly or implicitly - claims it does?), or less (proving the non-existence of an obscure references is proving a negative)?
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air.
Perfectly put, IMO.
I know arguments from authority aren't primary, but I think this point highlights some important context: Dr. Hossenfelder has gained international renown by publishing clickbait-y YouTube videos that ostensibly debunk scientific and technological advances of all kinds. She's clearly educated and thoughtful (not to mention otherwise gainfully employed), but her whole public persona kinda relies on assuming the exclusively-critical standpoint you mention.
I doubt she necessarily feels indebted to her large audience expecting this take (it's not new...), but that certainly does seem like a hard cognitive habit to break.
More often than not, when I inquire deeper, I often find their prompting isn't very good at all.
"Garbage in, garbage out" as the law says.
Of course, it took a lot of trial and error for me to get to my current level of effectiveness with LLMs. It's probably our responsibility to teach these who are willing.
It seems hard to be bullish on LLMs as a generally useful tool if the solution to problems people have is "use trial and error to improve how you write your prompts, no, it's not obvious how to do so, yes, it depends heavily on the exact model you use."
A Mitre Saw is an amazing thing to have in a woodshop, but if you don't learn how to use it you're probably going to cut off a finger.
The problem is that LLMs are power tools that are sold as being so easy to use that you don't need to invest any effort in learning them at all. That's extremely misleading.
> Are legally liable for defects in design or manufacture that cause injury, death, or property damage
Except when you use them for purposes other than declared by them - then it's on you. Similarly, you get plenty of warnings about limitation and suitability of LLMs from the major vendors, including even warnings directly in the UI. The limitations of LLMs are common knowledge. Like almost everyone, you ignore them, but then consequences are on you too.
> Provide manuals that instruct the operator how to effectively and safely use the power tool
LLMs come with manuals much, much more extensive than any power tool ever (or at least since 1960s or such, as back then hardware was user-serviceable and manuals weren't just generic boilerplate).
As for:
> Know how they work
That is a real difference between power tool manufacturers and LLM vendors, but then if you switch to comparing against pharmaceutical industry, then they don't know how most of their products work either. So it's not a requirement for useful products that we benefit from having available.
Using LLMs to write SQL is a fascinating case because there are so many traps you could fall into that aren't really the fault of the LLM.
My favorite example: you ask the LLM for "most recent restaurant opened in California", give it a schema and it tries "select * from restaurants where state = 'California' order by open_date desc" - but that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
There are tricks that can help here - I've tried sending the LLM an example row from each table, or you can set up a proper loop where the LLM gets to see the results and iterate on them - but it reflects the fact that interacting with databases can easily go wrong no matter how "smart" the model you are using is.
> that returns 0 results, because it turns out the state column uses two-letter state abbreviations like CA instead.
As you’ve identified, rather than just giving it the schema you give it the schema and a some data when you tell it what you want.
A human might make exactly the same error - based on misassumption - and would then look at the data to see why it was failing.
If we assume that a LLM would magically realise that when you ask it to find something based on an identifier which you tell it is ‘California’ it would magically assume that the query should be based on ‘CA’ rather than what you told it, then that’s not really the fault of the LLM.
Agreed. If one compares ChatGPT to, say, the Cline IDE plugin backed by Claude 3.7, they might well be blown away by how far behind ChatGPT seems. A lot of the difference has to do with prompting, for sure -- Cline helps there by generating prompts from your IDE and project context automatically.
Every once in a while I send a query off to ChatGPT and I'm often disappointed and jam on the "this was hallucinated" feedback button (or whatever it is called). I have better luck with Claude's chat interface but nowhere near the quality of response that I get with Cline driving.
I want to sit next to you and stop you every time you use your LLM and say, “Let me just carefully check this output.” I bet you wouldn’t like that. But when I want to do high quality work, I MUST take that time and carefully review and test.
What I am seeing is fanboys who offer me examples of things working well that fail any close scrutiny— with the occasional example that comes out actually working well.
I agree that for prototyping unimportant code LLMs do work well. I definitely get to unimportant point B from point A much more quickly when trying to write something unfamiliar.
What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time? Nobody knows! A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result. If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
> What's also scary is that we know LLMs do fail, but nobody (even the people who wrote the LLM) can tell you how often it will fail at any particular task. Not even an order of magnitude. Will it fail 0.2%, 2%, or 20% of the time?
Benchmarks could track that too - I don't know if they do, but that information should actually be available and easy to get.
When models are scored on e.g. "pass10", i.e. pass the challenge in under 10 attempts, and then the benchmark is rerun periodically, that literally produces the information you're asking for: how frequently a given model fails at particular task.
> A computer that will randomly produce an incorrect result to my calculation is useless to me because now I have to separately validate the correctness of every result.
For many tasks, validating a solution is order of magnitudes easier and cheaper than finding the solution in the first place. For those tasks, LLMs are very useful.
> If I need to ask an LLM to explain to me some fact, how do I know if this time it's hallucinating? There is no "LLM just guessed" flag in the output. It might seem to people to be "miraculous" that it will summarize a random scientific paper down to 5 bullet points, but how do you know if it's output is correct? No LLM proponent seems to want to answer this question.
How can you be sure whether a human you're asking isn't hallucinating/guessing the answer, or straight up bullshitting you? Apply the same approach to LLMs as you apply to navigating this problem with humans - for example, don't ask it to solve high-consequence problems in areas where you can't evaluate proposed solutions quickly.
I think part of it is that, from eons of experience, we have a pretty good handle on what kinds of mistakes humans make and how. If you hire a competent accountant, he might make a mistake like entering an expense under the wrong category. And since he's watching for mistakes like that, he can double-check (and so can you) without literally checking all his work. He's not going to "hallucinate" an expense that you never gave him, or put something in a category he just made up.
I asked Gemini for the lyrics to a song that I knew was on all the lyrics sites. To make a long story short, it gave me the wrong lyrics three times, apparently making up new ones the last two times. Someone here said LLMs may not be allowed to look at those sites for copyright reasons, which is fair enough; but then it should have just said so, not "pretended" it was giving me the right answer.
I have a python script that processes a CSV file every day, using DictReader. This morning it failed, because the people making the CSV changed it to add four extra lines above the header line, so DictReader was getting its headers from the wrong line. I did a search and found the fix on Stack Overflow, no big deal, and it had the upvotes to suggest I could trust the answer. I'm sure an LLM could have told me the answer, but then I would have needed to do the search anyway to confirm it--or simply implemented it, and if it worked, assume it would keep working and not cause other problems.
That was just a two-line fix, easy enough to try out and see if it worked, and guess how it worked. I can't imagine implementing a 100-line fix and assuming the best.
It seems to me that some people are saying, "It gives me the right thing X% of the time, which saves me enough developer time (mine or someone else's) that it's worth the other (100-X)% of the time when it gives me garbage that takes extra time to fix." And that may be a fair trade for some folks. I just haven't found situations where it is for me.
Better yet than the whole 'unknown, fluctuating and non deterministic rates of failure' is the whole 'agentic' shtick. People proposing to chain together these fluctuating plausibility engines should study probability theory a bit deeper to understand just what they are in for with these rube goldberg machines of text continuation.
I think it’s very odd that you think that people using LLMs regularly aren’t carefully checking the outputs. Why do you think that people using LLMs don’t care about their work?
> invented references that just don't exist"...all I can say is "press X to doubt
This doesn’t include lying and cheating which LLMs can’t.
On the other hand AI is used to solve problems that are already solved. I just recently got an ad about a software for process modeling where one claim was you don’t need always to start from the ground up but can say the AI give me the customer order process to start from that point. That is basically what templates are for with much less energy consumption.
I've noticed there seems to be a gatekeeping archetype that operates as a hard cynic to nearly everything, so that when they finally judge something positively they get heaps of attention.
It doesn't always correlate with narcissism, but it happens much more than chance.
>A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
Yes somewhat. Its good for powershell/bash/cmd scripts and configs, but early models it would hallucinate PowerShell cmdlets especially.
One thing I think is clear is society is now using a lot of words to describe things when the words being used are completely devoid of the necessary context. It's like calling a powder you've added to water "juice" and also freshly-squeezed fruit just picked perfectly ripe off a tree "juice". A word stretched like that becomes nearly devoid of meaning.
"I write code all day with LLMs, it's amazing!" is in the exact same category. The code you (general you, I'm not picking on you in particular) write using LLMs, and the code I write apart from LLMs: they are not the same. They are categorically different artifacts.
all fun and games until your AI generated script deletes the production database. I think that's the point, fault tolerance in academic and financial settings is too high for LLMs to be useful
The point is that given the current valuations, being good at a bunch of narrow use cases is just not good enough. It needs to be able to replace humans in every role where the primary output is text or speech to meet expectations.
I don't think that "replacing humans in every role" is the line for "being bullish on AI models". I think they could stop development exactly where they are, and they would still make pretty dramatic improvements to productivity in a lot of places. For me at least, their value already exceeds the $20/month I'm paying, and I'm pretty sure that way more than covers inference costs.
> I think they could stop development exactly where they are, and they would still make pretty dramatic improvements to productivity in a lot of places.
Yup. Not to mention, we don't even have time to figure out how to effectively work with one generation of models, before the next generation of models get released and rises the bar. If development stopped right now, I'd still expect LLMs to get better for years, as people slowly figure out how to use them well.
Completely agree. As is, Cursor and ChatGPT and even Bing Image Create (for free generation of shoddy ideas, styles, concepts, etc) are very useful to me. In fact, it would suit me if everything stalled at this point rather than improve to the point that everyone can catch up in how they use AI.
I think that there are two kinds of people who use AI: people who are looking for the ways in which AIs fail (of which there are still many) and people who are looking for the ways in which AIs succeed (of which there are also many).
A lot of what I do is relatively simple one off scripting. Code that doesn't need to deal with edge cases, won't be widely deployed, and whose outputs are very quickly and easily verifiable.
LLMs are almost perfect for this. It's generally faster than me looking up syntax/documentation, when it's wrong it's easy to tell and correct.
Look for the ways that AI works, and it can be a powerful tool. Try and figure out where it still fails, and you will see nothing but hype and hot air. Not every use case is like this, but there are many.
-edit- Also, when she says "none of my students has ever invented references that just don't exist"...all I can say is "press X to doubt"