Hey, thanks for linking this! I'm a study author, and I greatly appreciate that this author dug into the appendix and provided feedback so that other folks can read it as well.
A few notes if it's helpful:
1. This post is primarily worried about ordering considerations -- I think this is a valid concern. We explicitly call this out in the paper [1] as a factor we can't rule out -- see "Bias from issue completion order (C.2.4)". We have no evidence this occurred, but we also don't have evidence it didn't.
2. "I mean, rather than boring us with these robustness checks, METR could just release a CSV with three columns (developer ID, task condition, time)." Seconded :) We're planning on open-sourcing pretty much this data (and some core analysis code) later this week here: https://github.com/METR/Measuring-Early-2025-AI-on-Exp-OSS-D... - star if you want to dig in when it comes out.
3. As I said in my comment on the post, the takeaway at the end of the post is that "What we can glean from this study is that even expert developers aren’t great at predicting how long tasks will take. And despite the new coding tools being incredibly useful, people are certainly far too optimistic about the dramatic gains in productivity they will bring." I think this is a reasonable takeaway from the study overall. As we say in the "We do not provide evidence that:" section of the paper (Page 17), we don't provide evidence across all developers (or even most developers) -- and ofc, this is just a point-in-time measurement that could totally be different by now (from tooling and model improvements in the past month alone).
Thanks again for linking, and to the original author for their detailed review. It's greatly appreciated!
Thanks for the response, you make some very points. Sorry, I had missed your response on the original post. I don't know if it was there yet, or because for some reason their blog is configured to only show the first two comments by default. :/ Either way, my bad.
I think my bias as someone who spends too much time looking at social science papers is that the protocol allows for spillover effects that, to me, imply that the results must be interpreted much more cautiously than a lot of people are doing. (And then on top of that I'm trying to be hyper-cautious and skeptical when I see a paper whose conclusions align with my biases on this topic.)
Granted, that sort of thing is my complaint about basically every study on developer productivity when using LLMs that I've seen so far. So I appreciate how difficult this is to study in practice.
Hey, thanks for digging into the details here! Copying a relevant comment (https://news.ycombinator.com/item?id=44523638) from the other thread on the paper, in case it's help on this point.
1. Some prior studies that find speedup do so with developers that have similar (or less!) experience with the tools they use. In other words, the "steep learning curve" theory doesn't differentially explain our results vs. other results.
2. Prior to the study, 90+% of developers had reasonable experience prompting LLMs. Before we found slowdown, this was the only concern that most external reviewers had about experience was about prompting -- as prompting was considered the primary skill. In general, the standard wisdom was/is Cursor is very easy to pick up if you're used to VSCode, which most developers used prior to the study.
3. Imagine all these developers had a TON of AI experience. One thing this might do is make them worse programmers when not using AI (relatable, at least for me), which in turn would raise the speedup we find (but not because AI was better, but just because with AI is much worse). In other words, we're sorta in between a rock and a hard place here -- it's just plain hard to figure out what the right baseline should be!
4. We shared information on developer prior experience with expert forecasters. Even with this information, forecasters were still dramatically over-optimistic about speedup.
5. As you say, it's totally possible that there is a long-tail of skills to using these tools -- things you only pick up and realize after hundreds of hours of usage. Our study doesn't really speak to this. I'd be excited for future literature to explore this more.
In general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the factors table on page 11).
I'll also note that one really important takeaway -- that developer self-reports after using AI are overoptimistic to the point of being on the wrong side of speedup/slowdown -- isn't a function of which tool they use. The need for robust, on-the-ground measurements to accurately judge productivity gains is a key takeaway here for me!
(You can see a lot more detail in section C.2.7 of the paper ("Below-average use of AI tools") -- where we explore the points here in more detail.)
1. That does not support these results in any way
2. Having experience prompting is quite a little part of being able to use agentic IDE tools. It's like relating cutting onion to being a good cook
I think we should all focus on how the effectivity is going to change in the long-term. We all know AI tooling is not going to disappear but to become better and better. I wouldn't be afraid to lose some productivity for months if I would acquire new skills for the future.
Thanks for the feedback! I strongly agree this is not the only measure of developer productivity -- but it's certainly one of them. I think this measure as speaks very directly to how _many_ developers (myself included) understand the impact of AI tools on their own work currently (e.g. just speeding up implementation speed).
(The SPACE [1] framework is a pretty overview of considerations here; I agree with a lot of it, although I'll note that METR [2] has different motivations for studying developer productivity than Microsoft does.)
Hey HN -- study author here! (See previous thread on the paper here [1].)
I think this blog post is an interesting take on one specific factor that is likely contributing to slowdown. We discuss this in the paper [2] in the section "Implicit repository context (C.1.5)" -- check it out if you want to see some developer quotes about this factor.
> This is why AI coding tools, as they exist today, will generally slow someone down if they know what they are doing, and are working on a project that they understand.
I made this point in the other thread discussing the study, but in general, these results being surprising makes it easy to read the paper, find one factor that resonates, and conclude "ah, this one factor probably just explains slowdown." My guess: there is no one factor -- there's a bunch of factors that contribute to this result -- at least 5 seem likely, and at least 9 we can't rule out (see the full factors table on page 11).
> If there are no takers then I might try experimenting on myself.
This sounds super cool! I'd be very excited to see how you set this up + how it turns out... please do shoot me an email (in the paper) if you do this!
> AI slows down open source developers. Peter Naur can teach us why
Nit: I appreciate how hard it is to write short titles summarizing the paper (the graph title is the best I was able to do after a lot of trying) -- but I might have written this "Early-2025 AI slows down experienced open-source developers. Peter Naur can give us more context about one specific factor." It's admittedly less of a catchy-title, but I think getting the qualifications right are really important!
Thanks again for the sweet write-up! I'll hang around in the comments today as well.
If this makes sense, how is the study able to give a reasonable measure of how long an issue/task should have taken, vs how long it took with AI to determine that using AI was slower?
Or it's comparing how long the dev thought it should take with AI vs how long it actually took, which now includes the dev's guess of how AI impacts their productivity?
When it's hard to estimate how difficult an issue should be to complete, how does the study account for this? What percent speed up or slow down would be noise due to estimates being difficult?
I do appreciate that this stuff is very hard to measure.
An easier way to think about it might be if you timed how long it took each ticket in your backlog. You also recorded whether you were drunk or not when you worked on it, and the ticket was selected at random from your backlog. The assumption (null-hypothesis) is that being drunk has no effect on ticket completion time.
Using the magic of statistics, if you have completed enough tickets, we can determine whether the null-hypothesis holds (for a given level of statistical certainty), and if it doesn't, low large is the difference (with a margin of error).
That's not to say there couldn't be other causes for the difference (if there is one), but that's how science proceeds, generally.
The challenge with “controlled experiments” is that saying to developers to “use AI for all of your tickets for a month” forces a specific tool onto problems that may not benefit from that tool.
Slowing down isn't necessarily bad, maybe slow programming (literate/Knuth comes to mind as another early argument) encourages better theory formation. Maybe programming today is like fast food, and proper theory and abstraction (and language design) requires a good measure of slow and deliberate work that has not been the norm in industry.
Thanks for the response, and apologies for misrepresenting your results somewhat! I'm probably not going to change the title since I am at heart and polemicist and a sloppy thinker, but I'll update the article to call out this misrepresentation.
That said, I think that what I wrote more or less encompasses three of the factors you call out as being likely to contribute: "High developer familiarity with reposito-
ries", "Large and complex repositories", and "Implicit repository context".
I thought more about experimenting on myself, and while I hope to do it - I think it will be very hard to create a controlled enviornment whilst also responding to the demands the job puts on me. I also don't have the luxury of a list of well scoped tasks that could feasibly be completed in a few hours.
I would expect any change to an optimized workflow (developing own well understood project) to initially be slower. What I'd like to see is how these same developers do 6 months or a year from now after using AI has become the natural workflow on these same projects. The article mentions that these results don't extrapolate to other devs, but it's important to note that it may not extrapolate over time to these same devs.
I myself am just getting started and I can see how so many things can be scripted with AI that would be very difficult to (semi-)automate without. You gotta ask yourself "Is it worth the time?"[0]
> Early-2025 AI slows down experienced open-source developers.
Even that's too general, because it'll depend on what the task is. It's not as if open source developers in general never work on tasks where AI could save time.
We call this over-generalization out specifically in the "We do not provide evidence that:" table in the blog post and paper - I agree there are tasks these developers are likely sped up on with early-2025 tools.
I think this will be the key. Finding appropriate tasks. Even on code bases I know, I can find tedious things for the AI to do. Sometimes I can find tedious things for it to do that I would never have dreamt of doing in the past. Now, I think “will it do it?”.
Once I got a hang of identifying problems, or being more targeted, I was spending less time messing about and got things done quicker.
Honestly, this is a fair point -- and speaks the difficulty of figuring out the right baseline to measure against here!
If we studied folks with _no_ AI experience, then we might underestimate speedup, as these folks are learning tools (see a discussion of learning effects in section (C.2.7) - Below-average use of AI tools - in the paper). If we studied folks with _only_ AI experience, then we might overestimate speedup, as perhaps these folks can't really program without AI at all.
In some sense, these are just two separate and interesting questions - I'm excited for future work to really dig in on both!
We explore this factor in section (C.2.5) - "Trading speed for ease" - in the paper [1]. It's labeled as a factor with an unclear effect, some developers seem to think so, and others don't!
> like the developers deliberately picked "easy" tasks that they already knew how to do
We explore this factor in (C.2.2) - "Unrepresentative task distribution." I think the effect here is unclear; these are certainly real tasks, but they are sampled from the smaller end of tasks developers would work on. I think the relative effect on AI vs. human performance is not super clear...
Yeah, I'll note that this study does _not_ capture the entire OS dev workflow -- you're totally right that reviewing PRs is a big portion of the time that many maintainers spend on their projects (and thanks to them for doing this [often hard] work). In the paper [1], we explore this factor in more detail -- see section (C.2.2) - Unrepresentative task distribution.
There's some existing lit about increased contributions to OS repositories after the introduction of AI -- I've also personally heard a fear anecdotes about an increase in the number of low-quality PRs from first time contributors, seemingly as a result of AI making it easier to get started -- ofc, the tradeoff is that making it easier to get started has pros to it too!
A few notes if it's helpful:
1. This post is primarily worried about ordering considerations -- I think this is a valid concern. We explicitly call this out in the paper [1] as a factor we can't rule out -- see "Bias from issue completion order (C.2.4)". We have no evidence this occurred, but we also don't have evidence it didn't.
2. "I mean, rather than boring us with these robustness checks, METR could just release a CSV with three columns (developer ID, task condition, time)." Seconded :) We're planning on open-sourcing pretty much this data (and some core analysis code) later this week here: https://github.com/METR/Measuring-Early-2025-AI-on-Exp-OSS-D... - star if you want to dig in when it comes out.
3. As I said in my comment on the post, the takeaway at the end of the post is that "What we can glean from this study is that even expert developers aren’t great at predicting how long tasks will take. And despite the new coding tools being incredibly useful, people are certainly far too optimistic about the dramatic gains in productivity they will bring." I think this is a reasonable takeaway from the study overall. As we say in the "We do not provide evidence that:" section of the paper (Page 17), we don't provide evidence across all developers (or even most developers) -- and ofc, this is just a point-in-time measurement that could totally be different by now (from tooling and model improvements in the past month alone).
Thanks again for linking, and to the original author for their detailed review. It's greatly appreciated!
[1] https://metr.org/Early_2025_AI_Experienced_OS_Devs_Study.pdf