This is a really nice study! It is very cool that they were able to get professi...

This is a really nice study! It is very cool that they were able to get professional programmers to participate, this is something that is really hard to set up as an academic team. And yes, 47 participants is a small number, but apparently sufficient in this case to detect the effect (as evidenced by the p-values). It also lines up nicely with work we did last year, which looked at the security of Copilot's completions without any humans in the loop [1] and found that something like 40% of its completions in security sensitive scenarios were vulnerable.

One thing I'm not sure of is how well the setup reflects how people actually use IDE-integrated tools like Copilot. In the experiment, users had to explicitly ask the assistant for answers rather than getting them as inline completions (see Figure 7(b) in the appendix). I don't know if this would change the results; I could see it going either way (inline completions appear automatically so might be accepted automatically, but on the other hand programmers are also used to inline completions being incorrect and might be more likely to reject or repair them). It also means that it was up to the individual user to figure out how to prompt the AI, so the results will depend a lot on how users chose to phrase their prompt and how much context they provided.

As full disclosure, I'm one of the authors on a very similar study [2] that didn't find any large effects on security :) The main differences were:

- We only looked at C, rather than multiple languages. C is notoriously hard to write secure code in, so the base rate of vulnerabilities is likely to be higher. It's worth noting that the Stanford study also didn't find a statistically significant difference in security for C.

- Our study environment was designed to mimic Copilot much more closely – so we had participants use VSCode with a plugin providing inline completions from OpenAI's Codex API. This is also why we used the Cushman rather than the DaVinci model: Cushman's latency is much lower, which is important for realtime use; it looks like GitHub made the same decision, since reverse engineering of the Copilot plugin indicates it also uses Cushman [3].

- We had participants try to code up a full library with 11 different functions, rather than doing a bunch of smaller independent tasks. This means that the AI model had more context to work with, and may have affected how users approached the problem.

- We unfortunately only managed to get undergraduate and graduate students as participants, so the base and experience skill level of our user population may have been lower.

Overall I think it's clear that these models are prone to spitting out insecure code right now, and this is an important problem to fix (and one we're working on)! But it's still not clear to me what effect this actually has on the security of code written by programmers using tools like Copilot, and more research is needed to figure that out.

[1] https://arxiv.org/abs/2108.09293

[2] https://arxiv.org/abs/2208.09727

[3] https://thakkarparth007.github.io/copilot-explorer/posts/cop...