Hacker News new | past | comments | ask | show | jobs | submit login
I worry our Copilot is leaving some passengers behind (joshcollinsworth.com)
249 points by headalgorithm 11 months ago | hide | past | favorite | 137 comments



I think the problem is that any tool like this (even one theoretically much more powerful) is most beneficial to those that need it the least and least beneficial to those that need it the most. If you're an expert you can identify the mistakes and they are not generally a roadblock. But if you're a novice you can't and you'll simply be unaware of any hallucinations. The benefit SO has over this is just the extra friction of needing to copy paste or retype because it slows you down and forces an opportunity to think.

My worry is that we become too reliant on tools and outsource our thinking to them before they are ready to take on that task. This will only accelerate the shitification of things we have. More apps that use far too many resources. Things that are security nightmares. Interfaces with more friction. All of it.

The problem is pareto efficiency. 80% of your code is written in 20% of your time but 80% of your time is is required for 20% of your code. The problem is that the devil is in the details. So even a 95% or 99% accurate code generator is going to make for hard work. That's 1 in every hundred lines of code. I hope the compilers people are writing good error messages.


> 80% of your code is written in 20% of your time but 80% of your time is is required for 20% of your code

This is not correct. It is well known that the first 90% of the code takes the first 90% of the time, and the remaining 10% of the code takes the other 90% of the time. [0]

[0] https://en.wikipedia.org/wiki/Ninety%E2%80%93ninety_rule


Good catch, I fell victim to one of the classic 4 problems in CS: off by integer overflow, segfault. I'm sorry I can't complete this task, I'm a model trained by OpenAI and my knowledge cutoff date is Sept 2021.


This is a bit suspect... you're not patronizing enough to have been trained by OpenAI!


I always wondered if that meant the first 90% takes 50% of the time or 10% of the time.

You could see it as the last 10% taking 90% which is the same as the first and total work was actually 180%, so the first took 50% of total.

Or you can see it as the last taking 90% of the time for real, so the first 90% took 10%.

It is a factor 5 difference between the two interpretations, so it really matters a ton.


When you think you're almost done, you're halfway there. However long you expect X to take, you should double that.

Hofstadter's Law: it always takes longer than you expect, even when you take into account Hofstadter's Law.


It makes a lot more sense when you understand two things:

1) Untrained software engineers' unconscious inclination is to estimate the amount of time after which it's probably time to reassess, rather than complete.

2) The distribution of actual completion is not a normal distribution with respect to estimates of completion, it is log normal: there's a really really long tail.


Hofstadter's Law implies that every task takes an infinite amount of time.


Software is a kind of art. It's never finished, but abandoned. You can continue to polish the code without adding functions (or even fixing bugs), so yes, it can take forever.

OTOH, when you give that ~180%, you reach to a maturity level most of your users perceive as "done", so what's most developers are after is that point.

Then, there are passion projects, which go on for 30+ years (Vim, BBEdit, etc.), where people work on it because they love the project and they're able to.

*: BBEdit is closed source/commercial software, but it sells because it's well crafted. It's not crafted to sell well.


> I always wondered if that meant the first 90% takes 50% of the time or 10% of the time.

Essentially the latter; what you thought will be 90% the time ends up being just the first 10%. And the rest is spent on the 10% of the code, plus everything else that is part of the project besides the code: reviews, testing, revisions, deployment issues etc.


This is one of the golden rules of the software development, yet many people still thinks this one-liner meant to be for fun only.


I think a better definition of a clique ends up being "something everyone can recite but don't know." I think it tells us something about intelligence too, because you can know things but that doesn't mean the information is actually useful.


In practice they both turn out to take 90%


But the total is 180%. This is why developers multiply their estimates by 2, by default.


> The benefit SO has over this is just the extra friction of needing to copy paste or retype because it slows you down and forces an opportunity to think.

The benefit of SO is the opportunity for both learning and doubt: There are multiple answers and comments on answers. It's not at all a site with nothing but code blocks to copy.

Certainly some people copy-paste without reading any more, but you don't have to use it that way if you want to learn. LLMs treat every user closer to the laziest user, which seems like a problem.


The key issue is not the power of the tool but the tool powerfully amplifying practices that ought to be resisted but exist in the majority of the code in the wild.


The problem with move fast and break things is that you need to at some point slow down and fix things. But we've developed systems that incentivize never stopping and so just enshitify everything. You win by not having no shit, but by being ankle deep in shit rather than waist deep. By being less shitty.


Preach. The hardest problems in computer science may be cache invalidation and naming things, but the hardest problem in modern application development is navigating the ocean of enshittification caused by short-term thinking and a socioeconomic backdrop that empowers non-technical managers and commoditizes engineers.

My goal is not release cadence. My goal is to be able to write "this repository is stable, secure, optimized, and feature-complete" in every project readme.

We should do this for ourselves, and for the future. We could build a world of stable, feature-complete ecosystems that move (voluntarily) together on a quarterly release cycle. We could focus on writing nearly perfect software with a long shelf life.

I take a tremendous amount of inspiration from the Jump team building Firedancer, though my understanding of their work barely qualifies as surface-level. What a public demonstration of software engineering excellence while doing cutting-edge work.

I also think younger engineers are being brainwashed by modern engineering culture. I am fortunate to have a mentor who had a career before Agile and worked in zero-bugs-tolerated environments. I realize this level of quality is not always realistic or optimal, but I suspect many younger web engineers just assume Agile is the best way. I did.

Younger engineers: agile has merits, but it has become the mechanism that managers (a) use to keep you keyed-up and short-term focused, and (b) deal with the fact that neither they nor their clients know what they are doing. Find the people who can rebuild the Information Age from scratch, and listen to them.


There are definitely systems out there that can get away with being built primarily through kanban boards and fungible coders of average skill. The industry has decided that this is the default "best practices" for keeping a team productive. I'd prefer to never work on such a team myself ever again.

The problem is when project leads assume that their project should also be built like that, when actually it's mission-critical infrastructure that has to be architected up-front for scale, stability, security, graceful degradation, testability, provability, etc. and if you tried to chart that in story points and sprints it would be a comical farce. So they just don't do that, they try to wing it like it's a much simpler and lower-stakes project.

Now not only is it a total writeoff technically, but the kind of people that build projects this way are also sold on the idea that you never rewrite a project (thanks, Joel) so they're stuck with it forever. Major refactoring will never be a user story on the board, and even if it did it would be hard to do safely without better testability, and testability can't be improved without that refactoring, and etc. and the project has passed the event horizon of crushing failure from which no light can escape.

I have seen FAANG teams build world-scale mission-critical production infrastructure this way, including by people who claimed to be oldschool and claimed to know better, yet their names are in the git blame clear as day. I take nothing for granted any more.


In my experience, it’s the opposite. Asking GPT4 for help is most helpful when I don’t know how to do something. Once I know what I’m doing, the mistakes become more obvious and annoying. I’ve learned something, but the chatbot makes the same mistakes as before, and it will keep making them.

Ironically, it’s because people can learn and chatbots don’t. (In the short term, that is; new releases will be better.)


The very article you comment on shows an example where copilot teaches you the wrong thing. That's why you shouldn't learn with AI. AI is very useful as autocompletion, search engine, bug detector... Imagine, very pessimistically, that AI hates you and wants to sabotage you; are there still some use cases, where such an entity could help you? If so, that's the workflow where you can safely use an incompetent (rather than malevolent) AI as well.

Let's say there's some not very readable code. You don't know what it does. You ask AI what it does. It tells you. Now you read the code again and it all makes sense, and based on your experience with coding you are certain you're not misinterpreting the code because AI has mislead you. That's one example of a safe usage of AI.


My worry is that we become too reliant on tools and outsource our thinking to them before they are ready to take on that task.

Industry greats like Spolsky have been beating this drum for decades [0] with no success. Those with natural curiosity will gravitate towards understanding the low-level mechanisms of things, just as they always have. Others won’t.

[0] https://www.joelonsoftware.com/2001/12/11/back-to-basics/


The most vocal anti-copilot person on my team is by far the best coder among us. His reasoning is that typing and coding is not his bottleneck, its process and politics within the organization.

I tend to agree. I just got access to copilot within the org last week, and I haven’t had a chance to use it yet due to a bunch of process and politics I’ve been dealing with. When I get back to coding, I’m not sure how helpful it will actually be. Time will tell, but I’ve never felt like I needed any more help than a search engine gives me. Usually what happens is Stack Overflow gets me in the ballpark and I need to use that info to go back to the actual documentation to get what I really need. I see copilot working in much the same way. The problems I’m trying to solve are always how to integrate with internal systems and existing other code, which I don’t see copilot helping with. It will lack the context.

A lot of people on my team are supposed to train up on what I do so we can get more people involved and I have a feeling they are overestimating how much copilot will help, and I anticipate we’ll need to do extremely detailed code reviews when copilot starts getting involved. Better code reviews aren’t a bad thing, but it will be one more thing on my plate, as not many will be qualified.


My worry is that we become too reliant on tools and outsource our thinking to them before they are ready to take on that task

Personally I have never tried any of the AI assistants, but I have noticed a large uptick in developers attempting to secretly use them in remote coding interviews. I'm curious how the larger companies are dealing with this.


I see it fairly often when doing code review, because sometimes a line of code or a function stands out that just doesn’t seem in line with the rest of the PR. So I add a comment like “what is this doing exactly?” because it’s usually something that’s difficult to understand, and the answer is usually “It’s what GPT/Copilot suggested shrug”. It’s not really something I approve of because it’s actively defying codebase standards that are intended to help the team. At least make the effort to clean it up so it meets basic expectations.

I imagine it’s quite easy to ask the same question during a code test because you shouldn’t have to stop and think about code you consciously wrote, and you wouldn’t have to wait for GPT to feed you an answer.


Hopefully by less lazy interviewing tactics and trying to hire via nuanced understanding of candidates instead of hackable metrics like memorizing leet code.

The traditional engineering interview is more fuzzy and is basically an engineer asking you about how you'd solve a problem they are currently working on or recently did. The interest is to see how you think and problem solve. It's inheritably unmeasurable but I think it is better than using a metric that ends up not meaning much. If it is explicit fuzziness vs implicit, I'll choose explicit every time because it is far harder to trick myself into thinking I'm doing the right thing when I'm not.


> attempting to secretly use them in remote coding interviews.

We have from time to time simply asked people to write pseudo-code in something like Etherpad or Google Doc. I'm sure that you can get an AI to type in your answer, but I feel it going to be pretty obvious what's happening.


I ask them to share their whole screen.


Yeah I thought about that too. I suppose it could still be a problem if they have a second monitor.

I guess there's the opposite perspective. By not actively trying to prevent it, we can weed out people who would choose to cheat in a remote coding interview. Those same candidates would likely do fine if they were physically not able to cheat, but may have ultimately be a net negative for the team.


İf you practice an open book exam you will have to ask much harder questions, and the actual becomes fishing for chatgpt's mistakes. This lacks repeatability because you don't know if and how it's going to hallucinate on any given day. And the level of questions you'd need to ask would be beyond many candidates. İn a closed book setting I can ask to implement a basic dynamic data structure and get all the signal I need.


I think the signal there is going to be how developers perform with assistance. The goal of the software is to solve the problem, after all. If they do it faster and better than everyone not using it, well, I guess we’ve figured out who to hire.


Enshitification doesn't happen because of the tools, it happens because the market will bear it.

"More apps that use far too many resources", "Interfaces with more friction" Let's extend this to just performance/latency/ux at large for web based tools/sites/resources. The market has shown in some cases people will tolerate a lot of this, like phone support, can take >30 mins, but in other cases/scales, like search, every millisecond matters. The garbage apps I encounter like this now I feel are on the wrong side of this line are generally are enterprise apps that deal with payroll, HR training, etc. These apps are allowed to be bad because their users are generally captive/don't choose the apps and their developers aren't likely to care much because nobody is passionate about making sexual harassment quizzes. I actually like the chances of a 2-3 person team using LLM coding tools being able to upset these entrenched garbage piles. The likelihood of a tool/site with good UX and performance now degrading because junior engineers are using LLM code seems to be about zero, if you've built these tools you know how hard it is to drive the culture/ethos on shipping the code that powers these projects before copilot/chatGPT was around, and that isn't going to change. So ultimately, I think

-Garbage apps that exist now will become slightly worse -The chances of slightly better, cheaper apps replacing those apps will grow -Good apps that exist now won't regress


Because of the tools? No, of course not. Do the tools make enshitification easier and does the current incentive structures create an environment where I expect these types of tools to accelerate enshitification? Certainly. These are two very different things. I hope we can understand the difference because these types of details are important to prevent enshitification.


I am pretty sure enshitification is a specific process with a specific meaning, what you're talking about, quality going down for whatever reason you can just call "going to shit."

Let's not lose the useful concept enshitification is a pointer to by overloading the word :)


The inevitable enshitification of enshitification.


I don't think I'm really pushing the bounds here. The fast pace does help make things sticky. We love shiny new features. Even if it is just a polished turd. My worry is about more polished turds, which I think is pretty in line with enshitification since the de facto tech is reliant upon network effects. But I guess the thought is more general.

Words shift meanings and once you have coined something you lose control over it. Bitter sweet.


> most beneficial to those that need it the least and least beneficial to those that need it the most

along those same lines, as someone who is rabidly anti-AI for coding, I've come up with a couple standards. If you can't explain to an expert the code you have produced with the use of "AI" (read: LLM), and/or you haven't tested it thoroughly then I don't want you using Copilot or any other automated tool.


> hallucinations

There's got to be a better term than this.


I wonder if these issues will still exist in 5 years. The power of NLP models has improved by so much these past 5 years, it's really insane.


Human advancement is tied to tool use. People said the same things about calculators, computers, math engines and solvers. Tools can only make us smarter, able to tackle big challenges.

All of this holds for good tools though, where a good tool is one that helps and then gets out of the way. Copilot isn't there yet, but over time, a successive version will get there.


I you think this is what I'm saying then you've gravely misunderstood.


"Once men handed their thinking over to machines in the hopes that this would set them free..."


Copilot is a competent coder. If I tell it to generate a function with certain parameters, a class that follows a Gang of Four pattern, mass rename variables, or refactor loops into maps, then it does a pretty good job.

Copilot is a bad engineer. If I tell it to build something, unless exceedingly simple, it usually fails. The ability for it to create something seems correlated to how many 5 minute tutorials for that exist on the internet. Which given its training, makes perfect sense. So if I think I could find an answer on Stack Overflow, then I will just ask Copilot instead.

10 Years ago everyone was afraid of the Stack Overflow developer, now its the GPT developer. I think its a combination of actual worry and hurt pride that your job can be accomplished by someone copy/pasting. But like usual, good engineers will learn to think for themselves when leveraging tools. And an exceedingly amount of code produced works but is bad by some arbitrary metric.

I think the footer example is hilarious, because its exactly inline with web development trends of the last decade. Why use native elements when I can script my own behavior in Javascript on a div? And in a rush to "not use tables for formatting," I bet there are some 25 nested div websites out there. Even on Google sites, I have see grids built using absolutely positioned boxes with Javascript layout logic. The web is a wild place once you start looking past the tutorials and best practices.


> Even on Google sites, I have see grids built using absolutely positioned boxes with Javascript layout logic

Possibly a side-effect of the framework used. Some frameworks used HTML like a CANVAS and drew/layout everything in HTML using absolute positioning. Ugggh.


ultimately these are the kind of things programmers care about. code debt is real and anyone who has any experience having to pay the debt off usually learns their lesson and does a better job on the next try


The vacuum cleaner analogy didn't land for me. I've bought that vacuum cleaner, and I didn't return it.

I'm referring to one of the countless models of robot vacuum, of course. They clean the floor, most of it, most of the time, but they miss spots, they get stuck on things, and they don't have the suction of a full-size vacuum.

I wish none of those things were true, but it saves me labor nonetheless, so I kept it. I can detail corners and pull the thing off the corner of the rug, and still get a mostly-clean floor, automatically. Sure, it doesn't get all the schmutz out of carpets, but it gets enough that I can go over them monthly instead of weekly.

Yes, I'm talking about LLM code assistants. They have embarrassing failure modes, but experienced developers get a sense of what they can and can't do, and the result is something which saves time. I've found they're particularly good at "dumb debugging", where there's some fat-fingered error in the code and I can't spot it just by looking. I can copypasta the function into ChatGPT in seconds, and it gives a step-by-step description of what the code does, which routinely points out exactly where the bug is.

I have my concerns about what these tools will do to the up-and-coming generation of developers, it's easy to imagine them as a crutch, training wheels which never come off. But that's a separate matter, and I trust that the more natively talented juniors will recognize the hazard there, and understand that a chatbot can't substitute for becoming a skilled programmer.


> but it saves me labor nonetheless, so I kept it.

Yep. Trading accessibility and usefulness to the people who disable JavaScript for 20% more productivity is a bargain. Most companies make that trade for far less every day.

I’ve never worked in a place that gave much thought to those. At most there was a contracted dev in some low cost country to slap aria tags around.


I use them quite a bit. Write these tests for this function in format x, transform this struct with lots of fields in the manner y or just good old rubber ducking about a problem I’m having trouble debugging


Copilot is a decent tool for experienced developers (though, hasn't replaced Google-foo by any stretch) and a trap for inexperienced ones. Sure it may be able to speed things up in the beginning but it's a crutch for long-term sustainability in the industry. You inevitably have to understand the paradigms and patterns that LLMs regurgitate; taking them at face value (which I suspect is what most LLM users do), is a recipe for disaster and unfounded confidence.


I work with a guy who is absolutely dedicated to using LLMs to generate C++ code. If I ask him for a specific small thing I'll get back a PR with hundreds of lines of irrelevant crap and when I ask why it has this move constructor or whatever, they won't have a good reason. Even though my colleague is an industry veteran, their new habit has made it feel like they are delegating all their work to the stupidest teammate I've ever had.

I feel like we are going to need to work out some norms and customs in this industry for using code-generating systems in a way that respects the time and attention of coworkers.


This. Good code is clean and has a well thought through internal architecture. LLM-ifying the code and treating it as a black box (if it passes the tests, it is acceptable) is tempting, but it works until it does not and the "does not" might come pretty quickly: once a human cannot easily untangle the logic the only fix is a rewrite.

I think there is a way to extend the useful life of such an approach by setting up a good architecture with lean, strict interfaces and thorough tests. Then one can treat any module that is compliant as a black box and give a computer the power to insert as much crap as it can generate. You then should be ready and willing to rewrite any box that has become so convoluted that LLM can no longer fix, likely by splitting it into smaller externally observable and testable elements.

I doubt that this is a long-term viable approach, but this is just a personal hunch. It would be interesting to see how such approaches develop. My 2c.


Request changes on the PR with the exact same reasoning you would use with any other developer who works like that?


Offloading all of the actual code reasoning onto your team because you cannot be bothered to write the code yourself and are trusting an LLM should get you fired on the spot.

I cannot imagine a worse teammate or a worse developer.


I had one such coworker until recently, and he was actually fired because nobody on the team felt he was pulling his own weight.

He produced massive amounts of code which did not fit the style of the codebase at all, and when questioned point-blank if it was LLM-generated he denied it (even thought it was undeniable).

I'm all for using tools to boost your productivity, but IMO when you offload generated junk to be reviewed by your team it's a sign of disrespect.


> absolutely dedicated

Hard to reason with developers like this, especially if they're more senior than you.


it really doesn't matter how many years they have been working, or how old they are, or how long they have been at the organization. we should all agree that someone who offloads the error correction of llms to their teammates isn't someone that's really 'senior'


Then don’t approve their PR since it’s unreviewable in its current form.


Ask the PR to be reviewed by an LLM. Enjoy your new life with lot of free time.


until you get tasked with fixing the buggy code


Maybe he needs to get a stern talking to by his manager? Has that happened?


Hope manager won't be like: "According to chatGPT..."


if he's doing that the company might as well save his salary and get an intern with a ChatGPT subscription


One concern that I have is that copilot is inherently additive in nature. It is unable to suggest that blocks of code be deleted which creates a bias that adding more code is always the solution and a lot of code that shouldn't be written ends up in the codebase.

I believe this is a problem works against less experienced engineers because more senior engineers are better at recognizing that problem. In my experience the most senior engineers respond to that by just turning the tool off.


Would be very cool to be able to highlight sections or even an entire file and then have a right-click option to "Refactor Code" which, rather than being additive, would clean up and condense the code according to the idioms of the language.


  cd /source/commercialOS/
  lintbug -fix **/*.code
  refactor **/*.code
  echo -e "\a"


You can do this with github copilot. It works exactly like you describe.


cody.dev can do that and it's pretty good. Running it on some late night coding sessions and I like the output around 90% of the time


To be fair, this is a common human bias as well. I’m generally the lone voice advocating for solving problems by subtraction, rather than addition. Though I will admit, automating this addition, and thinking that copilot suggesting it is permission, does make the problem worse.


I always turn off AI features in autocomplete. What I need is consistency, not fanciness.

if it's not consistent then it slows me down.

I think writing code for me is useful, making inconsistent suggestions is not.


I'll use a locally hosted Llama 2 or CodeLlama instance as a 'consultant', via a chat window. These models can be great for that! A well-formulated question often elicits a precise and accurate answer, even from the unspecialized model.

I won't use Copilot or anything else that integrates that tightly into my workflow, even though it is now possible to do so without losing the incremental-cost and customizability benefits of selfhosting.

The context switch is important. To a very good first approximation, our task as engineers is to think before we assume, and I have found Copilot recklessly encourages the latter at the expense of the former.


Personally I think of LLM code helpers as a warning smell.

If I’m working on something where I’m tempted to generate a bunch of boilerplate from a bot that knows very little about the context of the project, am I really spending my time on the right thing? Either I should be working on something higher level, or the amount of boilerplate should be so low that I can write it myself. Anything else suggests that there’s a problem and the LLM bloat band-aid isn’t the solution.


Depends. We have a large CRUDv service - lots of endpoints that has a lot of boilerplate and not much business logic but it doesn’t change often. It’s annoying to add a new endpoint but it’s not common enough that I want someone to spend a month+ refactoring it


I'm sure you generate all your serialization code by hand. Not to mention the object model. Why let compiler make a virtual call if you can load the class pointer and look up the right function yourself, MIRIGHT?


But that’s the whole point of what I meant by “I should be working on something higher level.”

Nobody (hopefully) uses an LLM to write mounds of x86 assembly code instead of letting the compiler do it. An AI model is a very blunt tool for most jobs. If I need code generation, most often I should write my own tool for it rather than copy-paste reams of barely inspected code from a language model.


Serialization code should be generic, no boilerplate needed. Object models must reflect the problem domain well, something to think carefully about. Current code assistants is only able to do that automatically for very generic classes.


> I'm sure you generate all your serialization code by hand.

Is this sarcasm? Who generates serialization/deserialization by hand now? Even Java has mature annotation libraries so it can be done in a single line. Maybe legacy code, but I'd argue using IDE method generation would be much preferred on a legacy codebase than AI.


you can't know that, nor is it a code smell in any way.

The first thing I did with ChatGPT was try to get it to generate a file with all 6502 opcodes defined as an enum. If I'm going to write an NES emulator, I'm going to need that done and either I do it or chatGPT does it.

Many parts of programming are mechanical in nature.


I agree with most of the stuff in this article but I'm a bit puzzled by some of the attitudes of the author. They seem to care more about the LLM delivering code with poor accessibility than they care about the LLM delivering completely wrong answers.

The "any good developer would realize this is bad code" rings strongly as "no true developer would think the LLM's bad answer was correct". Seems like a short sighted opinion to me.

I also think you can replace "accessibility" with any number of programming meta concepts and find problems too.

How about "internationalization"? Are LLMs any good at producing code that is nicely internationalized?

Or more importantly "security". Are LLMs going to produce millions of lines of poorly secured code that people never double check? Almost assuredly.

The fact is that LLMs are prediction engines. They run off of probabilities based on the prompt and the training model. Thus, unless the training model is weighted towards cherry picked examples of excellent code, it's going to follow the masses.

And the masses write bad-to-average code mostly.


> They seem to care more about the LLM delivering code with poor accessibility than they care about the LLM delivering completely wrong answers.

I think the idea with this is that if it gives you completely wrong answers and the code doesn't work, it will obviously not work and have to figure out how to fix it.

Meanwhile when it gives you code that appears to do what you wanted except the accessibility is broken, you'll ship it because you don't realize there's anything wrong with it.


To me this sounds just like human written code that mostly works but has a couple of issues. I don't see why we couldn't apply the same techniques to deal with it - unit testing, review, qa.


The problem is that often it will give answers that are only subtly wrong, and those will get shipped too.

I think my puzzlement is with the focus on accessibility as though it was a high priority item. In my experience it's usually an afterthought, if it's a thought at all. Personally I've never worked on a codebase where accessibility was in the top 5 priorities. No one would ever block a prod release for an accessibility mistake.

But like I said, you could take this whole argument, find+replace "accessibility" with "security" and you would have a much more compelling argument imo. Given time constraints, code should prioritize security over accessibility basically always.


I don’t think you are a minority in web development, but if you are shipping to a wide user base on the open web, you are definitely on the wrong, and you may be doing something illegal.

As far as web development goes, accessibility is actually something you must screw up, rather then something you have to build up. In most cases what you do is accessible by default (as is talked about in this article) and you have to do something weird to brake it. What you build by not thinking about accessibility might not have the best usability for assistive technology, but it should at least work.

That said, throughout my 10+ years as a web developer, I have consistently been reminded about accessibility. It is all over the literature, if you go to a random page on MDN there is probably a bullet point about accessibility implications. As a student, accessibility was at the forefront.

In fact, as an expert front end developer, it is my responsibility to make sure what I build is accessible. In fact, project managers often don’t know this, and I have to explain to them. A good project manager would know to take an expert advice.

> you could take this whole argument, find+replace "accessibility" with "security" and you would have a much more compelling argument imo. Given time constraints, code should prioritize security over accessibility basically always.

I’m sorry, but this mentality demonstrates a massive disrespect for a portion of your user base (given you are targeting a general audience; as opposed to internal tools). If your work can’t be used by somebody with a disability (or because their touchpad stopped working), you are not only being rather rude, but you may be braking the law. Everybody deserves the possibility to use your work equally.


I'll just quote from Stevey's Google Platforms Rant

> Like anything else big and important in life, Accessibility has an evil twin who, jilted by the unbalanced affection displayed by their parents in their youth, has grown into an equally powerful Arch-Nemesis (yes, there's more than one nemesis to accessibility) named Security. And boy howdy are the two ever at odds.

> But I'll argue that Accessibility is actually more important than Security because dialing Accessibility to zero means you have no product at all, whereas dialing Security to zero can still get you a reasonably successful product such as the Playstation Network.

I'm fully aware that I'm commenting with a drive-by facetious block quote, but it is a reality that "insecure but accessible" has more users than "secure but inaccessible".


so ask it to rewrite the code so it's accessible/secure/has unit tests. the concept of SQL injections are in the training data, so it can protect against that, and other attacks. it's able to rewrite the code it produces to fix problems when you point them out


> Copilot loves suggesting about 25 nested divs as a starting point.

> I assume this is because of a flaw in how LLMs work.

I know with some LLM implementations, you can configure the sampling to penalise repetitions – this is making me wonder if Copilot might benefit from that?

> What does it say about Copilot’s knowledge of accessibility when it will hand us code even basic checking tools would flag?

Maybe it could do with some fine-tuning based on those checking tools? e.g. sample many answers to same prompt, run them through checking tool, and then fine-tune it to prefer the answers which caused the least warnings?

Or: run the suggestion through checking tools, and if it triggers warnings, sample a new suggestion, and see if the new one doesn't. This could be done on the client side in a loop – run suggestion through checks, if it fails, ask the LLM for a new and different suggestion, repeat until we get one which passes checks, or we give up.


If you look at the code for many sites with high "production value" today, 25 nested divs is about right. As a non-web-dev, I have always been surprised at how often you need to throw in a new layer of divs to get some simple visual thing to work across device sizes.


As a web dev, I honestly can’t figure out the value here. It makes things considerably less legible from an inspection standpoint. And while yes each div provides a handle by which to independently control some layout property the law of diminishing returns kicks in hard by about three layers deep (three block level elements deep will cover 99.9% of things you might want to do with a given visual-block element).

I can only imagine this kind of markup is tooling generated and assumes that no human eyes will ever have to review/write it.


It's a good point that the HTML and CSS of all public websites was probably fed into copilot, and I assume a lot of that came from template engines or other tools rather than being handwritten.

I guess it would be like teaching a computer C++ by looking at the output of Cython.


The issue is that often you do want heavy repetition when programming. Think about a list of strings where they mostly have a common prefix. Or JSON, or a bunch of imports and exports. Good code often has these low entropy sections.


> The issue is that often you do want heavy repetition when programming.

Up to a certain number of tokens, yes. But, I doubt any high quality code would have the exact same sequence of N tokens repeated 25 times consecutively. There's a certain heaviness of repetition at which it is unlikely to be genuinely useful.

> Or JSON, or a bunch of imports and exports

A human programmer, when evaluating whether code is repetitive, doesn't treat all tokens as equal – they ignore "expected"/"necessary" repetitions, and focus on the "unexpected"/"unnecessary" ones. So, penalising repetitions in sampling doesn't have to treat all tokens equally either. For example, in a JSON document, one might choose to ignore the tokens required by JSON syntax. In Java, one might penalise repetitions less in the import block than in a method body.

Of course, this means the sampling actually has to be aware of the syntax of the language being generated – which is possible, and can have some other advantages (e.g. if sampling only samples tokens which are allowed by the language grammar, you can eliminate many possibilities of generating syntactically invalid code.)


All true. The raw transformer architecture isn’t enough to write sane code. I’d love to see changes made to have them guided by the compiler, customer linter rules, etc.

I still like them as is. I don’t let them write too much code for me. They’re really good translators (JSON to TypeScript interface definitions, shell command to Python string list) and quick documentation lookups (I like to write quick one-line comments for something short that I need that I would previously have looked up).


The biggest problem with Copilot/LLMs is that they effectively operate against anything that programming languages were designed for. What makes programming languages special is that they're well defined, semantically and syntactically rigorous and intended for machine execution. They give us the capacity to formally reason. Instead what we've got now is tools that literally argue with us, rather than anything that actually augments my capacity to reason about, inspect and understand the real performance and hardware of a system my code runs on. What I need is more Coq and less of something that just makes natural language suggestions.

What makes a good engineering tool is something that can look at the code right there as it is, use the formal guarantees that programming languages were designed for and give me some verifiably correct suggestions. Not average out 90% of Stackoverflow answers and then hallucinate up some statistical response.

Contrast Copilot with tree-sitter. What makes tree-sitter so good as a tool is that it leverages the regularity of programming languages. It can parse and correctly reason about code, instead of relying on some random regex collections and prayers. We've had so many good advances in recent years like the borrow checker in Rust. Why are we going back now and introducing tools that are by design incapable of ensuring correctness? Just to type a little bit faster?


I think a lot of these are actually solvable problems today, Copilot just hasn't prioritized actually improving the product (don't need to improve the product for breakneck growth when you have github.com as a distribution channel!)

It feels like there are a lot of well-intended AI coding products that just don't pay attention to getting the details right. I actually started building my own extension recently, with an emphasis on getting all the little things right, because I got so frustrated at Copilot. Things like closing brackets properly, not interrupting me and destroying my train of thought when writing comments, not suggesting imports unless it's highly certain (or verified with intellisense), etc. Like why am I wasting my precious time talking to copilot chat with gpt3 when gpt4 exists?

It's still a pretty early version, and ultimately we're using the same underlying model for completion, but I think getting these details right make a huge difference (at least to my biased self).

If you want to try it: https://marketplace.visualstudio.com/items?itemName=doublebo...

You'll need to install the pre-release version for auto-complete.


> Copilot is encouraging us to block users unnecessarily, by suggesting obviously flawed code, which is wrong on every level: wrong ethically, wrong legally, and the wrong way to build software.

I share many of the same worries as the author. This is why I think teams need to build and run their own Copilot-like systems, so that they can guide the suggestions they receive. Each developer and team has their own way of building software, and they need to be able to shape and evolve the suggestions they receive to fit their definition of the "right" way: https://blog.continue.dev/its-time-to-collect-data-on-how-yo...


I don't know if its mindset or my owner ignorance, but I find myself using Copilot and other language model tools as a teacher, a debugger, a reviewer, and and idea brainstormer. I find each use case to enhance my ability to think more deeply about my code and helps keep me more engaged in problem solving.

For some reason I find a inference from a compressed model which contains almost every notable open source program written in the history of humanity to be a decent sidekick.

My experience tells me no software engineer is an expert at everything. Having a tool which allows us to try new things faster is a good thing.


I've come to largely agree with the author. These days, I keep it off by default, but is a Cmd+' away from being flipped on and filling in what I _know_ to be boilerplate that's well suited. If I was younger with less money, I probably couldn't justify the price, but these days if it can save me a half hour of busywork per month on my personal projects the $10 is more than worth it.

Leaving it on while doing any thoughtful or challenging coding is super distracting for me.


As someone younger with less money... well, I'm "lucky" enough to get it for free (fun fact: GitHub just gives it away in perpetuity to accounts above certain threshold of "karma"), so I use it.

If I didn't get it for free... well I'd be lying if I said I don't get any value out of it, but you're spot on, definitely not enough to justify its perpetual subscription.


It’s appalling that most website/app developers even have to deal with those kinds of low-level considerations, after decades of web-tech evolution, instead of using a UI builder tool (or a better UI modeling language) that provides all the building blocks for the most common 97% of use cases, and where you would need to go out of your way to create a non-accessible link.

TFA is right about LLMs, but it’s also an indictment of the web UI stack.


By a UI modelling language, do you mean HTML or Javascript or CSS?


I mean something more suitable than the HTML+JS+CSS combination. HTML is a document markup language, not a UI definition language. CSS mixes layout with styling, which are largely orthogonal. (One should be able to specify a UI independently from styling/theming.) A programming language like Javascript shouldn’t be needed for building a UI, in the majority of cases. Most UI components and behaviors should be standard (built into browsers, or whatever is used as the UI runtime) and declarative.


So basically you’re saying someone should make mui or any other component library except built into browser. Why is that an improvement?


It would be more than a component library, but yes, more akin to traditional GUI frameworks in spirits than the HTML+CSS approach. The improvement is that it would dramatically reduce the effort needed to implement and maintain web application, make app behaviors more consistent for users, and make adherence to accessibility and other best practices the default, rather than something developers have to actively put significant effort into.


It's only a cause for concern if its capabilities are going to plateau.

More likely, advances in the field will mean that we end up in a more accessible world, where developers who don't normally think about accessibility have a generation engine doing a pass over their work adding appropriate labeling, fixing elements to work with screen readers, etc.

We just had a big paper about using genAI to improve test coverage.

And we haven't even really hooked LLM code generators up to linters and test suites broadly yet.

I can foresee a future where language specific Copilot features might include running suggested generations for HTML though an ARIA checker while running Python generations through a linter, etc. Especially when costs decrease and speed increases such that we see multiple passes of generation, this stuff is going to be really neat.

I still mostly consider the tech (despite its branding) in the "technical preview" stage moreso than a "finished product," and given the capabilities at this stage plus the recent research trends and the pace of acceleration, it's a very promising future even if there's valid and significant present shortcomings.


It is clear that despite these tools having flaws on the whole they save a lot of time. It is not clear what the tradeoff with introducing poorly understood or faulty code will bring, but given the utility we're never going back.


Copilot is bad at accessibility because web engineers are bad at accessibility. All of the bad habits in this post were learned from its training data.

That's not to say this can't be fixed: a recurring lesson of LLMs is that the quality of the training data is /everything/. OpenAI made their models better at chess by feeding in higher quality chess data - they could absolutely make it better at accessible frontend code by curating and boosting better code examples.

I doubt they'll do that any time soon, purely because there are so many other training data projects they could take on.

Thankfully we aren't nearly as dependent on a few closed research labs as we used to be.

It would be very exciting to see fine-tuned openly licensed models that target exactly this kind of improvement.


> Copilot is bad at accessibility because web engineers are bad at accessibility. All of the bad habits in this post were learned from its training data.

100%, and this is why Copilot is damn-near unusable for Bash scripting (yeah, the real problem is Bash scripting, use a better scripting language etc etc, but I do it, you've probably done it, and we've all definitely worked with codebases with Bash script linchpins) - there's a lot of bad Bash out there.


Does anyone who has found a good workflow with copilot have a good resource to share that demonstrates how to get the most out of it?

I really want it to be more useful, but rarely find that it's helpful for completing more than a single line or two.

Do you write out comments for everything you're going to do and then just write it yourself if the suggestion isn't useful?

Is there a trick to getting it to read your code itself across files?


> Shouldn’t the results I get from a paid service at least be better than a bad StackOverflow suggestion that got down-voted to the bottom of the page (and which would probably come with additional comments and suggestions letting me know why it was ranked lower)?

I don't know why you would expect this, when the model is likely trained on StackOverflow material (or similar publicly available code examples).


Redmonk says Kotlin is the 17th most popular programming language ( https://redmonk.com/sogrady/2023/05/16/language-rankings-1-2... ). So can any of these LLMs and whatnot, even the ones supposedly geared toward programming do something like this:

"Write a function in Kotlin that take a Long as a parameter, and sends back a List containing Long types. The parameter is a number, and the return is a list of prime numbers less than that number. All in one function."

It seems it should be pretty simple, in fact I have written this program a number of times. If you think a list of prime numbers might take up too much memory, I have also done prompts only asking it to just give the largest prime under the input parameter.

It is not a difficult task, and Kotlin is between Objective-C and Rust in popularity. Have any neural network programming tools been able to complete this? No. Some can, if the number input is 18L or the like. None have been able to handle 600851475143L (taken from the third Project Euler). If the program runs at all I get "java.lang.OutOfMemoryError: Java heap space". Even if I warn it to watch heap memory, it still is the same result.

As I said, this is a prompt for a list, but even if I ask for only the largest prime number before 600851475143L, or any long such as that number, I have not seen any LLM or the like that can write that function. Especially ChatGPT 4, which I have tried it on extensively.

I'm not saying LLMs will not get there, but this part of the third question on the Project Euler site, from a fairly popular language. It's a pretty simple question - a straightforward function to write. They can't do it yet.

I see people worrying about AI being on the verge of taking programmers jobs. Until it can do something incredibly specified and simple as this, I am not worried at all.


The list of primes below 600851475143 contains 23038900221 elements. If each element is a long, that takes a little over 184GB (decimal) of storage. May I ask how you managed it without running out of memory?

(Project Euler 3 asks for a factorisation, which using the most memory-hungry but reasonable algorithm would require a list of merely sqrt=775146 in length, which is much more manageable.)


These code generation systems should probably prepend a hidden "Generate accessible, secure, maintainable etc code" prompt

Of course that doesn't provide any guarantee, and no developer should rely on it, but the average results would probably be a little better


What we will see is that llms become so good in writing code that LLM first will emerge.

LLM first means we will test it against our libraries, best practices and potentially even create a new language for it.

Then programming in the classical sense won't exist anymore.

The ara of code will end when we will deploy the first code written with LLM to write new code.

Javallm or #llm.

It might be full of examples for a LLM, it might focus on analyzing logic and fixing it on a higher level, until the AI is good enough to self write, evaluate and deploy it.

After that it will become no longer understandable by us and researchers will start analyzing it after it was written.

Historians will start tracking when ai started to create more efficient abstractions etc.


This is one of the more thoughtful, nuanced criticisms of the current LLM fad that I've read, and I'm delighted to see it make it show up on HN. The author starts off with a series of well-thought experiments that show Github Copilot generating _pretty valid_ frontend code, code that works and fulfills the prompt: but code that ignores every web accessibility rule of thumb in the most egregious ways. Sure, yes, bad web devs write bad code, and Copilot is -- on its best day -- a perfectly cromulent bad developer. Yawn, news at 11, etc.

But where he takes those examples and where his thoughts end up is where this essay really hit home for me:

> As more and more of the internet is generated by LLMs, more and more of it will reinforce biases. Then more and more LLMs will consume that biased content, use it for their own training, and the cycle will accelerate exponentially.

And 'biases' here isn't the usual "models are woke-lobotomized!" yammering, but rather a thoughtful take on how the use of LLMs for code generation may, at least for the current state of LLMs, slowly normalize _writing worse code_.


So that’s nothing new. Code is getting worse for decades. Moore’s law is making worse code acceptable.

In the meantime, some people write better code and do care about it and LLMs aren’t going to change that.

So there will be worse code and there will be better code as always. LLM is just a tool.


This is true but let's not forger that it is race to the bottom.

Even this blog did reshaped itself while I was reading it. Moore’s law is lagging here.


> Why do we accept a product that not only misfires regularly, but sometimes catastrophically?

RIGHT?!!

ChatGPT went completely bonkers today, and people are basically ":shruggie: it happens"

THAT IS WILD TO ME.

People are using ChatGPT for medical advice. People are using ChatGPT for activities that have financial implications. How is this okay?!


> Copilot loves suggesting about 25 nested divs as a starting point.

To be fair it costs a huge amount of money to hire a React/Tailwind person to create 25 nested divs as a starting point.


On a personal note, I'm in this picture. I've been doing back end code for decades, and beyond a bit of 2003 style ajax/css, my front end skills were non-existent. One of those itches - standing up a blog - as I'm working on my Rust skills, I looked at the tailwind components and started using them. And down the rabbit hole I went to understand what was there and how to use it. On a whim, I asked copilot, interpret this style - and I'll be damned if it it did not produce useful results. I also understand there is an entire ecosystem I'm oblivious to.

I'll grumble about the Java code it generates. Mostly meh. As I look at the Rust suggestions, it seems fine. Is it Rust has better training data, or is it that I'm a weak Rust coder still and don't know right from working. My money is on the latter.

Anyhow... considering the hours I spent yesterday trying to implement light/dark mode on some simple pages, your Tailwind comment resonates.


> In a lot of ways, in fact, “AI” is just the newest iteration of a very old form of colonial capitalism; build a wall around something you didn’t create, call it yours, and charge for access. (And when the natives complain, call them primitive and argue they’re blocking inevitable progress.)

What a wonderful analogy. LLMs also feel very pythagoran, where a secret cult (of capital owners; the bourgeoisie) guards the secrets of forbidden math, using it for their own benefits, and denying it to the masses. The amount of data and computing power needed to train a good model means it is pretty much inaccessible to the masses, the public can only ever hope to use an already trained model which is provided to us by this secret cult.


No mention of testing in the article. It seems odd how often accessibility advocates talk about following rules rather than testing. Shouldn’t we be testing with screen readers or something?

If a website doesn’t work in Firefox, we fault the developer for not testing it in Firefox. Similarly for mobile browsers.

If testing is in place, LLM’s are much safer to use. You’ll notice when they give you code that doesn’t work.


100% failure isn't really a useful test, is it?


I don't know what you mean. With test-first development, you write a failing test and then you fix it.


It's a very powerful autocomplete. "It doesn't generate all the code I need in full and if it does I have to poke at it" is just poor criticism. You don't have to press tab and insert everything it suggests. It will usually generate me half a line after typing the first half - that's pretty awesome in my opinion.

If you stick to using it to merely speed-spell out what you were in fact already in the process of writing, and ignore 90% of the terrible crap it proposes, it's a nice productivity boost and has no way to make code worse by itself.

Basically, instead of writing a big comment and then a function signature and expect it to do the rest, just start writing out the function, tab when it gets it, don't when it doesn't, or (most of the time) tab then delete half of it and keep the lines you intended, likely with some small tweak.

Surely LLMs will be able to go so much more and without constant supervision in the future, but we're not there. That doesn't mean they're bad. Especially copilot since it's just there with its suggestions and doesn't require breaking flow to start spelling out in regular text what you're doing.


This sounds like it mirrors my usage. Basically treat it like pairing with a really junior dev: assume everything it writes will be wrong and then go from there. If you do that then best case it speeds you up and worst case you waste a little time reading what it wrote that was wrong and ignoring the suggestion and moving on.


That’s fine, but it already exists, i.e. resharper


Half the time copilot doesn’t even return a solution


This is a good thing in the context of the article (it even explicitly says "They’re not made to give you verifiable facts or to say 'I don’t know'" in a context which suggests this is in fact a bad trait). Better to return no code than to return crap code.


It would be a good thing if the requested solution wasn’t extremely simple


Better no-solution than cognitive overload with bad solutions.


I wonder though if Copilot had fared better here had it been told to pay attention to accessibility.

I mean, maybe it should do it by default (and maybe it could be part of its system prompt or otherwise in its material), but it's still a tool that needs some expertise for using, even if it's trying its best to trick people into believing otherwise. Ultimately I don't think there's a solution to people misusing tools.

Paraphrasing sentiment I don't quite recall exactly: "If anyone can do it, then anyone will."


> In a lot of ways, in fact, “AI” is just the newest iteration of a very old form of colonial capitalism; build a wall around something you didn’t create, call it yours, and charge for access. (And when the natives complain, call them primitive and argue they’re blocking inevitable progress.)

This is pithy, but the dynamic between OSS devs and Microsoft/OpenAI is not exactly comparable to the dynamic between a colonial government and an indigenous population. I don’t think it really needs to be said, but open-source maintainers are not colonized natives.

Even overlooking the very questionable metaphor, they’re not building a wall around existing repositories of code and selling them back to us. They spent a lot of money training an AI model on that code, and now they’re selling access to that model. You don’t need to pay Microsoft for access to the GitHub repos or Stack Overflow answers that they trained on.


Is it just me or has copilot gotten progressively worse lately? It used to feel like it was making well informed guesses, now they feel like literal guesses with no context at all. For example in my phoenix live view (elixir) app it guesses “xxx@xxxxx” for _any_ attribute I pass in to a component.


I am somewhat amused by all of the "copeelot bad" articles, and I dearly hope they keep proliferating, so that those of us who enjoy its frankly insane productivity boost get to stay ahead of the competition. I perceive no quality/reliability drawbacks in my own code. If anything, the ability to iterate more quickly makes my code better than ever.

It's a skill issue. (You had it coming.)


> I perceive no quality/reliability drawbacks in my own code.

How can you be sure that doesn't say more about you than it does about copilot?


I'm pretty sure, as I constantly judge and monitor the quality of my code. But thanks for immediately disregarding my personal experience and inserting your own uninformed prejudged assessment, random internet guy.


> thanks for immediately disregarding my personal experience and inserting your own uninformed prejudged assessment

Isn't that exactly what your toplevel post is doing? Physician, heal thyself!


I hesitate to engage in this hopelessly fruitless discussion, but the answer is no. I don't even express my opinion of the article, arguably barring one humorous phrase that refers to the currently-fashionable wave of Copilot criticism. I don't mind the article. It's actually pretty well-written. None of this is incompatible with my statement that in my experience, Copilot lets me do my job better. Time to get off the internet, physician.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: