One major limitation the LLM has is that it can't run a profiler on the code,
but we can. (This would be a fun thing to do in the future - feed the output
of perf to the LLM and say 'now optimize').
This has been a feature of the Scalene Python profiler (https://github.com/plasma-umass/scalene) for some time (at this point, about 1.5 years) - bring your own API key for OpenAI / Azure / Bedrock, also works with Ollama. Optimizing Python code to use NumPy or other similar native libraries can easily yield multiple order of magnitude improvements in real-world settings. We tried it on several of the success stories of Scalene (before the integration with LLMs); see https://github.com/plasma-umass/scalene/issues/58 - and found that it often automatically yielded the same or better optimizations - see https://github.com/plasma-umass/scalene/issues/554. (Full disclosure: I am one of the principal designers of Scalene.)
I think with new agent capabilities of AI IDEs like Cursor and Cline, you can simply instruct the agent to run a profiler in the shell and make optimizations based on the output of the profiler. You can even instruct the agent to try to make the program run faster than a certain time limit and will try to reach that goal iteratively.
This same approach can work for eliminating errors in LLM generated code. We've had good luck doing it with Lattice [1] going from 5% to 90% success rate of creating a fully functional application.
I'm drafting a blog post that talks about how but for now the documentation will have to do.
There are many LLMs that have tools (function calling) available in their API. Some time ago, I was building AI agent that has profiler connected by function calling, and it works exactly as you described. First produce the code, then call profiler, and last step was to optimize code.
> I think it's revealing how limited the LLMs were in algorithmically optimizing this problem vs approaches like "add some parallelism" or "use numpy", for which they were quite decent in Python and somewhat less decent in Rust. It's surprising to me that unless I prompted it overly-specifically, GPT 4o didn't suggest using a faster rand implementation, as that was a pretty effective, easy step.
As a human performing optimization one of the steps might be to generate a list of candidate optimizations. After that you might explore each idea in parallel. By using the LLM iteratively, you're side stepping that approach and constraining the LLM to instead picking a certain optimization first and then build on that. So one cure might be to adjust your prompt and iteration strategy to take into account that there are multiple candidate solutions rather than just one. This leads to more of an agentic / chain approach than a completions approach to the task though.
The other missing part of this is that each time you optimize the code (even if it's a failure), you learn something more about the optimization. Again this context seems missing when the LLM looks at the code (unless that history is provided in CoPilot completions automatically - it probably depends on what approach you're taking there).
I wonder if there are any AI tools out there doing context + planning + history + RAG + test + evaluation + tool use + feedback approaches. I've seen various tools do 2-3 of those. It's not a stretch to think that we might eventually see tools that start to emulate more of the developer inner loops.
> I wonder if there are any AI tools out there doing context + planning + history + RAG + test + evaluation + tool use + feedback approaches. I've seen various tools do 2-3 of those. It's not a stretch to think that we might eventually see tools that start to emulate more of the developer inner loops.
Aside: the original question begins "Given a list of 1 million random integers...". For me, this suggests that we ought to be benchmarking from the point after the random numbers have been generated. I don't know how easy this benchmarking is to do in Rust, but I know how I would do it in dotnet using something like BenchmarkDotNet.
Yup, that was my comment / I'm the author of the blog post. The discussion here inspired me to do the study more carefully and turn it into a longer post.
I think it's fair to leave the random generation in or out, really, as long as we're still comparing apples to apples. But I agree with you that it probably would have made the benchmarking more clear had I left them out.
i agree that excluding the test data generation from the test should have been required. additionally, the original task should have been formulated much more general, i.e:
Given a function that takes a list of n random integers between x and y,
find the difference between the smallest and the largest numbers whose digits sum up to m.
then define the parameter values for testing and performance evaluation. that should have prevented at least some of the proposed compile time optimizations.
edit: i'm aware though that the performance characteristics might be very different depending on the value of "n", especially when it comes to multithreading, so there should be a clause for that (e.g. n usually > 1m).
I think few jr. devs would be able to apply the optimizations that were in the blog post. So maybe already LLMs are performing on a level that is beyond human jr. devs, and quickly catching up with the next tier.
For me, the show stopper is that every single token is suspect. Adding layers of more of it compounds this suspicion, leading to fatigue and automation bias, amplifying the speed of error or technical debt while possibly even acting against the understanding and control of stakeholders. They are actually in my mind far above human skill, actually, and far below basic microbes in basic existing and engagement of course. I think the true societal task now is to give names to these non-alive qualities and risks, the echo chamber aspects, what the total information of a model plus a generated result is, and things like X/Y problems and unknown unknowns.
> LLMs are performing on a level that is beyond human jr. devs, and quickly catching up with the next tier.
I am a little more pessimistic. I would say that LLMs have been hovering above junior for a year or two, with some progress towards intermediate in the context of a few methods/classes, but not much evidence of the wider architectural/systems-level thinking that I'd expect from an intermediate.
It's a good question. I want to think that most of the students we graduate could do this (once they learned Rust, as it's not part of our curriculum, but they could do it in C or Java or something). But our (CMU CS) students get a fair bit of algorithmic experience, and I'm not sure how you're defining "junior dev" (and the experience of a junior dev probably varies widely given the many different paths to get there).
Refining and adding to the text and code they train on is a major area of focus at the moment. Things will definitely get better, but it will take time.
Nice case study of how far LLMs are from AGI. I gave it to o1 instead of 4o, testing whether the thinking tokens help. Alas, they don't. It never came up with the "obvious" "skip-if-between" improvement over the naive algorithm.
The skipping was obvious to me at first glance, before I read the full blog post. So o1 failing it even after prodding for algorithmic improvements is a sign it's not that smart (in an AGI sense).
Note: the initial problem statement is ambiguous. Instead of saying "Given a list of random integers" - which the LLM somewhat reasonably interpreted as parsing from stain - you should phrase it as "Generate a list of random integers and then..."
reply