Good point. Also (tangent), I followed your profile link to https://sibylline.dev and am thoroughly impressed. Stoked to have found your treasure trove of repos and insights.
>GPT-5 showed significant improvement only in one benchmark domain - which is Telecom. The other ones have been somehow overlooked during model presentation - therefore we won’t bother about them either.
I work at OpenAI and you can partly blame me for our emphasis on Telecom. While we no doubt highlight the evals that make us look good, let me defend why the emphasis on Telecom isn't unprincipled cherry picking.
Telecom was made after Retail and Airline, and fixes some of their problems. In Retail and Airline, the model is graded against a ground truth reference solution. Grading against a reference solution makes grading easier, but has the downside that valid alternative solutions can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why Airline and Retail scores stopped climbing with the latest generations of models and are stuck around 60% / 80%. I'd bet you $100 that a superintelligence would probably plateau around here too, as getting 100% requires perfect guessing of which valid solution is written as the reference solution.
In Telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So Telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.
Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that Telecom is much better than Airline/Retail for measuring tool use.
Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if your tasks trigger a quirk not present in the eval).
"During our conversation, I was continually struck by the degree to which Hotz and his company are anti-mimetic. Like many founders of tech startups—thanks to the influence of Peter Thiel – Hotz has a passing familiarity with René Girard’s theory of mimetic desire. The theory, now supported by a trove of empirical evidence, posits that our desires do not originate in us but are always learned from models."
That makes me think of https://store.steampowered.com/app/2262930/Bombe/ which is a version of Minesweeper where instead of clicking on squares you define (parametric!) rules that propagate information around the board automatically. Your own rules skip all the easy parts for you. As a result, every challenge you get is by definition a problem that you've never considered before. It's fun, but also exhausting.
Nothing for the use cases I have in production other than more platform support, but those can be compile time features for when using those specific environments. I want 0 lines of dead code in production for easy auditing.
Before anyone puts the blame on Nx, or Anthropic, I would like to remind you all what actually caused this exploit. The exploit was caused by an exploit, shipped in a package, that was uploaded using a stolen "token" (a string of characters used as a sort of "usename+password" to access a programming-language package-manager repository).
But that's just the delivery mechanism of the attack. What caused the attack to be successful were:
1. The package manager repository did not require signing of artifacts to verify they were generated by an authorized developer.
2. The package manager repository did not require code signing to verify the code was signed by an authorized developer.
3. (presumably) The package manager repository did not implement any heuristics to detect and prevent unusual activity (such as uploads coming from a new source IP or country).
4. (presumably) The package manager repository did not require MFA for the use of the compromised token.
5. (presumably) The token was not ephemeral.
6. (presumably) The developer whose token was stolen did not store the token in a password manager that requires the developer to manually authorize unsealing of the token by a new requesting application and session.
Now after all those failures, if you were affected and a GitHub repo was created in your account, this is a failure of:
1. You to keep your GitHub tokens/auth in a password manager that requires you to manually authorize unsealing of the token by a new requesting application and session.
So what really caused this exploit, is all completely preventable security mechanisms, that could have been easily added years ago by any competent programmer. The fact that they were not in place and mandatory is a fundamental failure of the entire software industry, because 1) this is not a new attack; it has been going on for years, and 2) we are software developers; there is nothing stopping us from fixing it.
This is why I continue to insist there needs to be building codes for software, with inspections and fines for not following through. This attack could have been used on tens of thousands of institutions to bring down finance, power, telecommunications, hospitals, military, etc. And the scope of the attacks and their impact will only increase with AI. Clearly we are not responsible enough to write software safely and securely. So we must have a building code that forces us to do it safely and securely.
Bingo. The best trials are those that allow the user to determine whether the product is capable of solving the user’s immediate problem without actually solving it unless the product is purchased.
I'd recommend anyone interested in Confidential Computing to read the work from Rodrigo Branco (@BSDaemon) to understand why it's mostly a failure and a PR stunt from cloud providers to give the illusion that the customer stays in control, while at the same time the hardware capabilities CC is built upon are unsecure (and can't be fixed by firmware or microcode update, most of the time).
Top 5 codebases for changing my mind about things:
Wietse Venema's Postfix mail server. Taught me tons about security posture, the architecture i'd describe as microservices before microservices was a thing, but contrary to the modern take on microservices (it's mostly a tool for decomposing work across large semi-isolated groups) this was primarily about security and simplicity.
Spring framework - this opened my eyes to ways of working that i hadn't really thought enough about before, the developers on that project have a culture of deeply considering the needs of their users (who are java developers often in an enterprise environment).
Git - the thing i like about the git code base is that once you've covered the objects database (e.g. blobs, trees and commits) and the implementation of refs, everything else just feels like additional incremental features. With those core concepts, everything else is kinda harmoniously built on top.
Varnish by Poul Henning-Kamp is another one - feels like he went to great lengths to make that code base a teaching tool despite the fact it's also a top tier reverse proxy.
Last one isn't a code base - but it will help with software design in the large; studying how the lieutenants model works in the linux kernel.
Thinking about my answers, i think i've highlighted something subtly different than "well designed codebases" it's more a list of codebases that left a notable long lasting impression on me because of design decisions they made.
Glad to see a fellow fundamental indexer on HN! As a US based investor, I personally invest in the RAFI US broad market fundamental index (FNDB ETF) which does keep up with the Vanguard US total market over the past 10 years except the bubbly years of 2020/2021 & 2024/2025, even with a higher expense ratio.
In my case, after observing the Covid-19 craziness in market, I decided to dig further on value strategies and discovered this gem from Research Affiliates in Journal of Portfolio Management circa 2012, which completely convinced me on the concept of fundamental indexation as a superior alternative to market-cap weighted total market index.
My abstract algebra class had it exactly backwards. It started with a lot of needless formalism culminating in galois theory. This was boring to most students as they had no clue why the formalism was invented in the first place.
Instead, I wished it showed how the sausage was actually made in the original writings of galois [1]. This would have been far more interesting to students, as it showed the struggles that went into making the product - not to mention the colorful personality of the founder.
The history of how concepts were invented for the problems faced is far more motivating to students to build a mental model than canned capsules of knowledge.
also if anyone wants to go down the rabbit hole about why SAML is hard to implement, this is a pretty interesting writeup of a major 0-day vuln we discovered earlier this year: https://workos.com/blog/samlstorm
>> When people talk about "unit tests", a unit doesn't refer to the common pattern of "a single class". A unit is a piece of the software with a clear boundary. It might be a whole microservice, or a chunk of a monolith that is internally consistent.
Its OK to dislike unit testing, but please don't redefine the term to avoid it. That's not helpful. Instead try to find the papers (by NASA or IBM?) That show unit testing finds only very few actual bugs, making it low value.
That said, there are IMHO some units worth testing more.
I disagree with the title; loops are tail-recursive functions, but tail-recursive functions are not loops (in the sense that squares are rectangles, but rectangles are not squares).
It is true that every tail recursive function can be converted into a semantically equivalent loop via a transformation like CPS (which the author mentions). However, for mutually tail-recursive functions, this conversion loses control flow information. This is because after the CPS transformation, calls to the other function become calls to a continuation; this call usually must be implemented as an indirect jump. On the other hand, mutually tail-recursive functions can call each other with direct/statically-known jumps.
This loss of information might appear trivial, but in practice it has some important consequences. Classic examples are interpreter loops. It is well-known that computed gotos can result in modest to large speedups for interpreters [1]. The reason why is that computed gotos create an indirect jump per opcode, so a branch predictor can take advantage of correlations between opcodes. For example, looking at Python disassembly, the header of a standard range for loop compiles down to three opcodes: GET_ITER, FOR_ITER, STORE_FAST in sequence [2]. A branch predictor can recognize that the target of the "FOR_ITER" indirect jump will likely be the "STORE_FAST" instruction pointer; it cannot predict this in the naive implementation where jumps for all instructions are "merged" into a single indirect jump / switch at the top of the loop body. In this case, computed goto is effectively equivalent to a CPS transformation whose closures require no storage on the heap.
Suppose, however, we know even more information about the instruction sequence; for example, we know ahead of time that every FOR_ITER opcode will be followed by a STORE_FAST opcode. We could completely replace the indirect jump with a direct jump to the instruction pointer for the STORE_FAST opcode. Because modern branch predictors are very good, this will have about the same performance in practice as the computed goto loop.
However, consider the limiting case where we know the entire instruction sequence beforehand. If we write our interpreter as many mutually tail-recursive functions, with one function for every instruction, an optimizing compiler can replace every indirect call with a direct (tail-recursive) call to the function that implements the next instruction's opcode. With a good enough optimizer / partial evaluator, you can turn an interpreter into a compiler! This is known as the first Futamura projection [3].
To see an example of this in action, I wrote a prototype of a Brainfuck compiler via the Futamura projection; it uses LLVM as a partial evaluator [4]. The main interesting function is `interpret`, which is templated on the program counter / instruction. That is, `interpret` is really a family of mutually tail-recursive functions which statically call each other as described above. For short Brainfuck programs, the LLVM optimizer is able to statically compute the output of the Brainfuck program. (The one in the Godbolt link compiles to a loop, likely because LLVM does not want to unroll the mutual recursion too much.) You can play around with different Brainfuck programs by modifying the `program` string on line 5.
I believe you've covered some working solutions in your presentation. They limit LLMs to providing information/summaries and taking tightly curated actions.
There are currently no fully general solutions to data exfiltration, so things like local agents or computer use/interaction will require new solutions.
My personal perspective is that the best we can do is build secure frameworks that LLMs can operate within, carefully controlling their inputs and interactions with untrusted third party components. There will not be inherent LLM safety precautions until we are well into superintelligence, and even those may not be applicable across agents with different levels of superintelligence. Deception/prompt injection as offense will always beat defense.
When the chatbot can also make cutting remarks pointing out your insecurities, nag you about chores and responsibilities, withhold affection, make you waste your time doing things the chatbot wants to do, or have you make soulcrushing smalltalk with the chatbot's parents, and you can't leave because you had children with it, and who knows if you can even do better you are getting too old to start over anyway, then you can call it real love.
The real crime is an economic system that limits the spread of knowledge and access to other "human rights" by requiring everyone to hustle to survive (and, if possible, increase capital gains for the financial overlords) when we would already be technologically equipped to feed and house well all of mankind, instead of letting thousands of children starve to death each day and restricting access to education so that billions miss out on their intellectual development - a void easily filled with addictive media full of rage and distraction. Pirating books is just a symptom of this wretched system. And it is not enough - RISE, HN! .. towards RBE & beyond..
Thanks. For a while there, it wasn't clear to me which side of the line I was walking.
Something that stuck with me from Poor Charlie’s Almanack is that low expectations are a cornerstone of a happy life. I built this for myself first, so when people actually signed up and paid, it was incredibly motivating. I was thrilled to spend my free time treating those early customers like royalty and building more of what they wanted.
If I had instead come into this with the expectation of quick success, I doubt I would have made it through those early years.
And cheers from one bootstrapper to another. It's not easy, but I can't imagine a more rewarding way to build.
My absolute favorite use of MCP so far is Bruce Hauman's clojure-mcp. In short, it gives the LLM (a) a bash tool, (b) a persistent Clojure REPL, and (c) structural editing tools.
The effect is that it's far more efficient at editing Clojure code than any purely string-diff-based approach, and if you write a good test suite it can rapidly iterate back and forth just editing files, reloading them, and then re-running the test suite at the REPL -- just like I would. It's pretty incredible to watch.
Something I've realized about LLM tool use is that it means that if you can reduce a problem to something that can be solved by an LLM in a sandbox using tools in a loop, you can brute force that problem.
The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
< I have an unusually high need to own the understanding of any thing I'm learning
This is called deprivation sensitivity. It’s different from intellectual curiosity, where the former is a need to understand vs. the latter, which is a need to know.
Deprivation sensitivity comes with anxiety and stress. Where intellectual curiosity is associated with joyous exploration.
I score very high with deprivation sensitivity. I have unbridled drive to acquire and retain important information.
It’s a blessing and curse. An exhausting way 2 live. I love it but sometimes wish I was not neurodivergent.