Hacker Newsnew | past | comments | ask | show | jobs | submit | spion's commentslogin

I think agents have a curve where they're kinda bad at bootstrapping a project, very good if used in a small-to-medium-sized existing project and then it goes downhill from there as size increases, slowly.

Something about a brand-new project often makes LLMs drop to "example grade" code, the kind you'd never put in production. (An example: claude implemented per-task file logging in my prototype project by pushing to an array of log lines, serializing the entire thing to JSON and rewriting the entire file, for every logged event)


There are a few languages where this is not too tedious (although other things tend to be a bit more tedious than needed in those)

The main problem with these is how do you actually get the verification needed when data comes in from outside the system. Check with the database every time you want to turn a string/uuid into an ID type? It can get prohibitively expensive.


The OP is the author of grugbrain.dev

What you think is an absurd question may not be as absurd as it seems, given the trillions of tokens of data on the internet, including its darkest corners.

In my experience, its better to simply try using LLMs in areas where they don't have a lot of training data (e.g. reasoning about the behaviour of terraform plans). Its not a hard cutoff of being _only_ able to reason exactly about solved things, but its not too far off as a first approximation.

The researchers took exiting known problems and parameterised their difficulty [1]. While most of these are not by any means easy for humans, the interesting observation to me was that the failure_N was not proportional to the complexity of the problem, but more with how common solution "printouts" for that size of the problem can typically be encountered in the training data. For example, "towers of hanoi" which has printouts of solutions for a variety of sizes went to very large number of steps N, while the river crossing, which is almost entirely not present in the training data for N larger than 3, failed above pretty much that exact number.

[1]: https://machinelearning.apple.com/research/illusion-of-think...


It doesn't help that thanks to RLHF, every time a good example of this gains popularity, e.g. "How many Rs are in 'strawberry'?", it's often snuffed out quickly. If I worked at a company with an LLM product, I'd build tooling to look for these kinds of examples in social media or directly in usage data so they can be prioritized for fixes. I don't know how to feel about this.

On the one hand, it's sort of like red teaming. On the other hand, it clearly gives consumers a false sense of ability.


Indeed. Which is why I think the only way to really evaluate the progress of LLMs is to curate your own personal set of example failures that you don't share with anyone else and only use it via APIs that provide some sort of no-data-retention and no-training guarantees.


Vibe-wise, it seems like progress is slowing down and recent models aren't substantially better than their predecessors. But it would be interesting to take a well-trusted benchmark and plot max_performance_until_date(foreach month). (Too bad aider changed recently and there aren't many older models; https://aider.chat/docs/leaderboards/by-release-date.html has not been updated in a while with newer stuff, and the new benchmark doesn't have the classic models such as 3.5, 3.5 turbo, 4, claude 3 opus)


I think that we can't expect continuous progress either, though. Often in computer science it's more discrete, and unexpected. Computer chess was basically stagnant until one team, even the evolution of species often behaves in a punctuated way rather than as a sum of many small adaptations. I'm much more interested (worried) of what the world will be like in 30 years, rather than in the next 5.


Its hard to say. Historically new discoveries in AI often generated great excitement and high expectations, followed by some progress, then stalling, disillusionment and AI winter. Maybe this time it will be different. Either way what was achieved so far is already a huge deal.


Why dagger and not just... any language? (Nushell for example https://www.nushell.sh/)


Because I'm typically building and running tests in containers in CI, which is what dagger is for.

nu is my default shell. Note that I am not talking about dagger shell. https://dagger.io/blog/a-shell-for-the-container-age-introdu...


This is why TC39 needs to work on fundamental language features like protocols. In Rust, you can define a new trait and impl it for existing types. This still has flaws (orphan rule prevents issues but causes bloat) but it would definitely be easier in a dynamic language with unique symbol capabilies to still come up with something.


Dynamic languages don't need protocols. If you want to make an existing object "conform to AsyncDisposable", you:

    function DisposablImageBitmap(bitmap) {
      bitmap[Symbol.dispose] ??= () => bitmap.close()
      return bitmap
    }
    
    using bitmap = DisposableObserver(createImageBitmap(image))
Or if you want to ensure all ImageBitmap conform to Disposable:

    ImageBitmap.prototype[Symbol.dispose] = function() { this.close() }
But this does leak the "trait conformance" globally; it's unsafe because we don't know if some other code wants their implementation of dispose injected to this class, if we're fighting, if some key iteration is going to get confused, etc...

How would a protocol work here? To say something like "oh in this file or scope, `ImageBitmap.prototype[Symbol.dispose]` should be value `x` - but it should be the usual `undefined` outside this scope"?


You could potentially use the module system to bring protocol implementations into scope. This could finally solve the monkey-patching problem. But its a fairly novel idea, TC39 are risk-averse, browser-side are feature-averse and the language has complexities that create issues with most of the more interesting ideas.


Isn't disconnecting a resize observer a poor example of this feature?


I couldn't come up with a reasonable one off the top of my head, but it's for illustration - please swap in a better web api in your mind

(edit: changed to ImageBitmap)


How are async closures / closure types, especially WRT future pinning?


While I'd like to have it, it doesn't stop me from writing a great deal of production code without those niceties.

When it came time for me to undo all the async-trait library hack stuff I wrote after the feature landed in stable, I realized I wasn't really held back by not having it.


Async closures landed in stable recently and have been a nice QoL improvement, although I had gotten used to working around their absence well enough previously that they haven’t been revolutionary yet from the like “enabling new architectural patterns” perspective or anything like that.

I very rarely have to care about future pinning, mostly just to call the pin macro when working with streams sometimes.


I'm an incredibly happy user of nushell, which brings all the best features of other shells terse pipelining syntax and all the best features of more well designed scripting languages (functions, modules, lexical scope, data structures, completely optional types) in one awesome package that also comes with editor (LSP) support and excellent documentation

https://www.nushell.sh/

(The intro page may be a bit misleading. You can freely mix-and-match existing, unstructured as well as nushell-built-in structured commands in the pipeline, as long as you convert to/from string streams - its not mandatory to use the structured built-ins. For example if an existing cli tool has json output, you can use `tool | from json` to turn it into structured data. There are also commands like `detect columns` that parses classic column output, and so on - the tools to mix-and-match structured and unstructured data are convenient and expressive)

Some highlights:

- automatic command line arguments and help by defining a main function and adding comments to each argument - e.g. https://github.com/nushell/nushell/discussions/11969

- run commands with controlled parallelism: https://www.nushell.sh/commands/docs/par-each.html

- easy parsing of raw input https://www.nushell.sh/commands/docs/parse.html

- support for a wide variety of data formats https://www.nushell.sh/commands/categories/formats.html

- built-in support for talking to SQLite databases: https://www.nushell.sh/book/loading_data.html#sqlite

edit: it looks like Mitchell Hashimoto was recently impressed too https://x.com/mitchellh/status/1907849319052386577 - rich functional programming library that blends with pipeline syntax https://www.nushell.sh/book/nushell_map_functional.html

Addendum: Its not my login shell. I run it ad-hoc as soon as the command pipeline i'm writing starts getting too complicated, or to write scripts (which of course can be run from outside nushell too, so long as they have the correct shebang)


Google doesn't really say it can't find an answer; instead it finds less relevant (irrelevant) search results. LLMs hallucinate, while search engines display irrelevance.


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: