Other than possibly loosing social capital, why does this surprise you? they are intimately familiar with the intricacies of how the technology is used to hoard and catalogue every aspect of our digital lives.
treesitter is more or less a universal AST parser you can run queries against. Writing queries against an AST that you incrementally rebuild is massively more powerful and precise in generating the correct context than manually writing infinitely many shell pipeline oneliners and correctly handling all of the edge cases.
I agree with you, but the question is more whether existing LLMs have enough training with AST queries to be more effective with that approach. It’s not like LLMs were designed to be precise in the first place
I see where you’re coming from, but I think teasing out something that looks like a clear objective function that generalizes to improved intelligence from llm interaction logs is going to be hellishly difficult. Consider, that most of the best llm pre training comes from being very very judicious with the training data, selecting the right corpus of llm interaction logs and then defining an objective function that correctly models…? Being helpful? From that sounds far harder than just working from scratch with rlhf.
The way I see it is to use hindsight, not to come with predefined criteria. The criteria is usefulness of one LLM response in the interactions that follow it down the line.
For example, the model might propose "try doing X", and I come back later and say "I tried X but this and that happened", it can use that asa feedback. It might be a feedback generated from the real world outcomes of the X suggestion, or even from my own experience, maybe I have seen X in practice and know if it works or not. The longitudinal analysis can span multiple days, the more context the better for self analysis.
The cool thing is that generating preference scores for LLM responses, training a judge model on them, and then doing RLHF with this judge model on the base LLM ensures isolation. So personal data leaks might not be an issue. Another beneficial effect is that the judge model learns to transfer judgements skills across similar contexts, so there might be some generalization going on.
Of course there is always the risk of systematic bias and random noise in the data, but I believe AI researchers are equipped to deal with it. It won't be as simple as I described, but the size of the interaction dataset and the human in the loop, and real world testing are certainly useful for LLMs.
Not in my experience, running qwen3:32b is good, but it’s not as coherent or useful as 3.5 at a 4bit quant. But the gap is a lot narrower than llama 70b.
I think you’re arguing against something the author didn’t actually say.
You seem to be claiming that this is a binary, either we will or won’t use llms, but the author is mostly talking about risk mitigation.
By analogy it seems like you’re saying the author is fundamentally against the development of the motor car because they’ve pointed out that some have exploded whereas before, we had horses which didn’t explode, and maybe we should work on making them explode less before we fire up the glue factories.
reply