Middling performance ? Do you actually understand the benchmarks you saw ? assuming you even read it. 88% of human eval is not middling lmao. Fuck, i really have seen everything.
But this is not raw Reflexion (it's not a result from the paper, but rather from follow-on work). The project uses significantly more scaffolding to guide the agent in how to approach the code generation problem. They design special prompts including worked examples to guide the model to generate test cases, prompt it to generate a function body, run the generated code through the tests, off-load the decision of whether to submit the code or to try to refine to hand-crafted logic, collate the results from the tests to make self-reflection easier, and so on.
This is hardly an example of minimal hand-holding. I'd go so far as to say this is MORE handholding than the paper this thread is about.
for me, an unsupervised pipeline is not handholding. the thoughts drive actions. If you can't control how those thoughts form or process memories then i don't see what is hand holding about it. a pipeline is one and done.
I would say that if you have to direct the steps of the agent's thought process:
-Generate tests
-Run tests (performed automatically)
-Gather results (performed automatically)
-Evaluate results, branch to either accept or refine
-Generate refinements
etc., then that's hand-holding. It's task specific reasoning that the agent can't perform on its own. It presents a big obstacle to extending the agent to more complex domains, because you'd have to hand-implement a new guided thought process for each new domain, and as the domains become more complex, so do the necessary thought processes.
You can call it handholding. Or call it having control over the direction of 'thought' of the LLM.
you can train another LLM that creates handholding pipeline steps. Then LLM squared can be tagged new LLM.