Middling performance ? Do you actually understand the benchmarks you saw ? assum...

Imnimo · on April 11, 2023

I don't see a benchmark in either paper that shows "88% of human eval". Which table or figure are you looking at?

og_kalu · on April 11, 2023

It's with reflexion https://twitter.com/johnjnay/status/1639362071807549446

Imnimo · on April 11, 2023

But this is not raw Reflexion (it's not a result from the paper, but rather from follow-on work). The project uses significantly more scaffolding to guide the agent in how to approach the code generation problem. They design special prompts including worked examples to guide the model to generate test cases, prompt it to generate a function body, run the generated code through the tests, off-load the decision of whether to submit the code or to try to refine to hand-crafted logic, collate the results from the tests to make self-reflection easier, and so on.

This is hardly an example of minimal hand-holding. I'd go so far as to say this is MORE handholding than the paper this thread is about.

og_kalu · on April 11, 2023

for me, an unsupervised pipeline is not handholding. the thoughts drive actions. If you can't control how those thoughts form or process memories then i don't see what is hand holding about it. a pipeline is one and done.

Imnimo · on April 11, 2023

I would say that if you have to direct the steps of the agent's thought process:

-Generate tests

-Run tests (performed automatically)

-Gather results (performed automatically)

-Evaluate results, branch to either accept or refine

-Generate refinements

etc., then that's hand-holding. It's task specific reasoning that the agent can't perform on its own. It presents a big obstacle to extending the agent to more complex domains, because you'd have to hand-implement a new guided thought process for each new domain, and as the domains become more complex, so do the necessary thought processes.

nuancebydefault · on April 11, 2023

You can call it handholding. Or call it having control over the direction of 'thought' of the LLM. you can train another LLM that creates handholding pipeline steps. Then LLM squared can be tagged new LLM.

og_kalu · on April 11, 2023

The pipeline doesn't really have to be task/domain specific.

og_kalu · on April 11, 2023

I guess we just have different meanings of hand holding then.