What this does is optimize a prompt given a dataset (here Banking77).
The optimizer, BootstrapFewShot, simply selects a bunch of random subsets from the training set, and measures which gives the best performance on the rest of the dataset when used as few-shot examples.
There are also more fancy optimizers, including ones that first optimize the prompt, and then use the improved model as a teacher to optimize the weights. This has the advantage that you don't need to pay for a super long prompt on every inference call.
dspy has more cool features, such as the ability to train a large composite LLM program "end to end", similar to backprop.
The main advantage, imo, is just not having "stale" prompts everywhere in your code base. You might have written some neat few-shot examples for the middle layers of your pipeline, but then you change something at the start, and you have to manually rewrite all the examples for every other module. With dspy you just keep your training datasets around, and the rest is automated.
I use DSPy often, and it’s the only framework that I have much interest in using professionally.
Evaluations are first class and have a natural place in optimization. I still usually spend some time adjusting initial prompts, but more time doing traditional ML things… like working with SMEs, building training sets, evaluating models and developing the pipeline. If you’re an ML engineer that’s frustrated by the “loose” nature of developing applications with LLMs, I recommend trying it out.
With assertions and suggestions, there’s also additional pathways you can use to enforce constraints on the output and build in requirements from your customer.
Every time I check the docs, I feel like it obfuscates so many things that it puts me off and I decide to just not try it out.
Behind the scenes it's using LLM's to find the proper prompting. I find that it uses a terminology and abstraction that is way too complicated for what it is.
What do you actually use it for? I've never been able to actually get it to perform on anything remotely close to what it claims. Sure, it can help optimize few shot prompting...but what else can it reliably do?
It isn’t for every application, but I’ve used it for tasks like extraction, summarization and generating commands where you have specific constraints you’re trying to meet.
Most important to me is that I can write evaluations based on feedback from the team and build them into the pipeline using suggestions and track them with LLM as a judge (and other) metrics. With some of the optimizers, you can use stronger models to help propose and test new instructions for your student model to follow, as well as optimize the N shot examples to use in the prompt (MIPROv2 optimizer).
It’s not that a lot of that can’t be done other ways, but as a framework it provides a non-trivial amount of value to me when I’m trying to keep track of requirements that grow over time instead of playing the whack a mole game in the prompt.
Yeah so like I said, I get the in-context optimization bit…which is nice, but pretty limited.
I have had precisely zero success with the LLM-prompt-writer elements. I would love to be wrong, but DSpy makes huge promises and falls painfully short on basically all of them.
Every time I've seen a dspy article, I end up thinking: ok, but what does it do exactly?
I've been using guidance, outlines, GBF grammars, etc. What advantage does dspy have over those alternatives?
I've learnt that the best package to use LLMs is just Python. These "LLM packages" just make it harder to do customizations as they all make opinionated assumptions and decisions.
Question from a casual AI user, if you have a minute. It seems to me that I could get much more productive by making my own personal AI "system". For example, write a simple pipeline where Claude would scrutinize OpenAI's answers and vice versa.
Are there any beginner-friendly Python packages that you would recommend to facilitate fast experimentation with such ideas?
Not the person you asked, but I will take a shot at your question. The best python package for getting what you want out of these systems is to use original gangster python with libraries to help with your goals.
For your example; Write a python script with requests that hits the OpenAI API. You can even hardcode the API key because its just a script on your computer! Now you have the GPT-4proLight-mini-deluxe response in JSON. You can pipe that into a bazzillion and one different places including another API request to Anthropic. Once that returns, you can now have TWO llm responses to analyze.
I tried haystack, langchain, txtai, langroid, CrewAI, Autogen, and more that I am forgetting. One day while I was reading r/Localllama someone wrote; "All these packages are TRASH, just write python!"... Lightbulb moment for me. Duh! Now I don't need to learn a massive framework to only use 1/363802983th of it while cursing that I can't figure out how to make it do what I want it to do.
Just write python. I tell you that has been massive for my usage of these LLM's outside of the chat interfaces like LibreChat and OpenWebUI. You can even have claude or deepseek write the script for you. That often gets me within striking distance of what I really want to achieve at that moment.
I've had good luck with a light "shim layer" library that handles the actual interfacing with the api and implements the plumbing on any fun new features that get introduced.
I've settled on the Mirascope library (https://mirascope.com/), which suits my use cases and lets me implement structured inputs/outputs via pydantic models, which is nice. I really like using it, and the team behind it is really responsive and helpful.
That being said, Pydantic just released an AI library of their own (https://ai.pydantic.dev/) that I haven't checked out, but I'd love to hear from someone who has! Given their track record, it's certainly worth keeping an eye on.
I assume you know how to program in Python? I would start with just the client libraries of the model providers you want to use. LLMs are conceptually simple when treated as black boxes. String in, string out. You don't necessarily need a framework.
You can have a look at Langroid -- it's an agent-oriented LLM programming framework from CMU/UW-Madison researchers. We started building it in Apr 2023 out of frustration with the bloat of then-existing libs.
In langroid you set up a ChatAgent class which encapsulates an LLM-interface plus any state you'd like. There's a Task class that wraps an Agent and allows inter-agent communication and tool-handling. We have devs who've found our framework easy to understand and extend for their purposes, and some companies are using it in production (some have endorsed us publicly). A quick tour gives a flavor of Langroid: https://langroid.github.io/langroid/tutorials/langroid-tour/
> For example, write a simple pipeline where Claude would scrutinize OpenAI's answers and vice versa.
I'm working on a naive approach to identify errors in LLM responses which I talk about at https://news.ycombinator.com/item?id=42313401#42313990, which can be used to scrutinize responses. It's written in Javascript though, but you will be able to create a new chat by calling a http endpoint.
I'm hoping to have the system in place in a couple of weeks.
It's based on pydantic and aims to make writing LLM queries as easy/compact as possible by using type annotations, including for structured outputs and streaming. If you use it please reach out!
I tried many python frameworks but the lack of customizations and observability limited the utility. Now, I only use Instructor (Jason Liu's library). That, and concurrent futures for parallel processing.
You could use the plain api libraries for each llm and ipython notebooks, conceptually each block could be a node or link in the prompt chain, and input/output of each block is printable and visible to check which part of the chain is the part that is failing or has sub optimal outputs.
Yes, anthropic just released model context protocol and mcp is perfect for this kind of thing. I actually wrote an mcp server for claude to call out to openai just yesterday.
I've seen a couple of talks on DSPy and tried to use it for one of my projects but the structure always feels somewhat strained. It seems to be suited for tasks that are primarily show, don't tell but what do you do when you have significant prior instruction you want to tell?
e.g Tests I want applied to anything retrieved from the database. What I'd like is to optimise the prompt around those (or maybe even the tests themselves) but I can't seem to express that in DSPy signatures.
And have the chat history end up as an input field instead of each request response being prepared individually for the LLM endpoint?
I'm new to the library but from what I can see the Chat adapter will do this automatically if I use the forward call
Quoting the docs: "Though rarely needed, you can write custom LMs by inheriting from dspy.BaseLM. Another advanced layer in the DSPy ecosystem is that of adapters, which sit between DSPy signatures and LMs. A future version of this guide will discuss these advanced features, though you likely don't need them."
I could be wrong, I could be looking for complexity where there is none.
Have a fundamentally misunderstood how this all works, it sometimes feels like I have?
Wouldn't try and convince the nay sayers. I have a few comments (or opinions rather)
This coming from a recent dspy user hitting its pain points.
- The GitHub page is very busy
- A clear example should come up early on the page. It's only when I got to the fiddle I could see a motivating example i.e the extractions, functions and tests.
- Then a section for running tests/evaluations
-Then deployment or run with/without the baml cli
- I do wonder if all the functions have to be so tightly coupled with the model. In dspy my modules are model agnostic and I can evaluate behaviour across different models. It's not so clear how to do this
Wouldn't try and convince the nay sayers. I have a few comments (or opinions rather)
This coming from a recent dspy user hitting its pain points.
- The GitHub page is very busy
- A clear example should come up early on the page. It's only when I got to the fiddle I could see a motivating example i.e the extractions, functions and tests.
- Then a section for running tests/evaluations
-Then deployment or run with/without the baml cli
- I do wonder if all the functions have to be so tightly coupled with the model. In dspy my modules are model agnostic and I can evaluate behaviour across different models.
Can someone explain what DSPy does that fine tuning doesn’t? Structured IO, optimized to better results. Sure. But why just just go straight to weights, instead of trying to optimize the few-shot space?
The main idea behind DSPy is that you can’t modify the weights, but you can perhaps modify the prompts. DSPy’s original primary customer was multi-llm-agent systems where you have a chain / graph of LLM calls (perhaps mostly or all to OpenAI GPT) and you have some metric (perhaps vague) that you want to increase. While the idea may seem a bit weird, there have been various success stories, such as a UoT team winning medical-notes-oriented competition using DSPy https://arxiv.org/html/2404.14544v1
It has multiple optimization strategies. One is optimizing the few shot list. Another is to let the model write prompts and pick the best one based on the given eval. I doubt latter much more intriguing although I have no idea how practical it is.
How does it work? Like I can see the goal and the results, but is it in fact the case that its still here "LLMs all the way down"? That is, is there a supplement bot here thats fine tuned to DSPy syntax, doing the actual work of turning the code to prompt? Trying to figure out how else it would work.. But if that is the case, this really feels like a Wizard of Oz behind the curtain thing.
I think you read it right. It is in my mind a kind of wish casting that adding other modeling to LLMs can improve their use, but the ideas all sound like playing with your food at best, and deliberately confusing people to prey on their excitement at the worst.
My go to framework. I wish we can use global metrics in DSPy, for examples, F1 score over the whole evaluation set (instead of a single query at the moment). The recent async support has been life saver.
DSPy is both great and frustrating at times. It's designed dort prompt and response. Textgrad and DSPy share this design decision and this is where the frustration begins.
I'd like the modules (dspy speak for inference ) to be aware of the chat history. Without this I'm not able to test the accuracy of the responses in the various chat contexts they appear.
It's been asked for in the TextGrad & Dspy GitHub issues.
The optimizer, BootstrapFewShot, simply selects a bunch of random subsets from the training set, and measures which gives the best performance on the rest of the dataset when used as few-shot examples.
There are also more fancy optimizers, including ones that first optimize the prompt, and then use the improved model as a teacher to optimize the weights. This has the advantage that you don't need to pay for a super long prompt on every inference call.
dspy has more cool features, such as the ability to train a large composite LLM program "end to end", similar to backprop.
The main advantage, imo, is just not having "stale" prompts everywhere in your code base. You might have written some neat few-shot examples for the middle layers of your pipeline, but then you change something at the start, and you have to manually rewrite all the examples for every other module. With dspy you just keep your training datasets around, and the rest is automated.
(Note, the example above is taken from the new website: https://dspy.ai/#__tabbed_3_3 and simplified a bit)