Show HN: OpenLLMetry – OpenTelemetry-based observability for LLMs

Areibman · on Oct 11, 2023

LLM observability strikes me as an extremely, extremely crowded space. And YC has funded an enormous number of them.

What do you think is the key differentiator between you and everyone else? Is vendor lock-in really that huge of an issue?

[0] https://hegel-ai.com, https://www.vellum.ai/, https://www.parea.ai, http://baserun.ai, https://www.trychatter.ai, https://talc.ai, https://github.com/BerriAI/bettertest, https://langfuse.com

nirga · on Oct 11, 2023

Note that these products aren't the same, even though they all fall under the category of observability - similarly to how you'd use Datadog AND Sentry, although both can be called "observability platforms".

I do think vendor locking is a key differentiator, which some of the reasons why OpenTelemetry succeeded in the first place. I know that my previous company switched to OpenTelemetry for exactly this reason. You get the flexibility of using any platform you'd want (since we're compatible with OpenTelemetry), so it's not vendor-locking you to a specific platform with specific capabilities. Why use any of the ones you mention - maybe Datadog is enough if your use case is simple?

But there are more advantages - you get much more than just observability to the LLM itself - you can see calls to vector DBs, network calls, DB queries, etc. - this can be extremely useful IMO for RAG and autonomous agents for example

phillipcarter · on Oct 12, 2023

Just to add to Nir's answer here:

Let's say your application takes several steps to build up a prompt dynamically, such as a RAG pipeline. You'll end up with a different prompt for potentially each user, depending on the application.

The result is you've likely increased the accuracy of the LLM, but at the expense of understanding the whole system's behavior by introducing more steps upstream of the LLM call. Those steps could be super simple, or they could be (like in our case) dozens of steps that could all potentially fail or have a bug or whatever.

And so how do you wrangle all of this in context? You need something like OpenLLMetry that treats a request to an LLM as one of several components that make up a request and/or user experience. Otherwise you're just throwing stuff at the wall, guessing at what could improve stuff (or guessing at what could make an eval score better).

caniszczyk · on Oct 11, 2023

Any thoughts of contributing this upstream directly or to CNCF?

We would be interested in hosting and supporting this type of work.

You can reach out to me via cra@linuxfoundation.org if you want to chat

nirga · on Oct 11, 2023

Sure, would love to! I'll ping you.

ramenmeal · on Oct 11, 2023

What is the difference from using OpenLLMetry versus using OTel directly? Is the issue that there aren't conventions for the needed attributes?

nirga · on Oct 11, 2023

2 differences:

1. You don't have instrumentations for libraries like OpenAI, LangChain, etc. so you need to manually open spans

2. As you said, there are no semantic conventions for logging things like prompts and chains.

What we did is just defined the new set of semantic conventions, and built the instrumentations. But we're using vanilla OpenTelemetry so it's fully compatible with standard OpenTelemetry.

jeffchao · on Oct 11, 2023

Cool! It looks like you effectively do auto instrumentation. Have you found there to be interesting nuances between LLM providers? Tracing is great and trace aggrgegates (with context!) cross-vendor would be even more awesome.

nirga · on Oct 11, 2023

Wow, where do I start? The APIs are not that similar, but we're trying to use the same set of semantic conventions for everyone so for example you'll always get the model version, or the temperature in the same attribute. Which kinda means it's identical cross-vendor, at least on the o11y side.

Here are all the semantic conventions we've defined so far - https://github.com/traceloop/openllmetry/tree/main/packages/...

serverlessmom · on Oct 11, 2023

Pretty neat! I assume it's just measuring traces right now? Any plans to add some top level metrics like build times, prompt length, etc?

nirga · on Oct 11, 2023

Yes, only traces for now. We do want to send out metrics for prompt length, token usage, etc. like you mentioned. Hopefully will be available soon (and we welcome contributions :) )

hossko · on Oct 12, 2023

Hello,

Is it possible to use Traceloop's LLM instrumentations with already existing opentelemetry implementation ?

nirga · on Oct 12, 2023

Yes, ofc. The LLM instrumentations are just like all other instrumentations.

hossko · on Oct 12, 2023

Thank you,

Does it work on Azure OpenAI calls for langchain ? seems it did not work for me or im missing somethin

nirga · on Oct 12, 2023

It should work, but LangChain has many quirks so it can depend on which syntax you're using. Ping us on slack and we'll assist -

https://join.slack.com/t/traceloopcommunity/shared_invite/zt...

hossam-shehab · on Oct 12, 2023

Hey,

Is it possible to used Traceloop LLM instrumentations only with already existing opentelemetry implementation

peter_d_sherman · on Oct 11, 2023

Great idea!

Observability (AKA, debug/proxy/statistics/logging/visualization layer) -- for LLM's (AKA Chat AI's)...

Hmmm, you know, I would love something for ChatGPT (and other AI chatbots) -- where you could open a second tab or window -- and see (and potentially interact with) debug info and statistics from prompts given to that AI in its main input window, in realtime...

Sort of like what Unix's STDERR is for programs running on Unix -- but an "AI STDERR" AKA debug channel, for AI's...

I'm guessing (but not knowing) that in the future, there will be standards defined for debug interfaces to AI's, standards defined for the data formats and protocols traversing those interfaces, and standards defined for such things as error, warning, hint, and informational messages...

Oh sure, a given AI company could pick a series of their own interfaces, data protocols and how to interpret that data.

But if so, that "AI debug interface" -- wouldn't be universal.

Of course, on the flip side, if a universal "AI debug interface" were ever established, perhaps such a thing would eventually suffer from the complexities, over-engineering and bloatedness that plague many "designed-by-committee" standards in today's world.

So, it will be interesting to see what the future holds...

To take an Elon Musk quote and twist it around (basically abuse it! <g>):

"Proper engineering of future designed-by-committee standards with respect to AI interfaces and protocols is NOT guaranteed -- but excitement is!"

:-) <g> :-)

Anyway, with respect to the main subject/article/authors, it's a very interesting and future-thinking idea what you're doing, you're breaking new ground, and I wish you all of the future success with your company, business, product and product ideas!

nirga · on Oct 11, 2023

Thanks! Related to what you're saying, I was actually expecting some reactions from devs who'd ask "why is it a separate repo and not part of opentelemetry from day 1?".

And for that my answer would be that I think having a separate repo would allow this to evolve in a more natural way, and faster (whereas OpenTelemetry, given it's massive adoption already, evolves much slower, with committees etc.).

Then, at some point when this is stabilized and useful - we can merge.

Kind of like Tesla's NACS vs. CCS

archibaldJ · on Oct 12, 2023

any smooth way to get this work with javascript? would love to use this in a project but my inferences are all in js

tomerf2 · on Oct 12, 2023

Definitely! (Tomer from Traceloop here)

We've already started developing the typescript SDK. Would love to see exactly what your use case is, so we can prioritize specific instrumentation and collaborate on it. We'll ping you.

archibaldJ · on Oct 12, 2023

Nice!

Does traceloop support OpenTelemetry Protocol File Exporter?

I'm the maintainer of Insomnium (https://github.com/ArchGPT/insomnium) and I'm building a LanceDB-based prompt orchestration framework for automated software development that I'm integrating into Insomnium these few weeks. (The orchestration framework will also be open-source soon) Traceloop cloud looks good but I think for simple cases my users will prefer to have a 100% local solution.

would be nice to have a simple API to export to local; thanks!

nirga · on Oct 12, 2023

Yes, since we're using vanilla OpenTelemetry, you can set your exporter to whatever you want, including OpenTelemetry Protocol File Exporter. But I'd still use some sort of a dashboard, like Jaeger or one of the open source observability platforms like SigNoz or HyperDX that you can run locally.

brianhorakh · on Oct 11, 2023

Any plans for pgvector? Graphana tempo

GalKlm · on Oct 11, 2023

Hey Gal from Traceloop here,

We definitely have pgvector on our roadmap (which tbh I think we better publish in the repo). For Graphana tempo, it's just a matter of making sure that it works as a destination - we'll do it today/tomorrow.

VadimPR · on Oct 12, 2023

Will vLLM be supported as well?

GalKlm · on Oct 12, 2023

Hey it's Gal from Traceloop,

That's a good tbh. I wonder whether we should implement instrumentations for LLMs "hosting solutions" or for specific LLMs (E.g. LLaMa/Falcon) and ignore the hosting solution (not sure if that's even possible though as it sort of dictates the inference api).

wdyt?

nadavwiz · on Oct 12, 2023

Love it!

robertlagrant · on Oct 11, 2023

Would've preferred LLMetry, My Dear Watson.

nirga · on Oct 11, 2023

but it's open! :)

LoganDark · on Oct 11, 2023

Worst pun ever, starred.

hartator · on Oct 11, 2023

> observability

I really don't like that word for some reason. It's abstracting away something simple. Logs? Graphs? Debug data? Telemetry data? There is way better words for "this".