More

nirga · 2024-07-19T16:57:59 1721408279

I think that's the key benefit of using OpenTelemetry - it's pretty efficient and the performance footprint is negligible.

nirga · 2024-07-18T12:13:06 1721304786

Thanks for spotting those! We'll fix it asap

nirga · 2024-07-18T12:01:39 1721304099

I think you can (pretty) easily set this up with an otel collector and something that replays data from S3 - there's a native implementation that converts otel to clickhouse

lmeyerov · 2024-07-18T20:59:19 1721336359

Our scenario would be more like using Clickhouse / a dwh for session cohort/workflow filtering and then populating otel tools for viz goodies. Interestingly, to your point, the otel python exporter libs are pretty simple, so SQL results -> otel spans -> Grafana temp storage should be simple!

nirga · 2024-07-18T09:50:18 1721296218

You can do it and it's a good way of doing that - from our experiments that can catch most errors. You don't even need to use different models - even using the same model (I don't mean asking "are you sure?" - just re-running the same workflow) will give you nice results. The only problem is that it's super expensive to run it on all your traces so I wouldn't recommend that as a monitoring tool.

nirga · 2024-07-18T06:47:01 1721285221

Thanks! It can vary greatly between use cases - but we've seen extremely high detection rates for tagged texts (>95%). When switching to production, this gets trickier since you don't know what you don't know (so it's hard to tell how many "bad examples" we're missing). Our false positive rate (number of examples that were tagged as bad but weren't) has been around 2-3% out of the overall examples tagged as bad (positive) and we always work on decreasing this.

nirga · 2024-07-18T06:35:05 1721284505

You're right. We faced those same issues. So we plan to move those prompts and completions to be sent as log events with some reference to the trace/span and not actually on the span.

The span can then only contain the most important data like the prompt template, model that was used, token usage, etc. You can then split the metadata (spans and traces) and the large payloads (prompts + completions) to different data stores.

nirga · 2024-07-17T19:54:36 1721246076

It has the same logic of saying you dont want to use a computer to monitor or test your code since it will mean that a computer will monitor a computer. AI is a broad term, I agree you can use GPT (or any LLM) to grade an LLM in an accurate way but that’s not the only way you can monitor.

its_ethan · 2024-07-17T20:29:26 1721248166

> computer to monitor or test your code since it will mean that a computer will monitor a computer

I mean... you don't trust the computer in that case, you trust the person who wrote the test code. Computers do what they're told to do, so there's no trust required of the computer itself. If you swap out the person (that you're trusting) writing that code with an AI writing that test code, then it's closer to your analogy - and in that case, I (and the guy above me, it seems) wouldn't trust for anything impactful.

Even if you're not using an LLM specifically (which no one in this chain even said you were), an AI built off some training set to eliminate hallucinations is still just an AI. So you're still using an AI to keep an AI in check, which begs the question (posed above) of: what keeps your AI in check?

Poking fun at a chain of AI's all keeping each other in check isn't really a dig at you or your company. It's more of a comment on the current industry moment.

Best of luck to you in your endeavor anyway, by the way!

nirga · 2024-07-17T20:36:13 1721248573

Thanks! I wasn’t offended or anything, don’t get the wrong impression.

What strikes me odd is the fact that an AI that checks AI is an issue. Because AI can mean a lot of things - from a encoder architecture, a neural network, or a simple regression function. And at the end of the day, similar to what you said - there was a human building and fine tuning that AI.

Anyway, this feels more of a philosophical question than an engineering one.

nirga · 2024-07-17T19:05:33 1721243133

I'm sorry but this is not what we do. We don't use LLMs to grade your LLM calls.

nirga · 2024-07-17T19:04:39 1721243079

I think that LLMs are hallucinating by design. I'm not sure we'll ever get to a 0% hallucinations and we should be ok with it (at least for the next coming years?). So getting an alert on hallucination becomes less interesting. What is more interesting perhaps is knowing the rate that this happens. And keeping track on whether this rate increases or decreases with time or with changes to models.

nirga · 2024-07-17T19:02:30 1721242950

I think it depends on the use case and how you define hallucinations. We've seen our metrics perform well (=correlates with human feedback) for use cases like summarization, RAG question-answering pipeline, and entity extraction.

At the end of the day things like "answer relevancy" are pretty dichotomic in a sense that for a human evaluator it will be pretty clear whether an answer is answering a question or not.

I wonder if you can elaborate on why you claim that there's no ability to detect with any certainty hallucinations.