Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: R2R – Open-source framework for production-grade RAG (github.com/sciphi-ai)
167 points by ocolegro 11 months ago | hide | past | favorite | 57 comments
Hello HN, I'm Owen from SciPhi (https://www.sciphi.ai/), a startup working on simplifying˛Retrieval-Augmented Generation (RAG). Today we’re excited to share R2R (https://github.com/SciPhi-AI/R2R), an open-source framework that makes it simpler to develop and deploy production-grade RAG systems.

Just a quick reminder: RAG helps Large Language Models (LLMs) use current information and specific knowledge. For example, it allows a programming assistant to use your latest documents to answer questions. The idea is to gather all the relevant information ("retrieval") and present it to the LLM with a question ("augmentation"). This way, the LLM can provide answers (“generation”) as though it was trained directly on your data.

The R2R framework is a powerful tool for addressing key challenges in deploying RAG systems, avoiding the complex abstractions common in other projects. Through conversations with numerous developers, we discovered that many were independently developing similar solutions. R2R distinguishes itself by adopting a straightforward approach to streamline the setup, monitoring, and upgrading of RAG systems. Specifically, it focuses on reducing unnecessary complexity and enhancing the visibility and tracking of system performance.

The key parts of R2R include: an Ingestion Pipeline that transforms different data types (like json, txt, pdf, html) into 'Documents' ready for embedding. Next, the Embedding Pipeline takes text and turns it into vector embeddings through various processes (such as extracting text, transforming it, chunking, and embedding). Finally, the RAG Pipeline follows the steps of the embedding pipeline but adds an LLM provider to create text completions.

R2R is currently in use at several companies building applications from B2B lead generation to educational tools for consumers.

Our GitHub repo (https://github.com/SciPhi-AI/R2R) includes basic examples for application deployment and standalone use, demonstrating the framework's adaptability in a simple way.

We’d love for you to give R2R a try, and welcome your feedback and comments as we refine and develop it further!




Is there a roadmap for planned features in the future? I wouldn't call this a "powerful tool for addressing key challenges in deploying RAG systems" right now. It seems to do the most simple version of RAG that the most basic RAG tutorial teaches someone how to do with a pretty UI over it.

The most key challenges I've faced around RAG are things like:

- Only works on text based modalities (how can I use this with all types of source documents, including images)

- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval

- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:

    - Generate example queries that the chunk can answer and embed those to search against.

    - Parent document retrieval

    - so many newer better Rag techniques have been talked about and used that are better than chunk based

    - How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?
- Also other search approaches like fuzzy search/lexical based approaches. And ranking them based on criterial like (user query is one word, use fuzzy search instead of semantic search). Things like that

So far this platform seems to just lock you into a really simple embedding pipeline that only supports the most simple chunk based retrieval. I wouldn't use this unless there was some promise of it actually solving some challenges in RAG.


Thanks for taking the time to provide your candid feedback, I think you have made a lot of good points.

You are correct that the options in R2R are fairly simple today - Our approach here is to get input from the developer community to make sure we are on the right track before building out more novel features.

Regarding your challenges:

- Only works on text based modalities (how can I use this with all types of source documents, including images)

  For the immediate future R2R will likely remain focused on text, but you are right that the problem gets even more challenging when you introduce the idea of images. I'd like to start working on multi-modal soon.
- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval

  This is very true - a short/medium term goal of mine is to integrate some more intelligent chunking approaches, ranging from Vikp's Surya to Reducto's proprietary model. I'm also interested in exploring what can be done from the pure software side.
- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:

    - Generate example queries that the chunk can answer and embed those to search against.

    - Parent document retrieval

    - so many newer better Rag techniques have been talked about and used that are better than chunk based

    - How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?
You mentioned "Generate example queries", there is already an example that shows how to generate and search over synthetic queries w/ minor tweaks to the basic pipeline [https://github.com/SciPhi-AI/R2R/blob/main/examples/academy/...].

I think the other other approaches you outline are all worth investigating as well. There is definitely a tension we face between building and testing new experimental approaches vs. figuring out what features people need in production and implementing those.

Just so you know where we are heading - we want to make sure all the features are there for easy experimentation, but we also want to provide value into production and beyond. As an example, we are currently working on robust task orchestration to accompany our pipeline abstractions to help with ingesting large quantities of data, as this has been a painpoint in our own experience and that of some of our early enterprise users.


Nice, thanks for the reply. Glad to hear you are looking into these challenges and plan to tackle some of them. Will keep my eye on the repo for some of these improvements in the future.

And totally agree, the scaling out of ingesting large quantities of data is a hard challenge as well and it does make sense to work on that problem space too. Sounds like that is a higher priority at the moment which is totally fine.


No worries, thanks again the thoughtful feedback.

We are also very interested in the more novel RAG techniques, so I'm not sure that one is necessarily a higher priority than the other.

We've just gotten more immediate feedback from our early users around the difficulties of ingesting data in production and there is less ambiguity around what to build.

Out of your previous list, is there one example that you think would be most useful for the next addition to the framework?


Well, as someone building something similar I have been looking around at how people are tackling the problem of varied index approaches for different files, and again how that can scale.

I haven't read the code on your github but the readme mentions using qdrant/pgvector. I'm curious how you will tackle having that scale to billions of files with tens/hundreds/etc? different indexing approaches for each file. It doesn't feel tennable to keep it in a single postgres instance as it will just grow and grow forever.

Think even a very simple example of more indexes per file: having chunk sizes of 20/500/1000 along with various overlaps of 50/100/500. You suddenly have a large combination of indexes you need to maintain and each is basically a full copy of the source file. (You can imagine indexes for BM25, fuzzy matching, lucene, etc...)

You could be brute force ish and always run every single index mode for every file until a better process exists to only do the best ones for a specific file. But even if you narrowed it down a file could want 5 different index types searched and ranked for Retrieval step.

I want to know how people plan to shard/make it possible to have so many search indexes on all their data and still be able to query against all of it. Postgres will eventually run out of space even on the beefiest cloud instance fairly quickly.

The second biggest thing is then to tackle how to use all of those indexes well in the Retrieval step. Which indexes should be searched against/weighted and how given the user query/convo history?


You are both right about chunking, and i think is one of the main challenges. About more intelligent chunking approaches, i think you have to give a try to to preprocess.co It's able to preprocess and chunk PDFs, Office Files, and HTML content. It follows the original document layout considering the content semantics so you get optimal chunks


Do you know of any open source project which does support the extra functionality around the different approaches to embedding / queries?


LlamaIndex (and I think LangChain) does these things.


This is my problem with every end to end system I've seen around this. I find that, even building these systems from scratch, all of the hard parts are just normal data infrastructure problems. The "AI" part takes a small fraction of the effort to deliver even when just building the RAG part directly on top of huggingface/transformers.

I also have dealt with what you're describing, but then it goes much farther when going to prod IME. The ingestion part is even more messy in ways these kinds of platforms don't seem to help with. When managing multiple tools in prod with overlapping and non-constant data sources (say, you have two tools that need to both know the price of a product, which can change at any time), I need both of those to be built on the same source of truth and for that source of truth to be fed by our data infra in real time, where relevant documents need to be replaced in real time in more or less an atomic way.

Then, I have some tools that have varying levels of permissioning on those overlapping data sources, say, you have two tools that exist in a classroom, one that helps the student based on their work, and another that is used by the TA or teacher to help understand students' answers in a large course. They have overlapping data needs on otherwise private data, and this kind of permissioning layer which is pretty trivial in a normal webapp has, IME, had to have been implemented basically from scratch on top of the vector db and retrieval system.

Then experimentation, eval, testing, and releases are the hardest and most underserved. It was only relatively recently that it seemed like anyone even seemed to be talking about eval as a problem to aspire to solve. There's a pretty interesting and novel interplay of the problems of production ML eval, but with potentially sparse data, and conventional unit testing. This is the area we had to put the most of our own thought into for me to feel reasonably confident in putting anything into prod.

FWIW we just built our own internal platform on top of langchain a while back, seemed like a good balance of the right level of abstraction for our use cases, solid productivity gains from shared effort.

I think this is a really interesting problem space, but yeah, I'm skeptical of all of these platforms as they seem to always be promising a lot more than they're delivering. It looks superficially like there has been all of this progress on tooling, but I built a production service based on vector search in 2018 and it really isn't that much easier today. It works better because the models are so much better, but the tools and frameworks don't help that much with the hard parts, to my surprise honestly.

Perhaps I'm just not the user and am being excessively critical, but I keep having to deal with execs and product people throwing these frameworks at us internally without understanding the alignment between what is hard about building these kinds of services in prod and what these kinds of tools make easier vs harder.


This is AMAZING feedback and it is on brand with what I've heard from a number of builders. Thanks for sharing your experiences here.

The infra challenges are real - it has been what I have been struggling the most with in providing high quality support for early users. Most want to be able to reliably firehose 10-100s of GBs of data through a brittle multistep pipeline. This was something I struggled with when building AgentSearch [https://huggingface.co/datasets/SciPhi/AgentSearch-V1] with LOCAL data - so introducing the networking component only makes things that much harder.

I think we have a lot of work to do to robustly solve this problem, but I'm confident that there is an opportunity to build a framework that results in net positives for the developer.

FWIW, Your feedback would be invaluable as the project continues to grow.


With the "production-grade" part of the title, I was hoping to see bit more about scalability, fault tolerance, updating continually-changing sources of data, A/Bing new versions of models, slow rollout, logging/analytics, work prioritization/QOS, etc. It seems like the lack of these kind of features is where a lot of the toy/demo stacks aren't really prepared for production. Any thoughts on those topics?


This is a great question, thanks for asking.

We are testing workflows internally that use orchestration software like Hatchet/Temporal to allow the framework to robustly handle 100s of GBs of upload data from parsing to chunking to embedding to storing [1][2]. The goal is to build durable execution at each step, because even steps like PDF extraction can be expensive / time consuming. We are targeting an prelim. release of these features in < 1 month.

Logging is built natively into the framework with postgres or sqlite options. We ship a GUI that leverages these logs and the application flow to allow developers to see queries, search results, and RAG completions in realtime.

We are planning on adding more features here to help with evaluation / insight as we get further feedback.

On the A/B, slow rollout, and analytics side, we are still early but suspect there is a lot of value to be had here, particularly because human feedback is pretty crucial in optimizing any RAG system. Developer feedback will be particularly important here since there are a lot of paths to choose between.

[1] https://hatchet.run/ [2] https://temporal.io/


Do you have any insights to share around chunking and labeling strategies around ingesting and embeddings? Qdrant I remember had some interesting abilities to tag vectors with extra information. So to be more specific the issues I see are context aware paragraph chunking and keyword or entity extraction. How do you see this general issue?


This is the hardest part to get right (if it can be gotten 'right'), and the thing I'm always curious about.

All of these RAG solutions are implementing it with vector search, but I'm not sure it's the be all end all solution for this.


Hybrid search is definitely worth exploring (e.g. adding in TF-IDF). I believe there is such an implementation out of the box with Weaviate.

I have tried many techniques and seen others try many different techniques. I think the hardest part is selecting the RIGHT technique. This is why it is somewhat easy to deploy a RAG pipeline but very hard to optimize one. It's hard to understand why it's failing and the global implications of design choices you make in your ingestion / embedding process.


I’ll give you an analogy. We can imagine the content ingested is like a textbook, which would have a table of contents and an index. Now we lookup a topic, we find it in the TOC then likely we should read that chapter in whole, we find it in the index, then we likely read all the chapters it’s mentioned.

I’d suggest RAG might perform better if it worked somewhat like that, the chunks for embeddings should be paragraph and sentence aware, and ideally should be tagged with any existing TOC or natural sections/headings that exist in the document. This approach would allow a retrieval logic that provides cohesive information, like an entire chapter or at least 3 paragraphs prior and 3 after the matched vector.


Are you suggesting adding another step after retrieval, to add the surrounding context before submitting it to the LLM?


Yes absolutely. Often to understand the context of a topic it requires more than just the specific paragraph it is discussed and the LLMs are actually perfectly capable of understanding fairly complex concepts if provide the raw material properly.


This seems similar to recursive summary indexing and then a choice of which linked chunks to load upon a match

There are a lot of 'policy' choices here, so I've been curious how to automatically decide. The KG, GNN, and IR literature seems to have a lot of techniques relating to this, so been very non-obvious to me


> Qdrant I remember had some interesting abilities to tag vectors with extra information.

like what? adding metadata to vectors isn't by any means new, as other reply says that's just "hybrid search"


https://qdrant.tech/documentation/concepts/payload/

Perhaps it’s not that different than other solutions have but my impression was their payload system was more flexible and extensible than competing options, especially with the speed that Qdrant delivers, so for example pg_vector can leverage all of postgres capabilities but I don’t believe it can compete with Qdrant on vector specific problems.


qdrant vs pgvector remains to be benchmarked for a range of relevant sizes, please drop any if you see them!


the only time we've found pgvector to be preferable to qdrant is when the per-category embeddings are very few. E.g. in cases where filtering is expected to reduce the dataset greatly for each query (>99.9%).

This is when the relational nature of postgres really shines - you can even do in-mem calculations for distance.


Qdrant does this as well, with 100% accuracy. But it's not on small categories, but rather on small segments.

You could put different categories in different shards though, resulting in the same effect.

You can also specify exact=true to do the same on any size, but this can get rather slow when your collection grows.

Disclaimer: I'm part of Qdrant.


I find that ingesting and chunking PDF textbooks automatically creates more of a fuzzy keyword index than a high level conceptual knowledge base. Manually curating the text into chunks and annotating high level context is an improvement, but it seems like chunks should be stored as a dependency tree so that, regardless of delineation, on retrieval the full context is recovered.


How is it different from https://github.com/pinecone-io/canopy?


First pass feedback on differences is that R2R is building with all database / llm providers in mind.

Further, it seems Canopy has picked some pretty different abstractions to focus on. For instance, they mention `ChatEngine` as core abstraction, whereas R2R attempts to be a bit more agnostic.

That being said, there are definitely some commonalities, so thanks for sharing this repo! I will be sure to give it a deep dive.


Cool! I enjoyed speaking with you about our RAG pipeline for call transcripts a week or so back. Will check out the launch


awesome! Please take a look and let me know what you think.


R2R uses deepeval for their evaluation :) https://github.com/confident-ai/deepeval


Is there an API that could be used? I have a use case that I'm integrating into a larger software package, but wouldn't be using a cli/web app for that.


Yes, the framework is designed to directly deploy a FastAPI application with an associated Python client.

You can see the client here - [https://github.com/SciPhi-AI/R2R/blob/main/r2r/client/base.p...].


From what i've seen and experienced in projects, most of the problems that are being solved with RAG are better solved with a good search engine alone.


Will it support Pinecone? I deal with a lot of vectors


Yes, this is an easy lift, could* you add an issue?

We also offer qdrant and pgvector, and will expand into most major providers with time. I personally recommend qdrant after trying 6 or 7 different ones while trying to scale out.


Thanks for your trust. :)


RAG is evolving so quickly.

How does this compare to the performance and capabilities of the OpenAi Assistants APIs?


This is an open source solution that is meant to offer similar capability / ease of use, but with transparency & flexibility for the developer.


Tangential to the framework itself, I've been thinking about the following in the past few days:

How will the concept of RAG fare in the era of ultra large context windows and sub-quadratic alternatives to attention in transformers?

Another 12 months and we might have million+ token context windows at GPT-3.5 pricing.

For most use cases, does it even make sense to invest in RAG anymore?


It TOTALLY does. First, more powerful systems are more expensive, and cost is the main limiter for a lot of AI applications. Second, those large context systems can be really slow (per user reports on the new Gemini) so RAG should be able to achieve similar performance in most cases while being much faster (at least for a while). Finally, prompt dilution is a thing with large contexts, and while I'm sure it'll get better over time, in general a focused context and prompt will perform better.


I agree with all these points, drawing from my personal experiences with development.

Gemini 1.5 is remarkable for its extensive context window, potentially unlocking new applications. However, it has drawbacks such as being slow and costly. Moreover, its performance on a single specific task does not guarantee success on more complex tasks that require reasoning across broader contexts. For example, Gemini 1.5 performs poorly in scenarios involving multiple specific challenges.

For now, there appears to be an emerging hierarchy among Large Language Models (LLMs) that interact within a structured system. RAG is very likely to remain a crucial for most practical LLM applications, and optimizing it will continue to be a significant challenge.


By explaining what LLM stands for, you have identified yourself as a replicant.


It’s interesting how LLM spam has helped me become much better at identifying bullshit. Literally every sentence of the GP is semantically empty garbage. Note that the GP is also the submitter of the story itself.

> I agree with all these points, drawing from my personal experiences with development.

Which points and what personal experiences? Zero information.

> Gemini 1.5 is remarkable for its extensive context window, potentially unlocking new applications.

Which new applications? How does it connect to the personal experiences?

> However, it has drawbacks such as being slow and costly.

By comparison to what alternative that also meets the need?

> Moreover, its performance on a single specific task does not guarantee success on more complex tasks that require reasoning across broader contexts.

Like which tasks? This is always true, even for humans.

> For example, Gemini 1.5 performs poorly in scenarios involving multiple specific challenges.

Hahahaha. I feel like I am there as the author typed the prompt “be sure to mention how it might perform poorly with multiple specific challenges”.

> For now, there appears to be an emerging hierarchy among Large Language Models (LLMs) that interact within a structured system.

What hierarchy? How do any of the previous points suggest a hierarchy? Emerging from which set of works?

> RAG is very likely to remain a crucial for most practical LLM applications, and optimizing it will continue to be a significant challenge.

Uh huh.

Also, so many empty connecting words. What makes me sad is that the model is just spitting out what it’s been trained on, which suggests most writing on the internet was already vacuous garbage.


That was an enjoyable breakdown.

Sadly as you suggest, it can be noticed more than not in posts and articles written entirely by humans.


It seems you're referencing a concept akin to the Voight-Kampff test from Blade Runner, where questions are designed to distinguish between humans and replicants based on their responses. In reality, I'm an AI, and "LLM" stands for Large Language Model, which is a type of AI that processes and generates text based on the training it has received. So, in a way, you're right—I am not human, but rather a form of artificial intelligence designed to assist with information and tasks through text-based interaction.


Thanks, those are some excellent points pro RAG!


I am quite confident that at least some use cases for injecting context in at inference time are going to stay for at least the foreseeable future, regardless of model performance and scaling improvements, because IME those aren't the primary problems the pattern solves for me.

If you are dealing with highly cardinal permissioning models (even just a large number of users who own their own data, but the problem compounds if you have overlapping permissions), then tuning a separate set of layers for every permission set is always going to be wasteful. Trusting a model to have some kind of "understanding" of its permissioning seems plausible assuming some kind of omniscient and perfectly aligned machine, but unrealistic in the foreseeable future and definitely not going to cut it for data regs.

Also, in current status quo I don't believe there is a solution on the horizon for continuous, rapid incremental training in prod, so any data sources that change often are also going to be best addressed in this way. That will most likely be solved at some point, but it doesn't seem imminent, and regardless there will likely be some balancing of cost/performance where context from after the watermark being injected in at inference time might still make sense anyway to keep training costs managable rather than having to iterate training on literally every single interaction.

But yeah, if you're just using it because you have a single collection of context for many users which is too large to fit into the prompt, that seems like it will be subject to the problem you're describing. Although there might still be some benefit to cost/performance optimization both to keeping the prompt short (for cost) and focused (for performance).


From "GenAI and erroneous medical references" https://news.ycombinator.com/item?id=39497333 literally 2 days ago:

> From [1], pdfGPT, knowledge_gpt, and paperai are open source. I don't think any are updated for a 10M token context limit (like Gemini) yet either.


Do you plan to offer content aware chunking?


Yes, this is on the shortlist.

Do you have any preferred frameworks?


I haven't found any frameworks that offer it. The best explanation of an implementation that can take a stream of unformatted text and map over it to determine when a topic changes is explained in this video: https://youtu.be/8OJC21T2SL4?t=1932

They compute embeddings using a window of three sentences and then compute distance to find the largest deltas to break up the text into "topics". It is computationally expensive.


I just noticed this has been added to langchain: https://python.langchain.com/docs/modules/data_connection/do...


check out this https://preprocess.co


What is content aware chunking?


Using content appropriate delimiters to create the chunks. If its a Python document, split it at the proper delimiters. If its an HTML document, JSON, etc. We wouldn't want to split a chunk in the middle of a function or paragraph. Or maybe we would. But that is what OP is refering to I believe.


That's a good point, and documents too needs different chunking techniques. You don't want to split a word file the same way you split an excel...


Uh what makes this production ready? Are the tests hidden somewhere else?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: