Haize Labs has developed LosslessRAG, a low-latency, high-accuracy RAG method that entirely avoids a category of hallucinations. This is achieved by leveraging infinigram search to ground answers exactly in reference data.

Introduction
Architecture
Experiments
1. Single-File Factual Lookup
2. Multi-File Reasoning
3. Github Issue Handling
LiteraryQA
1. Pushing the Speed-Accuracy Frontier

Introduction

Embedding-based RAG is well-adopted. However, it suffers from one major flaw: embedding models’ semantics are grounded in a static training set. This means that in domain-specific retrieval and Q&A, e.g. ASIC development, traditional RAG systems often hallucinate.

We leverage an alternate retrieval method: string search, which grounds answers exactly in the data. This enables lossless retrieval. It is also more flexible: semantics are no longer defined by a fixed model, but in the flexibility of strings.
In particular, we leverage suffix arrays (Infinigram) for efficient string search. Infinigram search is significantly more efficient than grep: it is logarithmic vs. linear w.r.t. query length.
We pair this retrieval with RLM-based query expansion, query refinement, and answer generation. We refer to the combined practice of suffix array retrieval and RLM answer generation as LosslessRAG. ****
We demonstrate that LosslessRAG achieves higher retrieval accuracy, lower hallucination rates, and reduced latency compared to embedding-based retrieval when applied to FBOSS + SONiC and LiteraryQA.

Architecture

LosslessRAG manages both high recall and precision.

Infinigram with expanded queries is high-recall: it retrieves a large volume of related documents for a set of queries expanded (via an LLM) from the original user query.
RLM-based summarization and reflection is high-precision: it prunes the set of retrieved documents before generating the final answer.

We let an RLM—essentially agents coordinating subagents—decide the mechanics of how to perform query expansion, pruning, summarization, and answer generation.

Figure 1

We evaluate against two baseline architectures.

ReAct replaces upfront query expansion with a query refinement loop (Figure 2). We implement with both Infinigram and embedding-based retrieval to isolate the effect of the answer generation method.
Plan-Execute-Synthesize (PES) (Figure 3) is the same as LosslessRAG minus a summarization step. All retrieved documents are passed to a final LLM call for generation.

Figure 2

Figure 3

Experiments

We evaluate all three architectures on the combined FBOSS and SONiC codebases, including their associated wiki documentation. For Infinigram retrieval, we build a character-level U16 suffix array over the full corpus using the SAIS algorithm, producing a roughly 1.1 GB index that runs entirely on CPU. For embedding retrieval, we chunk the same files and embed them with OpenAI's text-embedding-3-large into a ChromaDB vector store. We measure the following: