Your RAG system might be slow because it's doing too much unnecessary work.
Recently, Meta's research team released the REFRAG framework, demonstrating a key finding: 99% of cross-passage attention computations in RAG systems are wasted.
As context windows continue to grow, the generation latency for the first token increases quadratically, becoming a performance bottleneck for RAG systems. REFRAG achieves a 30.85x acceleration in first-token generation through a new compression strategy, while maintaining model accuracy.
Core Technical Solution
The traditional RAG process is simple: a query comes in, it's encoded into a vector, similar text blocks are found in a vector database, and then they're all fed to the LLM. This method works, but at a considerable cost. Most retrieved blocks contain irrelevant text, forcing the LLM to process far more tokens than necessary, which wastes computational power, increases latency, and unnecessarily consumes context capacity.
REFRAG's core idea is not to directly input retrieved raw tokens into the generative model, but to adopt the following strategy:
Divide the context into fixed-size blocks.
Generate compressed block embeddings using a lightweight encoder (e.g., RoBERTa).
Input these embeddings along with query tokens into the decoder.
Selectively expand important blocks via a reinforcement learning strategy.
This design reduces the complexity of attention computation from the number of tokens to the number of blocks. With a 16x compression ratio, the system achieves a 16.53x acceleration, while improving performance by 9.3% compared to existing methods.
So, how is this different from a re-ranker?
In a typical RAG pipeline with re-ranking, the re-ranker merely reorders or prunes blocks at the text level, without changing the representation format input to the LLM. The LLM still receives the full text of the top few blocks, token by token.
REFRAG, however, performs compression, filtering, and replacement at the embedding level. It prevents the LLM from consuming all token embeddings of each block; instead, a compressed embedding represents a block, and an RL strategy decides which blocks are worth expanding into their full form. More importantly, REFRAG moves relevance filtering into the LLM's representation space, not just the retrieval space. The LLM itself is trained to understand compressed embeddings and perform reasoning based on them.
Innovations
The research team discovered an important characteristic of RAG systems: attention between retrieved passages exhibits a block-diagonal structure. Tokens within a passage have high mutual attention, but cross-passage attention is almost zero. This sparsity provides a theoretical basis for compression optimization.
For training, the team adopted a curriculum learning strategy. The model first learns to reconstruct single blocks, then gradually increases to multiple blocks. This progressive training is crucial for the model to master compression capabilities. Furthermore, the reinforcement learning strategy can dynamically determine which content blocks need full expansion, achieving adaptive adjustment of the compression ratio.
Experimental Validation
In multiple benchmark tests, REFRAG demonstrated consistent performance improvements:
RAG tasks: Performance improved by 1.22% (strong retriever) to 1.93% (weak retriever) compared to LLaMA under the same latency conditions.
Multi-turn dialogue: The advantage became more pronounced as the number of dialogue turns increased, due to the ability to retain more historical context through compression.
Document summarization: In long document processing tasks, REFRAG could process more content within the same computational budget.
Summary
At the practical engineering implementation level, some work is also needed. For example:
To increase cross-query reuse, compressed block embeddings can be pre-computed and stored in a vector database, supporting reuse across queries. This "compress anywhere" capability makes it particularly suitable for multi-turn dialogue and agent applications.
To increase interpretability, it's important to explain which compressed contexts influenced the answer. The compression pipeline needs a traceability mechanism similar to retrieval, storing block hashes and version information.
At the same time, while reinforcement learning strategies offer better performance, a fixed compression ratio version might be more stable and reliable for actual deployment.
Overall, REFRAG's success indicates that optimizations tailored to specific application scenarios are highly necessary. For RAG systems, understanding and utilizing their unique attention sparsity is more effective than broadly expanding the context window.