Achieving Lossless Mathematical Reasoning with 10% KV Cache: An Open-Source Method to Resolve 'Memory Overload' in Large Inference Models

Contributed by R-KV Team QbitAI Official Account

While large inference models are powerful, a simple arithmetic problem can lead to three full pages of repetitive "verbose" output, making it hard to find the key points...

An efficient compression method has emerged that can transform the "rambling" of large models into controllable memory entries!

R-KV open-source debut: VRAM ↓90%, Throughput ×6.6, Accuracy = 100%.

It dynamically sorts tokens in real-time, considering both importance and non-redundancy, retaining only information-rich and diverse tokens, thereby solving the redundancy problem during large model inference.

Making "long inference" no longer a luxury.

Image

Project details can be found via the links at the end of the article.

R-KV's Three Steps: Redundancy Identification + Importance Evaluation + Dynamic Eviction

Chain-of-Thought (CoT) makes LLM problem-solving clear, but it also causes inference length to expand exponentially.

Taking DeepSeek-R1-Llama-8B as an example, a single AIME math problem can generate 32,000 Tokens: with a model weight of 15.5GB, the KV Cache consumes another 4.1GB—VRAM instantly bottoms out.

Existing KV compression methods (SnapKV, StreamingLLM, H2O, etc.) are primarily designed for long inputs. However, once the model starts "rambling" at the output end, similar sentences score high attention among themselves, rendering the "delete low-attention scores" strategy ineffective:

This leads to problems such as critical steps being mistakenly deleted, repetitive content being retained, and accuracy plummeting.

R-KV, on the other hand, processes redundant key/value (KV) tokens by compressing the KV cache in real-time during model decoding, retaining only important and non-redundant tokens, through the following steps:

Image

Decoding-Time Compression: Before tokens are written into KV, a decision is made whether to keep or discard them, completely preventing VRAM expansion.

Importance Scoring: Multi-head attention comprehensively evaluates each token's contribution to subsequent answers.

Redundancy Scoring: Calculates the cosine similarity of Key vectors to identify "repeater"-like content.

Joint Eviction: Real-time scheduling of KV quotas based on "high importance + low redundancy" priority, with optimal results when λ≈0.1.

The entire process is training-free and model-agnostic, requiring no model structure changes, allowing for direct "plug-and-play" integration. Therefore, it can be directly used in reinforcement learning sampling, offering great flexibility.

Visualization: R-KV vs. SnapKV

Image

The figure above shows which tokens R-KV and the pure attention baseline SnapKV selected in the same decoding steps. Gray = not selected; Light to dark red = selected by more attention heads.

As seen, SnapKV focuses on local segments closest to the current Query, even repeatedly retaining useless self-statements like "3 students are leaving early…".

In contrast, R-KV selects tokens across the entire reasoning process: keywords from the problem (30 students), crucial intermediate values (24, 12), and the final answer are all retained, with broader semantic coverage.

By combining attention strength with redundancy filtering, R-KV preserves important context and removes noise, successfully completing the task; while SnapKV mistakenly deletes critical information, leading to incorrect answers.

The results show: R-KV has a broader coverage, higher information diversity, and significantly stronger redundancy reduction capabilities.

Performance Test: Accuracy Increases Rather Than Decreases

ImageImage

As shown, R-KV significantly outperforms baselines in challenging mathematical benchmarks, even surpassing the full KV method.

Image

In terms of computational overhead, R-KV introduces additional calculations for importance and redundancy scoring, but the overall overhead is moderate and is typically offset by the reduced attention costs from compressed KV cache. This trade-off becomes increasingly favorable as sequence length increases.

Real-time analysis of memory savings and end-to-end throughput improvements shows that when the batch size is 1, R-KV slightly outperforms FullKV in throughput. This indicates that the acceleration achieved by R-KV through reduced attention computation exceeds R-KV's own computational overhead.

However, this direct speed improvement accounts for only a small portion of the overall gain. The primary throughput improvement brought by R-KV comes from KV cache compression, allowing the model to support significantly larger inference batch sizes.

Image

R-KV's applicable scenarios are as follows:

Long-chain inference on edge devices: VRAM drastically reduced, allowing consumer GPUs and even mobile NPUs to run models.

Multi-turn Agent: Complex processes like reflection-rewrite-self-evaluation are no longer constrained by VRAM.

Directly used to accelerate reinforcement learning sampling: Training-free method is plug-and-play.

Paper PDF: https://arxiv.org/pdf/2505.24133.pdfProject Homepage: https://zefan-cai.github.io/R-KV.page/Code Repository: https://github.com/Zefan-Cai/R-KV

One-click triple action: "Like", "Share", "Heart"

Welcome to leave your thoughts in the comment section!

— End —

Image

🌟 Light up the star 🌟

Get daily updates on cutting-edge technology

Main Tag:LLM Optimization

Sub Tags:KV CacheComputational EfficiencyMemory ManagementInference Optimization


Previous:o3-pro Completes 'Sokoban,' Classic Retro Games Become New Benchmarks for Large Models

Next:Comprehensive Evaluation of 12 Latest GraphRAG Techniques

Share Short URL