Nvidia’s new technique reduces LLM reasoning costs by 8x without losing accuracy

Researchers at Nvidia have developed a technique that can reduce the memory costs of reasoning in large language models by up to eight times. Their technique, called dynamic memory sparsification (DMS), compresses the key-value cache (KV), the temporary memory that LLMs generate and store as they process clues and reason through problems and documents.

Although researchers have previously proposed several methods to compress this cache, most have struggled to do so without affecting the model’s intelligence. Nvidia’s approach manages to throw away a large portion of the cache while preserving (and in some cases improving) the model’s reasoning capabilities.

Experiments show that DMS allows LLMs to think longer and explore more solutions without the usual penalties in speed or memory costs.

The bottleneck of reasoning

LLMs improve their performance on complex tasks by generating “chain of thoughtstokens, essentially writing down their reasoning steps before arriving at a final answer. Inference time-scaling techniques take advantage of this by giving the model a larger budget to generate these thinking tokens or to explore multiple potential reasoning paths in parallel.

However, this improved reasoning comes at a significant computational cost. As the model generates more tokens, it builds a KV cache.

For real-world applications, the KV cache is a major bottleneck. As the reasoning chain grows, the cache grows linearly, consuming enormous amounts of memory on GPUs. This forces the hardware to spend more time reading data from memory than actually computing, which slows down generation and increases latency. It also limits the number of users a system can operate at the same time, because a shortage of VRAM will cause the system to crash or slow down.

Nvidia researchers see this not only as a technical hurdle, but also as a fundamental economic hurdle for the company.

“The question isn’t just about the amount of hardware; it’s about whether your infrastructure handles 100 reasoning threads or 800 threads for the same cost,” Piotr Nawrot, Senior Deep Learning Engineer at Nvidia, told VentureBeat.

Previous attempts to solve this have focused on heuristics-based approaches. These methods use strict rules, such as a “sliding window” that caches only the most recent tokens and discards the rest. While this reduces memory usage, it often forces the model to discard critical information needed to solve the problem, reducing the accuracy of the output.

“Standard staking methods attempt to select old and unused tokens for staking using heuristics,” the researchers said. “They simplify the problem, hoping that as they approximate the inner workings of the model, the answer will remain correct.”

Other solutions use paging to move the unused portions of the KV cache to slower memory, but constantly exchanging data introduces latency overhead that makes real-time applications slow.

Dynamic memory sparsification

DMS takes a different approach by “adapting” existing LLMs to intelligently manage their own memory. Rather than applying a hard and fast rule for what to remove, DMS trains the model to identify which tokens are essential for future reasoning and which are disposable.

“It doesn’t just estimate importance; it learns a policy that explicitly preserves the model’s final output distribution,” Nawrot said.

The process transforms a standard, pre-trained LLM such as Llama 3 or Qwen 3 into a self-compressing model. Crucially, this does not require training the model from scratch, which would be prohibitively expensive. Instead, DMS reuses existing neurons within the model’s attention layers to output a “keep” or “turn off” signal for each token.

For teams concerned about the complexity of retrofitting, the researchers noted that the process is designed to be lightweight. “To improve the efficiency of this process, the model weights can be frozen, making the process similar to Low-Rank Adaptation (LoRA),” Nawrot said. This means that a standard business model such as the Qwen3-8B “can be equipped with DMS on a single DGX H100 within hours.”

One of the important parts of DMS is a mechanism called ‘delayed expansion’. Under standard sparsification, if a token is deemed unimportant, it is immediately deleted. This is risky because the model may need a fraction of a second to integrate the context of that token into its current state.

DMS addresses this by marking a token for staking but keeping it accessible for a short period of time (e.g. a few hundred steps). This delay allows the model to “extract” any remaining necessary information from the token and merge it with the current context before flushing the token from the KV cache.

“The ‘deferred eviction mechanism’ is crucial because not all tokens are simply ‘important’ (keep forever) or ‘useless’ (immediately delete). Many fall in between – they contain some information, but not enough to justify taking up an entire memory slot,” Nawrot said. “This is where the redundancy lies. By keeping these tokens in a local window for a short time before spawning them, we ensure that the model can pay attention to them and redistribute their information to future tokens.”

The researchers discovered that this retrofit process is very efficient. They were able to equip a pre-trained LLM with DMS in just 1,000 training steps, a small fraction of the computing power required for the original training. The resulting models use standard kernels and can be placed directly into existing high-performance inference stacks without custom hardware or complex software rewriting.

DMS in action

To validate the technique, the researchers applied DMS to several reasoning models, including the Qwen-R1 series (distilled from DeepSeek R1) and Llama 3.2, and tested them on difficult benchmarks such as AIME 24 (math), GPQA Diamond (science), and LiveCodeBench (coding).

The results show that DMS effectively pushes the Pareto frontier, the optimal trade-off between cost and performance. On the AIME 24 math benchmark, a Qwen-R1 32B model equipped with DMS achieved a score 12.0 points higher than a standard model when limited to the same memory bandwidth budget. By compressing the cache, the model could afford to “think” much deeper and broader than the standard model with the same memory and computing budget.

Perhaps most surprising, DMS defied common wisdom that compression impairs comprehension of long contexts. In “needle-in-a-haystack” tests, which measure a model’s ability to find a specific piece of information hidden in a large document, DMS variants actually performed better than the standard models. By actively managing memory instead of passively collecting noise, the model maintained a cleaner, more useful context.

For enterprise infrastructure, the efficiency gains translate directly into throughput and hardware savings. Because the memory cache is significantly smaller, the GPU spends less time retrieving data, reducing the wait time for users. In tests with the Qwen3-8B model, DMS matched the accuracy of the vanilla model and delivered up to 5x higher throughput. This means that a single server can process five times as many customer requests per second without losing quality.

The future of memory

Nvidia released DMS as part of its KVPress Library. On how companies can get started with DMS, Nawrot emphasizes that the barrier to entry is low. “The ‘minimum viable infrastructure’ consists of standard Hugging Face pipelines – no custom CUDA kernels are required,” Nawrot said, noting that the code is fully compatible with standard FlashAttention.

Looking ahead, the team sees DMS as part of a larger shift in which memory management becomes a separate, intelligent layer of the AI stack. Nawrot also confirmed that DMS is “fully compatible” with newer architectures such as the Latent attention with multiple heads (MLA) used in DeepSeek’s models, suggesting that combining these approaches could yield even greater efficiency gains.

As companies move from simple chatbots to complex agentic systems that require extensive reasoning, the cost of inference is becoming a primary concern. Technologies such as DMS offer a way to sustainably scale these capabilities.

“We have barely scratched the surface of what is possible,” Nawrot said, “and we expect inference time scaling to continue to evolve.”

#Nvidias #technique #reduces #LLM #reasoning #costs #losing #accuracy

Nvidia’s new technique reduces LLM reasoning costs by 8x without losing accuracy

Like this:

Related

Similar Posts

Reggie Bush ordered to pay $ 1.4 million on the deficiency case

‘We deserve what we ask for’: Jackie Young weighs in on the WNBA’s collective bargaining agreement

Leave a Reply Cancel reply

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply