- DeepSeek’s Engram separates static memory from computation, increasing efficiency in large AI models
- The method reduces the need for high-speed memory by enabling DeepSeek models to use queries
- Engram supports asynchronous prefetching across multiple GPUs with minimal performance overhead
DeepSeek, in collaboration with Beijing University, introduced a new training method called Engram, designed to decouple memory storage from computing processes.
Traditional large language models require high-bandwidth memory for knowledge retrieval and basic computation, creating a bottleneck in both performance and cost.
This HBM bottleneck is widely recognized as a key reason why DRAM prices increased by a factor of five in just ten weeks, as demand for hardware increased to support large AI models.
Validation and technical approach
The researchers say existing models waste sequential depth on trivial operations, which could otherwise support higher-level reasoning.
Engram allows models to efficiently “look up” essential information without overloading GPU memory, freeing up capacity for more complex reasoning tasks.
The system was tested on a 27 billion parameter model and showed measurable improvements over standard industry benchmarks.
By performing knowledge retrieval via hashed N-grams, Engram provides static memory access independent of the current context.
The retrieved information is then adjusted using a context-aware gating mechanism to align with the hidden state of the model.
This design allows models to process long context inputs more efficiently and supports system-level prefetching with minimal performance overhead.
The Engram method complements other hardware-efficient approaches, including solutions like Phison’s AI inference accelerators.
Engram minimizes the amount of fast memory required by using lookups for static information, making memory usage more efficient.
Phison provides a cost-effective way to expand overall memory using SSDs, and supports large AI models such as Engram or Mixture-of-Experts systems.
Combined, these approaches allow AI systems to optimize high-speed memory usage while affordably increasing overall memory capacity.
It also works with the emerging CXL (Compute Express Link) standards, which aim to overcome GPU memory bottlenecks in large-scale AI workloads.
The method separates static pattern storage from dynamic computation, improving the Transformer backbone without increasing the number of FLOPs or parameter counts.
DeepSeek formalized a U-shaped expansion rule to optimize the allocation of parameters between the MoE conditional calculation module and the Engram memory module.
Tests show that reallocating about 20-25% of the scarce parameter budget to Engram yields better performance than pure MoE models, maintaining stable gains across different scales.
Expanding memory slots provides predictable improvements without additional computing costs.
This confirms the scalability of conditional memory as an independent axis for sparse models.
Engram’s deterministic fetch mechanism allows memory capacity to scale linearly across multiple GPUs, while supporting asynchronous prefetching during inference.
It offloads the static knowledge reconstruction from lower layers, freeing up attention mechanisms to focus on the global context.
Hierarchical caching of commonly used embeds improves efficiency, and the module works with existing ones GPU and system memory architectures, potentially avoiding costly HBM upgrades.
This technique can ease pressure on expensive memory hardware, especially in regions like China, where HBM access lags behind competitors like Samsung, SK Hynix and Micron.
Early validation of Engram suggests that models can increase parameter scale and reasoning power while managing memory needs more efficiently.
This approach can help alleviate memory constraints within AI infrastructure, potentially reducing sharp price fluctuations for DDR5 DRAM.
Via S.C.M.P
Follow TechRadar on Google News And add us as a preferred source to get our expert news, reviews and opinions in your feeds. Then be sure to click the Follow button!
And of course that is also possible Follow TechRadar on TikTok for news, reviews, unboxings in video form and receive regular updates from us WhatsApp at.
#method #models #harder #avoid #valuable #bandwidth


