MILPITAS, Calif., June 12, 2024 — ScaleFlux is set to transform AI and data center RAS (reliability, availability, and serviceability) with its groundbreaking ECC (Error Correction Coding) technology. As conventional ECC methods buckle under the pressure of climbing memory error rates, undermining system reliability, ScaleFlux's innovative approach using list decoding shatters the limitations, offering rapid and efficient correction of complex errors. This disruptive technology not only boosts DRAM reliability and security but also slashes costs by making use of lower-cost DRAM chips viable, paving the way for a more resilient and cost-effective computing infrastructure. ScaleFlux is not just meeting the demands of the future—it's redefining them.
MILPITAS, Calif., June 12, 2024 /PRNewswire-PRWeb/ -- The fact that the global AI market is set to add $15.7 trillion to the world economy by 2030 (1) means that the rapidly evolving landscape of data centers and artificial intelligence (AI) technology makes the need for reliable, high-performance memory solutions more critical than ever to avoid the increasing costs of downtime from component and system failures. "Under the combination of growing DRAM densities and increasing challenges with DRAM error rates, we saw the need for a new style of error correction coding," said Tong Zhang, Chief Scientist of ScaleFlux. He continued, "the hyperbolic growth of AI infrastructure and the emergence of memory expansion with Compute Express Link (CXL) has the cascading effect of driving up DRAM capacities and traffic, exacerbating the need for innovation in ECC."
As DRAM technology advances towards 10nm and beyond and the error rates in the media grow, the need for DRAM fault tolerance becomes increasingly paramount. This is where error correction coding (ECC) steps in, playing a vital role in ensuring the reliability of the data and, subsequently, the data centers. Without this crucial error correction, data can readily become corrupted, resulting in "garbage in, garbage out" calculations and even costly system crashes. Traditional ECC methods face challenges in handling these increased error rates while meeting the stringent latency constraints. A new solution is needed.
Error Rates on the Rise
Four key trends are multiplying the frequency of memory errors:
1. Increasing memory capacity density: Current server and GPU systems can support a terabyte (1TB) or more of DRAM! This capacity expands even further with
Compute Express Link (CXL) memory modules.
2. Increasing fault-caused blast radius: As Cloud computing infrastructure continues to scale out, the crash of one server caused by memory errors will potentially compromise more and more connected servers. This will many-fold amplify the fatal impact of memory bit errors.
3. Increasing memory access speeds: As the industry has progressed from DDR3 through DDR4 to DDR5, transfer rates have quadrupled from 1600Mb/s to 6400Mb/s with further accelerations on the way.
4. Increasing vulnerability to soft errors and defects in the memory media: As memory manufacturers move to newer manufacturing lithography, the memory bit cells shrink, introducing more susceptibility (2) to soft errors and defects.
Considering that errors per second are a function of the memory capacity, the blast radius, the rate of access to the memory, and the inherent rate of errors in the memory media the increases in all these factors clearly makes for a significant challenge to reliability.
Throw on top of that the combination of increasing complexity of error detection with the system-level performance hits from uncorrectable errors and it's a real nightmare situation for maintaining system reliability and avoiding costly downtime.
Conventional ECC can no longer cut it
The conventional ECC methods typically follow minimum distance bounded decoding, capable of correcting up to 't' symbol errors using ECC with a minimum distance of '2t+1'. To meet stringent latency constraints, many DRAM ECC design solutions protect each data access unit (e.g., 64-byte cache line) by interleaving multiple short-length ECC codewords that each correct only 1 or few symbols at very low latency (1~3 clock cycles).
However, when it comes to tolerating more errors from DRAM devices, conventional methods will become largely inadequate, leading to uncorrectable errors and hence catastrophic failures in data centers.
In layman's terms, think of computer's memory like a document. ECC functions as a spell checker, detecting and correcting errors like typos that may occur when saving or retrieving data. However, conventional ECC, like a basic spell checker, has its limitations. It can only catch and fix certain types of errors, like single-letter mistakes. If multiple errors occur or the errors are too complex, conventional ECC may struggle to correct them effectively, leaving your data vulnerable to inaccuracies.
Enter an innovative ECC methodology
ScaleFlux, a fabless semiconductor company innovating in the field of data center storage and memory technology, has developed a groundbreaking ECC solution that will revolutionize DRAM fault tolerance. ScaleFlux presented this solution at the
IEEE RAS in Data Centers Summit on June 12, 2024. Unlike conventional ECC approaches, ScaleFlux's solution leverages list decoding, a branch of coding theory with origins dating back to the 1950s but largely forgotten in modern applications due to its computational complexity.
Central to ScaleFlux's innovation is the application of list decoding, which aims at correcting more-than-'t' errors. ScaleFlux's solution departs from the conventional ECC paradigm by protecting each 64-byte cache line with a single codeword while ensuring decoding latency as low as 1~3 clock cycles. This approach enables the correction of more-than-'t' errors from any combination of two DRAM devices at very high speed and with low computational complexity. Moreover, list decoding offers the added benefit of avoiding mis-correction by detecting decoding errors when the list contains two codewords of similar likelihood.
To realize this innovative ECC technology, ScaleFlux developed a robust mathematical framework to analyze correction, detection, and mis-correction probabilities. They also engineered a highly parallel VLSI-friendly architecture for ultra-low-latency decoding. Successful demonstrations on FPGA platforms verified more-than-'t' error correction with extremely low mis-correction probability. As a part of the development effort, ScaleFlux collaborated with key ecosystem partners including memory suppliers, CPU vendors, and hyperscalers.
"Innovations like ScaleFlux's decoding ECC methodology may be capable of enabling high-reliability CXL solutions," said Jim Pappas, CXL Consortium Chairperson. "CXL technology-based advancements ultimately enable the industry to meet the demands of expanding memory capacity and bandwidth."
Remember, it's not just about memory
The impact of ScaleFlux's ECC technology extends beyond improving DRAM reliability. By accommodating less reliable, lower-cost DRAM chips, it can reduce the total cost of ownership (TCO) for data center operators. Furthermore, it enhances immunity to security risks associated with malicious DRAM RowHammer attacks, bolstering data center security.
ScaleFlux's innovative ECC design solution is set to revolutionize DRAM fault tolerance, offering new levels of reliability and performance benefits for computing infrastructure. With its potential to lower costs, enhance security, and push the boundaries of RAS (Reliability, Availability, and Serviceability), this technology is poised to shape the future of data center and AI computing infrastructure.
Summing it up
ScaleFlux's ECC technology represents a significant advancement in DRAM fault tolerance, addressing the challenges faced by data centers while unlocking new possibilities for innovation at the system level. "As we embark on this journey towards more resilient and efficient computing infrastructure, ScaleFlux's innovations are blazing the trail for progress in the ever-evolving landscape of AI and data center technology," says JB Baker VP of Products at ScaleFlux.
About ScaleFlux
In an era where data reigns supreme, ScaleFlux emerges as the vanguard of computational storage and memory solutions, poised to redefine the landscape of IT infrastructure. With a commitment to innovation, ScaleFlux introduces a revolutionary approach to storage and memory that seamlessly combines hardware and software, designed to unlock unprecedented performance, efficiency, reliability, and scalability for data-intensive applications. As the world stands on the brink of a data explosion, ScaleFlux's cutting-edge technology offers a beacon of hope, promising not just to manage the deluge but to transform it into actionable insights and value, heralding a new dawn for businesses and data centers worldwide. For more details, visit
https://scaleflux.com/.
References:
1. Downie, Chris. How Data Centers Can Simultaneously Enable AI Growth and ESG Progress. 13 May 2024,
http://www.datacenterdynamics.com/en/opinions/how-data-centers-can-simultaneously-enable-ai-growth-and-esg-progress/. Accessed 7 June 2024.
2. Meixner, Anne. "DRAM Test and Inspection Just Gets Tougher." Semiconductor Engineering, 7 Nov. 2023, semiengineering.com/dram-test-and-inspection-just-gets-tougher/. Accessed 7 June 2024.