What’s New: Today, MLCommons published results of its industry AI performance benchmark in which both the 4th Generation Intel® Xeon® Scalable processor (code-named Sapphire Rapids) and Habana® Gaudi®2 dedicated deep learning accelerator logged impressive training results.
The 4th Generation Intel Xeon Scalable processor with Intel® Advanced Matrix Extensions (AMX), a new built-in AI accelerator, allows customers to extend the general-purpose Xeon server platform to cover even more DL use cases, including DL training and fine tuning. AMX is a dedicated matrix multiplication engine built into every core of 4th Gen Intel Xeon Scalable processors. This dedicated AI engine is optimized to deliver up to 6x higher gen-to-gen DL training model performance using industry standard frameworks.1
In cases where the server or a cluster of servers are predominantly used for DL training and inference compute, the Habana Gaudi2 accelerator is the optimal accelerator. It is purpose-designed to deliver the best DL performance and TCO for these dedicated use cases.
About the Results for Xeon: Intel submitted MLPerf Training v2.1 results on the 4th Gen Intel Xeon Scalable processor product line across a range of workloads. Intel Xeon Scalable Processor was the only CPU submitted for MLPerf v2.1, once again demonstrating it is the best server CPU for AI training, which enables customers to use their shared infrastructure to train anywhere, anytime. The 4th Gen Intel Xeon Scalable processors with Intel AMX deliver this performance out of the box across multiple industry standard frameworks and integrated with end-to-end data science tools and a broad ecosystem of smart solutions from partners. Developers only need to use the latest framework releases of TensorFlow and PyTorch to unleash this performance. Intel Xeon Scalable can now run any AI workload.
Intel’s results show that 4th Gen Intel Xeon Scalable processors are expanding the reach of general-purpose CPUs for AI training, so customers can do more with Xeons that are already running their businesses.This is especially true for training medium to small models or transfer learning (aka fine tuning). The DLRM results are great examples of where we were able to train the model in less than 30 minutes (26.73) with only four server nodes. Even for mid-sized and larger models, 4th Gen Xeon processors could train BERT and ResNet-50 models in less than 50 minutes (47.26) and less than 90 minutes (89.01), respectively. Developers can now train small DL models over a coffee break, mid-sized models over lunch and use those same servers connected to data storage systems to utilize other analytics techniques like classical machine learning in the afternoon. This allows the enterprise to conserve deep learning processors, like Gaudi2, for the largest, most demanding models.
About the Results for Habana Gaudi2: Gaudi2, Habana’s second-generation DL processor, launched in May and submitted leadership results on MLPerf v2.0 training 10 days later. Gaudi2, produced in 7 nanometer process and featuring 24 tensor processor cores, 96GB on-board HBM2e memory and 24 100 integrated gigabit Ethernet ports, has again shown leading eight-card server performance on the benchmark compared to Nvidia’s A100.
As shown here, Gaudi2 improved by 10% for time-to-train in TensorFlow for both BERT and ResNet-50, and reported results on PyTorch, which achieved 4% and 6% TTT advantage for BERT and ResNet-50, respectively, over the May Gaudi2 submission. Both sets of results were submitted in the closed and available categories.
As further evidence of the strength of the results, Gaudi2 continued to outperform the Nvidia A100 for both BERT and ResNet-50, as it did in the May submission and shown here. In addition, it’s notable that Nvidia’s H100 ResNet-50 TTT is only 11% faster than the Gaudi2 performance. And though the H100 is 59% faster than Gaudi2 on BERT, it is worth noting that Nvidia reported BERT TTT in the FP8 data type, while Gaudi2 TTT is on standard, verified BF16 data type (with FP8 enablement in the software plans for Gaudi2). Gaudi2 offers meaningful price-performance improvement versus both A100 and H100.
The Intel and Habana team look forward to its next MLPerf submissions for Intel AI portfolio solutions.
More Context: Performance metrics based on MLPerf v2.1 Training benchmark results. | See today’s MLCommons announcement regarding its latest benchmark. | For information about 4th Gen Intel Xeon Scalable processors, check out " Chalk Talk Covers Strategy and Design Behind 4th Gen Intel Xeon Scalable Processors." | For information on Gaudi2, see today's blog, the website and live demonstration video from the Intel Innovation conference.