MLCommons recently released the latest round of MLPerf Training v5.1 benchmark test results, which is not only the annual final exam for the AI computing field, but also an arena for NVIDIA and AMD to showcase the strength of their next-generation architectures. NVIDIA swept the competition with its Blackwell Ultra architecture, while AMD participated in the training test for the first time with its CDNA 4 architecture MI350 series, demonstrating its competitiveness in close pursuit.
NVIDIA: Blackwell Ultra paired with NVFP4 precision doubles performance
In this test, NVIDIA unsurprisingly achieved the fastest training speed in all seven tests, including Large Language Model (LLM) and image generation, and was the only platform to submit results in all projects.
NVIDIA's "trump card" this time is its GB300 NVL72 rack-mount system based on the Blackwell Ultra GPU architecture. To unleash extreme performance, NVIDIA has adopted NVFP4 low-precision computing for the first time in the history of MLPerf Training.
According to official data released by NVIDIA, Blackwell Ultra offers a significant performance improvement compared to the previous generation Hopper architecture with the same number of GPUs:
• Llama 3.1 405B pre-training:Efficiency is increased by more than 4 times.
• Llama 2 70B LoRA fine-tuning:Efficiency increased by nearly 5 times.
The architectural advantages of Blackwell Ultra lie in its new Tensor Cores with 15 petaflops of NVFP4 AI computing power and up to 279GB of HBM3e high-bandwidth memory. Furthermore, NVIDIA utilized over 5000 Blackwell GPUs in the Llama 3.1 405B test, setting a record of completing training in just 10 minutes.
AMD: CDNA 4 architecture debuts, MI355X delivers 2.8x performance over its predecessor.
On the other hand, AMD also delivered impressive results in this test. This is the first time AMD has used its Instinct MI350 series GPUs (including MI355X and MI350X) for MLPerf training tests.
The AMD Instinct MI355X GPU uses a 3nm manufacturing process and CDNA 4 architecture, and is equipped with 288GB of HBM3e high-bandwidth memory. AMD emphasizes its remarkable performance improvements.
• Efficiency leap forward:Compared to its predecessor, the MI300X, the MI355X offers a 2.8x improvement in training performance.
• Llama 2 70B LoRA fine-tuning:The MI355X platform completed in 10.18 minutes, a significant reduction compared to the MI300X's 27.97 minutes.
While NVIDIA's B200 platform slightly outperformed the AMD MI355X in absolute speed with a time of 9.85 minutes, the AMD MI355X's 10.18 minutes demonstrated a highly competitive performance, indicating that the gap between the two is narrowing.
Ecosystem and Future Layout
This test also highlighted the expansion of both companies' ecosystems. NVIDIA had 15 partners submitting results, including ASUS, Dell, Quanta Computer, and Wistron. AMD was not to be outdone, with 9 partners (including ASUS, Dell, and GIGABYTE) submitting test results based on AMD Instinct hardware.
Looking ahead, AMD also updated its product roadmap at its Financial Analyst Day conference, confirming that it will maintain...The "once a year" update scheduleThe MI400 series is expected to launch in 2026, while the MI500 series is planned to debut in 2027, in order to further compete with NVIDIA.




