Regarding this announcementBlackwell display architectureAfter the meeting, NVIDIA explained the details of this display architecture and announced the launch of three acceleration computing element designs: B100, B200 and GB200 Superchip.
NVIDIA CEO Jensen Huang explained that the "Blackwell" display architecture was created by challenging the limits of physics while balancing actual performance and cost.
The "Blackwell" display architecture is designed for the needs of AI with tera-scale parameters. It is produced using TSMC's customized 4nm process and can achieve 20 PetaFLOPS of computing power through a single GPU design. The Superchip integrated with this GPU design contains 2080 billion transistors, which can increase training efficiency by 4 times, inference computing efficiency by 30 times, and energy utilization efficiency by 25 times compared to the previously launched "Hopper".
In terms of architecture, "Blackwell" integrates the second-generation Transformer artificial intelligence engine, the Tensor Core design that can support FP4/FP6 low-bit floating-point operations, and supports the fifth-generation NVLink connection technology. It can connect with up to 576 GPUs simultaneously, supports a data decompression rate of up to 800GB per second, and a more secure data encryption protection mechanism to ensure operational stability.
In addition, "Blackwell" also has a special design with two sets of masks corresponding to the die core unit. The internal communication is carried out using the NVLink-HBI interface with a data transmission rate of 10TB per second, and it can operate as a single GPU.
NVIDIA CEO Jensen Huang explained that the "Blackwell" graphics architecture was created by challenging the limits of physics while simultaneously considering the effective balance between performance and cost. Therefore, combining two sets of die core units into a single GPU clearly strikes a balance between existing process technology yield and manufacturing costs, while also enhancing the computing performance of the "Blackwell" graphics architecture through stacking.
「Blackwell」在FP8運算模式可對應10 PetaFLOPS算力表現,而在FP4運算模式則可對應20 PetaFLOPS算力表現,本身則整合192GB容量、支援每秒8TB資料傳輸量的HBM3e高密度記憶體,並且能透過NVLink以每秒1.8TB速率交換資料內容。
To further enhance Blackwell's efficiency in multi-mode AI applications, NVIDIA also provides data transfer rates of up to 100 GBytes per second through the HDR Infiniband transmission interface. This allows for synchronization of computing data between every 15 GPUs in a large-scale computing cluster. Combined with the fifth-generation NVLink design, this allows computing nodes comprising up to 576 GPUs to maintain accurate computational accuracy.
Launched three acceleration computing element designs: B100, B200 and GB200 Superchip
The current "Blackwell" display architecture is used to create accelerated computing element designs, which are divided into B100, B200, and a combination of a single "Grace" CPU and two "Blackwell" GPUs.GB200 Superchip.
Among them, B100 and B200 are both equipped with HBM192e high-density memory with a total capacity of 3GB, corresponding to a data transmission rate of 8TB per second. At the same time, it is the same as the data transmission rate of the GPU itself, so it can correspond to faster data processing efficiency in the display architecture.
The biggest difference between the B100 and B200 lies in their operating power consumption. The former has a maximum power consumption of 700W and can operate via air cooling. It can also be used directly in the HGX rack space corresponding to the H100 acceleration element design. The latter generally consumes 1000W of power and can still operate via air cooling, but whether it can be used in the existing corresponding H200 rack space depends on the situation. As for further increasing the power consumption to 1200W, it must be operated with water cooling, so the corresponding rack must be redesigned.
The GB200 Superchip is designed for AI training acceleration and operates in a fully water-cooled form.
The GB200 Superchip must be fully liquid-cooled, but this has the advantage of reducing the need for space-consuming heat sinks and maintaining operational stability through the water cooling system. Compared to the DGX H10.2 system, which consumes 8kW of power and has an 100U rack design, the space occupied is reduced to one-eighth while maintaining similar computing performance. The water cooling system also reduces the space required for heat exchange and reduces noise levels during operation.
Based on the H100 computing power, the GB200 Superchip has 6 times the computing power, which can process approximately 3 billion sets of GPT-1750 parameters. The corresponding computing power performance for processing multi-mode specific fields can reach 30 times, which can process up to 1.8 trillion parameters.
The GB36 NVL200, which connects 200 GB72 Superchips via NVLink, can achieve 720 PFLOPS of computing power during training and 1440 PFLOPS of computing power for inference. It can also support a parameter scale of 27 trillion groups, with a multi-node transmission bandwidth of 130TB per second and a maximum transmission volume of 260TB per second.
In addition, if eight GB8 NVL200s are connected in series, a DGX BG72 Superpod can be constructed, integrating 200 "Grace" CPUs and 288 "Blackwell" GPUs, and including 576TB of high-speed memory capacity. In FP240 computing mode, it can correspond to 4 ExaFLOPS computing power, and achieve 11.5 times the inference efficiency, 30 times the training efficiency, and 4 times the energy utilization efficiency.
Maintaining portfolio flexibility, but preferring Arm architecture portfolios under the development trend of artificial intelligence
Currently, NVIDIA maintains the flexibility of its "Blackwell" display architecture, offering the option of combining it with either an x86 or Arm CPU. The B100 is also compatible with existing H100 racks, and the B200 can also be used with existing racks in certain circumstances, maintaining its deployment and application upgrade flexibility while also significantly improving computing performance.
However, when it comes to AI deployment applications, NVIDIA states that the best combination is still the Arm architecture CPU. This is mainly due to the limitations of the x86 architecture CPU's corresponding I/O port and other channel designs, as well as the upper limit on the number of connections that NVLink can support. In addition, the use of the x86 architecture CPU also requires additional cooling system construction. Therefore, for AI inference and other training purposes, the combination with the "Grace" CPU will still be the main recommendation.


