Semiconductor industry analyst firm SemiAnalysis released InferenceMAX v1 benchmark test results showing that NVIDIA's Blackwell graphics architecture GPU swept all test items, setting a new benchmark in performance, energy efficiency, and overall economics.
This new benchmark is considered the first independent evaluation that can truly reflect the total cost of AI inference (Total Cost of Inference), covering a variety of models and real-world application scenarios, focusing on "efficiency" rather than pure speed.
The AI Factory Formula for a 15x ROI
The report indicates that if an enterprise invests $500 million to deploy an NVIDIA GB200 NVL72 system, it can generate up to $7500 million in DSR1 token revenue from AI applications, with a return on investment of up to 15 times. This means that inference performance is no longer just a technical indicator, but a key engine for enterprise operational profitability.
“Inference is at the heart of how AI delivers value every day,” said Ian Buck, vice president of Hyperscale and High-Performance Computing at NVIDIA. “Blackwell’s achievements demonstrate that our end-to-end approach enables customers to achieve both extreme performance and optimal efficiency when deploying AI at scale.”
Blackwell Architecture: Dual-track Drive for Performance and Efficiency
In the InferenceMAX v1 benchmark, the Blackwell-based B200 GPUs achieved impressive performance across multiple models, achieving a throughput of 60000 tokens per second per GPU and up to 1000 TPS (Token per Second) per user. Compared to the previous-generation H200 GPU, this performance increased by 4 times, while the computational cost per million tokens was reduced by 15 times, achieving the industry's lowest cost per million tokens at just $0.02.
This performance is enabled by NVIDIA's new TensorRT-LLM v1.0 inference framework and NVLink Switch high-speed interconnect technology, which provides 1800 GB/s of bidirectional bandwidth, allowing up to 72 GPUs to operate as a single super GPU.
Open Source Collaboration Advances the Inference Revolution
NVIDIA has also collaborated with several AI research teams, including OpenAI (gpt-oss 120B), Meta 9Llama 3 70B), and DeepSeek AI (DeepSeek R1), to optimize open-source inference performance. Furthermore, collaborative development with communities like FlashInfer, SGLang, and vLLM has enabled TensorRT-LLM to fully exploit the parallelization potential of Blackwell.
In addition, the newly released gpt-oss-120B-Eagle3-v2 model introduces "Speculative Decoding" technology, which can predict multi-word output and significantly reduce latency, thereby tripling the throughput per user.
The balance between economy and sustainability
InferenceMAX uses the "Pareto Frontier" model to evaluate the balance between performance, energy consumption, and responsiveness. The results show that Blackwell not only leads in throughput but also sets new records in energy efficiency and cost control. These include a 10x increase in throughput per megawatt compared to the previous generation, and a significant increase in word output per watt, reducing the energy burden on data centers.
Conclusion: Benchmarks in the AI Factory Era
As AI evolves from single-shot generation to multi-step reasoning and toolchain integration, inference performance will directly determine the economies of scale of AI services. NVIDIA, through its Blackwell architecture, has successfully translated "performance" into "revenue," making the concept of the AI factory a reality.
The debut of InferenceMAX is not only a technology showcase, but also a symbol that NVIDIA is leading the industry into a new era of the "Inference Economy."





