During Google Cloud Next'26 in Las Vegas this year, in addition to the keynote address, Google's infrastructure team provided a deeper dive into the latest TPU development roadmap in a breakout session. Echoing the main theme of the event, "How to Design TPU Architecture for Cutting-Edge AI," Google further explained the updated TPU architecture.Hardware architecture details of the 8th generation TPUs (TPU 8t and TPU 8i)It also showcases the practical application results of the AI team Decart in frontier world models.

From an overall architecture perspective, Google has realized that a single chip architecture cannot simultaneously meet the needs of training massive models with mega-parameters and providing inference services with extremely low latency. Therefore, the 8th generation TPU adopts a more explicit dual-architecture offloading strategy and undergoes a complete overhaul of its network interconnection technology.
Continuity and Breakthrough: Ironwood Architecture, 3D Torus, and Virgo Network
In this technical analysis, Google emphasized that the 8th generation TPU continues the Ironwood architecture that laid the foundation for previous generations and further enhances RDMA (Remote Direct Memory Access) performance. With the upgraded TPUDirect Storage technology, the transmission latency of data transfer between the chip and the storage cluster is minimized. For cutting-edge AI models that need to frequently read large datasets, this will significantly reduce the idle time of computing units "waiting for data".

Furthermore, the TPU 8t incorporates a Large Language Model (LLM) decoding engine in its SparseCore collaborative computing core, increasing its arithmetic strength by up to 30 times. This results in a 5-fold increase in the computational efficiency of models such as DLRM DCN v2. The TPU 8i utilizes SparseCore as a collective acceleration engine, coupled with integrated 384MB of SRAM (Static Random Access Memory) for key-value caching. This enhances short-term memory during inference in large AI models, reducing computational power and memory consumption when processing repetitive data, while also improving inference efficiency.


In terms of scaling, Google adopts a two-pronged strategy: vertical and horizontal scaling.
• Vertically increasing computing power (Scale-Up):Within a single Pod, Google still uses the mature and efficient 3D Torus network topology architecture, and through ICI (Inter-Core Interconnect) interconnect technology that doubles the bandwidth again, adjacent TPU chips can exchange data quickly with ultra-high bandwidth and extremely low latency. This allows all TPUs in a single cluster to operate like a giant chip, which is very suitable for handling highly coupled computing tasks.

• Horizontal scaling (Scale-Out):To overcome the physical limitations of a single Pod and build data center-level computing power through a concatenated approach, sufficient to train next-generation cutting-edge AI models, Google detailed its new Virgo Network technology. Virgo Network is designed for massive pools of computing power across clusters and even data centers, ensuring stable, high network throughput and fault tolerance even when performing distributed computing with tens of thousands of TPUs.


Dual architectures: TPU 8t specializes in extreme training, while TPU 8i optimizes inference cost.
Google's past TPU designs mostly attempted to achieve a balance between training and inference (such as the previously released v4 or Trillium, and even the previous generation Ironwood). However, with the maturity of the generative AI industry, workloads have become significantly differentiated, making it necessary to completely separate "training" and "inference" in order to achieve greater efficiency during "training" and greater cost-effectiveness during "inference". Therefore, unlike in the past, it is no longer possible to simply use the same TPU with different memory and other architectural settings to handle both "training" and "inference" tasks at the same time.
• TPU 8t (training):The "t" at the end represents training. This architecture is designed to handle the pre-training of massive language models and multimodal models, featuring maximum-capacity HBM high-bandwidth memory and maximizing the peak computing power of the matrix multiplication unit (MXU). Its design philosophy is to pursue computational density and memory bandwidth at all costs, thereby shortening the training cycle of cutting-edge models.



• TPU 8i (inference):The "i" at the end represents inference. This architecture abandons some complex instruction sets dedicated to training, investing chip area in larger SRAM static random access memory caches and higher-speed I/O throughput. The goal is to provide the lowest first-word latency (TTFT) and the highest data throughput during the model deployment phase. It even uses HBM3e high-bandwidth memory (which provides higher data transfer bandwidth compared to the HBM3 high-bandwidth memory used in the TPU 8t), while significantly reducing the operating cost per API call.



The Moat of the Software Ecosystem: The Advantages of PyTorch on TPU
Of course, even the most powerful hardware is useless without a supportive developer ecosystem. Faced with the formidable ecosystem barriers erected by NVIDIA CUDA, Google has dedicated significant space to analyzing the design advantages of the upcoming PyTorch on TPU.

By continuously optimizing the XLA (Accelerated Linear Algebra) compiler, Google has enabled PyTorch developers to seamlessly transfer models that originally ran on GPUs to TPU computing clusters with "zero code modification" or "minimal modification," reducing translation costs when migrating between different computing tasks.
PyTorch/XLA can now automatically translate dynamic graphs into static graphs that TPUs can efficiently execute. In addition, it supports advanced technologies such as Automatic Mixed Precision Training (AMP) and Fully Split Data Parallel (FSDP), allowing startups to easily migrate existing PyTorch projects to TPU 8t for large-scale training.
A review of the history of TPU development
Looking back at the development of Google TPU, its architectural evolution is almost a microcosm of the history of modern AI development:
• TPU v1 (2015):Focusing on inference, it accelerates AlphaGo and its internal search capabilities.
• TPU v2/v3 (2017-2018):By incorporating floating-point operations and HBM high-bandwidth memory design, we have officially entered the field of model training and proposed the concept of Pod clusters.
• TPU v4 (2021):The introduction of optical circuit switches and 3D Torus architecture establishes a milestone in exaflop-level computing power.
• TPU v5e / v5p (2023):This is the first attempt to divide the product line into two categories: one focusing on cost-effectiveness (v5e) and the other on extreme performance (v5p), laying the groundwork for future development of separate tracks. However, at this stage, the product line is still differentiated by different memory configurations, and is essentially still based on the same TPU design.
• TPU v6 "Trillium" (2024):Fully embracing generative AI significantly improves energy efficiency and memory bandwidth, and it also possesses both "training" and "inference" capabilities, which became the design basis for the subsequent "Ironwood".
• TPU v8 8t / 8i (2026):The dual-architecture (t/i) strategy for training and inference was formally established, and the horizontal scaling of computing power was pushed to a whole new level through the Virgo Network.


Decart Frontier World Model Applications
In the second half of this session, the conference also specially invited...Decart, a well-known AI teamThey shared their results of running a cutting-edge world model on an 8th generation TPU.
Decart points out that world models require handling extremely complex physical laws and time series generation, which places extremely stringent demands on memory bandwidth and interconnect latency. Through the TPU 8t's ICI interconnect technology and TPUDirect Storage data processing method, Decart successfully minimized data loading bottlenecks and achieved a near-real-time interactive reasoning experience on the TPU 8i, fully demonstrating the practical value of Google's dual-architecture design and fundamental network technology upgrade.
The reason for using Decart's cutting-edge world model results to illustrate the performance of the 8th generation TPU is clearly in response to the previous "Google's TPUs struggle to handle workloads like Decart that require real-time rendering.In response to criticisms such as "...", Google emphasized that the current TPU operation can be fully supported by PyTorch on TPU design, which is compatible with computing projects built with PyTorch. This also means that Google will continue to collaborate with many open source computing companies to optimize applications.



