NVIDIA GB200 NVL72 Delivers 10x Performance Boost for MoE-Based Kimi K2 Thinking LLM
NVIDIA has announced a major leap forward in accelerating Mixture of Experts frontier models, showcasing how its next generation GB200 Blackwell NVL72 cluster delivers transformative performance gains for large scale AI deployments. As detailed in the company’s official press release the GB200 NVL72 achieves a ten times improvement in generational performance compared to the previous generation Hopper H200 platform when running the Kimi K2 Thinking model, one of the leading open source MoE LLMs featuring 32 billion activated parameters per forward pass.
The AI industry has rapidly expanded foundational model sizes by increasing token parameters and scaling training infrastructure, but this approach is reaching practical compute limits. MoE models offer a more resource efficient architecture by activating only a subset of parameters per token depending on the request type. However, scaling MoEs introduces significant communication and compute bottlenecks especially when models grow into frontier scale deployments. NVIDIA’s GB200 NVL72 tackles this challenge head on.
A performance chart referenced in the announcement highlights the magnitude of the improvement. The GB200 platform surpasses five million tokens per megawatt on Kimi K2 Thinking dramatically outperforming the H200. This reflects the impact of NVIDIA’s co design approach which leverages seventy two Blackwell GPUs tightly integrated with 30 terabytes of high speed shared memory. This enables aggressive expert parallelism where token batches are split and scattered across GPUs while sustaining increasingly nonlinear communication volumes.
Additional innovations in NVIDIA’s full stack architecture also play a major role. The NVIDIA Dynamo framework coordinates disaggregated serving by routing prefill and decode operations to different GPUs ensuring that each stage runs with the optimal degree of parallelism. Prefill operations leverage techniques best suited to their workload while decode tasks scale using large expert parallelism. Meanwhile, the NVFP4 precision format delivers improved performance and efficiency while maintaining inference accuracy.
This milestone arrives at a pivotal moment as GB200 NVL72 systems move deeper into the global supply chain powering the next generation of frontier AI models. MoE architectures are becoming increasingly attractive thanks to their scalability and efficiency which support broad deployment in both enterprise and cloud environments. NVIDIA’s work positions the Blackwell platform as a central enabler for this shift giving developers and AI providers the tools required to expand capabilities without incurring prohibitive compute costs.
NVIDIA’s breakthrough signals a new performance era for frontier AI and marks a decisive step toward infrastructure purpose built for massive MoE workloads.
Do you think MoE architectures will become the dominant standard for next generation AI models? Share your views.
