FuriosaAI and Broadcom Partner on 2nm AI Inference Accelerator With HBM4/E Memory
FuriosaAI is moving beyond the traditional GPU playbook with a new AI accelerator designed around high bandwidth inference performance, advanced chiplet architecture, and next generation HBM4/E memory. The company has announced a partnership with Broadcom to develop its 3rd generation AI accelerator, targeting large scale AI inference clusters as demand for agentic AI workloads continues to accelerate.
The new platform builds on FuriosaAI’s 2nd generation RNGD accelerator, which is already in mass production using TSMC’s 5nm process technology. The current RNGD platform is a 180W PCIe based accelerator designed for large language model and agentic AI workloads, with customers including Samsung SDS and LG AI Research. With its next generation design, FuriosaAI is shifting deeper into the AI inference market, where token throughput, latency, bandwidth, and energy efficiency are becoming the core battlegrounds for data center infrastructure.
At the center of FuriosaAI’s 3rd generation platform is a 2nm compute architecture paired with HBM4/E memory. The company says this combination is designed to enable high bandwidth, rack scale networking across massive AI compute clusters. Rather than following the conventional GPU approach, FuriosaAI is focusing on bandwidth driven data movement and inference optimization, claiming its architecture can deliver higher performance per watt and greater token density than even the most efficient GPU designs.
Platform highlights
2nm compute technology paired with HBM4/E memory for advanced AI inference workloads
High bandwidth architecture designed for rack scale AI compute clusters
Optimized data movement for higher performance per watt and greater token density
Broadcom infrastructure support, including Ethernet and PCIe IP
Advanced packaging designed to integrate multiple silicon dies into a single AI accelerator platform
Software stack designed to quickly deploy new AI models while meeting throughput and latency requirements
The teaser image shared by FuriosaAI shows a complex chip design featuring 12 HBM4/E memory sites, 2 large 2nm compute chiplets, and 2 I/O controllers. If the company uses 12 high 36GB memory stacks, the total memory capacity could reach 432GB. That level of memory capacity and bandwidth would position the accelerator for demanding inference workloads where models need fast access to large amounts of data across the compute platform.
Broadcom’s role is also important beyond manufacturing support. FuriosaAI plans to use Broadcom’s Ethernet and PCIe technologies to support high bandwidth connectivity across large AI clusters. This suggests the platform is being designed not just as a single accelerator, but as part of a broader data center scale system for AI inference deployment.
The company says the chip is optimized for demanding real world AI workloads, including post training sampling. For modern AI infrastructure, inference is becoming more than simple model execution. It now requires fast response times, efficient token generation, low latency, and scalable deployment across many connected accelerators. FuriosaAI’s decision to focus heavily on bandwidth reflects this shift in the AI market.
FuriosaAI also claims that its architecture avoids some of the complexity found in GPU based programming. Traditional GPUs rely heavily on thread management, which can add development complexity and performance variability. FuriosaAI says its own software stack gives developers a faster way to deploy models while still meeting strict performance and latency requirements.
The company’s SDK uses a general compiler that automatically maps high level PyTorch code to silicon. For developers who need more control, FuriosaAI’s Virtual ISA provides a declarative programming model that offers hardware level control without the nondeterministic complexity commonly associated with traditional GPU programming.
"Bringing together Broadcom’s infrastructure capabilities and Furiosa’s Tensor Contraction Processor architecture and its industry defining software stack allows us to move beyond the chip level and deliver a comprehensive solution for the token factory era."
— June Paik
The phrase token factory era reflects where the AI hardware market is heading. As AI models become more deeply integrated into enterprise workflows, consumer products, autonomous agents, and cloud services, the focus is no longer only on training the largest models. Companies now need infrastructure that can generate tokens efficiently, reliably, and at massive scale. This makes inference performance one of the most important areas of future AI competition.
FuriosaAI’s strategy is ambitious because it directly challenges the dominance of GPUs in AI acceleration. NVIDIA remains the strongest player in AI data center hardware, while AMD and other custom silicon vendors are also pushing aggressively into the market. FuriosaAI is trying to differentiate itself by focusing on a bandwidth first architecture, advanced packaging, HBM4/E memory, and a software stack designed specifically for inference deployment.
The 3rd generation FuriosaAI accelerator is expected to begin sampling in the first half of 2028. While that timeline places the product several years away from broad deployment, it also aligns with the expected growth of next generation AI data centers, where HBM4/E memory, 2nm silicon, and rack scale networking will become increasingly important.
If FuriosaAI and Broadcom can execute on the platform’s performance and efficiency claims, this accelerator could become a serious alternative for companies looking beyond traditional GPU infrastructure. The real test will come when silicon samples arrive, performance numbers are validated, and customers begin evaluating whether FuriosaAI’s bandwidth driven inference design can deliver the token density and efficiency needed for the next phase of AI compute.
Do you think dedicated inference accelerators like FuriosaAI’s 2nm HBM4/E platform can challenge GPUs in future AI data centers?
