AMD and Intel Publish ACE Specification to Standardize AI Acceleration Across Future x86 CPUs

AMD and Intel have published the detailed specification for AI Compute Extensions, better known as ACE, establishing a shared matrix acceleration architecture that could become a major part of future x86 processors. Developed through the x86 Ecosystem Advisory Group, ACE is designed to increase matrix multiplication performance, scalability, and energy efficiency while giving software developers a consistent optimization target across processors from both companies.

Matrix multiplication is one of the core operations behind neural networks, large language models, image generation, inference, and other machine learning workloads. Existing vector extensions such as AVX10 can already perform matrix calculations, but their compute density and scalability are limited compared with dedicated matrix hardware. ACE addresses that limitation by introducing tile based outer product operations that work directly alongside AVX vectors and scalar code.

The newly published official ACE specification defines dedicated tile registers, block scale registers, matrix processing operations, data movement instructions, and system management capabilities. The architecture allows AVX registers to supply data directly to ACE tile operations, combining the broader data manipulation capabilities of AVX10 with much denser matrix processing.

According to the earlier ACE technical white paper, an ACE outer product operation can provide 16x the compute density of an equivalent AVX10 multiply accumulate operation while consuming the same number of input vectors. For INT8 calculations, the comparison rises from 64 multiplications with AVX10 to 1,024 multiplications through ACE. BF16 operations increase from 32 multiplications to 512 under the same comparison.

ACE adds 8 tile registers, with each register measuring 512 bits across 16 rows, alongside a 1,024 bit block scale register. These tile registers hold intermediate matrix results, while AVX10 registers prepare, convert, and move input data. The architecture is exposed to software through a new palette under the existing AMX accelerator framework, allowing operating systems and software tools to reuse much of the current state management model.

The specification also reflects the growing importance of reduced precision formats in AI. Smaller numerical formats lower memory consumption, reduce bandwidth requirements, and increase throughput, which is particularly valuable for inference and quantized models. ACE supports native matrix operations for several major AI formats while using AVX10 instructions for conversion between additional formats.

Format Description Notes
INT8 8-bit integer
INT32 32-bit integer
FP32 SE8M23 As defined by IEEE-754
BF16 SE8M7
FP16 SE5M10
E8M0 8-bit unsigned exponent Used for power-of-two block scale formats
FP8 8-bit floating point Defined in OCP 8-bit Floating Point Specification (OFP8) [1]. Also refer to OCP Microscaling Formats (MX) Specification [2].
MX FP8 8-bit floating point formats (SE5M2, SE4M3)
MX FP6 6-bit floating point formats (SE3M2, SE2M3)
MX FP4 4-bit floating point format (SE2M1)
MX INT8 8-bit fixed-point fractional format

One of the most important additions is native support for Open Compute Project microscaling formats. MX FP8 and MX INT8 operations can apply block scaling directly during matrix multiplication, reducing the amount of extra software processing needed around quantized models. Dedicated conversion instructions also cover FP4, FP6, and FP8, giving developers more flexibility as AI models increasingly adopt smaller data formats.

ACE does not replace AVX10. The 2 technologies are designed to work together. AVX10 handles preprocessing, format conversion, vector calculations, and other operations surrounding the matrix kernel, while ACE performs the dense outer product calculations. Data can move directly between AVX and ACE register states, reducing unnecessary memory transfers and allowing both execution paths to work inside the same processing pipeline.

The architecture also gives AMD and Intel flexibility when designing future CPUs. The programming model and instruction results remain consistent, but each company can scale the internal implementation according to the power, area, and performance targets of a specific product. A laptop processor may include a smaller ACE implementation focused on local inference and battery efficiency, while server CPUs could dedicate more silicon to higher throughput matrix hardware.

Software support will be essential to ACE adoption. The x86 Ecosystem Advisory Group says enablement work is underway across compilers, debuggers, profilers, deep learning libraries, scientific computing tools, and frameworks such as PyTorch and TensorFlow. Standardizing ACE across AMD and Intel could reduce the need for developers to maintain separate matrix acceleration paths for each processor vendor.

The initiative follows the formation of the x86 Ecosystem Advisory Group, which brought AMD, Intel, Microsoft, Google, Meta, Lenovo, HP, Dell, and other ecosystem partners together to coordinate future x86 features. Its first major priorities included FRED for interrupt handling, AVX10 for vector consistency, ChkTag for memory safety, and ACE for AI matrix acceleration. ACE also sits beside wider x86 modernization projects such as Intel Advanced Performance Extensions, which expands the number of general purpose registers and introduces more efficient instruction formats. Together, these technologies show that AMD and Intel are working to improve both general computing and specialized AI execution without abandoning the software compatibility that remains x86’s strongest advantage.

The full specification now gives operating system developers, compiler teams, and software companies a more complete foundation for implementation, but it does not identify the first commercial AMD or Intel processors that will support the architecture.

ACE represents one of the most meaningful outcomes of the AMD and Intel partnership. The 2 companies will continue competing aggressively in processors, data centers, and AI, but agreeing on a common instruction architecture can strengthen the entire x86 ecosystem. A shared standard gives developers a larger addressable hardware base and gives x86 CPUs a clearer role in AI workloads that do not always require a discrete GPU or dedicated accelerator.

Dedicated accelerators will continue to dominate the largest AI training systems, but much of the future AI market will involve inference on laptops, workstations, enterprise servers, and edge devices. ACE is AMD and Intel’s attempt to ensure that x86 remains relevant across those environments by bringing scalable matrix acceleration directly into future CPU architectures.

Can ACE make future x86 CPUs more competitive for AI inference, or will dedicated GPUs and NPUs continue to handle most serious AI workloads?

Share
Angel Morales

Founder and lead writer at Duck-IT Tech News, and dedicated to delivering the latest news, reviews, and insights in the world of technology, gaming, and AI. With experience in the tech and business sectors, combining a deep passion for technology with a talent for clear and engaging writing

Previous
Previous

Game Pass Could Have Worked, Moon Studios Boss Says, but Xbox Never Delivered the Hits It Needed

Next
Next

Solo Developer Tries Building GTA 6 With AI, but the Agents Create the Wrong City