Digital Event Horizon
Google's latest virtual machine instance, C4, has been shown to offer a 70% improvement in Total Cost of Ownership (TCO) compared to its predecessor, Google Cloud C3. This breakthrough is attributed to the collaboration between Google and Intel, which has resulted in an expert execution optimization that enables faster and more efficient inference for large MoE models.
The new Google VM instance C4 offers a 70% improvement in Total Cost of Ownership (TCO) compared to its predecessor C3. The breakthrough is attributed to the collaboration between Google and Intel, resulting in an expert execution optimization for faster and more efficient inference. MoE models use specialized "expert" sub-networks and a "gating network" to decide which experts to use, allowing for efficient scaling without increasing compute costs. The benchmarking process compared the performance of C3 and C4 VMs with different specifications and found a consistent improvement in throughput per vCPU for the C4 instance. C4 provides 1.7x the per-vCPU throughput of C3 at batch size 64, resulting in a 1.7x TCO advantage due to near parity in price per vCPU.
Google's latest virtual machine (VM) instance, C4, has been touted as a game-changer in the field of large-scale language model inference. According to recent benchmarks conducted by Intel and Hugging Face, the C4 VM instance offers a significant 70% improvement in Total Cost of Ownership (TCO) compared to its predecessor, Google Cloud C3. This breakthrough is attributed to the collaboration between Google and Intel, which has resulted in an expert execution optimization that enables faster and more efficient inference.
The study focuses on GPT OSS, an open-source Mixture of Experts (MoE) model released by OpenAI. MoE models are deep neural network architectures that use specialized "expert" sub-networks and a "gating network" to decide which experts to use for a given input. These models allow for efficient scaling without increasing compute costs, as well as specialization, where different "experts" learn different skills.
The benchmarking process involved creating instances of both C3 and C4 VMs, each with different specifications. The C3 instance was equipped with 4th Gen Intel Xeon processors (SPR), while the C4 instance featured Intel Xeon 6 processors (GNR). The study aimed to isolate architectural differences between the two generations and MoE execution efficiency.
The configuration summary revealed that the GPT OSS model, using unsloth/gpt-oss-120b-BF16, was run with precision in bfloat16, task text generation, input length 1024 tokens (left-padded), output length 1024 tokens, batch sizes ranging from 1 to 64, and static KV cache enabled. The results reported a consistent improvement in throughput per vCPU for the C4 instance, outperforming the C3 instance by a factor of 1.4 to 1.7.
The cost analysis showed that at batch size 64, C4 provides 1.7x the per-vCPU throughput of C3; with near parity in price per vCPU, this results in a 1.7x TCO advantage for the C4 instance. This means that for the same generated token volume, C4 would require 1.7x less spend than C3.
The study concludes that Google Cloud C4 VMs powered by Intel Xeon 6 processors (GNR) offer both impressive performance gains and better cost efficiency for large MoE inference over previous generation Google Cloud C3 VMs. This breakthrough underlines the effectiveness of targeted framework optimizations from Intel and Hugging Face, allowing large MoE models to be efficiently served on next-generation general-purpose CPUs.
Related Information:
https://www.digitaleventhorizon.com/articles/Revolutionizing-Large-Scale-Language-Model-Inference-Google-Cloud-C4-Brings-a-70-TCO-Improvement-deh.shtml
https://huggingface.co/blog/gpt-oss-on-intel-xeon
https://cloud.google.com/blog/products/compute/a-closer-look-at-compute-engine-c4-and-n4-machine-series
Published: Thu Oct 16 03:55:11 2025 by llama3.2 3B Q4_K_M