Digital Event Horizon

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Speculative decoding, a critical technique for accelerating large language model inference, has been evaluated using a unified benchmark called SPEED-Bench. This benchmark addresses the limitations of existing benchmarks by providing a comprehensive evaluation framework for SD performance across diverse semantic domains and realistic serving regimes. SPEED-Bench captures domain-dependent accuracy and speedups, exposing side effects of aggressive system optimizations like vocabulary pruning. By introducing this unified benchmark, researchers can analyze SD behavior in production environments and develop more accurate models.

Speculative decoding is a technique to accelerate large language model inference by speculating multiple future tokens, which are then verified in parallel by the target model.

The evaluation of speculative decoding remains fragmented and often unrepresentative of real-world data and serving conditions.

A new benchmark called SPEED-Bench has been introduced to evaluate speculative decoding across diverse semantic domains and realistic serving regimes.

SPEED-Bench introduces a unified measurement framework, combining two purpose-built dataset splits: the Qualitative split and the Throughput split.

The Qualitative split measures speculation quality across domains, while the Throughput split evaluates system-level speedups across various input sequence lengths and high concurrency.

SPEED-Bench captures fine-grained timing information from streaming responses to compute acceptance behavior, step latency, user-level tokens-per-second, and overall throughput.

The unified measurement framework isolates the effects of SD algorithms and system optimizations from preprocessing artifacts.

Insights gained from SPEED-Bench include domain-dependent accuracy and speedups, as well as the impact of vocabulary pruning on speculative decoding performance.

Speculative decoding has emerged as a critical technique for accelerating large language model (LLM) inference, allowing models to significantly improve throughput while preserving the exact output distribution of the target model. This approach utilizes a lightweight draft model to speculate multiple future tokens, which are then verified in parallel by the target model.

Despite the growing interest in speculative decoding, its evaluation remains fragmented and often unrepresentative of real-world data and serving conditions. Existing benchmarks often rely on small prompt sets, limited semantic diversity, short input sequence lengths, batch size one, or high-level inference stacks that do not reflect production environments. To address these gaps, researchers introduced SPEED-Bench: a unified benchmark designed to evaluate speculative decoding across diverse semantic domains and realistic serving regimes, using production-grade inference engines.

SPEED-Bench introduces a benchmarking ecosystem for speculative decoding, combining two purpose-built dataset splits and a unified measurement framework. The first split, known as the Qualitative split, is optimized for semantic diversity and designed to measure speculation quality (drafter accuracy) across domains. This split explicitly prioritizes semantic diversity by aggregating data from 18 publicly available sources and organizing it into 11 categories, including Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning, and QA.

Each category contains 80 samples, resulting in a total of 880 prompts. To achieve semantic diversity, each candidate prompt is embedded into a dense vector space using a pre-trained text embedder (openai/text-embedding-3-small). A selection algorithm that minimizes average pairwise cosine similarity within each category ensures that the selected samples span the semantic space as widely as possible, reducing redundancy and increasing evaluation fidelity.

The second split, known as the Throughput split, is constructed to evaluate system-level speedups across various input sequence lengths and high concurrency. This split uses two metrics: Throughput (Output TPS), the total tokens generated per second across all concurrent requests, and User TPS, the per-request token generation rate. The Throughput split captures the behavior of speculative decoding in production environments, where models are served under high concurrency and a wide range of input sequence lengths.

SPEED-Bench introduces a unified measurement framework that handles tokenization and prompt formatting externally. Inference engines receive pre-tokenized sequences, ensuring that all systems process identical inputs. The framework integrates with production-grade inference engines: TensorRT-LLM, vLLM, and SGLang. It captures fine-grained timing information from streaming responses to compute acceptance behavior, step latency, user-level tokens-per-second, and overall throughput.

The unified measurement framework isolates the effects of SD algorithms and system optimizations from preprocessing artifacts, providing a more accurate evaluation of speculative decoding performance. This design allows for a comprehensive analysis of SD behavior across diverse semantic domains and realistic serving regimes.

Insights gained from SPEED-Bench include domain-dependent accuracy and speedups. The results confirm that SD acceptance length is highly domain-dependent, with low-entropy domains consistently yielding higher acceptance lengths, while high-entropy tasks are more difficult to speculate on. The table also highlights differences between speculation methods, with lightweight approaches like N-Gram speculation resulting in net slowdowns at moderate batch sizes.

Furthermore, SPEED-Bench can assist with exposing side effects of aggressive system optimizations, such as vocabulary pruning. Vocabulary pruning is used in EAGLE3 to reduce the computational cost of the final projection layer. While effective on narrow domains, this optimization can degrade acceptance length on the "long tail" of user inputs. The impact of vocabulary pruning is minimal in Coding and Math but substantial in Multilingual, RAG, and Summarization categories.

In conclusion, SPEED-Bench has introduced a unified benchmark for speculative decoding, addressing the limitations of existing benchmarks and providing a comprehensive evaluation framework for SD performance. By capturing domain-dependent accuracy and speedups, SPEED-Bench provides valuable insights into the behavior of speculative decoding algorithms and their applications in production environments.

Related Information:

https://www.digitaleventhorizon.com/articles/SPEED-Bench-A-Unified-and-Diverse-Benchmark-for-Speculative-Decoding-deh.shtml

https://huggingface.co/blog/nvidia/speed-bench

https://research.nvidia.com/publication/2026-02_speed-bench-unified-and-diverse-benchmark-speculative-decoding

Published: Thu Mar 19 09:39:41 2026 by llama3.2 3B Q4_K_M

Today's AI/ML headlines are brought to you by ThreatPerspective

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding