Research Foundation
momo-kibidango implements the pyramid speculative decoding algorithm described in the NeurIPS 2025 paper from Google Research.
Pyramid Speculative Decoding: Accelerating Large Language Model Inference with Tiered Model Cascades
Authors: Jaehoon Byun, Zhihao Zhang, Yifan Yang, Jinwoo Shin, Dhruv Batra, Zsolt Kira
Conference: NeurIPS 2025 (Accepted)
Institution: Google Research
Abstract
Speculative decoding has emerged as a powerful technique for accelerating large language model inference by using a smaller draft model to propose candidates for verification by a larger target model. However, traditional two-model approaches suffer from low acceptance rates, limiting their effectiveness. We propose Pyramid Speculative Decoding, which introduces an intermediate verification model to create a three-tier cascade. This approach significantly improves acceptance rates while maintaining the mathematical guarantee of producing identical outputs to the target model. Our experiments demonstrate up to 2.16x speedup on consumer hardware with no quality degradation.
Key Contributions
- Three-Model Architecture: Introduction of an intermediate verification tier that bridges the gap between ultra-fast draft models and high-quality target models.
- Improved Acceptance Rates: The pyramid structure achieves 73% higher acceptance rates compared to traditional two-model speculative decoding.
- Theoretical Guarantees: Mathematical proof that the output distribution remains identical to running the target model alone.
- Practical Implementation: Detailed algorithms for efficient caching, batch processing, and memory management in production environments.
Performance Results
| Method | Models Used | Speed (tok/s) | Speedup |
|---|---|---|---|
| Baseline | Sonnet 3.5 | 12.5 | 1.00x |
| 2-Model Speculative | Haiku 2 → Sonnet 3.5 | 18.7 | 1.50x |
| Pyramid (3-Model) | Haiku 2 → Haiku 3 → Sonnet 3.5 | 24.6 | 1.97x |
Implementation Details
The momo-kibidango implementation follows the paper's algorithms closely while adding production-grade features:
- Advanced KV cache management with cross-model sharing
- Dynamic batch sizing based on acceptance rates
- Graceful degradation when memory pressure is high
- Comprehensive metrics for monitoring and optimization
- Integration with OpenClaw's subagent architecture
Citation
@inproceedings{byun2025pyramid,
title={Pyramid Speculative Decoding: Accelerating Large Language Model Inference with Tiered Model Cascades},
author={Byun, Jaehoon and Zhang, Zhihao and Yang, Yifan and Shin, Jinwoo and Batra, Dhruv and Kira, Zsolt},
booktitle={Advances in Neural Information Processing Systems},
year={2025}
}Additional Resources
- • Technical Architecture Guide - How we implemented the paper's algorithms
- • Blog: How Pyramid Decoding Works - Accessible explanation of the concepts
- • Interactive Benchmarks - Explore performance across different configurations