momo-kibidango - 3-Model Speculative Decoding for OpenClaw

momo-kibidango implements the pyramid speculative decoding algorithm described in the NeurIPS 2025 paper from Google Research.

Pyramid Speculative Decoding: Accelerating Large Language Model Inference with Tiered Model Cascades

Authors: Jaehoon Byun, Zhihao Zhang, Yifan Yang, Jinwoo Shin, Dhruv Batra, Zsolt Kira

Conference: NeurIPS 2025 (Accepted)

Institution: Google Research

ArXiv Paper PDF Download

Abstract

Speculative decoding has emerged as a powerful technique for accelerating large language model inference by using a smaller draft model to propose candidates for verification by a larger target model. However, traditional two-model approaches suffer from low acceptance rates, limiting their effectiveness. We propose Pyramid Speculative Decoding, which introduces an intermediate verification model to create a three-tier cascade. This approach significantly improves acceptance rates while maintaining the mathematical guarantee of producing identical outputs to the target model. Our experiments demonstrate up to 2.16x speedup on consumer hardware with no quality degradation.

Key Contributions

Three-Model Architecture: Introduction of an intermediate verification tier that bridges the gap between ultra-fast draft models and high-quality target models.
Improved Acceptance Rates: The pyramid structure achieves 73% higher acceptance rates compared to traditional two-model speculative decoding.
Theoretical Guarantees: Mathematical proof that the output distribution remains identical to running the target model alone.
Practical Implementation: Detailed algorithms for efficient caching, batch processing, and memory management in production environments.

Performance Results

Method	Models Used	Speed (tok/s)	Speedup
Baseline	Sonnet 3.5	12.5	1.00x
2-Model Speculative	Haiku 2 → Sonnet 3.5	18.7	1.50x
Pyramid (3-Model)	Haiku 2 → Haiku 3 → Sonnet 3.5	24.6	1.97x

Implementation Details

The momo-kibidango implementation follows the paper's algorithms closely while adding production-grade features:

Advanced KV cache management with cross-model sharing
Dynamic batch sizing based on acceptance rates
Graceful degradation when memory pressure is high
Comprehensive metrics for monitoring and optimization
Integration with OpenClaw's subagent architecture

Citation

@inproceedings{byun2025pyramid,
  title={Pyramid Speculative Decoding: Accelerating Large Language Model Inference with Tiered Model Cascades},
  author={Byun, Jaehoon and Zhang, Zhihao and Yang, Yifan and Shin, Jinwoo and Batra, Dhruv and Kira, Zsolt},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025}
}

Additional Resources

• Technical Architecture Guide - How we implemented the paper's algorithms
• Blog: How Pyramid Decoding Works - Accessible explanation of the concepts
• Interactive Benchmarks - Explore performance across different configurations

Research Foundation