momo-kibidango - 3-Model Speculative Decoding for OpenClaw

Today we're excited to announce momo-kibidango v1.0.0, bringing Google Research's pyramid speculative decoding to OpenClaw users everywhere. This production-ready implementation delivers 2x faster inference on Apple Silicon with zero quality loss.

Why We Built This

Running large language models locally has always involved a trade-off: you can have fast inference with cloud APIs, or you can have privacy and control with local models, but rarely both. Even on Apple's impressive M-series chips, inference speeds often lag behind what's needed for real-time applications.

When Google Research published their paper on pyramid speculative decoding, we saw an opportunity to change this equation. Their approach promised significant speedups without sacrificing quality—exactly what the local LLM community needed.

The Power of Three Models

Traditional speculative decoding uses two models: a small, fast "draft" model and a larger "target" model. The draft model generates candidates quickly, and the target model verifies them. This works, but acceptance rates can be low, limiting the speedup.

Pyramid speculative decoding adds a third model in the middle. This creates a verification cascade:

Haiku 2 (45.6 tok/s): Ultra-fast draft generation
Haiku 3 (30.5 tok/s): Middle-tier verification
Sonnet 3.5 (12.5 tok/s): Final quality assurance

This three-tier approach dramatically improves acceptance rates, leading to our observed 1.97x speedup.

Real-World Performance

On a MacBook Pro with M3 Max, momo-kibidango achieves:

Baseline Speed

12.5 tok/s

With momo-kibidango

24.6 tok/s

This isn't a synthetic benchmark—it's real-world performance on actual OpenClaw workloads. The speedup is consistent across different types of prompts and generation lengths.

Zero Compromises

The best part? There's no quality trade-off. Pyramid speculative decoding is mathematically guaranteed to produce the exact same output distribution as running the target model alone. Every token generated is verified by Sonnet 3.5, ensuring you get the full capabilities of the model at nearly twice the speed.

OpenClaw Native

We built momo-kibidango specifically for the OpenClaw ecosystem. It integrates seamlessly with existing subagent workflows, requiring just a simple configuration change to enable acceleration. The implementation includes:

Automatic model loading and caching
Smart memory management
Prometheus metrics for monitoring
Graceful fallback for edge cases
Production-grade error handling

Get Started Today

momo-kibidango v1.0.0 is available now. Installation takes less than 5 minutes, and you'll see immediate speedups in your OpenClaw workflows. Check out our getting started guide to begin.

We're excited to see what the community builds with faster local inference. Whether you're running agents, building applications, or exploring new use cases, momo-kibidango gives you the speed you need without compromising on quality or privacy.

Join the Community

Have questions or want to share your experience? Join our Discord community or contribute to the project on GitHub.