Documentation
Welcome to the momo-kibidango documentation. Learn how to integrate and use 3-model speculative decoding to accelerate your LLM inference on Apple Silicon.
🏗️ Architecture
Understand how pyramid speculative decoding works under the hood.
Technical Overview →Key Features
- ✅ 1.97x faster inference with zero quality degradation
- ✅ OpenClaw native integration - works out of the box
- ✅ Memory efficient - runs on 16GB MacBooks
- ✅ Production ready - v1.0.0 with monitoring and metrics
- ✅ Smart fallback - gracefully handles edge cases
- ✅ MIT licensed - free for commercial use
How It Works
momo-kibidango implements Google Research's 3-model pyramid architecture:
- 1. Draft Generation: Haiku 2 (fastest model) generates multiple draft tokens at 45.6 tok/s
- 2. Middle Verification: Haiku 3 verifies drafts, correcting obvious errors at 30.5 tok/s
- 3. Final Authority: Sonnet 3.5 validates the final output, ensuring quality matches baseline
This approach achieves near 2x speedup while mathematically guaranteeing identical output distribution to running Sonnet 3.5 alone.
Need Help?
Check out our FAQ or join our Discord community for support.