Documentation

Welcome to the momo-kibidango documentation. Learn how to integrate and use 3-model speculative decoding to accelerate your LLM inference on Apple Silicon.

🚀 Getting Started

Get up and running with momo-kibidango in under 5 minutes.

Quick Start Guide →

🏗️ Architecture

Understand how pyramid speculative decoding works under the hood.

Technical Overview →

📦 Installation

Detailed installation instructions for various environments.

Install Guide →

⚙️ Configuration

Configure models, thresholds, and optimization parameters.

Configuration Options →

Key Features

✅ 1.97x faster inference with zero quality degradation
✅ OpenClaw native integration - works out of the box
✅ Memory efficient - runs on 16GB MacBooks
✅ Production ready - v1.0.0 with monitoring and metrics
✅ Smart fallback - gracefully handles edge cases
✅ MIT licensed - free for commercial use

How It Works

momo-kibidango implements Google Research's 3-model pyramid architecture:

1. Draft Generation: Haiku 2 (fastest model) generates multiple draft tokens at 45.6 tok/s
2. Middle Verification: Haiku 3 verifies drafts, correcting obvious errors at 30.5 tok/s
3. Final Authority: Sonnet 3.5 validates the final output, ensuring quality matches baseline

This approach achieves near 2x speedup while mathematically guaranteeing identical output distribution to running Sonnet 3.5 alone.

Need Help?

Check out our FAQ or join our Discord community for support.