In a significant move within the competitive AI landscape, Chinese tech giant Xiaomi has officially launched and open-sourced its latest large language model, the MiMo-V2-Flash. Announced at the 2025 Xiaomi Human-Car-Home Ecosystem Partners Conference, the model is positioned as a high-performance, cost-effective alternative to leading open-source offerings, boasting impressive benchmark scores and a radically efficient architecture designed for speed.
A New Contender in the Open-Source Arena
Xiaomi's MiMo-V2-Flash enters a crowded field with a bold proposition: matching the capabilities of top-tier models like DeepSeek-V3.2 and Kimi-K2 while dramatically reducing operational costs. The model utilizes a Mixture of Experts (MoE) architecture with a total of 309 billion parameters, of which 15 billion are active during inference. This design choice is central to its efficiency, allowing it to deliver complex reasoning without the computational burden of activating its full parameter set for every task. Under a permissive MIT license, the model's base weights are now available on Hugging Face, inviting developers and researchers to experiment and build upon it.
Model Specifications & Pricing
- Architecture: Mixture of Experts (MoE)
- Total Parameters: 309 Billion
- Active Parameters: 15 Billion
- Context Window: 256k tokens
- Inference Speed: ~150 tokens/second
- Pricing: USD 0.1 per million tokens (input), USD 0.3 per million tokens (output)
- License: MIT (open-source)
Engineered for Speed and Affordability
The core innovation of MiMo-V2-Flash lies in its architectural optimizations, which target the twin goals of blistering inference speed and low cost. Xiaomi claims the model achieves a generation speed of 150 tokens per second. More strikingly, it pushes the price of AI inference down to USD 0.1 per million tokens for input and USD 0.3 per million tokens for output, setting a new benchmark for affordability. This is achieved through two key technologies. First, a hybrid sliding window attention mechanism drastically reduces the memory required for processing long contexts by a factor of nearly six, while still supporting a 256k token context window. Second, a native multi-token prediction (MTP) module allows the model to predict several future tokens in parallel, speeding up inference by 2 to 2.6 times.
Core Technical Innovations
- Hybrid Sliding Window Attention: Uses a 5:1 ratio of sliding window (128 tokens) to global attention layers, reducing KV cache memory by ~6x.
- Multi-Token Prediction (MTP): Natively predicts 2.8-3.6 tokens on average in parallel, accelerating inference by 2.0-2.6x.
- Multi-Teacher Online Policy Distillation (MOPD): A training method claimed to be 50x more compute-efficient than traditional RL pipelines.
Benchmark Performance and Capabilities
Initial benchmark results paint a picture of a highly capable model, particularly in technical domains. In programming, MiMo-V2-Flash scored 73.4% on the SWE-bench Verified test, which involves fixing real-world software bugs—a result that reportedly surpasses all other open-source models and approaches the performance of advanced closed-source systems. It also performs strongly in mathematics and scientific knowledge tests, ranking in the top two among open-source models. Beyond raw benchmarks, the model is equipped for practical application, supporting deep thinking, web search, and complex multi-turn agent interactions. Its performance in agent-based tasks, such as communication and retail simulations, further demonstrates its ability to understand and execute multi-step logical operations.
Key Benchmark Scores
- SWE-bench Verified (Code/Bug Fixing): 73.4%
- SWE-Bench Multilingual: 71.7%
- Agent Benchmarks (τ²-Bench):
- Communication: 95.3
- Retail: 79.5
- Aviation: 66.0
- BrowseComp Search Agent: 45.4 (58.3 with context management)
A Novel Approach to Model Training
Xiaomi's technical report highlights an unconventional and efficient training methodology dubbed Multi-Teacher Online Policy Distillation (MOPD). This approach moves away from the traditional, computationally expensive pipeline of supervised fine-tuning followed by reinforcement learning. Instead, the student model (MiMo-V2-Flash) generates its own outputs, and multiple expert teacher models provide dense, per-token feedback. This method is claimed to be 50 times more compute-efficient, allowing the student model to rapidly learn and achieve peak teacher performance with far fewer resources. The framework also allows for a self-reinforcing cycle where a proficient student can later become a teacher for the next model iteration.
Positioning for the Future of AI Agents
Xiaomi executives, including President Lu Weibing and newly appointed MiMo lead Luo Fuli, framed the release as more than just another model. They described MiMo-V2-Flash as a "new language foundation for the Agent era," emphasizing its role in building systems that don't just simulate language but understand and interact with the world. The model's long context window and integration capabilities with developer tools like Claude Code and Cursor are aimed at making it a practical, daily assistant for coding and complex task automation. With the model's API currently offered for free on a limited-time basis, Xiaomi is clearly aiming for rapid adoption and community feedback to fuel its evolution in the fast-moving AI space.
