MiniMax M2.7 Helped Build Itself

MiniMax's M2.7 participated in its own training loop — running 100+ autonomous iteration rounds on its own scaffold, achieving a 30% performance boost. SWE-Pro hits 56.22%.

MiniMax M2.7 Helped Build Itself

The claim that makes M2.7 interesting isn’t a benchmark number — it’s the process. MiniMax says this is the first model that deeply participated in its own evolution. Not as a slogan. As an engineering workflow.

During development, MiniMax had an internal version of M2.7 build its own research agent harness, manage data pipelines and training environments, launch reinforcement learning experiments, monitor results, read logs, debug failures, analyze metrics, fix code, submit merge requests, and then decide what to try next based on what worked. The model ran over 100 autonomous iteration rounds on its own scaffold — analyzing failure trajectories, modifying code, running evaluations, comparing results, and deciding whether to keep or revert changes. No human in the loop for any of it.

The result: a 30% performance improvement on MiniMax’s internal evaluation sets, driven by optimizations the model itself discovered — things like systematically searching for optimal sampling parameter combinations and designing workflow-specific guidelines for itself.

The Self-Evolution Thesis

MiniMax is framing this as the beginning of a cycle: the model improves its harness, the improved harness produces a better model, the better model improves the harness further. Right now, M2.7 handles 30-50% of the workflow autonomously. The rest still requires human researchers for critical decisions.

They tested the outer boundary of this approach by having M2.7 compete in 22 machine learning competitions from OpenAI’s MLE Bench Lite. Each trial got 24 hours for iterative self-evolution on a single A30 GPU. The best run produced 9 gold, 5 silver, and 1 bronze medal — a 66.6% medal rate that ties Gemini 3.1 and sits behind only Opus 4.6 (75.7%) and GPT-5.4 (71.2%).

That’s competitive with frontier models on a task that requires not just code generation but experimental design, hypothesis testing, and iterative refinement — exactly the capabilities the self-evolution loop is designed to strengthen.

Software Engineering Numbers

On coding benchmarks, M2.7 is genuinely strong. SWE-Pro (multi-language) hits 56.22%, matching GPT-5.3-Codex. VIBE-Pro, which measures end-to-end project delivery across Web, Android, iOS, and simulation tasks, lands at 55.6% — near Opus 4.6. SWE Multilingual reaches 76.5, and Terminal Bench 2 (deep system-level comprehension) hits 57.0%.

The practical demonstration matters more than the scores: MiniMax says M2.7 has reduced live production incident recovery time to under three minutes on multiple occasions. The model correlates monitoring metrics with deployment timelines, does statistical analysis on trace sampling, connects to databases to verify root causes, and even uses non-blocking index creation to stop the bleeding before submitting a fix. That’s SRE-level reasoning, not just code generation.

Agent Teams and Professional Work

M2.7 ships native multi-agent collaboration — what MiniMax calls Agent Teams. Multiple model instances take distinct roles (product manager, engineer, QA) and work together on complex tasks while maintaining role boundaries and challenging each other’s reasoning. This is trained as a native capability, not achieved through prompting.

On the professional work side, GDPval-AA scores an ELO of 1495 — the highest among open-source models, behind only Opus 4.6, Sonnet 4.6, and GPT-5.4. The model handles Word, Excel, and PPT workflows with multi-round high-fidelity editing, and maintains 97% skill adherence across 40+ complex skills (each over 2,000 tokens).

The TSMC demo is the most concrete illustration: M2.7 autonomously reads annual reports and earnings calls, cross-references research reports, builds a revenue forecast model, and produces a PowerPoint and Word research report using templates — handling it like a junior financial analyst with self-correction capabilities.

OpenRoom and the Entertainment Play

MiniMax also open-sourced OpenRoom (openroom.ai) — a Web GUI interaction system where AI characters exist in a visual space with real-time scene generation, proactive environmental engagement, and conversation-driven experiences. It’s built on M2.7’s character consistency and emotional intelligence capabilities. Most of the code was written by AI.

This isn’t just a demo — it’s a signal that MiniMax sees agentic models extending beyond productivity into interactive entertainment. Xiaomi’s MiMo-V2 is going after the full perception-reasoning-voice stack; MiniMax is going after the self-evolving agent that can also be your companion.

The Competitive Position

M2.7 integrates with OpenClaw and scored 62.7% on MiniMax’s own MM Claw benchmark — close to Sonnet 4.6. On Toolathon, it hit 46.3%, placing it in the global top tier for tool use.

The self-evolution angle is what separates this from yet another Chinese lab matching Western frontier benchmarks. Every lab is chasing SWE-bench and MMLU scores. MiniMax is claiming something different: that the model’s ability to improve its own development process is itself a capability worth measuring and optimizing for. Whether that claim holds up under external scrutiny is an open question — but the benchmark results suggest the approach is producing genuinely competitive output.

Available now on MiniMax Agent (agent.minimax.io) and the API platform (platform.minimax.io), with a dedicated coding plan for developers.

Tags: AI