MiniMax M2 vs M3: What's Actually Different and Which One Should You Use?

By OpsMatters

Jun 8, 2026

5 minutes

OpsMatters

If you've been following open-source AI in 2026, MiniMax has probably crossed your radar at least once. The Shanghai-based lab has been quietly releasing models that punch well above their weight — and now, with M3 dropping on June 1, 2026, the question everyone's asking is: does it replace M2, or do they serve different purposes?

Let's break it down clearly, without the hype.

A Quick Background on the M-Series

MiniMax launched the original M2 in October 2025 with a simple pitch: frontier-level coding and agentic performance at a fraction of the price. It wasn't just marketing — M2 genuinely delivered. With 230 billion total parameters but only 10 billion active parameters per forward pass, it kept costs low while staying competitive with models like Claude Sonnet on agentic tasks.

The M-series then went through several iterations — M2.5, M2.7, M2.1 — each improving tool use, coding reliability, and long-chain agent behavior. By the time M3 arrived, M2.7 had become a go-to for developers who needed a reliable, cost-efficient model for production workflows.

Then M3 changed the conversation.

What MiniMax M2 Actually Does Well

Before talking about M3, it's worth being honest about why M2 still matters.

M2 was built for speed and efficiency. At roughly $0.30 per million input tokens and $1.20 per million output tokens, it's one of the most cost-effective serious models available. It generates output at around 129 tokens per second — fast enough for interactive agents that need to read files, plan changes, execute commands, and retry on failures without burning through your budget.

On benchmarks, M2 performed well across tool use and deep search, coming close to the best models available. Its coding numbers were slightly behind the top-tier closed models, but it was already best-in-class among open-weight options at launch. If you want a full picture of how far the M-series has come since then, the MiniMax M3 model page is a good reference — it shows exactly where the capability ceiling moved.

More importantly, M2's behavior is well-understood at this point. If you're running production workflows — bug fixes, test writing, refactoring, repeated automation — M2 and M2.7 give you predictable results. You know the failure modes. You know the costs. That stability has real value, especially when M3 is still weeks old and its real-world behavior in edge cases hasn't been stress-tested at scale yet.

What M3 Changes (And Why It Matters)

M3 is not a point release. MiniMax describes it as a generational shift, and the architecture backs that up.

Context Window: 197K → 1 Million Tokens

This is the biggest headline number. M2 shipped with a roughly 197K-token context window. M3 goes to 1 million tokens — and makes it usable thanks to a new architecture called MiniMax Sparse Attention (MSA).

Here's why that matters for agents specifically: standard transformer attention is quadratic. Every token attends to every other token, so doubling context quadruples the compute cost. MSA breaks that by having the model focus on selected blocks of context rather than treating everything equally. The result is up to 15.6x faster decoding at long contexts compared to full attention, making 1M tokens economically viable rather than just theoretically possible.

If you're working with large codebases, long research sessions, multi-document comparisons, or lengthy browser automation sessions, this isn't a nice-to-have — it's a different capability class.

Native Multimodality: Text, Images, and Video

M2 was text-only. M3 accepts text, image, and video inputs natively. For coding agents and operations assistants, this opens up workflows that weren't possible before: reading UI screenshots, analyzing charts, parsing dashboard errors, doing visual QA on rendered components, and understanding browser states without needing a separate vision model.

This is more significant than it sounds. Many real-world agent tasks involve some visual component. M3 handles them in a single model call.

Benchmark Results: Where M3 Lands

On SWE-Bench Pro — widely considered the most meaningful real-world coding benchmark — M3 scores 59.0%. That puts it ahead of GPT-5.5 (58.6%) and Gemini 3.1 Pro. Claude Opus 4.7 still leads at 64.3%, so Anthropic keeps the edge on raw coding performance. But M3 is the first open-weight model to even be in that conversation.

On BrowseComp, which tests autonomous web search and browsing, M3 scores 83.5 — ahead of Claude Opus 4.7's 79.3. That's notable. Autonomous web research is one of the hardest agentic tasks, and M3 outperforms the closed frontier model on it.

Other numbers worth knowing:

Terminal-Bench 2.1 (command-line agent tasks): 66.0%
MCP Atlas (tool use): 74.2%
OSWorld-Verified (desktop GUI operation): 70.0%

These aren't cherry-picked soft metrics. They represent the workflows where agents actually break down.

One caveat worth mentioning: most of these figures are vendor-published and haven't been fully independently verified yet. That's not unusual for a launch-week model, but it's a reason to run your own benchmarks on your actual tasks before committing.

M2 vs M3: The Real Comparison

Here's how the two models actually compare where it counts:

Context window: M2 caps at around 197K tokens. M3 goes to 1 million. If your tasks stay within a few hundred thousand tokens, M2 is fine. If you're loading entire repos or processing long video transcripts, only M3 works.

Multimodality: M2 is text-only. M3 handles images and video. If your workflows are purely text-based, this doesn't change anything. If visual inputs matter, M3 is the only option.

Coding performance: M2.7 reached 56.2% on SWE-Bench Pro. M3 pushes to 59.0%. The improvement is real but not dramatic for well-scoped tasks. For large, cross-cutting changes that touch many files and systems, M3's advantage grows.

Price: Both models currently sit at roughly the same promotional price — $0.30 per million input tokens and $1.20 per million output tokens. Standard M3 pricing is higher, so treat current rates as temporary. M2 is the safer long-term cost assumption.

Stability: M2 is battle-tested. M3 is weeks old. For production systems, that difference matters.

Deep Search Specifically: Which Model Wins?

Deep search — multi-step information retrieval combined with web interactions and reasoning — is one of the areas where MiniMax has invested the most. Both M2 and M3 score well here compared to other open-weight models, but M3's BrowseComp score of 83.5 is genuinely impressive.

If you're building a research agent that needs to gather, synthesize, and reason across many web sources, M3 is worth testing seriously. The 1M context window means it can hold more source material in memory before losing track of earlier context. Combined with the improved tool-use numbers, this makes M3 the stronger choice for long-horizon research tasks.

For short, well-defined search tasks — pulling current prices, checking a specific fact, summarizing a single article — M2 is still fast and accurate enough that the upgrade may not be worth the added complexity of switching.

What This Means for You

The practical answer depends on what you're building:

Stick with M2 or M2.7 if you have working production workflows, your tasks fit comfortably within 100K–200K tokens, you don't need image or video inputs, and you value cost predictability above everything else.

Test M3 if you're working with large codebases, long documents, browser research pipelines, or any workflow that benefits from multimodal inputs. Run it on your actual tasks before fully migrating — benchmarks tell part of the story, real task completion tells the rest.

The safest path right now is running both in parallel for a few weeks. Use M2.7 as your fallback. Test M3 on the tasks where you've historically hit context limits or needed visual understanding. Let the results from your own workload make the decision.

For a practical starting point on running these models in a managed environment, MyClaw is worth looking at — it handles the hosting, tooling, and agent runtime so you can focus on evaluating model behavior rather than infrastructure.

The Bigger Picture

MiniMax's M-series has quietly become one of the more interesting stories in open-weight AI. M2 proved that a Chinese open-source lab could compete with proprietary frontier models on agentic tasks. M3 narrows the gap further — and on autonomous web browsing, it actually leads.

The broader implication is that the cost advantage of open-weight models keeps growing. If M3's benchmark numbers hold up under independent verification, developers will have access to a model that matches or beats most closed proprietary options on real coding and agent tasks — at a fraction of the price, and with the ability to self-host.

M2 was the model that made people take MiniMax seriously. M3 is the one that puts them in the same sentence as OpenAI, Anthropic, and Google. That's a meaningful shift, even if you don't upgrade right away.