Meituan Released General 365: A Rigorous New Benchmark

Meituan released General 365, a rigorous new benchmark — and honestly, it’s already making a lot of AI researchers uncomfortable. In a good way. The Chinese tech giant didn’t just throw together another multiple-choice test. They built something that makes today’s best models look surprisingly, humblingly limited.

Even Gemini 3 Pro — the top scorer in initial testing — could only manage around 62%. Twenty-six mainstream models were evaluated, and not one came close to acing it. Consequently, the AI community is asking a pointed question: have we been grading our models on a curve this whole time?

This benchmark lands at exactly the right moment. Companies routinely claim their models “beat” existing tests, while researchers increasingly doubt whether those tests measure anything resembling real intelligence. General 365 changes the conversation entirely.

Why Meituan Released General 365 as a Rigorous New Benchmark

Meituan isn’t a name most Americans associate with AI research. Nevertheless, the company — China’s largest food delivery and local services platform — has been quietly building serious AI capabilities for years. Their decision to release General 365 reflects growing frustration with evaluation tools that just aren’t pulling their weight anymore.

The core problem is straightforward. Popular benchmarks like MMLU (Massive Multitask Language Understanding) have become too easy. Top models now score above 90% on MMLU, which sounds impressive until you realize those same models still fumble basic common-sense reasoning in real-world applications. I’ve seen this firsthand — a model aces a knowledge test and then completely falls apart on a three-step logic problem.

Meituan released General 365 as a rigorous new benchmark specifically to close that gap. The test focuses on complex, multi-step reasoning across 365 carefully curated problems. Each one requires genuine understanding — not pattern matching. Importantly, the questions span diverse domains: mathematics, logic, science, language comprehension, and practical problem-solving.

Here’s what sets it apart structurally:

  • Anti-contamination measures: Questions are original, so models can’t have memorized them during training
  • Multi-step reasoning required: Surface-level recall won’t get you far here
  • Human expert validation: Domain specialists signed off on every question
  • Balanced difficulty distribution: Problems range from challenging to genuinely brutal
  • Cross-domain coverage: Being great at one thing won’t save you

Furthermore, Meituan designed General 365 to resist “teaching to the test.” You can’t memorize your way to a good score — you have to actually reason. This directly challenges the benchmark saturation problem that’s been quietly undermining AI evaluation for years. Fair warning, though: this also makes it harder to use as a quick sanity check during development cycles.

How General 365 Compares to Existing AI Benchmarks

Understanding why Meituan released General 365 as a rigorous new benchmark requires some context. Specifically, you need to see how badly current benchmarks have drifted from being useful.

Benchmark Focus Area Top Model Score Year Created Key Limitation
MMLU Multitask knowledge ~90%+ 2020 Saturated; too easy for frontier models
ARC (AI2 Reasoning Challenge) Science reasoning ~95%+ 2018 Limited to grade-school science questions
GSM8K Math word problems ~95%+ 2021 Narrow scope; only arithmetic reasoning
GPQA Graduate-level Q&A ~55-65% 2023 Small question set; limited domains
General 365 Complex multi-domain reasoning ~62% (Gemini 3 Pro) 2025 New; needs longitudinal validation

The pattern is hard to ignore. Older benchmarks have hit ceiling effects — models score so high that the tests can’t tell you anything useful about which one actually reasons better. Conversely, General 365 creates real, meaningful separation between models. That’s rarer than it should be.

MMLU’s collapse as a useful metric is particularly telling. When it launched in 2020, GPT-3 scored around 43%. Today, multiple models exceed 90%. Although that represents genuine progress in some areas, it also means MMLU can no longer tell a good model from a great one. It’s become a checkbox, not a challenge.

GSM8K tells a similar story. This math benchmark once seemed tough. Now models routinely solve 95% or more of its problems — and notably, researchers have shown that some of them are essentially memorizing solution patterns rather than understanding mathematics. This surprised me when I first dug into the research on it.

General 365 deliberately avoids these pitfalls. Because Meituan released General 365 as a rigorous new benchmark with anti-saturation baked into its design, it should stay useful for years rather than months. The 62% ceiling for Gemini 3 Pro proves the point — there’s still enormous room for improvement, which is exactly what you want from an evaluation tool.

Additionally, the cross-domain approach matters more than it might seem. MMLU tests knowledge breadth, GSM8K tests math, ARC tests science. General 365 tests whether a model can reason flexibly across all these areas at the same time. That’s a fundamentally harder challenge — and a much more honest one.

The 62% Ceiling: What Gemini 3 Pro’s Score Reveals

That 62% score deserves a closer look. And not for the reason you might expect.

Gemini 3 Pro is Google DeepMind’s frontier model — it represents billions of dollars in research investment and tops most existing benchmarks. Yet on General 365, it barely cleared 60%. I’ve tested dozens of AI evaluation setups over the years, and watching a top-tier model struggle this visibly on a well-designed benchmark is genuinely instructive.

This isn’t a failure of Gemini 3 Pro. It’s a success of benchmark design. When Meituan released General 365 as a rigorous new benchmark, they calibrated difficulty specifically to expose genuine reasoning limitations. The result tells us something important — and a little sobering — about where AI actually stands right now.

Specifically, the scores across all 26 tested models clustered in revealing ways:

  • Top tier (55–62%): Frontier models like Gemini 3 Pro, GPT-4 class models, and Claude 3.5 Sonnet
  • Mid tier (40–55%): Strong open-source models and slightly older commercial models
  • Lower tier (below 40%): Smaller models and older architectures

The compressed range at the top is the real kicker. Moreover, it suggests that current scaling approaches — more data, more compute, more parameters — may be hitting diminishing returns for complex reasoning. Models that differ dramatically in size and training cost performed surprisingly similarly. That’s not what the “just scale it” crowd wants to hear.

Several failure patterns emerged from the initial assessment:

  1. Chain-of-reasoning breakdowns: Models started problems correctly but lost coherence across multiple steps
  2. Cross-domain transfer failures: Strong math performance didn’t carry over to logical reasoning tasks
  3. Ambiguity handling: Models struggled when problems required reading nuanced language carefully
  4. Novel problem structures: Unfamiliar question formats caused disproportionately large error rates

Therefore, the 62% ceiling isn’t just a number — it’s a roadmap. It shows exactly where model architectures need to improve, and that’s precisely what a good benchmark should do. No other recent test has been this specific about where the gaps actually are.

How Benchmarks Drive Model Development and Geopolitical Competition

Benchmarks aren’t academic exercises. They shape where companies invest billions of dollars, influence national AI strategies, and determine which capabilities get prioritized.

The benchmark-development feedback loop works like this: researchers create a test, companies optimize models to beat it, the test becomes saturated, someone builds a harder one. Because Meituan released General 365 as a rigorous new benchmark, this cycle has entered a new phase — and companies now have a concrete, honest target for improving complex reasoning.

This matters geopolitically. The AI race between the US and China increasingly plays out through benchmark performance. The National Institute of Standards and Technology (NIST) has stressed the importance of solid AI evaluation frameworks. Meanwhile, Chinese companies like Meituan, Alibaba, and Baidu are increasingly setting their own evaluation standards rather than deferring to Western ones.

Consider the strategic implications:

  • Benchmark creators set the agenda — by defining what “intelligence” means in measurable terms, they steer global research priorities
  • National prestige is genuinely at stake — countries want their models at the top of leaderboards
  • Funding follows scores — venture capital and government grants flow toward teams showing benchmark improvements
  • Standards emerge from benchmarks — today’s tests quietly become tomorrow’s regulatory requirements

Similarly, the fact that a Chinese company created a benchmark where American and international frontier models struggle sends a message. It shows that Chinese AI research has reached a level of sophistication where it can credibly evaluate — not just compete with — global frontier models. That’s a notable shift from even three years ago.

Nevertheless, benchmark-driven development has real downsides. Companies sometimes optimize narrowly for test performance rather than genuine capability. This phenomenon — called Goodhart’s Law — means that when a measure becomes a target, it stops being a good measure. General 365’s anti-contamination design tries to reduce this risk. Although no benchmark is immune to gaming forever, Meituan’s approach makes it significantly harder than most.

The broader trend is unmistakable. AI evaluation is becoming more sophisticated, more international, and more consequential. When Meituan released General 365 as a rigorous new benchmark, they didn’t just create a test — they made a statement about who gets to define AI progress.

What General 365 Means for AI Developers and Enterprises

Look, if you’re building with AI professionally, this benchmark matters to you. Here’s why.

For AI developers, the fact that Meituan released General 365 as a rigorous new benchmark creates both challenges and real opportunities. Models that perform well here show genuine reasoning capability — which is exactly what enterprise customers actually need, even if they don’t always know to ask for it.

Think about real-world applications where complex reasoning genuinely matters:

  • Legal analysis: Reviewing contracts requires multi-step logical reasoning across domains
  • Medical diagnosis: Connecting symptoms to conditions demands cross-domain knowledge integration
  • Financial modeling: Evaluating investment scenarios involves handling ambiguity and uncertainty
  • Software architecture: Designing systems means reasoning about trade-offs across multiple constraints at once
  • Scientific research: Generating hypotheses demands novel problem-solving — not pattern recall

Current benchmarks don’t adequately test these capabilities. General 365 does. Consequently, model performance here should far better predict real-world usefulness than a 90%+ MMLU score ever could.

For enterprise buyers, General 365 offers a more honest assessment tool. When a vendor claims their model is “state of the art,” you can now ask a specific question: what’s their General 365 score? A model at 62% versus one at 45% represents a meaningful, practical capability difference — that distinction was invisible when everyone was scoring 90%+ on saturated benchmarks. Bottom line: you now have a sharper lens.

Practical recommendations for different stakeholders:

  1. AI researchers: Study General 365’s failure patterns to find the most promising research directions
  2. ML engineers: Use General 365 as a supplementary evaluation metric during model fine-tuning
  3. Product managers: Factor General 365 scores into model selection for reasoning-heavy applications
  4. CTOs and technical leaders: Push for multi-benchmark evaluation rather than relying on any single score
  5. Policymakers: Consider General 365-style evaluations when developing AI capability standards

Additionally, the benchmark highlights an important — and somewhat humbling — truth. We’re still far from artificial general intelligence. The best models in the world can’t solve roughly 4 out of every 10 problems on this test. That should meaningfully shape expectations and investment decisions alike.

Importantly, Meituan released General 365 as a rigorous new benchmark as an open evaluation. This transparency benefits the entire ecosystem. Open benchmarks allow independent verification, support genuine competition, and speed up real progress. Closed evaluations, by contrast, can quietly hide weaknesses and inflate perceived capabilities — which, frankly, has happened more than once in this industry.

The Future of AI Benchmarking After General 365

General 365 represents a broader shift in how we think about AI evaluation. The era of simple, easily saturated benchmarks is ending. What comes next will be more demanding, more diverse, and — hopefully — more honest.

Several trends are converging here:

  • Dynamic benchmarks: Tests that update regularly to prevent memorization and contamination
  • Process evaluation: Scoring how models reason, not just whether they land on the right answer
  • Multi-modal challenges: Problems requiring integrated reasoning across text, images, code, and data
  • Adversarial testing: Questions deliberately designed to exploit known model weaknesses
  • Cultural and linguistic diversity: Tests that don’t implicitly assume Western, English-language knowledge as the baseline

Because Meituan released General 365 as a rigorous new benchmark with many of these principles already built in, it serves as a genuine template for future evaluation tools. Other organizations will follow — and they should. The competitive pressure to build better benchmarks is, somewhat ironically, one of the healthiest dynamics in AI research right now.

Moreover, the AI community is moving toward benchmark suites rather than single tests. No one benchmark captures everything that matters. The combination of MMLU for breadth, GSM8K for math, GPQA for graduate-level reasoning, and now General 365 for complex multi-domain reasoning creates a meaningfully more complete picture than any single score ever could.

The stakes keep rising. As AI systems take on more consequential tasks — medical decisions, legal judgments, financial trades — we need evaluation tools that genuinely test capability rather than just producing impressive-looking numbers. A model scoring 95% on an easy test but 45% on General 365 may not be ready for high-stakes deployment. That distinction matters enormously, and for a long time we didn’t have good tools to see it.

Alternatively, some researchers argue we need to move beyond benchmarks entirely, pushing instead for evaluation through real-world task completion. Although that approach has real merit, standardized benchmarks remain essential for fair, reproducible comparison across models. General 365 shows that well-designed benchmarks still carry tremendous value — they just need to be built with considerably more rigor than most have been.

Conclusion

When Meituan released General 365 as a rigorous new benchmark, they exposed an uncomfortable truth the AI industry has been quietly dancing around. Our best models aren’t nearly as capable as saturated benchmarks suggest. Even Gemini 3 Pro’s 62% score — the highest among 26 tested models — reveals specific reasoning limitations that matter for real-world deployment.

This benchmark matters for several reasons. It provides honest evaluation, drives research toward genuine reasoning improvement, and reshapes geopolitical AI competition in ways that will play out over years. Furthermore, it gives developers and enterprises a more reliable tool for assessing what models can actually do — not just what their marketing decks claim.

Here are your actionable next steps:

  1. Track General 365 scores alongside traditional benchmark results whenever you’re evaluating models
  2. Test your current AI implementations against complex, multi-step reasoning tasks — you might be surprised
  3. Avoid over-relying on any single benchmark — use multiple evaluation frameworks and triangulate
  4. Follow Meituan’s ongoing research for updated results and methodology insights as the benchmark matures
  5. Advocate for transparent, rigorous evaluation in your organization’s AI procurement process

The fact that Meituan released General 365 as a rigorous new benchmark is a genuine turning point — not just another press release. It raises the bar for what we expect from AI systems and reminds us that the gap between impressive demo performance and reliable real-world reasoning is still wide. Closing that gap is the real work ahead.

FAQ

What is General 365, and why did Meituan create it?

General 365 is a benchmark containing 365 carefully curated problems designed to test complex, multi-step reasoning in AI models. Meituan released General 365 as a rigorous new benchmark because existing tests like MMLU and GSM8K had become too easy for frontier models. Top models were scoring above 90% on older benchmarks, making it essentially impossible to tell them apart in any meaningful way. General 365 restores honest evaluation by testing genuine reasoning ability across multiple domains at once — not just isolated knowledge recall.

Why did Gemini 3 Pro only score 62% on General 365?

Gemini 3 Pro scored approximately 62% because General 365 tests fundamentally different capabilities than traditional benchmarks do. The problems require multi-step reasoning, cross-domain knowledge integration, and handling of real ambiguity — areas where even the most advanced models still genuinely struggle. Notably, this score was the highest among all 26 models tested, which suggests the benchmark is appropriately challenging rather than unfairly constructed.

Can companies game or cheat on the General 365 benchmark?

Meituan released General 365 as a rigorous new benchmark with specific anti-contamination measures built in from the start. The questions are original and weren’t publicly available before the benchmark’s release. However, no benchmark is completely immune to gaming over time — that’s just the nature of the field. As models train on more internet data, some test information may eventually leak into training sets. Meituan has designed safeguards against this, but the AI community will need to watch for contamination as the benchmark matures and gains wider adoption.

Does General 365 mean current AI models aren’t useful?

Absolutely not. Current AI models are remarkably capable for many practical tasks — they genuinely excel at text generation, translation, coding assistance, and information retrieval. General 365 specifically tests complex reasoning, which is one important dimension of intelligence, not the whole picture. A model scoring 62% on General 365 can still be incredibly valuable for a wide range of business applications. The benchmark simply highlights where further improvement is needed, particularly for high-stakes reasoning tasks where errors carry real consequences.

Where can I find the General 365 benchmark results and methodology?

Meituan has shared initial results through their research publications and AI community channels. For the most current information, check Meituan’s official technology blog and major AI research repositories like Papers With Code, which tracks benchmark results across the broader AI ecosystem. Additionally, the AI research community on platforms like X (formerly Twitter) and at academic conferences discusses new benchmark findings and methodology details regularly — worth following if you want to stay current.

Leave a Comment