Why it matters for testing
A new arXiv paper, ComplexMCP, shows that even the best LLM agents fail more than 40% of tasks when tools are interdependent and the environment is stateful — mirroring the exact conditions agents face in production. This isn't a model problem QA can wait for researchers to fix; it's a test design problem QA teams need to solve right now.
Intro
Here's a scenario playing out in engineering teams everywhere: an AI agent is tested on a set of isolated tool calls, passes with flying colors, gets deployed — and then silently fails in production when it needs to chain together five API calls across two stateful systems, one of which occasionally returns a transient error.
The agent looked fine in testing. The testing was wrong.
A new benchmark paper out of arXiv in May 2026 puts hard numbers on this gap and identifies exactly why current LLM agents — including top-tier models — struggle at scale. For QA engineers building test strategies for AI-powered systems, this research is required reading.
The AI development / news
ComplexMCP (arXiv:2605.10787, published May 2026) is a rigorous evaluation benchmark for LLM agents operating in large-scale, interdependent, and stochastic tool ecosystems. It was built on the Model Context Protocol (MCP) and provides over 300 meticulously tested tools derived from 7 stateful sandboxes — ranging from office suites to financial systems.
Unlike existing benchmarks that test agents on isolated API calls, ComplexMCP simulates the kind of messy, real-world conditions agents actually face: tools that depend on each other's outputs, environments with dynamic state, and APIs that occasionally fail unpredictably.
The results are sobering. Even top-tier models fail to exceed a 60% success rate — far below the human baseline of 90% on the same tasks.
The paper identifies three fundamental failure modes through granular trajectory analysis:
- Tool retrieval saturation: As the number of available tools grows, agents struggle to find the right tool for the job.
- Over-confidence: Agents skip essential verification steps, assuming the environment is in an expected state when it isn't.
- Strategic defeatism: When a step fails, agents rationalize the failure rather than attempting recovery.
Current testing landscape
The way most teams currently test AI agents maps poorly to production:
- Isolated tool call tests: Testing whether the agent can call a single API correctly. Fast and easy to write, but doesn't test chaining or state.
- Happy-path scenarios: Giving the agent a clean environment and a clear task. Misses the failure modes that emerge under realistic conditions.
- Static benchmarks: Running the agent against a fixed dataset. Doesn't account for stateful environments where earlier actions affect later ones.
- LLM-as-judge: Asking a model to rate the agent's output. Good for output quality, but blind to process failures — the agent might take 12 wrong steps before arriving at the right answer.
ComplexMCP's findings suggest that if your test environment doesn't model tool interdependence, dynamic state, and stochastic failures, you're not actually testing the agent — you're testing a simplified version of it.
The impact
This changes what "good test coverage" means for AI agents.
If your test suite consists of clean, isolated scenarios with independent tools, you may be shipping agents with a 40%+ failure rate on real workloads while your tests are green. This is the AI equivalent of testing a distributed system with a single-node test environment.
The three failure modes ComplexMCP identifies map directly to test design decisions:
- Tool retrieval saturation → your tests need to include large tool inventories, not curated minimal sets
- Over-confidence → your tests should include scenarios where expected state has changed between steps
- Strategic defeatism → your tests should include recoverable failure scenarios and assert that the agent actually recovers
The 60% ceiling on top models also suggests that organizations deploying AI agents should build graceful degradation and human escalation paths into their systems — and test those paths explicitly.
Practical applications
Design tests for interdependence. Model your agent's tool ecosystem as a dependency graph. Write tests that require the agent to use the output of Tool A as the input to Tool B, and then verify with Tool C. If your tests only ever exercise one tool at a time, you're not covering the failure modes that matter most.
Inject stochastic failures. ComplexMCP uses a seed-driven architecture to simulate unpredictable API failures. QA teams can replicate this by adding fault-injection to their agent test environments: randomly return 503s from mock services, introduce latency, and occasionally return stale data. Then assert that the agent handles it correctly rather than giving up.
Test state verification. Design scenarios where the environment starts in an unexpected state — one that would be valid in production but doesn't match the agent's assumptions. Does the agent verify before acting, or does it barrel ahead and take a destructive action on wrong assumptions?
Measure recovery, not just success. Beyond pass/fail, track how many steps an agent takes, when it first encounters an error, and what it does next. An agent that recovers from a tool failure in 2 steps is very different from one that takes 15 steps, and that difference matters for both cost and reliability.
Set realistic SLAs. If top models cap at ~60% on complex interdependent tasks, set your production SLAs accordingly. Build in human oversight for tasks that hit complexity thresholds, and test that the handoff to human review works correctly.
Tools / frameworks to watch
- ComplexMCP paper (arXiv:2605.10787) — The benchmark paper itself. The methodology section is directly adaptable as a test design guide.
- jcode — A GitHub-trending framework for testing code agents specifically, with composable scenarios.
- RAMPART (Microsoft) — Pairs well with ComplexMCP's insights: use RAMPART for safety/adversarial scenarios in the same interdependent tool environments.
- Playwright MCP — For web-facing agents, Playwright's MCP server gives you production-grade tool primitives to build realistic stateful test environments.
- MCPAgentBench (arXiv:2512.24565) — A related real-world task benchmark for LLM agents using MCP tools, useful as a reference for test scenario design.
Conclusion
ComplexMCP doesn't reveal a flaw in any specific model — it reveals a flaw in how the industry has been thinking about AI agent testing. Clean, isolated, happy-path tests don't surface the failure modes that matter in production. Real agents operate in messy, stateful, interdependent environments, and test suites need to reflect that.
The 40% failure rate on top models isn't a reason to stop using AI agents. It's a reason to test them properly — with realistic tool graphs, stochastic failures, and explicit recovery scenarios. The QA engineers who figure this out first will be the ones keeping AI agents in production instead of pulling them when things go wrong.
References
- ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox — arXiv:2605.10787
- ComplexMCP HTML paper (arXiv)
- MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use — arXiv:2512.24565
- Autonomous Software Testing: Tools, AI Models & Guide 2026 — Testomat.io
- What are some QA and testing trends that you are seeing IRL in 2026? — Ministry of Testing