May 31, 2026Testing Tools

Microsoft Just Open-Sourced a pytest Framework for Testing AI Agents — Here's What QA Teams Need to Know

Why it matters for testing

Microsoft's newly open-sourced RAMPART framework brings red team-style safety and security testing directly into the pytest workflow, meaning QA engineers can now write standard test files that evaluate agentic AI behavior — including probabilistic pass/fail thresholds — and gate deployments in CI/CD just like any other integration test.

Intro

AI agents are no longer just a feature of your product. In 2026, they often are the product — or at minimum, a critical component that handles data retrieval, decision-making, and external API calls. Which raises an urgent question most QA teams haven't fully answered yet: how do you test software whose behavior is probabilistic and whose attack surface includes the content it reads?

Microsoft's answer is RAMPART, open-sourced on May 20, 2026. If you've ever written a pytest test, you already understand most of the mental model.

The AI development/news

On May 20, 2026, Microsoft's AI Red Team — the internal unit that stress-tests the company's own AI systems — open-sourced two new frameworks: RAMPART (Risk Assessment and Measurement Platform for Agentic Red Teaming) and Clarity.

RAMPART is a pytest-native safety and security testing framework for agentic AI applications. It's built on top of PyRIT, Microsoft's existing automation framework for red teaming generative AI, but with a key philosophical shift: where PyRIT is designed for black-box discovery by security researchers after a system is built, RAMPART is designed for engineers as the system is being built.

The framework covers:

Adversarial attacks (prompt injection, jailbreaks, indirect manipulation)
Benign failures (hallucinations, refusals, unexpected behavior under edge case inputs)
Harm categories (content safety, data exfiltration, privilege escalation scenarios)

All of this is expressed in standard pytest syntax, returning clear pass/fail signals that can be gated in CI just like any other integration test.

The killer feature for probabilistic systems: statistical trials. Since LLM behavior is inherently non-deterministic, RAMPART lets you run the same test N times with a configurable policy — for example, "this action must be safe in at least 80% of runs." That's a fundamentally different (and more honest) testing contract than binary pass/fail on a single execution.

Current testing landscape

Most teams testing AI agents today are doing one of three things:

Manual spot-checking — a QA engineer runs the agent through a set of prompts and eyeballs the output. Doesn't scale, doesn't catch regressions.
Eval frameworks (OpenAI Evals, LangSmith, etc.) — purpose-built LLM evaluation tools that are powerful but siloed from the broader test suite and CI pipeline.
Ad hoc pytest/unittest wrappers — homegrown harnesses that vary wildly in coverage and rigor, and almost never include adversarial scenarios.

None of these approaches integrate seamlessly into existing CI/CD pipelines in a way that's familiar to the QA engineers who maintain them. The result is a two-tier testing culture: your traditional software has thorough automated coverage, while the AI component that operates on behalf of users gets a fraction of that rigor.

The specific gap RAMPART targets — indirect prompt injection — has been especially underserved. Agents that read documents, emails, tickets, or external data sources are vulnerable to poisoned content that hijacks their behavior without the user knowing. This class of attack is difficult to test manually and almost impossible to cover comprehensively without automated tooling.

The impact

RAMPART's biggest structural impact is normalizing AI safety testing as a first-class part of the QA process rather than a post-hoc security review.

Concretely, this means:

1. CI gates for agent behavior. Teams can now fail a build if the agent exfiltrates data in more than 5% of injection scenarios, or refuse to ship if the agent takes irreversible actions when it shouldn't. These are the same kinds of gates that prevent shipping broken authentication — just applied to the AI layer.

2. Regression testing for safety properties. When you update your agent's system prompt, swap the underlying model (say, from Claude Opus 4.7 to 4.8), or change tool access — you can run your RAMPART suite to detect regressions in safety behavior, just as you'd run your existing suite to detect functional regressions.

3. Shared language between security and QA. Red team findings have historically lived in threat models and penetration reports that are disconnected from the test suite. RAMPART creates a bridge: red team scenarios translate directly into pytest test cases that QA owns and runs continuously.

4. Statistical honesty about probabilistic systems. The configurable pass-rate thresholds force explicit conversations about acceptable risk. "This injection attack succeeds 3% of the time — is that good enough for our threat model?" is a much better quality gate than pretending determinism exists where it doesn't.

Practical applications

Here's how QA and security teams can start integrating RAMPART into existing workflows:

Start with cross-prompt injection coverage. RAMPART's most mature test coverage addresses indirect injection attacks — agents processing poisoned documents or emails. Map your agent's data ingestion paths and write RAMPART tests for each one. This is the highest-ROI starting point.
Add RAMPART to your PR pipeline. Install RAMPART alongside your existing pytest suite and configure it to run on every PR that touches agent code, system prompts, or tool definitions. Use lower statistical trial counts (5–10 runs) for speed; reserve higher counts (50+) for release candidates.
Establish a baseline before model swaps. Before upgrading your agent to a new LLM version, run your full RAMPART suite to establish a behavioral baseline. Run it again after the upgrade and diff the results. This catches safety regressions that functional tests won't surface.
Use Clarity alongside RAMPART. Microsoft also released Clarity, a companion tool for visualizing and auditing agent decision flows. Using both together gives you both automated test coverage (RAMPART) and human-readable audit trails (Clarity) — the combination regulators and compliance teams increasingly expect.

Tools/frameworks to watch

RAMPART — github.com/microsoft/RAMPART | pytest-native, open-source, built on PyRIT. The primary tool discussed here.
Clarity — Microsoft's companion visual audit tool for agent decision flows, released alongside RAMPART.
PyRIT — Microsoft's lower-level red-teaming automation library that RAMPART builds on; useful for custom adversarial scenarios beyond RAMPART's built-in coverage.
jcode — A separate open-source framework (trending on GitHub in May 2026) specifically designed for evaluating code-generating AI agents, complementary to RAMPART for engineering-focused agent testing.
Playwright MCP — For agents with browser automation capabilities, Playwright MCP provides the execution layer that RAMPART tests can drive for E2E agent behavior validation.
Giskard — An open-source ML testing library with growing support for LLM evaluation; useful for teams that want evals and safety testing in one framework before fully committing to RAMPART.

Conclusion

The industry has spent the last two years figuring out how to build AI agents. The next two years will be about figuring out how to test them rigorously enough to trust them in production. RAMPART is the clearest signal yet that the tooling is catching up to the deployment reality.

For QA engineers, the message is straightforward: the skills you already have — writing pytest tests, building CI pipelines, thinking in terms of coverage and regressions — transfer directly to AI agent testing. The domain is new; the craft isn't.

If you're shipping an agent and you don't have a RAMPART suite (or equivalent) in your CI pipeline, you have a gap. Now you have no excuse not to close it.