Testing Tools | AI/LLM Updates | Test Automation

Nobody Is Testing Their LLM Apps — And That's a Growing Problem

Why it matters for testing

As LLM-powered features ship at record speed across every product category, most engineering teams lack any formal test strategy for their AI components — leaving quality, safety, and reliability entirely to chance. This is the next major frontier for QA professionals to own.


Intro

When a traditional software bug ships, it usually breaks in a deterministic, reproducible way: a button 404s, a form fails to submit, a calculation returns the wrong number. QA teams have spent decades building frameworks, runbooks, and automation pipelines to catch exactly those kinds of failures. But what happens when the bug isn't a broken function — it's a model that confidently gives the wrong answer? Or one that behaves differently every time it's called? Most teams don't have a good answer yet. And they're shipping anyway.

The AI Development/News

A widely-circulated HackerNoon article from April 2026 put it bluntly: "Nobody Is QA Testing Their LLM Apps — That's Going to Be a Problem." The piece describes how the vast majority of teams shipping LLM and RAG applications do so with no real test suite — no evals, no regression coverage, no monitoring for output drift.

This coincides with a wave of new LLM releases that are exponentially increasing the surface area of AI-powered features:

  • Claude Opus 4.7 (Anthropic) is now generally available, with improved vision and long-running agentic capabilities
  • Claude Mythos Preview (Anthropic, April 7) introduces striking computer security capabilities — powerful for security testing, but a double-edged sword
  • GPT-5.5 (OpenAI, April 23) adds deeper agentic reasoning and computer use, being integrated into products at speed
  • ChatGPT Images 2.0 now features native reasoning in image generation, a new untested modality for many apps

Every one of these represents new capabilities that engineering teams are racing to integrate — with most QA processes entirely unprepared for the evaluation challenges they introduce.

Current Testing Landscape

Traditional automated testing was built around deterministic systems: given input X, you expect output Y, and any deviation is a failure. Testing LLM applications breaks this model in several ways:

  • Non-determinism: The same prompt can produce different outputs across calls, even at temperature=0 in many real-world configurations
  • Semantic correctness vs. string matching: "The meeting is at 3pm" and "Your meeting's scheduled for 15:00" are functionally identical — traditional assertion patterns fail completely
  • Hallucination and confidence: A model can be wrong while appearing perfectly confident, and brittle prompt changes can shift reliability dramatically
  • Context window effects: A RAG app might behave correctly with short context and hallucinate with long context — traditional test data rarely captures this
  • Multi-turn degradation: Conversational agents behave differently across long sessions in ways single-turn unit tests can't detect

The result is that most teams default to manual spot-checking, occasional vibe-checks from developers, and production monitoring as their primary QA signal. By then, users have already seen the failures.

The Impact

The gap between the pace of LLM feature shipping and LLM testing maturity represents one of the most significant quality risks in software development today. The implications are direct and serious:

  • Silent regressions: A model upgrade or prompt change that degrades output quality may not surface in metrics until user satisfaction drops
  • Safety and compliance exposure: For products in regulated industries (health, finance, legal), an untested AI response pipeline is not just a quality problem — it's a liability
  • Compounding complexity: Agentic AI systems (like those powered by Claude Opus 4.7 or GPT-5.5's Codex integration) can take actions with real-world consequences — sending emails, updating records, calling APIs. Untested agents in production are a qualitatively different risk than untested chatbots
  • Vendor model drift: When Anthropic or OpenAI updates the underlying model, application behavior can shift without any code change on your end — and without evals, you won't know until something breaks in production

Practical Applications

QA professionals have a real opportunity to become the owners of LLM testing infrastructure. Here's where to start:

  1. Build an eval suite, not just a test suite: LLM evaluation frameworks (like Braintrust, LangSmith, or PromptFoo) are purpose-built for semantic correctness, not string matching. Invest in learning one of these now.

  2. Adopt a layered testing model: The HackerNoon six-layer framework provides a useful mental model: unit evals (individual prompts), integration evals (full chain), regression evals (before/after model changes), adversarial evals (jailbreak/injection attempts), latency/cost testing, and production monitoring.

  3. Snapshot-test your prompts: Treat prompt templates like code — version them, test changes against a golden dataset before shipping, and run regression evals on every prompt change.

  4. Use Claude or GPT-5.5 as judge models: LLM-as-judge patterns (using a powerful model to evaluate the outputs of another model) are becoming a mainstream eval technique. They're imperfect, but far better than no evals at all.

  5. Advocate for AI testing in your sprint process: Push for AI-specific acceptance criteria in feature tickets — "the model's response should include X", "the model should not hallucinate Y in this context" — and write evals to verify them before merge.

Tools/Frameworks to Watch

  • Braintrust — LLM evaluation and tracing platform; supports human review, AI-as-judge, and regression testing across model versions
  • LangSmith (LangChain) — Tracing, debugging, and eval framework for LLM chains and agents; strong CI/CD integration
  • PromptFoo — Open-source CLI tool for LLM testing and red-teaming; easy to integrate into existing CI pipelines
  • Ragas — RAG-specific evaluation framework focused on retrieval quality, faithfulness, and answer relevance
  • Evals (OpenAI) — OpenAI's own evaluation framework, increasingly used as a template for building custom LLM test harnesses
  • Helicone / Langfuse — Observability platforms that surface production drift and anomalies in LLM behavior over time

Conclusion

LLM apps are shipping. The testing infrastructure to support them largely isn't. This isn't a problem that will solve itself — it's a gap that will widen as models get more capable, more deeply integrated, and take on more consequential tasks. QA engineers who build expertise in LLM evaluation now — understanding evals, semantic testing, prompt regression, and agentic test coverage — will find themselves leading one of the most important new disciplines in software quality. The teams that treat AI testing as an afterthought will keep finding out about failures from their users. The teams that build proper LLM eval infrastructure now will ship faster and more confidently. The choice is straightforward, even if the work isn't.


References

Latest from the blog

See all →