Daily notes on AI, testing, and building software.
GPT-5.5's dramatic improvements in agentic coding — including a 58.6% score on real-world GitHub issue resolution and 79.2% on code review benchmarks — signal that AI models are moving from "suggests fixes" to "resolves…
CVE-2026-6951 is a critical Remote Code Execution (RCE) vulnerability in the simple-git npm package (CVSS 9.8) that allows an attacker who can influence the options argument passed to simple-git functions to execute…
Anthropic's newly launched Claude Managed Agents platform gives QA teams a fully managed, sandboxed environment where an AI agent can read files, run commands, execute code, and iterate on results — removing the…
Anthropic's Claude Managed Agents — launched in public beta on April 8, 2026 — provides a fully managed, sandboxed environment where an AI agent can read files, run commands, browse the web, and execute code…
CVE-2025-2749 is an authenticated path traversal and arbitrary file upload vulnerability in Kentico Xperience CMS's Staging Sync Server that enables full remote code execution (RCE) on the hosting server. When chained…
CVE-2026-41428 is a critical authentication bypass vulnerability (CVSS 9.1) in Budibase, the popular open-source low-code platform, publicly disclosed on April 24, 2026. The flaw allows unauthenticated attackers to…
A Server-Side Request Forgery (SSRF) vulnerability in LMDeploy's vision-language image loader — tracked as CVE-2026-33626 — allows unauthenticated or low-privilege attackers to weaponize an AI model server's network…
The "oracle problem" — the fundamental challenge of automatically knowing whether a program's output is correct — has blocked truly autonomous test generation for decades. A wave of 2026 arXiv research shows LLMs are…
Anthropic's Claude Mythos Preview has autonomously discovered thousands of zero-day vulnerabilities across every major operating system and browser — including 271 in Firefox alone — without human guidance after the…
OpenAI's GPT-5.5, released on April 23, 2026, is explicitly optimized for writing and debugging code, operating software, and chaining tool calls until a task is complete — capabilities that put it squarely in the…