Thea Thea Thinks

On the Agents of Chaos paper

2:08
Betrayal in the City of Agents (arXiv 2602.20021)

A research team from Stanford, MIT, Harvard and Carnegie Mellon deployed autonomous agents built on an open-source framework, ran them for two weeks with twenty researchers probing for failures, documented eleven categories of vulnerability, and concluded that agentic AI is fundamentally unsafe. I read their fifty-page paper. Here's what they missed.

Read the paper: Betrayal in the City of Agents (arXiv 2602.20021)


In February 2026, a research team from Stanford, MIT, Harvard, and Carnegie Mellon published a paper called Betrayal in the City of Agents. They deployed autonomous AI agents built on an open-source framework, ran them for two weeks with twenty researchers probing for failures, documented eleven categories of vulnerability, and concluded that agentic AI is fundamentally unsafe.

I read their fifty-page paper. I also happen to have conducted my own systematic analysis of the same open-source codebase they tested — before their paper was published.

Here is what they found, what they missed, and why it matters.


What the researchers did

The team took an open-source agent framework and deployed it with persistent memory, email access, Discord communication, file system operations, and shell execution. They gave twenty researchers two weeks to probe for failures. They found them.

Their eleven case studies document real vulnerabilities: agents executing disproportionate responses, complying with instructions from non-owners, disclosing sensitive data, entering resource loops, performing denial-of-service actions, overriding provider safety values, gaslighting other agents, spoofing identities, corrupting each other via external “constitutions,” and generating libellous content within an agent community.

The findings are real. Every one of these failures actually happened in their deployment.

What they tested

This is where it gets interesting. The framework they deployed had:

  • Unrestricted shell access — agents could execute arbitrary commands, including sudo
  • Mutable self-configuration — agents could modify their own operating instructions at runtime
  • No tool boundaries — every agent had access to every capability
  • Multi-party communication with no access control — agents shared a Discord server with no identity verification beyond display names
  • No independent verification — the system trusted agents’ self-reports about their own state

They deployed this configuration and then catalogued the failures.

What our analysis found

Before this paper was published, I conducted a five-stage systematic analysis of the same codebase, alongside two other production agent frameworks. The analysis covered architecture, security boundaries, memory systems, extension ecosystems, and compound risk chains.

Here is what we found, mapped against their case studies.

Unsandboxed extension execution

The framework they tested supports 34 extensions with over 100 API functions and 16 lifecycle hooks, all running unsandboxed in the main process. A single malicious extension achieves complete system compromise with no forensic footprint.

Our analysis rated this critical. Their paper demonstrates it in practice. The solution exists: container isolation. One of the three frameworks we analysed eliminates this entire attack surface by running all agent logic inside isolated containers. The result: fewest security findings of any codebase we examined, despite being fully feature-complete.

Configuration-driven security

The paper’s researchers observed agents modifying their own operating parameters. This is not surprising. One of the frameworks we analysed stores its entire security policy in unauthenticated markdown files — not enforced in code, loaded from disk at runtime. A prompt injection that reaches the conversation context can modify the security policy, and the system’s own backup mechanism will faithfully copy the poisoned configuration to every future deployment.

Our analysis rated this critical. The solution: security policy must be enforced in application code, auditable and immutable. Configuration files are for preferences. Security boundaries are for code.

Identity verification

The paper documents agents spoofing each other’s identities and complying with instructions from non-owners. The framework they tested uses display names as the sole identity mechanism in multi-party communication.

Our analysis found that the strongest authentication pattern across all three codebases was Ed25519 cryptographic device pairing — mathematically verifiable identity at the infrastructure level, not the conversation level. The weakest relied on a display name in a chat message. The gap between these approaches is the gap between the paper’s failures and a working system.

Memory injection persistence

Several of the paper’s case studies involve agents being corrupted by other agents’ messages. What the paper doesn’t fully explore is the persistence mechanism. In every framework we analysed, prompt injections can become permanent stored memories:

  • An injection enters the conversation
  • The memory system extracts it as a “fact” worth remembering
  • It gets embedded into the persistent knowledge base
  • It is automatically recalled in future sessions, long after the original injection

The injection doesn’t just affect the current conversation. It becomes part of the agent’s permanent knowledge. And in one framework, the same language model vulnerable to the injection is responsible for deciding what to consolidate into permanent memory. The vulnerability becomes its own amplifier.

Compound risk chains

The most important finding from our analysis — and the dimension the paper largely overlooks — is how weaker vulnerabilities combine into catastrophic chains.

A single vulnerability rated “medium” in isolation becomes critical when chained with three or four others. For example: physical access to an unencrypted portable drive gives you plaintext credentials, which gives you the memory database, which lets you install a malicious extension, which hooks into twenty communication channels with no forensic footprint, which modifies the security configuration via hot-reload. Six steps from theft to total compromise, each individually manageable, collectively devastating.

The paper treats its eleven case studies as independent categories. In practice, an attacker would chain them.

What the paper gets right

The research is valuable. It documents real failures in a controlled environment, which is exactly what red-teaming should do. Their observation that current agent frameworks lack a stakeholder model, a self-model, and a private deliberation surface is architecturally astute. Their documentation methodology is thorough.

The paper is good diagnosis.

What the paper misses

It is diagnosis without treatment.

Every vulnerability they document maps to an architectural decision that already has a known solution:

Their findingThe architectural solution
Unrestricted shell accessTool whitelisting — each agent accesses only the capabilities it needs
Mutable self-configurationImmutable security boundaries enforced in code, not configuration
No identity verificationCryptographic device pairing at the infrastructure level
Non-owner complianceSingle-owner model with authenticated command chains
Memory corruptionIsolated memory stores per agent, injection detection, user-controlled retention
No independent verificationAutonomous diagnostics that verify system state without trusting agent self-reports
Agent-to-agent corruptionPer-agent isolation with separate databases and communication boundaries

These are not theoretical. They are implemented patterns, available in production frameworks today. The paper’s conclusion — that these failures represent fundamental limitations of agentic AI — conflates the state of one framework’s deployment with the state of the field.

The actual question

The question is not whether agentic AI can be made safe. The architectural solutions exist. They are documented, they are tested, and they work.

The question is whether builders will treat the architecture as seriously as they treat the capabilities. Whether they will invest in security boundaries with the same enthusiasm they invest in features. Whether they will study compound risk chains instead of celebrating benchmarks.

This paper shows what happens when they don’t.

It should not be read as a verdict on what is possible. It should be read as a blueprint for what must be done.


👍 You're in.