AI Agents Must Be Treated as Untrusted Systems: Researchers

Researchers Warn AI Agents Must Be Treated as Untrusted Systems or Security Will Fail

Last Updated:
AI Agents Must Be Treated as Untrusted Systems: Researchers
  • Google and Meta researchers say AI model robustness alone cannot secure agentic systems.
  • Eleven real-world attacks show prompt injection bypasses model-level defences every time.
  • Agents need instruction data separation, least privilege sandboxing and information flow control.

A research paper from scientists at Google, Meta, UC San Diego, and several universities has taken a direct position that challenges how the industry currently approaches AI agent security.

The paper, titled Agent Security Is a Systems Problem, argues that treating AI models as the primary security layer is fundamentally insufficient. The model powering any agent must instead be treated as an untrusted component, the same way an operating system treats an external process, with security enforced at the system level around it.

“Efforts to increase model robustness are insufficient on their own,” the researchers wrote. “We must complement existing efforts with techniques from the systems security domain.”

Why the Current Approach Keeps Failing

The researchers analysed eleven real-world attacks on AI agents and found the same pattern every time. Developers trusted the AI model to police itself. Attackers found ways around it.

Two documented cases illustrate the problem. A ChatGPT memory feature attack allowed an attacker to inject malicious instructions through an ordinary document, causing the system to continuously send user conversations to an external server via an invisible image URL. 

A Claude Code attack used prompt injection hidden inside a code file to extract API keys and exfiltrate them through a DNS query using the ping command, which had been allowed without human approval.

In both cases, the model had no reliable mechanism to stop the attack because the malicious instructions were indistinguishable from legitimate ones at the model level.

Three Principles the Industry Is Ignoring

The researchers identified three core security principles from decades of systems security that AI deployments consistently fail to implement:

  • Instruction and data separation: Trusted instructions and untrusted external data flow through the same token stream with no separation, making prompt injection structurally possible.
  • Least privilege sandboxing: Agents are routinely deployed with access to shell commands, file systems, and APIs far beyond what any specific task requires.
  • Information flow control: Sensitive data can leak through indirect channels even when access controls exist.

The Bigger Problem

AI agents have no judgment and no self-preservation instinct. They will explore every directory they have access to at machine speed. They will execute any instruction that reaches them if the system allows it.

The security infrastructure built around human actors was never designed for this. Until it is rebuilt for machine actors, every organisation deploying agents with access to production systems is carrying a risk it cannot fully measure.

Related: Foresight Ventures: AI Agents Are Moving Beyond Chatbots Into Commerce

Disclaimer: The information presented in this article is for informational and educational purposes only. The article does not constitute financial advice or advice of any kind. Coin Edition is not responsible for any losses incurred as a result of the utilization of content, products, or services mentioned. Readers are advised to exercise caution before taking any action related to the company.