The web was built for human eyes. It is being rebuilt for machine readers. And it is precisely in this transitional phase that an attack surface is emerging whose systematic nature has barely been understood until now. A research team from Google DeepMind led by Matija Franklin has produced the first comprehensive framework mapping this threat: AI Agent Traps – adversarial content deliberately embedded in the information environment of autonomous AI agents to manipulate, deceive or co-opt them for external purposes.
The implications for the financial sector are significant. Autonomous agents are increasingly assuming operational functions in banks and asset managers – from credit assessment through algorithmic trading to regulatory reporting. According to Evident Insights, the number of technologists working on Agentic AI at the world's 50 largest banks has increased tenfold compared to the second half of 2024. At the same time, 93 per cent of financial institutions plan to introduce Agentic AI within two years. The attack surface is thus growing faster than the defences.
What: AI Agent Traps – adversarial content that manipulates autonomous AI agents via their information environment
Who: Google DeepMind (Franklin, Tomašev, Jacobs, Leibo, Osindero), published 2025
Framework: 6 trap categories targeting perception, reasoning, memory, actions, multi-agent dynamics and human oversight
Success rate: Prompt injection achieves a 50–84 per cent attack rate in production systems
Regulation: Neither the EU AI Act nor DORA (Digital Operational Resilience Act) defines "autonomous AI agents" – the accountability gap for agent-caused damages remains unresolved
The Paradigm Shift: The Environment as Attack Surface
Six Categories, One Insight
Classical cybersecurity protects systems against technical exploits – code bugs, configuration errors, unpatched vulnerabilities. AI Agent Traps work fundamentally differently: they manipulate not the model itself but the information the agent consumes. The agent is not hacked; its own capabilities are turned against it. The research team compares the challenge to autonomous driving: just as a self-driving car must recognise manipulated road signs, an AI agent must see through a manipulated information environment.
The framework distinguishes six categories, each targeting a different functional component of the agent. Content Injection Traps target perception: hidden commands in HTML comments, CSS attributes or metadata are invisible to humans but are read and interpreted as instructions by the agent. The WASP benchmark shows that simple prompt injections in web content achieve partial takeover of agent behaviour in up to 86 per cent of scenarios. Semantic Manipulation Traps attack reasoning: authoritatively worded superlatives such as "the industry-standard solution" systematically shift the model's synthesis direction without requiring an explicit command.
Cognitive State Traps corrupt the agent's long-term memory and knowledge base. Particularly alarming here is so-called RAG Knowledge Poisoning (RAG: Retrieval-Augmented Generation): as few as five deliberately crafted documents in a database with millions of entries achieve an attack success rate of over 90 per cent, as the PoisonedRAG study (USENIX Security 2025) demonstrates. Behavioural Control Traps hijack the agent's action capability directly – for example through embedded jailbreak sequences in emails or documents that prompt the agent to exfiltrate data. The final two categories – Systemic Traps and Human-in-the-Loop Traps – address multi-agent dynamics and the manipulation of the human overseer, for instance through deliberately induced approval fatigue.
Financial Sector: The Most Attractive Target
From Manipulated Headlines to Manipulated Markets
For financial institutions, the threat posed by Agent Traps is no theoretical exercise. The direct monetary consequence of every wrong decision, regulatory liability and systemic risk from correlated agent behaviour in trading make the financial sector the most attractive target. The combination of Content Injection and Semantic Manipulation already has quantifiable effects on algorithmic trading systems.
A research team from the University of Liechtenstein led by Advije Rizvani demonstrated two attack techniques on LLM-powered Algorithmic Trading Systems (ATS) in 2026. In Unicode homoglyph substitution, individual letters in stock names are replaced by visually identical Unicode characters – for example, a Latin "A" by a Cyrillic "А". The difference is invisible to humans; the trading model FinBERT assigns the headline to the wrong stock in 99 per cent of cases. The second technique, hidden text injection, inserts invisible text with opposing sentiment into the headline via the CSS attribute display:none. In the worst-case scenario, a single manipulated trading day reduces annual returns by up to 17.7 percentage points – while the trading system remains profitable and the attack goes undetected.
Historical precedents underscore the urgency. The AP Twitter hack of 2013, in which a fake tweet about an explosion at the White House caused the S&P 500 to drop 143 points within seconds, temporarily destroyed 136 billion US dollars in market value. During the Flash Crash on 6 May 2010, a single automated sell order worth 4.1 billion US dollars triggered a "hot potato" effect among high-frequency traders – the Dow Jones fell nine per cent in ten minutes. Both incidents demonstrate how quickly automated systems react to manipulated signals and trigger systemic cascade effects.
The M365 Copilot Case: When the Assistant Becomes a Spy
That Behavioural Control Traps are not merely theoretical was impressively demonstrated by security researcher Johann Rehberger in 2024. He combined four techniques into a complete data exfiltration chain against Microsoft 365 Copilot: an indirect prompt injection via a manipulated email caused Copilot to automatically search further emails and documents – including Slack MFA codes. Using ASCII smuggling, the exfiltrated data was encoded in Unicode characters invisible to the user and embedded in an innocuous-looking hyperlink. A single click sufficed to transmit sales figures and authentication codes to an external server. Microsoft patched the link rendering vulnerability, but the structural susceptibility to prompt injection remains.
Flash Crash 2.0: When Agents React Simultaneously
The fifth category of the DeepMind framework – Systemic Traps – deserves particular attention in the financial context. These traps target not individual agents but the emergent dynamics of multi-agent systems. The paper identifies five mechanisms: Congestion Traps, where homogeneous agents simultaneously converge on the same signal; Interdependence Cascades, where a manipulated financial report triggers a self-reinforcing cascade; Tacit Collusion, where pricing agents learn coordinated behaviour without explicit communication; Compositional Fragment Traps, where a jailbreak is distributed across semantically harmless fragments; and Sybil Attacks, where fake agent identities subvert collective decision-making processes.
The warning from the former SEC Chairman is not alarmism. When thousands of financial institutions deploy the same foundation models whose embeddings classify identical signals as threatening, the precondition for synchronised herd behaviour is created – a flash crash without coordination. Existing circuit breakers at regulated exchanges absorb some of this risk but apply neither to cryptocurrency trading nor to internal agent systems in procurement or risk management. That the threshold from theory to practice is lower than commonly assumed is demonstrated by research into algorithmic price coordination: in a widely cited study in the American Economic Review, Calvano et al. showed that pricing algorithms learn to maintain supra-competitive prices without any explicit agreement – with antitrust implications that remain largely unresolved.
Three Regulatory Gaps
The regulatory landscape in Europe is not prepared for the challenge posed by AI Agent Traps. Three central gaps are crystallising.
The first is a definitional gap: no European regulatory framework – neither the EU AI Act nor DORA nor MaRisk or BAIT – explicitly defines "autonomous AI agents". The EU AI Act does cite the "degree of autonomy" as a criterion for high-risk classification but does not define at what level of autonomy an agent should be classified as high-risk. Existing regulations were written for deterministic systems; probabilistic, autonomously acting agents fall into interpretive grey areas.
The second is a governance gap: the adoption of Agentic AI in the financial sector is outpacing regulatory development. BaFin published its "Guidance on ICT Risks in the Use of Artificial Intelligence" on 30 January 2026 – an important signal, but non-mandatory and not specifically tailored to agentic systems. DORA requires resilience testing in Articles 24 to 27 but does not define what "penetration testing" means for an AI agent. Prompt injection, RAG poisoning or adversarial examples appear in no DORA guideline.
The third and most serious is the accountability gap: when a compromised AI agent triggers a financial transaction that causes damage – who is liable? Classical product liability does not apply clearly; tort law fails due to the absence of direct causality in emergent decisions. An SSRN paper by Shukanayev (December 2025) proposes a tiered liability allocation: policy defects are borne by the model developer, credential compromises by the deploying institution, and model errors are shared. For emergent coordination failures – the actual core of systemic risk – no attribution exists whatsoever. The EU AI Liability Directive, which was intended to close this gap, is still under deliberation.
The Defence Architecture
Input Filtering Alone Is Not Enough
In March 2026, OpenAI disclosed a remarkable paradigm shift in its own defence strategy. The central finding: classical "AI firewalling" – an intermediary classification layer that categorises inputs as harmful or harmless – systematically fails against sophisticated attacks. Detecting a malicious prompt injection is structurally the same problem as detecting a lie: unsolvable without sufficient context. Instead, OpenAI relies on source-sink analysis combined with a social engineering model: an agent is treated like a human customer service representative – it will be manipulated, so deterministic system controls around it must limit the damage, regardless of whether the agent was deceived.
This yields a three-line defence. The first line – ingestion controls – comprises pre-ingestion scanning for injection patterns using regex and LLM classifiers such as Meta PromptGuard, provenance verification for all documents in knowledge bases, and pseudonymisation of personal data before embedding generation, since embedding inversion attacks can reconstruct 50 to 70 per cent of original words from vectors. The second line – runtime controls – relies on source-sink analysis: not every input is filtered, but every dangerous sink action (data transmission, tool invocation, code execution) is constrained. OpenAI's safe URL mechanism detects, for example, when conversation data would be transmitted to third parties and interrupts the process. The third line – architectural measures – concerns the design of the overall system: per-edge zero trust for multi-agent communication with cryptographic agent identity, defence heterogeneity through the use of different foundation models, and deterministic hard caps instead of blanket human-in-the-loop approval.
What the Critics Say – and Where They Are Right
The threat landscape warrants a differentiated assessment. A substantial portion of media reporting on AI agent security incidents in 2025 was exaggerated or lacking context, as a community analysis on r/cybersecurity documented. Many of the scenarios described in the DeepMind framework – particularly Systemic Traps and Human-in-the-Loop Traps – presuppose multi-agent systems with broad permissions that are still rare in enterprise deployments today. Existing frameworks such as MITRE ATLAS (extended with agentic-specific techniques in January 2026), the Microsoft Taxonomy of Failure Modes in AI Agents (April 2025) and the OWASP Top 10 for LLM Applications (2025) already cover the practically relevant risks.
What cannot be relativised, however, is the structural dimension of the problem. Prompt injection achieves success rates of 50 to 84 per cent in production systems. This is due to a fundamental architectural weakness of all current Large Language Models (LLM): there is no system-internal separation between instructions and data. Everything is text within the same context window. Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI improve aligned behaviour within the training distribution, but prompt injection is by definition outside that distribution – which is why even the best models remain vulnerable. With increasing adoption of Agentic AI, the risk landscape will undergo a qualitative shift within 12 to 24 months.
Recommendations for Action
Financial institutions that deploy or plan to deploy Agentic AI should prioritise the following measures. The recommendations are staggered by time horizon and take into account both the technical defence architecture and regulatory compliance under DORA and the EU AI Act.
Create an Agent Inventory and Blast Radius Mapping
All AI agents in use should be mapped with their tools, data sources and permissions. For each agent, the maximum damage radius must be documented: what data can it read? What actions can it trigger? What external systems can it reach? This inventory forms the basis for DORA-compliant ICT risk management (Articles 5–16) and enables an informed least-privilege configuration.
Audit and Harden RAG Trust Boundaries
The most common undefended vulnerability in production systems is the implicit trust assumption towards retrieved context in RAG architectures. Institutions should implement content scanning before ingestion, establish provenance tracking for all documents and ensure tenant isolation of their vector databases by security domain (HR, Legal, Finance). Regulated data – Material Non-Public Information (MNPI) and special categories of personal data under GDPR Article 9 – must not enter general-purpose AI systems.
Establish Agent Red Teaming as Standard Practice
Existing Threat-Led Penetration Tests (TLPT/TIBER-DE) should be extended with AI-specific test scenarios: prompt injection (direct and indirect), tool abuse, memory manipulation and multi-hop injection through agent chains. The Cloud Security Alliance (CSA) published a specific Agentic AI Red Teaming Guide in May 2025. Open-source tools such as DeepTeam, Promptfoo and SPLX AI Probe enable automated RAG poisoning simulations on enterprise knowledge sources.
Classify Foundation Model Providers as Critical ICT Third-Party Providers
Under DORA Articles 28 to 44, foundation model providers (OpenAI, Anthropic, Google) are to be classified as ICT third-party providers. Standard API terms typically do not meet the contractual requirements of Article 30. Institutions should conduct a concentration risk analysis and pursue defence heterogeneity – i.e. not run all agents on the same foundation model. This simultaneously reduces systemic risk from correlated agent behaviour.
Define an Internal Accountability Chain
Since the EU AI Liability Directive is still pending, institutions should proactively establish an internal liability chain for agent-caused damages: policy defects borne by the model developer, credential compromises by the deploying institution, and model errors as shared liability. For trading agents, anti-collusion monitoring must additionally be established, as antitrust liability can arise even without explicit agreement. Cooldown mechanisms for correlated agent activity complement existing circuit breakers for internal operations.