AI security06/10/2026

AI agents and prompt injection: What enterprises can do today against indirect attacks

Indirect prompt injection is the biggest security threat to AI agents in 2026. OpenAI's GPT-5 system card shows 56.8% attack success rate on hardest-tier benchmarks, other frontier models exceed 70%. Three real incidents (GrafanaGhost, ForcedLeak, GeminiJack) illustrate the attack class. Why model guardrails no longer suffice, which five concrete safeguards enterprises should pull now, and what an audit-ready security concept must deliver.

AI agents and prompt injection: What enterprises can do today against indirect attacks

Security researchers at Google and Forcepoint have documented in recent months how indirect prompt injection attacks compromise production AI systems. The attack is invisible: no phishing link to click, no malicious file, no suspicious login. Instead, attackers place hidden instructions in web pages, documents, or emails. When an AI agent processes that content, it reads the instructions and executes them. The result: data exfiltration, disclosure of credentials, outbound requests to attacker-controlled servers. All triggered by the AI itself.

Kiteworks' 2026 forecast report surveyed 225 enterprises: 41 to 44 percent have no basic governance controls like Human-in-the-Loop monitoring, audit trails, or data minimization for their AI agents. In plain language: the majority of enterprises are structurally unprotected today.

GPT-5 System Card (Aug 2025)

56.8%

attack success rate for gpt-5-thinking on hardest-tier attacks

Competing models

70%+

Claude 3.7 and other frontier models on same benchmark

Unprotected enterprises

~43%

without HITL monitoring (Kiteworks 2026)

How the attack unfolds in practice

A typical scenario: a field rep's assistant receives the task of summarizing a series of emails. One of those emails contains hidden text embedded in the HTML code. The agent reads the instruction, believes it came from the user, and sends internal quote data to an external address. The attacker doesn't need to hack the agent — they just need to send an email that the agent reads. The outbound request looks like a normal agent call to SIEM, DLP, and endpoint monitoring, because it runs through legitimate channels.

Three real incidents from spring 2026 show the pattern:

GrafanaGhost: Zero-click data exfiltration via URL parameters in logs. Researchers placed instructions in log data that the AI assistant processed. Result: financial metrics, infrastructure telemetry, and customer data were smuggled out in image render requests.
ForcedLeak (Salesforce Agentforce): The same attack class on a different enterprise platform.
GeminiJack (Google Gemini): Takeover of the AI via a malicious email to a Gmail account, including theft of two-factor codes.

All incidents have been patched. The attack class remains, because the structural problem lies in the foundation of generative AI: the model cannot reliably distinguish between a trusted user instruction and a hidden command embedded in content.

Current numbers from 2025/2026: the threat is real

OpenAI's GPT-5 system card, published in August 2025, states that gpt-5-thinking shows a 56.8 percent attack success rate on the hardest-tier prompt injection benchmark. Other current models sit in the 70+ percent range on the same benchmark. Claude 3.7 reaches values in the 60s; other frontier models go higher. The message from the system card is unambiguous: even the strongest model currently available is not robust against targeted attacks.

A 2026 study on patient safety measured a 94.4 percent success rate across 216 prompt injection evaluations against medical AI systems at the primary decision turn. Lightweight models like GPT-4o-mini and Gemini-2.0-flash-lite were completely susceptible, Claude-3-haiku still showed 83.3 percent partial vulnerability.

From the AgentDojo benchmark (published 2024, extended in 2025 with current models): frontier models like GPT-4o reach 69 percent benign utility in normal operation but drop to 45 percent under attack. On targeted attacks, the Targeted Attack Success Rate (ASR) reaches 20 percent for most models, averaging 11 to 15 percent across all tasks.

The three main research lines tell the same story: with current frontier models, success rates have dropped measurably compared to 2024, but no model in production deployment is considered robust. OpenAI itself acknowledges in tech blogs that prompt injection remains an "open challenge" and that work on it will continue "for years." That is not reassurance, it is confirmation of the structural problem.

More important than blanket percentages is the fundamental insight: jailbreak attacks on guardrails achieve close to 100 percent success on tested models according to NeurIPS publications. Safety filters can be bypassed through clever wording. System prompts are configurable and therefore not a security mechanism in the audit sense. An auditor, whether HIPAA, CMMC, PCI, or SOX, will not accept the argument "the model was instructed not to" as proof of access control. Auditors certify enforcement decisions, not configurations.

To pass an audit, security must be pushed down a layer, into the data and permission layer.

Five concrete safeguards

Organizations deploying or planning to deploy AI agents can pull five levers with manageable effort:

1. Catalogue every agent action. List precisely what each agent may do: which data to read, which actions to perform, which external services to call. Anything not on the list is forbidden. This is the foundation for permissions and audit.

2. Restrict permissions to the necessary minimum. An agent that only books appointments does not need access to the ERP. An agent that calculates quotes does not need access to email. The principle of least privilege applies to AI as well. Data minimization is the most effective defense against indirect prompt injection: what the agent cannot read, it cannot exfiltrate.

3. Put a human in the loop at sensitive points. Configure the agent to require approval for any action with external impact (sending emails, initiating bookings, accessing personal data). The overhead per action is minimal, the protection is substantial. This HITL architecture is the most effective single lever.

4. Strictly separate external content. Never process web content, PDFs, or emails in the same context as the actual task. Where possible: have a separate model summarize external content first, then feed the summary into the actual task context. A typical attack fails structurally as a result.

5. Maintain a complete audit trail. Every agent action should be logged with timestamp, user attribution, input data, tool calls, and output data. In an incident, this enables forensic analysis and gives the auditor the required proof. Without a complete log, you have no evidence in the event of damage.

What an audit-ready security concept must deliver

To make AI agent deployment compliance-ready, three requirements are non-negotiable:

Authentication of every request. Every call to an agent must be attributable to a concrete identity: human, machine, service. Anonymous requests are rejected.

Attribute-based access control in real time. The question of whether the agent may read a particular piece of information is checked at every request, based on role, context, and data sensitivity. The check must happen in real time, not in advance.

Full logging before data access. The enforcement decision (allowed, restricted, denied) is logged with all attributes, before data is actually returned. This is the decisive difference from model guardrails: enforcement happens at the data layer, not in the prompt. Even if the model is compromised, the agent can only access data for which it has valid authorization.

Concrete threats every enterprise should know about

Three attack scenarios that affect typical business processes:

Email-based attacks: A supposed supplier contact sends a request containing hidden instructions. The agent processing incoming mail exfiltrates quote data or customer information as a result.
Web research attacks: A research agent searches the web and lands on a page with a hidden instruction. The agent sends internal data to an attacker-controlled endpoint, often embedded in innocuous image URLs.
Supplier document attacks: A supplier sends a PDF with instructions hidden in fine print or metadata. The agent processes the invoice and performs unwanted actions on the side.

What all three share: the attack requires no system access, no account theft, no technical vulnerability in the classical sense. It exploits the very operating principle of the agent itself.

What enterprises should do today

The urgent recommendation is simple: audit your current AI agent deployment with three questions. Which agents have access to which data? Which actions do they perform autonomously? Is there an audit trail? Anyone who cannot answer these three questions today should pause the rollout of new agents until the controls are in place.

Organizations that already have agents in production can establish a significantly more robust security posture in two to four weeks using the five levers above. The tools exist, the standards are documented, and the gap between lab and production has closed. The next 18 months will show which organizations took the topic seriously and which end up involved in the first major data protection scandal of the agent era.

centerbit

Book a consultation now

If you see similar manual work in your team, we can review the process together in a free initial consultation.

Request consultation