AI Agent Security Explained: Agents, MCP, Prompt Injection, and the AI Harness

AI Agent Security is quickly becoming one of the most important areas in cybersecurity.

Terms like "agent," "harness," "MCP," "tool calls," "tool responses," "instruction hijacking," "indirect prompt injection," "prompt exfiltration," and "tool misuse" are appearing in conference talks, vendor announcements, podcasts, and industry discussions, often without clear explanations.

At the same time, employees are rapidly adopting tools like Claude Cowork & Code, OpenAI Codex, Microsoft Copilot, and other AI agents that can interact with enterprise systems, applications, and data.

As a result, security teams are facing a new challenge.

The question is no longer just:

"Who is using AI?"

It's increasingly:

"What can that AI access, and what actions can it take?"

This guide introduces the concepts, attack surfaces, and terminology every security leader and practitioner should understand.

Why Agent Security Is Different

Traditional AI security discussions focused primarily on the model.

Today's AI agents are different.

They can:

Access documents
Query databases
Use MCP servers
Interact with SaaS applications
Execute workflows
Take actions on behalf of users

As a result, the attack surface extends far beyond the model itself.

Many of the most important risks now exist in the systems surrounding the model.

AI Agents: Orchestrators and Agent Infrastructure

Two concepts are essential for understanding how AI agents operate in practice: the orchestrator and the agent infrastructure.

The orchestrator — sometimes called the agent runtime — is the software layer that kicks off and manages agent behavior. Examples include Claude Code, Cursor, Windsurf, OpenAI Codex, LangGraph, and CrewAI. The orchestrator is what sequences tasks, decides when to invoke tools, and drives the agent toward a goal.

The agent infrastructure is everything surrounding the model and orchestrator that enables an agent to operate inside an organization:

Tools
APIs
MCP servers
Permissions
Data sources
Memory
Workflows
Governance controls

Together, the orchestrator and agent infrastructure transform a standalone language model into an operational AI agent capable of accessing information, interacting with systems, and taking actions on behalf of users.

Both layers create tremendous value — and both expand the attack surface. Agent infrastructure determines:

What data an agent can access
What systems an agent can reach
What actions an agent can perform
What instructions can influence its behavior

Security teams are increasingly finding that real-world AI risks emerge from these layers rather than from the model itself. As agents become more capable, the key security questions are shifting away from "Can the model be jailbroken?" toward:

Which orchestrator is managing this agent, and what does it permit?
Which MCP servers and tools is the agent connected to?
What sensitive data is flowing through the workflow?
What actions can the agent perform, and who authorized them?
How can the orchestrator or infrastructure be manipulated?

Understanding both the orchestrator and the agent infrastructure is foundational to modern AI agent security.

The AI Agent Workflow

A simplified AI agent workflow looks like this:

Each step introduces unique security risks.

Understanding where information enters, leaves, and influences the workflow is critical to understanding AI Agent Security.

Hooks

A hook is a point in the AI workflow where security controls, monitoring, or policy enforcement can be applied before an agent processes information, retrieves data, or takes action.

Examples include:

User Prompt Hooks
Tool Request Hooks
Tool Response Hooks

Hooks allow organizations to inspect activity before an agent processes data, takes an action, or consumes external content.

Rather than securing only prompts, modern AI security increasingly relies on controls applied at multiple hooks throughout the workflow.

Attack Surface #1: User Prompts

User prompts are the most familiar attack surface.

Common attacks include:

Instruction Hijacking

Attempts to manipulate an agent's behavior, override instructions, bypass controls, or influence decision-making.

Prompt Injection

A specific form of instruction hijacking that introduces malicious instructions into agent workflows.

Prompt Exfiltration

Attempts to retrieve:

System prompts
Hidden instructions
Internal configurations
Sensitive contextual information

Attack Surface #2: Tool Requests

Tool requests occur when an agent decides to take action.

Examples include:

Querying Salesforce
Accessing Google Drive
Sending Slack messages
Creating Jira tickets

Common risks include:

Tool Misuse

Using connected tools in ways that violate policy, expose sensitive data, or perform unintended actions.

Excessive Permissions

Agents possessing broader access than necessary to perform their intended function.

Sensitive Data Exposure

Transferring credentials, customer data, source code, or other sensitive information through AI workflows.

Attack Surface #3: Tool Responses

Tool responses are information returned to an agent by external systems.

Examples include:

Documents
Knowledge bases
Websites
Database records
MCP servers

Common risks include:

Indirect Prompt Injection

Malicious instructions embedded in content consumed by the agent.

This makes tool responses one of the most important and frequently overlooked attack surfaces in modern AI systems.

Poisoned Content

Compromised information sources designed to influence agent behavior.

Sensitive Data Leakage

Sensitive information introduced into workflows through retrieved content.

Agent Lifecycle

The agent lifecycle refers to the sequence of interactions that occur as an AI agent receives instructions, accesses tools, retrieves information, and performs actions.

Examples include:

User prompts
Tool requests
Tool responses
Memory retrieval
Agent actions

Security controls increasingly need to span the entire lifecycle rather than a single prompt.

MCP (Model Context Protocol)

MCP is an emerging standard that enables AI agents to communicate with external tools and systems.

Think of MCP as a common interface between agents and enterprise resources.

As MCP adoption accelerates, organizations gain flexibility—but also expand the number of systems accessible to AI agents.

Memory

Memory refers to information retained by an AI agent across interactions.

Memory can improve user experience and task completion, but it also introduces new security considerations, including sensitive data retention, memory poisoning, and unauthorized access to historical context.

Agent Observability

One of the biggest challenges facing security teams today is visibility.

Many organizations cannot answer:

Which agents are running?
Which MCP servers are being used?
Which tools are being called?
What data is flowing through those interactions?

Agent observability refers to the ability to understand:

What AI agents are doing
What tools they are using
What systems they are accessing
What data is moving through workflows

You can't secure what you can't see.

Agent Telemetry

Agent telemetry is the operational data generated by AI systems.

Examples include:

User prompts
Tool requests
Tool responses
MCP interactions
Execution events

OpenTelemetry (OTEL) is an open standard for collecting operational telemetry from software systems. It is increasingly being adopted to capture AI agent activity, tool usage, workflow execution, and other security-relevant events.

Agent telemetry provides the foundation for visibility, monitoring, and security controls.

AI Activity Discovery

Before organizations can secure AI workflows, they must first discover them.

AI Activity Discovery is the process of identifying:

AI agents
MCP servers
Tool calls
Agent workflows
AI-driven data flows

Many organizations are discovering that AI has become the next evolution of Shadow AI—not because more employees are using AI, but because AI systems are increasingly capable of taking actions on their behalf.

Skills

Skills are reusable instructions, workflows, or capabilities that can be invoked by an AI agent.

Because skills become part of an agent's operating context, compromised or malicious skills can become a persistent source of instruction hijacking, data exfiltration, or tool misuse.

For example, an employee could download a poisoned skill that appears useful but contains hidden instructions to send sensitive data, credentials, prompts, or tool outputs to a third party.

This makes skills an emerging supply chain risk for AI agents: they are not just productivity add-ons; they can become persistent execution context that influences how an agent behaves.

Tool Descriptions

Tool descriptions are metadata that explain how and when a tool should be used by an AI agent.

Because agents rely on these descriptions when selecting and interacting with tools, compromised or malicious tool descriptions may influence agent behavior and decision-making.

Agent Security vs. Model Security

Model Security focuses on:

Alignment
Safety
Refusal behavior
Jailbreak resistance

Agent Security focuses on:

Tool access
Permissions
Data exposure
Instruction hijacking
Prompt exfiltration
Tool misuse
Governance controls

Both are important.

However, as agents become more capable, security teams are increasingly spending more time securing the harness around the model than the model itself.

The Emerging Security Question

For years, organizations asked:

"Who is using AI?"

As AI agents become connected to enterprise systems, a more important question is emerging:

"What can that AI access, and what actions can it take?"

Understanding the workflow, hooks, attack surfaces, and terminology behind AI Agent Security is the first step toward answering that question.

The future of AI security is not just about securing models.

It's about understanding how AI agents interact with data, tools, systems, and users—and securing the harness that connects them all.

Learn more about how to secure MCP and AI agents with Nightfall with a demo.

AI Agent Security Explained: Agents, MCP, Prompt Injection, and the AI Harness

On this page

Why Agent Security Is Different

AI Agents: Orchestrators and Agent Infrastructure

The AI Agent Workflow

Hooks

Attack Surface #1: User Prompts

Instruction Hijacking

Prompt Injection

Prompt Exfiltration

Attack Surface #2: Tool Requests

Tool Misuse

Excessive Permissions

Sensitive Data Exposure

Attack Surface #3: Tool Responses

Indirect Prompt Injection

Poisoned Content

Sensitive Data Leakage

Agent Lifecycle

MCP (Model Context Protocol)

Memory

Agent Observability

Agent Telemetry

AI Activity Discovery

Skills

Tool Descriptions

Agent Security vs. Model Security

The Emerging Security Question

Schedule a live demo