How To Keep Your AI Agents From Breaking The Internet
PagerDuty Executive Chair Jenn Tejada explains why AI's next challenge isn't building smarter agents—it's keeping them from drifting, failing and disrupting business.
Martine Paris, Contributor
Forbes
3 min read
7/10
Key Takeaways
PagerDuty Executive Chair Jenn Tejada identifies 'AI drift' as a key risk: agents gradually deviate from intended behavior, causing hard-to-detect failures.
Unlike traditional software bugs, AI failures often emerge slowly and probabilistically, requiring new observability and incident response paradigms.
PagerDuty is adapting its incident management platform to include AI-specific guardrails, such as automated triggers for behavioral deviation thresholds.
Tejada advocates for 'chaos engineering' for AI—deliberately injecting failures in test environments to build agent resilience.
The broader AI reliability market is expected to surge as enterprises demand tools to prevent agentic AI from disrupting critical business processes.
The next big threat to the internet isn't a new cyberattack or a server outage—it's your own AI agents silently going rogue. PagerDuty Executive Chair Jenn Tejada warns that as companies race to deploy autonomous AI agents, the real challenge isn't making them smarter; it's keeping them from drifting, failing, and disrupting critical business operations. In a new interview, Tejada explains that AI agents—unlike traditional software—can gradually deviate from their intended behavior, causing cascading failures across systems. This phenomenon, known as 'AI drift,' is already causing headaches for enterprises that rushed to implement agentic AI without robust monitoring and incident response protocols. The stakes are high: a single misbehaving AI agent could take down an e-commerce platform, corrupt financial transactions, or even compromise healthcare decisions. Tejada argues that the solution lies in applying the same operational rigor used for cloud infrastructure to AI systems—real-time observability, automated rollbacks, and clear human escalation paths. The article, published on Forbes, highlights how PagerDuty—a company known for incident management—is repositioning itself to help enterprises handle the unique failure modes of AI agents. Tejada points out that unlike traditional software bugs, AI agent failures often manifest slowly and subtly, making them harder to detect. For example, a customer service AI might start giving slightly wrong answers after a model update, or a supply chain agent might optimize for the wrong metric after data drift. Traditional monitoring tools can't catch these issues because they're designed for deterministic systems, not probabilistic ones. The PagerDuty approach involves setting up 'guardrails'—automated triggers that flag when an agent's behavior deviates beyond acceptable thresholds. These guardrails integrate with existing incident management workflows, ensuring that human teams are alerted before the agent causes real damage. Tejada also emphasizes the importance of 'chaos engineering' for AI: deliberately introducing failures in testing environments to understand how agents behave under stress. The broader implication is that the AI industry must mature its operational practices quickly. As more companies embed AI agents into core business processes—from coding assistants to fraud detection—the cost of failure rises exponentially. Analysts draw parallels to the early days of cloud computing, when companies learned the hard way that agility without reliability leads to disaster. The next wave of AI adoption will hinge not on model accuracy but on resilience. Looking ahead, Tejada predicts that the market for AI reliability tools will explode, with every major cloud provider and observability platform adding AI-specific monitoring features. Regulators may also step in, requiring companies to demonstrate that their AI agents have adequate fail-safes. For now, the message to CIOs is clear: before you let your AI agents loose, make sure you can pull the plug—before they break the internet.
Frequently Asked Questions
AI drift refers to the gradual deviation of an AI agent's behavior from its intended function over time. This can happen due to model updates, data changes, or environmental shifts, and unlike software bugs, it often goes unnoticed until it causes significant disruption.
Companies can prevent AI agent failures by implementing real-time observability, setting behavioral thresholds that trigger alerts, using automated rollbacks, and applying chaos engineering to test resilience. Integrating these practices into existing incident management workflows is critical.
Traditional monitoring tools are designed for deterministic systems with clear failure modes. AI agents are probabilistic and their failures are often subtle and gradual, requiring specialized observability that can detect drift in model outputs and decision patterns.
PagerDuty is adapting its incident management platform to include AI-specific features such as adaptive thresholds, anomaly detection, and automated escalation paths to catch AI drift before it causes business disruption.
Chaos engineering for AI involves deliberately injecting failures or edge cases into AI agent test environments to observe how they respond. This helps teams understand failure modes and build more resilient systems before deployment.