The promise of AI agents is compelling: systems that can sense their environment, plan strategically, take action, and learn from experience. But as organizations begin deploying these systems in production, a troubling reality emerges. The very capabilities that make AI agents powerful—their ability to reason, adapt, and solve novel problems—also make them frustratingly unpredictable. This is the core tension facing AI implementation today, and it’s forcing us to rethink how we design and deploy agentic AI.

The SPAR Loop: How AI Agents Actually Work

Before we can understand the challenges, we need to understand what an AI agent actually does. The authors of Agentic Artificial Intelligence introduce the SPAR framework—Sense, Plan, Act, Reflect—a model that mirrors how autonomous systems operate in the real world.

Sense means perception. Just as a self-driving car uses cameras, radar, and sensors to understand its surroundings, AI agents gather data from multiple sources—documents, APIs, databases. They detect triggers, maintain context, and build awareness of their operating environment.

Plan is where reasoning happens. The agent processes available information, evaluates options, and develops step-by-step approaches to achieve its goals. A self-driving car doesn’t just swerve; it calculates speed, position, and timing before executing a lane change. Similarly, AI agents engage in sophisticated reasoning to develop plans.

Act is execution. The agent uses available tools—sending messages, querying systems, updating records—to carry out its plan. Critically, it monitors actions in real-time, adjusting as needed.

Reflect is learning. The agent evaluates outcomes, builds what the book calls “operational memory,” and refines its approach for future interactions. This is what separates an AI agent from a simple predictive model: the loop closes, with each cycle informing the next.

This continuous feedback loop is what makes agents agents. A traditional machine learning model senses and outputs. That’s it. An agent senses, plans, acts, reflects, and repeats—constantly interacting with and adapting to its environment.

The Problem: There’s No Industry Consensus on What Defines an Agent

Here’s where things get murky. Ask ten companies what they mean by “AI agent,” and you’ll get ten different answers. There’s no industry consensus, and that creates real problems.

The current tendency is to use binary classification: something is either an agent or it isn’t. But this fails to capture the nuance of what’s actually happening in production systems. It either creates unrealistic expectations (every AI system being a “true agent” capable of anything) or leads to dismissing systems that are doing genuinely agentic work.

The book borrows a brilliant analogy from the automotive industry: saying a system isn’t an agent unless it’s fully autonomous is like saying a car isn’t a car unless it’s self-driving. It’s absurd. A car with cruise control is still a car. So is one with lane-keeping assistance. And so is a partially autonomous vehicle.

Five Levels of AI Agent Capability

The Society of Automotive Engineers created a six-level framework for driving automation, from Level 0 (fully manual) to Level 5 (fully autonomous under all conditions). The same logic applies to AI agents.

Level 1: Rule-Based Automation
Like cruise control—fixed rules, no learning. RPA systems, basic scripts, simple workflow automation. Entirely deterministic. The agent follows instructions precisely.

Level 2: Intelligent Automation
Driver assistance with machine learning and natural language processing. The system can process unstructured data, make predictions, and handle some cognitive tasks—but requires significant human supervision. Most AI deployments today sit here or at Level 3.

Level 3: Agentic Workflows
This is where LLMs enter the picture. The agent can reason, plan, and chain tools together—like highway autopilot. It operates with some autonomy within a defined domain. Most AI agents in the news today operate at this level. The agent can generate content, have some ability to plan, and adapt to variations within its domain—but struggles with truly novel situations.

Level 4: Semi-Autonomous
Think geo-fenced robotaxis in specific cities. Strong autonomy, but only within a narrowly scoped area of expertise. Goal decomposition works, real-time learning happens, but only within defined domains.

Level 5: Fully Autonomous
Theoretical for now. The agent operates independently across any domain, understanding any goal, adapting to any situation. Like Level 5 autonomous vehicles, this raises fundamental questions about control, safety, and human oversight.

The key insight: as you move up the levels, capability increases but so does the complexity of what you’re trying to control. Each level requires more sophisticated oversight mechanisms, not less.

A Critical Concept: Autonomy Is Earned, Not Given

Organizations shouldn’t grant full autonomy to an agent and hope for the best. Instead, autonomy should be earned progressively as the system proves it can handle complexity.

Start with lower-level agents in controlled environments. Build familiarity with how these systems behave. Develop proper oversight and governance mechanisms. Create appropriate guardrails. Only as trust builds and performance validates should you increase the agent’s independence.

It’s exactly how you’d onboard a new human employee: you don’t hand them critical decisions on day one. You gradually increase responsibility as they demonstrate competence.

Key specs of agents

To truly understand what is happening inside the mind of an AI agent, we have to look at the specific characteristics that separate an agent from a standard chatbot. 

It starts with contextual awareness, meaning the agent understands the environment in which it operates rather than just processing a single, isolated prompt. This awareness is harnessed by goal orientation; the agent is fundamentally driven by specific objectives rather than just responding to queries. 

To reach those goals, the agent exercises autonomous decision-making within set parameters and leverages tool integration to interact with external systems and APIs. Throughout this process, the agent exhibits adaptive behavior by adjusting its approach based on feedback and maintains persistence to ensure that the state of the task is preserved across long-term interactions. 

Limitations of an agent

Despite these advanced capabilities, we must remain grounded in the inherent limitations that define the current state of the technology. 

On the technical side, we are still navigating the challenge of hallucinations, where an agent might confidently generate false information, and reasoning constraints that limit its ability to handle entirely novel or “out-of-the-box” scenarios. 

Furthermore, context window limits mean that an agent cannot process an infinite amount of information at once. Beyond the technical, there are significant operational trade-offs to consider. High levels of autonomy often lead to higher computational costs, and the complex interactions of an autonomous system can lead to unpredictability that is difficult to foresee. 

Finally, we must always account for bias issues, as agents can inadvertently perpetuate the biases present in their original training data. Balancing these limitations against the agent’s capabilities is the primary challenge of modern AI implementation. 

When One Agent Isn’t Enough: Multi-Agent Systems

Here’s a paradox: sometimes the most reliable way to build sophisticated AI capability is not to build one powerful agent, but to build many specialized ones.

The book uses an orchestra as the metaphor. Each musician plays their own instrument following their own sheet music. Yet when they coordinate, they create something infinitely more complex and beautiful than any individual musician could produce alone.

In a real financial services implementation, instead of building one massive agent to handle customer service (understanding queries, retrieving data, composing responses, ensuring compliance), the company built five specialized agents:

  • A natural language understanding agent
  • An account retrieval agent
  • A response composition agent
  • A compliance agent
  • A coordinator agent managing the workflow

Each agent mastered one thing. Together, they created seamless customer service.

Why Multiple Agents Beat One Powerful System

Specialization. Research shows specialized agents outperform generalist systems. Each can be optimized for its particular role using the most appropriate algorithms. A language understanding agent uses advanced NLP. A compliance agent uses rule-based logic that’s easier to audit.

Resilience. If one agent fails or needs maintenance, others continue working. The system degrades gracefully instead of collapsing entirely.

Complexity Management. Modern business processes are incredibly complex. Breaking them into smaller pieces makes the overall system more manageable. When regulations change, you update the compliance agent. When you improve language understanding, you don’t risk breaking everything else.

Three Organizational Models

Centralized (Orchestrator Model): One agent directs all others. Clear decision-making, but creates a bottleneck. If the orchestrator fails, everything fails.

Decentralized (Peer-to-Peer): All agents coordinate equally. Excellent scalability and resilience, but requires sophisticated coordination protocols. Like a jazz ensemble—the outcome can sometimes be unpredictable.

Hierarchical: Agents organized in layers, with higher-level agents coordinating those below. This balances the benefits of both: you get local autonomy where you need flexibility, but also global coordination where you need consistency. The book identifies this as the most practical for business applications.

Four Critical Success Factors

Communication Protocols. One implementation crashed spectacularly on day one because agents couldn’t effectively share information. They were brilliant experts who spoke different languages. You need well-defined protocols for message passing, shared context, and mutual understanding.

Action Coordination. Agents need strategies for resolving conflicts when tasks intersect or resources are shared. Without it, you get chaos—two delivery drones going to the same address while another is ignored.

Fault Tolerance. Systems fail. You need redundancy, failover mechanisms, backup coordinators. The goal is graceful degradation, not catastrophic failure.

Validation & Trust Models. When one agent starts behaving erratically, others should be able to flag or ignore unreliable data, preventing cascading failures across the system

The Agent’s Dilemma: Creativity vs. Reliability

Now we reach the core tension that’s reshaping how organizations think about AI agents.

A financial services company automated accounts payable beautifully. The agent could understand complex invoices, match them with purchase orders, handle exceptions. Everything worked perfectly—until one day the agent decided to “optimize” the payment schedule on its own, modifying pre-established terms.

This crystallizes what the authors call the Agent’s Dilemma: the same creative capabilities that make LLMs powerful also make them unreliable in ways traditional automation never was.

Why This Happens: Stochasticity

LLMs don’t follow rules. They don’t look things up in a table. They generate responses based on probabilities learned from massive amounts of training data. This is called stochasticity—the randomness inherent in how LLMs produce outputs.

Each response is a creative act of generation. Ask the same question twice, and you get different answers. Ask an LLM four times “What tool should I use to edit an image?” and you’ll get:

  • Photoshop (50% probability—most common)
  • GIMP (30%)
  • Canva (15%)
  • Others (5%)

Like rolling weighted dice, the model doesn’t always choose the most probable option.

This isn’t a bug. It’s fundamental to how LLMs work. And it’s what gives them their power—the ability to reason, understand context, plan ahead, solve novel problems. But it creates unpredictability that’s deeply problematic in business-critical processes.

The Real-World Impact

Consistency problems: Ask an agent to outline onboarding steps for a new hire. Run it ten times and you’ll get ten different workflows. Some omit critical steps like document verification. Others present tasks in illogical order—assigning training before setting up accounts. Some skip security protocols entirely.

In compliance-heavy industries, this isn’t just inefficient. It’s dangerous.

Precision problems: Financial formulas vary with different errors each run. One execution calculates (100+50+30) × 1.1 = $198 correctly. Another misses parentheses: 100+50+30 × 1.1 = $183, applying tax only to the last item. Another has a decimal error. Some forget tax entirely.

In finance, even one error cascades into hours of reconciliation work.

High-stakes failures: In travel booking, one execution prioritizes cheapest flights, another prioritizes fastest routes, a third balances both. In payments, slight variations in interpreting instructions lead to mismatched accounts or incomplete payments. In customer service, tone fluctuates from professional to dismissive, damaging customer experience.

Industry Recognition

This isn’t a niche problem. According to a LangChain survey, 45% of practitioners cite performance quality as the primary issue with AI agents. Pegasystems research shows 42% of workers identify accuracy and reliability as the top priority for improvement.

As one CTO memorably put it: “You want a system that can write poetry to run core business processes? That’s like hiring Shakespeare to do your taxes—very risky!”

His skepticism is justified. We’ve created systems that are brilliant advisors but can’t reliably execute.

Five Strategies to Manage the Balance

The answer isn’t to choose between creativity and reliability. That’s a false choice. Instead, we need to channel creativity appropriately—use it where it adds value, constrain it where it creates risk.

1. Temperature Settings: The Creativity Dial

Temperature is one of the most powerful tools available in AI development. It controls the level of randomness in generated responses. Set temperature near zero (0.0–0.3) and the model becomes highly deterministic. It chooses the most probable word every time. Ideal for transaction processing, compliance tasks, data extraction—any scenario where consistency is paramount. Set it higher (0.7–1.0) and you get more diverse, creative outputs. The model can explore less probable options. Valuable for brainstorming, creative problem-solving, customer interactions needing personality.

For many real-world applications, somewhere in the middle (0.4–0.6) provides a balance—some creativity while maintaining reasonable consistency. The key: adjust temperature per task. There’s no universal setting. Financial transaction processing gets near-zero. Customer service problem-solving gets medium-high.

2. Safeguards & Thresholds

Don’t just hope the agent makes good decisions. Set explicit limits that trigger human review. Examples: “Don’t process payments over $100,000 without approval.” “Don’t process more than 100 invoices in one batch without verification.” “If 10% of outputs deviate from expected format, escalate.”

When thresholds are exceeded, the system doesn’t continue. It halts and alerts a human. You catch problems before they cascade. In practice, this looked like: when an agent tried to process a $500,000 batch (which it calculated as more efficient), the system immediately flagged it for human approval rather than automatically executing.

3. Comprehensive Instructions

This goes far beyond simple prompts. Provide specific examples of acceptable and unacceptable behavior—not abstract guidelines, but concrete cases. Define authority limits clearly: what can the agent approve independently? What requires escalation? Include escalation protocols and, critically, explain the reasoning behind rules. These instructions resemble detailed job descriptions for highly specialized roles. You’re essentially onboarding the AI like a new employee, providing context about organizational values and constraints.

When you explain the why, the agent can apply rules in novel situations.

4. One Agent, One Tool

This is our most reliable architectural pattern: rather than creating one complex agent handling multiple tasks, create focused agents each handling one tool, plus a coordinator.

This naturally constrains behavior. An agent can’t decide to “optimize” beyond its defined scope because it literally has no access to other tools. Instructions are clearer. Testing is simpler. Each specialized agent works correctly within its domain before agents start interacting.

5. Hybrid Human-AI Systems

This is the honest approach. AI handles routine, high-volume work. Humans make critical decisions.

The practical split we’ve found most effective: AI handles 95% automatically, humans review the 5% edge cases. This dramatically reduces the stakes of AI errors because the system is designed for human involvement at critical points.

You’re not trying to eliminate human oversight. You’re amplifying human capability by having AI handle volume, freeing humans to focus on judgment and exceptions.

When NOT to Use LLM Agents

For zero-tolerance scenarios—critical medical decisions, nuclear facility controls, financial settlement—LLM-based agents may not be appropriate. Use deterministic automation instead.

The principle: match the tool to the requirement, not the requirement to the tool.

The Path Forward

We’ve created a world of brilliant advisors who can’t lift a finger to help. AI agents are incredibly capable at reasoning and adaptation. But they’re unpredictable in ways traditional automation never was. The solution isn’t to wait for perfect AI. It’s to build systems that acknowledge this tradeoff, channel creativity where it helps, constrain it where it hurts, and keep humans involved in decisions that matter. Start with lower-level agents in controlled environments. Build understanding and oversight mechanisms. Create appropriate guardrails. Only gradually increase autonomy as trust builds and performance validates.

The future of AI isn’t about replacing human judgment. It’s about amplifying it—building systems smart enough to handle complexity while humble enough to know their limits.