L1: Troubleshooting Common Issues
L1: Troubleshooting Common Issues
Troubleshooting Common Issues & Cross-Agent Error Propagation
In Module 9, we successfully integrated our agents into the CI/CD pipeline, GitHub, and Rally. However, enterprise systems are chaotic. APIs time out, databases lock, and edge cases inevitably break your prompts. When an autonomous system fails, it can fail silently and catastrophically.
This lesson covers how to troubleshoot the most common agentic failures and how to architect Cross-Agent Error Propagation so that a single subagent's error does not cause a cascading collapse of your entire pipeline.
1. Troubleshooting Common Agentic Failures
Before you can architect error handling, you must be able to diagnose the three most common ways an agentic workflow breaks in production:
The Infinite Tool Loop: The agent encounters an error, attempts to fix it by calling the exact same tool with the exact same parameters, and gets stuck in a loop until it exhausts your token budget.
- The Fix: Implement a programmatic
max_retriescounter in your application code. If the counter hits 3, forcefully inject a system message: "You have attempted this action 3 times and failed. Stop calling this tool and return a FATAL_ERROR."
- The Fix: Implement a programmatic
Hallucinated Tool Calls: The agent attempts to use a tool that does not exist in its provided schema, or it invents parameters.
- The Fix: This usually indicates an overly vague System Prompt. You must tighten the prompt or use the
tool_choiceparameter to strictly enforce boundaries.
- The Fix: This usually indicates an overly vague System Prompt. You must tighten the prompt or use the
The Cascading Failure: The Coordinator (Hub) relies on data from a Subagent (Spoke). The Subagent encounters a database timeout and, instead of failing gracefully, hallucinates a fake dataset to "please" the Coordinator. The Coordinator then deploys broken code based on fake data.
2. Cross-Agent Error Propagation (The Architectural Standard)
To prevent cascading failures, errors must be explicit, structured, and forcefully propagated up the chain of command.
When a Subagent fails, it must never return conversational apologies (e.g., "I'm sorry, I couldn't find the data"). Conversational text forces the Coordinator agent to guess whether the task succeeded or failed.
The Implementation:
When a Subagent terminates unsuccessfully, your application code must intercept the failure and format the final output as a strict JSON error schema before passing it back to the Coordinator's context window.
Example Cross-Agent Error Payload:
JSON
{
"status": "FATAL_SUBAGENT_ERROR",
"failing_agent": "SQL_Query_Subagent",
"error_type": "SCHEMA_MISMATCH",
"details": "The requested table 'q3_revenue' does not exist in the database.",
"attempted_steps": 4
}
Architectural Advantage: When the Coordinator receives this JSON block as a tool_result, its System Prompt can mathematically trigger an exception-handling protocol, rather than blindly continuing the workflow.
3. The Coordinator as an Exception Handler
To make the Hub-and-Spoke model resilient, the Coordinator agent must be explicitly programmed to act as a router for errors, not just a router for tasks.
Your Coordinator's prompt must include a Fault Tolerance Matrix :
"If a Subagent returns a
TIMEOUTerror, wait and retry the subagent exactly once.""If a Subagent returns a
SCHEMA_MISMATCHerror, do NOT retry. Trigger theDatabase_Introspection_Subagentto find the correct table name, then deploy the original Subagent again.""If a Subagent returns a
FATAL_SUBAGENT_ERROR, immediately halt the overarching workflow, log the failure, and escalate to the human operator."
4. Multi-Tiered Escalation Patterns
When an error reaches the threshold where the agent cannot self-correct, architects deploy multi-tiered patterns to ensure the system degrades gracefully.
Tier 1: Alternative Routing (The Pivot)
- If the primary path fails, the Coordinator attempts a secondary path. For example, if the
Search_Internal_Wiki_AgentreturnsNOT_FOUND, the Coordinator pivots and triggers theSearch_Jira_Tickets_Agentbefore giving up.
- If the primary path fails, the Coordinator attempts a secondary path. For example, if the
Tier 2: Graceful Degradation (Partial Fulfillment)
- If a multi-step task partially fails, the system returns what it successfully completed. "I successfully generated the backend code and unit tests, but the
API_Docs_Agentfailed due to a token limit. The code is ready for review, but documentation is pending."
- If a multi-step task partially fails, the system returns what it successfully completed. "I successfully generated the backend code and unit tests, but the
Tier 3: The Dead-Letter Queue (DLQ)
- For asynchronous CI/CD tasks, if an agent gets completely stuck, your system should package the Agent's entire state (Session ID, current context window, and the exact JSON error payload) and push it into a Dead-Letter Queue (e.g., an AWS SQS queue or a dedicated database table). A human engineer can review the DLQ later to debug the architectural flaw without halting the rest of the pipeline.
5. State Preservation During Human Escalation
When a Coordinator finally escalates an issue to a human engineer in a live session (like a Claude Code terminal), context preservation is critical.
The Anti-Pattern: The agent outputs, "An error occurred. What should I do?" The human has no idea what broke and has to scroll through hundreds of lines of terminal output.
The Architectural Standard: The agent must synthesize the cross-agent error logs. "I asked the Code Review Agent to analyze your PR, but it encountered a token limit error on
massive_legacy_file.py. Should I ignore that specific file and review the rest, or would you like to chunk the file first?"
By preserving and summarizing the failure state of the subagent, you transform a frustrating system crash into a highly actionable decision prompt for the human operator.