L1: Troubleshooting Common Issues

L1: Troubleshooting Common Issues

Troubleshooting Common Issues & Cross-Agent Error Propagation

In Module 9, we successfully integrated our agents into the CI/CD pipeline, GitHub, and Rally. However, enterprise systems are chaotic. APIs time out, databases lock, and edge cases inevitably break your prompts. When an autonomous system fails, it can fail silently and catastrophically.

This lesson covers how to troubleshoot the most common agentic failures and how to architect Cross-Agent Error Propagation so that a single subagent's error does not cause a cascading collapse of your entire pipeline.

1. Troubleshooting Common Agentic Failures

Before you can architect error handling, you must be able to diagnose the three most common ways an agentic workflow breaks in production:

  • The Infinite Tool Loop: The agent encounters an error, attempts to fix it by calling the exact same tool with the exact same parameters, and gets stuck in a loop until it exhausts your token budget.

    • The Fix: Implement a programmatic max_retries counter in your application code. If the counter hits 3, forcefully inject a system message: "You have attempted this action 3 times and failed. Stop calling this tool and return a FATAL_ERROR."
  • Hallucinated Tool Calls: The agent attempts to use a tool that does not exist in its provided schema, or it invents parameters.

    • The Fix: This usually indicates an overly vague System Prompt. You must tighten the prompt or use the tool_choice parameter to strictly enforce boundaries.
  • The Cascading Failure: The Coordinator (Hub) relies on data from a Subagent (Spoke). The Subagent encounters a database timeout and, instead of failing gracefully, hallucinates a fake dataset to "please" the Coordinator. The Coordinator then deploys broken code based on fake data.

2. Cross-Agent Error Propagation (The Architectural Standard)

To prevent cascading failures, errors must be explicit, structured, and forcefully propagated up the chain of command.

When a Subagent fails, it must never return conversational apologies (e.g., "I'm sorry, I couldn't find the data"). Conversational text forces the Coordinator agent to guess whether the task succeeded or failed.

The Implementation:

When a Subagent terminates unsuccessfully, your application code must intercept the failure and format the final output as a strict JSON error schema before passing it back to the Coordinator's context window.

Example Cross-Agent Error Payload:

JSON

{
  "status": "FATAL_SUBAGENT_ERROR",
  "failing_agent": "SQL_Query_Subagent",
  "error_type": "SCHEMA_MISMATCH",
  "details": "The requested table 'q3_revenue' does not exist in the database.",
  "attempted_steps": 4
}
  

Architectural Advantage: When the Coordinator receives this JSON block as a tool_result, its System Prompt can mathematically trigger an exception-handling protocol, rather than blindly continuing the workflow.

3. The Coordinator as an Exception Handler

To make the Hub-and-Spoke model resilient, the Coordinator agent must be explicitly programmed to act as a router for errors, not just a router for tasks.

Your Coordinator's prompt must include a Fault Tolerance Matrix :

  • "If a Subagent returns aTIMEOUT error, wait and retry the subagent exactly once."

  • "If a Subagent returns aSCHEMA_MISMATCH error, do NOT retry. Trigger the Database_Introspection_Subagent to find the correct table name, then deploy the original Subagent again."

  • "If a Subagent returns aFATAL_SUBAGENT_ERROR, immediately halt the overarching workflow, log the failure, and escalate to the human operator."

4. Multi-Tiered Escalation Patterns

When an error reaches the threshold where the agent cannot self-correct, architects deploy multi-tiered patterns to ensure the system degrades gracefully.

  1. Tier 1: Alternative Routing (The Pivot)

    • If the primary path fails, the Coordinator attempts a secondary path. For example, if the Search_Internal_Wiki_Agent returns NOT_FOUND, the Coordinator pivots and triggers the Search_Jira_Tickets_Agent before giving up.
  2. Tier 2: Graceful Degradation (Partial Fulfillment)

    • If a multi-step task partially fails, the system returns what it successfully completed. "I successfully generated the backend code and unit tests, but theAPI_Docs_Agent failed due to a token limit. The code is ready for review, but documentation is pending."
  3. Tier 3: The Dead-Letter Queue (DLQ)

    • For asynchronous CI/CD tasks, if an agent gets completely stuck, your system should package the Agent's entire state (Session ID, current context window, and the exact JSON error payload) and push it into a Dead-Letter Queue (e.g., an AWS SQS queue or a dedicated database table). A human engineer can review the DLQ later to debug the architectural flaw without halting the rest of the pipeline.

5. State Preservation During Human Escalation

When a Coordinator finally escalates an issue to a human engineer in a live session (like a Claude Code terminal), context preservation is critical.

  • The Anti-Pattern: The agent outputs, "An error occurred. What should I do?" The human has no idea what broke and has to scroll through hundreds of lines of terminal output.

  • The Architectural Standard: The agent must synthesize the cross-agent error logs. "I asked the Code Review Agent to analyze your PR, but it encountered a token limit error onmassive_legacy_file.py. Should I ignore that specific file and review the rest, or would you like to chunk the file first?"

By preserving and summarizing the failure state of the subagent, you transform a frustrating system crash into a highly actionable decision prompt for the human operator.