How to Diagnose Task Failures in LLM Multi-Agent Systems: A Step-by-Step Guide
Introduction
When an LLM-powered multi-agent system fails, identifying which agent caused the failure and when it happened is like searching for a needle in a haystack. Recent research from Penn State University and Duke University, in collaboration with Google DeepMind and other institutions, introduces automated failure attribution as a solution. Their benchmark dataset Who&When and open-source tools provide a systematic way to pinpoint the root cause. This guide walks you through the process of using these methods to diagnose task failures efficiently.

What You Need
- Multi-agent system logs – Detailed interaction records between agents (e.g., message exchanges, tool calls).
- Task definitions – Clear description of the task the system was trying to complete.
- Failure event description – A record of when and how the system failed (e.g., incorrect output, deadlock).
- Access to the Who&When benchmark – Open-source code and dataset for testing attribution methods.
- Basic programming environment – Python, Jupyter notebook, or similar to run attribution scripts.
- Familiarity with LLM multi-agent architectures – Understanding of agent roles, communication patterns, and failure modes.
Step-by-Step Instructions
-
Step 1: Collect and Organize Interaction Logs
Gather all logs from the multi-agent run that ended in failure. Include timestamps, agent IDs, messages sent/received, and any internal agent states. If your system uses a centralized orchestrator, extract the full conversation history. For decentralized systems, merge logs from each agent by timestamp to create a unified timeline. Save the logs in a structured format (e.g., JSON or CSV) for easy processing.
-
Step 2: Define the Failure Event
Clearly specify what constitutes a failure for your task. Examples: the final answer is incorrect, an agent halted without completing its subtask, or the system entered an infinite loop. Document the exact point where the failure became observable – this will be your ground truth for evaluating attribution methods. The Who&When dataset provides labeled failures for benchmarking, but you need to create your own labels for custom systems.
-
Step 3: Apply an Automated Attribution Method
Choose one of the attribution approaches from the research:
- Perturbation-based: Re-run the system after removing or swapping agents’ contributions to see which change fixes the failure.
- LLM-based reasoning: Use a strong LLM (e.g., GPT-4) to analyze the logs and provide a textual explanation for the failure, then extract the responsible agent and timestamp.
- Gradient-based (if applicable): For systems with differentiable components, compute gradients to identify sensitive agents or interactions.
Implement the method using the open-source code. For LLM-based reasoning, craft a prompt that includes the failure description, the full log, and asks for the “who” and “when” in a structured answer.
-
Step 4: Analyze the Attribution Results
Examine the output of your chosen method. It should indicate an agent ID and a timestep (or message index). Compare this with your own analysis or ground truth labels if available. If the attribution method suggests multiple candidates, prioritize those that appear consistently across different methods. Document the reasoning: e.g., “Agent 3 failed because it received incorrect information from Agent 2 at timestep 47, leading to a cascading error.”
-
Step 5: Iterate and Fix the System
Based on the attribution, modify the responsible agent’s instructions, tool access, or communication protocol. Re-run the system to verify the fix. Automated attribution is not a one-time analysis; use it as a debugging loop. The Who&When benchmark includes multiple failure scenarios, allowing you to test your attribution method’s robustness across different tasks.
Tips for Success
- Start with simple failure cases – Single-agent mistakes are easier to attribute. Gradually move to complex multi-step failures.
- Combine multiple attribution methods – Cross-validating results from perturbation and LLM reasoning increases confidence.
- Automate log collection – Integrate logging into your system from the start to avoid manual archaeology.
- Use the Who&When dataset for training – Even if your system is different, the dataset helps you refine your attribution pipeline.
- Document your attribution process – Keep a record of what methods worked and why; this builds institutional knowledge.
- Don’t rely solely on automated attribution – Manual inspection of a few key logs is still valuable for catching patterns the automation might miss.