Looking for AI consulting services?Talk to the Padiso team
All posts
Guide

Observability Beyond Logs: Tracing Decisions Across Multi-Agent Workflows

Master distributed tracing for multi-agent AI systems. Learn how to trace decisions, debug workflows, and achieve full observability beyond basic logs.

TPThe Padiso Team
15 minutes read

Understanding the Problem: Why Logs Aren't Enough

You've deployed a multi-agent workflow. It ran for six hours, processed 10,000 tasks, and suddenly failed on decision #7,432. Your logs show error messages, but not why the agent made that decision, what context it had, or which other agents influenced the outcome.

This is the core problem with traditional logging in agent systems. Logs are linear, point-in-time records. They tell you what happened, but not why it happened or how it connects to the broader workflow. When you're running always-on AI agents at scale-especially in headless companies where agents are your operational backbone-you need visibility that goes beyond text records.

Traditional observability tools were built for monolithic applications and microservices. They assume linear request flows and synchronous call chains. Multi-agent systems are fundamentally different: agents operate asynchronously, make decisions based on incomplete information, delegate work to other agents, and iterate through reasoning loops. A single user request might trigger dozens of agent invocations, tool calls, and decision branches-and you need to trace all of it.

This is where distributed tracing becomes essential. Distributed tracing is an instrumentation technique that captures the entire execution path of a workflow across multiple components, showing not just what happened, but the causal relationships between events. For agent systems, this means you can see exactly which agent made which decision, what data it considered, which tools it called, and how that decision cascaded through your workflow.

Distributed Tracing Fundamentals for Agent Systems

Distributed tracing works by assigning a unique identifier to each workflow execution and propagating that identifier across every component that touches that workflow. Think of it like a tracking number on a package: the same number follows the package through every warehouse, delivery vehicle, and checkpoint, creating a complete audit trail.

In agent systems, a trace typically has this structure:

Trace: A single end-to-end workflow execution (e.g., "process customer inquiry")

Spans: Discrete units of work within that trace. A span might represent:

  • A single agent invocation
  • A tool call (API request, database query, file read)
  • A decision point (e.g., "should this task be escalated?")
  • A reasoning loop (e.g., "agent iterated 3 times before deciding")

Attributes: Key-value metadata attached to spans, such as:

  • Agent name and version
  • Model used (Claude, GPT-4, custom)
  • Decision made and confidence score
  • Input data and output data
  • Latency and token count
  • Success or failure status

Context propagation: The mechanism that carries the trace ID from one component to the next. When Agent A calls Agent B, the trace ID travels with that call, so both agents' work appears in the same trace.

The critical insight is that distributed tracing captures the causal chain of events. You're not just logging that Agent B was called; you're recording that Agent B was called because Agent A made Decision X, which was based on Data Y. This causality is what lets you answer the question: "Why did my workflow produce this outcome?"

Why Logs Fail for Multi-Agent Workflows

Conventional logging in agent systems typically produces output like this:

[2024-01-15 14:32:15] Agent: research_agent | Status: started | Task: investigate_competitor
[2024-01-15 14:32:18] Tool: web_search | Query: "acme_corp_product_roadmap" | Results: 12
[2024-01-15 14:32:22] Agent: research_agent | Status: completed | Output: 2 key findings
[2024-01-15 14:32:23] Agent: analysis_agent | Status: started | Input: 2 key findings
[2024-01-15 14:32:45] Agent: analysis_agent | Status: completed | Decision: ESCALATE

On the surface, this looks reasonable. But consider the problems:

1. No parent-child relationships: You can see that analysis_agent ran after research_agent, but you can't programmatically trace the data flow. If analysis_agent failed, you don't know which specific output from research_agent caused the failure.

2. No reasoning visibility: Logs show the final decision (ESCALATE) but not the reasoning that led to it. What data did the agent consider? Did it iterate multiple times? What confidence did it have?

3. No cross-agent context: If you have 50 agents running in parallel, logs become a soup of timestamps. You can't easily reconstruct which agents were working on the same logical task.

4. No tool call chains: When an agent calls a tool, which in turn triggers another agent, which calls another tool, logs don't show that nested relationship. You get a flat sequence instead of a tree.

5. Difficult aggregation: To answer questions like "how many times did agents make this decision type?" or "what's the average time from decision to outcome?", you need to parse and correlate logs manually. It's slow and error-prone.

6. No automatic alerting on decision patterns: With logs, you can alert on keywords ("ERROR", "FAILED"), but you can't easily alert on decision patterns ("agent made this decision 100 times in a row") without custom parsing.

Distributed tracing solves all of these problems by capturing structure, causality, and relationships as first-class data.

Implementing Distributed Tracing in Agent Orchestration

When you're running agent teams on an orchestration platform, distributed tracing should be built into the platform itself. Here's what a production-grade implementation looks like:

Instrumentation at the Agent Level

Every agent invocation should create a span. That span should capture:

  • Agent metadata: Name, version, model, system prompt hash
  • Input context: What triggered the agent? What data did it receive?
  • Decision points: When the agent reaches a decision (e.g., "should I retry?", "should I escalate?"), log both the decision and the reasoning
  • Tool calls: Each tool invocation is a child span, with the tool name, parameters, latency, and result
  • Output: What did the agent produce? What was its confidence?
  • Iteration count: Did the agent loop? How many times? Why did it stop?

For example, a research agent might produce a trace like this:

Trace: process_customer_inquiry_12345
├─ Span: research_agent.invoke
│  ├─ Attribute: agent_version = "2.1.0"
│  ├─ Attribute: model = "claude-3-opus"
│  ├─ Span: tool.web_search
│  │  ├─ Attribute: query = "customer_issue_type"
│  │  ├─ Attribute: duration_ms = 450
│  │  └─ Attribute: results_count = 15
│  ├─ Span: tool.internal_kb_search
│  │  ├─ Attribute: query = "similar_cases"
│  │  ├─ Attribute: duration_ms = 120
│  │  └─ Attribute: results_count = 3
│  ├─ Span: agent.decision_point
│  │  ├─ Attribute: decision = "escalate_to_human"
│  │  ├─ Attribute: confidence = 0.78
│  │  └─ Attribute: reasoning = "issue_matches_3_critical_cases"
│  └─ Attribute: total_duration_ms = 1200
└─ Span: escalation_agent.invoke
   ├─ Attribute: triggered_by = "research_agent.decision"
   ├─ Span: tool.create_ticket
   │  └─ Attribute: ticket_id = "TICKET-98765"
   └─ Attribute: status = "completed"

Notice the structure: the escalation_agent span has an attribute triggered_by that explicitly links it to the research_agent's decision. This is context propagation-the trace ID follows the workflow, and each span records its parent and the reason it was invoked.

Decision Tracing: The Heart of Agent Observability

For operators running agent-operated companies, decision tracing is critical. Every time an agent makes a decision-especially decisions that affect business logic, customer experience, or resource allocation-that decision must be traceable.

A production decision trace should include:

  • Decision type: A categorical label (e.g., "escalate", "approve", "retry", "delegate")
  • Decision data: The input data and context that informed the decision
  • Decision reasoning: A structured explanation of why this decision was made (not just free text, but structured reasoning steps)
  • Confidence score: How confident was the agent in this decision?
  • Alternative decisions considered: What else could the agent have decided, and why were those rejected?
  • Decision outcome: What happened as a result? Did it succeed?
  • Feedback loop: If the decision was wrong, what was the correction, and did the agent learn from it?

For example, an approval agent might trace a decision like this:

{
  "trace_id": "approve_request_99999",
  "span_id": "approval_decision_1",
  "decision_type": "approve",
  "decision_confidence": 0.92,
  "reasoning_steps": [
    {
      "step": 1,
      "description": "Check request against policy",
      "result": "matches_policy",
      "evidence": "request_amount < approval_limit"
    },
    {
      "step": 2,
      "description": "Verify requester credentials",
      "result": "valid",
      "evidence": "requester_role = manager, tenure > 1_year"
    },
    {
      "step": 3,
      "description": "Check for fraud patterns",
      "result": "no_anomalies",
      "evidence": "request_matches_historical_pattern"
    }
  ],
  "alternatives_considered": [
    {
      "decision": "deny",
      "confidence": 0.05,
      "reason": "no_policy_violations_found"
    },
    {
      "decision": "escalate",
      "confidence": 0.03,
      "reason": "confidence_threshold_met"
    }
  ],
  "outcome": "success",
  "timestamp": "2024-01-15T14:35:22Z"
}

This level of detail lets operators understand not just what the agent decided, but why. When a decision turns out to be wrong, you can audit the reasoning and adjust the agent's logic. This is essential for compliance, debugging, and continuous improvement.

Building Observable Multi-Agent Workflows

When multiple agents work together, tracing becomes exponentially more valuable. Here's how to structure observable multi-agent workflows:

Agent Delegation and Context Flow

When Agent A delegates work to Agent B, the trace should clearly show:

  1. The delegation point: Where in Agent A's execution did it decide to delegate?
  2. The delegation reason: Why was Agent B chosen? Were there alternatives?
  3. The context passed: What data did Agent A give to Agent B?
  4. The result integration: How did Agent A use Agent B's output?

For example, a customer service workflow might look like:

Trace: handle_customer_complaint_77777
├─ Span: triage_agent.invoke
│  ├─ Attribute: input = "customer_complaint_text"
│  ├─ Span: agent.decision_point ("classify_complaint")
│  │  └─ Attribute: classification = "technical_issue"
│  ├─ Span: agent.decision_point ("delegate_to_specialist")
│  │  └─ Attribute: delegated_to = "technical_agent"
│  └─ Span: technical_agent.invoke (child of triage_agent)
│     ├─ Attribute: context_from_parent = "complaint_classification"
│     ├─ Span: tool.diagnose_system
│     │  └─ Attribute: diagnosis = "database_connection_timeout"
│     ├─ Span: agent.decision_point ("recommend_solution")
│     │  └─ Attribute: solution = "increase_connection_pool"
│     └─ Span: agent.decision_point ("escalate_to_engineering")
│        └─ Attribute: escalated_to = "engineering_agent"
└─ Span: engineering_agent.invoke
   ├─ Attribute: context_from_parent = "diagnosis_and_solution"
   ├─ Span: tool.create_incident
   │  └─ Attribute: incident_id = "INC-55555"
   └─ Attribute: status = "completed"

This structure shows the complete chain of delegation, including why each agent was chosen and what context was passed along. An operator can look at this trace and understand the entire decision flow without reading code.

Parallel Agent Execution and Synchronization

When agents run in parallel, tracing must show:

  1. Parallel spans: Which spans ran concurrently?
  2. Synchronization points: Where did the workflow wait for parallel agents to complete?
  3. Result aggregation: How were parallel results combined?

For example, a market research workflow might run multiple research agents in parallel:

Trace: market_research_88888
├─ Span: research_orchestrator.invoke
│  ├─ Span: competitor_research_agent.invoke (parallel, starts at T+0)
│  │  ├─ Span: tool.web_search
│  │  ├─ Span: tool.financial_data_lookup
│  │  └─ Attribute: duration_ms = 3500
│  ├─ Span: industry_analysis_agent.invoke (parallel, starts at T+0)
│  │  ├─ Span: tool.market_reports
│  │  ├─ Span: tool.trend_analysis
│  │  └─ Attribute: duration_ms = 2800
│  ├─ Span: customer_sentiment_agent.invoke (parallel, starts at T+0)
│  │  ├─ Span: tool.social_media_monitoring
│  │  ├─ Span: tool.review_aggregation
│  │  └─ Attribute: duration_ms = 4200
│  ├─ Span: synchronization_point (waits for all parallel agents)
│  │  └─ Attribute: wait_time_ms = 4200
│  ├─ Span: synthesis_agent.invoke
│  │  ├─ Attribute: input_from_agents = ["competitor_research", "industry_analysis", "customer_sentiment"]
│  │  └─ Span: agent.decision_point ("market_opportunity_assessment")
│  │     └─ Attribute: recommendation = "enter_market"
│  └─ Attribute: total_duration_ms = 5800

This shows that three agents ran in parallel (total wall-clock time was 4.2 seconds, not 10.5 seconds), and the synthesis agent waited for all three to complete before proceeding. An operator can see where parallelization is working efficiently and where bottlenecks exist.

Querying and Analyzing Traces

Capturing traces is only half the battle. You need to be able to query and analyze them efficiently. A production observability system should support:

Trace Search and Filtering

You should be able to find traces by:

  • Trace ID: "Show me trace ABC123"
  • Span attributes: "Show me all traces where decision_type = 'escalate'"
  • Time range: "Show me all traces from 2-3 PM today"
  • Agent name: "Show me all invocations of the approval_agent"
  • Decision outcome: "Show me all traces where decision_confidence < 0.5"
  • Duration: "Show me all traces that took more than 10 seconds"
  • Error status: "Show me all traces with failed spans"

For example, if you want to investigate why approvals are taking longer than expected, you might query:

Span type = "approval_decision"
AND duration_ms > 5000
AND timestamp > "2024-01-15T10:00:00Z"
AND timestamp < "2024-01-15T12:00:00Z"

This returns all approval decisions that took longer than 5 seconds during the 10 AM-12 PM window, letting you investigate the slowdown.

Aggregation and Metrics

Beyond individual trace inspection, you need aggregated metrics:

  • Decision distribution: "What percentage of escalation decisions result in successful resolution?"
  • Latency percentiles: "What's the p95 latency for the approval workflow?"
  • Error rates by agent: "Which agent has the highest failure rate?"
  • Tool usage: "Which tools are called most frequently? Which are slowest?"
  • Agent efficiency: "How many iterations does the research agent typically need?"

These metrics let you identify patterns and trends that wouldn't be visible from individual traces.

Trace Visualization and Waterfall Diagrams

A good observability tool renders traces as waterfall diagrams, showing:

  • The timeline of each span
  • Parent-child relationships
  • Which spans ran in parallel
  • Where time was spent

This visual representation makes it much easier to spot bottlenecks and understand workflow structure.

Implementing Tracing in Your Agent Stack

If you're building agent systems on Padiso's orchestration platform, distributed tracing is built in. But if you're building custom agent systems, here's how to implement tracing:

Instrumentation Libraries

Use an instrumentation library that supports OpenTelemetry, the industry standard for distributed tracing. Libraries like distributed tracing frameworks provide SDKs for Python, Node.js, and other languages.

Basic instrumentation looks like this (Python example):

from opentelemetry import trace
 
tracer = trace.get_tracer(__name__)
 
def research_agent(query):
    with tracer.start_as_current_span("research_agent.invoke") as span:
        span.set_attribute("agent_name", "research_agent")
        span.set_attribute("query", query)
        
        # Tool call
        with tracer.start_as_current_span("tool.web_search") as tool_span:
            tool_span.set_attribute("query", query)
            results = web_search(query)
            tool_span.set_attribute("result_count", len(results))
        
        # Decision point
        with tracer.start_as_current_span("agent.decision_point") as decision_span:
            decision = decide_next_action(results)
            decision_span.set_attribute("decision", decision)
            decision_span.set_attribute("confidence", confidence_score)
        
        return decision

Each tracer.start_as_current_span() creates a new span, and attributes are attached with set_attribute(). The nesting automatically creates parent-child relationships.

Exporting Traces

Instrumented code needs to export traces to a backend. Common backends include:

  • Jaeger: Open-source, self-hosted
  • Zipkin: Open-source, self-hosted
  • Datadog: Managed service with advanced features
  • New Relic: Managed service with AI-specific features
  • Lightstep: Managed service focused on observability

Export configuration typically looks like:

from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
 
jaeger_exporter = JaegerExporter(
    agent_host_name="localhost",
    agent_port=6831,
)
 
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(jaeger_exporter)
)

Once configured, all spans are automatically sent to the backend.

Context Propagation

When agents call other services (other agents, APIs, databases), the trace context must be propagated. This is typically done via HTTP headers or message metadata.

For HTTP calls:

from opentelemetry.propagate import inject
 
headers = {}
inject(headers)  # Adds trace context headers
response = requests.get(url, headers=headers)

For message queues (if agents communicate via Kafka, RabbitMQ, etc.):

from opentelemetry.propagate import inject
 
message_headers = {}
inject(message_headers)
queue.publish(message, headers=message_headers)

Context propagation ensures that when Agent A calls Agent B, the same trace ID flows through both, creating a unified view of the entire workflow.

Real-World Tracing Scenarios

Here are concrete examples of how distributed tracing solves real problems in agent systems:

Scenario 1: Debugging a Failed Approval Workflow

An approval workflow failed after 45 minutes. The error log just says "timeout". With distributed tracing, you can:

  1. Query for the failed trace
  2. See that the approval_agent delegated to a compliance_check_agent
  3. See that compliance_check_agent called an external API
  4. See that the external API call took 44 minutes (99.5% of the total time)
  5. Identify the root cause: the external API was slow
  6. Make a decision: add a timeout, use a cached result, or optimize the API call

Without tracing, you'd be guessing. With tracing, the root cause is obvious.

Scenario 2: Understanding Decision Patterns

You notice that your escalation rate is 15%, but you expected 5%. With distributed tracing, you can:

  1. Query for all escalation decisions in the past week
  2. Aggregate by decision_confidence
  3. See that 80% of escalations have confidence < 0.6
  4. Drill into those low-confidence escalations to see the reasoning
  5. Discover that a recent change to the triage_agent's prompt is causing it to be overly cautious
  6. Adjust the prompt and re-test

Again, without tracing, you'd only see the aggregate escalation rate. With tracing, you can understand the decision patterns and make targeted improvements.

Scenario 3: Optimizing Parallel Agent Execution

Your market research workflow has a synchronization point where it waits for three parallel agents to finish. Traces show:

  • competitor_research_agent: 2.5 seconds
  • industry_analysis_agent: 1.8 seconds
  • customer_sentiment_agent: 8.2 seconds

The workflow is bottlenecked by customer_sentiment_agent. Traces show that it spends 7 seconds calling an external sentiment API. You can:

  1. Optimize the API call (batch requests, use caching)
  2. Run sentiment analysis in parallel with the other agents (pre-fetch data)
  3. Use a faster sentiment API

Without tracing, you wouldn't know where the time was being spent.

Best Practices for Agent Tracing

Here are key practices for getting the most out of distributed tracing:

1. Trace Every Decision

Every time an agent makes a decision that affects downstream behavior, create a decision span with reasoning, confidence, and alternatives considered. This is the core of agent observability.

2. Attach Structured Data

Don't just log free text. Attach structured attributes:

span.set_attribute("decision_type", "escalate")
span.set_attribute("decision_confidence", 0.78)
span.set_attribute("reason_code", "POLICY_VIOLATION")

Structured data is queryable and aggregatable. Free text is not.

3. Include Input and Output

For each span, include what went in and what came out:

span.set_attribute("input_size_bytes", len(input_data))
span.set_attribute("output_size_bytes", len(output_data))
span.set_attribute("output_type", type(output_data).__name__)

This helps you understand data flow and spot anomalies.

4. Track Iteration and Retry Loops

If an agent iterates multiple times before reaching a decision, trace each iteration:

for iteration in range(max_iterations):
    with tracer.start_as_current_span(f"reasoning_iteration_{iteration}") as span:
        span.set_attribute("iteration", iteration)
        result = reason_step()
        if result.done:
            break

This shows how many iterations were needed and why.

5. Correlate Traces with Business Outcomes

After a workflow completes, record the business outcome in the trace:

span.set_attribute("business_outcome", "customer_satisfied")
span.set_attribute("resolution_time_minutes", 12)
span.set_attribute("escalation_required", False)

This lets you correlate decision quality with actual outcomes.

6. Set Appropriate Span Names

Span names should be descriptive and consistent:

  • Good: approval_agent.invoke, tool.database_query, decision.escalate
  • Bad: process, call, execute

Consistent naming makes traces easier to search and analyze.

7. Use Baggage for Workflow Context

Baggage is metadata that flows across all spans in a trace. Use it for workflow-level context:

from opentelemetry.baggage import set_baggage
 
set_baggage("workflow_id", "process_customer_inquiry_12345")
set_baggage("customer_id", "CUST-99999")
set_baggage("priority", "high")

All spans in the trace automatically include this context, making it easy to filter and correlate.

Integrating Tracing with Your Agent Orchestration Platform

When you use Padiso for agent orchestration, tracing is integrated into the platform. You get:

  • Automatic span creation for every agent invocation
  • Built-in tool call tracing for all integrated tools
  • Decision capture with reasoning and confidence scores
  • Trace visualization in the Padiso dashboard
  • Queryable trace data for analysis and debugging

You can also export traces to external backends (Datadog, New Relic, etc.) for deeper analysis.

For custom agent implementations, you'll need to instrument your code with an observability library. The Padiso documentation includes examples of integrating OpenTelemetry with Padiso agents.

Observability as a Competitive Advantage

For founders building headless companies and operators scaling agent teams, observability isn't optional-it's a competitive advantage.

Here's why:

Speed of debugging: When something goes wrong, tracing lets you identify the root cause in minutes, not hours. This means faster fixes and less downtime.

Continuous improvement: Traces show you exactly how your agents are behaving, letting you optimize decisions, reduce latency, and improve quality systematically.

Compliance and auditability: For regulated industries, traces provide a complete audit trail of every decision an agent made and why. This is essential for compliance.

Cost optimization: By understanding where time and resources are being spent, you can optimize your agent workflows and reduce costs.

Scaling with confidence: As you add more agents and workflows, tracing gives you visibility into system behavior. You can scale confidently because you understand what's happening.

Getting Started with Agent Tracing

If you're ready to implement distributed tracing in your agent systems, here's the roadmap:

Phase 1: Basic instrumentation (1-2 weeks)

  • Add span creation for agent invocations
  • Capture basic attributes (agent name, model, input/output size)
  • Export traces to a backend (Jaeger or managed service)

Phase 2: Decision tracing (2-3 weeks)

  • Add decision spans with reasoning and confidence
  • Capture alternatives considered
  • Correlate decisions with outcomes

Phase 3: Advanced analysis (3-4 weeks)

  • Build dashboards for decision patterns
  • Create alerts for anomalies
  • Integrate with your incident response process

Phase 4: Continuous optimization (ongoing)

  • Use trace data to identify optimization opportunities
  • A/B test agent prompts and logic
  • Track improvements over time

For Padiso users, much of this is already built in. Check the pricing page to see which tiers include advanced observability features, and review the integrations available for external observability tools.

Conclusion: From Black Box to Transparent Systems

Distributed tracing transforms agent systems from black boxes into transparent, auditable, and optimizable systems. You move from asking "What happened?" to answering "Why did this happen, and how can we do better?"

For operators running always-on agent teams, this transparency is essential. It's the difference between managing systems you don't understand and managing systems you can reason about, debug, and optimize.

Start with basic tracing-capture agent invocations and tool calls. Then add decision tracing-capture the reasoning behind every important decision. Finally, build analysis and optimization on top of your trace data.

The investment in observability pays dividends in speed, quality, and confidence. And as your agent systems grow, observability becomes your most valuable operational tool.

For more on building observable agent systems, explore Padiso's resources on agent orchestration and operational best practices. The contact team can also help you design an observability strategy for your specific use case.