ToolTrace - Kevin Zhang

The Problem

When LLM agents fail at tool use (selecting the wrong tool, passing invalid parameters, violating policy constraints), the standard response is to improve the agent: better prompts, fine-tuning, reinforcement learning. But the agent is often frozen and API-only, and many failures are not the agent's fault. A tool that accepts invalid parameters without validation will be misused by any model. A schema that describes a function ambiguously will mislead any agent.

These are infrastructure problems. They demand infrastructure solutions.

The Approach

ToolTrace is an automated feedback loop that analyzes failed execution traces from a frozen agent and generates persistent, programmatic modifications to the tool layer. The agent and model weights are never touched. All improvements come from modifying the tool infrastructure: wrappers that enforce preconditions, composite tools that bundle multi-step sequences, rewritten schemas, and runtime linter rules.

The key asymmetry: diagnosis with hindsight is easier than real-time prevention. A strong model that fails at a task during live execution can still correctly diagnose why it failed when examining the trace post-hoc. This makes it productive to use the same model as both the failing agent and the analyst.

1. Agent fails

👨

Cancel my reservation R1042.

🤖

Sure, I'll cancel that now.

cancel_reservation("R1042")

Reservation cancelled.

✗ economy, no insurance, ineligible

2. Agent analyzes → wraps

🤖

API doesn't enforce cancellation policy. I cancelled an ineligible reservation.

↓

def cancel(id):
  r = get(id)
  if not (r.biz or r.ins or r.recent):
    return "Not eligible"
  return _cancel(id)

3. Wrapper catches it

👨

Cancel my reservation R1042.

🤖

Sure, I'll cancel that now.

cancel_reservation("R1042")

Blocked: not eligible for cancellation.

🤖

Your economy booking isn't eligible. I can help you explore other options instead.

✓ Policy enforced, invalid action prevented

Figure 1. Left: the agent cancels an ineligible reservation and the API silently succeeds. Center: the agent analyzes the failed trace and generates a wrapper. Right: the wrapper blocks the invalid call and the agent recovers.

A Negative Result: Text Doesn't Work

Before building code-based modifications, we tried the natural approach: inject natural language conventions (preconditions, parameter rules, sequencing hints) into tool schema descriptions. This failed. Models systematically ignore instructions embedded in tool descriptions, treating them as metadata rather than directives. Conventions appended to a one-line description, sometimes exceeding 3,000 characters, are simply not read.

Convention quantity is inversely correlated with performance. 549 rules performs worse than 74 rules, which performs worse than 56 rules. More text in tool descriptions adds noise that degrades decision-making on tasks the agent already handles.

The one text intervention that works is system-level rules in the system prompt. Models treat system prompts as directives, not metadata. Short, strategic rules placed in the system prompt outperform extensive per-tool conventions.

Results

We evaluate on τ-bench, a multi-turn conversational agent benchmark with complex domain policies. The frozen GPT-5.1 agent interacts with simulated users while following airline domain policies and calling 12 tools. We use 30 train tasks for trace analysis and 20 held-out test tasks for evaluation.

The deployed modifications consist of 4 policy-enforcing wrappers (cancellation eligibility, baggage validation, economy flight restrictions, certificate amount checks) and 8 system-level rules.

Configuration	Pass^1	Pass^2	Pass^3
Baseline	51.7%	38.3%	30.0%
+ Tool desc. conventions (37 rules)	53.3%	43.3%	35.0%
+ System prompt rules (8)	55.0%	46.7%	45.0%
+ Wrapped tools (4)	58.3%	45.0%	40.0%
+ Wrappers + rules	58.3%	50.0%	45.0%
Improvement	+6.6pp	+11.7pp	+15.0pp

Table 1. Infrastructure modification results on held-out test tasks (GPT-5.1, 20 tasks, pass^k with k=3). Code-based wrappers and system-level rules together achieve +50% relative improvement at pass^3. Improvements compound with consistency: larger gains at higher k.

Code beats text. Wrappers alone (+10pp at pass^3) outperform tool description conventions (+5pp). System prompt rules (+15pp) outperform both. The combination achieves gains neither achieves alone. Improvements compound with consistency: the gap between baseline and modified infrastructure grows with k, because infrastructure modifications make agent behavior reliable, not just occasionally correct. The +15pp improvement at pass^3 is equivalent to the jump from GPT-4 to GPT-5 on τ-bench, achieved without touching the model.

Why This Matters

Infrastructure modifications are not model-specific. A wrapper that validates cancellation eligibility helps any model that calls that tool. Conventions learned from GPT-4.1 failures transferred to GPT-5-mini (+50% relative improvement) and GPT-5.1. Train once, benefit every agent.

As tool ecosystems scale (MCP already exposes thousands of tools across hundreds of servers), manual tool improvement becomes infeasible. ToolTrace automates what production engineering teams already do: when an API is persistently misused, you don't retrain the caller. You add server-side validation, build higher-level endpoints, and fix misleading documentation. The fix must be structural: modify what the tools do, not what the descriptions say.

ToolTrace: Fix the Environment, Not the Agent

The Problem

The Approach

A Negative Result: Text Doesn't Work

Results

Why This Matters