Failure clusters as interventions¶

Failure clustering becomes useful when it changes the system. Until then, it’s a label on a plot. This article is about the move that turns a cluster into a fix.

From output clustering to system diagnosis¶

Most failure clustering I’ve read groups similar wrong outputs. That’s a start, but it stops short. A stronger version clusters system behavior: task, model choice, routing decision, tool usage, memory state, prompt strategy, evaluation signal — the whole trajectory, together. Each failed run is then a point in a behavioral space, not a single string.

The difference matters. Wrong outputs that look alike are a description. Behavioral trajectories that fail in the same way are a diagnosis. One tells you what happened; the other tells you which part of the system caused it.

Possible failure modes¶

When you cluster trajectories, the clusters tend to correspond to recognizable patterns:

reasoning breakdown
retrieval failure
overconfident or hallucinated content
tool misuse
memory misalignment
routing failure

The point isn’t that these are the canonical categories. It’s that each cluster is a candidate failure mode of the architecture you’re running — a hypothesis about which part of the system needs to change, generated by the system’s own behavior.

Closing the loop¶

        flowchart LR
  T[failure trace] --> B[behavior embedding]
  B --> C[failure cluster]
  C --> I[interpreted failure mode]
  I --> X{intervention}
  X --> R[routing policy]
  X --> M[memory schema]
  X --> Tc[tool constraint]
  X --> Bw[bandit weights]
  X --> G[guardrail / fallback]
  R --> RE[re-evaluate]
  M --> RE
  Tc --> RE
  Bw --> RE
  G --> RE
  RE -.->|next run produces new traces| T

Once a cluster is identified, it can suggest interventions — not in the abstract, but in the same code paths the runtime already exposes:

adjust a routing policy
inspect or revise a memory schema
tighten a tool constraint
update bandit or exploration weights
add a guardrail or fallback path
update prompt strategy

That framing turns evaluation from passive reporting into testable system change. It’s the same loop the rest of the stack uses: observe, diagnose, intervene, re-evaluate.

It also aligns with the design rule in Memory-guided evaluation: the slow loop is the only thing that mutates memory and policy. Cluster-driven interventions are slow-loop work — they read the failure log, propose a change, and the change takes effect on the next run, where the next trace can confirm or disconfirm the intervention worked.

Why memory matters here¶

Without memory state in the traces, clustering is mostly descriptive. The same wrong answer can come from a thousand directions and the cluster won’t tell you which one.

With memory snapshots in the trace (Structured failure traces shows the field shape), the question shifts from “what failed?” to “what internal state made this failure more likely?” The memory layer turns failure clustering from a sociology of outputs into a forensics of behavior.

A note on validity¶

Clustering always produces some structure. That’s a feature of the math, not evidence about the system. The validation question is whether clusters remain stable across:

models
datasets
prompts
memory states

Stable clusters may point to meaningful failure regimes. Unstable clusters may be representation artifacts — the embedding picked up something the runtime doesn’t actually care about. Treat every cluster as a hypothesis. Manually inspect the members. Hold out a subset for validation. Check whether the proposed intervention actually changes the next run’s traces.

This isn’t optional discipline. It’s what keeps failure-driven evaluation from drifting into folk taxonomy.

What this page doesn’t claim¶

This is a research framing, not a proof that every system benefits from automatic intervention selection. Human review and held-out evaluation still matter. The interesting result isn’t clusters → policies; it’s clusters that reliably point to policies that reliably reduce the failure mode. That’s an empirical chain, and the only way to verify it is to run it.

Where the working code lives¶

Trace shapes and validated examples: memoryevalguided — see Structured failure traces.
Routing / memory / fast-loop / slow-loop scaffold: memory-guided-eval — see Memory-guided evaluation. The interventions/failure_cluster_interventions.py module is where cluster signals get translated into router-readable changes.
Sliced metrics and the binary-reasoning failure-discovery toy: failure-sliced-eval — see Failure-sliced eval.

The same data, used to generate harder questions instead of changing routing, becomes failure-induced benchmarks. Same loop, different direction.