Failure discovery on binary reasoning

Small experiment, big question: when a model fails, do its failures have recoverable structure?

If the answer is no — failures are essentially random idiosyncratic mistakes — then everything else in the evaluation lane is built on sand. So this experiment is the cheapest possible test of the assumption that justifies the rest of the work. Use a tiny synthetic dataset with known structure, train a deliberately weak model, look at the failures, and see whether unsupervised clustering recovers the known structure without being told what to look for.

        flowchart LR
  D["synthetic binary reasoning data<br/>(hidden reasoning_type labels)"] --> M["weak model<br/>(BoW + logistic regression)"]
  M --> R[evaluation records]
  R --> F[failure extraction]
  F --> C[clustering · K-means on text features]
  C --> V["validation<br/>cluster purity vs hidden labels"]
    

Dataset

Controlled binary reasoning examples: simple factual checks, negation, compositional statements, transitive comparisons. Each example carries a hidden reasoning_type label (simple, negation, compositional, transitive).

The hidden label is the trick that makes the experiment work. It’s not used during clustering. It’s reserved for validation — checking after the fact whether the clusters the algorithm found correspond to the categories that actually exist in the data.

Model

The first model is intentionally weak. Bag-of-words featurizer, logistic regression. That’s a feature, not a bug. If failures cluster in interpretable ways, the effect is easier to inspect through a weak model than buried inside a large black box. You want the failures to be possible to recover; you don’t want them to look recovered because the model itself is doing the work.

Failure extraction

After the eval, isolate the failed records (incorrect predictions). Those rows become the input to clustering. Successful runs are dropped — the experiment is asking what failures look like as a population.

Clustering

A first pass uses lightweight text features and standard K-means. The goal is not to claim K-means discovers cognition. The goal is to test whether unsupervised groups align with known reasoning types held out from the clustering inputs.

Validation

For each cluster, summarize size, average length, negation rate, symbol rate, counts of each hidden type, and a purity-style view of overlap with those types. The central question is one line:

Do discovered failure clusters recover known reasoning categories better than chance?

That’s it. That’s the experiment.

What this shows

If clusters align with hidden reasoning labels, that supports a careful claim: some failures are structured enough to be discovered from failure data alone in a controlled setting. That’s the load-bearing assumption underneath structured failure traces, failure clusters as interventions, and failure-induced benchmarks. If failures can be clustered into something recognizable here, in a tiny controlled setting, then the bigger versions of the loop (with real models, real traces, real interventions) have a foundation to stand on.

This experiment doesn’t establish a general theory of reasoning failure. It’s a starting point — the cheapest test that lets the rest of the work proceed without lying to itself.

What it doesn’t claim

Results depend on the synthetic generator, the model class, and the clustering choices. External validity to open-ended tasks is a separate experiment. Take this as a proof of concept for the type of question you can ask of failure data, not as a proof of any specific cognitive taxonomy.

Repository

The concrete harness, synthetic generator, clustering pipeline, and purity validation against held-out reasoning types live in github.com/obversary/failure-sliced-eval. Runnable entry point:

python experiments/run_failure_discovery.py

The longer write-up is in the repo at docs/failure_discovery_binary_reasoning.md.