Evaluation Systems

Evaluation is where assumptions meet evidence. This is the lane hub — the index page for the evaluation work in the stack. The articles linked from here treat metrics, traces, slices, and induced tests as engineering surfaces: you’re not only scoring a model, you’re asking what failures the score still hides.

Why evaluation is a layer, not a step

Most stacks I’ve read run evaluation at the end. Build, run, measure, ship. That’s fine for a function. It’s wrong for a system. A system that is supposed to learn from its own behavior needs evaluation to be continuous, structured, and addressable — every part of the trajectory has to be measurable, not just the final answer.

That’s why this lane has multiple repos and multiple articles. Each one covers a different axis of what gets measured:

  • routing and memory decisions, in real time, with logs

  • failures, recorded as full trajectories with comparable fields

  • slices of behavior across predefined and discovered subsets

  • clusters of failures, treated as candidate interventions

  • induced benchmarks, built from the failures themselves

Together they’re the apparatus for turning mistakes into the substrate’s second memory — not just “this failed,” but what route, state, tool, and evaluator were involved. Some pieces are already runnable scaffolds; the full closed loop is the research direction.

If you only have one pass through this lane

You can read every article eventually, but a single coherent path looks like this:

  1. Memory-guided evaluation — why fast-loop routing and slow-loop learning must stay separable if you want inspectability.

  2. Failure-sliced eval — make rare failures visible in the metrics before you invest in heavyweight trace machinery.

  3. Structured failure traces — once slices show where it hurts, traces give you comparable objects for postmortems and repos.

  4. Failure-induced benchmarks or Failure clusters as interventions — pick the direction that matches whether you’re generating harder tests from failures or changing the system because a cluster keeps recurring.

That order isn’t doctrine; it’s a cheap-to-expensive gradient. Skipping sliced metrics early is how aggregate dashboards stay green while something important stays broken — the opening argument of Failure-sliced eval.

In this lane

  • Memory-guided evaluation — fast-loop routing and slow-loop learning, kept separate as a hard architectural rule. Lives in memory-guided-eval.

  • Failure-sliced eval — measuring performance on slices so averages don’t smother rare failures. Sliced metrics, MVP failure-episode log format, the binary-reasoning failure-discovery toy. Lives in failure-sliced-eval.

  • Structured failure traces — the canonical schema for failures-as-comparable-objects. Validated examples, validator script. Lives in memoryevalguided.

  • Failure clusters as interventions — turning recurring failure shapes into routing, memory, or tool changes. Cross-cuts all three repos.

The thread connecting all of this

Aggregate accuracy hides structure. The number on the dashboard tells you something is wrong; it doesn’t tell you whether the problem is routing, retrieval, the tool, the model, or the evaluator. The only way to tell those apart is to record the layers separately and let yourself read them as comparable objects later.

That’s also why I think static benchmarks are going to be short-lived (see Why memory is the substrate). I don’t mean static sets become useless overnight; I mean they stop being enough once systems can adapt around fixed tasks. A system that can keep up with its own failures has to measure them in the same shape it measures everything else. Evaluation isn’t separate from the substrate. It’s the part of the substrate that asks whether what just happened was useful.

For how this lane fits the wider studio picture, see Projects overview.