Why memory is the substrate¶

I’m Brian. I’m 31. I came to research engineering the long way around.

I worked retail, hospitality, and sales before I worked in code. I spent real time in psych settings early in life, watching how people actually operate when the polite layer is off. I’ve used computers since I was a kid but I didn’t go deep on them until a couple of years ago, and I never went the traditional school route for any of this. I’ve failed at more things than most people have tried, and that’s where most of what I know actually came from.

That’s relevant, because it’s the reason I think about AI systems the way I do.

The bet¶

The way I learn isn’t from rules. It’s from a memory of failing, a memory of succeeding, and the slow accumulation of which one tends to follow which. That accumulation is what people call intuition, and it’s the only thing that has ever reliably steered me. So when I look at a model — at how it answers, how it routes, how it forgets — I’m not asking did it get the right token. I’m asking what did it remember, and what did that memory let it become.

My bet is simple, and I’ll say it plainly so it’s on the record:

Intelligence is built from memory. Specifically, from experience that points back at a cognitive ability. When you remember a thing well enough to feel it the next time the situation reappears, you have something resembling instinct. When a system can do that — preserve its own experience with enough structure that it can act differently the next time — the system has the substrate intuition forms on. Without that substrate, you don’t get intelligence. You get a fast lookup.

Most stacks I see treat memory as retrieval. Store chunks, embed, search, return. That’s not memory. That’s a filing cabinet with a search box. Memory in the sense I care about has provenance, has events, has the trace of when and why and what changed because of it. Memory means the system can, in principle, explain itself back to itself. That’s the only foundation I trust to build judgment on.

Python, and why I can’t stay in one domain¶

Before I say the rest of this I want to name where my own thinking comes from, because it’s the template for everything below.

The thing that actually unlocked this work for me was learning Python. Not because of Python. Because once I understood the syntax and landscape of a domain, I could suddenly move through it. The rules stopped being foreign. I could read the math and the code at the same time and see which parts corresponded. And once I had that lever, I couldn’t stop using it on everything else — research papers, RL, tokenizers, phishing payloads, mangrove fishing on the dock outside my rental this week in Florida. I lock onto whatever I’m focused on and go straight through the wall with it. Some people would call it OCD. I don’t know what to call it. I never learned constraint. It’s been a blessing and a curse.

But the pattern I eventually noticed inside it is the one I care about for this work:

If you can learn the syntax and landscape of a domain, you can transition into it. What makes that possible isn’t the domain — it’s the cognitive ability underneath, the one that lets you map unfamiliar structure onto structure you already know.

Domains are skills. Cognitive ability is what decides how far your skills can range. That distinction is the thing most people miss, and it’s why I think the way we evaluate AI systems right now is about to go out of date.

Why static benchmarks are going to be short-lived¶

Here’s the consequence.

If intelligence is a function of range across domains, not performance inside one, then a benchmark that measures a model’s score on a fixed task list is measuring the wrong thing. It’s measuring how well the model performs the syntax of one domain. It’s not measuring whether the cognitive substrate underneath can pick up the syntax of a domain it hasn’t seen yet.

Static benchmarks will keep giving numbers. Leaderboards will keep existing. But the number will stop correlating with anything useful, because every serious model will saturate the task list long before it saturates the underlying capability. The interesting question in a few years won’t be “what did the model score on benchmark X.” It’ll be “how does this model aggregate and use the experience it was handed, and how well does that aggregation transfer when the domain changes.” That’s a memory question and a cognitive-range question. It’s not a task-list question.

This is the research problem I’m actually working on, across all the repos on this site. The way I’d phrase it:

The best measurement of an AI system isn’t what it answers today. It’s what it decides is important about the experience it’s handling, what it lets shape its later behavior, and what it stays consistent about when the task surface changes.

That’s why the work on failure-induced benchmarks exists. Not because I want another leaderboard. Because I think benchmarks need to become functions of the system’s own failure distribution — they have to respond to where a specific model breaks, not sit on a shelf for everyone. That’s the only kind of evaluation that can keep up with models that are getting better at static sets faster than the static sets can be designed.

How the layers stack¶

Visually, the bet looks like this:

        flowchart TB
  subgraph long["long-horizon · what the system becomes"]
    intuition["intuition · emotional intelligence as observable code"]
  end

  subgraph adaptive["adaptive feedback · failure as the second memory"]
    benchmarks["failure-induced benchmarks"]
    interventions["failure clusters as interventions"]
  end

  subgraph eval["evaluation · making behavior measurable"]
    routing["memory-guided routing"]
    traces["structured failure traces"]
    slices["sliced metrics"]
  end

  subgraph runtime["runtime · what to do, given memory"]
    obvos["Obversary-OS"]
  end

  subgraph ingestion["applied ingestion · real input"]
    pdf["PDF Intelligence Core"]
    chats["chat / notes / datasets · later"]
  end

  subgraph substrate["substrate · two sibling memory repos"]
    earthdb["earth-database<br/>local canonical core<br/>SQLite · trust · observability"]
    memdrop["memory-dropbox<br/>event-sourced substrate experiment<br/>Postgres · Redis · Qdrant"]
    earthdb <-.-> memdrop
  end

  ingestion --> substrate
  substrate --> runtime
  runtime --> eval
  eval --> adaptive
  adaptive --> long
  adaptive -. closes the loop .-> substrate

Reading bottom to top: ingestion lands real input on the substrate; the runtime decides what to do and writes its decisions back to the substrate; evaluation reads those decisions and produces traces; failure feedback turns those traces into interventions and harder benchmarks; intuition is what the long-horizon thing eventually starts to look like when the substrate has accumulated enough experience to compose across domains.

The dotted line is the one I care about most. Failure feedback closes back onto the substrate. That’s not a flourish — it’s the whole point. The system’s mistakes become memory the next run can use.

Why modular ingestion matters¶

Here’s the part I think most people miss.

A model trained on one corpus is a model. A system that ingests many kinds of experience — documents, conversations, traces of its own failures, notes from the human running it, signals from tools it used — and holds them all in the same memory substrate, with provenance, is something different. The pieces start to interact. A failure trace from last week informs a routing decision today. A document ingested in March changes how a chat ingested in September gets understood. The substrate becomes the thing that lets emergent capability show up, because the intelligence ingestions are no longer isolated datasets — they’re a single accumulating experience.

Modularity is the scalability mechanism. Not because modules are easier to debug (though they are). Because modules with a shared memory substrate let the system grow in ways the original architect didn’t plan for, the same way a person who’s worked retail and hospitality and psych and code ends up reading rooms differently than someone who only worked one of those. The crossover is where the interesting thing happens. That’s the same pattern as the Python-to-everything-else jump — an accumulated substrate produces range, and range is what looks like intelligence.

The work on this site is what that bet looks like when I try to build it.

What I’m actually doing here¶

A few threads, each on its own page, all pointing at the same substrate question:

earth-database — the local canonical memory core. Local, embedded, SQLite-backed, with deterministic trust-boundary work and a JSONL event log that makes decisions inspectable. The smallest honest version of the substrate idea, and the one I’d point at first if I had to pick a single canonical memory repo.
Memory Dropbox — the event-sourced substrate experiment at larger scope. Postgres, Redis, Qdrant, worker. Where derived memories, observation memories, and agent-facing memory experiments get tested beyond what fits in one embedded database. The companion page What memory substrates are for treats six domains as research-speed experiments on the same substrate idea — not as finished product claims, but as directions where provenance and replay could change the pace of research.
Obversary-OS — the runtime layer. Roles, workflows, tools, model interfaces. The prototype records decisions so you can inspect them later; the substrate wiring is the next boundary, not something I’m treating as done.
PDF Intelligence Core — the first applied ingestion lane. Documents are messy, common, and structurally honest about being hard. Turning them into inspectable artifacts is a real test of whether the substrate has useful material to preserve.
Memory-guided evaluation and structured failure traces — failure as the second memory. When the system gets something wrong, that wrongness is data about how it currently thinks. That data is the same shape as everything else in the substrate. You don’t throw it away; you let it shape what comes next.
Failure-induced benchmarks — using that failure data to build harder questions, instead of running yesterday’s benchmarks forever. Three pre-committed hypotheses, runnable harness, real bootstrap CIs, committed run artifacts. This is where “static benchmarks are short-lived” becomes a claim I have to test, not a line I get to say for free.
Math Boundaries for AI Systems — the layer where I keep myself honest about where math actually buys you something and where engineering judgment has to take over. I built it as a translator, because that’s how I read papers.

Projects overview has the full map of how the pieces fit.

The long-horizon thing¶

The reason I care about all of this isn’t a benchmark score.

I think the real measurement of an AI system in a few years won’t be accuracy on a static set. It’ll be how the system aggregates the experience it’s handling — what it decides is important, what it lets shape its later behavior, and what tunnels of logic it follows that it isn’t fully aware of. Steering that is the actual research problem. Words and tokens are constructs sitting on top of a much older architecture: live signals, mappings, the things that came before language and made language possible. If we want to build something that resembles real intelligence, we have to get back to the substrate, and we have to make it observable.

The horizon I’m working toward is whether you can make something that looks like intuition — and eventually something that looks like emotional intelligence — show up as code you can read. Not as a vibe. Not as a prompt. As an inspectable substrate where you can point to the experience that produced the behavior. I won’t pretend I’m there. I’ll pretend less than that — I’ll say I think it’s the right direction and I’m willing to be wrong in public about it.

Why this site exists¶

I’m using this site to put my thinking somewhere other than my laptop. Some of it will be useful to other research engineers who think the way I do and want to fork something. Some of it will be useful to me in six months when I’ve forgotten why I made a particular decision. And some of it is here because I believe enough in this direction that I’d rather take the risk of saying it out loud than keep it as a private notebook.

I’ve lived more lifetimes than the timeline suggests, and I’ve come around to thinking my purpose as a researcher is pretty specific: compress what we’ve already figured out into something the next generation can pick up faster, and ask the questions that make the transition easier for whoever comes after. The work here is the start of that.

If any of it lands for you, the repos are live and the writing is honest about what’s a working artifact and what’s still a bet.

— Brian