Why agent memory needs receipts

On this page

When an agent tells you "we deploy with npm run deploy:staging", the useful follow-up is: where did it learn that? From a session last Tuesday? From the ops/deploy.md you wrote six months ago? From a hallucination it produced at 2 a.m.?

Without the source, you cannot calibrate trust. The recall is a black box — useful sometimes, dangerous when it is wrong.

The provenance rule #

honeycomb enforces a simple invariant: every memory cell stores a source_path, a session_id, and a score. The recall query returns them alongside the snippet. Nothing enters the store without one of those anchors.

In the dashboard you see it immediately:

What "verified" actually means #

The verified state (green dot) means the memory is anchored to a file or commit that still exists on disk. Every Dreaming consolidation pass re-checks anchors. If the file is deleted or the session is purged, the cell downgrades to unverified — still retrievable, but visually flagged.

This matters because codebases drift. A deploy script that existed in April may not be the right one in June. Verification catches stale anchors before they mislead.

Implementing it in practice #

typescript
// Every memory write must supply provenance.
interface MemoryWrite {
  content: string;
  sourcePath?: string;   // file path, e.g. "ops/deploy-staging.md:12"
  sessionId?: string;    // e.g. "session:8f2a"
  scope: 'personal' | 'team' | 'org';
}

// The store rejects writes without at least one anchor.
function validateProvenance(write: MemoryWrite): void {
  if (!write.sourcePath && !write.sessionId) {
    throw new Error('Memory write must include sourcePath or sessionId.');
  }
}

The constraint feels strict at first. Some facts — a team convention established in conversation, never written down — have only a session anchor. That is fine. The session id is the receipt. You can look it up, see the context, decide if the memory is still true.

The cost of skipping this #

Early prototypes of honeycomb stored plain text without provenance. In load-testing we saw the expected pattern: recall was helpful 90% of the time and confidently wrong 10% of the time — with no way to distinguish the two cases programmatically. The 10% was silent. That is the failure mode that breaks teams.

Receipts are not a nice-to-have. They are the feature.