Data and storage

All of honeycomb's durable state lives in tables on a GPU-backed SQL and vector store. The daemon is the only process that opens that store; everything else reaches it through the daemon. The storage layer has a few unusual properties that shape every table and every write pattern, so it pays to …

Data and storage

Derived from the honeycomb knowledge base, captured 2026-06. Written for an external practitioner. The DDL shapes shown are the logical table shapes from the knowledge base; the runtime source of truth is the daemon's schema module, which the lazy heal pass converges every table toward. Confirm exact columns against your installed version.

#The concept

All of honeycomb's durable state lives in tables on a GPU-backed SQL and vector store. The daemon is the only process that opens that store; everything else reaches it through the daemon. The storage layer has a few unusual properties that shape every table and every write pattern, so it pays to understand them before reading the catalog.

#Storage properties a practitioner must know

  • Lazy schema healing. Tables and columns are created on first write, not through an upfront migration. A new column added with a safe default is filled in on the next heal pass, so adding a field does not require a migration step ahead of the worker that writes it. Schema changes are additive.
  • No parameterized queries. The query endpoint takes no bound parameters, so the daemon builds SQL by string composition and escapes every value itself through dedicated helpers. This is why all SQL construction lives in one place (the daemon) and never in a client.
  • Append-only, version-bumped writes. The backend coalesces updates in a way that can silently drop concurrent edits, so honeycomb does not lean on naive in-place updates for hot tables. The current state of a versioned row is its highest version; a change appends a new version rather than mutating the old one.
  • Select-before-insert with drift detection. Writes that must be unique check for an existing row first and re-verify after, making concurrent-writer races observable rather than silent, because the backend has no server-side unique constraint to lean on.
  • Tenant isolation at the storage layer. Organization and workspace isolation is enforced at the storage partition, so two workspaces never share a row, partition, or index. Most tables therefore do not need explicit tenancy columns; a few cross-cutting tables carry explicit organization and workspace ids.

#The three "memory" tables

Three tables are easy to confuse because they all hold something called memory. Fix them first.

Table Holds Written by
sessions The raw capture stream, one row per event Capture
memories The distilled engine output, the facts the pipeline decided to keep The pipeline
memory Wiki summaries and the virtual-filesystem file rows The summary worker

Capture writes sessions; the pipeline reads sessions and writes memories; the summary worker writes memory.

flowchart LR
    sessions["sessions (raw events)"] --> pipeline["pipeline"]
    pipeline --> memories["memories (distilled facts)"]
    pipeline --> entities["entities plus ontology"]
    sessions --> summary["summary worker"]
    summary --> memory["memory (wiki plus browse)"]
    memories --> skills["skillify -> skills"]
  • sessions holds one row per prompt, tool call, or response. Its message body is structured JSON, with an optional vector. Rows are append-only inserts; readers concatenate by path in time order.
  • memory holds wiki summaries and browse-surface file rows. It is update-or-insert keyed by path and carries a one-line key for fast session priming.
  • memories is the engine's distilled output, with confidence, importance, provenance, a dedup hash, a soft-delete flag, and scope columns. It is the table recall ranks over. Each row carries a durable one-sentence key written at distillation time so the session-priming digest can skim durable keys with a pure SQL select and no generation at read time.

#The distilled-memory schema (illustrative)

CREATE TABLE IF NOT EXISTS "memories" (
  id                 TEXT NOT NULL DEFAULT '',
  type               TEXT NOT NULL DEFAULT 'fact',
  content            TEXT NOT NULL DEFAULT '',
  key                TEXT NOT NULL DEFAULT '',
  normalized_content TEXT NOT NULL DEFAULT '',
  content_hash       TEXT NOT NULL DEFAULT '',
  confidence         FLOAT4 NOT NULL DEFAULT 1.0,
  importance         FLOAT4 NOT NULL DEFAULT 0.5,
  tags               TEXT NOT NULL DEFAULT '[]',
  project            TEXT NOT NULL DEFAULT '',
  project_id         TEXT NOT NULL DEFAULT '',
  source_id          TEXT NOT NULL DEFAULT '',
  source_type        TEXT NOT NULL DEFAULT '',
  pinned             BIGINT NOT NULL DEFAULT 0,
  is_deleted         BIGINT NOT NULL DEFAULT 0,
  agent_id           TEXT NOT NULL DEFAULT 'default',
  visibility         TEXT NOT NULL DEFAULT 'global',
  content_embedding  FLOAT4[],
  created_at         TEXT NOT NULL DEFAULT '',
  updated_at         TEXT NOT NULL DEFAULT ''
) USING deeplake;

The key column is additive and heal-compatible; a row with no derived key falls back to its content at read time, so a legacy un-keyed row is still primeable.

#The rest of the catalog

Group Tables What they hold
Engine support memory_history, memory_jobs, embeddings The audit trail of every proposal, the durable distillation job queue, and the vectors mirrored for GPU search.
Knowledge graph entities, entity_aspects, entity_attributes, entity_dependencies, memory_entity_mentions, epistemic_assertions, ontology_proposals The ontology, with supersession by appended attribute version.
Sources and documents memory_artifacts, documents, document_memories, connectors Source-backed rows keyed by source id, the ingest lifecycle, the document-to-chunk join, and external-connector sync cursors.
Product tables skills, rules, goals, kpis, codebase Mined skill versions, org-wide rules, goals and KPIs, and codebase-graph snapshots.
Tenancy and auth agents, api_keys, projects, synced_assets The within-workspace agent roster and read policies, hashed connector keys, the per-workspace project registry, and the team asset-sync substrate.
Telemetry (opt-in counters and an optional recall-quality ledger) Usage counters and diagnostics; never carries secrets or request bodies.

Skills and rules are append-only and version-bumped (the current state for a logical key is the highest version). Goals and KPIs are update-or-insert by their logical key. Snapshots in codebase are one row per repository-checkout identity, deduped by a content hash.

#Per-project scoping

Tenancy has a third, soft ring inside a workspace: the project. A projects registry records the projects a folder can bind to. Memory and skills carry a resolved project id that the scope clause segments on, defaulting to a reserved per-workspace inbox so a capture is never dropped when no project resolves. A project is a registry-backed identity, not a repository id; a canonical git remote is only an optional auto-bind signal. Cross-project sharing of a skill is an explicit, auditable opt-in recorded directly on the row.

#The memory virtual filesystem

honeycomb presents the team-shared database as an ordinary directory and intercepts the shell commands that touch that mount, so an assistant browses memory with cat, ls, grep, and find while every operation is really a scoped query. No real files exist at these paths: every read hits an in-memory cache, a pending-write buffer, or a query, and every write is buffered and flushed on a timer.

Three things the intercept hides from the agent:

  • Write batching. A read immediately after a write reads from the pending buffer, so the agent sees its own write even before it reaches storage.
  • The multi-row session layout. A session "file" is dozens of rows concatenated transparently. Session files are read-only at this layer; attempts to write, append, remove, copy, or move them are rejected, because they are an append-only event log owned by capture.
  • The structured goals and KPIs tables. Goals and KPIs appear as plain markdown files, so an agent manages objectives with file operations while the CLI reads the same state from typed columns. Goal lifecycle is expressed through file verbs: removing a goal file is a soft close (status flipped, the row preserved for the audit trail), and moving a goal between status folders is a status transition that may change only the status component.

A synthesized index file at the mount root lists the most recent summaries and sessions, and a synthesized subtree renders the codebase-graph queries from the local snapshot. The same browse view is produced by both the long-lived shell object and the stateless pre-tool hook, sharing one renderer so they never disagree.

#Retention

Because the backend exposes no transactions at this layer, retention runs as batched, idempotent sweeps in a daemon worker rather than cascading deletes.

Data Default behavior
sessions raw events Pruned by the sessions-prune operation; summaries retained in memory
memories Soft-delete window before purge; history retained longer
memory_jobs Completed jobs purged after a window; dead jobs later
memory_artifacts Soft-delete on source-file removal, hard purge on source disconnect by source id
skills / rules Append-only version history retained
Embeddings / vectors Purged with their owning row during retention sweeps