How doctor works

the mechanics behind doctor's watchdog loop: how it checks each daemon, tells down from wedged, climbs the repair ladder, and escalates when it's stuck.

doctor probes each daemon every 30 seconds, classifies the result as healthy, degraded, refused, or timed out, and if unhealthy climbs a repair ladder: restart, then reinstall, then remove a conflicting package, backing off between tries and stopping the moment health returns.

The short version: doctor checks, doctor decides what kind of broken it found, and doctor tries the smallest fix first. Here's the mechanics underneath that.

#Why the os supervises doctor, not doctor supervising itself

Every self-monitoring scheme hits the same wall: who restarts the restarter? If doctor tried to watch itself, a bug that took down doctor would take down the thing meant to notice. So doctor doesn't. Your operating system's own service manager, launchd on macOS, systemd on Linux, Scheduled Tasks on Windows, restarts doctor on a crash and starts it on boot. doctor has no "restart myself" code path on purpose.

#One supervisor per daemon

doctor reads a list of daemons it's responsible for and builds one fully independent supervisor per entry. Each one has its own probe, its own backoff clock, its own repair ladder, and its own incident log. A crash loop in one daemon can never bleed into another daemon's state, nectar failing repeatedly does not touch honeycomb's record.

#What a check actually finds

Every 30 seconds, doctor sends one bounded request to each daemon and reads back exactly one of four answers:

  • Healthy. The daemon answered cleanly.
  • Degraded. The daemon answered, but something under the hood is off, doctor gets the specific reason (storage, embeddings, schema).
  • Unreachable, refused. The connection was flat-out refused. The daemon is down.
  • Unreachable, timed out. The socket accepted the connection but never answered. The daemon isn't dead, it's wedged, and that's a different problem than dead.

That refused-versus-timeout split matters. A dead daemon needs a restart. A wedged one might need something else entirely, and doctor's repair ladder treats them differently instead of blindly restarting everything.

A daemon that was just started gets a grace window (60 seconds by default) before doctor judges it unhealthy, so a slow boot doesn't trigger a false alarm.

#The repair ladder

When a daemon is genuinely unhealthy, doctor climbs a ladder, cheapest fix first:

  1. Restart it. This is rung one and it's where almost every problem gets solved. doctor won't restart a daemon it just restarted (a cooldown prevents doctor from fighting the daemon's own restart), and it won't restart a daemon that already looks healthy.
  2. Reinstall it. If restarts keep failing (three in a row, by default), doctor reinstalls the daemon from the approved release.
  3. Remove a conflict. If a conflicting global package is detected, doctor removes it, but only after writing a backup record first, and only the package, never your data directories.
  4. Escalate. If nothing above worked, doctor stops trying blind fixes and writes a plain report instead.

Between rungs, doctor waits longer each time (a doubling backoff with a floor and a ceiling) so a struggling daemon isn't hammered with retries. The moment a health check comes back clean, the ladder resets to zero. There's no lingering penalty for a daemon that's already fixed itself.

#What escalation actually means

Escalation isn't a bigger hammer, it's the point where doctor stops swinging. It writes a record: what it diagnosed, every step it tried and whether each one worked, and what it recommends you do next. For a deferred action (like the credential check below), it notes what it would have done. That record shows up on the local status page at 127.0.0.1:3852 so you see it the moment you look, not buried in a log file.

#The blessed-update gate

doctor can update the memory daemon automatically, but only through a gate: a version has to be explicitly approved before doctor will roll it out, and after updating, doctor verifies the daemon is actually healthy. If that verify fails, doctor rolls back to the last version that worked. A bad release can't spread itself across your machine. doctor never updates its own package this way, there's a single explicit command for that, and nothing else triggers it.

#What doctor will never touch

There is no code path in doctor that reads, writes, or deletes your credentials file. If doctor suspects a credential problem, the only thing it does is tell you and stop. There is deliberately no command that clears credentials, because that's not a decision a watchdog should be allowed to make on its own.

#How doctor feeds hive

doctor is the single source of truth for fleet health. Each daemon writes its own local telemetry; doctor polls all of it, merges it into one picture, and streams that picture out over one feed. hive, the portal, reads that feed and renders it live. Nothing in hive probes the daemons directly, doctor is the one place that decides what "healthy" means.

#Common questions

Why doesn't doctor just restart everything at the first sign of trouble? Because down and wedged are different diseases. A restart fixes a dead daemon fast, but a wedged one might need a different fix, and blindly restarting everything wastes time and risks masking a real problem.

What stops doctor from getting stuck in a restart loop? The give-up threshold. After a set number of failed restarts, doctor stops restarting and moves to the next rung, reinstall, instead of retrying the same fix forever.

Does doctor ever guess at a fix it isn't sure about? No. Every rung either succeeds, fails, or is skipped for a specific documented reason. A deliberate skip never counts against the give-up threshold, only a genuine failure does.