Supervision Trees
When agents crash, what happens? In v1.x Sage, the answer is simple: the error propagates to whoever spawned the agent, and it’s your problem. This works for task agents — short-lived workers that do one thing and exit.
But steward agents — long-lived agents that maintain a domain — need something better. A DatabaseSteward that crashes because of a transient connection error should restart, not bring down the whole program.
Supervision trees provide declarative crash recovery. You declare how agents should be restarted when they fail, and the runtime handles it automatically.
The Supervisor Declaration
A supervisor is declared with the supervisor keyword:
supervisor AppSupervisor {
strategy: OneForOne
children {
DatabaseSteward {
restart: Permanent
connection_string: "postgres://localhost/myapp"
schema_version: 0
}
APISteward {
restart: Permanent
port: 8080
}
MetricsCollector {
restart: Transient
interval_ms: 5000
}
}
}
run AppSupervisor;
When you run a supervisor, it spawns its children in order and monitors them. When a child exits, the supervisor applies its restart strategy.
Restart Strategies
The strategy determines what happens when a child fails.
OneForOne
Restart only the failed child. Other children continue running.
supervisor WebApp {
strategy: OneForOne
children {
Worker1 { restart: Permanent }
Worker2 { restart: Permanent }
Worker3 { restart: Permanent }
}
}
If Worker2 crashes, only Worker2 restarts. Worker1 and Worker3 are unaffected.
Use when: Children are independent. A database connection agent doesn’t affect an API server agent.
OneForAll
When one child fails, restart all children.
supervisor TightlyCoupled {
strategy: OneForAll
children {
ConfigLoader { restart: Permanent }
Worker1 { restart: Permanent }
Worker2 { restart: Permanent }
}
}
If any child crashes, all children are stopped and restarted together.
Use when: Children share state and can’t function correctly if one fails. If your config loader crashes, the workers have stale config and should restart too.
RestForOne
Restart the failed child and all children declared after it.
supervisor Pipeline {
strategy: RestForOne
children {
DatabaseSteward { restart: Permanent } // Position 1
APISteward { restart: Permanent } // Position 2
FrontendSteward { restart: Permanent } // Position 3
}
}
If APISteward (position 2) crashes:
DatabaseSteward(position 1) continues — it’s before the failureAPISteward(position 2) restarts — it failedFrontendSteward(position 3) restarts — it’s after the failure
Use when: Children have dependencies in declaration order. The API steward depends on the database steward, and the frontend steward depends on the API steward. If the database fails, everything downstream should restart.
Restart Policies
Each child has a restart policy that determines when it should be restarted.
Permanent
Always restart, regardless of exit reason.
DatabaseSteward {
restart: Permanent
// ...
}
If the agent exits cleanly (calls yield), restart it. If it crashes (calls yield in on error), restart it. Permanent agents run forever — until the supervisor itself stops.
Use for: Core steward agents that must always be running.
Transient
Restart only if the agent exited with an error.
MigrationRunner {
restart: Transient
// ...
}
If the agent exits cleanly, don’t restart — it completed its work. If it crashes, restart it to retry.
Use for: Agents that do work and then should stop, but should retry on failure.
Temporary
Never restart.
OneTimeSetup {
restart: Temporary
// ...
}
Run once. If it succeeds, fine. If it fails, fine. Don’t restart either way.
Use for: Initialisation agents, cleanup agents, or agents that shouldn’t retry.
Restart Intensity Limiting
A crashing agent that keeps crashing creates a restart storm. To prevent this, supervisors have a circuit breaker:
# grove.toml
[supervision]
max_restarts = 5
within_seconds = 60
If a supervisor sees more than max_restarts within within_seconds, it gives up and terminates. If the supervisor has a parent supervisor, that parent’s strategy applies.
Default: 5 restarts within 60 seconds.
Integration with Persistence
Supervision and persistent beliefs work together to provide crash recovery with state.
When a Permanent agent with @persistent fields restarts:
- The supervisor spawns a fresh agent instance
@persistentfields are loaded from the last checkpointon wakingruns (validate recovered state, reconnect)on startruns (normal operation)
agent DatabaseSteward {
@persistent schema_version: Int
@persistent migration_log: List<String>
on waking {
print("Recovered at schema v{self.schema_version.get()}");
reconnect_to_database();
}
on start {
// Resume normal operation
yield(0);
}
}
Without @persistent, a restarted agent starts fresh with zero-valued fields. This may be fine for stateless workers, but steward agents typically need persistence.
Belief Initialisation
When declaring children in a supervisor, you provide initial values for their beliefs:
supervisor AppSupervisor {
strategy: OneForOne
children {
QueryMonitor {
restart: Permanent
check_count: 0
slow_query_threshold_ms: 100
alert_count: 0
}
}
}
These are the initial values used on the first run. If the agent has @persistent fields and a checkpoint exists, the checkpoint values are used instead.
Practical Example
A database guardian with multiple monitoring agents:
// Query Monitor - tracks slow queries
agent QueryMonitor {
@persistent check_count: Int
@persistent alert_count: Int
on waking {
trace("Resuming with {self.check_count.get()} previous checks");
}
on start {
let count = self.check_count.get() + 1;
self.check_count.set(count);
trace("Check #{count}");
// Actual monitoring logic...
yield(count);
}
on error(e) {
let alerts = self.alert_count.get() + 1;
self.alert_count.set(alerts);
trace("Error (alert #{alerts})");
yield(-1);
}
}
// Pool Monitor - watches connection pool
agent PoolMonitor {
@persistent max_connections_seen: Int
on start {
let current = check_pool_connections();
if current > self.max_connections_seen.get() {
self.max_connections_seen.set(current);
}
yield(current);
}
on error(e) {
yield(-1);
}
}
// Supervisor
supervisor DbGuardian {
strategy: OneForOne
children {
QueryMonitor {
restart: Permanent
check_count: 0
alert_count: 0
}
PoolMonitor {
restart: Permanent
max_connections_seen: 0
}
}
}
run DbGuardian;
Configure in grove.toml:
[project]
name = "db-guardian"
[persistence]
backend = "sqlite"
path = ".sage/db_guardian.db"
[supervision]
max_restarts = 10
within_seconds = 30
Running a Supervisor
Use run SupervisorName; at the end of your file:
run DbGuardian;
The supervisor starts all children and monitors them. The program runs until:
- All children have exited (and none need restarting)
- The circuit breaker trips (too many restarts)
- The process is killed externally
Nested Supervisors
Supervisors can be children of other supervisors, creating a supervision tree:
supervisor DatabaseSection {
strategy: OneForAll
children {
QueryMonitor { restart: Permanent }
PoolMonitor { restart: Permanent }
}
}
supervisor ApiSection {
strategy: OneForOne
children {
RouterAgent { restart: Permanent }
HandlerPool { restart: Permanent }
}
}
supervisor AppRoot {
strategy: RestForOne
children {
DatabaseSection { restart: Permanent }
ApiSection { restart: Permanent }
}
}
run AppRoot;
If the DatabaseSection supervisor’s circuit breaker trips, AppRoot sees it as a child failure and applies RestForOne — restarting DatabaseSection and ApiSection.
Maximum nesting depth: 8 levels (to prevent pathological trees).
Best Practices
-
Start with OneForOne. It’s the simplest and usually correct. Escalate to RestForOne or OneForAll only when you have clear dependencies.
-
Use Permanent for core stewards. Your main agents should always be running.
-
Use Transient for retry-on-failure workers. Agents that do work and exit should be Transient.
-
Pair Permanent with @persistent. An always-restart agent without persistence restarts from scratch — probably not what you want.
-
Tune restart intensity. The default (5 restarts in 60 seconds) may be too aggressive or too lenient for your use case.
-
Keep supervisors shallow. Deep nesting is a code smell. If you need more than 2-3 levels, reconsider your architecture.
Related
- Persistent Beliefs — State that survives restarts
- The Steward Pattern — Building long-lived agents
- Lifecycle Hooks —
on waking,on resting, and friends