Monitoring & failures
Track a campaign from the cockpit — progress, ETA, workers — and understand quarantine, failure diagnostics, and one-click retries.
The Plan cockpit shows every campaign and lets you see exactly what’s happening — and why something failed.
Campaign cards
Each campaign is a card with:
- a progress bar (
done / total) and percentage; - active sessions and an ETA (derived from throughput);
- a readable quota line when parked — “quota exhausted · auto-resume ~14:30”;
- a failed count when units are quarantined.
The engine strip above shows engine state, the global pool, and a pause switch. Cards are clickable to edit the campaign.
Drill-down: workers & failures
Open a card’s Workers view to see the live sessions (current unit, heartbeat, lease) and the failed units list — never a table of thousands of rows, just the workers and the exceptions.
Quarantine vs parking
These are different on purpose:
- Parking = a quota pause. The unit stays “to do”,
attemptsis unchanged, and it auto-resumes. Not a failure. - Quarantine = a genuine failure: after
max_attempts(crash loops, or the agent returning idle without producing the output). The unit is setfailedand listed.
Failure diagnostics
When a unit is quarantined, Plan records why — the reason plus the last screen captured from the dead session. So instead of guessing, you see the actual cause (a wrong path, a missing file, a CLI error) right in the drill-down.
Resilience built in
- Anti-spiral — Plan only attempts
--resumefor a session that actually started; a never-established session is relaunched fresh, so a bad first launch can’t loop into quarantine. - Idle grace — a session that briefly goes idle isn’t failed immediately; Plan waits a few observations before counting an attempt, to avoid false failures.
Retrying
Retry failed clears the quarantine (and, for a brief campaign, re-reads the brief so your fixes apply). Failed units are then re-claimed on the next tick. Because the run is idempotent, already-done units are never repeated.