Monitoring & failures

Track a campaign from the cockpit — progress, ETA, workers — and understand quarantine, failure diagnostics, and one-click retries.

The Plan cockpit shows every campaign and lets you see exactly what’s happening — and why something failed.

Campaign cards

Each campaign is a card with:

a progress bar (done / total) and percentage;
active sessions and an ETA (derived from throughput);
a readable quota line when parked — “quota exhausted · auto-resume ~14:30”;
a failed count when units are quarantined.

The engine strip above shows engine state, the global pool, and a pause switch. Cards are clickable to edit the campaign.

Drill-down: workers & failures

Open a card’s Workers view to see the live sessions (current unit, heartbeat, lease) and the failed units list — never a table of thousands of rows, just the workers and the exceptions.

Quarantine vs parking

These are different on purpose:

Parking = a quota pause. The unit stays “to do”, attempts is unchanged, and it auto-resumes. Not a failure.
Quarantine = a genuine failure: after max_attempts (crash loops, or the agent returning idle without producing the output). The unit is set failed and listed.

Failure diagnostics

When a unit is quarantined, Plan records why — the reason plus the last screen captured from the dead session. So instead of guessing, you see the actual cause (a wrong path, a missing file, a CLI error) right in the drill-down.

Resilience built in

Anti-spiral — Plan only attempts --resume for a session that actually started; a never-established session is relaunched fresh, so a bad first launch can’t loop into quarantine.
Idle grace — a session that briefly goes idle isn’t failed immediately; Plan waits a few observations before counting an attempt, to avoid false failures.

Retrying

Retry failed clears the quarantine (and, for a brief campaign, re-reads the brief so your fixes apply). Failed units are then re-claimed on the next tick. Because the run is idempotent, already-done units are never repeated.