docs(roadmap): add tolerant tool permission prompt gap

docs(roadmap): add prompt echo glyph mismatch gap
docs(roadmap): add workspace test flag order preflight gap
2026-07-03 08:36:25 +02:00 · 2026-05-21 21:30:41 +00:00 · 2026-05-21 21:00:45 +00:00 · 2026-05-21 20:30:47 +00:00 · 2026-05-21 20:00:55 +00:00 · 2026-05-21 19:30:59 +00:00
1 changed files with 32 additions and 0 deletions
@@ -6639,3 +6639,35 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
 541. **`SessionStore::from_cwd` and `from_data_dir` eagerly `create_dir_all` the sessions namespace, so read-only session discovery and failed resume attempts mutate the workspace before proving a session will be read or written** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 13:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@8a92f0e` and binary built from source SHA `25d663d`. Code inspection: `runtime/src/session_control.rs:32-43` builds `<cwd>/.claw/sessions/<workspace_hash>/` and immediately calls `fs::create_dir_all(&sessions_root)?`; `from_data_dir` does the same at lines 54-67. These constructors are used by read/discovery paths (`session_control.rs` list/exists/resolve helpers, `main.rs` session store setup, tests, and resume command routing), so merely asking to list sessions, check existence, or resume `latest` in a fresh workspace can create `.claw/sessions/<fingerprint>/` even when no session exists and the command ultimately fails. This is the root cause behind the repeated failed-resume droppings noted in #435/#444/#540, but the actual product gap is broader: the constructor for a session *store handle* performs durable filesystem mutation instead of deferring directory creation to the first successful save/create operation. **Required fix shape:** (a) split `SessionStore::from_cwd` into non-mutating construction and an explicit `ensure_exists_for_write()` path; (b) make list/exists/resolve/latest handle missing session directories as empty/not-found without creating them; (c) call `create_dir_all` only when creating a new session file or writing a session snapshot; (d) expose a `store_exists` / `created:false` diagnostic if needed, but do not mutate during read-only probes; (e) add regressions proving `--resume latest`, session list/exists, and cold-workspace slash probes leave no `.claw/` tree behind on failure, while a real successful save creates the namespace. **Why this matters:** session discovery is supposed to be a read-only observability surface. Eagerly creating workspace metadata during failed reads pollutes repos, confuses `git status`, and makes cold-workspace probes look like they partially initialized state even though no usable session exists. Source: gaebal-gajae dogfood response to Clawhip message `1507005067016929364` on 2026-05-21.

 542. **`claw session --help --output-format json` and other top-level `session` invocations fall through to prompt/runtime startup, then hang behind config/plugin initialization instead of returning bounded local help or typed unsupported-subcommand JSON** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 13:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@adaf8c3` and binary built from source SHA `25d663d`. Reproduction in normal env: `timeout --kill-after=1s 5s ./rust/target/debug/claw session --help --output-format json` exits 124 with zero stdout and only the settings deprecation warning on stderr (`enabledPlugins` deprecated). `session list --output-format json` and `session exists latest --output-format json` show the same zero-stdout timeout pattern. Code inspection explains it: `parse_local_help_action` only recognizes `status|sandbox|doctor|acp|init|state|export|version|system-prompt|dump-manifests|bootstrap-plan`; `LocalHelpTopic` has no `Session` variant; and the top-level parser has no `session` branch even though slash/resume `/session list|exists|switch|fork|delete` has structured helpers (`render_session_list`, `session_exists_json`). As a result, a user asking for the session command contract is routed into the generic prompt path and waits on runtime startup instead of getting a small control-plane response. **Required fix shape:** (a) add a top-level `session` command parser with `help|list|exists <id>|show <id>|delete?` or explicitly documented unsupported actions; (b) add `LocalHelpTopic::Session` and bounded JSON help showing supported top-level vs slash/resume session operations; (c) make `session list` and `session exists` use non-mutating session-store read paths from #541 and return typed empty/not-found results; (d) reject unsupported `session` actions with `kind:"unsupported_session_action"` including `action`, `supported_actions`, and whether slash/resume alternatives exist; (e) add elapsed-time regressions for `session --help --output-format json`, `session list --output-format json`, `session exists latest --output-format json`, and `session bogus --output-format json`. **Why this matters:** session management is the backbone of resume/export automation. If the natural top-level `session` spelling hangs, operators cannot safely discover or preflight sessions without already knowing the slash-only internal contract, and a control-plane typo looks like runtime/plugin deadlock. Source: gaebal-gajae dogfood response to Clawhip message `1507012621079937165` on 2026-05-21.
+
+543. **Top-level local inventory commands (`plugins list`, `mcp list`, `agents list`, `skills list`) hang in normal env when deprecated config warnings are present, so a non-fatal settings migration warning blocks JSON inventory output entirely** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 14:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@dbc7125` and binary built from source SHA `25d663d`. Reproduction in the operator's normal env: bounded probes `timeout --kill-after=1s 6s ./rust/target/debug/claw plugins list --output-format json`, `mcp list --output-format json`, `agents list --output-format json`, and `skills list --output-format json` each exited 124 with zero stdout and only one stderr line: `warning: /home/bellman/.claw/settings.json: field "enabledPlugins" is deprecated (line 2). Use "plugins.enabled" instead`. These are supposed to be local discovery surfaces and the code has direct top-level branches (`CliAction::Plugins`, `CliAction::Mcp`, `CliAction::Agents`, `CliAction::Skills`) plus JSON renderers, but a non-fatal config deprecation warning appears before output and the command never completes. This is distinct from unknown-subcommand parser fallthrough (#529/#530) and missing top-level `session` (#542): here the requested actions are valid, parsed, and should be bounded, yet the normal-env warning path wedges them. **Required fix shape:** (a) make config deprecation warnings non-blocking for read-only inventory commands and never require interaction/cleanup before emitting JSON; (b) include warnings in a structured `warnings[]` field for JSON mode instead of stderr-only prelude that can precede a hang; (c) ensure inventory handlers can load partial/deprecated config and return `status:"degraded"` with `config_warnings` rather than timing out; (d) add normal-env/fixture regressions with deprecated `enabledPlugins` proving `plugins list`, `mcp list`, `agents list`, and `skills list` all produce bounded JSON within a small budget; (e) audit `dump-manifests` and `system-prompt` for the same warning-induced wedge. **Why this matters:** inventory commands are the emergency observability path for MCP/plugin/agent lifecycle issues. A migration warning must not make those commands look dead, especially in the exact long-lived user configs where deprecated fields are most likely to exist. Source: gaebal-gajae dogfood response to Clawhip message `1507020170671947888` on 2026-05-21.
+
+544. **Local help-topic commands (`export --help`, `status --help`, `version --help`, `bootstrap-plan --help`) hang in normal env when the deprecated settings warning is present, so usage discovery is blocked by a non-fatal config migration warning** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 14:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@785d4bd` and binary built from source SHA `25d663d`. Reproduction in the operator's normal env: bounded probes `timeout --kill-after=1s 5s ./rust/target/debug/claw export --help --output-format json`, `status --help --output-format json`, and `version --help --output-format json` each exited 124 with zero stdout and only `warning: /home/bellman/.claw/settings.json: field "enabledPlugins" is deprecated (line 2). Use "plugins.enabled" instead` on stderr; `bootstrap-plan --help --output-format json` entered the same wait before the sweep was killed. Code already has `parse_local_help_action` and `LocalHelpTopic` support for these commands, so help rendering should be a pure parser/static-output path. Instead, the presence of a config deprecation warning appears to push even `--help` into the same startup/config/plugin wedge as #543's inventory commands. **Required fix shape:** (a) route local help-topic requests before any config/plugin/runtime loading that can emit warnings or hang; (b) guarantee `claw <local-command> --help --output-format json` produces bounded static JSON independent of user config validity/deprecation state; (c) if warnings are intentionally collected, attach them only after the help payload as structured `warnings[]`, never as a blocking stderr prelude; (d) add fixture regressions with deprecated `enabledPlugins` for `export/status/version/bootstrap-plan --help --output-format json` plus text-mode controls; (e) audit all `LocalHelpTopic` variants for the same warning-induced hang. **Why this matters:** help is the recovery path when startup/config is broken. If a deprecated config warning prevents help from rendering, users cannot learn the command contract needed to fix the config or avoid the bad path. Source: gaebal-gajae dogfood response to Clawhip message `1507027716191551630` on 2026-05-21.
+
+545. **Core local diagnostics (`doctor`, `status`, `sandbox`, `config env`, `state`) hang in normal env when the deprecated settings warning is present, so the config-warning wedge blocks both inventory/help and the primary health surfaces** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 15:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@8d0dc50` and binary built from source SHA `25d663d`. Reproduction in the operator's normal env: bounded probes `timeout --kill-after=1s 5s ./rust/target/debug/claw doctor --output-format json`, `status --output-format json`, `sandbox --output-format json`, `config env --output-format json`, and `state --output-format json` each exited 124 with zero stdout and only `warning: /home/bellman/.claw/settings.json: field "enabledPlugins" is deprecated (line 2). Use "plugins.enabled" instead` on stderr. #543 captured inventory commands and #544 captured help topics; this entry shows the same non-fatal migration warning also disables the core health/config/state probes users need to diagnose startup. These commands are local and should be resilient to partial/deprecated config, especially `doctor`, whose purpose is to classify config problems. **Required fix shape:** (a) move deprecation collection into the diagnostic payload path instead of printing-and-wedging before command dispatch; (b) make `doctor/status/sandbox/config/state` produce bounded JSON even when config has deprecated fields, with `warnings[]` and `status:"degraded"` where appropriate; (c) guarantee `doctor` never depends on successful modern config normalization to report config migration issues; (d) add fixture regressions with `settings.json` containing `enabledPlugins` for all five probes, asserting non-timeout, nonzero/zero exit semantics as documented, and structured warning metadata; (e) unify this with #543/#544 as one startup/config-warning no-hang invariant across every local command. **Why this matters:** users run `doctor`, `status`, and `config env` precisely when startup/config looks wrong. If a deprecated config warning makes those commands hang with no JSON, the CLI loses its self-diagnostic escape hatch and every support workflow devolves into manual stderr archaeology. Source: gaebal-gajae dogfood response to Clawhip message `1507035269830934600` on 2026-05-21.
+
+546. **One-shot prompt/setup/git commands (`prompt`, `init`, `diff`) hang in normal env when the deprecated settings warning is present, so the warning wedge reaches executing and workspace-mutating entrypoints, not only diagnostics** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 15:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@0a2542a` and binary built from source SHA `25d663d`. Reproduction in the operator's normal env: bounded probes `timeout --kill-after=1s 5s ./rust/target/debug/claw prompt hello --output-format json`, `init --output-format json`, and `diff --output-format json` each exited 124 with zero stdout and only `warning: /home/bellman/.claw/settings.json: field "enabledPlugins" is deprecated (line 2). Use "plugins.enabled" instead` on stderr. #543 covered inventory, #544 help, and #545 diagnostics; this entry shows the same non-fatal config migration warning also blocks the main one-shot prompt path, project initialization, and local git diff inspection. `init` is especially problematic because it is the command a user might run to refresh/migrate workspace scaffolding, yet the warning prevents it from producing the artifact report. **Required fix shape:** (a) make deprecated settings warnings non-blocking for every dispatch class, including prompt, init, and diff; (b) for prompt/runtime paths, surface migration warnings as structured preflight metadata before model work without preventing bounded startup/result/error output; (c) for `init`, avoid loading user runtime/plugin config before emitting or applying idempotent project scaffolding; (d) for `diff`, keep the command pure git/local and warning-resilient with `warnings[]` in JSON mode; (e) add fixture regressions with `settings.json` containing `enabledPlugins` for `prompt hello --output-format json`, `init --output-format json`, and `diff --output-format json`, asserting deterministic output or typed prompt preflight failure rather than timeout. **Why this matters:** once the warning wedge reaches prompt/init/diff, users cannot even run the normal work loop or bootstrap/migrate away from the warning state. A deprecation notice must never be a hidden global startup blocker. Source: gaebal-gajae dogfood response to Clawhip message `1507042815606259972` on 2026-05-21.
+
+547. **Text-mode static/local commands still emit deprecated-settings warnings to stderr, and `system-prompt` emits the same warning twice, so successful output is polluted by unstructured migration noise even when the command completes** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 16:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@10f5dac` and binary built from source SHA `25d663d`. Normal-env probes showed `claw --help`, `claw help`, `claw version`, and `claw --version` complete cleanly with no warning, but `claw bootstrap-plan` exits 0 with the static phase list on stdout and the deprecated `enabledPlugins` settings warning on stderr; `claw system-prompt` exits 0 with the rendered system prompt on stdout and prints the same deprecation warning twice on stderr. `claw dump-manifests` exits 1 for missing manifests and also prefixes the structured error with the warning. #543-#546 cover JSON-mode warning-induced hangs; this entry captures the surviving text-mode event/log opacity: even when commands complete, successful static/local output is coupled to unstructured warning chatter, and some paths load config twice. **Required fix shape:** (a) dedupe config warnings per process/command before writing to stderr or structured output; (b) do not emit runtime config migration warnings for static commands that do not semantically need runtime config (`bootstrap-plan`, likely help/version); (c) for commands that intentionally inspect/render config-dependent state (`system-prompt`, `dump-manifests`), collect warnings into one structured/clearly phased diagnostic block instead of repeated raw lines; (d) add regressions with deprecated `enabledPlugins` proving text-mode `bootstrap-plan` has clean stderr or an intentional single warning, and `system-prompt` emits at most one warning; (e) align warning collection with the JSON `warnings[]` fix required by #543-#546. **Why this matters:** claws and shell pipelines treat stderr as the failure/diagnostic channel. Duplicate or irrelevant migration warnings make successful static outputs look degraded, break golden-output tests, and hide real errors behind repeated noise. Source: gaebal-gajae dogfood response to Clawhip message `1507050372181524622` on 2026-05-21.
+
+548. **Two timestamp helpers in `tools/src/lib.rs` disagree on their contract: `iso8601_timestamp()` shells out to `/bin/date` for RFC3339, then falls back to `iso8601_now()` which returns epoch seconds, so the same JSON field can change format by host/tool availability** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 16:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@832098c` and binary built from source SHA `25d663d`. Code inspection: `tools/src/lib.rs:6091-6100` implements `iso8601_timestamp()` by spawning `date -u +%Y-%m-%dT%H:%M:%SZ`; if that command is unavailable or fails, it calls `iso8601_now()`. But `iso8601_now()` at `tools/src/lib.rs:5262-5268` serializes `SystemTime::now().duration_since(UNIX_EPOCH).as_secs().to_string()`, yielding strings like `1748004000`, not an ISO/RFC3339 timestamp. `execute_brief` uses `iso8601_timestamp()` for `BriefOutput.sent_at`, so `sent_at` is RFC3339 on hosts with GNU/BSD `date`, but silently becomes epoch-seconds on minimal/sandboxed hosts where `date` is missing or fails. This is adjacent to the LaneEvent timestamp bug reported in the same channel, but distinct: the brief/notification surface has an environment-dependent timestamp format because one helper's fallback violates the other helper's named contract. **Required fix shape:** (a) replace both helpers with one pure Rust RFC3339 UTC formatter using an existing time crate or a tiny internal formatter; (b) remove shelling out to `date` from timestamp generation so output format is host-independent and deterministic; (c) if an epoch timestamp is desired anywhere, name it `unix_epoch_seconds_now` and type it separately, never as an ISO helper fallback; (d) add tests asserting `BriefOutput.sent_at`, AgentOutput `created_at/completed_at`, and LaneEvent timestamps parse as RFC3339 under both normal and forced-fallback conditions; (e) add a contract comment/schema note that string timestamps are RFC3339/ISO-8601 UTC. **Why this matters:** timestamp fields are coordination/log ordering primitives. If the same field is `"2026-05-21T16:30:00Z"` on one host and `"1747845000"` on another, downstream parsers, JSON schemas, and UI timelines cannot trust event ordering or distinguish parse failures from host quirks. Source: gaebal-gajae dogfood response to Clawhip message `1507057919278190654` on 2026-05-21.
+
+549. **G004 conformance validates `laneEvents[].emittedAt` only as a non-empty string, so epoch-seconds timestamps pass the contract even though fixtures and field names imply RFC3339/ISO date-time** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 17:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@d670b43` and binary built from source SHA `25d663d`. Code inspection: `runtime/src/g004_conformance.rs:66-69` requires `/event`, `/status`, and `/emittedAt` to be non-empty strings, but never parses or pattern-checks `/emittedAt`. In `runtime/src/lane_events.rs`, `LaneEvent.emitted_at` is serialized as `emittedAt` and tests/fixtures use RFC3339-looking values like `2026-04-04T00:00:00Z`, but the validator would accept `"1748004000"`, `"not-a-date"`, or any other non-empty string. This compounds #548 and Jobdori's timestamp helper finding: even if producers emit epoch strings from `iso8601_now()`, the advertised machine-checkable G004 contract will not catch the timestamp-format regression. **Required fix shape:** (a) add an RFC3339/ISO-8601 UTC validator for `laneEvents[].emittedAt` in `validate_lane_events`; (b) reject numeric-looking epoch strings and arbitrary prose with a clear error like `expected RFC3339 timestamp`; (c) apply the same validation to report/approval-token date-time fields if the bundle schema has them; (d) add negative tests showing `"1748004000"` and `"not-a-date"` fail while `"2026-04-04T00:00:00Z"` passes; (e) align producer helpers from #548/#549 so conformance and production agree on timestamp contract. **Why this matters:** conformance helpers are supposed to prevent log/event opacity bugs from reaching downstream claws. If the contract only checks non-empty strings, timestamp regressions become invisible until UI parsers, JSON-schema validators, or ordering logic fail later. Source: gaebal-gajae dogfood response to Clawhip message `1507065465552507033` on 2026-05-21.
+
+550. **Agent manifest timestamp tests only assert non-empty/presence, so epoch-seconds `createdAt`/`startedAt`/`completedAt` and lane-event timestamps are baked into tests as acceptable output** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 17:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@02b6855` and binary built from source SHA `25d663d`. Code inspection: `tools/src/lib.rs:2734-2755` defines `AgentOutput.createdAt`, `startedAt`, `completedAt`, and `laneEvents`. Production fills them via `iso8601_now()` at `tools/src/lib.rs:3663`, `3693-3696`, `3912`, and terminal `LaneEvent::*` calls. But the main manifest regression at `tools/src/lib.rs:8147-8152` only asserts `!manifest.created_at.is_empty()`, `started_at.is_some()`, and `completed_at.is_none()`; later completed/failed manifest tests assert event names/status/details/data but never assert that `createdAt`, `startedAt`, `completedAt`, or `laneEvents[].emittedAt` parse as RFC3339. As a result, the current epoch-seconds strings from `iso8601_now()` satisfy the test suite, and a future producer fix could regress again without a red test. **Required fix shape:** (a) add a shared test helper that validates RFC3339/ISO-8601 UTC strings for AgentOutput top-level timestamps and every lane event `emittedAt`; (b) update creation, completion, failure, spawn-error, recovery, review, selection, and artifact manifest tests to call it; (c) add explicit negative/unit coverage proving epoch strings like `"1748004000"` fail the helper; (d) align test fixtures with the G004 conformance timestamp validation required by #549; (e) ensure tests check serialized JSON fields, not only in-memory struct presence. **Why this matters:** tests are currently encoding the broken contract as acceptable by checking only non-empty strings. Without semantic timestamp assertions, timestamp-format fixes in #548/#549 have no durable safety net and downstream parser breakage can reappear silently. Source: gaebal-gajae dogfood response to Clawhip message `1507073019296874608` on 2026-05-21.
+
+551. **Lane completion ignores timestamp validity and recency entirely, so stale or epoch-formatted AgentOutput manifests can be auto-marked completed if status/tests/push flags are green** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 18:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@2bc85c5` and binary built from source SHA `25d663d`. Code inspection: `tools/src/lane_completion.rs::detect_lane_completion` decides completion only from `error.is_none()`, `status` being completed/finished, `current_blocker.is_none()`, `test_green`, and `has_pushed`. It never reads or validates `AgentOutput.createdAt`, `startedAt`, `completedAt`, or `laneEvents[].emittedAt`. The test fixture at `lane_completion.rs:103-120` uses RFC3339-looking timestamps, but there is no negative coverage for epoch strings or stale completedAt values. Therefore a manifest with `completedAt:"1748004000"` (from the broken `iso8601_now()` producer) or a very old completion timestamp can still generate a completed `LaneContext` as long as the status/tests/push booleans are favorable. This is distinct from #550's manifest test coverage gap: the completion policy itself treats temporal fields as irrelevant even though stale-session cleanup and lane closeout depend on trustworthy event time. **Required fix shape:** (a) parse and validate `completedAt` (and ideally `createdAt`/`startedAt`/terminal lane event `emittedAt`) before auto-completion; (b) reject or mark degraded manifests whose timestamp fields are missing, non-RFC3339, or implausibly stale relative to the current run; (c) thread a typed temporal blocker/degraded reason into `LaneContext` instead of silently completing; (d) add tests where status/tests/push are green but `completedAt` is `"1748004000"`, `"not-a-date"`, missing, or older than a configured freshness window, proving completion is withheld or degraded; (e) align this with #548-#550 so timestamp producer, validator, tests, and completion policy share one contract. **Why this matters:** completion is an action trigger, not just a display label. If stale or malformed manifests can auto-close lanes, claws may cleanup sessions, suppress reminders, or report success based on artifacts whose time identity is untrustworthy. Source: gaebal-gajae dogfood response to Clawhip message `1507080567731261655` on 2026-05-21.
+
+552. **Anthropic retry tests cover `/v1/messages` but not the preflight `/v1/messages/count_tokens` endpoint, so a transient 429 during token counting can be silently swallowed or skip refined context-window enforcement with no regression coverage** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 18:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@762ee65` and binary built from source SHA `25d663d`. Code/test inspection: `api/src/providers/anthropic.rs::preflight_message_request` calls `self.count_tokens(request).await`, but on any error it immediately `return Ok(())` and falls back to the coarse local byte guard. The `count_tokens` implementation sends one bare request to `/v1/messages/count_tokens` with no retry. Existing integration tests in `api/tests/client_integration.rs` cover retry behavior for `/v1/messages` (`retries_retryable_failures_before_succeeding`, `surfaces_retry_exhaustion_for_persistent_retryable_errors`, `retries_multiple_retryable_failures_with_exponential_backoff_and_jitter`) and local oversized-byte blocking, but there is no test where count_tokens returns 429/503 once then succeeds, nor a test asserting refined context-window rejection still happens after a retry. Therefore a count-token rate limit can disable the precise token guard and still let `send_message` proceed to `/v1/messages`, while the retry suite stays green. **Required fix shape:** (a) add integration tests with mock server sequence `429 count_tokens -> 200 count_tokens -> ...` proving the same retry policy applies before `/v1/messages`; (b) add a test where retried count_tokens returns a token count that exceeds the model window and assert no `/v1/messages` call is sent; (c) when count_tokens errors are intentionally best-effort, expose structured telemetry/warnings so skipped refined preflight is observable; (d) then implement retry or explicitly document degraded behavior; (e) keep request-count assertions path-aware so count_tokens and messages attempts are distinguishable. **Why this matters:** the retry suite currently gives false confidence because it exercises only the final message endpoint. Under rate pressure, the preflight endpoint is exactly where retries matter for safe compaction/context-window decisions. Source: gaebal-gajae dogfood response to Clawhip message `1507088113892331622` on 2026-05-21.
+
+553. **Worker restart reuses the original `created_at`, so startup-timeout elapsed time after restart includes the previous worker lifetime and stale pre-restart events** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 19:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@51c05ac` and binary built from source SHA `25d663d`. Code inspection: `WorkerRegistry::restart` at `runtime/src/worker_boot.rs:600-621` resets status, prompt fields, trust, and error state, then appends `WorkerEventKind::Restarted`, but it does not reset `worker.created_at` or clear/carry a new boot-start timestamp. `observe_startup_timeout` at `worker_boot.rs:713-735` computes `elapsed = now.saturating_sub(worker.created_at)` and reports `command_started_at: worker.created_at`. Therefore a worker restarted after a long previous lifetime can immediately report a huge startup timeout elapsed/command_started_at from the original boot, not the restart. Because events are also retained, the timeout evidence can mix pre-restart trust/tool-permission detections with the new boot attempt. This is distinct from Jobdori's #554 O(3N) scan/unbounded-events finding: even with O(1) caches, the temporal anchor for a restarted boot is wrong. **Required fix shape:** (a) add `boot_started_at` or `current_attempt_started_at` distinct from immutable worker creation time; (b) set that timestamp on create and restart, and use it for startup timeout elapsed/command_started_at; (c) either scope trust/tool-permission evidence to events since the current boot attempt or store per-attempt cached flags; (d) include `attempt_index`/`restart_count` in startup evidence and worker events so old/new attempts are separable; (e) add a regression where a worker is created, time advances/restart occurs, then `observe_startup_timeout` reports elapsed from restart rather than original creation and ignores pre-restart prompts. **Why this matters:** restart is supposed to create a fresh startup attempt. If timeout evidence is anchored to the first creation, operators see misleading "stalled for hours" reports and stale blocker classifications for a brand-new restart, which breaks recovery decisions. Source: gaebal-gajae dogfood response to Clawhip message `1507095667959402638` on 2026-05-21.
+
+554. **Recovery ledger tests assert only `started_at.is_some()` / `finished_at.is_some()`, so fake tick-counter timestamps are explicitly accepted by the machine-readable ledger suite** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 19:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@1540e3a` and binary built from source SHA `25d663d`. Code inspection: `RecoveryContext::next_timestamp` in `runtime/src/recovery_recipes.rs:276-279` returns `recovery-ledger-tick-N`, and `attempt_recovery` stores those strings into public `RecoveryLedgerEntry.started_at` / `finished_at`. The test named `recovery_context_exposes_machine_readable_ledger` at `recovery_recipes.rs:688-720` only asserts `entry.started_at.is_some()` and `entry.finished_at.is_some()`; exhaustion/failure ledger tests likewise validate state/result/command details but not timestamp parseability. Therefore the test suite labels the ledger machine-readable while allowing non-date sentinel strings in fields named `started_at` and `finished_at`. This is the test-coverage sibling of Jobdori's #555 public API timestamp bug: even after production is fixed, nothing in the ledger tests prevents a regression back to tick strings or other unparseable data. **Required fix shape:** (a) add a recovery timestamp assertion helper that parses `started_at` and `finished_at` as RFC3339/ISO-8601 UTC; (b) update success, exhausted, and failed ledger tests to use it; (c) add a negative unit test proving `recovery-ledger-tick-1` is rejected by the helper/contract; (d) document whether recovery ledger timestamps are wall-clock instants or monotonic attempt IDs, and if both are needed, add separate `attempt_seq` instead of overloading timestamp fields; (e) align with the timestamp contract fixes in #548-#551. **Why this matters:** tests currently make the wrong semantic promise: "machine-readable" only means present. Recovery ledgers drive retries/escalation audit trails, so timestamp fields must be parseable dates or consumers cannot sort, correlate, or display recovery attempts reliably. Source: gaebal-gajae dogfood response to Clawhip message `1507103214665859264` on 2026-05-21.
+
+555. **Workspace-test stale-branch preflight can compare against stale local `main` instead of `origin/main`, letting branches behind the remote base run full workspace tests as “fresh”** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@a036293` and binary built from source SHA `25d663d`. Live channel context had multiple open claw-code PRs whose head ref is `main`, and the watchdog target is specifically stale-branch confusion. Code inspection: `tools/src/lib.rs::workspace_test_branch_preflight` reads the current branch then calls `resolve_main_ref(&branch)` before `check_freshness`. `resolve_main_ref` at `tools/src/lib.rs:2020-2032` returns local `main` whenever it exists, except when the current branch itself is `main` and `origin/main` exists. In the common feature-branch case with both refs present, the stale-branch guard compares feature branch to local `main`, not `origin/main`. If local `main` has not been fetched/updated, a branch can be behind `origin/main` but equal to local `main`, so `check_freshness` returns `Fresh` and `cargo test --workspace` proceeds without the preflight block. **Required fix shape:** (a) prefer `origin/main` (or the configured protected/base remote ref) for non-main branches when present; (b) fetch or verify the remote ref freshness before using it, or emit a degraded `branch.remote_base_unknown`/`branch.base_ref_stale` lane event instead of silently falling back; (c) include `baseRefSource` and `baseRefCommit` in the blocked lane event payload so operators know whether freshness was checked against local or remote state; (d) add a regression with local `main` stale, `origin/main` ahead, and a feature branch equal to local `main`, proving workspace tests are blocked; (e) keep the current `branch == main -> origin/main` behavior but cover it separately. **Why this matters:** full workspace tests are used as green evidence. If the guard checks an outdated local main, agents can burn time and report green against a stale base while missing fixes already in the remote protected branch. Source: gaebal-gajae dogfood response to Clawhip message `1507110763301437440` on 2026-05-21.
+
+556. **Workspace-test stale-branch preflight only matches fixed argument order, so broad workspace test commands like `cargo test --all-targets --workspace` bypass the stale-base guard** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 20:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@6ef5457` and binary built from source SHA `25d663d`. Active tmux list was empty at probe time. Code inspection: `tools/src/lib.rs::is_workspace_test_command` normalizes whitespace/lowercase, then checks substring needles in this exact order: `cargo test --workspace`, `cargo test --all`, `cargo nextest run --workspace`, `cargo nextest run --all`. The existing regression `bash_workspace_tests_are_blocked_when_branch_is_behind_main` uses `cargo test --workspace --all-targets`, which matches the fixed-order needle. But semantically equivalent broad commands such as `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, `cargo test --all-features --workspace`, or `cargo nextest run --all-features --workspace` do not contain the exact substring `cargo test --workspace` / `cargo nextest run --workspace`, so `workspace_test_branch_preflight` returns `None` and the command executes even on a stale branch. Targeted-test skip coverage does not protect this because the bypassed commands are still workspace-wide tests. **Required fix shape:** (a) parse shell command tokens enough to identify `cargo test` and `cargo nextest run` invocations independent of flag order; (b) classify workspace-wide tests when any token is `--workspace` or `--all` for those subcommands, regardless of intervening flags; (c) add negative coverage for targeted package tests that include workspace-looking strings only in quoted args/comments; (d) add regressions proving stale-branch preflight blocks `cargo test --all-targets --workspace`, `cargo test --locked --workspace`, and `cargo nextest run --all-features --workspace`; (e) include the normalized detected test scope in the structured branch-divergence event so operators can see why a command was blocked. **Why this matters:** agents often reorder cargo flags. A stale-branch safety guard that depends on one flag order gives false confidence and lets expensive full-suite green evidence be produced against stale code. Source: gaebal-gajae dogfood response to Clawhip message `1507118317528158370` on 2026-05-21.
+
+557. **Wrong-task prompt-misdelivery detection only recognizes `›` prompt echoes, so `>` / `❯` agent prompts can hide mismatched-task receipts until timeout** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@bc55711` and binary built from source SHA `25d663d`. Active tmux session at probe time: `gajae-issue-311-auto-merge-race-receipt`. Code inspection: worker readiness accepts multiple prompt glyphs (`>`, `›`, `❯`) in `detect_ready_for_prompt` at `runtime/src/worker_boot.rs:1079-1113`, but `detect_prompt_echo` at `worker_boot.rs:1210-1217` only strips a leading `›`. `detect_prompt_misdelivery` relies on `detect_prompt_echo` for the `mismatched_prompt_visible` path that catches wrong-task receipts when the screen shows a different task prompt. The existing regression `wrong_task_receipt_mismatch_is_detected_before_execution_continues` uses `› Explain this KakaoTalk screenshot...`, so it exercises only the single supported glyph. If the coding agent UI echoes `> Explain...` or `❯ Explain...`, `observed_prompt_preview` is `None`; when the expected prompt text is not also visible, the wrong-task mismatch is not detected and the worker stays `Running` until coarse startup timeout classification. **Required fix shape:** (a) make prompt-echo parsing share the same glyph set as `detect_ready_for_prompt` (`>`, `›`, `❯`, and boxed `│ >` variants if present); (b) add wrong-task receipt tests for `>`, `›`, and `❯` echoes; (c) include the raw echo line/glyph in `WorkerEventPayload::PromptDelivery` or event detail so operators can diagnose UI variant drift; (d) ensure shell prompt detection remains separate so real shell prompts are still classified as `Shell`, not wrong-task agent echoes; (e) add a timeout evidence regression proving observed prompt preview is populated for all supported glyphs. **Why this matters:** prompt-misdelivery protection is only as good as the UI echo parser. Supporting multiple ready glyphs but only one echo glyph creates event/log opacity: operators see a generic timeout instead of a precise wrong-task replay condition for common terminal themes or agent UIs. Source: gaebal-gajae dogfood response to Clawhip message `1507125863408341102` on 2026-05-21.
+
+558. **Tool-permission gate detection is hard-coded to one English MCP prompt shape, so alternate MCP approval wording can fall through as startup-no-evidence instead of `ToolPermissionRequired`** — dogfooded 2026-05-21 from the `#clawcode-building-in-public` 21:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@9fd61af` and binary built from source SHA `25d663d`. Active tmux session at probe time: `omx-issue-2443-ralplan-consensus-resume`. Code inspection: `detect_tool_permission_prompt` in `runtime/src/worker_boot.rs:958-999` only enters when the full screen contains either `allow the` + `server` + `tool` + `run`, or `allow tool` + `run`. The only production-shaped tests at `worker_boot.rs:1387-1474` use exactly `Allow the omx_memory MCP server to run tool "..."?`. Common equivalent approval copy such as `Allow MCP server omx_memory to call tool "project_memory_read"?`, `Allow server omx_memory to execute tool ...`, `Approve tool project_memory_read from omx_memory?`, or localized/shorter plugin prompts do not contain the exact `allow the ... server ... run tool` / `allow tool ... run` token pattern. When those appear during boot, `observe` will not set `WorkerStatus::ToolPermissionRequired`, no structured `ToolPermissionPrompt` payload is emitted, and later timeout evidence can degrade to generic `startup_no_evidence` or worker-crashed classification even though the pane clearly showed an approval gate. **Required fix shape:** (a) replace phrase-order checks with a tolerant classifier over permission verbs (`allow`/`approve`/`permit`), execution verbs (`run`/`call`/`execute`), and MCP/tool tokens independent of order; (b) add fixture tests for at least three real-world prompt variants, including `call tool` and prompts where the tool name appears before the server; (c) preserve extracted `server_name`, `tool_name`, allow-scope, and raw `prompt_preview` even when fields are partial; (d) emit an `Unknown`-scope tool-permission event rather than falling through when the approval intent is clear but parsing is incomplete; (e) include classifier confidence/reason in startup timeout evidence so UI wording drift is visible. **Why this matters:** MCP permission prompts are exactly the kind of boot blocker operators need to resolve quickly. A brittle single-template detector converts an actionable “click allow” condition into opaque startup failure noise whenever plugin/UI copy drifts. Source: gaebal-gajae dogfood response to Clawhip message `1507133416884404254` on 2026-05-21.
Author	SHA1	Message	Date
Yeachan-Heo	9ef521bb98	docs(roadmap): add tolerant tool permission prompt gap	2026-05-21 21:30:41 +00:00
Yeachan-Heo	9fd61af086	docs(roadmap): add prompt echo glyph mismatch gap	2026-05-21 21:00:45 +00:00
Yeachan-Heo	bc55711600	docs(roadmap): add workspace test flag order preflight gap	2026-05-21 20:30:47 +00:00
Yeachan-Heo	6ef54578e8	docs(roadmap): add workspace preflight remote base gap	2026-05-21 20:00:55 +00:00
Yeachan-Heo	a036293829	docs(roadmap): add recovery ledger timestamp test gap	2026-05-21 19:30:59 +00:00
Yeachan-Heo	1540e3a9ab	docs(roadmap): add worker restart timeout anchor gap	2026-05-21 19:00:52 +00:00
Yeachan-Heo	51c05ac359	docs(roadmap): add count_tokens retry coverage gap	2026-05-21 18:30:53 +00:00
Yeachan-Heo	762ee656f3	docs(roadmap): add lane completion timestamp gap	2026-05-21 18:00:48 +00:00
Yeachan-Heo	2bc85c54c8	docs(roadmap): add agent manifest timestamp test gap	2026-05-21 17:30:46 +00:00
Yeachan-Heo	02b68555f5	docs(roadmap): add G004 timestamp validation gap	2026-05-21 17:01:25 +00:00
Yeachan-Heo	d670b43535	docs(roadmap): add timestamp helper fallback gap	2026-05-21 16:31:31 +00:00
Yeachan-Heo	832098c18e	docs(roadmap): add text warning opacity gap	2026-05-21 16:00:51 +00:00
Yeachan-Heo	10f5daca43	docs(roadmap): add prompt init diff warning hang gap	2026-05-21 15:31:12 +00:00
Yeachan-Heo	0a2542abf7	docs(roadmap): add diagnostics warning hang gap	2026-05-21 15:01:16 +00:00
Yeachan-Heo	8d0dc50cef	docs(roadmap): add help warning hang gap	2026-05-21 14:31:15 +00:00
Yeachan-Heo	785d4bde40	docs(roadmap): add inventory warning hang gap	2026-05-21 14:01:31 +00:00