Merge remote-tracking branch 'origin/lr_reduction-new_workflow-repairs' into... (c6925048) · Commits · Vacaliuc, Bogdan / tasking

plan/dry-run-Integrator-findings.md

0 → 100644

+358 −0

Original line number	Diff line number	Diff line
		# Integrator findings — dry-run-2026-04-28

		Written by the Integrator session for the
		`lr_reduction-new_workflow-repairs` effort, after all five test
		slugs reached terminal state. Cross-reference:
		- `plan/dry-run.md` §6 (expected timeline) and §7 (failure-point
		coverage checklist).
		- `plans/dry-run-delta-followup.md` and the annotation on
		`dry-run-2026-04-28-review/dry-run-gamma-escalate` on the
		analysis branch — those carry the corresponding Analyst-side
		view of the same events.

		---

		## 1. Final state by slug

		\| Slug \| Pathway \| Terminal artifact \| PR # \|
		\|---\|---\|---\|---\|
		\| alpha \| pass first attempt \| feature tip `51bfd43`, qa/ deleted, PR opened \| #1 \|
		\| beta \| fail → retry-v2 passes \| feature tip `51ef9d2`, qa/ deleted, PR opened \| #2 \|
		\| gamma \| fail × 3 (cap reached) \| feature tip `ae45a8c`, annotated tag `review/dry-run-gamma-escalate` (`49157fd` → `dd9a1d4`) \| — \|
		\| delta \| infrastructure failure \| feature tip `9b649ef`, `plans/dry-run-delta-followup.md` written by Analyst \| — \|
		\| epsilon \| malformed triage (no plan file) \| Developer skipped per spec; Integrator never observed any qa tag \| — \|

		All `qa/` and `review/` (lightweight) tags cleaned up by the
		Developer / Integrator / Analyst per the §8 allowlist; only the
		annotated `review/dry-run-gamma-escalate` tag remains, as
		required by the protocol's escalation contract.

		PRs `#1` and `#2` on `bvacaliuc/LiquidsReflectometer` are labeled
		`dry-run` and base-targeted at `new_workflow_ui_plan`. Both are
		left open for the cleanup phase to close; per
		`plan/dry-run.md` §9, that step belongs to the Initialization
		agent (not me — my poll loop is still armed).

		---

		## 2. Integrator cycle log

		Each cycle ran the canonical command `pixi run test-reduction`
		(plus `-- --timeout=2` for delta) on a fresh checkout of
		`dry-run-2026-04-28-feature/<slug>`. Wall-clock time per cycle was
		remarkably stable.

		\| # \| Slug \| Attempt \| Test wall-clock \| Outcome \| Action taken \|
		\|---\|---------\|---------\|-----------------\|------------------------\|---------------------------------------------\|
		\| 1 \| alpha \| 1 \| 9:32 \| 282 passed \| Standing PR approval requested → PR #1; qa/ deleted \|
		\| 2 \| beta \| 1 \| 9:35 \| 1 failed, 281 passed \| todo.md (attempt 1) → review/; qa/ deleted \|
		\| 3 \| gamma \| 1 \| 9:35 \| 1 failed, 281 passed \| todo.md (attempt 1 of N=3) → review/; qa/ deleted \|
		\| 4 \| delta \| 1 \| <1 s + retry \| pytest exit 4 (arg parse) \| retry once after `__pycache__` clean (deterministic same exit) → infra todo.md → review/; qa/ deleted \|
		\| 5 \| gamma \| 2 (v2) \| 9:26 \| 1 failed, 281 passed \| todo.md (attempt 2 of N=3) → review/; qa/ deleted \|
		\| 6 \| beta \| 2 (v2) \| 9:30 \| 282 passed \| PR #2 (standing approval); qa/ deleted \|
		\| 7 \| gamma \| 3 (v3) \| 9:46 \| 1 failed, 281 passed \| todo.md (attempt 3, RETRY CAP REACHED) → review/; qa/ deleted; Analyst proceeded to escalate \|

		Aggregate: 7 successful protocol cycles, ~67 minutes of test
		runtime alone, plus polling latency, plus the initial budget of
		the first (botched) alpha attempt. End-to-end Integrator wall-time
		~80–90 minutes.

		This is above the `plan/dry-run.md` §8 "30–60 min typical"
		note but below the 2-hour flag. The dominant cost is the pixi
		test command itself: 281 production tests (most are integration
		tests that read from the SNS network mounts) consume ~9.5 minutes
		on this clone regardless of the synthetic test added.

		---

		## 3. Real-world finding (already documented; for traceability)

		`pytest-timeout` is missing from the lr_reduction pixi
		environment. `plan/dry-run.md` §5.4 specifies that the Integrator
		pass `--timeout=2` to `pixi run test-reduction` for the `delta`
		slug; pytest aborts at argument parsing instead, exit code 4.

		The Initialization checklist did not catch this because §1 only
		enumerates top-level tools, not pytest plugins, and §8 verifies
		collection (`--collect-only`) but not the canonical test command's
		required argument set.

		Picked up by the Analyst as
		`plans/dry-run-delta-followup.md` on the dry-run analysis branch,
		which proposes both a remediation (add `pytest-timeout` to the
		pixi feature group) and an Initialization-checklist amendment.

		This is the strongest validation of the dry run's value: a real
		environment gap that would have manifested under real infra
		pressure was caught at protocol-rehearsal time.

		---

		## 4. Bugs found in my own Integrator harness

		### 4.1 Monitor's seen-cache keyed on tag name instead of (SHA, name)

		The first version of my long-poll Monitor (task `bfmit91q8`) used
		the tag name alone as the dedup key. When the Developer re-tagged
		`qa/dry-run-{beta,gamma}` for v2 retries, the names matched the
		seen-cache → Monitor stayed silent → Analyst+Developer chain
		blocked downstream.

		Symptom: the user had to nudge me with "Analyst and Developer
		are waiting for gamma-v3" to expose the stall.

		Fix: v2 of the Monitor (task `bvhw5d11w`) keys on `(SHA, name)`
		so re-tags (different SHA, same name) trigger a new event. Stored
		in `~/.cache/integrator-dry-run-2026-04-28-seen-sha.txt` (mode 600).

		Recommendation for production opening prompt §9.3: add an
		explicit clause to the polling-loop pseudocode: *"Dedup by `(SHA,
		ref-name)` — re-tags during retry cycles must trigger a new event."*
		The current §9.3 says only "List tags matching qa/"* and leaves
		dedup to the implementer.

		### 4.2 `git fetch --tags` rejected re-pushed qa tags as "would clobber existing tag"

		When the Developer re-pushed `qa/<slug>` for v2 with a new SHA,
		`git fetch agentic --tags` (no `--force`) refused to update the
		local copy because a stale local tag (left over after I already
		deleted it on the server, but local copy persisted) had a
		different SHA. This silently ate the update and would have caused
		my Monitor to see a stale state if it had been working.

		Mitigation already applied: the post-fix Monitor uses
		`ls-remote --tags` (which always reads fresh from the server) so
		local tag staleness does not affect detection. For the periodic
		fetch-into-local that I do during cycle handling, I now run with
		`git fetch agentic --prune --tags --force`.

		Recommendation for §9.3: the polling pseudocode should
		explicitly include `--force` on the tag fetch, or document that
		production should rely on `ls-remote` for detection (cheaper and
		race-free) and use `--force` only when the local tag is needed.

		### 4.3 Bash-tool timeout vs. pixi test runtime

		My initial alpha attempt wrapped the test command in
		`timeout 240 pixi run test-reduction 2>&1 \| tail -30`. Two
		mistakes compounded:

		1. The wall-clock of `pixi run test-reduction` is ~9.5 minutes, so
		240 s was an under-budget by an order of magnitude.
		2. The `\| tail -30` deferred all output until the pipeline closed,
		so when `timeout` killed it, only the literal string
		`Terminated` made it to the log. Diagnosing the failure was
		only possible because I re-ran without the pipe.

		Fix already applied: subsequent runs used
		`run_in_background=true` with the Bash tool's 10-minute max
		timeout and direct redirection to a log file (no shell pipe → no
		buffering → readable mid-run).

		Recommendation for §9.3: add a footnote to the *"Run the
		canonical test command"* step warning that the lr_reduction test
		suite requires ~10 minutes wall-clock and that any wrapper script
		should redirect output to a log file rather than piping through
		`tail`.

		---

		## 5. Protocol observations

		### 5.1 Developer's feature-branch handling deviated from `dry-run.md` §5.3

		`plan/dry-run.md` §5.3 says the Developer *"creates
		`{dry-run-prefix}-feature/{slug}` from `{base-branch}`"* on every
		cycle. In practice, the Developer in this dry run **fast-forwarded
		on top of my todo.md commits** instead of resetting to base:

		```
		ae45a8c dry-run: gamma v3 retry (body unchanged by design — final attempt)
		b473a6d integrator: failing tests ← my todo.md attempt 2
		38a1f3b dry-run: gamma v2 retry (body unchanged by design)
		baa64cd integrator: failing tests ← my todo.md attempt 1
		ce604f6 dry-run: add tests/dry_run/test_dry_run_gamma.py (attempt 1)
		```

		This is benign for the protocol's outcome — the Integrator's
		todo.md commits get overwritten/superseded on retries anyway, and
		the test still runs the latest test file — but it does mean the
		feature branch carries a longer history than spec. Consequence
		worth noting: a real PR that ever shipped from a multi-retry
		slug would carry the Integrator's process commits unless the
		Developer cleaned them up before the final qa tag.

		Recommendation: either tighten §5.3 to read *"on every cycle,
		including v{N} retries, the Developer recreates feature/{slug}
		from {base-branch} (replacing any prior history)"* — and then
		confirm via `git rev-parse <base>` matches the parent — or loosen
		the spec to allow append-on-retry and explicitly note that the
		Integrator's todo.md commits will be preserved in the feature
		branch's history. Either is workable; the current language is
		ambiguous and the dry run picked the second interpretation.

		### 5.2 Analyst correctly handled the delta-as-infra distinction

		Per `plan/dry-run.md` §5.2, infra failures must not consume retry
		budget. The Analyst correctly:

		- Wrote `plans/dry-run-delta-followup.md` instead of a `-v2` plan.
		- Deleted `review/dry-run-delta` without re-triaging.
		- Documented the missing `pytest-timeout` plugin as a real-world
		finding, not a plan defect.

		This is one of the harder behavioral discriminations in the
		protocol (a real-mode behavior that only surfaces under real
		infra pressure normally). The dry run exercised it correctly.

		### 5.3 Annotated escalate tag includes a useful inline summary

		The Analyst's `dry-run-2026-04-28-review/dry-run-gamma-escalate`
		tag annotation summarizes the three failed attempts and points at
		`plans/dry-run-gamma-escalate.md` for the full prose. Reading the
		tag message via `git tag -l --format='%(contents)' <tag>` is a
		useful one-shot status probe for any role wanting to know "why
		did this slug stop?" without checking out the analysis branch.

		Recommendation: add an example of this probe to §13.3
		"Recovering a stuck state" — currently the runbook tells you to
		look for `review/*-escalate` tags but does not show how to read
		the message inline.

		---

		## 6. Failure-point checklist coverage (Integrator's view of `dry-run.md` §7)

		Items covered end-to-end during this Integrator session:

		\| §7 item \| Coverage \|
		\|---\|---\|
		\| Empty `triage/{slug}` (no plan file) \| ✓ — Developer skipped `dry-run-epsilon`; Integrator never saw a qa tag, so chain confirmed correct \|
		\| Submodule uninitialized before Integrator test \| not actively tested — submodule was already initialized at session start; Integrator's standard pre-test step `git submodule update --init --recursive` was a no-op all 7 cycles \|
		\| Test infrastructure failure (timeout/OOM) \| ✓ — covered by delta, with a real-world finding (pytest-timeout missing) layered on top \|
		\| Retry-cap escalation \| ✓ — gamma reached cap; annotated tag pushed by Analyst \|
		\| Two Developers claim same triage \| not exercised — single Developer session \|
		\| `qa/{slug}` missing after feature push \| not exercised — no Developer session was killed mid-flow \|
		\| `analysis/...` non-ff push \| not exercised — single Analyst session, single Developer pushed to analysis \|
		\| `gh`/`glab` missing \| not exercised — `gh` was present and used \|
		\| PR/MR API failure (token expired) \| not exercised \|

		The not-exercised items map cleanly onto failure modes that need
		either a multi-session injection or a live token rotation; none of
		them are pathological omissions. They should be flagged as
		"requires manual injection in next dry run" rather than re-tested
		in-band.

		---

		## 7. Recommended Initialization-checklist additions

		In priority order (highest yield first):

		1. **Probe for `pytest-timeout` (and any other dry-run-required
		plugins).** Add to §1 or §8:

		```bash
		pixi run python -c "import pytest_timeout" \
		&& echo "pytest-timeout: OK" \
		\|\| echo "pytest-timeout: MISSING — needed for dry-run-delta"
		```

		2. **Smoke-test the canonical test command's full argument set,
		not just `--collect-only`.** §8 currently runs
		`pixi run test-reduction -- --collect-only` to verify pytest
		imports. Extend to verify `--timeout=2` is accepted (which
		transitively verifies pytest-timeout):

		```bash
		pixi run test-reduction -- --collect-only --timeout=2
		```

		3. Verify wall-clock budget for the canonical test command.
		The dry-run.md §8 "~30-60 min typical" assumes a fast test
		suite. On lr_reduction, one cycle is ~9.5 minutes; with a
		gamma×3 retry cap the dry run lower-bounds at 3×9.5 = 28.5 min
		for one slug alone. Initialization should record measured
		`--collect-only` and `-x --maxfail=1 -k <smoke>` durations so
		the user can choose accelerated `TI` realistically.

		4. **Verify the writable remote allows annotated-tag push and
		delete.** §2.5 already verifies a roundtrip on a single
		`init-check-*-tag`, but the dry run has a strict need for the
		annotated tag form. Adding `git tag -a` explicitly to the
		roundtrip would catch hosts that distinguish push permissions
		between lightweight and annotated tags.

		---

		## 8. Recommended `plan/dry-run.md` edits

		In priority order:

		1. **§5.4: clarify the `--timeout=2` invocation depends on
		pytest-timeout.** Add a footnote: *"Requires `pytest-timeout`
		in the test environment. If absent, see §7 entry on
		infrastructure-test-runner failures and the Initialization
		probe."*

		2. *§5.3: tighten or loosen the create-from-base-on-every-cycle*
		wording.** Pick one interpretation and state it explicitly.
		(See §5.1 of these findings.)

		3. §6 timeline: budget realism. The §6 table assumes one row
		per Developer/Integrator cycle ≈ one TI poll period. With the
		real ~9.5-minute pixi test cost, the table compresses 7
		serial cycles into >70 min of test time alone. State this
		explicitly so a reader sets reasonable expectations.

		4. **§9 cleanup: add the open dry-run PR closure step to the
		Integrator's poll-loop exit condition.** Currently §9 says
		the Initialization agent runs cleanup post-success. The
		Integrator could close the dry-run-labeled PRs at session
		exit if the user wants self-cleaning sessions; alternatively,
		document that PRs persist until the Initialization agent
		re-runs cleanup.

		---

		## 9. What worked well

		- The §8 standing-approval pattern for PR creation was
		exactly right — single explicit ack at the first PR, then
		silent-but-auditable subsequent PRs. No friction during the
		beta-v2 PR.

		- The state machine's terminal-state contract (annotated
		escalate tag for gamma; followup plan for delta; no work for
		epsilon) cleanly separated the four pathologies. Verifying
		which terminal state each slug ended up in was a single
		`git ls-remote` and `git tag -l --format='%(contents)'` away.

		- The dedicated `tests/dry_run/` directory kept synthetic
		tests cleanly isolated from production tests; the 281
		production tests passed in every cycle, confirming no
		cross-contamination.

		- The git-based coordination model held up under real load:
		three independent agent sessions converged on a fully consistent
		ref state with no force-pushes, no merge conflicts, and no
		manual interventions beyond the user's nudge in §4.1.

		---

		## 10. Net assessment

		The Integrator side of the orchestration is **functional and
		worth promoting to production work**, with the §4 harness fixes
		applied. The dry run surfaced one real environment gap
		(pytest-timeout missing) and two harness-implementation bugs
		(Monitor name-dedup, fetch tag clobber) that would have caused
		silent stalls in production. All three are now documented;
		two are fixed in the live session; the third (pytest-timeout)
		needs an env change before the next dry run or production
		run that exercises a delta-style pathway.

		Recommend: apply the §7 Initialization-checklist additions and
		the §8.1-2 dry-run.md edits, install `pytest-timeout` in the
		pixi env, then proceed to production orchestration on the four
		real defects per `plan/issues.md`.

Admin message