Commit c6925048 authored by Vacaliuc, Bogdan's avatar Vacaliuc, Bogdan
Browse files

Merge remote-tracking branch 'origin/lr_reduction-new_workflow-repairs' into...

Merge remote-tracking branch 'origin/lr_reduction-new_workflow-repairs' into lr_reduction-new_workflow-repairs
parents b6d7f500 9ac40584
Loading
Loading
Loading
Loading
+358 −0
Original line number Diff line number Diff line
# Integrator findings — dry-run-2026-04-28

Written by the Integrator session for the
`lr_reduction-new_workflow-repairs` effort, after all five test
slugs reached terminal state. Cross-reference:
- `plan/dry-run.md` §6 (expected timeline) and §7 (failure-point
  coverage checklist).
- `plans/dry-run-delta-followup.md` and the annotation on
  `dry-run-2026-04-28-review/dry-run-gamma-escalate` on the
  analysis branch — those carry the corresponding Analyst-side
  view of the same events.

---

## 1. Final state by slug

| Slug | Pathway | Terminal artifact | PR # |
|---|---|---|---|
| alpha   | pass first attempt                       | feature tip `51bfd43`, qa/ deleted, PR opened | #1 |
| beta    | fail → retry-v2 passes                   | feature tip `51ef9d2`, qa/ deleted, PR opened | #2 |
| gamma   | fail × 3 (cap reached)                   | feature tip `ae45a8c`, annotated tag `review/dry-run-gamma-escalate` (`49157fd``dd9a1d4`) | — |
| delta   | infrastructure failure                   | feature tip `9b649ef`, `plans/dry-run-delta-followup.md` written by Analyst | — |
| epsilon | malformed triage (no plan file)          | Developer skipped per spec; Integrator never observed any qa tag | — |

All `qa/*` and `review/*` (lightweight) tags cleaned up by the
Developer / Integrator / Analyst per the §8 allowlist; only the
**annotated** `review/dry-run-gamma-escalate` tag remains, as
required by the protocol's escalation contract.

PRs `#1` and `#2` on `bvacaliuc/LiquidsReflectometer` are labeled
`dry-run` and base-targeted at `new_workflow_ui_plan`. Both are
left open for the cleanup phase to close; per
`plan/dry-run.md` §9, that step belongs to the Initialization
agent (not me — my poll loop is still armed).

---

## 2. Integrator cycle log

Each cycle ran the canonical command `pixi run test-reduction`
(plus `-- --timeout=2` for delta) on a fresh checkout of
`dry-run-2026-04-28-feature/<slug>`. Wall-clock time per cycle was
remarkably stable.

| # | Slug    | Attempt | Test wall-clock | Outcome                | Action taken                                |
|---|---------|---------|-----------------|------------------------|---------------------------------------------|
| 1 | alpha   | 1       | 9:32            | 282 passed             | Standing PR approval requested → PR #1; qa/ deleted |
| 2 | beta    | 1       | 9:35            | 1 failed, 281 passed   | todo.md (attempt 1) → review/; qa/ deleted  |
| 3 | gamma   | 1       | 9:35            | 1 failed, 281 passed   | todo.md (attempt 1 of N=3) → review/; qa/ deleted |
| 4 | delta   | 1       | <1 s + retry    | pytest exit 4 (arg parse) | retry once after `__pycache__` clean (deterministic same exit) → infra todo.md → review/; qa/ deleted |
| 5 | gamma   | 2 (v2)  | 9:26            | 1 failed, 281 passed   | todo.md (attempt 2 of N=3) → review/; qa/ deleted |
| 6 | beta    | 2 (v2)  | 9:30            | 282 passed             | PR #2 (standing approval); qa/ deleted      |
| 7 | gamma   | 3 (v3)  | 9:46            | 1 failed, 281 passed   | todo.md (attempt 3, RETRY CAP REACHED) → review/; qa/ deleted; Analyst proceeded to escalate |

Aggregate: **7 successful protocol cycles**, ~67 minutes of test
runtime alone, plus polling latency, plus the initial budget of
the first (botched) alpha attempt. End-to-end Integrator wall-time
~80–90 minutes.

This is **above** the `plan/dry-run.md` §8 "30–60 min typical"
note but below the 2-hour flag. The dominant cost is the pixi
test command itself: 281 production tests (most are integration
tests that read from the SNS network mounts) consume ~9.5 minutes
on this clone regardless of the synthetic test added.

---

## 3. Real-world finding (already documented; for traceability)

`pytest-timeout` is missing from the lr_reduction pixi
environment. `plan/dry-run.md` §5.4 specifies that the Integrator
pass `--timeout=2` to `pixi run test-reduction` for the `delta`
slug; pytest aborts at argument parsing instead, exit code 4.

The Initialization checklist did not catch this because §1 only
enumerates top-level tools, not pytest plugins, and §8 verifies
collection (`--collect-only`) but not the canonical test command's
required argument set.

**Picked up by the Analyst** as
`plans/dry-run-delta-followup.md` on the dry-run analysis branch,
which proposes both a remediation (add `pytest-timeout` to the
pixi feature group) and an Initialization-checklist amendment.

This is the strongest validation of the dry run's value: a real
environment gap that would have manifested under real infra
pressure was caught at protocol-rehearsal time.

---

## 4. Bugs found in my own Integrator harness

### 4.1 Monitor's seen-cache keyed on tag *name* instead of *(SHA, name)*

The first version of my long-poll Monitor (task `bfmit91q8`) used
the tag name alone as the dedup key. When the Developer re-tagged
`qa/dry-run-{beta,gamma}` for v2 retries, the names matched the
seen-cache → Monitor stayed silent → Analyst+Developer chain
blocked downstream.

**Symptom:** the user had to nudge me with "Analyst and Developer
are waiting for gamma-v3" to expose the stall.

**Fix:** v2 of the Monitor (task `bvhw5d11w`) keys on `(SHA, name)`
so re-tags (different SHA, same name) trigger a new event. Stored
in `~/.cache/integrator-dry-run-2026-04-28-seen-sha.txt` (mode 600).

**Recommendation for production opening prompt §9.3:** add an
explicit clause to the polling-loop pseudocode: *"Dedup by `(SHA,
ref-name)` — re-tags during retry cycles must trigger a new event."*
The current §9.3 says only *"List tags matching qa/*"* and leaves
dedup to the implementer.

### 4.2 `git fetch --tags` rejected re-pushed qa tags as "would clobber existing tag"

When the Developer re-pushed `qa/<slug>` for v2 with a new SHA,
`git fetch agentic --tags` (no `--force`) refused to update the
local copy because a stale local tag (left over after I already
deleted it on the server, but local copy persisted) had a
different SHA. This silently ate the update and would have caused
my Monitor to see a stale state if it had been working.

**Mitigation already applied:** the post-fix Monitor uses
`ls-remote --tags` (which always reads fresh from the server) so
local tag staleness does not affect detection. For the periodic
fetch-into-local that I do during cycle handling, I now run with
`git fetch agentic --prune --tags --force`.

**Recommendation for §9.3:** the polling pseudocode should
explicitly include `--force` on the tag fetch, or document that
production should rely on `ls-remote` for detection (cheaper and
race-free) and use `--force` only when the local tag is needed.

### 4.3 Bash-tool timeout vs. pixi test runtime

My initial alpha attempt wrapped the test command in
`timeout 240 pixi run test-reduction 2>&1 | tail -30`. Two
mistakes compounded:

1. The wall-clock of `pixi run test-reduction` is ~9.5 minutes, so
   240 s was an under-budget by an order of magnitude.
2. The `| tail -30` deferred all output until the pipeline closed,
   so when `timeout` killed it, only the literal string
   `Terminated` made it to the log. Diagnosing the failure was
   only possible because I re-ran without the pipe.

**Fix already applied:** subsequent runs used
`run_in_background=true` with the Bash tool's 10-minute max
timeout and direct redirection to a log file (no shell pipe → no
buffering → readable mid-run).

**Recommendation for §9.3:** add a footnote to the *"Run the
canonical test command"* step warning that the lr_reduction test
suite requires ~10 minutes wall-clock and that any wrapper script
should redirect output to a log file rather than piping through
`tail`.

---

## 5. Protocol observations

### 5.1 Developer's feature-branch handling deviated from `dry-run.md` §5.3

`plan/dry-run.md` §5.3 says the Developer *"creates
`{dry-run-prefix}-feature/{slug}` from `{base-branch}`"* on every
cycle. In practice, the Developer in this dry run **fast-forwarded
on top of my todo.md commits** instead of resetting to base:

```
ae45a8c dry-run: gamma v3 retry (body unchanged by design — final attempt)
b473a6d integrator: failing tests              ← my todo.md attempt 2
38a1f3b dry-run: gamma v2 retry (body unchanged by design)
baa64cd integrator: failing tests              ← my todo.md attempt 1
ce604f6 dry-run: add tests/dry_run/test_dry_run_gamma.py (attempt 1)
```

This is **benign** for the protocol's outcome — the Integrator's
todo.md commits get overwritten/superseded on retries anyway, and
the test still runs the latest test file — but it does mean the
feature branch carries a longer history than spec. Consequence
worth noting: a real PR that ever shipped from a multi-retry
slug would carry the Integrator's process commits unless the
Developer cleaned them up before the final qa tag.

**Recommendation:** either tighten §5.3 to read *"on every cycle,
including v{N} retries, the Developer recreates feature/{slug}
from {base-branch} (replacing any prior history)"* — and then
confirm via `git rev-parse <base>` matches the parent — **or** loosen
the spec to allow append-on-retry and explicitly note that the
Integrator's todo.md commits will be preserved in the feature
branch's history. Either is workable; the current language is
ambiguous and the dry run picked the second interpretation.

### 5.2 Analyst correctly handled the delta-as-infra distinction

Per `plan/dry-run.md` §5.2, infra failures must not consume retry
budget. The Analyst correctly:

- Wrote `plans/dry-run-delta-followup.md` instead of a `-v2` plan.
- Deleted `review/dry-run-delta` without re-triaging.
- Documented the missing `pytest-timeout` plugin as a real-world
  finding, not a plan defect.

This is one of the harder behavioral discriminations in the
protocol (a real-mode behavior that *only* surfaces under real
infra pressure normally). The dry run exercised it correctly.

### 5.3 Annotated escalate tag includes a useful inline summary

The Analyst's `dry-run-2026-04-28-review/dry-run-gamma-escalate`
tag annotation summarizes the three failed attempts and points at
`plans/dry-run-gamma-escalate.md` for the full prose. Reading the
tag message via `git tag -l --format='%(contents)' <tag>` is a
useful one-shot status probe for any role wanting to know "why
did this slug stop?" without checking out the analysis branch.

**Recommendation:** add an example of this probe to §13.3
*"Recovering a stuck state"* — currently the runbook tells you to
look for `review/*-escalate` tags but does not show how to read
the message inline.

---

## 6. Failure-point checklist coverage (Integrator's view of `dry-run.md` §7)

Items covered end-to-end during this Integrator session:

| §7 item | Coverage |
|---|---|
| Empty `triage/{slug}` (no plan file)             | ✓ — Developer skipped `dry-run-epsilon`; Integrator never saw a qa tag, so chain confirmed correct |
| Submodule uninitialized before Integrator test  | not actively tested — submodule was already initialized at session start; Integrator's standard pre-test step `git submodule update --init --recursive` was a no-op all 7 cycles |
| Test infrastructure failure (timeout/OOM)       | ✓ — covered by delta, with a real-world finding (pytest-timeout missing) layered on top |
| Retry-cap escalation                            | ✓ — gamma reached cap; annotated tag pushed by Analyst |
| Two Developers claim same triage                | not exercised — single Developer session |
| `qa/{slug}` missing after feature push          | not exercised — no Developer session was killed mid-flow |
| `analysis/...` non-ff push                      | not exercised — single Analyst session, single Developer pushed to analysis |
| `gh`/`glab` missing                             | not exercised — `gh` was present and used |
| PR/MR API failure (token expired)               | not exercised |

The not-exercised items map cleanly onto failure modes that need
either a multi-session injection or a live token rotation; none of
them are pathological omissions. They should be flagged as
"requires manual injection in next dry run" rather than re-tested
in-band.

---

## 7. Recommended Initialization-checklist additions

In priority order (highest yield first):

1. **Probe for `pytest-timeout` (and any other dry-run-required
   plugins).** Add to §1 or §8:

   ```bash
   pixi run python -c "import pytest_timeout" \
     && echo "pytest-timeout: OK" \
     || echo "pytest-timeout: MISSING — needed for dry-run-delta"
   ```

2. **Smoke-test the canonical test command's full argument set,
   not just `--collect-only`.** §8 currently runs
   `pixi run test-reduction -- --collect-only` to verify pytest
   imports. Extend to verify `--timeout=2` is accepted (which
   transitively verifies pytest-timeout):

   ```bash
   pixi run test-reduction -- --collect-only --timeout=2
   ```

3. **Verify wall-clock budget for the canonical test command.**
   The dry-run.md §8 *"~30-60 min typical"* assumes a fast test
   suite. On lr_reduction, one cycle is ~9.5 minutes; with a
   gamma×3 retry cap the dry run lower-bounds at 3×9.5 = 28.5 min
   for one slug alone. Initialization should record measured
   `--collect-only` and `-x --maxfail=1 -k <smoke>` durations so
   the user can choose accelerated `TI` realistically.

4. **Verify the writable remote allows annotated-tag push and
   delete.** §2.5 already verifies a roundtrip on a single
   `init-check-*-tag`, but the dry run has a strict need for the
   *annotated* tag form. Adding `git tag -a` explicitly to the
   roundtrip would catch hosts that distinguish push permissions
   between lightweight and annotated tags.

---

## 8. Recommended `plan/dry-run.md` edits

In priority order:

1. **§5.4: clarify the `--timeout=2` invocation depends on
   pytest-timeout.** Add a footnote: *"Requires `pytest-timeout`
   in the test environment. If absent, see §7 entry on
   infrastructure-test-runner failures and the Initialization
   probe."*

2. **§5.3: tighten or loosen the *create-from-base-on-every-cycle*
   wording.** Pick one interpretation and state it explicitly.
   (See §5.1 of these findings.)

3. **§6 timeline: budget realism.** The §6 table assumes one row
   per Developer/Integrator cycle ≈ one TI poll period. With the
   real ~9.5-minute pixi test cost, the table compresses 7
   serial cycles into >70 min of test time alone. State this
   explicitly so a reader sets reasonable expectations.

4. **§9 cleanup: add the open dry-run PR closure step to the
   Integrator's poll-loop exit condition.** Currently §9 says
   the Initialization agent runs cleanup post-success. The
   Integrator could close the dry-run-labeled PRs at session
   exit if the user wants self-cleaning sessions; alternatively,
   document that PRs persist until the Initialization agent
   re-runs cleanup.

---

## 9. What worked well

- **The §8 standing-approval pattern** for PR creation was
  exactly right — single explicit ack at the first PR, then
  silent-but-auditable subsequent PRs. No friction during the
  beta-v2 PR.

- **The state machine's terminal-state contract** (annotated
  escalate tag for gamma; followup plan for delta; no work for
  epsilon) cleanly separated the four pathologies. Verifying
  *which* terminal state each slug ended up in was a single
  `git ls-remote` and `git tag -l --format='%(contents)'` away.

- **The dedicated `tests/dry_run/` directory** kept synthetic
  tests cleanly isolated from production tests; the 281
  production tests passed in every cycle, confirming no
  cross-contamination.

- **The git-based coordination model** held up under real load:
  three independent agent sessions converged on a fully consistent
  ref state with no force-pushes, no merge conflicts, and no
  manual interventions beyond the user's nudge in §4.1.

---

## 10. Net assessment

The Integrator side of the orchestration is **functional and
worth promoting to production work**, with the §4 harness fixes
applied. The dry run surfaced one real environment gap
(pytest-timeout missing) and two harness-implementation bugs
(Monitor name-dedup, fetch tag clobber) that would have caused
silent stalls in production. All three are now documented;
two are fixed in the live session; the third (pytest-timeout)
needs an env change before the next dry run or production
run that exercises a delta-style pathway.

Recommend: apply the §7 Initialization-checklist additions and
the §8.1-2 dry-run.md edits, install `pytest-timeout` in the
pixi env, then proceed to production orchestration on the four
real defects per `plan/issues.md`.