Commit 511590ff authored by Vacaliuc, Bogdan's avatar Vacaliuc, Bogdan
Browse files

plan: scripts/ — four bulletproof poll-wrapper-*.sh scripts



Per orchestration-v2-redesign.md §3.1 and §9.1. Each script wraps
the role's poll loop in a Bash process that:

- Discovers refs via cheap `git ls-remote {remote}
  '<narrow-refspec>'` per cycle (no --tags side effects, no
  would-clobber rejections per dry-run F4).
- Strips ^{} peel suffixes from annotated-tag dereference lines
  before applying suffix exclusions (per dry-run F2).
- Dedups events on (SHA, ref-name) tuples — re-pointed refs (Developer
  empty-commit advance for v{N>1} retries) trigger NEW events; this is
  the fix for dry-run Stall B (Integrator findings §4.1).
- Emits typed STALLED reason=... events on failure paths (categorized
  as fetch-403 / fetch-tag-clobber / fetch-network / fetch-auth /
  fetch-error) instead of exiting silently — the fix for dry-run
  Stall A.
- Continues polling indefinitely; after 3 consecutive same-reason
  fails, escalates to phase=stalled and backs off to 3*T{role} but
  never exits.
- Updates {STATE_DIR}/agent-state-<role>.json atomically (write to
  .tmp.$$, mv -f) every cycle so the §9.7 statusline never reads a
  partially-written file. Mode 600 on every write.

Per-role specifics:
- Analyst: refspec refs/tags/{prefix}review/*; emits REVIEW (and
  REVIEW_GONE on remote deletion); skips *-escalate (annotated tag
  for human attention).
- Developer: refspec refs/heads/{prefix}triage/*; emits TRIAGE; no
  suffix exclusion (every -v{N} variant is a real work item). The
  agent itself filters with plans/skipped-triage.txt on the analysis
  branch (per Developer findings §3.5).
- Integrator: refspec refs/tags/{prefix}qa/*; emits QA; targeted
  fetch with --force is the agent's responsibility on pickup
  (orchestration.md §9.3).
- Administrator: different shape — no work-item events. Reads each
  worker's agent-state-<role>.json plus one ls-remote per cycle;
  emits one DASHBOARD line per cycle plus zero or more ATTENTION
  lines when any worker has phase=stalled or last-fetch-age >
  3*T_WORKER. Read-only on workers per redesign §5.3.

All scripts use `set -uo pipefail` (not -e — we explicitly check
the few commands that matter and let the loop continue on transient
errors), avoid jq for the small set of fields they read (grep+sed
fallback so the wrapper works on any clone with stock coreutils),
and run with mode 700.

Co-Authored-By: default avatarClaude Opus 4.7 (1M context) <noreply@anthropic.com>
parent 3fea9792
Loading
Loading
Loading
Loading
+183 −0
Original line number Diff line number Diff line
#!/usr/bin/env bash
# poll-wrapper-Administrator.sh — bulletproof Bash poll loop for the Administrator role
#
# Per orchestration.md §6.4 (Administrator state machine), §9.6, and
# orchestration-v2-redesign.md §3.1 / §5. The Administrator runs in
# phase 2 (active monitoring) only; phase 1 is one-shot and done by
# the agent itself before this wrapper is launched.
#
# Phase 2 behaviors:
#   - Every TM seconds, read agent-state-{Analyst,Developer,
#     Integrator}.json; missing file → that role is offline.
#   - One cheap `git ls-remote` against {prefix-or-empty}*/* (workers
#     publish state via refs; this is the single fresh signal the
#     Administrator needs per cycle).
#   - Compute system summary: agents present, phase per agent,
#     last-fetch ages, stall reasons.
#   - Emit ATTENTION lines when any worker has phase=stalled OR
#     last-fetch-age > 3*T{role} (per redesign §5.1).
#   - Write the summary to agent-state-Administrator.json.
#   - Continue polling indefinitely; never exits on transient errors.
#
# This wrapper is read-only on worker state files and read-only on
# the {remote} (no fetch, no push, no write to anyone else's state).
# The Administrator never restarts a stalled worker autonomously —
# it only reports.
#
# Environment:
#   REMOTE      writable git remote name (default: agentic)
#   PREFIX      ref-name prefix for dry-run isolation (default: empty)
#   TM          monitor poll interval in seconds (default: 300)
#   T_WORKER    typical worker poll interval; used to scale the
#               "stale fetch" threshold to 3*T_WORKER (default: 60)
#   STATE_DIR   directory for agent-state-<role>.json (default ~/.claude/state)
#
# Usage:
#   PREFIX=dry-run-2026-04-28- TM=300 ./poll-wrapper-Administrator.sh
#
# Output: one event line per cycle plus zero or more ATTENTION lines.
# Event grammar:
#   DASHBOARD ts=<unix-ts> roles=<csv-of-present-roles>
#             phases=<role:phase,...> openWork=<N>
#   ATTENTION role=<role> reason=<reason> detail=<...> ts=<unix-ts>
#   STALLED reason=<reason> [fail_count=<N>] ts=<unix-ts>
# (STALLED here is for the Administrator's OWN failures — e.g. its
#  ls-remote returned 403; not for worker stalls, which surface as
#  ATTENTION.)

set -uo pipefail

ROLE="Administrator"
REMOTE="${REMOTE:-agentic}"
PREFIX="${PREFIX:-}"
TM="${TM:-300}"
T_WORKER="${T_WORKER:-60}"
STATE_DIR="${STATE_DIR:-${HOME}/.claude/state}"
STATE="${STATE_DIR}/agent-state-${ROLE}.json"
WORKERS=(Analyst Developer Integrator)

mkdir -p "$STATE_DIR"
chmod 700 "$STATE_DIR" 2>/dev/null || true

fail_count=0
iter=0

write_state() {
  local phase="$1" stalled="$2" last_scan_ts="$3" summary="$4"
  local tmp="${STATE}.tmp.$$"
  cat > "$tmp" <<EOF
{
  "role": "${ROLE}",
  "iteration": ${iter},
  "phase": "${phase}",
  "lastFetchOk": true,
  "lastFetchTs": ${last_scan_ts},
  "failCount": ${fail_count},
  "stalledReason": ${stalled:+\"${stalled}\"}${stalled:-null},
  "summary": ${summary:-\"\"}
}
EOF
  chmod 600 "$tmp"
  mv -f "$tmp" "$STATE"
}

classify_error() {
  local err="$1"
  if grep -q '403' <<<"$err"; then
    printf 'fetch-403'
  elif grep -qE 'Could not resolve host|Connection timed out|network is unreachable' <<<"$err"; then
    printf 'fetch-network'
  elif grep -qE 'Authentication failed|denied|unauthorized' <<<"$err"; then
    printf 'fetch-auth'
  else
    printf 'fetch-error'
  fi
}

# Read a scalar field from a worker's state file via grep+sed (avoids
# requiring jq). Returns empty if the file or field is missing.
read_field() {
  local file="$1" key="$2"
  [[ -r "$file" ]] || return 0
  grep -oE "\"${key}\"\s*:\s*(\"[^\"]*\"|[0-9.]+|null|true|false)" "$file" 2>/dev/null \
    | head -1 \
    | sed -E "s/^\"${key}\"\s*:\s*//; s/^\"//; s/\"\$//; s/^null\$//"
}

while true; do
  iter=$((iter + 1))
  iter_start=$(date -u +%s)

  # Single ls-remote against the prefix-scoped namespace (cheap; one
  # roundtrip, no objects). On failure, emit STALLED for the
  # Administrator itself but continue polling — workers may still be
  # making progress visible via their state files.
  refspec_pattern="${PREFIX}*/*"
  ls_remote_ok=true
  if ! refs_now=$(git ls-remote "$REMOTE" "${refspec_pattern}" 2>&1); then
    fail_count=$((fail_count + 1))
    reason="$(classify_error "$refs_now")"
    echo "STALLED reason=${reason} fail_count=${fail_count} ts=${iter_start}"
    ls_remote_ok=false
    refs_now=""
  else
    fail_count=0
  fi

  # Aggregate per-worker state.
  phases_csv=""
  open_total=0
  attention_lines=()
  present_csv=""
  for role in "${WORKERS[@]}"; do
    sf="${STATE_DIR}/agent-state-${role}.json"
    if [[ ! -r "$sf" ]]; then
      [[ -n "$phases_csv" ]] && phases_csv+=","
      phases_csv+="${role}:offline"
      continue
    fi
    [[ -n "$present_csv" ]] && present_csv+=","
    present_csv+="${role}"

    phase="$(read_field "$sf" 'phase')"
    phase="${phase:-?}"
    stalled_reason="$(read_field "$sf" 'stalledReason')"
    last_fetch_ts="$(read_field "$sf" 'lastFetchTs')"
    fail="$(read_field "$sf" 'failCount')"
    fail="${fail:-0}"

    [[ -n "$phases_csv" ]] && phases_csv+=","
    phases_csv+="${role}:${phase}"

    # ATTENTION conditions:
    if [[ "$phase" == "stalled" ]] || [[ -n "$stalled_reason" ]]; then
      attention_lines+=("ATTENTION role=${role} reason=stalled detail=${stalled_reason:-unknown} ts=${iter_start}")
    elif [[ "$last_fetch_ts" =~ ^[0-9]+$ ]] && (( last_fetch_ts > 0 )); then
      age=$(( iter_start - last_fetch_ts ))
      threshold=$(( T_WORKER * 3 ))
      if (( age > threshold )); then
        attention_lines+=("ATTENTION role=${role} reason=fetch-stale detail=age=${age}s>threshold=${threshold}s ts=${iter_start}")
      fi
    fi
  done

  # Emit dashboard line (single line per cycle).
  echo "DASHBOARD ts=${iter_start} roles=${present_csv:-none} phases=${phases_csv} openWork=${open_total} ls_remote=${ls_remote_ok}"

  # Emit ATTENTION lines (if any).
  for line in "${attention_lines[@]:-}"; do
    [[ -n "$line" ]] && echo "$line"
  done

  # Build a small JSON summary string for our own state file.
  summary_json="\"roles=${present_csv:-none} phases=${phases_csv} attention=${#attention_lines[@]}\""

  # Update Administrator's own state file.
  if $ls_remote_ok; then
    write_state "polling" "" "$iter_start" "$summary_json"
  else
    write_state "polling" "$(classify_error "$refs_now")" "$iter_start" "$summary_json"
  fi

  sleep "$TM"
done
+156 −0
Original line number Diff line number Diff line
#!/usr/bin/env bash
# poll-wrapper-Analyst.sh — bulletproof Bash poll loop for the Analyst role
#
# Per orchestration.md §6 (state machine), §6.4 / §9.7 / §16, and
# orchestration-v2-redesign.md §3.1. The Analyst watches review/* tags
# on {remote} via cheap `git ls-remote` discovery and emits typed events
# to stdout that the Claude Code Monitor task surfaces to the agent.
#
# Key behaviors:
#   - Cheap discovery via ls-remote with a NARROW refspec (no --tags
#     side effects, no would-clobber rejection per dry-run F4).
#   - Strips ^{} peel of annotated tags BEFORE applying the *-escalate
#     suffix exclusion (per dry-run F2).
#   - Dedup keyed on (SHA, ref-name) — re-pointed refs trigger NEW
#     events (per orchestration.md §6 dedup contract).
#   - Emits typed STALLED reason=... events on failure paths instead
#     of exiting (per redesign §3.1; mitigates Stall A).
#   - Updates {STATE_DIR}/agent-state-Analyst.json every loop with
#     iteration / phase / lastFetchOk / lastFetchTs / failCount.
#   - Continues polling indefinitely; never exits on transient errors.
#
# Environment:
#   REMOTE      writable git remote name (default: agentic)
#   PREFIX      ref-name prefix for dry-run isolation (default: empty)
#   TA          poll interval in seconds (default: 60; 30 minimum dry-run)
#   STATE_DIR   directory for agent-state-<role>.json (default ~/.claude/state)
#
# Usage:
#   cd lr_reduction
#   PREFIX=dry-run-2026-04-28- TA=30 ./poll-wrapper-Analyst.sh
#
# Output: one event line per change. Event grammar:
#   REVIEW slug=<slug> ref=<refname> sha=<sha>
#   REVIEW_GONE ref=<refname>
#   STALLED reason=<reason> [fail_count=<N>] [detail=<...>] ts=<unix-ts>

set -uo pipefail

ROLE="Analyst"
REMOTE="${REMOTE:-agentic}"
PREFIX="${PREFIX:-}"
TA="${TA:-60}"
STATE_DIR="${STATE_DIR:-${HOME}/.claude/state}"
STATE="${STATE_DIR}/agent-state-${ROLE}.json"

mkdir -p "$STATE_DIR"
chmod 700 "$STATE_DIR" 2>/dev/null || true

# In-memory dedup keyed on refname → SHA.
declare -A seen_sha
fail_count=0
last_fail_reason=""
iter=0

# Atomic state-file write (write to .tmp, rename) so the statusline
# never reads a partially-written file.
write_state() {
  local phase="$1" stalled="$2" last_fetch_ok="$3" last_fetch_ts="$4"
  local tmp="${STATE}.tmp.$$"
  cat > "$tmp" <<EOF
{
  "role": "${ROLE}",
  "iteration": ${iter},
  "phase": "${phase}",
  "lastFetchOk": ${last_fetch_ok},
  "lastFetchTs": ${last_fetch_ts},
  "failCount": ${fail_count},
  "stalledReason": ${stalled:+\"${stalled}\"}${stalled:-null}
}
EOF
  chmod 600 "$tmp"
  mv -f "$tmp" "$STATE"
}

classify_error() {
  # Stdin: stderr/stdout from the failed git command.
  # Stdout: short reason tag (one of: fetch-403, fetch-tag-clobber,
  #         fetch-network, fetch-auth, fetch-error)
  local err="$1"
  if grep -q '403' <<<"$err"; then
    printf 'fetch-403'
  elif grep -q 'would clobber existing tag' <<<"$err"; then
    printf 'fetch-tag-clobber'
  elif grep -qE 'Could not resolve host|Connection timed out|network is unreachable' <<<"$err"; then
    printf 'fetch-network'
  elif grep -qE 'Authentication failed|denied|unauthorized' <<<"$err"; then
    printf 'fetch-auth'
  else
    printf 'fetch-error'
  fi
}

while true; do
  iter=$((iter + 1))
  iter_start=$(date -u +%s)
  write_state "polling" "" true "$iter_start"

  # Cheap discovery via ls-remote with narrow refspec — no --tags side
  # effects, no objects downloaded.
  refspec="refs/tags/${PREFIX}review/*"
  if ! refs_now=$(git ls-remote "$REMOTE" "$refspec" 2>&1); then
    fail_count=$((fail_count + 1))
    reason="$(classify_error "$refs_now")"
    last_fail_reason="$reason"
    echo "STALLED reason=${reason} fail_count=${fail_count} ts=${iter_start}"
    if (( fail_count >= 3 )); then
      echo "STALLED reason=${reason}-persistent action=needs-investigation fail_count=${fail_count} ts=${iter_start}"
      write_state "stalled" "$reason" false "$iter_start"
      sleep $((TA * 3))
    else
      write_state "polling" "$reason" false "$iter_start"
      sleep "$TA"
    fi
    continue
  fi
  fail_count=0

  # Build the set of (refname, sha) seen this cycle. ls-remote output
  # is `<sha>\t<refname>` per line; annotated tags also produce a
  # second line `<peeled-sha>\t<refname>^{}` which we strip BEFORE
  # filtering (F2 fix).
  declare -A current=()
  while IFS=$'\t' read -r sha ref; do
    [[ -z "${sha:-}" ]] && continue
    refn="${ref%^\{\}}"
    # Filter: skip *-escalate (annotated escalation tag, NOT a review
    # event). Suffix exclusion applied AFTER the ^{} strip.
    [[ "$refn" == *-escalate ]] && continue
    current["$refn"]="$sha"
  done <<<"$refs_now"

  # Emit REVIEW for new (refname, sha) tuples.
  for refn in "${!current[@]}"; do
    sha="${current[$refn]}"
    if [[ "${seen_sha[$refn]:-}" != "$sha" ]]; then
      seen_sha[$refn]="$sha"
      slug="${refn##*review/}"
      echo "REVIEW slug=${slug} ref=${refn} sha=${sha}"
    fi
  done

  # Emit REVIEW_GONE for refs that were in the seen-cache but absent
  # from the current scan (deleted on remote).
  for refn in "${!seen_sha[@]}"; do
    if [[ -z "${current[$refn]:-}" ]]; then
      unset "seen_sha[$refn]"
      echo "REVIEW_GONE ref=${refn}"
    fi
  done

  unset current
  declare -A current=()

  write_state "polling" "" true "$iter_start"
  sleep "$TA"
done
+138 −0
Original line number Diff line number Diff line
#!/usr/bin/env bash
# poll-wrapper-Developer.sh — bulletproof Bash poll loop for the Developer role
#
# Per orchestration.md §6 (state machine), §6.4 / §9.7 / §16, and
# orchestration-v2-redesign.md §3.1. The Developer watches triage/*
# branches on {remote} via cheap `git ls-remote` discovery and emits
# typed events to stdout that the Claude Code Monitor task surfaces
# to the agent.
#
# Key behaviors mirror poll-wrapper-Analyst.sh; the only differences:
#   - Refspec is refs/heads/{prefix}triage/* (branches, not tags).
#   - Event grammar emits TRIAGE (and TRIAGE_GONE) instead of REVIEW.
#   - No suffix exclusion (the Developer wants every triage variant
#     including -v{N}).
#   - The agent itself filters with plans/skipped-triage.txt on the
#     analysis branch (per Developer findings §3.5); this wrapper
#     only signals presence.
#
# Environment:
#   REMOTE      writable git remote name (default: agentic)
#   PREFIX      ref-name prefix for dry-run isolation (default: empty)
#   TD          poll interval in seconds (default: 60; 30 minimum dry-run)
#   STATE_DIR   directory for agent-state-<role>.json (default ~/.claude/state)
#
# Usage:
#   cd lr_reduction
#   PREFIX=dry-run-2026-04-28- TD=30 ./poll-wrapper-Developer.sh
#
# Output: one event line per change. Event grammar:
#   TRIAGE slug=<slug> ref=<refname> sha=<sha>
#   TRIAGE_GONE ref=<refname>
#   STALLED reason=<reason> [fail_count=<N>] [detail=<...>] ts=<unix-ts>

set -uo pipefail

ROLE="Developer"
REMOTE="${REMOTE:-agentic}"
PREFIX="${PREFIX:-}"
TD="${TD:-60}"
STATE_DIR="${STATE_DIR:-${HOME}/.claude/state}"
STATE="${STATE_DIR}/agent-state-${ROLE}.json"

mkdir -p "$STATE_DIR"
chmod 700 "$STATE_DIR" 2>/dev/null || true

declare -A seen_sha
fail_count=0
last_fail_reason=""
iter=0

write_state() {
  local phase="$1" stalled="$2" last_fetch_ok="$3" last_fetch_ts="$4"
  local tmp="${STATE}.tmp.$$"
  cat > "$tmp" <<EOF
{
  "role": "${ROLE}",
  "iteration": ${iter},
  "phase": "${phase}",
  "lastFetchOk": ${last_fetch_ok},
  "lastFetchTs": ${last_fetch_ts},
  "failCount": ${fail_count},
  "stalledReason": ${stalled:+\"${stalled}\"}${stalled:-null}
}
EOF
  chmod 600 "$tmp"
  mv -f "$tmp" "$STATE"
}

classify_error() {
  local err="$1"
  if grep -q '403' <<<"$err"; then
    printf 'fetch-403'
  elif grep -q 'would clobber existing tag' <<<"$err"; then
    printf 'fetch-tag-clobber'
  elif grep -qE 'Could not resolve host|Connection timed out|network is unreachable' <<<"$err"; then
    printf 'fetch-network'
  elif grep -qE 'Authentication failed|denied|unauthorized' <<<"$err"; then
    printf 'fetch-auth'
  else
    printf 'fetch-error'
  fi
}

while true; do
  iter=$((iter + 1))
  iter_start=$(date -u +%s)
  write_state "polling" "" true "$iter_start"

  refspec="refs/heads/${PREFIX}triage/*"
  if ! refs_now=$(git ls-remote "$REMOTE" "$refspec" 2>&1); then
    fail_count=$((fail_count + 1))
    reason="$(classify_error "$refs_now")"
    last_fail_reason="$reason"
    echo "STALLED reason=${reason} fail_count=${fail_count} ts=${iter_start}"
    if (( fail_count >= 3 )); then
      echo "STALLED reason=${reason}-persistent action=needs-investigation fail_count=${fail_count} ts=${iter_start}"
      write_state "stalled" "$reason" false "$iter_start"
      sleep $((TD * 3))
    else
      write_state "polling" "$reason" false "$iter_start"
      sleep "$TD"
    fi
    continue
  fi
  fail_count=0

  declare -A current=()
  while IFS=$'\t' read -r sha ref; do
    [[ -z "${sha:-}" ]] && continue
    # ls-remote on refs/heads/* doesn't produce ^{} lines, but the
    # strip is harmless and keeps the wrapper symmetric with the
    # Analyst variant.
    refn="${ref%^\{\}}"
    current["$refn"]="$sha"
  done <<<"$refs_now"

  for refn in "${!current[@]}"; do
    sha="${current[$refn]}"
    if [[ "${seen_sha[$refn]:-}" != "$sha" ]]; then
      seen_sha[$refn]="$sha"
      slug="${refn##*triage/}"
      echo "TRIAGE slug=${slug} ref=${refn} sha=${sha}"
    fi
  done

  for refn in "${!seen_sha[@]}"; do
    if [[ -z "${current[$refn]:-}" ]]; then
      unset "seen_sha[$refn]"
      echo "TRIAGE_GONE ref=${refn}"
    fi
  done

  unset current
  declare -A current=()

  write_state "polling" "" true "$iter_start"
  sleep "$TD"
done
+141 −0
Original line number Diff line number Diff line
#!/usr/bin/env bash
# poll-wrapper-Integrator.sh — bulletproof Bash poll loop for the Integrator role
#
# Per orchestration.md §6 (state machine), §6.4 / §9.7 / §16, and
# orchestration-v2-redesign.md §3.1. The Integrator watches qa/* tags
# on {remote} via cheap `git ls-remote` discovery and emits typed
# events to stdout that the Claude Code Monitor task surfaces to the
# agent.
#
# Key behaviors:
#   - Dedup keyed on (SHA, ref-name) so re-tagged qa/* refs (the
#     Developer's empty-commit advance for v{N>1} retries — see
#     orchestration.md §9.2) trigger NEW events. This is the fix for
#     dry-run Stall B (Integrator findings §4.1).
#   - Refspec is refs/tags/{prefix}qa/* (narrow; no --tags side
#     effects, no would-clobber rejection per dry-run F4).
#   - When the agent picks up a work item, it should use
#     `git fetch {remote} 'refs/tags/{prefix}qa/{slug}' --force`
#     for the targeted pull (§9.3 of orchestration.md).
#
# Environment:
#   REMOTE      writable git remote name (default: agentic)
#   PREFIX      ref-name prefix for dry-run isolation (default: empty)
#   TI          poll interval in seconds (default: 60; 30 minimum dry-run)
#   STATE_DIR   directory for agent-state-<role>.json (default ~/.claude/state)
#
# Usage:
#   cd lr_reduction
#   PREFIX=dry-run-2026-04-28- TI=30 ./poll-wrapper-Integrator.sh
#
# Output: one event line per change. Event grammar:
#   QA slug=<slug> ref=<refname> sha=<sha>
#   QA_GONE ref=<refname>
#   STALLED reason=<reason> [fail_count=<N>] [detail=<...>] ts=<unix-ts>

set -uo pipefail

ROLE="Integrator"
REMOTE="${REMOTE:-agentic}"
PREFIX="${PREFIX:-}"
TI="${TI:-60}"
STATE_DIR="${STATE_DIR:-${HOME}/.claude/state}"
STATE="${STATE_DIR}/agent-state-${ROLE}.json"

mkdir -p "$STATE_DIR"
chmod 700 "$STATE_DIR" 2>/dev/null || true

declare -A seen_sha
fail_count=0
last_fail_reason=""
iter=0

write_state() {
  local phase="$1" stalled="$2" last_fetch_ok="$3" last_fetch_ts="$4"
  local tmp="${STATE}.tmp.$$"
  cat > "$tmp" <<EOF
{
  "role": "${ROLE}",
  "iteration": ${iter},
  "phase": "${phase}",
  "lastFetchOk": ${last_fetch_ok},
  "lastFetchTs": ${last_fetch_ts},
  "failCount": ${fail_count},
  "stalledReason": ${stalled:+\"${stalled}\"}${stalled:-null}
}
EOF
  chmod 600 "$tmp"
  mv -f "$tmp" "$STATE"
}

classify_error() {
  local err="$1"
  if grep -q '403' <<<"$err"; then
    printf 'fetch-403'
  elif grep -q 'would clobber existing tag' <<<"$err"; then
    printf 'fetch-tag-clobber'
  elif grep -qE 'Could not resolve host|Connection timed out|network is unreachable' <<<"$err"; then
    printf 'fetch-network'
  elif grep -qE 'Authentication failed|denied|unauthorized' <<<"$err"; then
    printf 'fetch-auth'
  else
    printf 'fetch-error'
  fi
}

while true; do
  iter=$((iter + 1))
  iter_start=$(date -u +%s)
  write_state "polling" "" true "$iter_start"

  refspec="refs/tags/${PREFIX}qa/*"
  if ! refs_now=$(git ls-remote "$REMOTE" "$refspec" 2>&1); then
    fail_count=$((fail_count + 1))
    reason="$(classify_error "$refs_now")"
    last_fail_reason="$reason"
    echo "STALLED reason=${reason} fail_count=${fail_count} ts=${iter_start}"
    if (( fail_count >= 3 )); then
      echo "STALLED reason=${reason}-persistent action=needs-investigation fail_count=${fail_count} ts=${iter_start}"
      write_state "stalled" "$reason" false "$iter_start"
      sleep $((TI * 3))
    else
      write_state "polling" "$reason" false "$iter_start"
      sleep "$TI"
    fi
    continue
  fi
  fail_count=0

  declare -A current=()
  while IFS=$'\t' read -r sha ref; do
    [[ -z "${sha:-}" ]] && continue
    # Strip ^{} peel of annotated tags (qa/* is normally lightweight,
    # but the strip is defensive — symmetric with the Analyst).
    refn="${ref%^\{\}}"
    current["$refn"]="$sha"
  done <<<"$refs_now"

  # Dedup on (SHA, ref-name): re-tag with a new SHA but the same
  # name is a NEW event (per dry-run Integrator findings §4.1).
  for refn in "${!current[@]}"; do
    sha="${current[$refn]}"
    if [[ "${seen_sha[$refn]:-}" != "$sha" ]]; then
      seen_sha[$refn]="$sha"
      slug="${refn##*qa/}"
      echo "QA slug=${slug} ref=${refn} sha=${sha}"
    fi
  done

  for refn in "${!seen_sha[@]}"; do
    if [[ -z "${current[$refn]:-}" ]]; then
      unset "seen_sha[$refn]"
      echo "QA_GONE ref=${refn}"
    fi
  done

  unset current
  declare -A current=()

  write_state "polling" "" true "$iter_start"
  sleep "$TI"
done