Commit 1070df2c authored by Bogdan Vacaliuc's avatar Bogdan Vacaliuc
Browse files

plan: revise for UID/GID guard, -e/--das3 classifier, make-based RELEASE parsing



Responds to prompt 1.1 feedback and resolves open questions Q4 and Q5.

- §2: document the single-dash hostname pattern (bl\d+[a-z]?-<machine>);
  multi-dash names like ref-l-vtwin1 are explicitly invalid.
- §4.5 (new): "create if missing, never mutate" rule for the controls
  user and slowcontrols_developers group. Pins canonical UID=88888 /
  GID=43647 on fresh VMs but leaves existing non-canonical IDs alone
  (e.g., bl4b-vtwin1 today has 166/55121) to avoid chown storms and
  orphaning running processes with ruid=166.
- §6.3 step 2: replace the grep-based RELEASE aggregator (which would
  miss $(SUPPORT)-style variable refs) with a GNU-make helper
  epics/dump-deps.mk. Uses make as the make-syntax interpreter, so
  $(VAR) expansion "just works" without reinventing a parser or needing
  EPICS base built (msi is unavailable pre-build).
- §6.3 step 3 (new): per-path classifier table for git-release.sh flags.
  -e and --das3 are mutually exclusive; the classifier picks exactly
  one branch per unique /home/controls/... path. Notes the latent bug
  in share/scripts/dependencies.sh which passes --das3 on *common*
  paths but has no -e branch for *epics* paths; fix lives in our own
  build_modules.sh, not upstream.
- §6.3 step 8: move the existing 'git-release.sh -r --all --das3 nED
  rel2.4.10_20241203' line out of the top-level provision target into
  epics-modules so it runs after all transitive deps exist. No behavior
  change for non-vtwin roles because epics-modules is ROLE-gated.
- §9 (verify) check 0: assert user/group existence and membership by
  name, emit NOTE for non-canonical IDs, still pass. Check 7: flag-only
  PS71/PS72 TCP 4571 collision (bootstrap never mutates beamline repo).
- §11.1: add epics/dump-deps.mk to deliverables; §5.2 tree updated.
- §12 Q4, Q5: resolved inline.
- §13 step 0 (new): validate the aggregator dry-run on this machine
  before touching any live state — catches classifier/dump-deps bugs
  while they're cheap.
- Appendix B: add dry-run cheat-sheet lines for dump-deps.mk,
  build_modules.sh --show-only, and the ID-guard sanity check.

Co-Authored-By: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent cc5787d1
Loading
Loading
Loading
Loading
+239 −50
Original line number Diff line number Diff line
@@ -63,12 +63,22 @@ agree perfectly.
beamline's `${beamline}-ProcServ-${hostid}` IOC without any parameter changes.
This is the key "configure-by-hostname" property the plan must preserve.

**Hostname pattern (single-dash only):** the `cut -f1 -d-` tokenization is
strictly a one-dash parser. Valid hostnames must match `bl\d+[a-z]?-<machine>`
(e.g., `bl4b-vtwin1`, `bl11a-dasopi1`, `bl100-daq2`). Multi-dash names like
`ref-l-vtwin1` yield `BEAMLINE=ref, MACHINE=l` which is nonsense. The one
explicit carve-out is `rgb-dev-*` (Ray's VMs) which the Makefile already
special-cases. Do not introduce new multi-dash conventions; the beamline
registry treats names as single-dash tokens.

**Preserve also:** the `slowcontrols_developers` GID pinning at 43647 (used on
real instruments - currently 55121 on bl4b-vtwin1 but sub-optimal).
Establish this as a fixed GID in the `dependencies`
target so newly created `/home/controls/*` files survive replication to sibling
machines. Likewise, the `controls` UID shall be pinned to 88888 (used on real
instruments - currently 166 on bl4b-vtwin1 but sub-optimal).
instruments - currently 166 on bl4b-vtwin1 but sub-optimal). See §4.5 for
the "create-if-missing, never mutate" guard that makes this safe on machines
where different IDs are already in use.

---

@@ -169,6 +179,40 @@ The user's directive: *"please keep a record of [OS installs] in the
listed in the top-level `dependencies` target so a fresh VM can always run
`make dependencies` to reach the prerequisite state.

### 4.5 User/group ID guard rule — create if missing, never mutate

The canonical IDs for real beamline nodes are `controls` UID=88888 and
`slowcontrols_developers` GID=43647. A fresh VM should be created with
these exact values so that files owned by `controls` are portable across
sibling machines. However, on a machine where the user and/or group
**already exist with different IDs** (e.g., `bl4b-vtwin1` today has 166 /
55121), attempting to change them would require chown-ing every file
under `/home/controls/**` and would orphan any running process whose
`ruid` equals the old UID — highly disruptive and unnecessary.

**The rule, applied in three places:**

1. **`dependencies` target (create-if-missing).** Use `getent` to check
   existence. If missing, `groupadd -g 43647 slowcontrols_developers` and
   `useradd -u 88888 -g slowcontrols_developers …`. If present with
   non-canonical IDs, emit a `WARN:` line and continue (no `usermod -u`,
   no `groupmod -g`).

2. **`install`/`provision` — reference by name.** The existing Makefile
   already uses `install -dm6775 --owner=controls --group=slowcontrols_developers`
   which resolves names via `getent` at runtime. Keep this; never hard-code
   numeric IDs in ownership arguments.

3. **`verify` target — membership, not exact IDs.** Assert that
   `controls` exists, the group exists, and `controls ∈ slowcontrols_developers`.
   If the UID/GID diverge from canonical, emit a `NOTE:` explaining the
   deviation and still pass. Concrete checks are specified in §9.

This makes the bootstrap correct on both a freshly-pinned VM and on the
current `bl4b-vtwin1` (and any other machine where legacy IDs are already
in use). Migration — if ever needed on a given host — is an operator-driven
re-image operation, not a bootstrap action.

---

## 5. Proposed new Makefile layout
@@ -228,8 +272,9 @@ bootstrap/
│   └── README.md
├── epics/                         # NEW — EPICS module builds
│   ├── Makefile
│   ├── modules.list               # ordered dependency list (base → asyn → …)
│   ├── build_modules.sh           # idempotent wrapper around git-release.sh
│   ├── dump-deps.mk               # GNU-make helper: expand $(VAR) in RELEASE files
│   ├── modules.list               # topological order (base → asyn → …)
│   ├── build_modules.sh           # aggregator + classifier + git-release.sh driver
│   └── README.md
├── iocs/                          # NEW — IOC application builds
│   ├── Makefile
@@ -347,12 +392,8 @@ builds each one using its own `Makefile`.

**The generalization:** rather than treating nED as special, the bootstrap
walks every `applications/*/configure/RELEASE*` file, extracts the *unique set*
of module paths, and calls `git-release.sh` on each in topological order. The
existing `/home/controls/share/scripts/dependencies.sh` already implements
this loop — it reads `configure/RELEASE[.local]`, extracts unique paths, and
calls `git-release.sh` or `ics-deploy` per module. We just run it from each
IOC's Makefile (for IOCs that have a `dependencies:` target) *or* run it
once against the aggregate set.
of module paths, classifies each path to decide which `git-release.sh` flags
apply, and calls `git-release.sh` per path in topological order.

**Concrete steps:**

@@ -360,34 +401,134 @@ once against the aggregate set.
   If `SSHPASS` is unset and stdin is a TTY, prompt once at the start of the
   target so a single password unlocks every subsequent `git clone`.

2. **Aggregate unique module paths** across `applications/*/configure/RELEASE*`:
2. **Aggregate unique module paths** across `applications/*/configure/RELEASE*`
   using `epics/dump-deps.mk` — a 10-line helper that leverages GNU make
   itself to expand `$(VAR)` references. This handles IOCs that write
   `SUPPORT=/home/controls` + `ASYN=$(SUPPORT)/common/asyn/main` — plain
   grep-based aggregation would miss these because `$(...)` is unexpanded
   text. `msi` cannot be used here (it only materializes after EPICS base
   is built → chicken-and-egg); GNU make is always present because it's
   required for EPICS builds anyway, listed in §7.

   `epics/dump-deps.mk`:
   ```make
   # Expand /home/controls/... paths from an IOC's configure/RELEASE{.local}.
   # Uses GNU make as the make-syntax interpreter, so $(VAR) substitution
   # "just works" without reinventing a parser or needing EPICS base built.
   #
   # Usage:  make -f dump-deps.mk IOC_DIR=/path/to/ioc dump

   IOC_DIR ?= $(CURDIR)
   TOP     := $(IOC_DIR)

   # Seed common root variables so IOCs that rely on them still expand.
   # These are ?= so any explicit definition in RELEASE wins.
   SUPPORT    ?= /home/controls
   SUPPORTDIR ?= /home/controls
   CONTROLS   ?= /home/controls

   -include $(IOC_DIR)/configure/RELEASE
   -include $(IOC_DIR)/configure/RELEASE.local

   .PHONY: dump
   dump:
   	@$(foreach v,$(.VARIABLES), \
   	    $(if $(filter /home/controls/%,$(value $(v))), \
   	        $(info $(value $(v))) \
   	    ) \
   	)

   .DEFAULT_GOAL := dump
   ```

   Aggregator loop in `epics/build_modules.sh`:
   ```bash
   cat /home/controls/${BEAMLINE}/applications/*/configure/RELEASE* \
       /home/controls/${BEAMLINE}/RELEASE.local* \
     | grep '=' | grep -v '^#' | grep -v '\$(' \
     | tr -d ' \t' | cut -f2 -d= | sort -u
   unique_paths() {
       for iocdir in /home/controls/${BEAMLINE}/applications/${BEAMLINE}-*; do
           [ -f "$iocdir/configure/RELEASE" ] || \
               [ -f "$iocdir/configure/RELEASE.local" ] || continue
           make -f "$HERE/dump-deps.mk" IOC_DIR="$iocdir" dump 2>&1 \
               | grep -oE '^/home/controls/[^[:space:]]+' || true
       done | sort -u
   }
   ```
   This is the *same* aggregation `dependencies.sh` does for a single IOC,
   applied to the whole applications tree.

3. **Topologically order** them. EPICS Base must be first; then the "tier 1"
   modules (asyn, seq, autosave, busy, calc, sscan, streamdevice, ioc-stats,
   generic-io, pvmediator, procservcontrol); then the per-hardware
   specialty modules (galil, lakeshore, ether-ip, modbus, hplc, …); and
   finally nED (which pulls in many of the above anyway). Encode the order
   in `epics/modules.list`.

4. **For each module path:** if the directory already exists with a non-empty
   `bin/linux-x86_64/` (or is `base/` with `bin/linux-x86_64/softIocPVA`),
   **skip**. Otherwise call:

   **Known limitations (document, don't fix):** (a) `$(shell …)` constructs
   aren't evaluated (none observed in the BL4B tree); (b) `ifdef`/`ifeq`
   branches take the default-taken branch under a stock environment, which
   matches how a vanilla build would resolve them; (c) RELEASE files that
   `include` beamline-level files referencing env vars set elsewhere are
   handled if we seed those vars in `dump-deps.mk`, otherwise the unresolved
   variable silently drops from the result. Any deps the helper misses
   surface as a clear error when the IOC itself is built in step (D).

3. **Classify each unique path** to pick the correct `git-release.sh` args.
   `-e` (`--epics`) redirects `PROD` to `$PREFIX/epics`; `--das3` redirects
   `PROD` to `/home/controls/prod/common` and switches the git URL to
   `code.ornl.gov`. **These two flags are mutually exclusive** — each
   targets a different output layout and combining them would produce
   `/home/controls/prod/common/epics/` which matches no IOC's RELEASE
   path. The classifier picks exactly one branch per path:

   | Matched path shape                                           | `git-release.sh` args                              |
   |--------------------------------------------------------------|----------------------------------------------------|
   | `/home/controls/prod/common/<mod>/<tag>`                     | `--das3 -r --all <mod> <tag>`                      |
   | `/home/controls/prod/epics/<mod>/<tag>`                      | `-e -r <mod> <tag>`                                |
   | `/home/controls/prod/<mod>/<tag>` (neither common nor epics) | `-r <mod> <tag>` (default PREFIX=`/home/controls/prod`) |
   | `/home/controls/common/<mod>/<tag>`                          | `--prefix /home/controls/common -r <mod> <tag>`    |
   | `/home/controls/epics/<mod>/<tag>`                           | `-e --prefix /home/controls -r <mod> <tag>`        |
   | anything else                                                | warn, surface to operator, skip                    |

   Classifier implementation in `build_modules.sh`:
   ```bash
   classify() {
       local d="$1" mod tag
       mod=$(basename "$(dirname "$d")")
       tag=$(basename "$d")
       case "$d" in
           /home/controls/prod/common/*/*) echo "--das3 -r --all $mod $tag" ;;
           /home/controls/prod/epics/*/*)  echo "-e -r $mod $tag" ;;
           /home/controls/prod/*/*)        echo "-r $mod $tag" ;;
           /home/controls/common/*/*)      echo "--prefix /home/controls/common -r $mod $tag" ;;
           /home/controls/epics/*/*)       echo "-e --prefix /home/controls -r $mod $tag" ;;
           *) echo "UNHANDLED $d" >&2 ; return 1 ;;
       esac
   }
   ```

   **Note on recursion:** once `git-release.sh -r` starts building a
   top-level module, its internal sub-dependency resolver already handles
   the `-e`/`--das3` distinction per nested dependency (see lines 326, 365
   of `git-release.sh` where `ISEPICS` is detected from the module path).
   The classifier is therefore only for **the top-level call per unique
   path** that the aggregator discovered; transitive dependencies self-route.

   **Latent bug in `/home/controls/share/scripts/dependencies.sh`, noted but
   not fixed here:** that script passes `--das3` when `*common*` is in the
   path but has no corresponding `-e` branch for `*epics*`. Our own
   `build_modules.sh` implements the correct split. If the shared script is
   ever fixed upstream, the bootstrap can switch to calling `dependencies.sh`
   directly and shrink. For now, we deliberately do our own classification.

4. **Topologically order** the classified set. EPICS Base must be first;
   then the "tier 1" modules (asyn, seq, autosave, busy, calc, sscan,
   streamdevice, ioc-stats, generic-io, pvmediator, procservcontrol); then
   per-hardware specialty modules (galil, lakeshore, ether-ip, modbus, hplc,
   …); and finally nED (which pulls in many of the above anyway). Encode
   the order in `epics/modules.list`.

5. **For each path in order:** if the destination directory already exists
   with a non-empty `bin/linux-x86_64/` (or the base-specific sentinel
   `bin/linux-x86_64/softIocPVA`), **skip**. Otherwise run the classifier
   output as arguments to `git-release.sh`:
   ```bash
   git-release.sh --das3 --prefix $(dirname $(dirname $d)) \
       $(basename $(dirname $d)) $(basename $d)
   unique_paths | while read d; do
       test -s "$d/bin/linux-x86_64"/* 2>/dev/null && continue
       eval git-release.sh $(classify "$d")
   done
   ```
   (or `ics-deploy --release --build` in the DAS3 case — the existing
   `dependencies.sh` already picks the right one based on `which`.)

5. **Handle the three already-known failures** (streamdevice, scantools, occ).
6. **Handle the three already-known failures** (streamdevice, scantools, occ).
   Their `build_err_*.txt` files contain only "Cloning into…" which indicates
   a network/auth problem at provision time, not a real build failure. The
   target should:
@@ -396,14 +537,17 @@ once against the aggregate set.
   - If retry fails, surface a clear error pointing at the correct code.ornl.gov
     repo URL for manual debugging

6. **Retry once on authentication error** (HTTP 401) after re-prompting for
7. **Retry once on authentication error** (HTTP 401) after re-prompting for
   password. This is the common failure mode on ORNL trac pushes.

7. **After all modules:** run `git-release.sh -r --all --das3 nED
   rel2.4.10_20241203` exactly once — this is the existing provision command,
   and it will now succeed because all nED's transitive deps are in place.
8. **After all modules:** run `git-release.sh -r --all --das3 nED
   rel2.4.10_20241203` exactly once. This is the existing provision
   command, **moved here** from the top-level `provision` target so it
   runs after all nED transitive deps are in place (§12 Q4 resolution).
   No behavior change for non-vtwin roles because `epics-modules` is
   ROLE-gated (vtwin/epics only).

8. **Verify:** for each module in `modules.list`, check
9. **Verify:** for each module in `modules.list`, check
   `test -d $d/bin/linux-x86_64` (or a module-appropriate sentinel).

**Note on the "modules.list":** I strongly prefer **not** hard-coding the
@@ -630,6 +774,22 @@ Key edges:

Augment the existing `verify` target with these checks:

0. **User/group existence and membership** (by name, not exact UID/GID) —
   implements the §4.5 guard rule:
   ```bash
   getent passwd controls >/dev/null       || { echo FAIL: controls user; exit 1; }
   getent group slowcontrols_developers >/dev/null \
                                           || { echo FAIL: group; exit 1; }
   id -nG controls | tr ' ' '\n' | grep -qx slowcontrols_developers \
                                           || { echo FAIL: membership; exit 1; }
   uid=$(id -u controls)
   gid=$(getent group slowcontrols_developers | cut -d: -f3)
   if [ "$uid" != "88888" ] || [ "$gid" != "43647" ]; then
       echo "NOTE: non-canonical IDs (controls=$uid, slowcontrols=$gid) — tolerated"
   fi
   ```
   On a fresh canonical VM both `uid` and `gid` equal the target values.
   On `bl4b-vtwin1` today a `NOTE:` is emitted but verify still passes.
1. `systemctl is-active` for: `vsmserver vsmagent tlwebaccess tlwebadm
   scan-server epics-controls-env` → all must be `active`.
2. `test -f /var/run/controls.env && grep -q EPICS_BASE /var/run/controls.env`.
@@ -639,6 +799,10 @@ Augment the existing `verify` target with these checks:
   → matches expected IOC count.
5. `ss -tln | grep -E ':(300|904|4501|4810|9000)\b'` → all listeners present.
6. Re-run `make everything-vtwin` → exit 0 with zero file changes reported.
7. **Flag-only checks** (emit `NOTE:` or `WARN:`, do not fail):
   - PS71/PS72 TCP 4571 collision in
     `bl4b-ProcServ-vtwin1/iocBoot/ioc/st.cmd` (§12 Q5 resolution — the
     bootstrap never mutates the beamline repo).

Failures should print the **exact** command to re-run to fix, not just "FAIL".

@@ -683,7 +847,7 @@ Failures should print the **exact** command to re-run to fix, not just "FAIL".
  `dependencies`. Extend `verify`.
- `thinlinc/Makefile` + hconf templates + systemd overrides + README
- `css/Makefile` + README
- `epics/Makefile` + `modules.list` + `build_modules.sh` + README
- `epics/Makefile` + `dump-deps.mk` + `modules.list` + `build_modules.sh` + README
- `iocs/Makefile` + `build_iocs.sh` + `register_iocs.sh` +
  `epics-controls-env.service` + README
- `README.md` — document `everything-vtwin` as the canonical entry point
@@ -734,18 +898,20 @@ change the plan in minor but real ways.
   repo alone (e.g., presence of `/home/controls/${BEAMLINE}/alarms/` or a
   specific config file)?

4. **The existing nED provision line.** The current
   `git-release.sh -r --all --das3 nED rel2.4.10_20241203` in the
   `provision` target silently failed on this machine (leaving an empty
   `rel2.4.10_20241203/` dir). Should the new `epics-modules` target
   **replace** that line, or run *in addition* to it and just add the
   remaining modules? I'd prefer replacement for cleanliness, but that
   changes behavior for non-vtwin roles.

5. **The PS71/PS72 collision** in the pre-populated `st.cmd` — do you want
   me to patch it as part of the bootstrap, or flag it for you to fix in
   the beamline repo? My preference is flag-only; the bootstrap should not
   mutate the beamline repo.
4. ~~**The existing nED provision line.**~~ **Resolved (replacement):** the
   `git-release.sh -r --all --das3 nED rel2.4.10_20241203` line moves from
   the top-level `provision` target into `epics-modules` step 8, where it
   runs after the classifier-driven loop has built all nED transitive
   dependencies. Because `epics-modules` is ROLE-gated (vtwin/epics), the
   behavior for `dassrv1`/`opi`/`daq`/`remote` roles does not change — they
   never ran `epics-modules` in the first place.

5. ~~**The PS71/PS72 collision.**~~ **Resolved (flag-only):** the bootstrap
   does not read or mutate `bl4b-ProcServ-*/iocBoot/*/st.cmd` or any other
   beamline-repo file. The `verify` target emits a `NOTE:` when it detects
   the duplicate 4571 allocation (see §9 check 7). Repairing the substitutions
   / st.cmd is Bogdan's call on the beamline repo, out-of-band from the
   bootstrap.

6. **Passwordless sudo on vtwins.** Is it OK for the bootstrap to assume
   `6ov` (or the running user) has NOPASSWD sudo on a vtwin, so that
@@ -762,6 +928,18 @@ change the plan in minor but real ways.

Begin on the `vtwin` branch of `~/bootstrap`:

0. **Validate the aggregator dry-run** on this machine before touching any
   live state. Ship only `epics/dump-deps.mk` + a `--show-only` mode for
   `epics/build_modules.sh` and run them against
   `/home/controls/bl4b/applications/`. Sanity-check that:
   - every unique `/home/controls/…` path classifies to exactly one row
   - no `UNHANDLED` lines in stderr
   - the aggregated set contains the expected core modules (base, asyn,
     autosave, seq, streamdevice, ioc-stats, …)
   - IOCs that use `$(SUPPORT)`-style variable references do resolve
   This catches classifier / dump-deps bugs while they're cheap (zero
   writes to the filesystem).

1. Create `iocs/epics-controls-env.service` and the `controls-env` target —
   **smallest, safest, highest-value** change. After this, running
   `make controls-env` once on bl4b-vtwin1 gives us the env file and proves
@@ -819,4 +997,15 @@ ioc unregister bl4b-Foo # remove one
ioc start|stop|restart bl4b-Foo
ioc log bl4b-Foo              # paginate log file
ioc console bl4b-Foo          # interactive procServ telnet

# Dry-run the dependency aggregator against a single IOC:
make -f ~/bootstrap/epics/dump-deps.mk \
     IOC_DIR=/home/controls/bl4b/applications/bl4b-Det-nED dump

# Show the full classified set of modules to build, no side effects:
~/bootstrap/epics/build_modules.sh --show-only

# Quick ID-guard check (§4.5):
getent passwd controls ; getent group slowcontrols_developers
id -nG controls | tr ' ' '\n' | grep -x slowcontrols_developers
```