Commit ae1562cd authored by Vacaliuc, Bogdan's avatar Vacaliuc, Bogdan
Browse files

tthd analysis: correct zd ownership, flesh out CLAUDE.md dual-Galil map



I initially claimed BL4B:Mot:zd was a decommissioned ghost PV because the
`bl4b-Galil1` st.cmd comments out DMC5 and the substitutions comment out
the zd row. That was wrong: `bl4b-Galil2`'s st.cmd creates DMC5 on the
same 10.112.9.49 IP and serves zd as a live motor. The 1002.289 mm value
used by the tthd forward kinematics is a real, current cross-IOC CA read,
not a frozen legacy value. Report updated:

 - Executive summary fix #3 and the "third issue" section reframe zd from
   "ghost PV" to "split-ownership CA-write leaks" — the reads work, only
   the Galil1 -> zd.STOP/zd_Ctrl writes fail.
 - Priority 4 recommended fix reframed from "restore or decommission zd"
   to "remove Galil1's orphaned CA-write links to zd".
 - Dead-ends section captures the misread so future investigators don't
   make the same inference.

tasking/CLAUDE.md:
 - Fill in the two TODO tables (authoritative files, Galil axis map) with
   both Galil1 and Galil2 contents from the actual st.cmd/substitutions.
 - Add a "cross-IOC quirks" subsection noting that tthd's forward kinematics
   reads zd.RBV from Galil2, and pointing to `dbpr BL4B:Mot:tthd:REQ_pd,4`
   as the only reliable way to see the actual inputs.
 - Correct the Galil command log path (was pointing at a pre-upgrade file
   that has been stale since 2019; the real log is
   /home/controls/var/log/dassrv1/galil_dmc.log, shared by both IOCs).
 - Add summary pointer to tasking/tthd-Motion-Failure-Analysis.md for the
   2026-04-10 incident.

tthd-Motion-Failure-Analysis.pdf regenerated via tasking/pdf-tools/md2pdf.py.

Co-Authored-By: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent f34e1a23
Loading
Loading
Loading
Loading
+59 −11
Original line number Diff line number Diff line
@@ -52,24 +52,72 @@ Full docs: `setup/docs/sns-archiver-query.md` on `main`.

| Purpose | Path |
|---|---|
| Motor record autosave (VELO, BVEL, BDST, RDBD, RTRY…) | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav` + `.sav0/.sav1/.sav2` (rotating) |
| Motor record dated snapshots | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav_YYMMDD-hhmmss` (on restart) |
| Motor record pass0 autosave (MRES, ERES, DVAL, OFF) | `/home/controls/var/bl4b-Galil1/bl4b-Galil1_pass0.sav*` |
| Galil command log (every PR/BG/MG/SH per controller) | `/home/controls/var/log/dassrv1/ioc_bl4b-Galil1.log` |
| Galil IOC stdout/stderr | `/home/controls/var/log/dassrv1/ioc_bl4b-Galil1.log` |
| Scan server stdout (has real RBV values in `TimeoutException`) | `/home/controls/var/scan/console.log` |
| Motor record autosave (`Galil1`) — VELO, BVEL, BDST, RDBD, RTRY… | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav` + `.sav0/.sav1/.sav2` (rotating) |
| Motor record pass0 autosave (`Galil1`) — MRES, ERES, DVAL, OFF | `/home/controls/var/bl4b-Galil1/bl4b-Galil1_pass0.sav*` |
| Motor record dated snapshots (`Galil1`) — on IOC restart | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav_YYMMDD-hhmmss` |
| Motor record autosave (`Galil2`) | `/home/controls/var/bl4b-Galil2/bl4b-Galil2.sav` + rotating |
| Motor record pass0 autosave (`Galil2`) | `/home/controls/var/bl4b-Galil2/bl4b-Galil2_pass0.sav*` |
| **Galil command log** — every PR/BG/MG/SH on every controller, **shared by both IOCs** | `/home/controls/var/log/dassrv1/galil_dmc.log` (from `GALIL_DEBUG_FILE` env in `st.cmd`) |
| Galil1 IOC stdout (Encoder-stall, move-begin-failure, CA Link Exceptions) | `/home/controls/var/log/ioc_bl4b-Mot-Galil1.log` |
| Galil2 IOC stdout | `/home/controls/var/log/ioc_bl4b-Galil2.log` |
| Scan server stdout (real RBV values in `TimeoutException`) | `/home/controls/var/scan/console.log` |
| Scan device definitions (tolerances, timeouts) | `/home/controls/bl4b/python/scantools/devices.py` |
| Motor substitutions | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/bl4b-Galil1.substitutions` |
| Motor substitutions — `Galil1` | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/bl4b-Galil1.substitutions` |
| Motor substitutions — `Galil2` | `/home/controls/bl4b/applications/bl4b-Galil2/bl4b-Galil2App/Db/bl4b-Galil2.substitutions` |
| IOC startup — `Galil1` | `/home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/st.cmd` |
| IOC startup — `Galil2` | `/home/controls/bl4b/applications/bl4b-Galil2/iocBoot/iocbl4b-Galil2/st.cmd` |

TODO: The above table needs to be expanded by incorporating the 2nd Galil IOC at '/home/controls/bl4b/applications/bl4b-Galil2/'
Note: the old path `/home/controls/var/log/dassrv1/ioc_bl4b-Galil1.log` is a
pre-RHEL9-upgrade artefact (last-written 2019). The live IOC stdout is at
`/home/controls/var/log/ioc_bl4b-Mot-Galil1.log`, written to by the current
`procServ` wrapper.

### Galil controller-to-axis mapping
### Galil controller-to-axis mapping (two IOCs, overlapping numbering)

BL4B uses **two** Galil IOCs. Each creates different controllers on the same
beamline network (10.112.9.x). DMC5 (10.112.9.49) is *commented out* in
`bl4b-Galil1`'s st.cmd but is *active* in `bl4b-Galil2` — so `BL4B:Mot:zd`
and siblings are live motors served by Galil2, not ghosts. Always consult
both st.cmd files when mapping a PV to its controller.

#### `bl4b-Galil1` IOC

| Controller | IP | Notable axes |
|---|---|---|
| DMC1 | 10.112.9.45 | A=s1t, B=s1b, C=s1l, D=s1r, E=s2b, F=s2t, G=s2l, H=s2r |
| DMC2 | 10.112.9.46 | A–D = s3 (t/b/r/l), E–H = s4 (t/b/r/l) |
| DMC3 | 10.112.9.47 | A–D = incident slits (sib/sit/sir/sil), **E = hs** (Sample Height) |
| DMC4 | 10.112.9.48 | A=zm, B=chim, C=thm, D=xi, E=ysc, **F=p_d**, **G=tthd_enc**, H=tphd |

#### `bl4b-Galil2` IOC

TODO: This table needs to be constructed by inspection of /home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/st.cmd
and /home/controls/bl4b/applications/bl4b-Galil2/iocBoot/iocbl4b-Galil2/st.cmd
| Controller | IP | Notable axes |
|---|---|---|
| DMC5 | 10.112.9.49 | D=p_i, E=zi, **F=zd** (used by `tthd` formula on Galil1 via CA), G=att |
| DMC6 | 10.112.9.44 | A=xs, B=ys, C=chis, D=ths, E=zs |

#### Cross-IOC quirks to watch for

* **`BL4B:Mot:zd.RBV`** is read by `bl4b-Galil1`'s `virtual_pdLift.template`
  (for the `tthd` forward kinematics) from `bl4b-Galil2` over CA. If you see
  a suspicious `zd` value in a tthd calc, `dbpr BL4B:Mot:tthd:REQ_pd,4` on
  the running Galil1 IOC is the quickest way to see the actual input value;
  inspecting the `Galil1` substitutions alone is misleading because they
  comment out `zd`.
* **`BL4B:Mot:zd.STOP` and `BL4B:Mot:zd_Ctrl` CA writes from Galil1 fail
  continuously** (log noise in `ioc_bl4b-Mot-Galil1.log` since at least
  2026-03-25). Originated from `theta_out:Abort` and `ctrl.template`'s
  `zd_toggle`. Harmless for normal operation; cleanup is a separate task.

### The `tthd` virtual-motor retry-runaway (incident 2026-04-10)

See `tasking/tthd-Motion-Failure-Analysis.md` for the full analysis. Summary:
`virtual_pdLift.template`'s `NewPD1` retry record applies `(Target − Readback)`
as a correction to the forward-kinematic `tthdCalc` without verifying the
physical encoder actually moved. When the `p_d` stage encoder-stalls, each
retry adds ~6.2° to `tthdCalc` and ~119 mm to `REQ_p_d`. On the 4th retry the
stall transient cleared, the Galil began executing the (now huge) move, and
`p_d` ran up at 2 mm/s for 3.5 minutes before the operator could hit Stop All.

## Secure Temporary Files

+84 −49
Original line number Diff line number Diff line
@@ -26,7 +26,7 @@ The fix is at three layers, in priority order:

1. **Make `virtual_pdLift.template` distinguish "calibration is slightly off" from "motor failed to move"** — refuse to accumulate corrections when `tthd_enc.RBV` did not change between the previous and the current attempt.
2. **Investigate the chronic Encoder-stall on `p_d` (motor F)** — bursts on 2026-03-27, 2026-03-30, and 2026-04-10. This is a hardware/mechanical issue independent of the runaway.
3. **Restore or cleanly remove `BL4B:Mot:zd`** — DMC5 was decommissioned but `tthd` still reads `zd.RBV` for its forward kinematics, currently frozen at 1002.289 mm. That number is used in every `REQ_p_d` calculation and is not visible to anyone reviewing the file structure.
3. **Disentangle the split ownership of `BL4B:Mot:zd`** — DMC5 (which hosts `zd`) was *commented out* of `bl4b-Galil1`'s startup, but the motor itself is actually alive on `bl4b-Galil2`'s DMC5 (also IP `10.112.9.49`). So `zd.RBV` is a real, live readback (1002.289 mm, stationary since 2026-03-24). What *is* broken is that the `bl4b-Galil1` IOC keeps trying to CA-write `zd.STOP` and `zd_Ctrl` and keeps failing (hundreds of `DB CA Link Exception` entries in the Galil1 log). Those failing writes are harmless for this incident but are persistent log noise and suggest the `bl4b-Galil1` template set should be cleaned up to match the real ownership.

---

@@ -117,11 +117,17 @@ that moves the arm is `p_d` (linear stage on DMC4 axis F).
| `p_d.OFF` | 139.157 mm | autosave |
| `p_d.DHLM / DLLM` | **2087.77 / 1460.55 mm** | autosave |
| `p_d.DVAL` (resting) | 1524.39 mm | pass0 (later state, after recovery) |
| `BL4B:Mot:zd.RBV` | **1002.289 mm — frozen** | live `dbpr` from 2026-04-12 20:16 |
| `BL4B:Mot:zd.RBV` | **1002.289 mm (live, stationary since 2026-03-24)** | bl4b-Galil2 IOC, confirmed in archiver |

---

## Galil controller / axis map (from `iocBoot/iocbl4b-Galil1/st.cmd`)
## Galil controller / axis map — two IOCs share the DMC numbering

Two IOCs serve the BL4B Galil controllers. Both controllers and their axes
live on the beamline's private controls network (10.112.9.x); which IOC
*owns* an axis depends on which `st.cmd` creates it.

### `bl4b-Galil1` IOC — `iocBoot/iocbl4b-Galil1/st.cmd`

| Controller | IP | Axes (notable) |
|---|---|---|
@@ -129,11 +135,19 @@ that moves the arm is `p_d` (linear stage on DMC4 axis F).
| DMC2 | 10.112.9.46 | A–D=s3 (tblr), E–H=s4 (tblr) |
| DMC3 | 10.112.9.47 | A–D=sib/sit/sir/sil, **E=hs** (Sample Height), F–H unused |
| DMC4 | 10.112.9.48 | A=zm, B=chim, C=thm, D=xi, E=ysc, **F=p_d**, **G=tthd_enc**, H=tphd |
| ~~DMC5~~ | ~~10.112.9.49~~ | **commented out — `zd` no longer exists** |
| ~~DMC5~~ | ~~10.112.9.49~~ | commented out in **this** IOC — but the motors are served by Galil2 (see below) |

### `bl4b-Galil2` IOC — `iocBoot/iocbl4b-Galil2/st.cmd`

| Controller | IP | Axes (notable) |
|---|---|---|
| DMC5 | 10.112.9.49 | D=p_i (incident arm translation), E=zi, **F=zd** (detector height), G=att |
| DMC6 | 10.112.9.44 | A=xs, B=ys, C=chis, D=ths, E=zs |

The error strings reported by the user map cleanly:
* `move begin failure axis E` + `axis=4` ⇒ DMC3 axis E = `hs` (Sample Height).
* `Encoder stall stop motor F` ⇒ DMC4 axis F = `p_d` (detector arm).
* `move begin failure axis E` + `axis=4` ⇒ DMC3 axis E = `hs` (Sample Height), on bl4b-Galil1.
* `Encoder stall stop motor F` ⇒ DMC4 axis F = `p_d` (detector arm), on bl4b-Galil1.
* The runaway formula's `INPB` (`zd.RBV`) resolves to DMC5 axis F = `zd`, on bl4b-Galil2 — i.e. the virtual motor on Galil1 reads from a different IOC over EPICS CA.

---

@@ -266,7 +280,7 @@ record(calcout, "$(VM):NewPD1") {

The retry assumes the *only* reason the readback is not at the target is that
the kinematic constants (`constL_d`, `constA_d`, `constB_d`, `constRO_A`,
`constRO_B`) — combined with the (frozen) `zd.RBV` — are slightly off, so the
`constRO_B`) — combined with the current `zd.RBV` — are slightly off, so the
tthdCalc value sent through the formula needs to be nudged. With those
assumptions the iteration converges quickly: every cycle the readback moves
toward target, the residual shrinks, and at most a handful of refinements get
@@ -308,15 +322,21 @@ narrative the user hit Stop All, then re-clicked Move Align — and the
sequencer happily started the runaway again. This is what produced burst 2
identically to burst 1.

### A third (less critical, but still real) issue: stale `BL4B:Mot:zd.RBV`
### A third (cosmetic but real) issue: `zd` has split ownership between the two Galil IOCs

`REQ_pd` reads `INPB = $(S):Mot:$(M2).RBV NPP` where `$(M2) = zd`. DMC5 was
removed from `iocBoot/iocbl4b-Galil1/st.cmd` (line 38, commented out) and the
`zd` substitution row (`bl4b-Galil1.substitutions:45`) is also commented out.
The `BL4B:Mot:zd` PV is still served by *something* (likely the
`bl4b-Parker1` or another retired IOC that still has a soft record), and its
`.RBV` is permanently **1002.289 mm** — the value confirmed by `dbpr
BL4B:Mot:tthd:REQ_pd,4` from the live IOC on 2026-04-12 at 20:16:59:
`REQ_pd` reads `INPB = $(S):Mot:$(M2).RBV NPP` where `$(M2) = zd`. The `zd`
row in `bl4b-Galil1.substitutions:45` is commented out, and the
`GalilCreateController("DMC5", ...)` call in
`iocBoot/iocbl4b-Galil1/st.cmd:38` is also commented out. On the face of
it, this looks like `zd` is a retired / ghost PV. It is not.

The `bl4b-Galil2` IOC's `st.cmd` actually creates DMC5 on `10.112.9.49`
(the same physical controller the Galil1 startup disclaims) and its
substitutions file defines `zd` on DMC5 axis F along with `p_i` on D, `zi`
on E, and `att` on G. So `BL4B:Mot:zd` is a **live motor record served by
`bl4b-Galil2`**, with a recent `.RBV` of 1002.289 mm that has been
stationary since `2026-03-24T07:29:53`. This was confirmed by a live
`dbpr BL4B:Mot:tthd:REQ_pd,4` from 2026-04-12 20:16:59:

```
B   : 1002.289      ALST: 1776.95309410205
@@ -327,13 +347,13 @@ E : 468.91 ← constA_d
CALC: ((B-C*SIN(A-D))**2+(C*COS(A-D)-E)**2)**0.5
```

If the geometry actually does still depend on `zd`, then driving `tthd` is
running open-loop on a number that no one is updating. If it doesn't, then
the formula is carrying around a 1002 mm constant that nobody knows about.
Either way it deserves cleanup.
So: the read path of the kinematic formula is **correct**`zd.RBV` is a
real current reading from Galil2. The runaway is *not* caused by a stale
input; it is caused purely by the retry-without-motion-verification loop
analysed above.

The `IOC` log corroborates that `zd` has been throwing CA-link exceptions
continuously since at least 2026-03-25:
The write path, on the other hand, is leaky. The `Galil1` IOC log has been
throwing CA-link exceptions continuously since at least 2026-03-25:

```
Wed Mar 25 15:30:26 2026  DB CA Link Exception: ... context "BL4B:Mot:zd"
@@ -342,10 +362,16 @@ Wed Mar 25 15:30:27 2026 DB CA Link Exception: ... context "BL4B:Mot:zd_Ctrl"
... (recurring every few hours)
```

These come from the `theta_out:Abort` and `ctrl.template` `zd_toggle` records,
which both still try to `CA`-write `BL4B:Mot:zd.STOP` and `BL4B:Mot:zd_Ctrl`
that no longer have a writeable backing. The reads (`zd.RBV`) succeed
silently because some other IOC is still publishing the value.
These come from the `theta_out:Abort` and `ctrl.template` `zd_toggle`
records loaded by `Galil1`, which try to CA-write `BL4B:Mot:zd.STOP` and
`BL4B:Mot:zd_Ctrl` on `Galil2`. The `zd.STOP` target exists (it's a field
on the real `Galil2` motor record) but the CA write from `Galil1` is
failing; `zd_Ctrl` may or may not exist as a writeable PV depending on
how `ctrl.template` was instantiated historically. Neither write matters
for this incident — they would only matter on a Stop All abort where we'd
want to stop `zd` too — but the log noise is a symptom of a template set
on `Galil1` that no longer matches the real ownership (now split across
two IOCs). Cleanup is a separate, non-urgent task.

---

@@ -489,21 +515,21 @@ Diagnostic candidates:
* **Mechanical resistance / cable tray** — the detector arm is heavy and rides on an air pad; if pressure drops or a cable snags, momentary stall is plausible.
* **Galil controller temperature / power-cycle history** — the controller has been replaced before on BL4A for similar behaviour; rule out controller hardware first.

### Priority 4 — Decommission or restore `BL4B:Mot:zd` cleanly
### Priority 4 — Clean up `Galil1`'s CA-write links to `zd`

Decision required: does the detector arm geometry still depend on a vertical
adjustment `zd`? Two options:
`zd` is a live motor on `bl4b-Galil2` (DMC5 axis F), not a ghost — the kinematic
formula reads its `.RBV` correctly. But the `Galil1` IOC is still trying to
CA-write `zd.STOP` and `zd_Ctrl` from `theta_out:Abort` and the `zd_toggle`
`ctrl.template` instance, and those writes fail continuously (log noise since
at least 2026-03-25). Two follow-ups:

* **If yes** (i.e. the height adjustment still exists mechanically, just on a
  different controller / readout): wire `INPB` of `tthd:REQ_pd` to whatever
  PV reads the actual current value, and remove the dependency on the dead
  `BL4B:Mot:zd.RBV`.
* **If no** (the geometry no longer changes in `zd` because the height
  is fixed): bake the `1002.289` value into the kinematic formula directly
  (or into a new ASG=BEAMLINE constant `tthd:const_zd_fixed`) and remove
  `INPB` entirely. Then delete the orphaned `BL4B:Mot:zd*` records and the
  `theta_out:Abort` link to `Mot:zd.STOP`, which will also clean up the
  CA-link exceptions in the IOC log.
* **Verify** that a real Stop All sequence reaches `zd` via whatever path the
  operators actually trust — not the broken cross-IOC CA writes from Galil1.
  If operators think Stop All stops `zd`, confirm that claim (or fix it).
* **Remove the orphaned links** from `Galil1`'s template instantiations of
  `tthd-virtual.template` and `ctrl.template`. They refer to a local `zd`
  that no longer exists on this IOC; the CA failures are harmless today but
  mask real write failures you might care about later.

### Priority 5 — Sanity bound on `REQ_p_d`

@@ -530,9 +556,10 @@ physical motor.
   matches the DPF-based reconstruction of motion (1663 mm → 2085 mm in
   3 m 33 s = 213 s; 422.45 mm / 213 s = 1.98 mm/s ≈ 2 mm/s).
4. **Software-stack**. `tthd:Status` was archived as STATUS_39 (LINK_ALARM)
   throughout, consistent with the broken `BL4B:Mot:zd` CA links — the
   alarm severity was MAJOR but the calc still *evaluated* (because reads
   succeeded), so the runaway wasn't blocked by alarm-state guarding.
   throughout, consistent with the (harmless) `Galil1 → zd.STOP` cross-IOC
   CA-write failures. The alarm severity was MAJOR but the calc still
   *evaluated* (because the CA read of `zd.RBV` from Galil2 succeeds), so
   the runaway wasn't blocked by alarm-state guarding.

---

@@ -548,8 +575,15 @@ physical motor.
* My second hypothesis was that **bad zd readback** alone caused the
  unexpected motion. It does not — the same formula with the same `zd =
  1002.289` produces the *correct* `REQ_p_d = 1777.08` for the first
  attempt. The runaway is downstream of the formula, in the retry loop. zd
  is a separate cleanup item, not the proximate cause.
  attempt. The runaway is downstream of the formula, in the retry loop.

* I also initially misread the system as having `zd` **decommissioned**
  (because it was commented out in `bl4b-Galil1`'s `st.cmd` and
  `substitutions`). That was wrong: `bl4b-Galil2`'s `st.cmd` actually hosts
  DMC5 on the same IP (`10.112.9.49`) and serves `BL4B:Mot:zd` as a live
  motor. Always cross-check a "decommissioned" claim against the sibling
  IOC's startup — BL4B has two Galil IOCs with overlapping controller
  numbering.

* The `hs` failures (red-light, BGE failures on axis E) consumed a lot of
  attention in the email thread but they are an independent, separate
@@ -566,12 +600,13 @@ physical motor.
  include `tthd:REQ_p_d` so the retry sequence was invisible from his
  vantage point.

* The frozen `BL4B:Mot:zd.RBV = 1002.289` would be very easy to miss in
  routine inspection. The number only surfaces by `dbpr BL4B:Mot:tthd:REQ_pd`
  on the live IOC — *not* by reading the substitutions or DB files (which
  reference a `zd` that no longer exists in the local IOC). Keep this in
  mind for future BL4B virtual-motor investigations: always `dbpr` the
  composite calcouts to check what their inputs are actually evaluating to.
* The cross-IOC dependency on `zd.RBV = 1002.289` would be very easy to
  miss in routine inspection. The number only surfaces by `dbpr
  BL4B:Mot:tthd:REQ_pd` on the live IOC — *not* by reading the
  `bl4b-Galil1` substitutions or DB files (which refer to a commented-out
  `zd`). Keep this in mind for future BL4B virtual-motor investigations:
  always `dbpr` the composite calcouts to see which PVs they are actually
  resolving to across IOCs.

---

+8.31 KiB (117 KiB)

File changed.

No diff preview for this file type.