Commit 6ddbf5a9 authored by Vacaliuc, Bogdan's avatar Vacaliuc, Bogdan
Browse files

tthd analysis: incorporate Galil2 deep-dive — thi already has the fix



The user pushed back on whether bl4b-Galil2 changed anything. Going
deeper: yes, in two important ways.

1) Galil2 is actively running (command log updated 09:02 today, 322 MB
   total) and writes to a SEPARATE command log
   (/home/controls/var/log/bl4b-Galil2.log), not the shared file I
   incorrectly described in tasking/CLAUDE.md last commit. Fixed.

2) BL4B has TWO parallel virtual angular motors driving heavy arms via
   linear stages: tthd (Galil1, detector arm via p_d) and thi (Galil2,
   incident arm via p_i). They were originally implemented with the
   same general design, but Matthew Pearson rewrote the thi side in
   January 2026 (RedMine #4822, virtual_thi_to_pi.template) to remove
   the retry runaway. tthd was left in the 2018-era
   virtual_pdLift.template. The thi rewrite has no NewPD1, no MaxTries,
   no second-motor-RBV dependency in the formula — just a one-shot
   law-of-cosines calc that commits to the physical motor and lets the
   motor record's own RTRY mechanism handle convergence.

Report changes:
 - Executive summary fix #1 reframed: "retire virtual_pdLift.template"
   instead of "make NewPD1 distinguish stall from miscalibration", with
   pointer to the existing thi rewrite.
 - New "Two parallel virtual-motor stacks" subsection in System
   Architecture comparing the two templates side-by-side.
 - Recommended Fix 1 split into 1a (minimal patch — same as before, as
   stopgap) and 1b (full rewrite modeled on virtual_thi_to_pi.template).
   1b includes concrete steps and a note that bl4b-VMotCalc already has
   a tthdFake harness instance for safe pre-deployment testing.
 - "Files referenced" expanded to include both Galil1 and Galil2
   template paths and command logs.
 - Dead-ends adds the lesson: when investigating template-driven
   behavior, grep for sibling template instantiations before designing
   from scratch — the pattern may already be in the tree.

tasking/CLAUDE.md:
 - Correct the shared-command-log error from prior commit.
 - Add "Two parallel virtual-motor stacks" subsection to the cross-IOC
   quirks block, pointing future tthd-investigators at thi as the
   reference design.

PDF regenerated.

Co-Authored-By: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent c0fb7a94
Loading
Loading
Loading
Loading
+23 −1
Original line number Diff line number Diff line
@@ -57,7 +57,8 @@ Full docs: `setup/docs/sns-archiver-query.md` on `main`.
| Motor record dated snapshots (`Galil1`) — on IOC restart | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav_YYMMDD-hhmmss` |
| Motor record autosave (`Galil2`) | `/home/controls/var/bl4b-Galil2/bl4b-Galil2.sav` + rotating |
| Motor record pass0 autosave (`Galil2`) | `/home/controls/var/bl4b-Galil2/bl4b-Galil2_pass0.sav*` |
| **Galil command log** — every PR/BG/MG/SH on every controller, **shared by both IOCs** | `/home/controls/var/log/dassrv1/galil_dmc.log` (from `GALIL_DEBUG_FILE` env in `st.cmd`) |
| Galil1 command log — every PR/BG/MG/SH on DMC1-DMC4 (env `GALIL_DEBUG_FILE` in `st.cmd`) | `/home/controls/var/log/dassrv1/galil_dmc.log` |
| Galil2 command log — every PR/BG/MG/SH on DMC5-DMC6 (separate file from Galil1) | `/home/controls/var/log/bl4b-Galil2.log` |
| Galil1 IOC stdout (Encoder-stall, move-begin-failure, CA Link Exceptions) | `/home/controls/var/log/ioc_bl4b-Mot-Galil1.log` |
| Galil2 IOC stdout | `/home/controls/var/log/ioc_bl4b-Galil2.log` |
| Scan server stdout (real RBV values in `TimeoutException`) | `/home/controls/var/scan/console.log` |
@@ -109,6 +110,24 @@ both st.cmd files when mapping a PV to its controller.
  2026-03-25). Originated from `theta_out:Abort` and `ctrl.template`'s
  `zd_toggle`. Harmless for normal operation; cleanup is a separate task.

#### Two parallel virtual-motor stacks — `tthd` is the un-rewritten one

BL4B has two structurally similar virtual angular motors that drive a
linear stage to position a heavy arm:

| | `tthd` (detector arm) | `thi` (incident arm) |
|---|---|---|
| Lives on | `bl4b-Galil1` | `bl4b-Galil2` |
| Composite-motor sequencer template | `tthd-virtual.template` (theta_out) | `thi-virtual.template` (theta_in) |
| Per-axis kinematic template | `virtual_pdLift.template` (2018, John Ankner) | **`virtual_thi_to_pi.template`** (2026-01, Matthew Pearson, RedMine #4822) |
| Has `NewPD1` retry runaway? | **Yes** | No — rewritten |
| Drives | `p_d` (DMC4 axis F, on Galil1) | `p_i` (DMC5 axis D, on Galil2) |

When investigating `tthd`-related issues, **always look at `thi`'s
template too** — it is the existing reference design that the `tthd`
rewrite should follow. Don't design a fix from scratch when the pattern
is already in the repo.

### The `tthd` virtual-motor retry-runaway (incident 2026-04-10)

See `tasking/tthd-Motion-Failure-Analysis.md` for the full analysis. Summary:
@@ -118,6 +137,9 @@ physical encoder actually moved. When the `p_d` stage encoder-stalls, each
retry adds ~6.2° to `tthdCalc` and ~119 mm to `REQ_p_d`. On the 4th retry the
stall transient cleared, the Galil began executing the (now huge) move, and
`p_d` ran up at 2 mm/s for 3.5 minutes before the operator could hit Stop All.
**Recommended fix** — emulate `virtual_thi_to_pi.template` (above): retire
the retry loop entirely and let the physical motor's `RTRY` mechanism handle
on-target convergence.

## Secure Temporary Files

+115 −14
Original line number Diff line number Diff line
@@ -24,9 +24,9 @@ On one of those repeated retries the encoder stall transient cleared, the Galil

The fix is at three layers, in priority order:

1. **Make `virtual_pdLift.template` distinguish "calibration is slightly off" from "motor failed to move"** — refuse to accumulate corrections when `tthd_enc.RBV` did not change between the previous and the current attempt.
1. **Retire `virtual_pdLift.template` and replace it with a tthd-equivalent of `virtual_thi_to_pi.template`.** BL4B *already* has the right pattern in the tree: Matthew Pearson's Jan-2026 `thi` rewrite (RedMine #4822, loaded by `bl4b-Galil2`) replaced the old `virtual_piLift` template with a one-shot kinematic calc that has **no retry loop at all** — the physical motor handles its own retries via the normal motor-record `RTRY` mechanism. `tthd` was simply left behind in the 2018-era template with the `NewPD1` retry record that causes the runaway. A minimal-patch version of this fix (add a "did the readback move?" guard to `NewPD1`) is given later as Recommended Fix 1a as a stopgap; the full template rewrite is Recommended Fix 1b.
2. **Investigate the chronic Encoder-stall on `p_d` (motor F)** — bursts on 2026-03-27, 2026-03-30, and 2026-04-10. This is a hardware/mechanical issue independent of the runaway.
3. **Disentangle the split ownership of `BL4B:Mot:zd`** — DMC5 (which hosts `zd`) was *commented out* of `bl4b-Galil1`'s startup, but the motor itself is actually alive on `bl4b-Galil2`'s DMC5 (also IP `10.112.9.49`). So `zd.RBV` is a real, live readback (1002.289 mm, stationary since 2026-03-24). What *is* broken is that the `bl4b-Galil1` IOC keeps trying to CA-write `zd.STOP` and `zd_Ctrl` and keeps failing (hundreds of `DB CA Link Exception` entries in the Galil1 log). Those failing writes are harmless for this incident but are persistent log noise and suggest the `bl4b-Galil1` template set should be cleaned up to match the real ownership.
3. **Clean up `bl4b-Galil1`'s broken CA writes to `zd`.** `BL4B:Mot:zd` is a live motor on `bl4b-Galil2` DMC5 axis F (not decommissioned — the commented-out DMC5 block in Galil1's `st.cmd` is misleading). The kinematic formula's *read* of `zd.RBV` works correctly over cross-IOC CA. But Galil1's `theta_out:Abort` and `ctrl.template` instance try to CA-write `zd.STOP` and `zd_Ctrl` continuously and fail (hundreds of `DB CA Link Exception` entries). Harmless for this incident, but persistent log noise that masks other write failures.

---

@@ -144,11 +144,37 @@ live on the beamline's private controls network (10.112.9.x); which IOC
| DMC5 | 10.112.9.49 | D=p_i (incident arm translation), E=zi, **F=zd** (detector height), G=att |
| DMC6 | 10.112.9.44 | A=xs, B=ys, C=chis, D=ths, E=zs |

`bl4b-Galil2` is actively running (its command log
`/home/controls/var/log/bl4b-Galil2.log` was updated as recently as
2026-04-13 09:02 — note: this is a *separate* command log from Galil1's
`/home/controls/var/log/dassrv1/galil_dmc.log`; each Galil IOC writes its
own `GALIL_DEBUG_FILE`).

The error strings reported by the user map cleanly:
* `move begin failure axis E` + `axis=4` ⇒ DMC3 axis E = `hs` (Sample Height), on bl4b-Galil1.
* `Encoder stall stop motor F` ⇒ DMC4 axis F = `p_d` (detector arm), on bl4b-Galil1.
* The runaway formula's `INPB` (`zd.RBV`) resolves to DMC5 axis F = `zd`, on bl4b-Galil2 — i.e. the virtual motor on Galil1 reads from a different IOC over EPICS CA.

### Two parallel virtual-motor stacks — only one was rewritten

The two Galil IOCs each provide a virtual angular motor that drives a
linear stage to position a heavy arm. They were originally implemented
with the same general design, but the `thi` side was **rewritten in
January 2026** to remove the retry runaway, while `tthd` was left in the
2018-era design.

| Concern | `tthd` (Galil1) | `thi` (Galil2) |
|---|---|---|
| Outermost composite motor | `BL4B:Mot:theta_out` (`tthd-virtual.template`) — coordinates `tthd` + `zd` + `thi` checks | `BL4B:Mot:theta_in` (`thi-virtual.template`) — coordinates `thi` + `zi` |
| Per-axis virtual motor template | `virtual_pdLift.template` (≈790 lines, 2018, John Ankner geometry) | `virtual_thi_to_pi.template` (≈300 lines, **2026-01, Matthew Pearson, RedMine #4822**) |
| Has `NewPD1` retry loop? | **Yes** (this is the runaway) | **No** |
| Has `MaxTries`/`NumTries`? | **Yes** (5) | **No** |
| Forward kinematic uses second motor's RBV? | **Yes** (`zd.RBV`) | **No** (4 pure constants) |
| Where does retry on missed target happen? | Virtual motor re-runs the formula with adjusted `tthdCalc` | Physical motor's own `RTRY` mechanism |

The thi rewrite is the existing pattern that the tthd rewrite should
follow — see Recommended Fix 1b.

---

## The Specific Failure — Timeline by Wall Clock
@@ -452,10 +478,12 @@ Move Align attempts — but the `hs` failure mode itself is *separate* from the

## Recommended fixes — by priority

### Priority 1 — Make the virtual motor refuse to retry when the readback didn't move
### Priority 1a — Minimal patch: make the virtual motor refuse to retry when the readback didn't move

Add a "did the readback actually change?" check in the sequence between
`CheckIfThere` and `NewPD1`. The simplest version is one new calcout:
This is the **stopgap** — a surgical edit to the existing
`virtual_pdLift.template` that closes the runaway without touching the rest
of the template. Add a "did the readback actually change?" check in the
sequence between `CheckIfThere` and `NewPD1`:

```epics
record(ai, "$(VM):Readback_atMoveStart") {
@@ -485,10 +513,63 @@ record(seq, "$(VM):MotorStalled") {
}
```

Insert `TrustOrNot` between `CheckContinue` (which currently goes directly to
`NewPD1`) and `NewPD1`. With this in place the very first encoder-stall
Insert `TrustOrNot` between `CheckContinue` (which currently goes directly
to `NewPD1`) and `NewPD1`. With this in place the very first encoder-stall
attempt would fail cleanly and stop, instead of accumulating four corrections.

### Priority 1b — Proper fix: retire `virtual_pdLift.template`, replicate `virtual_thi_to_pi.template`

The more durable fix is to **do for `tthd` what Matthew Pearson did for
`thi` in January 2026** (RedMine #4822). Diff the two templates side by
side:

| Element | `virtual_pdLift.template` (tthd, 2018) | `virtual_thi_to_pi.template` (thi, 2026-01) |
|---|---|---|
| Retry loop (`NewPD1``REQ_twoTheta_d``DoMove``VerifyDMov`) | Yes, up to `MaxTries=5` | **None** |
| "At target?" check (`CheckIfThere`) that can re-trigger motion | Yes (`TTHD_TOL=0.0005°`) | **None** — DMOV is `p_i.DMOV` directly |
| Kinematic formula reads a second motor's RBV | Yes (`zd.RBV` as `INPB`) | **No** — four pure constants A, B, L, b |
| Move commit to physical motor | `DoMove` sseq with `State`, `NumTries`, `VerifyDMov` | `CalcP` calcout writes `OUT = p_i.VAL PP` |
| Who handles retry on miss | Virtual motor re-fires the kinematic calc with adjusted `tthdCalc` (the runaway path) | The physical motor record's own `RTRY`/`RDBD` |
| Forward kinematic (update `.RBV` from physical) | Separate retry-capable path | Pure display-only calc from `p_i.RBV` |

The thi rewrite eliminates the runaway by eliminating the retry loop that
causes it. The physical-motor RTRY mechanism is *already* correct for
handling the "motor didn't quite hit target, try again" case; the virtual
motor adds no value by retrying at its own level, and in the runaway case
it causes active harm.

**Concrete steps for the tthd rewrite:**

1. Derive the tthd forward/inverse kinematics in the same shape as thi's
   (law of cosines with constants A, B, L, b, and a phi offset). The
   existing tthd geometry uses `L_d = 2186.53`, `A_d = 468.91`, `B_d =
   394.51`, plus `constRO_A = 0.19` and `constRO_B = 0.96` for a
   run-out correction. These don't map 1:1 onto the thi template — the
   engineers who own the geometry (Erik Watkins et al. on the thi
   redesign; John Ankner wrote the original tthd formulas) will need
   to collaborate.
2. Decide what happens to the `zd.RBV` dependency. Two options:
   - **Keep it as a formula input, but plumb it through a calc record
     that reads zd.RBV from Galil2 and exposes it as a local
     read-only PV**, eliminating the cross-IOC dependency inside the
     kinematic chain. Then the formula can still compensate for zd's
     vertical position.
   - **Drop it if the detector geometry no longer depends on zd**
     (i.e., zd is permanently fixed at its current 1002.289 mm for
     this configuration). This simplifies the formula to four
     constants.
3. Create a `BL4B:Mot:tthdNew` / `BL4B:Mot:tthd2` test instance in
   `bl4b-VMotCalc` (which already carries a `tthdFake` pattern) to
   validate the new template against the real geometry before
   swapping over.
4. Swap `BL4B:Mot:tthd` from `virtual_pdLift.template` to the new
   template in `bl4b-Galil1.substitutions:209`; archive the old
   template as `virtual_pdLift.template.old`.

The 1a patch is *fine* as an emergency measure and can be deployed
today. The 1b rewrite is a 1-2 week task best scheduled at the next
maintenance period; it's the right long-term fix.

### Priority 2 — Fix the Stop / Abort semantics

`InitMoveSeq` should *not* unconditionally clear `AbortFlag`. Either:
@@ -608,19 +689,39 @@ physical motor.
  always `dbpr` the composite calcouts to see which PVs they are actually
  resolving to across IOCs.

* The "thi got the fix, tthd didn't" finding only surfaced after the
  `bl4b-Galil2` IOC was examined — not something the user originally
  flagged. If I'd only looked at `bl4b-Galil1` (which is where `tthd`
  lives) the recommendation would have been the narrower "patch
  `NewPD1` with a readback-moved guard" rather than "replicate what
  Matthew Pearson already did for the sibling virtual motor." The
  broader lesson: **when a file references a template, always look
  for sibling/parallel templates in the codebase** — the pattern you
  need to emulate may already be in the tree, written by someone
  who already thought through the same problem. Grep the project
  for `file virtual_*.template` instantiations before designing
  from scratch.

---

## Files referenced

| Purpose | Path |
|---|---|
| Virtual tthd motor template | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/virtual_pdLift.template` |
| **Virtual tthd motor template (the runaway lives here)** | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/virtual_pdLift.template` |
| theta_out (Move Align glue) | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/tthd-virtual.template` |
| zd_toggle ctrl logic | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/ctrl.template` |
| Motor substitutions | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/bl4b-Galil1.substitutions` |
| IOC startup | `/home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/st.cmd` |
| Motor record runtime values | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav{,0,1,2}` |
| Pass0 (MRES, ERES, OFF, DVAL) | `/home/controls/var/bl4b-Galil1/bl4b-Galil1_pass0.sav{,0,1}` |
| Galil command log | `/home/controls/var/log/dassrv1/galil_dmc.log` |
| Galil IOC stdout (Encoder stalls) | `/home/controls/var/log/ioc_bl4b-Mot-Galil1.log` |
| Galil1 motor substitutions | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/bl4b-Galil1.substitutions` |
| Galil1 IOC startup | `/home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/st.cmd` |
| **Existing reference design for tthd rewrite** (Matthew Pearson, RedMine #4822) | `/home/controls/bl4b/applications/bl4b-Galil2/bl4b-Galil2App/Db/virtual_thi_to_pi.template` |
| theta_in (thi/zi composite glue) | `/home/controls/bl4b/applications/bl4b-Galil2/bl4b-Galil2App/Db/thi-virtual.template` |
| Galil2 motor substitutions (defines `zd` on DMC5 axis F) | `/home/controls/bl4b/applications/bl4b-Galil2/bl4b-Galil2App/Db/bl4b-Galil2.substitutions` |
| Galil2 IOC startup | `/home/controls/bl4b/applications/bl4b-Galil2/iocBoot/iocbl4b-Galil2/st.cmd` |
| Galil1 motor record runtime values | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav{,0,1,2}` |
| Galil1 pass0 (MRES, ERES, OFF, DVAL) | `/home/controls/var/bl4b-Galil1/bl4b-Galil1_pass0.sav{,0,1}` |
| Galil2 motor record runtime values | `/home/controls/var/bl4b-Galil2/bl4b-Galil2.sav{,0,1,2}` |
| Galil1 command log | `/home/controls/var/log/dassrv1/galil_dmc.log` |
| Galil2 command log (separate from Galil1) | `/home/controls/var/log/bl4b-Galil2.log` |
| Galil1 IOC stdout (Encoder stalls, CA Link Exceptions) | `/home/controls/var/log/ioc_bl4b-Mot-Galil1.log` |
| Galil2 IOC stdout | `/home/controls/var/log/ioc_bl4b-Galil2.log` |
| Email thread (PDF) | `/SNS/users/6ov/BL4B/2026/04/12/WangHarveyHicksGeng-2026-04-13.pdf` |
+22.3 KiB (140 KiB)

File changed.

No diff preview for this file type.