Commit 2527553b authored by Vacaliuc, Bogdan's avatar Vacaliuc, Bogdan
Browse files

plan: hs-investigation briefing for fresh-context session



Andre Parizzi (2026-04-13) reviewed the incident plots and proposed:
the hs high limit got triggered at ~23:35 on 2026-04-10, suppressing
positive moves until ~00:10 on 2026-04-11 when something cleared it.
Suspected cause: noise / flaky cable.

The hypothesis is already 80% confirmed by data the parallel tthd
investigation extracted: every BGE failure on DMC3 axis E in the window
is on a positive PRE, the single negative move (PRE=-12801 at
23:50:34) succeeded, and even a 3-motor-step positive move was refused.
That is the textbook signature of an asserted Galil hardware-limit
input.

What's NOT yet confirmed (and what the new session should pursue):
 - Exact HLS 0->1 and 1->0 transition times from the archiver
 - Whether the motor was physically anywhere near DHLM=250mm at the
   moment of assertion (autosave shows DVAL=76mm — likely far below)
 - The proximate physical cause (noise vs cable vs Galil bug vs
   EPICS misconfig)
 - What cleared HLS at ~00:10 (operator action vs IOC restart vs
   passing event)
 - Whether this is a single event or chronic

The plan is self-contained — designed for a fresh session that has
not seen the conversation that produced the parallel tthd report.
It carries forward what we know, points at the right files / PVs /
commands, and includes a "first command to run" bootstrap.

Co-Authored-By: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent 6ddbf5a9
Loading
Loading
Loading
Loading
+375 −0
Original line number Diff line number Diff line
# BL4B `hs` Motion Failure — Investigation Plan

**Date prepared:** 2026-04-13
**Prepared by:** Automated investigation assistance (carrying over context from the parallel `tthd` investigation)
**Instrument:** BL4B (Liquids Reflectometer)
**Affected PV:** `BL4B:Mot:hs` (Sample Height) — DMC3 axis E, on `bl4b-Galil1` IOC
**Working theory:** **High limit (HLS) was spuriously asserted at ~23:35 on
2026-04-10, suppressing all positive moves until ~00:10 on 2026-04-11 when
something cleared it.** Hypothesised cause: noise / flaky cable / loose
connection on the limit-switch input. Source: Andre Parizzi review of the
incident plots, 2026-04-13.

This document is a **plan**, not an analysis. It is meant to be picked up by
a fresh investigation session that has not seen the conversation that
produced the parallel `tthd-Motion-Failure-Analysis.md` document. It carries
forward the relevant context.

---

## The hypothesis to test (Andre Parizzi, paraphrased)

> Looking at the top half of the plots (the `hs` traces) for the
> 2026-04-10 incident: it looks like a **high limit got triggered at
> 23:35 on Friday** — possibly noise, possibly a flaky cable
> connection, possibly something else. From 23:35 to about 00:10 on
> Saturday, **positive moves on `hs` were suppressed** while
> **negative moves were fine**. At 00:10, **something cleared the
> high-limit trigger** and `hs` came back to full operation.

The plan below is to confirm or refute each of those four claims
(time of trigger, suppression of positive direction only, time of
clear, root cause), and then derive a fix.

---

## Why the hypothesis is already 80% confirmed (smoking-gun evidence
from the existing data)

In the course of investigating the parallel `tthd` failure on the same
night (see `tasking/tthd-Motion-Failure-Analysis.md` for the full
context), the `bl4b-Galil1` command log
(`/home/controls/var/log/dassrv1/galil_dmc.log`) was extracted around
the same window. The DMC3 axis E (=`hs`) commands look like this:

```
23:37:39  PRE=36690    BGE → ?  ERROR     (positive move, +2.866 mm)   ← suppressed
23:43:54  PRE=36690    BGE → ?  ERROR     (positive move)              ← suppressed
23:45:14  PRE=11108    BGE → ?  ERROR     (positive move)              ← suppressed
23:46:32  PRE=23907    BGE → ?  ERROR     (positive move)              ← suppressed
23:47:22  PRE=23907    BGE → ?  ERROR     (positive move)              ← suppressed
23:50:34  PRE=-12801   BGE → OK            (NEGATIVE move, -1.000 mm)   ← allowed
23:50:35  PRE=12781    BGE → ?  ERROR     (positive move, +1.000 mm)   ← suppressed
23:51:56  PRE=3        BGE → ?  ERROR     (positive move, +0.0002 mm!) ← suppressed
23:55:57  PRE=23887    BGE → ?  ERROR     (positive move)              ← suppressed
23:56:02  PRE=23888    BGE → ?  ERROR     (positive move)              ← suppressed
00:03:45  PRE=23889    BGE → ?  ERROR     (positive move)              ← suppressed
00:06:18  PRE=105664   BGE → ?  ERROR     (positive move, +8.255 mm)   ← suppressed
00:07:07  PRE=12781    BGE → ?  ERROR     (positive move)              ← suppressed
00:08:34  PRE=12792    BGE → ?  ERROR     (positive move)              ← suppressed
00:09:12  PRE=127994   BGE → ?  ERROR     (positive move, +10.0 mm)    ← suppressed
[after ~00:10, no more BGE failures on axis E that night]
```

Three observations from this pattern that match Andre's hypothesis:

1. **Every BGE failure during the window is on a *positive* PRE.** The
   single negative move in the same window (23:50:34, `PRE=-12801`)
   is the only `BGE` that succeeded. This is the textbook signature
   of an asserted Galil hardware limit input: the controller refuses
   `BG` in the limit direction with a `?` syntax-error response, but
   moves in the opposite direction work.
2. **Even a 3-motor-step (≈ 0.234 µm) positive move is refused.**
   So the suppression is not a "go further than the limit" issue;
   the controller has decided the axis is *at* the limit and refuses
   any motion in that direction.
3. **The episode ends sometime between 00:09:12 (last failure) and
   the next move attempt.** The exact clear time needs to be
   recovered from the archiver.

What is *not* yet established and the new investigation needs to
nail down:

- What `hs` was actually commanded to do at ~23:35 (the proposed
  trigger time), and whether the limit went hot at that exact moment
  or earlier.
- The physical position of `hs` at the moment HLS asserted. From the
  autosave on 2026-04-13, `hs.DVAL = 76.027` mm and `hs.DHLM = 250` mm
  — so **the motor was nowhere near the soft high limit position**.
  If HLS asserted while the motor was at ~76 mm with the limit
  configured at ~250 mm, **this cannot have been a physical limit
  contact** — it has to be electrical (noise, EMI, cable fault) or
  a Galil-program / EPICS-record misconfiguration.
- What cleared the limit at ~00:10. (User intervention via the OPI?
  An IOC restart? A retract command that physically backed the
  switch off? A `caput` to the appropriate field?)

---

## What is `hs` and where does it live?

| Attribute | Value | Source |
|---|---|---|
| PV | `BL4B:Mot:hs` | `bl4b-Galil1.substitutions:32` |
| Description | "Sample Height (E)" | substitutions |
| Engineering units | mm | substitutions |
| Galil controller | DMC3 (10.112.9.47) | `iocBoot/iocbl4b-Galil1/st.cmd` |
| Galil axis | E (axis index 4) | substitutions |
| Hosting IOC | `bl4b-Galil1` | st.cmd |
| `MRES` | 7.8125e-05 mm/step | `_pass0.sav` |
| `ERES` | 0.0001 mm/cnt | `_pass0.sav` |
| `VELO` | 2 mm/s | autosave |
| `BDST` | **1 mm** | autosave (worth a separate look — could be a backlash trap candidate) |
| `RDBD` | 0.01 mm | autosave |
| `RTRY` | 5 | autosave |
| `UEIP` | 1 (encoder feedback) | autosave |
| `DHLM` / `DLLM` | **250 / 0 mm** (soft limits) | autosave |
| `DVAL` (resting) | 76.027 mm | `_pass0.sav` |

`hs` is one of the simpler Galil1 motors — no virtual motor wrapper, no
kinematic transform. The signal chain is direct: motor record →
`asynMotorController` → Galil DMC3 axis E (CPU command `BGE`).

---

## Investigation tasks

### Task 1 — Confirm the trigger time and the directional asymmetry from the archiver

Pull the archiver traces for these PVs over the window 2026-04-10
22:00 → 2026-04-11 01:00 EDT:

```bash
./setup/archiver-query.sh \
    --pv 'BL4B:Mot:hs,BL4B:Mot:hs.RBV,BL4B:Mot:hs.DMOV,BL4B:Mot:hs.MOVN,BL4B:Mot:hs.HLS,BL4B:Mot:hs.LLS,BL4B:Mot:hs.MSTA,BL4B:Mot:hs.LVIO,BL4B:Mot:hs.STAT,BL4B:Mot:hs.SEVR' \
    --start '2026-04-10 22:00:00' --end '2026-04-11 01:00:00' \
    -o /tmp/hs_arch.jsonl
```

Then check:

* `hs.HLS` — does it transition 0→1 at ~23:35 and 1→0 at ~00:10?
* `hs.MSTA` — bit 2 (`MSTA_HOMED`?) and bit 12 (`MSTA_PLUS_LS`?) — see
  the EPICS motor record `MSTA` bit definitions for which bit is
  "high limit hit". Knowing which MSTA bit is set will pin down whether
  it's the Galil's hardware-limit-input bit or a soft-limit assertion.
* `hs.SEVR` and `hs.STAT` — was a MAJOR/MINOR alarm active during the
  window? What kind?
* `hs` (the .VAL) and `hs.RBV` — did the user keep writing setpoints
  while the limit was active? (We can already see they tried at least
  10 times based on the BGE-failure list above.)

**Acceptance:** Andre's hypothesis is *confirmed* if and only if
`hs.HLS = 1` (or the equivalent MSTA bit) goes hot at ~23:35 and clears
at ~00:10 with no intervening physical motion in the high-limit
direction.

### Task 2 — Find what was happening at exactly 23:35

The first observed BGE failure was at 23:37:39 (per
`/home/controls/var/log/ioc_bl4b-Mot-Galil1.log`). The hypothesis says
the limit went hot at 23:35 — earlier than the first failed move. Two
options:

* The user issued `hs` moves during 23:35 → 23:37 that *also* failed,
  but those failures simply happened to be on negative-direction moves
  and BGE returned OK (so they look normal in the log, but the limit
  was already hot).
* The limit was triggered by something completely independent of any
  user move (electrical event with no software origin).

To distinguish: look at all axis-E activity in the
`/home/controls/var/log/dassrv1/galil_dmc.log` from 23:30 onward for
DMC3 (`controller="10.112.9.47"`), and correlate with the archiver's
`hs.HLS` transition time.

```bash
awk '/^2026-04-10 23:3[0-9]/' /home/controls/var/log/dassrv1/galil_dmc.log \
  | grep 'controller="10.112.9.47".*[PB]GE\|controller="10.112.9.47".*PRE\|controller="10.112.9.47".*MG _LFE\|controller="10.112.9.47".*MG _LRE'
```

The Galil `_LFE` and `_LRE` are the runtime forward / reverse limit
input states for axis E. If the IOC poll is sampling these (it does on
every poll cycle, see `Galil` driver source), they'll be visible in
the command log.

### Task 3 — Establish what cleared the limit at ~00:10

Three plausible mechanisms:

1. **An operator did something on the OPI** — set `caput hs.STOP 1`
   then `caput hs.SPMG Go`, or hit a Galil "clear faults" button if
   the BL4B OPI exposes one. Check archiver for `hs.STOP`, `hs.SPMG`
   transitions around 00:10.
2. **An operator manually moved the encoder reference** — e.g.,
   `caput hs.SET Use` then `caput hs.VAL <new>` to redefine where the
   motor "thinks" it is, so the soft-limit logic stops complaining.
3. **An IOC or controller restart** — would clear all latched state
   transparently. Check `ioc_bl4b-Mot-Galil1.log` for restart
   markers around 00:10.

The Galil program code on DMC3 may also contain a `LIMSWI` interrupt
routine that requires explicit re-arming after the limit is hit;
inspect:

```bash
ls -la /home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/10.112.9.47_*.dmc
```

(There are `_gen.dmc` and `_prd.dmc` files — `_gen` is the auto-generated
default, `_prd` would be a custom production override if one exists.)

### Task 4 — Distinguish "noise / cable" from "Galil program bug" from
"genuine physical limit contact"

The key data point: at the moment HLS asserted, what was `hs.RBV`?

* If `hs.RBV ≈ DHLM = 250 mm`, the motor was actually at the high limit
  — investigate what move pushed it there.
* If `hs.RBV << DHLM` (e.g., the resting value of ~76 mm we saw in
  autosave), then the limit assertion was *not* due to physical
  contact. Candidates:
  - **Electrical noise on the limit-switch input** — common for
    long cable runs; can be confirmed by inspecting how often `_LFE`
    transitions even without motion in `galil_dmc.log` (it should be
    monotonically 0 or monotonically 1 during stationary periods; flips
    are the smoking gun for noise).
  - **Wiring fault** — a broken or oxidised connection presenting as
    an open-collector "active" state.
  - **Galil program bug** — the auto-generated `LIMSWI` interrupt on
    the DMC3 might be mis-handling axis E specifically. Inspect
    `10.112.9.47_gen.dmc` for the `LIMSWI`, `#LIMSWI`, or
    `MO`/`SH`-after-limit logic.
  - **EPICS soft-limit configuration mismatch** — the motor record's
    `HLM`, `DHLM`, `EHLM`, `OFF`, `DIR` could be configured such that
    a mid-range encoder reading triggers a soft-limit assertion which
    propagates as `LVIO=1` and prevents `BG`. Check `hs.LVIO` in the
    archiver and walk the User-vs-Dial coordinate transform.

### Task 5 — Look for prior occurrences

If this is a recurring noise/cable issue, there should be earlier
HLS transients. Pull a longer archiver window:

```bash
./setup/archiver-query.sh \
    --pv 'BL4B:Mot:hs.HLS,BL4B:Mot:hs.LLS,BL4B:Mot:hs.MSTA' \
    --start '2026-01-01 00:00:00' --end '2026-04-13 00:00:00' \
    -o /tmp/hs_hls_history.jsonl
```

Count HLS 0→1 transitions per day. A flaky cable would produce a
sprinkling over many days; a single noise-burst electrical event would
produce one (April 10) and nothing else.

### Task 6 — Hardware inspection (post-software-investigation)

Once the software side establishes "limit asserted electrically with
the motor mechanically far from the limit", the next step is a
physical inspection of the `hs` limit-switch wiring on DMC3 axis E.
Likely candidates (in order of probability based on similar facility
experience):

* The limit-switch cable shielding integrity at the connector to
  DMC3 axis E.
* Any cable tray crossings near high-current sources (e.g., ion-pump
  supplies, helium liquefier, motors with VFDs) that could couple
  noise into the limit signal.
* The limit-switch microswitch itself (oxidation, debris).
* The Galil's debounce / digital-filter setting for the limit input,
  if configurable.

---

## Files and tools the new session will need

| Purpose | Path |
|---|---|
| Galil command log (DMC1-4) — every PR/BG/MG/SH | `/home/controls/var/log/dassrv1/galil_dmc.log` |
| Galil1 IOC stdout — Encoder-stall, BGE failures, CA Link Exceptions | `/home/controls/var/log/ioc_bl4b-Mot-Galil1.log` |
| Galil1 motor substitutions | `/home/controls/bl4b/applications/bl4b-Galil1/bl4b-Galil1App/Db/bl4b-Galil1.substitutions` |
| Galil1 IOC startup | `/home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/st.cmd` |
| Generated Galil DMC3 program (default) | `/home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/10.112.9.47_gen.dmc` |
| Production Galil DMC3 program (override, if any) | `/home/controls/bl4b/applications/bl4b-Galil1/iocBoot/iocbl4b-Galil1/10.112.9.47_prd.dmc` |
| Galil1 motor record runtime values | `/home/controls/var/bl4b-Galil1/bl4b-Galil1.sav{,0,1,2}` |
| Galil1 pass0 (MRES, ERES, OFF, DVAL) | `/home/controls/var/bl4b-Galil1/bl4b-Galil1_pass0.sav{,0,1}` |
| Email thread that triggered the original report | `/SNS/users/6ov/BL4B/2026/04/12/WangHarveyHicksGeng-2026-04-13.pdf` |
| Sister investigation (tthd runaway, same night) | `tasking/tthd-Motion-Failure-Analysis.md` |
| Archiver query tool | `setup/archiver-query.sh` (see parent `CLAUDE.md` "SNS CSS archiver direct query") |

---

## What we already know to *exclude* from the hs investigation

These were established during the parallel `tthd` investigation and don't
need to be re-derived:

* **`hs`'s BGE failures did not cause the unexpected `tthd` motion.** The
  unexpected upward motion at 23:55-23:59 was a separate `p_d`
  encoder-stall + `virtual_pdLift.template` retry-runaway event. See
  `tasking/tthd-Motion-Failure-Analysis.md`. The `hs` problem and the
  `tthd` problem coincided on the same night because the user's
  frustration with `hs` led them to re-trigger the Move Align procedure,
  which is what fired the `tthd` runaway — but mechanistically they are
  two distinct faults.
* **`hs` is on a different controller from `p_d` and `tthd_enc`.** `hs`
  is on DMC3 (10.112.9.47) axis E; `p_d` and `tthd_enc` are on DMC4
  (10.112.9.48) axes F and G respectively. So the limit-switch issue
  on `hs` is electrically and logically isolated from the `p_d` chain
  — there's no common-mode failure to look for between them.
* **The "hs motor red light was on" the user reported** is consistent
  with HLS being asserted (the OPI lights up on `hs.HLS=1` or
  `hs.SEVR=MAJOR`). It is not necessarily a separate fault; it may just
  be the same limit-asserted state surfacing in the OPI.

---

## Definition of "done"

The investigation is complete when the report can answer:

1. **What time did `hs.HLS` go from 0→1?** (Should be at or near 23:35.)
2. **Was the motor physically at the high limit at that moment?** (Almost
   certainly no.)
3. **What was the proximate cause of the HLS assertion?** (Noise pulse on
   limit input cable / loose connector / EMI from a coincident event /
   Galil firmware bug / EPICS configuration error / other.)
4. **What time did `hs.HLS` go from 1→0?** (Should be at or near 00:10.)
5. **Who or what cleared it?** (Operator intervention via specific
   field/button / IOC restart / passing event.)
6. **Has this happened before?** (Single-event vs chronic.)
7. **Recommended fix.** (Cable inspection / replacement, Galil debounce
   tuning, EPICS configuration change, hardware replacement, etc.)

The deliverable should be a markdown report following the same
structure as `tasking/tthd-Motion-Failure-Analysis.md` (executive
summary → architecture → timeline → root cause → recommended fixes →
cross-checks → dead ends and lessons learned), plus a PDF render via
`tasking/pdf-tools/md2pdf.py`.

---

## Pointer to cross-project patterns

The parent project `CLAUDE.md` (on `main`) carries a section
"EPICS motor record investigation patterns" with multiple reusable
techniques. The most relevant ones for this investigation:

* **`MSTA` bit decoding** — to translate the bitfield into "which
  specific limit/alarm/state is asserted".
* **EPICS motor record: runtime vs substitutions ground truth** — the
  substitutions file is *not* authoritative; runtime values from
  autosave or live `caget` are. Don't rely on substitutions-file
  defaults for `DHLM`, `DLLM`, etc.
* **EPICS autosave file archaeology** — dated backup files
  (`_YYMMDD-hhmmss`) preserve snapshots; useful for reconstructing
  what `hs.DHLM` / `hs.HLM` / `hs.OFF` were at any point in the past.

---

## Suggested first command for the new session

To bootstrap quickly, the very first archiver pull to do is:

```bash
cd /media/ssd2/Projects/Claude/1
./setup/archiver-query.sh \
    --pv 'BL4B:Mot:hs.HLS,BL4B:Mot:hs.LLS,BL4B:Mot:hs.MSTA,BL4B:Mot:hs.LVIO,BL4B:Mot:hs.STAT,BL4B:Mot:hs.SEVR,BL4B:Mot:hs,BL4B:Mot:hs.RBV,BL4B:Mot:hs.DMOV,BL4B:Mot:hs.STOP,BL4B:Mot:hs.SPMG' \
    --start '2026-04-10 22:00:00' --end '2026-04-11 01:00:00' \
    -o /tmp/hs_window.jsonl
```

The first thing to look at in the resulting JSONL is the
`BL4B:Mot:hs.HLS` timeline. If it shows a 0→1 transition near
23:35 and a 1→0 near 00:10, Andre's hypothesis is confirmed and the
investigation pivots to *why*.