**Last failure:** 2026-03-28 ~23:48 UTC (and recurring since at least 2026-03-25)
---
## Executive Summary
The recurring scan failures on `BL4A:Mot:S3:X:Gap` are caused by a **tolerance budget mismatch** between three software layers, not a hardware defect. The maximum combined positioning error of the two physical slit blades (0.015 mm) exceeds the scan server's acceptance tolerance (0.010 mm). When both blades err in the same direction for gap, the virtual gap readback falls outside the scan tolerance, the scan server times out after 300 seconds, and the scan is declared failed — even though every motor reached its individual target correctly.
Replacing the Galil controller (done twice) cannot fix this because the motors are performing within specification. The fix is to reconcile the tolerance values across the software stack.
---
## System Architecture
The S3 horizontal slit gap is controlled by a three-layer software stack:
From the scan server log (reported by Xiaosong Geng, 2026-03-30):
```
2026-03-29 00:15:08 WARNING Command failed:
Set 'BL4A:Mot:S3:X:Gap' = 4.143068 with completion in 300.0 sec
(check for 'BL4A:Mot:S3:X:Gap.RBV' +-0.01)
Caused by: TimeoutException:
Timeout while waiting for BL4A:Mot:S3:X:Gap.RBV (4.1279) = 4.143068
```
| Quantity | Value |
|---|---|
| Requested gap | 4.143068 mm |
| Achieved gap (RBV) | 4.1279 mm |
| **Gap error** | **0.01517 mm** |
| Scan tolerance | ±0.010 mm |
| **Result** | **0.01517 > 0.010 — FAIL** |
---
## Root Cause: Tolerance Budget Mismatch
The gap is computed as the *difference* of two blade readbacks. Each blade position has an independent error up to its RDBD (retry deadband — the threshold below which the motor record considers the position "close enough" and stops retrying). In the worst case, these errors **add**:
```
Max gap error = LSlit3.RDBD + RSlit3.RDBD
= 0.005 + 0.010
= 0.015 mm
Scan tolerance = 0.010 mm
0.015 > 0.010 → Failure is statistically inevitable
```
When both blades err in the same direction for gap (e.g., LSlit3 undershoots while RSlit3 overshoots, or vice versa), the combined gap error exceeds the scan tolerance. The scan server cannot distinguish between "still moving" and "settled at wrong position" — it simply waits 300 seconds and times out.
### Why the Virtual Motor Layer Doesn't Help
The virtual Gap motor has `RTRY=0` (no retries). Once both physical motors declare DMOV=1, the virtual motor immediately reports done. It never checks whether the *combined* gap is correct. There is no retry at the virtual level to correct the gap by re-positioning the blades.
### Why It Appears Intermittent
The actual blade positioning error is random within ±RDBD. On many moves, the errors partially cancel or both blades land close to target, and the gap falls within ±0.01. On ~10-30% of moves (estimated from the error distribution), the errors combine unfavorably and exceed the tolerance.
---
## Evidence from Network Trace (PCAP Analysis)
A Wireshark capture of traffic between the IOC (10.11.8.130) and the Galil DMC1-2 (10.112.8.42) during the failure shows the complete motion sequence:
| Phase | Timestamp | Axis G (LSlit3) | Axis H (RSlit3) | Interpretation |
**Phase 3 is the critical event.** RSlit3 (axis H) settled within its 0.010 mm RDBD, so the motor record accepted its position and issued `MOH` (motor off). LSlit3 still needed a small correction (782 counts = 0.015 mm). After LSlit3's final correction, both motors are DMOV=1. But the combined gap readback (4.1279 mm) is 0.015 mm off target — outside the scan's ±0.01 mm window.
The Phase 2 correction (~20% of the initial move) indicates a significant first-move overshoot, likely related to backlash compensation or step-count accumulation error in open-loop (UEIP=0) mode. This is a secondary concern but contributes to landing near the edge of the RDBD.
---
## RDBD Tuning History
Analysis of the autosave backup files shows engineers have been adjusting the RDBDs:
The RSlit3 RDBD was doubled from 0.005 to 0.010, likely to reduce excessive retries. This reduced the retry count but made the gap tolerance budget worse (combined max error went from 0.010 to 0.015). The significant changes in motor OFF values across dates indicate multiple recalibrations, consistent with controller replacements.
---
## Why Hardware Replacement Cannot Fix This
The Galil controller has been replaced twice. The problem persists because:
1.**The motors are reaching their targets** within their individual deadbands. The hardware is performing correctly.
2.**The issue is in the tolerance bookkeeping** across software layers — specifically, the scan server tolerance is tighter than the sum of the physical motor deadbands.
3.**Any new controller** with the same motor resolution and RDBD settings will exhibit the identical behavior.
---
## Related Failures
The email thread documents similar failures on other slits:
-**S1:Y:Gap** (2026-03-25): `BL4A:Mot:S1:Y:Gap.RBV (0.3906) = 0.4001`, error = 0.0095 mm, tolerance ±0.005 mm. Same root cause — blade RDBDs sum to more than the scan tolerance.
-**S3:X:Gap** (2026-03-25): Reported by Haile Ambaye, same target (~4.14 mm).
-**S3:X:Gap** (2026-03-28): Second failure at exact same setpoint, reported by Asmaa Qdemat.
All slit virtual motors (`S0:X`, `S1:X`, `S1:Y`, `S2:X`, `S2:Y`, `S3:X`, `S3:Y`, `SS:X`) use the same `tolerance=0.01` in `devices.py`, making any slit with blade RDBD sum > 0.01 vulnerable.
---
## Recommended Fixes
### Option A — Increase Scan Server Tolerance (Recommended)
The simplest and most correct fix. In `devices.py`, set the tolerance for all virtual slit Gap and Center devices to at least the sum of the blade RDBDs plus margin:
```python
# Current (line 58):
Device("BL4A:Mot:S3:X:Gap",...,tolerance=0.01)
# Proposed:
Device("BL4A:Mot:S3:X:Gap",...,tolerance=0.02)
```
**Rule of thumb:**`scan_tolerance >= blade1_RDBD + blade2_RDBD + margin`
Apply to all virtual slit devices (lines 44-60 in `devices.py`). A tolerance of 0.02 mm provides adequate margin for all current RDBD settings.
**Impact:** Scan acceptance window widens from ±0.01 to ±0.02 mm. For a reflectometry instrument with typical slit gaps of 0.5-10+ mm, this has negligible impact on data quality.
### Option B — Tighten Physical Motor RDBDs
Set both blade RDBDs low enough that their sum stays within the scan tolerance:
```
LSlit3.RDBD = 0.004 mm
RSlit3.RDBD = 0.004 mm
Sum = 0.008 < 0.010 ✓
```
**Risk:** Tighter RDBDs mean more retries and potential oscillation for steppers running open-loop (UEIP=0). The Mar 20 attempt at 0.002 mm may have caused this. Would need to verify the steppers can reliably position within 0.004 mm (205 steps) without encoder feedback.
### Option C — Enable Virtual Motor Retries
Set `RTRY > 0` on the Gap soft motor record. This would allow the virtual motor to detect the gap mismatch and re-issue corrected blade positions.
**Risk:** Changes the behavior of the `slit_nondrifting.template` and needs testing to ensure it doesn't interact badly with the physical motor retries. The template was specifically designed with `RTRY=0` to avoid double-retry loops.
### Option D — Combination Approach (Most Robust)
1. Set `devices.py` tolerance to 0.02 mm (Option A) — prevents scan timeouts
2. Set both blade RDBDs to 0.004 mm (Option B) — reduces typical gap error
3. Monitor with archiver data to confirm improvement
---
## Appendix A: Slit Tolerance Template Logic
The `slit_nondrifting.template` uses a two-stage calcout for each blade:
**Stage 1** — Calculate target blade position from gap demand:
OOPT: When Non-zero (only send move if outside tolerance)
```
With `Tolerance = 0`, the check always evaluates to 1 (always move), so every gap change sends commands to both physical motors.
## Appendix B: Motor Record Retry Mechanism
When a physical motor (e.g., LSlit3) finishes a move:
1. Motor record computes `DIFF = DVAL - DRBV` (dial target - dial readback)
2. If `|DIFF| >= RDBD`: retry (up to RTRY times)
- Default mode: sends full remaining difference as relative move
3. If `|DIFF| < RDBD`: accept position, set DMOV=1
4. If retries exhausted (`RCNT > RTRY`): set MISS=1, set DMOV=1
The virtual Gap motor's DMOV is the AND of both blade DMOVs. Once both blades are done (either accepted or MISS), the gap is declared done — regardless of whether the combined gap is correct.
## Appendix C: Key Runtime Values (from autosave)
```
BL4A:Mot:S3:X:Tolerance.VAL 0
BL4A:Mot:S3:X:Gap:SP.VAL 4.143068 (the failed setpoint)