results: Step 1 live read test against snsoroda-scan (bf2e73a4) · Commits · Vacaliuc, Bogdan / tasking

step1-live-test-results.md

0 → 100644

+251 −0

Original line number	Diff line number	Diff line
		# Step 1 — Live Read Test Results

		Date: 2026-04-11
		Branch: `tasking/css-archiver-query-tool-development`
		Target: `snsoroda-scan.sns.gov:1521/scprod_controls` as `sns_reports`
		Driver: Python `oracledb` 3.4.2 thin mode

		This document records the result of the Step 1 feasibility test prescribed by
		`plan/archiver-query-tool.md`. All temporary scripts (`/tmp/*-xZRkXM.py`,
		mode 600) were deleted after the run; this file is the durable record.

		---

		## Summary

		\| Check \| Result \|
		\|------------------------------------------------\|------------------------------------------------------------------------------------------------------------------------------------------\|
		\| `oracledb` thin-mode connect \| ✅ 0.09 s \|
		\| Server identification \| Oracle Database 19.27.0.0.0 \|
		\| `chan_arch.channel` row count \| 428,106 channels in the Instruments DB \|
		\| DANGLE channel resolution \| ✅ 24 BL4A:Mot:DANGLE\* channels found; all four target PVs resolve cleanly \|
		\| Sample query in DANGLE incident window \| ✅ 120 rows for `BL4A:Mot:DANGLE.RBV` between 13:35:00–13:37:00 EDT, matches operator CSV exactly \|
		\| Stored procedure `get_actual_start_time` \| ✅ Returns datetime `2026-04-08 13:20:24.921311` (15 min before incident start), confirms initial-sample carry-forward path is usable \|
		\| `enum_metadata` for AirPadStatus \| ⚠️ No rows. AirPadStatus is stored as a raw integer in `num_val` — CSS would render it as `0`/`1`, not as an enum string \|
		\| `num_metadata` for DANGLE.RBV \| ✅ Display range -1000..1000, prec=4, unit=`'deg'` \|
		\| Session/DB timezone \| Session = `-04:00` (EDT, client local), DB = `+00:00` (UTC). Naive datetimes round-trip to operator CSV epoch-ms exactly. \|
		\| Round-trip vs operator CSV \| ✅ Exact — first DANGLE.RBV row `(13:35:48.698608, 4.4057)` matches CSV row `1775669748698 → 4.4057` within rounding (μs vs ms) \|
		\| Disconnected gap markers (last 30 days) \| ✅ Multiple `Disconnected` rows visible from March 15-20 — all value columns NULL, severity=2 INVALID, status=4 Disconnected \|

		All Step 1 exit criteria met. Proceed to Step 2 implementation.

		---

		## Detail: timezone — the headline finding

		```sql
		SELECT sessiontimezone, dbtimezone, TO_CHAR(SYSTIMESTAMP, '...') FROM DUAL
		-- session_tz: -04:00
		-- db_tz: +00:00
		-- systs: 2026-04-11 13:40:32.455327 -04:00
		```

		The Oracle database is configured in UTC (`db_tz=+00:00`); the connection
		session inherits the client OS timezone (`session_tz=-04:00` because the
		dragonfly machine is in `America/New_York`). When the JDBC/Python driver
		fetches a `TIMESTAMP` column, Oracle converts the stored UTC value into the
		session timezone wall-clock and returns it as a naive datetime in that
		timezone.

		The naive `datetime` we receive is therefore already in client local time.
		Converting it to epoch milliseconds with `.timestamp() * 1000` (in a process
		whose `time.tzname` is also `America/New_York`) gives the same number the
		operator-exported CSV has, where the CSV's `epoch_ms` column is true UTC.

		**This works correctly only when the client OS timezone matches what was used
		when the query was constructed.** If the tool ever runs on a UTC host (or
		Pacific, etc.) without a timezone fix, both the input window timestamps and
		the output sample timestamps will be off by the local-vs-Eastern offset.

		The fix is mandatory and one line:

		```python
		cur.execute("ALTER SESSION SET TIME_ZONE = 'America/New_York'")
		```

		This pins the session to Eastern regardless of client OS, so naive datetimes
		always behave the same way. The tool MUST do this immediately after connect,
		before any sample queries.

		(Verified working in the second probe script.)

		---

		## Detail: severity and status enum tables

		These were captured live from `chan_arch.severity` and `chan_arch.status` so
		the tool can hard-code the integer→string mapping instead of doing a JOIN on
		every row.

		### `chan_arch.severity`

		```
		0 Unused
		1 NONE ← the normal state for a healthy sample
		2 INVALID
		3 UNDEFINED
		4 MAJOR
		5 MINOR
		6 OK
		7 MINOR_ACK
		8 MAJOR_ACK
		9 INVALID_ACK
		10 UNDEFINED_ACK
		```

		### `chan_arch.status` (first 20 — there are more)

		```
		1 NO_ALARM
		2 Archive_Off ★ gap marker
		3 Write_Error ★ gap marker
		4 Disconnected ★ gap marker
		5 HIHI_ALARM
		6 HIGH_ALARM
		7 LOW_ALARM
		8 STATE_ALARM
		9 LOLO_ALARM
		10 READ_ALARM
		11 TIMEOUT_ALARM
		12 UDF_ALARM
		13 OK
		14 HIGH
		15 HIHI
		16 LOLO
		17 LOW
		18 READ
		19 WRITE
		20 STATE
		...
		```

		Gap-marker status IDs are `{2, 3, 4}` — `Archive_Off`, `Write_Error`,
		`Disconnected`. The Phoebus `AbstractRDBValueIterator.filterSeverity()` does a
		case-insensitive string comparison; we can do an integer set membership test
		which is faster and avoids accidental name-drift.

		The tool should hard-code these constants rather than re-query the lookup
		tables on every invocation. They are unlikely to change — they are set at
		archive-engine schema-creation time and CSS depends on them.

		---

		## Detail: AirPadStatus is not an enum in the archive

		We expected `chan_arch.enum_metadata` for `BL4A:Mot:AirPadStatus` to return
		`(0, "Off"), (1, "On")` rows so the tool could render it as enum strings
		like CSS does for other mbbi PVs. It returned no rows at all.

		Sample data for the channel (in the DANGLE incident window):

		```
		(2026-04-08 13:35:44.992699, sev=5 MINOR, st=8 STATE_ALARM, num_val=1, float_val=None, str_val=None)
		(2026-04-08 13:36:26.710821, sev=1 NONE, st=1 NO_ALARM, num_val=0, float_val=None, str_val=None)
		(2026-04-08 13:38:42.793420, sev=5 MINOR, st=8 STATE_ALARM, num_val=1, float_val=None, str_val=None)
		(2026-04-08 13:38:53.590670, sev=1 NONE, st=1 NO_ALARM, num_val=0, float_val=None, str_val=None)
		```

		Observations:

		- Storage column: `num_val` (integer), not `float_val`. Confirms the
		coalescing rule must look at both columns — analog motors use `float_val`,
		integer status PVs use `num_val`.
		- The channel is alarming when value=1. The IOC has STATE_ALARM
		configured for the "On" state with severity MINOR. This is the intended
		behaviour per the operator: when air pads are pressurized, the operator is
		reminded by a minor alarm to release them after the move.
		- No enum_metadata. CSS Phoebus would render these samples as raw `1`s and
		`0`s, not as `"On"`/`"Off"`. So the tool must too — render the integer
		value, document that the operator can use `--describe-channel` to discover
		whether enum_metadata exists for a channel before assuming enum semantics.
		- The values match the operator CSV exactly (column 5 in the CSV =
		`0,1,0` for the three transitions visible in the same window).

		Plan revision needed: the original plan and the analysis doc both said
		"AirPadStatus is an enum with `(0, Off) (1, On)` labels". That was wrong.
		AirPadStatus has no enum_metadata. The tool must not assume any binary or
		mbbi-style PV has enum_metadata; it must check on each channel and fall back
		to integer rendering when absent.

		---

		## Detail: real Disconnected gap markers in DANGLE.RBV history

		The query `SELECT ... WHERE channel_id=80401 AND status_id != 1` over the
		last 30 days returned multiple rows with status `Disconnected` and **all
		value columns NULL**:

		```
		(2026-03-15 10:06:25, sev=2 INVALID, st=4 Disconnected, num_val=None, float_val=None, str_val='Disconnected')
		(2026-03-15 10:08:23, sev=4 MAJOR, st=39 (??), num_val=None, float_val=5.116, str_val=None)
		... (more pairs of Disconnected → recovery)
		(2026-03-16 18:46:17, sev=2 INVALID, st=4 Disconnected, num_val=None, float_val=None, str_val='Disconnected')
		(2026-03-16 18:46:40, sev=4 MAJOR, st=39, num_val=None, float_val=3390.0, str_val=None) ← suspicious recovery position
		```

		Two important takeaways:

		1. `Disconnected` rows have `str_val='Disconnected'` (not null), and both
		`num_val` and `float_val` are NULL. So the CSS coalescing precedence
		(float → num → str) would naturally select the `str_val='Disconnected'`
		string for these rows. The tool should still surface the status column
		alongside, and set the `is_marker=true` flag, so the agent doesn't have
		to know that the string value `"Disconnected"` is special.

		2. One recovery row has `float_val=3390.0` for DANGLE.RBV, which is far
		outside the motor's `[-1000, 1000]` display range from `num_metadata`. This
		is an obvious bad value and is exactly the kind of thing an agent
		investigating motion failures wants to find. **The tool should flag any
		value outside `num_metadata.[low_disp_rng, high_disp_rng]` as a potential
		anomaly** (or at least make the metadata available so the agent can
		compare).

		The recovery `status_id=39` is not in the first 20 rows of `chan_arch.status`
		that I dumped. That status code is unknown to me; the tool's
		`--describe-schema` should dump the full status table on first run so the
		agent can decode it.

		---

		## Detail: query performance notes

		\| Query \| Rows \| Latency \|
		\|------------------------------------------------------------------------------------\|-------\|---------\|
		\| `SELECT COUNT(*) FROM chan_arch.channel` \| 1 \| < 0.5 s \|
		\| `SELECT … FROM chan_arch.channel WHERE LOWER(name) LIKE 'bl4a:mot:dangle%'` \| 24 \| < 0.5 s \|
		\| `SELECT ... FROM chan_arch.sample WHERE channel_id=:cid AND smpl_time BETWEEN ...` \| 120 \| < 0.2 s \|
		\| `archive_reader_pkg.get_actual_start_time(:cid, :ts, :te)` \| 1 \| < 0.1 s \|
		\| Inline `WHERE smpl_time <= :ts ORDER BY smpl_time DESC, ROWNUM=1` \| 1 \| >60 s, hangs \|

		The inline at-or-before query is unusable on the production schema —
		Oracle's optimizer doesn't push the LIMIT-1 down into the inner reverse-scan
		efficiently. The stored procedure is dramatically faster.

		Plan revision: the tool must use `archive_reader_pkg.get_actual_start_time`
		for the carry-forward step (matching CSS exactly), not the fallback inline
		SQL pattern. The fallback is only useful for non-SNS deployments that don't
		have the stored procedure installed, which is not our use case.

		Alternatively, `FETCH FIRST 1 ROW ONLY` (Oracle 12c+) may also work — it
		gives the optimizer a different hint. We didn't measure this in the live test
		because the stored procedure already worked. The tool should default to the
		stored procedure and offer `--no-stored-procedures` as an escape hatch that
		uses `FETCH FIRST 1 ROW ONLY`.

		---

		## Refinements to fold into `plan/archiver-query-tool.md`

		\| § \| Refinement \|
		\|---\|---\|
		\| Credentials \| Add an `ALTER SESSION SET TIME_ZONE = 'America/New_York'` directive immediately after connect. Mandatory. Without this, the tool only works on Eastern-timezone hosts. \|
		\| CLI surface \| No `--strict-window` is needed in v1 — carry-forward via stored procedure is fast and the agent benefit is universal. \|
		\| Severity/status decoding \| Hard-code the integer→string maps for `severity` and `status` (first 20 entries each, captured above) as Python dicts at module top. Avoid the JOIN. Marker IDs are `{2, 3, 4}`. \|
		\| Enum rendering \| When a PV has no `enum_metadata` rows, do not assume the integer is an enum — return it as a number. AirPadStatus is the canonical example. \|
		\| Anomaly flagging \| When a sample's coalesced value is outside `num_metadata.[low_disp_rng, high_disp_rng]`, set `out_of_range=true` in the JSONL output. Only emit if the metadata is present. \|
		\| Status table dump \| `--describe-schema` should fetch the full status and severity tables and print them — the tool's hard-coded constants are based on a snapshot, and an agent investigating an unfamiliar status code (e.g. status 39 above) needs the up-to-date mapping. \|
		\| Carry-forward implementation \| Use `archive_reader_pkg.get_actual_start_time(channel_id, start, end)` as the default. Inline SQL is unusable on the production schema. \|
		\| Verification table T8 \| Update — AirPadStatus rows should be `0`/`1` integers, not enum strings. \|

		The plan refinements are a small delta on top of what's already there. I'll
		apply them in a follow-up commit.