results: Step 2/3/4 implementation captured (83469746) · Commits · Vacaliuc, Bogdan / tasking

step2-3-implementation-results.md

0 → 100644

+221 −0

Original line number	Diff line number	Diff line
		# Step 2 / Step 3 / Step 4 — Implementation Results

		Date: 2026-04-11
		Branch: `tasking/css-archiver-query-tool-development`
		Outcome: Tool committed to `main` as `setup/archiver-query` and
		`setup/archiver-query.sh` with byte-for-byte parity to the operator-exported
		DANGLE incident CSV.

		This document captures the as-built deltas from `plan/archiver-query-tool.md`
		that any future maintainer (or follow-up investigation) needs to know.

		---

		## Summary

		\| Plan step \| Status \| Notes \|
		\|-----------\|--------\|-------\|
		\| Step 1 (live read feasibility) \| ✅ already done \| See `step1-live-test-results.md` \|
		\| Step 2 (MVP) \| ✅ done \| One bug fixed live (TIMESTAMP bind precision) \|
		\| Step 3 (CSV correctness gate) \| ✅ done \| T6, T7, T8, T10 all pass; byte-for-byte CSV match \|
		\| Step 4 (CLI flesh-out) \| ✅ done \| Multi-PV, csv/tsv, search, describe-* implemented \|
		\| Step 5 (wrapper + docs + commit) \| ✅ done \| Tool on `main`, merged into `uvdl3` \|
		\| Step 6 (knowledge capture) \| in progress \| Per-branch tasking/CLAUDE.md updates pending \|

		The tool source is `setup/archiver-query` (~1090 lines of Python) plus
		`setup/archiver-query.sh` (uv-managed bash wrapper). Per-archive credentials
		file at `~/.config/sns-archiver/credentials` (mode 600), bootstrapped from
		`~/opt/css/product-sns-4.7.4-SNAPSHOT/settings.ini` on first run.

		---

		## The one live bug found in implementation

		The Step 1 live test caught the inline-SQL carry-forward hang and the
		mandatory `ALTER SESSION SET TIME_ZONE` requirement. **Step 2 found one
		additional issue** that the source review missed:

		### Python `datetime` binds as Oracle `DATE`, not `TIMESTAMP`

		By default, `oracledb` thin-mode binds Python `datetime.datetime` as Oracle
		`DATE` (second precision) rather than `TIMESTAMP` (sub-second precision).
		When the tool first ran, `get_actual_start_time()` correctly returned
		`2026-04-08 13:20:24.921311` from the stored procedure — but passing that
		datetime back to the `BETWEEN` clause as a bind variable truncated it to
		`2026-04-08 13:20:24` (microseconds → 0). The 1-second-wider window pulled in
		3 extra rows from the same second (4.3931, 4.396, 4.3988) before the
		carry-forward sample, producing 124 rows instead of the correct 121.

		Fix: explicit `setinputsizes(... oracledb.DB_TYPE_TIMESTAMP)` on every
		cursor that binds a datetime — `get_actual_start_time` and `fetch_samples`
		both. Without this the bind goes through Oracle's implicit DATE conversion
		and microseconds are silently dropped.

		```python
		import oracledb
		cur.setinputsizes(
		cid=None,
		sts=oracledb.DB_TYPE_TIMESTAMP,
		ets=oracledb.DB_TYPE_TIMESTAMP,
		)
		cur.execute(sql, cid=channel_id, sts=start, ets=end)
		```

		Why the source analysis didn't catch this: the Phoebus Java code uses
		`PreparedStatement.setTimestamp(int, Timestamp)` which binds as `TIMESTAMP`
		unconditionally. Python `oracledb` uses `DATE` as the default Python-datetime
		mapping for backwards compatibility with `cx_Oracle`. This is a Python-side
		gotcha, not a schema issue.

		Generalization: any future Python tool that talks to Oracle TIMESTAMP
		columns via `oracledb` bind variables needs `DB_TYPE_TIMESTAMP` setinputsizes
		or it will silently lose sub-second precision. Worth promoting to the parent
		`CLAUDE.md` Cross-Project Patterns once enough Python-Oracle work has been
		done to confirm the pattern applies broadly. For now, captured here.

		---

		## CSS Data Browser CSV format — what we learned matching it byte-for-byte

		CS-Studio's `--format csv` export wraps the deltas-only sample stream in a
		sparse table where every PV has a column at every unique timestamp from any
		PV's events. The single-PV view has more rows than the database has samples
		because CSS propagates last-known values forward into rows triggered by
		other PVs' events.

		Concrete from the DANGLE window:
		- Database has 121 `BL4A:Mot:DANGLE.RBV` samples in the 13:35–13:37 window
		- Operator CSV has 127 rows where the DANGLE.RBV column is non-`#N/A`
		- The 6 "extra" rows are CSS-propagated values triggered by changes to one
		of the other 3 PVs in the same export

		Our `write_css_csv()` does the same:
		1. Build per-PV `{ts_ms: value}` maps from the actual database samples
		2. Compute the sorted union of all timestamps across all PVs
		3. At each timestamp, walk forward: if a PV has a sample at that timestamp,
		update its "last-known" value; otherwise keep the previous last-known
		4. Emit `#N/A` for any PV that has no last-known yet (no carry-forward and
		no sample-at-this-time)

		Float rendering also matters for byte-for-byte parity:
		- CSS uses `num_metadata.prec` (4 for DANGLE.RBV) and emits fixed-precision
		decimals — `4.5710` not `4.571`, and `4.4277` not `4.427700000000001`
		- Trailing zeros are kept, not stripped
		- `_stringify_value(v, prec=4)` does `f"{v:.4f}"` which matches exactly

		JSONL output is unaffected — it still emits raw IEEE 754 floats so the agent
		has full precision when doing arithmetic on values.

		---

		## Recovery samples after `Disconnected` are `LINK_ALARM (39)`

		The Step 1 live test surfaced `status_id=39` on recovery samples for
		`BL4A:Mot:DANGLE.RBV` immediately after `Disconnected` markers and noted it
		was unknown. Step 4's `--describe-schema` revealed it: `LINK_ALARM`.

		This is the EPICS standard "input link broken" alarm — exactly what you'd
		expect when an IOC is just coming back online from a network glitch and the
		record's input link hasn't re-validated yet. The agent investigating a
		Disconnected window can now expect to see `LINK_ALARM` recovery samples and
		treat them as the "first sample after the IOC came back" rather than a
		mysterious code.

		The tool's hard-coded `STATUS_NAMES` dict was extended to include `LINK_ALARM`
		during Step 4 work. The full live status table (47 entries) is reachable via
		`./setup/archiver-query.sh --describe-schema`.

		---

		## What's NOT in the v1 tool (deliberate)

		1. No waveform / `array_val` BLOB support. The BLOB format is documented
		in the analysis doc and would be straightforward to add, but no v1 use
		case needs it.
		2. No OPTIMIZED stored-procedure path (`get_browser_data` with
		min/max/avg buckets). Raw is fine for minute-to-hour windows.
		3. No PV-name wildcards in `--pv`. `--search GLOB` is the discovery
		path; `--pv` takes explicit names.
		4. No `--decimate N` for histogram-style compression. Future v2 if a
		multi-day window investigation needs it.
		5. No write access. Read-only by design; the credentials file
		authenticates as a read-only Oracle account anyway.

		---

		## How to use the tool from this branch's investigation context

		Even though this branch is for the development of the tool, the tool is
		already usable from any clone after `git checkout main && git pull`. The
		tool's wrapper handles its own venv, so you don't need to be on this branch
		to use it.

		```bash
		# Smoke test
		./setup/archiver-query.sh \
		--pv 'BL4A:Mot:DANGLE.RBV' \
		--start '2026-04-08 13:35:00' \
		--end '2026-04-08 13:37:00'

		# Reproduce the operator CSV exactly
		./setup/archiver-query.sh \
		--pv 'BL4A:Mot:DANGLE.RBV,BL4A:Mot:DANGLE,BL4A:Mot:DANGLE.DMOV,BL4A:Mot:AirPadStatus' \
		--start '2026-04-08 13:35:00' --end '2026-04-08 13:37:00' \
		--format csv -o /tmp/dangle.csv

		# Diff against the original
		grep -v '^#' /tmp/dangle.csv \| grep -v '^$' > /tmp/ours-data.tsv
		grep -v '^#' ~/analysis/BL4A/2026/04/09/bl4a-DANGLE-operation-fault-2026-04-08_1335.csv \| grep -v '^$' > /tmp/csv-data.tsv
		diff /tmp/ours-data.tsv /tmp/csv-data.tsv && echo "✅ exact match"
		```

		---

		## What this unblocks for future investigations

		Any branch on the tasking submodule that needs archive-time-series data can
		now:

		1. Run `./setup/archiver-query.sh --pv ... --start ... --end ...` directly
		from any session, on any machine that can reach `snsoroda-scan:1521`
		2. Get JSONL output the agent can parse without operator help
		3. Get sub-millisecond timestamp precision and per-PV anomaly flags
		4. Discover unfamiliar PVs via `--search` and inspect their metadata via
		`--describe-channel` before assuming semantics
		5. Reproduce a CS-Studio-equivalent CSV for human consumption when needed
		(`--format csv`)

		The 3-day operator-CSV-export round-trip that dominated the DANGLE
		investigation collapses to a one-shot bash invocation. Each tasking branch
		that touches archiver data should be told about this; that's what Step 6
		("Knowledge capture") of the plan does.

		---

		## File inventory

		\| File \| Branch \| Purpose \|
		\|------\|--------\|---------\|
		\| `setup/archiver-query` \| `main` \| Python entrypoint (~1090 lines) \|
		\| `setup/archiver-query.sh` \| `main` \| bash wrapper (uv venv mgmt) \|
		\| `setup/docs/sns-archiver-query.md` \| `main` \| user-facing docs \|
		\| `CLAUDE.md` (Cross-Project Patterns) \| `main` \| quick-reference summary \|
		\| `plan/archiver-query-tool.md` \| `main` \| original plan + refinements \|
		\| `tasking/css-rdb-reader-analysis.md` \| this \| Phoebus source review \|
		\| `tasking/step1-live-test-results.md` \| this \| live feasibility test \|
		\| `tasking/step2-3-implementation-results.md` \| this \| this file \|

		---

		## Bottom line

		The plan was mostly implementable as written — the source review and
		Step 1 live test had already eliminated the major uncertainties. The single
		implementation surprise was the Python `datetime` → Oracle `DATE` bind
		truncation, which would have been invisible in a side-by-side diff against
		CSS Java's JDBC binds. Caught and fixed in ~10 minutes by walking from a
		122-row count to a 121-row count and noticing the off-by-one was actually
		off-by-3.

		Tool is committed to `main`, merged into `uvdl3`, ready for Step 6 knowledge
		capture across the other tasking branches.