Commit bf2e73a4 authored by Vacaliuc, Bogdan's avatar Vacaliuc, Bogdan
Browse files

results: Step 1 live read test against snsoroda-scan



All exit criteria met. Captures the eight verification queries against
the production Instruments archiver and the resulting findings.

Key results:

* Connect: oracledb 3.4.2 thin mode connects to
  snsoroda-scan.sns.gov:1521/scprod_controls as sns_reports in 0.09 s.
  Server is Oracle Database 19.27.0.0.0.

* Schema: chan_arch.channel has 428,106 rows. All four DANGLE-related
  PVs (DANGLE, DANGLE.RBV, DANGLE.DMOV, AirPadStatus) resolve cleanly
  to channel IDs 80399-80402.

* Sample query: 120 rows for BL4A:Mot:DANGLE.RBV in the 13:35-13:37
  EDT incident window. First row matches the operator-exported CSV
  exactly: (13:35:48.698608, 4.4057) <-> CSV row "1775669748698  4.4057".

* Timezone: Oracle DB is configured in UTC; session inherits client
  OS timezone. Round-trip works on dragonfly because the OS is Eastern.
  Tool MUST issue ALTER SESSION SET TIME_ZONE = 'America/New_York'
  immediately after connect for portability to non-Eastern hosts.

* Severity/status integer maps captured live so the tool can hard-code
  them instead of joining on every row. Gap-marker status IDs are
  {2: Archive_Off, 3: Write_Error, 4: Disconnected}. Normal-state is
  severity_id=1 (NONE), NOT severity_id=0 (which is Unused).

* AirPadStatus has NO enum_metadata rows -- contradicts the original
  plan and the source-analysis doc, which both assumed it would have
  (0, "Off") (1, "On"). The tool must not assume binary PVs are enums.

* Real Disconnected gap markers found in the historical record. They
  have severity=2 INVALID, status=4 Disconnected, all numeric value
  columns NULL, and str_val='Disconnected'. CSS coalescing returns the
  string naturally, but the tool should additionally surface
  is_marker=true via integer comparison.

* num_metadata for DANGLE.RBV: display range [-1000, 1000] deg, prec 4.
  A historical recovery sample shows float_val=3390.0 -- wildly out of
  range, the kind of bad value an investigator wants flagged. Tool should
  set out_of_range=true when num_metadata is present.

* Stored procedure archive_reader_pkg.get_actual_start_time(cid,ts,te)
  works for the carry-forward initial-sample lookup. The inline-SQL
  fallback hangs >60 s on the production schema -- Oracle doesn't push
  LIMIT-1 down through the reverse scan. Tool must use the stored proc.

Co-Authored-By: default avatarClaude Opus 4.6 (1M context) <noreply@anthropic.com>
parent a15bddb2
Loading
Loading
Loading
Loading
+251 −0
Original line number Diff line number Diff line
# Step 1 — Live Read Test Results

**Date:** 2026-04-11
**Branch:** `tasking/css-archiver-query-tool-development`
**Target:** `snsoroda-scan.sns.gov:1521/scprod_controls` as `sns_reports`
**Driver:** Python `oracledb` 3.4.2 thin mode

This document records the result of the Step 1 feasibility test prescribed by
`plan/archiver-query-tool.md`. All temporary scripts (`/tmp/*-xZRkXM.py`,
mode 600) were deleted after the run; this file is the durable record.

---

## Summary

| Check                                          | Result                                                                                                                                  |
|------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
| `oracledb` thin-mode connect                   | ✅ 0.09 s                                                                                                                                |
| Server identification                          | Oracle Database 19.27.0.0.0                                                                                                              |
| `chan_arch.channel` row count                  | **428,106 channels** in the Instruments DB                                                                                               |
| DANGLE channel resolution                      | ✅ 24 BL4A:Mot:DANGLE\* channels found; all four target PVs resolve cleanly                                                              |
| Sample query in DANGLE incident window         | ✅ 120 rows for `BL4A:Mot:DANGLE.RBV` between 13:35:00–13:37:00 EDT, matches operator CSV exactly                                        |
| Stored procedure `get_actual_start_time`       | ✅ Returns datetime `2026-04-08 13:20:24.921311` (15 min before incident start), confirms initial-sample carry-forward path is usable    |
| `enum_metadata` for AirPadStatus               | ⚠️ **No rows.** AirPadStatus is stored as a raw integer in `num_val` — CSS would render it as `0`/`1`, not as an enum string             |
| `num_metadata` for DANGLE.RBV                  | ✅ Display range -1000..1000, prec=4, unit=`'deg'`                                                                                       |
| Session/DB timezone                            | Session = `-04:00` (EDT, client local), DB = `+00:00` (UTC). Naive datetimes round-trip to operator CSV epoch-ms exactly.                 |
| Round-trip vs operator CSV                     | ✅ **Exact** — first DANGLE.RBV row `(13:35:48.698608, 4.4057)` matches CSV row `1775669748698 → 4.4057` within rounding (μs vs ms)       |
| Disconnected gap markers (last 30 days)        | ✅ Multiple `Disconnected` rows visible from March 15-20 — all value columns NULL, severity=2 INVALID, status=4 Disconnected             |

**All Step 1 exit criteria met.** Proceed to Step 2 implementation.

---

## Detail: timezone — the headline finding

```sql
SELECT sessiontimezone, dbtimezone, TO_CHAR(SYSTIMESTAMP, '...') FROM DUAL
-- session_tz: -04:00
-- db_tz:      +00:00
-- systs:      2026-04-11 13:40:32.455327 -04:00
```

The Oracle database is configured in **UTC** (`db_tz=+00:00`); the connection
session inherits the **client OS timezone** (`session_tz=-04:00` because the
dragonfly machine is in `America/New_York`). When the JDBC/Python driver
fetches a `TIMESTAMP` column, Oracle converts the stored UTC value into the
session timezone wall-clock and returns it as a naive datetime in that
timezone.

The naive `datetime` we receive is therefore **already in client local time**.
Converting it to epoch milliseconds with `.timestamp() * 1000` (in a process
whose `time.tzname` is also `America/New_York`) gives the same number the
operator-exported CSV has, where the CSV's `epoch_ms` column is true UTC.

**This works correctly only when the client OS timezone matches what was used
when the query was constructed.** If the tool ever runs on a UTC host (or
Pacific, etc.) without a timezone fix, both the input window timestamps *and*
the output sample timestamps will be off by the local-vs-Eastern offset.

**The fix is mandatory and one line:**

```python
cur.execute("ALTER SESSION SET TIME_ZONE = 'America/New_York'")
```

This pins the session to Eastern regardless of client OS, so naive datetimes
always behave the same way. The tool MUST do this immediately after connect,
before any sample queries.

(Verified working in the second probe script.)

---

## Detail: severity and status enum tables

These were captured live from `chan_arch.severity` and `chan_arch.status` so
the tool can hard-code the integer→string mapping instead of doing a JOIN on
every row.

### `chan_arch.severity`

```
0  Unused
1  NONE          ← the normal state for a healthy sample
2  INVALID
3  UNDEFINED
4  MAJOR
5  MINOR
6  OK
7  MINOR_ACK
8  MAJOR_ACK
9  INVALID_ACK
10 UNDEFINED_ACK
```

### `chan_arch.status` (first 20 — there are more)

```
1  NO_ALARM
2  Archive_Off       ★ gap marker
3  Write_Error       ★ gap marker
4  Disconnected      ★ gap marker
5  HIHI_ALARM
6  HIGH_ALARM
7  LOW_ALARM
8  STATE_ALARM
9  LOLO_ALARM
10 READ_ALARM
11 TIMEOUT_ALARM
12 UDF_ALARM
13 OK
14 HIGH
15 HIHI
16 LOLO
17 LOW
18 READ
19 WRITE
20 STATE
...
```

**Gap-marker status IDs are `{2, 3, 4}`**`Archive_Off`, `Write_Error`,
`Disconnected`. The Phoebus `AbstractRDBValueIterator.filterSeverity()` does a
case-insensitive string comparison; we can do an integer set membership test
which is faster and avoids accidental name-drift.

**The tool should hard-code these constants** rather than re-query the lookup
tables on every invocation. They are unlikely to change — they are set at
archive-engine schema-creation time and CSS depends on them.

---

## Detail: AirPadStatus is *not* an enum in the archive

We expected `chan_arch.enum_metadata` for `BL4A:Mot:AirPadStatus` to return
`(0, "Off"), (1, "On")` rows so the tool could render it as enum strings
like CSS does for other mbbi PVs. **It returned no rows at all.**

Sample data for the channel (in the DANGLE incident window):

```
(2026-04-08 13:35:44.992699, sev=5 MINOR, st=8 STATE_ALARM, num_val=1, float_val=None, str_val=None)
(2026-04-08 13:36:26.710821, sev=1 NONE,  st=1 NO_ALARM,    num_val=0, float_val=None, str_val=None)
(2026-04-08 13:38:42.793420, sev=5 MINOR, st=8 STATE_ALARM, num_val=1, float_val=None, str_val=None)
(2026-04-08 13:38:53.590670, sev=1 NONE,  st=1 NO_ALARM,    num_val=0, float_val=None, str_val=None)
```

Observations:

- **Storage column:** `num_val` (integer), not `float_val`. Confirms the
  coalescing rule must look at both columns — analog motors use `float_val`,
  integer status PVs use `num_val`.
- **The channel is alarming when value=1.** The IOC has STATE_ALARM
  configured for the "On" state with severity MINOR. This is the *intended*
  behaviour per the operator: when air pads are pressurized, the operator is
  reminded by a minor alarm to release them after the move.
- **No enum_metadata.** CSS Phoebus would render these samples as raw `1`s and
  `0`s, not as `"On"`/`"Off"`. So the tool must too — render the integer
  value, document that the operator can use `--describe-channel` to discover
  whether enum_metadata exists for a channel before assuming enum semantics.
- **The values match the operator CSV exactly** (column 5 in the CSV =
  `0,1,0` for the three transitions visible in the same window).

**Plan revision needed:** the original plan and the analysis doc both said
"AirPadStatus is an enum with `(0, Off) (1, On)` labels". That was wrong.
AirPadStatus has no enum_metadata. The tool must not assume any binary or
mbbi-style PV has enum_metadata; it must check on each channel and fall back
to integer rendering when absent.

---

## Detail: real Disconnected gap markers in DANGLE.RBV history

The query `SELECT ... WHERE channel_id=80401 AND status_id != 1` over the
last 30 days returned multiple rows with status `Disconnected` and **all
value columns NULL**:

```
(2026-03-15 10:06:25, sev=2 INVALID, st=4 Disconnected, num_val=None, float_val=None, str_val='Disconnected')
(2026-03-15 10:08:23, sev=4 MAJOR,   st=39 (??),         num_val=None, float_val=5.116,  str_val=None)
... (more pairs of Disconnected → recovery)
(2026-03-16 18:46:17, sev=2 INVALID, st=4 Disconnected, num_val=None, float_val=None, str_val='Disconnected')
(2026-03-16 18:46:40, sev=4 MAJOR,   st=39,              num_val=None, float_val=3390.0, str_val=None)  ← suspicious recovery position
```

Two important takeaways:

1. **`Disconnected` rows have `str_val='Disconnected'`** (not null), and both
   `num_val` and `float_val` are NULL. So the CSS coalescing precedence
   (float → num → str) would naturally select the `str_val='Disconnected'`
   string for these rows. The tool should still surface the status column
   alongside, and set the `is_marker=true` flag, so the agent doesn't have
   to know that the string value `"Disconnected"` is special.

2. **One recovery row has `float_val=3390.0`** for DANGLE.RBV, which is far
   outside the motor's `[-1000, 1000]` display range from `num_metadata`. This
   is an obvious bad value and is exactly the kind of thing an agent
   investigating motion failures wants to find. **The tool should flag any
   value outside `num_metadata.[low_disp_rng, high_disp_rng]` as a potential
   anomaly** (or at least make the metadata available so the agent can
   compare).

The recovery `status_id=39` is not in the first 20 rows of `chan_arch.status`
that I dumped. That status code is unknown to me; the tool's
`--describe-schema` should dump the full status table on first run so the
agent can decode it.

---

## Detail: query performance notes

| Query                                                                              | Rows  | Latency |
|------------------------------------------------------------------------------------|-------|---------|
| `SELECT COUNT(*) FROM chan_arch.channel`                                           | 1     | < 0.5 s |
| `SELECT … FROM chan_arch.channel WHERE LOWER(name) LIKE 'bl4a:mot:dangle%'`        | 24    | < 0.5 s |
| `SELECT ... FROM chan_arch.sample WHERE channel_id=:cid AND smpl_time BETWEEN ...` | 120   | < 0.2 s |
| `archive_reader_pkg.get_actual_start_time(:cid, :ts, :te)`                         | 1     | < 0.1 s |
| Inline `WHERE smpl_time <= :ts ORDER BY smpl_time DESC, ROWNUM=1`                  | 1     | **>60 s, hangs** |

**The inline at-or-before query is unusable** on the production schema —
Oracle's optimizer doesn't push the LIMIT-1 down into the inner reverse-scan
efficiently. The stored procedure is dramatically faster.

**Plan revision:** the tool must use `archive_reader_pkg.get_actual_start_time`
for the carry-forward step (matching CSS exactly), not the fallback inline
SQL pattern. The fallback is only useful for non-SNS deployments that don't
have the stored procedure installed, which is not our use case.

Alternatively, `FETCH FIRST 1 ROW ONLY` (Oracle 12c+) may also work — it
gives the optimizer a different hint. We didn't measure this in the live test
because the stored procedure already worked. The tool should default to the
stored procedure and offer `--no-stored-procedures` as an escape hatch that
uses `FETCH FIRST 1 ROW ONLY`.

---

## Refinements to fold into `plan/archiver-query-tool.md`

| § | Refinement |
|---|---|
| Credentials | Add an `ALTER SESSION SET TIME_ZONE = 'America/New_York'` directive immediately after connect. **Mandatory.** Without this, the tool only works on Eastern-timezone hosts. |
| CLI surface | No `--strict-window` is needed in v1 — carry-forward via stored procedure is fast and the agent benefit is universal. |
| Severity/status decoding | Hard-code the integer→string maps for `severity` and `status` (first 20 entries each, captured above) as Python dicts at module top. Avoid the JOIN. Marker IDs are `{2, 3, 4}`. |
| Enum rendering | When a PV has no `enum_metadata` rows, **do not** assume the integer is an enum — return it as a number. AirPadStatus is the canonical example. |
| Anomaly flagging | When a sample's coalesced value is outside `num_metadata.[low_disp_rng, high_disp_rng]`, set `out_of_range=true` in the JSONL output. Only emit if the metadata is present. |
| Status table dump | `--describe-schema` should fetch the full status and severity tables and print them — the tool's hard-coded constants are based on a snapshot, and an agent investigating an unfamiliar status code (e.g. status 39 above) needs the up-to-date mapping. |
| Carry-forward implementation | Use `archive_reader_pkg.get_actual_start_time(channel_id, start, end)` as the default. Inline SQL is unusable on the production schema. |
| Verification table T8 | Update — AirPadStatus rows should be `0`/`1` integers, not enum strings. |

The plan refinements are a small delta on top of what's already there. I'll
apply them in a follow-up commit.