RAPS 2.0 (!134) · Merge requests · ExaDigiT / sim-raps

Generated Diff since last update of main:

Release summary — merge develop → main

Merged develop into main: large feature & stabilization release adding robust dataloader/telemetry support (Philly, MIT, BlueWaters, Frontier, Marconi, Lassen, Fugaku, Google Cluster V2), network topology modeling and plotting (fat-tree/dragonfly/torus3d) with inter-job congestion & slowdown modeling, HPL analytical model integration, RL scheduler & training tooling, major engine/tick refactors (time-delta/sub-second support), dataset-download tooling, and extensive tests/quality fixes. See notes for config migrations and testing recommendations.

⸻

Highlights:
• Much-improved dataloader & telemetry support: Philly, MIT/mit_supercloud, BlueWaters, Marconi100, Frontier, Lassen, AdastraMI250, Fugaku, and Google Cluster V2 work; more robust handling of start/end/trace times and npz playback.
• Network modeling & plotting: fat-tree/dragonfly/torus3d topologies, topology plotting helpers, inter-job congestion synthetic workloads and network slowdown/dilation reporting.
• Simulation engine and tick refactors: clearer tick/time-delta semantics, engine-driven current_timestep, batching for performance, sub-second downscale factor, and improved prepare/fast-forward handling.
• Replay / reschedule / policy improvements: explicit PolicyType handling (replay vs reschedule), fixes to replay policy/config handling, arrival/poisson arrival rate support, and more consistent handling of telemetry_start vs sim_config.start.
• HPL analytical model: initial integration of Hao’s HPL model (per-iteration model calls) and added HPL test cases.
• RL framework & schedulers: RL scheduler, train-rl subcommand, PPO exposure, SB3-style stats logging and training metrics.
• Datasets & convenience: dataset-download skeleton & dataset downloaders (Frontier/Marconi100/Lassen/Fugaku/Adastra), --start/--end cli options, yaml experiment files, shell completion, /opt/data default paths.
• Extensive test & quality work: lots of pytest additions/fixes (network topology tests, telemetry tests, smoke tests), flake8/PEP8 formatting, test hardening and long-markers for slow tests.
• Various UX/CLI fixes: -o/--output, --noui, improved README examples, clearer error messages, progress bars and improved Rich live layouts.

⸻

New features & notable additions:
• Philly traces dataloader + parse_philly_traces.py and example run commands.
• Network plotting (fat-tree, dragonfly, torus3d) + -w network_test and -w inter_job_congestion synthetic workloads and plotting outputs dumped to output_dir.
• HPL analytical model (Hao 2025) integrated and exercised in tests.
• Added Calculon workload generator.
• --time-delta / downscale factor to control tick frequency and allow sub-second simulations.
• YAML experiment files support and cli validation improvements.
• --arrival (including Poisson) and --job-arrival-rate / --job-arrival-time support for arrival-time modifications.
• RL scheduler and training tooling (PPO/metrics output, train-rl).
• Ability to override dataloader via config; dataloader performance/robustness improvements.
• Telemetry plotting (Gantt charts for nodes/jobs, arrival Gantt) and various new statistics (EDP/EDP^2, slowdown-per-job, network stats).

⸻

Bug fixes & stability improvements (selected):
• Fixed get_current_utilization regression for trace_quanta=None (works for numeric traces again).
• Fixes for dragonfly and torus3d logic & tests.
• Fixed jobs killed prematurely in replay; corrected job state/run-time naming (current_run_time).
• Telemetry start/end handling: telemetry_start shifted back to match sim_config.start when needed; trace_start/trace_end rewrite to avoid NaN/zero-padding issues.
• Fixes to running_time remnants; scheduler stats now display seconds again.
• Many loader-specific fixes: npz loading, arrival/scale bugs, start/end time handling, and missing-trace defaults handled in engine.
• Validation & tests: added long-marker for tests that run long even when no data is present.

⸻

Tests & QA:
• Added/updated many pytest tests: telemetry, network topologies, multi-part simulations, workloads, smoke tests, and RL-related test improvements.
• Flake8/PEP8 formatting passes and other lint fixes.
• Added sample commands and test-friendly defaults (e.g., /opt/data default path used in README/tests).

⸻

CLI / UX changes:
• run-multi-part renamed to run-parts (tests updated).
• --start / --end simulation options for dataloaders (e.g., MIT loader).
• -o/--output added to manage output directories during tests.
• --noui to disable UI; improved Rich-based live layout with progress bar and speed controls (space, +/−, j/l keys).
• Shell completion added.
• Default scheduler behavior clarified: PolicyType determines replay vs reschedule; requested_nodes removed in favor of PolicyType/scheduled_nodes.

⸻

Performance & scaling:
• Batching run_simulation loop into 6-hour windows to improve performance. Updated to bisect input jobs. • Added option to generate jobs during simulation. • Added a parallelized multi-part-sim-mpi.py variant and fastsim parallel integration efforts.
• Added downscale/time-delta features for simulating at different time resolutions and sub-second behavior.

⸻

Datasets & downloads:
• Added dataset download helpers and configs (Frontier, Marconi100, Lassen, Fugaku, AdastraMI250, Marconi FMU integration).
• create_trace / generate_local_metadata safeguards (avoid overwriting) and better documentations for Google Cluster V2 traces.

⸻

Migration / merge notes (important):
• Config keys and defaults changed: check sim_config keys and scheduler JSONs (e.g., nodes_per_blade, arrival vs older reschedule naming) and the change from requested_nodes to scheduled_nodes / PolicyType logic.
• Telemetry & start/end handling: dataloaders and engine now enforce stricter start/end semantics (trace_start/trace_end). Replay behaviour may raise errors if simulation accesses times outside recorded telemetry; rescheduling uses idle defaults. Validate your common replay experiments after the merge.
• Run smoke tests: run the smoke test matrix and the new network tests (torus3d/dragonfly/fattree) — some implementations were adjusted and tests were added/modified.
• Slow tests: some network tests and dataset-driven tests can be long — use the long marker and the -m pytest filters in CI to avoid unexpected timeouts.
• HPL model: initial integration — validate HPL-related experiments if you depend on previous HPL behavior; it now calls the new analytical model per iteration.
• RL changes: training & PPO metrics are now exposed — training pipelines may require updated arguments and SB3-style logging adjustments.

Edited Nov 05, 2025 by Maiterth, Matthias

RAPS 2.0

Merge request reports