Merge branch 'develop' into rl3 (fde21fba) · Commits · ExaDigiT / sim-raps

.flake8

+1 −1

Original line number	Diff line number	Diff line
		[flake8]
		exclude = .git, __pycache__, venv*, simulation_results, third_party, models
		exclude = .git, __pycache__, venv*, simulation_results, third_party, models, .venv
		max-line-length = 120

.gitignore

+1 −0

Original line number	Diff line number	Diff line
		@@ -5,3 +5,4 @@ venv
		*.npz
		*.prof
		simulation_results/
		models/*.fmu

README.md

+62 −37

Original line number	Diff line number	Diff line
		@@ -19,37 +19,37 @@ Note: Requires python3.12 or greater.

		## Usage and help menu

		python main.py -h
		raps run -h

		## Run simulator with default synthetic workload

		python main.py
		raps run

		## Run simulator with telemetry replay

		# Frontier
		DATEDIR="date=2024-01-18"
		DPATH=~/data/frontier-sample-2024-01-18
		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR
		raps run -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR

		## Open Telemetry dataset

		For Marconi supercomputer, download `job_table.parquet` from https://zenodo.org/records/10127767

		# Marconi100
		python main.py --system marconi100 -f ~/data/marconi100/job_table.parquet
		raps run --system marconi100 -f ~/data/marconi100/job_table.parquet

		For Adastra MI250 supercomputer, download 'AdastaJobsMI250_15days.parquet' from https://zenodo.org/records/14007065

		# Adastra MI250
		python main.py --system adastraMI250 -f AdastaJobsMI250_15days.parquet
		raps run --system adastraMI250 -f AdastaJobsMI250_15days.parquet

		For Google cluster trace v2

		python main.py --system gcloudv2 -f ~/data/gcloud/v2/google_cluster_data_2011_sample --ff 600
		raps run --system gcloudv2 -f ~/data/gcloud/v2/google_cluster_data_2011_sample --ff 600

		# analyze dataset
		python -m raps.telemetry --system gcloudv2 -f ~/data/gcloud/v2/google_cluster_data_2011_sample -v
		raps telemetry --system gcloudv2 -f ~/data/gcloud/v2/google_cluster_data_2011_sample -v

		For MIT Supercloud

		@@ -62,29 +62,29 @@ For MIT Supercloud
		python -m raps.dataloaders.mit_supercloud.cli download --start 2021-05-21T13:00 --end 2021-05-21T14:00

		# Load data and run simulation - will save data as part-cpu.npz and part-gpu.npz files
		python multi-part-sim.py -x mit_supercloud -f $DPATH --start 2021-05-21T13:00 --end 2021-05-21T14:00
		raps run-parts -x mit_supercloud -f $DPATH --start 2021-05-21T13:00 --end 2021-05-21T14:00
		# or simply
		python multi-part-sim.py experiments/mit.yaml
		raps run-parts experiments/mit-replay-25hrs.yaml
		# Note: if no start, end dates provided will default to run 24 hours between
		# 2021-05-21T00:00 to 2021-05-22T00:00 set by defaults in raps/dataloaders/mit_supercloud/utils.py

		# Re-run simulation using npz files (much faster load)
		python multi-part-sim.py -x mit_supercloud -f part-*.npz
		raps run-parts -x mit_supercloud -f part-*.npz

		# Synthetic tests for verification studies:
		python multi-part-sim.py -x mit_supercloud -w multitenant
		raps run-parts -x mit_supercloud -w multitenant

		For Lumi

		# Synthetic test for lumi multi-part-sim:
		python multi-part-sim.py -x lumi/*
		# Synthetic test for Lumi:
		raps run-parts -x lumi

		## Perform Network Simulation

		Lassen is one of the few datasets that has networking data. See `raps/dataloaders/lassen.py` for how to
		get the datasets. To run a network simulation, use the following command:

		python main.py -f ~/data/lassen/Lassen-Supercomputer-Job-Dataset --system lassen --policy fcfs --backfill firstfit --ff 365d -t 12h --arrival poisson --net
		raps run -f ~/data/lassen/Lassen-Supercomputer-Job-Dataset --system lassen --policy fcfs --backfill firstfit --ff 365d -t 12h --arrival poisson --net

		## Snapshot of extracted workload data

		@@ -92,8 +92,7 @@ To reduce the expense of extracting the needed data from the telemetry parquet f
		RAPS saves a snapshot of the extracted data in NPZ format. The NPZ file can be
		given instead of the parquet files for more quickly running subsequent simulations, e.g.:

		python main.py -f jobs_2024-02-20_12-20-39.npz

		raps run -f jobs_2024-02-20_12-20-39.npz

		## Cooling models

		@@ -104,37 +103,29 @@ We provide several cooling models in the repo https://code.ornl.gov/exadigit/POW
		Will install the POWER9CSM in the models folder. To activate cooling when running RAPS,
		use `--cooling` or `-c` argument. e.g.,

		python main.py --system marconi100 -c
		raps run --system marconi100 -c

		python main.py --system lassen -c
		raps run --system lassen -c

		python main.py --system summit -c
		raps run --system summit -c

		## Support for multiple system partitions

		Multi-partition systems are supported by running the `multi-part-sim.py` script, where a list of configurations can be specified using the `-x` flag as follows:
		Multi-partition systems are supported by running `raps multi-parts ...` command, where a list of partitions can be specified using the `-x` flag as follows:

		python multi-part-sim.py -x setonix/part-cpu setonix/part-gpu
		raps run-parts -x setonix/part-cpu setonix/part-gpu

		or simply:

		python multi-part-sim.py -x setonix/* # bash

		python multi-part-sim.py -x 'setonix/*' # zsh

		To run this in parallel use:

		mpiexec -n 2 python multi-part-sim-mpi.py -x setonix/part-cpu setonix/part-gpu

		Note: first install `mpi4py` via pip or conda.
		raps run-parts -x setonix

		This will simulate synthetic workloads on two partitions as defined in `config/setonix-cpu` and `config/setonix-gpu`. To replay telemetry workloads from another system, e.g., Marconi100's PM100 dataset, first create a .npz snapshot of the telemetry data, e.g.,

		python main.py --system marconi100 -f /path/to/marconi100/job_table.parquet
		raps run-parts --system marconi100 -f /path/to/marconi100/job_table.parquet

		This will dump a .npz file with a randomized name, e.g. ac23db.npz. Let's rename this file to pm100.npz for clarity. Note: can control-C when the simulation starts. Now, this pm100.npz file can be used with `multi-part-sim.py` as follows:
		This will dump a .npz file with a randomized name, e.g. ac23db.npz. Let's rename this file to pm100.npz for clarity. Note: can control-C when the simulation starts. Now, this pm100.npz file can be used as follows:

		python multi-part-sim.py -x setonix/* -f pm100.npz --arrival poisson --scale 192
		raps run-parts -x setonix -f pm100.npz --arrival poisson --scale 192

		## Modifications to telemetry replay

		@@ -142,7 +133,8 @@ There are three ways to modify replaying of telemetry data:

		1. `--arrival`. Changing the arrival time distribution - replay cases will default to `--arrival prescribed`, where the jobs will be submitted exactly as they were submitted on the physical machine. This can be changed to `--arrival poisson` to change when the jobs arrive, which is especially useful in cases where there may be gaps in time, e.g., when the system goes down for several days, or the system is is underutilized.
		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --arrival poisson
		2. `--policy`. Changing the way the jobs are scheduled. The `--policy` flag will be set by default to `replay` in cases where a telemetry file is provided, in which case the jobs will be scheduled according to the start times provided. Changing the `--policy` to `fcfs` or `backfill` will use the internal scheduler.

		2. `--policy`. Changing the way the jobs are scheduled. The `--policy` flag will be set by default to `replay` in cases where a telemetry file is provided, in which case the jobs will be scheduled according to the start times provided. Changing the `--policy` to `fcfs` or `backfill` will use the internal scheduler, e.g.:

		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --policy fcfs --backfill firstfit -t 12h

		@@ -152,11 +144,11 @@ python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --pol

		## Job-level power output example for replay of single job

		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --jid 1234567 -o
		raps run -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --jid 1234567 -o

		## Compute stats on telemetry data, e.g., average job arrival time

		python -m raps.telemetry -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR
		raps telemetry -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR

		## Build and run Docker container

		@@ -176,6 +168,39 @@ See instructions in [server/README.md](https://code.ornl.gov/exadigit/simulation

		See instructions in [dashboard/README.md](https://code.ornl.gov/exadigit/simulation-dashboard)

		## Running Tests

		RAPS uses [pytest](https://docs.pytest.org/) for its test suite.
		Before running tests, ensure that you have a valid data directory available (e.g., `/opt/data`) and set the environment variable `RAPS_DATA_DIR` to point to it.

		### Run all tests
		```bash
		RAPS_DATA_DIR=/opt/data pytest -n auto -x
		```

		By default, tests are parallelized with `pytest-xdist` (`-n auto`) to speed up execution.
		The `-x` flag stops execution after the first failure. Add `-v` to run in verbose mode.

		### Run tests on multi-partition systems

		```bash
		pytest -v -k "multi_part_sim"
		```

		### Run only network-related tests

		```bash
		RAPS_DATA_DIR=/opt/data pytest -n auto -x -m network
		```

		See `pytest.ini` for the different options for `-m`.

		### Run a specific test file

		```bash
		RAPS_DATA_DIR=/opt/data pytest tests/systems/test_engine.py
		```

		### Contributing Code

		Install pre-commit hooks as set by the project:

config/bluewaters.yaml

+3 −1

Original line number	Diff line number	Diff line
		@@ -49,7 +49,9 @@ scheduler:
		NODE_FAIL: 0.01
		network:
		topology: torus3d
		network_max_bw: 9600000000.0
		#topology: capacity
		#network_max_bw: 9.6E9
		network_max_bw: 1E7
		torus_x: 24
		torus_y: 24
		torus_z: 24

config/kestrel.yaml

0 → 100644

+53 −0

Original line number	Diff line number	Diff line
		system:
		num_cdus: 6
		racks_per_cdu: 6
		nodes_per_rack: 80
		rectifiers_per_rack: 6
		chassis_per_rack: 1
		nodes_per_blade: 1
		switches_per_chassis: 5
		nics_per_node: 2
		rectifiers_per_chassis: 5
		nodes_per_rectifier: 4
		missing_racks: []
		down_nodes: []
		cpus_per_node: 1
		gpus_per_node: 4
		cpu_peak_flops: 396800000000.0
		gpu_peak_flops: 7800000000000.0
		cpu_fp_ratio: 0.69
		gpu_fp_ratio: 0.69

		power:
		power_gpu_idle: 75
		power_gpu_max: 300
		power_cpu_idle: 100
		power_cpu_max: 800
		power_mem: 74.26
		power_nic: 21
		power_nvme: 45
		power_switch: 250
		power_cdu: 0
		power_update_freq: 20
		rectifier_peak_threshold: 13670
		sivoc_loss_constant: 0
		sivoc_efficiency: 1
		rectifier_loss_constant: 0
		rectifier_efficiency: 1
		power_cost: 0.094

		scheduler:
		seed: 42
		job_arrival_time: 20
		mtbf: 11
		trace_quanta: 20
		min_wall_time: 3600
		max_wall_time: 43200
		ui_update_freq: 3600
		max_nodes_per_job: 3000
		job_end_probs:
		COMPLETED: 0.63
		FAILED: 0.13
		CANCELLED: 0.12
		TIMEOUT: 0.11
		NODE_FAIL: 0.01
		No newline at end of file

Admin message