Merge branch 'run-parts' into 'develop' (36c16e0f) · Commits · ExaDigiT / sim-raps

README.md

+20 −17

Original line number	Diff line number	Diff line
		@@ -62,21 +62,20 @@ For MIT Supercloud
		python -m raps.dataloaders.mit_supercloud.cli download --start 2021-05-21T13:00 --end 2021-05-21T14:00

		# Load data and run simulation - will save data as part-cpu.npz and part-gpu.npz files
		raps run-multi-part -x 'mit_supercloud/*' -f $DPATH --system mit_supercloud \
		--start 2021-05-21T13:00 --end 2021-05-21T14:00
		raps run-parts -x mit_supercloud -f $DPATH --system mit_supercloud --start 2021-05-21T13:00 --end 2021-05-21T14:00
		# Note: if no start, end dates provided will default to run 24 hours between
		# 2021-05-21T00:00 to 2021-05-22T00:00 set by defaults in raps/dataloaders/mit_supercloud/utils.py

		# Re-run simulation using npz files (much faster load)
		raps run-multi-part -x mit_supercloud/* -f part-*.npz --system mit_supercloud
		raps run-parts -x mit_supercloud -f part-*.npz --system mit_supercloud

		# Synthetic tests for verification studies:
		raps run-multi-part -x 'mit_supercloud/*' -w multitenant
		raps run-parts -x mit_supercloud -w multitenant

		For Lumi

		# Synthetic test for lumi multi-part-sim:
		raps run-multi-part -x lumi/*
		# Synthetic test for Lumi:
		raps run-parts -x lumi

		## Perform Network Simulation

		@@ -93,7 +92,6 @@ given instead of the parquet files for more quickly running subsequent simulatio

		raps run -f jobs_2024-02-20_12-20-39.npz


		## Cooling models

		We provide several cooling models in the repo https://code.ornl.gov/exadigit/POWER9CSM
		@@ -111,23 +109,21 @@ use `--cooling` or `-c` argument. e.g.,

		## Support for multiple system partitions

		Multi-partition systems are supported by running the `multi-part-sim.py` script, where a list of configurations can be specified using the `-x` flag as follows:
		Multi-partition systems are supported by running `raps multi-parts ...` command, where a list of partitions can be specified using the `-x` flag as follows:

		raps run-multi-part -x setonix/part-cpu setonix/part-gpu
		raps run-parts -x setonix/part-cpu setonix/part-gpu

		or simply:

		raps run-multi-part -x setonix/* # bash

		raps run-multi-part -x 'setonix/*' # zsh
		raps run-parts -x setonix

		This will simulate synthetic workloads on two partitions as defined in `config/setonix-cpu` and `config/setonix-gpu`. To replay telemetry workloads from another system, e.g., Marconi100's PM100 dataset, first create a .npz snapshot of the telemetry data, e.g.,

		raps run-multi-part --system marconi100 -f /path/to/marconi100/job_table.parquet
		raps run-parts --system marconi100 -f /path/to/marconi100/job_table.parquet

		This will dump a .npz file with a randomized name, e.g. ac23db.npz. Let's rename this file to pm100.npz for clarity. Note: can control-C when the simulation starts. Now, this pm100.npz file can be used with `multi-part-sim.py` as follows:
		This will dump a .npz file with a randomized name, e.g. ac23db.npz. Let's rename this file to pm100.npz for clarity. Note: can control-C when the simulation starts. Now, this pm100.npz file can be used as follows:

		raps run-multi-part -x setonix/* -f pm100.npz --arrival poisson --scale 192
		raps run-parts -x setonix -f pm100.npz --arrival poisson --scale 192

		## Modifications to telemetry replay

		@@ -135,7 +131,8 @@ There are three ways to modify replaying of telemetry data:

		1. `--arrival`. Changing the arrival time distribution - replay cases will default to `--arrival prescribed`, where the jobs will be submitted exactly as they were submitted on the physical machine. This can be changed to `--arrival poisson` to change when the jobs arrive, which is especially useful in cases where there may be gaps in time, e.g., when the system goes down for several days, or the system is is underutilized.
		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --arrival poisson
		2. `--policy`. Changing the way the jobs are scheduled. The `--policy` flag will be set by default to `replay` in cases where a telemetry file is provided, in which case the jobs will be scheduled according to the start times provided. Changing the `--policy` to `fcfs` or `backfill` will use the internal scheduler.

		2. `--policy`. Changing the way the jobs are scheduled. The `--policy` flag will be set by default to `replay` in cases where a telemetry file is provided, in which case the jobs will be scheduled according to the start times provided. Changing the `--policy` to `fcfs` or `backfill` will use the internal scheduler, e.g.:

		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --policy fcfs --backfill firstfit -t 12h

		@@ -182,6 +179,12 @@ RAPS_DATA_DIR=/opt/data pytest -n auto -x
		By default, tests are parallelized with `pytest-xdist` (`-n auto`) to speed up execution.
		The `-x` flag stops execution after the first failure. Add `-v` to run in verbose mode.

		### Run tests on multi-partition systems

		```bash
		pytest -v -k "multi_part_sim"
		```

		### Run only network-related tests

		```bash

main.py

+2 −2

Original line number	Diff line number	Diff line
		@@ -3,7 +3,7 @@ ExaDigiT Resource Allocator & Power Simulator (RAPS)
		"""
		import argparse
		from raps.helpers import check_python_version
		from raps.run_sim import run_sim_add_parser, run_multi_part_sim_add_parser, show_add_parser
		from raps.run_sim import run_sim_add_parser, run_parts_sim_add_parser, show_add_parser
		from raps.workload import run_workload_add_parser
		from raps.telemetry import run_telemetry_add_parser

		@@ -20,7 +20,7 @@ def main(cli_args: list[str] \| None = None):
		subparsers = parser.add_subparsers(required=True)

		run_sim_add_parser(subparsers)
		run_multi_part_sim_add_parser(subparsers)
		run_parts_sim_add_parser(subparsers)
		show_add_parser(subparsers)
		run_workload_add_parser(subparsers)
		run_telemetry_add_parser(subparsers)

raps/run_sim.py

+13 −5

Original line number	Diff line number	Diff line
		@@ -7,6 +7,7 @@ import json
		import pandas as pd
		import sys
		import yaml
		import warnings
		from pathlib import Path
		from raps.ui import LayoutManager
		from raps.plotting import Plotter
		@@ -73,7 +74,7 @@ def run_sim(sim_config: SimConfig):
		if sim_config.verbose or sim_config.debug:
		print(f"SimConfig: {sim_config.model_dump_json(indent=4)}")
		if len(sim_config.system_configs) > 1:
		print("Use run-multi-part to run multi-partition simulations")
		print("Use run-parts to run multi-partition simulations")
		sys.exit(1)

		engine, workload_data, time_delta = Engine.from_sim_config(sim_config)
		@@ -221,8 +222,8 @@ def run_sim(sim_config: SimConfig):
		print("Output directory is: ", out) # If output is enabled, the user wants this information as last output


		def run_multi_part_sim_add_parser(subparsers: SubParsers):
		parser = subparsers.add_parser("run-multi-part", description="""
		def run_parts_sim_add_parser(subparsers: SubParsers):
		parser = subparsers.add_parser("run-parts", description="""
		Simulates multi-partition (heterogeneous) systems. Supports replaying telemetry or
		generating synthetic workloads across CPU-only, GPU, and mixed partitions. Initializes
		per-partition power, FLOPS, and scheduling models, then advances simulations in lockstep.
		@@ -237,11 +238,18 @@ def run_multi_part_sim_add_parser(subparsers: SubParsers):
		"cli_shortcuts": shortcuts,
		})
		parser.set_defaults(
		impl=lambda args: run_multi_part_sim(model_validate(args, read_yaml(args.config_file)))
		impl=lambda args: run_parts_sim(model_validate(args, read_yaml(args.config_file)))
		)


		def run_multi_part_sim(sim_config: SimConfig):
		def run_parts_sim(sim_config: SimConfig):

		if len(sim_config.system_configs) == 1:
		warnings.warn(
		"run_parts_sim is usually for multiple partitions. Did you mean to run with one?",
		UserWarning
		)

		multi_engine, workload_results, timestep_start, timestep_end, time_delta = \
		MultiPartEngine.from_sim_config(sim_config)

tests/smoke.py

+1 −1

Original line number	Diff line number	Diff line
		@@ -54,7 +54,7 @@ def synthetic_workload_tests():
		def hetero_tests():
		"""Run heterogeneous workload tests."""
		print("Starting heterogeneous workload tests...")
		run_command(f"python main.py run-multi-part -x setonix/part-cpu setonix/part-gpu -t {DEFAULT_TIME}")
		run_command(f"python main.py run-parts -x setonix/part-cpu setonix/part-gpu -t {DEFAULT_TIME}")


		def main():

tests/systems/test_multi_part_sim_basic_run.py

+1 −1

Original line number	Diff line number	Diff line
		@@ -18,7 +18,7 @@ def test_multi_part_sim_basic_run(system, system_config):

		os.chdir(PROJECT_ROOT)
		result = subprocess.run([
		"python", "main.py", "run-multi-part",
		"python", "main.py", "run-parts",
		"--time", "1h",
		"-x", f"{system}/*",
		], capture_output=True, text=True, stdin=subprocess.DEVNULL)