Improve documentation of different ways to modify telemetry replay, and better... (b3dcd88b) · Commits · ExaDigiT / sim-raps

README.md

+12 −2

Original line number	Diff line number	Diff line
		@@ -70,9 +70,19 @@ This will simulate synthetic workloads on two partitions as defined in `config/s

		This will dump a .npz file with a randomized name, e.g. ac23db.npz. Let's rename this file to pm100.npz for clarity. Note: can control-C when the simulation starts. Now, this pm100.npz file can be used with `multi-part-sim.py` as follows:

		python multi-part-sim.py -x setonix/* -f pm100.npz --reschedule --scale 192
		python multi-part-sim.py -x setonix/* -f pm100.npz --arrival poisson --scale 192

		The `--reschedule` flag will use the internal scheduler to determine what nodes to schedule for each job, and the `--scale` flag will specify the maximum number of nodes for each job (generally set this to the max number of nodes of the smallest partition).
		## Modifications to telemetry replay

		There are three ways to modify replaying of telemetry data:

		1. `--arrival`. Changing the arrival time distribution - replay cases will default to `--arrival prescribed`, where the jobs will be submitted exactly as they were submitted on the physical machine. This can be changed to `--arrival poisson` to change when the jobs arrive, which is especially useful in cases where there may be gaps in time, e.g., when the system goes down for several days, or the system is is underutilized.

		2. `--policy`. Changing the way the jobs are scheduled. The `--policy` flag will be set by default to `replay` in cases where a telemetry file is provided, in which case the jobs will be scheduled according to the start times provided. Changing the `--policy` to `fcfs` or `backfill` will use the internal scheduler.

		3. `--scale`. Changing the scale of each job in the telemetry data. The `--scale` flag will specify the maximum number of nodes for each job (generally set this to the max number of nodes of the smallest partition), and randomly select the number of nodes for each job from one to max nodes. This flag is useful when replaying telemetry from a larger system onto a smaller system.

		4. `--shuffle`. Shuffle the jobs before playing.

		## Job-level power output example for replay of single job

args.py

+29 −17

Original line number	Diff line number	Diff line
		@@ -3,41 +3,53 @@ import sys
		from raps.schedulers.default import PolicyType

		parser = argparse.ArgumentParser(description='Resource Allocator & Power Simulator (RAPS)')

		# System configurations
		parser.add_argument('--system', type=str, default='frontier', help='System config to use')
		parser.add_argument('-x', '--partitions', nargs='+', default=None, help='List of machine configurations to use, e.g., -x setonix-cpu setonix-gpu')
		parser.add_argument('-c', '--cooling', action='store_true', help='Include FMU cooling model')
		parser.add_argument('--start', type=str, help='ISO8061 string for start of simulation')
		parser.add_argument('--end', type=str, help='ISO8061 string for end of simulation')

		# Simulation runtime options
		parser.add_argument('-t', '--time', type=str, default=None, help='Length of time to simulate, e.g., 123, 123s, 27m, 3h, 7d')
		parser.add_argument('-d', '--debug', action='store_true', help='Enable debug mode and disable rich layout')
		parser.add_argument('-e', '--encrypt', action='store_true', help='Encrypt any sensitive data in telemetry')
		parser.add_argument('-n', '--numjobs', type=int, default=1000, help='Number of jobs to schedule')
		parser.add_argument('-t', '--time', type=str, default=None, help='Length of time to simulate, e.g., 123, 123s, 27m, 3h, 7d')
		parser.add_argument('-ff', '--fastforward', type=str, default=None, help='Fast-forward by time amount (uses same units as -t)')
		parser.add_argument('-v', '--verbose', action='store_true', help='Enable verbose output')
		choices = ['layout1', 'layout2']
		parser.add_argument('--layout', type=str, choices=choices, default=choices[0], help='Layout of UI')
		parser.add_argument('--start', type=str, help='ISO8061 string for start of simulation')
		parser.add_argument('--end', type=str, help='ISO8061 string for end of simulation')
		parser.add_argument('--seed', action='store_true', help='Set random number seed for deterministic simulation')
		parser.add_argument('-f', '--replay', nargs='+', type=str, help='Either: path/to/joblive path/to/jobprofile' + \
		' -or- filename.npz (overrides --workload option)')
		choices = ['prescribed', 'poisson']
		parser.add_argument('--arrival', default=choices[0], type=str, choices=choices, help=f'Modify arrival distribution ({choices[1]}) or use the original submit times ({choices[0]})')
		parser.add_argument('-u', '--uncertainties', action='store_true',
		help='Change from floating point units to floating point units with uncertainties.' + \
		' Very expensive w.r.t simulation time!')
		parser.add_argument('--jid', type=str, default='*', help='Replay job id')
		parser.add_argument('--validate', action='store_true', help='Use node power instead of CPU/GPU utilizations')

		# Output options
		parser.add_argument('-o', '--output', action='store_true', help='Output power, cooling, and loss models for later analysis')
		parser.add_argument('-p', '--plot', nargs='+', choices=['power', 'loss', 'pue', 'temp', 'util'],
		help='Specify one or more types of plots to generate: power, loss, pue, util, temp')
		choices = ['png', 'svg', 'jpg', 'pdf', 'eps']
		parser.add_argument('--imtype', type=str, choices=choices, default=choices[0], help='Plot image type')

		# Telemetry data
		parser.add_argument('-f', '--replay', nargs='+', type=str, help='Either: path/to/joblive path/to/jobprofile' + \
		' -or- filename.npz (overrides --workload option)')
		parser.add_argument('-ff', '--fastforward', type=str, default=None, help='Fast-forward by time amount (uses same units as -t)')
		parser.add_argument('-e', '--encrypt', action='store_true', help='Encrypt any sensitive data in telemetry')
		parser.add_argument('--validate', action='store_true', help='Use node power instead of CPU/GPU utilizations')
		parser.add_argument('--jid', type=str, default='*', help='Replay job id')
		parser.add_argument('--scale', type=int, default=0, help='Scale telemetry to max nodes specified in order to run telemetry on a smaller smaller target system/partition, e.g., --scale 192')
		parser.add_argument('--system', type=str, default='frontier', help='System config to use')

		# Synthetic workloads
		choices = ['random', 'benchmark', 'peak', 'idle']
		parser.add_argument('-w', '--workload', type=str, choices=choices, default=choices[0], help='Type of synthetic workload')

		# Scheduling options
		choices = ['default', 'nrel', 'anl', 'flux']
		parser.add_argument('--scheduler', type=str, choices=choices, default=choices[0], help='Name of scheduler')
		policies = [policy.value for policy in PolicyType]
		choices = ['prescribed', 'poisson']
		parser.add_argument('--arrival', default=choices[0], type=str, choices=choices, help=f'Modify arrival distribution ({choices[1]}) or use the original submit times ({choices[0]})')
		parser.add_argument('--policy', type=str, choices=policies, default=None, help='Schedule policy to use')
		choices = ['random', 'benchmark', 'peak', 'idle']
		parser.add_argument('-w', '--workload', type=str, choices=choices, default=choices[0], help='Type of synthetic workload')
		choices = ['layout1', 'layout2']
		parser.add_argument('-x', '--partitions', nargs='+', default=None, help='List of machine configurations to use, e.g., -x setonix-cpu setonix-gpu')
		parser.add_argument('--layout', type=str, choices=choices, default=choices[0], help='Layout of UI')
		parser.add_argument('--accounts', action='store_true', help='Flag indicating if accounts should be tracked')
		parser.add_argument('--accounts-json', type=str, help='Json of account stats generated in previous run. see raps/accounts.py')

Admin message