Merge branch 'shell-completion' into 'develop' (121c9cb5) · Commits · ExaDigiT / sim-raps

.gitignore

+1 −0

Original line number	Diff line number	Diff line
		@@ -6,3 +6,4 @@ venv
		*.prof
		simulation_results/
		models/*.fmu
		.shell-completion-cache

README.md

+4 −3

Original line number	Diff line number	Diff line
		@@ -75,7 +75,7 @@ For MIT Supercloud
		raps run-parts -x mit_supercloud -w multitenant

		# Reinforcement learning test case
		python main.py train-rl --system mit_supercloud/part-cpu -f /opt/data/mit_supercloud/202201
		raps train-rl --system mit_supercloud/part-cpu -f /opt/data/mit_supercloud/202201

		For Lumi

		@@ -135,11 +135,12 @@ This will dump a .npz file with a randomized name, e.g. ac23db.npz. Let's rename
		There are three ways to modify replaying of telemetry data:

		1. `--arrival`. Changing the arrival time distribution - replay cases will default to `--arrival prescribed`, where the jobs will be submitted exactly as they were submitted on the physical machine. This can be changed to `--arrival poisson` to change when the jobs arrive, which is especially useful in cases where there may be gaps in time, e.g., when the system goes down for several days, or the system is is underutilized.
		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --arrival poisson

		raps run -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --arrival poisson

		2. `--policy`. Changing the way the jobs are scheduled. The `--policy` flag will be set by default to `replay` in cases where a telemetry file is provided, in which case the jobs will be scheduled according to the start times provided. Changing the `--policy` to `fcfs` or `backfill` will use the internal scheduler, e.g.:

		python main.py -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --policy fcfs --backfill firstfit -t 12h
		raps run -f $DPATH/slurm/joblive/$DATEDIR,$DPATH/jobprofile/$DATEDIR --policy fcfs --backfill firstfit -t 12h

		3. `--scale`. Changing the scale of each job in the telemetry data. The `--scale` flag will specify the maximum number of nodes for each job (generally set this to the max number of nodes of the smallest partition), and randomly select the number of nodes for each job from one to max nodes. This flag is useful when replaying telemetry from a larger system onto a smaller system.

experiments/mit-replay-24hrs.yaml

+1 −1

Original line number	Diff line number	Diff line
		# python main.py run-multi-part experiments/mit-replay-24hrs.yaml
		# raps run-multi-part experiments/mit-replay-24hrs.yaml
		partitions: ["mit_supercloud/part-cpu", "mit_supercloud/part-gpu"]
		replay:
		- /opt/data/mit_supercloud/202201

experiments/mit-synthetic.yaml

+1 −1

Original line number	Diff line number	Diff line
		# python main.py run-multi-part experiments/mit-synthetic.yaml
		# raps run-multi-part experiments/mit-synthetic.yaml
		partitions: ["mit_supercloud/part-cpu", "mit_supercloud/part-gpu"]
		workload: multitenant

main.py

+67 −7

Original line number	Diff line number	Diff line
		#!/usr/bin/env python3
		# PYTHON_ARGCOMPLETE_OK
		"""
		ExaDigiT Resource Allocator & Power Simulator (RAPS)
		"""
		import argparse
		from pathlib import Path
		import os
		import textwrap
		import copy
		import gzip
		import dill
		import argcomplete

		# Implement shell completion using argcomplete
		# Importing all of raps' dependencies like pandas etc can be rather slow, often taking 1-2 seconds. So for snappy shell
		# completion we need avoid imports on the shell completion path. We could do this by shuffling the code around to
		# create the parser without importing any heavy-weight libraries. But that would be a pain to maintain and track that
		# pandas or scipy aren't accidentally imported transitively. Pandas can also be convenient to use in validating SimConfig
		# etc, which is needed to build the argparser. So instead, we cache the generated argparser object so that shell
		# completion can run without importing the rest of raps.
		PARSER_CACHE = Path(__file__).parent / '.shell-completion-cache'


		def shell_completion_add_parser(subparsers):
		parser = subparsers.add_parser("shell-completion", description=textwrap.dedent("""
		Register shell completion for RAPS.
		""").strip(), formatter_class=argparse.RawDescriptionHelpFormatter)

		# Run the command from argcomplete, this edits ~/.bash_completion to register argcomplete
		def impl(args):
		os.system("activate-global-python-argcomplete")

		parser.set_defaults(impl=impl)


		def shell_complete():
		try:
		parser = dill.loads(gzip.decompress(PARSER_CACHE.read_bytes()))
		except Exception:
		PARSER_CACHE.unlink(missing_ok=True) # delete cache if corrupted somehow
		parser = argparse.ArgumentParser()
		# Use a dummy parser so that autocomplete still handles sys.exit tab complete if there's no
		# cache. Cache will be created on first run of `main.py`

		argcomplete.autocomplete(parser, always_complete_options=False)


		def cache_parser(parser: argparse.ArgumentParser):
		parser = copy.deepcopy(parser)
		subparsers = next(a for a in parser._actions if isinstance(a, argparse._SubParsersAction))
		# Don't need to pickle the impl functions
		for subparser in subparsers.choices.values():
		subparser.set_defaults(impl=lambda args: None)

		pickled = gzip.compress(dill.dumps(parser), compresslevel=4, mtime=0)
		if not PARSER_CACHE.exists() or PARSER_CACHE.read_bytes() != pickled:
		try: # Ignore if there's some kind of write or permission error
		PARSER_CACHE.write_bytes(pickled)
		except Exception:
		pass


		def main(cli_args: list[str] \| None = None):
		shell_complete() # will output shell completion and sys.exit during tab complete

		from raps.helpers import check_python_version
		check_python_version()

		from raps.run_sim import run_sim_add_parser, run_parts_sim_add_parser, show_add_parser
		from raps.workloads import run_workload_add_parser
		from raps.telemetry import run_telemetry_add_parser
		from raps.train_rl import train_rl_add_parser

		check_python_version()


		def main(cli_args: list[str] \| None = None):
		parser = argparse.ArgumentParser(
		description="""
		ExaDigiT Resource Allocator & Power Simulator (RAPS)
		@@ -27,8 +86,9 @@ def main(cli_args: list[str] \| None = None):
		run_workload_add_parser(subparsers)
		run_telemetry_add_parser(subparsers)
		train_rl_add_parser(subparsers)
		shell_completion_add_parser(subparsers)

		# TODO: move other misc scripts into here
		cache_parser(parser)

		args = parser.parse_args(cli_args)
		assert args.impl, "subparsers should add an impl function to args"