Reduce the length of the paper and reformat for new requirements. (7755164c) · Commits · Cianciosa, Mark / graph_framework

graph_paper/paper.md

+75 −102

Original line number	Diff line number	Diff line
		@@ -2,7 +2,7 @@
		title: 'graph_framework: A Domain Specific Compiler for Building Physics Applications'
		tags:
		- C++
		- Autodifferentation
		- Auto differentiation
		- GPU
		- RF Ray Tracing
		- Energetic particles
		@@ -39,13 +39,12 @@ initial support for AMD GPUs.

		![Mathematical operations are defined as a tree of operations. A df method transforms the tree by applying the derivative chain rule to each node. A reduce method applies algebraic rules removing nodes from the graph.\label{tree}](../graph_docs/Tree.png){width=60%}

		This framework focuses on the domain of physics problems where a
		the same physics is being applied to large ensemble of particles or rays.
		Applications have been developed for tracing large numbers of Radio Frequency
		(RF) rays in fusion devices and particle tracing for understanding how particles
		distributions are lost or evolve over time. The exploitation of GPU resources
		afforded by this framework allows high fidelity simulations at low computational
		cost.
		This framework focuses on the domain of problems where a the same physics is
		applied to large ensemble of independent particles or rays. Applications have
		been developed for tracing large numbers of Radio Frequency (RF) rays in fusion
		devices and particle tracing for understanding how particle distributions are
		lost or evolve over time. The exploitation of GPU resources afforded by this
		framework allows high fidelity simulations at low computational cost.

		[^1]:Notice of Copyright This manuscript has been authored by UT-Battelle, LLC
		under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The
		@@ -59,24 +58,12 @@ Public Access Plan ([http://energy.gov/downloads/doe-public-access-plan](http://

		# Statement of need

		Modern supercomputers are increasingly relying on Graphic Processing Units
		(GPUs) and other accelerators to achieve exa-scale performance at reasonable
		energy usage. A major challenge of exploiting these accelerators is the
		incompatibility between different vendors. A scientific code written using
		CUDA will not operate on a AMD gpu. Frameworks that can abstract the physics
		from the accelerator kernel code are needed to exploit the current and future
		hardware. In the world of machine learning, several auto differentiation
		frameworks have been developed that have the promise of abstracting the math
		from the compute hardware. However in practice, these framework often lag in
		supporting non-CUDA platforms. Their reliance on python makes them challenging
		to embed within non python based applications.

		Fusion energy is a grand engineering challenge to make into a viable power
		source. Beyond the technical challenges towards making it work in the first
		place, there is an economic challenge that it needs to be addressed. For fusion
		energy to be competitive in the market place. Addressing the economic challenge
		is tackled though design optimization. However, a barrier to optimization is
		the computational costs associated with exploring the different configurations.
		place,it needs to be economically competitive in the energy market place.
		Addressing the economic challenge is tackled though design optimization.
		However, a barrier to optimization is the computational costs associated with
		exploring different configurations.

		Low fidelity models like systems codes[@Kovari],[@Kovari2], can miss critical
		physics that enable optimized designs. High fidelity models, are too costly to
		@@ -92,24 +79,21 @@ challenges of adopting GPUs. As more super computers are diminishing CPU
		capacity in favor of GPU support, we are losing the capacity computing necessary
		to explore large ensembles necessary for device optimization.

		The goal of the graph_framework is to lower the barrier of entry for adopting
		The goal of the `graph_framework` is to lower the barrier of entry for adopting
		GPU code. While there are many different solutions to the problem of performance
		portable code, different solutions have different drawbacks or trade offs. With
		that in mind the graph_framework was developed to address the specific
		that in mind the `graph_framework` was developed to address the specific
		capabilities of:

		- Transparently support multiple CPUs and GPUs including Apple GPUs.
		- Use an API that is as simple as writing equations.
		- Allow easy embedding in legacy code (Doesn't rely on python).
		- Allow easy embedding into legacy code (Doesn't rely on python).
		- Enables automatic differentiation.

		With these design goals in mind this framework is limited to the classes of
		problems which the same physics is applied to a large ensemble of particles.
		This limitation simplifies the complexity of this framework making future
		extensibility simpler as a need arises for a new problem domain. In this paper
		will describe the frameworks design and capabilities. Demonstrate applications
		to problems in radio frequency (RF) heating and particle tracing, and show its
		performance scaling.
		extensibility simpler as a need arises for new problem domains.

		# State of the field

		@@ -119,12 +103,12 @@ performance scaling.
		\|-----------------\|--------------------\|--------------------\|-----------------------\|--------------------\|----------------------\|
		\| Cuda \| C \| Official \| None \| None \| No \|
		\| Metal \| Objective C, Swift \| None \| Official \| Depreciated \| No \|
		\| Kokkos \| C++ \| Official \| None \| Official \| No \|
		\| OpenACC \| C, C++, Fortran \| Official \| None \| None \| No \|
		\| OpenMP \| C, C++, Fortran \| Compiler Dependent \| None \| Compiler Dependent \| No \|
		\| HIP \| C \| Official \| None \| Official \| No \|
		\| OpenCL \| C \| Official \| Deprecated \| Official \| No \|
		\| Vulcan \| C \| Official \| Unofficial \| Official \| No \|
		\| HIP \| C \| Official \| None \| Official \| No \|
		\| OpenACC \| C, C++, Fortran \| Official \| None \| None \| No \|
		\| OpenMP \| C, C++, Fortran \| Compiler Dependent \| None \| Compiler Dependent \| No \|
		\| Kokkos \| C++ \| Official \| None \| Official \| No \|
		\| TensorFlow \| Python, C++ \| Official \| Unofficial/Incomplete \| Unofficial \| Yes \|
		\| JAX \| Python \| Official \| Unofficial/Incomplete \| Official \| Yes \|
		\| PyTorch \| Python, C++, Java \| Official \| Official \| Official \| Yes \|
		@@ -132,46 +116,45 @@ performance scaling.
		Table: Overview of GPU capable frameworks. \label{frameworks}

		Standardized programming languages such as Fortran[@Backus], C[@Ritchie],
		C++[@Stroustrup], have simplified the development of cross platform programs.
		Scientific codes have relied on the ability to write source code which can
		operate on multiple processor architectures and operating systems (OSs) with no
		or little changes given an appropriate compiler. However, modern super computers
		rely on graphical processing units (GPUs) to achieve exa-scale
		performance[@Hines],[@Yang],[@Schneider] with reasonable energy usage. Unlike
		central processing units (CPUs), the instruction sets of GPUs are proprietary
		information. Additionally, since accelerators typically are hardware
		accessories, an OS requires device drivers which are also proprietary. NVidia
		GPUs are best programmed using CUDA[@Cuda] while Apple GPUs use Metal[@Metal]
		and AMD GPUs use HIP[@Hip].

		There are many potential solutions to cross performance portable support. Low
		level cross platform frameworks general purpose GPU (GPGPU) programming
		frameworks such as OpenCL[@Munshi] and Vulkan[@Vulkan] requires
		direct vendor support. HIP can support NVidia GPUs by abstracting the driver API
		and rewriting kernel code. However these frameworks are the lowest level and
		require GPU programming expertise to utilize them effectively that a domain
		scientist may not have. A higher level approach used in OpenACC[@Farber] and
		OpenMP[@OpenMP] use source code annotation to transform loops and code blocks
		into GPU kernels. The drawback of this approach is that source code written for
		CPUs can result in poor GPU performance. Kokkos[@Edwards] is a collection of
		performance portable array operations for building device agnostic applications.
		However, the framework only support AMD and Nvidia GPUs and doesn't have out of
		box support for auto differentiation.

		With the advent of Machine learning, several machine learning frameworks have
		been created such as TensorFlow[@Abadi], JAX[@Bradbury], PyTorch[@Paszke], and
		MLX[@Hannun]. These frameworks build a graph representation operations that can
		be auto-differentiated and compiled to GPUs. These frameworks are intended to be
		used through a python interface which lowers one barrier to using but also
		C++[@Stroustrup], simplify the development of cross platform programs.
		Scientific codes have relied on this ability to support multiple processor
		architectures and operating systems (OSs) with little or no changes given an
		appropriate compiler. However, modern super computers rely on GPUs to achieve
		exa-scale performance[@Hines],[@Yang],[@Schneider]. Unlike CPUs, the instruction
		sets of GPUs are proprietary information. Additionally, since accelerators
		typically are hardware accessories, an OS requires device drivers which are also
		proprietary. NVidia GPUs are best programmed using CUDA[@Cuda] while Apple GPUs
		use Metal[@Metal] and AMD GPUs use HIP[@Hip].

		There are many potential solutions to cross platform GPU support. Low level
		cross platform frameworks general purpose GPU (GPGPU) programming frameworks
		such as OpenCL[@Munshi] and Vulkan[@Vulkan] requires direct vendor support. HIP
		can support NVidia GPUs by abstracting the driver API and rewriting kernel code.
		However these frameworks are the lowest level and require GPU programming
		expertise to utilize them effectively that a domain scientist may not have. A
		higher level approach used in OpenACC[@Farber] and OpenMP[@OpenMP] use source
		code annotation to transform loops and code blocks into GPU kernels. The
		drawback of this approach is that source code written for CPUs can result in
		poor GPU performance. Kokkos[@Edwards] is a collection of performance portable
		array operations for building device agnostic applications. However, the
		framework only supports AMD and Nvidia GPUs and doesn't have out of box support
		for auto differentiation.

		With the advent of Machine learning, several frameworks have been created such
		as TensorFlow[@Abadi], JAX[@Bradbury], PyTorch[@Paszke], and MLX[@Hannun]. These
		frameworks build a graph representation operations that can be
		auto-differentiated and compiled to GPUs. These frameworks are intended to be
		used through a python interface which lowers one barrier to using them but also
		introduces new barriers. For instance, it's not straight forward to embed these
		frameworks in non-python codes and their non-python API's don't always support
		all the features or are as well documented as their python API's. Additionally
		performance is not guaranteed. It is not always straight forward to understand
		what the framework is doing. Additionally cross platform support is often
		unofficial and can be incomplete. Table \ref{frameworks} shows an overview of
		these frameworks.
		what the framework is doing under the hood. Additionally cross platform support
		is often unofficial and can be incomplete. Table \ref{frameworks} shows an
		overview of these frameworks.

		# Software design

		The core of this software is built around a graph data structure representing
		mathematical expressions. In graph form, the expressions can be treated
		symbolically enabling two critical functions. Algebraic rules can be applied to
		@@ -187,7 +170,7 @@ respect to any other expression.

		After expressions are built, workflows are created. A workflow is defined from
		one or more workflow items. A workflow item is defined from input nodes, output
		nodes, and maps between inputs and outputs. For each input and output nodes,
		nodes, and maps between inputs and outputs. For each input and output node,
		device buffers are allocated. Then starting from a given output, device specific
		kernel source code is created by traversing the graph and adding a line
		appropriate for the expression. Duplicate expressions are avoided by tracking a
		@@ -197,45 +180,34 @@ iterating through the workflow items.

		# Research impact statement

		To demonstrate the performance of the optimized kernels created using this
		framework we measured the strong scaling using the the RF ray tracing problem
		in a realistic tokamak geometry. To to compare against other frameworks we
		benchmarked the achieved throughput for simulating gyro motion in a uniform
		magnetic field.
		The `graph_framework` enables domain scientists to create portable high
		performance code by simply writing out equations. Symbolic mathematical
		reductions simplifies expressions which are JIT compiled to device code. The
		high performance code generated enables higher fidelity or the generation of
		large datasets for training reduced machine learning models. To demonstrate the
		performance of this framework we explored two physics examples, RF ray tracing
		in a realistic tokamak geometry and simulating gyro motion in a uniform magnetic
		field.

		## Strong Scaling

		![Left: Strong scaling wall time for 100000 Rays traced in a realistic tokamak equilibrium. Right: Strong scaling speedup normalized to the wall time for a single device or core. The dashed diagonal line references the best possible scaling. The M2 Max has 8 fast performance cores and 4 slower energy efficiency cores resulting drop off in improvement beyond 8 cores.\label{strong}](../graph_docs/StrongScaling.png){width=90%}

		To benchmark code performance we traced $10^{6}$ rays for $10^{3}$ time steps
		using the cold plasma dispersion relation in a realistic tokamak equilibrium. A
		To measure strong scaling we traced $10^{6}$ rays for $10^{3}$ time steps using
		the cold plasma dispersion relation in a realistic tokamak equilibrium. A
		benchmarking application is available in the git repository. The figure above
		shows the strong scaling of wall time as the number of GPU and CPU devices are
		increased. The figure above shows the strong scaling speed up
		shows the strong scaling of wall time and normalized speed up
		$$SpeedUp = \frac{time\left(1\right)}{time\left(n\right)}$$
		as the number of GPU and CPU devices are increased.

		Benchmarking was prepared on two different setups. The first set up as a Mac
		Benchmarking was prepared on two different setups. The first set up is a Mac
		Studio with an Apple M2 Max chip. The M2 chip contains a 12 core CPU where 8
		cores are faster performance codes and the remaining 4 are slower efficiency
		cores. The M2 Max also contains a single 38-core GPU which only support single
		precision operations. The second setup is a server with 4 Nvidia A100 GPUs.
		Benchmarking measures the time to trace $10^{6}$ rays but does not include
		the setup and JIT times.

		Figure \ref{strong} shows the advantage even a single GPU has over CPU
		execution. In single precision, the M2's GPU is almost $100\times$ faster than
		single CPU core while the a single A100 has a nearly $800\times$ advantage. An
		interesting thing to note is the M2 Max CPU show no advantage between single and
		double precision execution.

		For large problem sizes the framework is expected to show good scaling with
		number of devices as the problems we are applying are embarrassingly parallel in
		nature. The figure above shows the strong scaling speed up with the number
		of devices. The framework shows good strong scaling as the problem is split
		among more devices. The architecture of the M2 Chip contains 8 fast performance
		cores and 4 slower energy efficiency cores. This produces a noticeable knee in
		the scaling after 8 core are used. Overall, the framework demonstrates good
		scaling across CPU and GPU devices.
		the setup and JIT times. Figure \ref{strong} shows the advantage even a single
		GPU has over CPU execution.

		## Comparison to other frameworks

		@@ -246,17 +218,18 @@ motion in a uniform magnetic field $\vec{B}=B_{0}\hat{z}$.
		$$\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}$$
		$$\frac{\partial\vec{x}}{\partial t} = dt\vec{v}$$
		We compared the graph framework against the MLX framework since it supports
		Apple GPUs and JAX due to it's popularity. Source codes for this benchmark case
		is available in the `graph_framework` documentation.
		Figure \ref{throughput} shows the throughput of pushing $10^{8}$ particles for
		$10^{3}$ time steps. The `graph_framework` consistently shows the best
		throughput on both CPUs and GPUs. Note MLX CPU throughput could by improved by
		splitting the problem to multiple threads.
		Apple GPUs, JAX due to its popularity, and Kokkos for its performance
		portability. Source codes for this benchmark case are available in the
		`graph_framework` documentation. Figure \ref{throughput} shows the throughput of
		pushing $10^{8}$ particles for $10^{3}$ time steps. The `graph_framework`
		consistently shows the best throughput on both CPUs and GPUs.

		# AI usage disclosure

		No AI technology was used in the development of this software.

		# Acknowledgements

		The authors would like to thank Dr. Yashika Ghai, Dr. Rhea Barnett, and Dr.
		David Green for their valuable insights when setting up test cases for the
		RF-Ray Tracing.