Commit 7755164c authored by Cianciosa, Mark's avatar Cianciosa, Mark
Browse files

Reduce the length of the paper and reformat for new requirements.

parent 1fee604c
Loading
Loading
Loading
Loading
+75 −102
Original line number Diff line number Diff line
@@ -2,7 +2,7 @@
title: 'graph_framework: A Domain Specific Compiler for Building Physics Applications'
tags:
    - C++
    - Autodifferentation
    - Auto differentiation
    - GPU
    - RF Ray Tracing
    - Energetic particles
@@ -39,13 +39,12 @@ initial support for AMD GPUs.

![Mathematical operations are defined as a tree of operations. A df method transforms the tree by applying the derivative chain rule to each node. A reduce method applies algebraic rules removing nodes from the graph.\label{tree}](../graph_docs/Tree.png){width=60%}

This framework focuses on the domain of physics problems where a
the same physics is being applied to large ensemble of particles or rays. 
Applications have been developed for tracing large numbers of Radio Frequency 
(RF) rays in fusion devices and particle tracing for understanding how particles
distributions are lost or evolve over time. The exploitation of GPU resources
afforded by this framework allows high fidelity simulations at low computational 
cost.
This framework focuses on the domain of problems where a the same physics is 
applied to large ensemble of independent particles or rays. Applications have 
been developed for tracing large numbers of Radio Frequency (RF) rays in fusion 
devices and particle tracing for understanding how particle distributions are 
lost or evolve over time. The exploitation of GPU resources afforded by this 
framework allows high fidelity simulations at low computational cost.

[^1]:Notice of Copyright This manuscript has been authored by UT-Battelle, LLC 
under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The 
@@ -59,24 +58,12 @@ Public Access Plan ([http://energy.gov/downloads/doe-public-access-plan](http://

# Statement of need

Modern supercomputers are increasingly relying on Graphic Processing Units 
(GPUs) and other accelerators to achieve exa-scale performance at reasonable 
energy usage. A major challenge of exploiting these accelerators is the 
incompatibility between different vendors. A scientific code written using 
CUDA will not operate on a AMD gpu. Frameworks that can abstract the physics 
from the accelerator kernel code are needed to exploit the current and future 
hardware. In the world of machine learning, several auto differentiation 
frameworks have been developed that have the promise of abstracting the math 
from the compute hardware. However in practice, these framework often lag in 
supporting non-CUDA platforms. Their reliance on python makes them challenging 
to embed within non python based applications.

Fusion energy is a grand engineering challenge to make into a viable power 
source. Beyond the technical challenges towards making it work in the first
place, there is an economic challenge that it needs to be addressed. For fusion 
energy to be competitive in the market place. Addressing the economic challenge 
is tackled though design optimization. However, a barrier to optimization is 
the computational costs associated with exploring the different configurations. 
place,it needs to be economically competitive in the energy market place. 
Addressing the economic challenge is tackled though design optimization. 
However, a barrier to optimization is the computational costs associated with 
exploring different configurations. 

Low fidelity models like systems codes[@Kovari],[@Kovari2], can miss critical 
physics that enable optimized designs. High fidelity models, are too costly to 
@@ -92,24 +79,21 @@ challenges of adopting GPUs. As more super computers are diminishing CPU
capacity in favor of GPU support, we are losing the capacity computing necessary 
to explore large ensembles necessary for device optimization.

The goal of the graph_framework is to lower the barrier of entry for adopting 
The goal of the `graph_framework` is to lower the barrier of entry for adopting 
GPU code. While there are many different solutions to the problem of performance 
portable code, different solutions have different drawbacks or trade offs. With 
that in mind the graph_framework was developed to address the specific 
that in mind the `graph_framework` was developed to address the specific 
capabilities of:

- Transparently support multiple CPUs and GPUs including Apple GPUs.
- Use an API that is as simple as writing equations.
- Allow easy embedding in legacy code (Doesn't rely on python).
- Allow easy embedding into legacy code (Doesn't rely on python).
- Enables automatic differentiation.

With these design goals in mind this framework is limited to the classes of 
problems which the same physics is applied to a large ensemble of particles.
This limitation simplifies the complexity of this framework making future 
extensibility simpler as a need arises for a new problem domain. In this paper 
will describe the frameworks design and capabilities. Demonstrate applications 
to problems in radio frequency (RF) heating and particle tracing, and show its 
performance scaling.
extensibility simpler as a need arises for new problem domains.

# State of the field

@@ -119,12 +103,12 @@ performance scaling.
|-----------------|--------------------|--------------------|-----------------------|--------------------|----------------------|
| Cuda            | C                  | Official           | None                  | None               | No                   |
| Metal           | Objective C, Swift | None               | Official              | Depreciated        | No                   |
| Kokkos          | C++                | Official           | None                  | Official           | No                   |
| OpenACC         | C, C++, Fortran    | Official           | None                  | None               | No                   |
| OpenMP          | C, C++, Fortran    | Compiler Dependent | None                  | Compiler Dependent | No                   |
| HIP             | C                  | Official           | None                  | Official           | No                   |
| OpenCL          | C                  | Official           | Deprecated            | Official           | No                   |
| Vulcan          | C                  | Official           | Unofficial            | Official           | No                   |
| HIP             | C                  | Official           | None                  | Official           | No                   |
| OpenACC         | C, C++, Fortran    | Official           | None                  | None               | No                   |
| OpenMP          | C, C++, Fortran    | Compiler Dependent | None                  | Compiler Dependent | No                   |
| Kokkos          | C++                | Official           | None                  | Official           | No                   |
| TensorFlow      | Python, C++        | Official           | Unofficial/Incomplete | Unofficial         | Yes                  |
| JAX             | Python             | Official           | Unofficial/Incomplete | Official           | Yes                  |
| PyTorch         | Python, C++, Java  | Official           | Official              | Official           | Yes                  |
@@ -132,46 +116,45 @@ performance scaling.
Table: Overview of GPU capable frameworks. \label{frameworks}

Standardized programming languages such as Fortran[@Backus], C[@Ritchie], 
C++[@Stroustrup], have simplified the development of cross platform programs. 
Scientific codes have relied on the ability to write source code which can 
operate on multiple processor architectures and operating systems (OSs) with no 
or little changes given an appropriate compiler. However, modern super computers 
rely on graphical processing units (GPUs) to achieve exa-scale 
performance[@Hines],[@Yang],[@Schneider] with reasonable energy usage. Unlike 
central processing units (CPUs), the instruction sets of GPUs are proprietary 
information. Additionally, since accelerators typically are hardware 
accessories, an OS requires device drivers which are also proprietary. NVidia
GPUs are best programmed using CUDA[@Cuda] while Apple GPUs use Metal[@Metal] 
and AMD GPUs use HIP[@Hip].

There are many potential solutions to cross performance portable support. Low 
level cross platform frameworks general purpose GPU (GPGPU) programming 
frameworks such as OpenCL[@Munshi] and Vulkan[@Vulkan] requires 
direct vendor support. HIP can support NVidia GPUs by abstracting the driver API
and rewriting kernel code. However these frameworks are the lowest level and
require GPU programming expertise to utilize them effectively that a domain 
scientist may not have. A higher level approach used in OpenACC[@Farber] and 
OpenMP[@OpenMP] use source code annotation to transform loops and code blocks 
into GPU kernels. The drawback of this approach is that source code written for 
CPUs can result in poor GPU performance. Kokkos[@Edwards] is a collection of 
performance portable array operations for building device agnostic applications.
However, the framework only support AMD and Nvidia GPUs and doesn't have out of 
box support for auto differentiation.

With the advent of Machine learning, several machine learning frameworks have
been created such as TensorFlow[@Abadi], JAX[@Bradbury], PyTorch[@Paszke], and 
MLX[@Hannun]. These frameworks build a graph representation operations that can 
be auto-differentiated and compiled to GPUs. These frameworks are intended to be 
used through a python interface which lowers one barrier to using but also
C++[@Stroustrup], simplify the development of cross platform programs. 
Scientific codes have relied on this ability to support multiple processor 
architectures and operating systems (OSs) with little or no changes given an 
appropriate compiler. However, modern super computers rely on GPUs to achieve 
exa-scale performance[@Hines],[@Yang],[@Schneider]. Unlike CPUs, the instruction 
sets of GPUs are proprietary information. Additionally, since accelerators 
typically are hardware accessories, an OS requires device drivers which are also 
proprietary. NVidia GPUs are best programmed using CUDA[@Cuda] while Apple GPUs 
use Metal[@Metal] and AMD GPUs use HIP[@Hip].

There are many potential solutions to cross platform GPU support. Low level 
cross platform frameworks general purpose GPU (GPGPU) programming frameworks 
such as OpenCL[@Munshi] and Vulkan[@Vulkan] requires direct vendor support. HIP 
can support NVidia GPUs by abstracting the driver API and rewriting kernel code. 
However these frameworks are the lowest level and require GPU programming 
expertise to utilize them effectively that a domain scientist may not have. A 
higher level approach used in OpenACC[@Farber] and OpenMP[@OpenMP] use source 
code annotation to transform loops and code blocks  into GPU kernels. The 
drawback of this approach is that source code written for CPUs can result in 
poor GPU performance. Kokkos[@Edwards] is a collection of performance portable 
array operations for building device agnostic applications. However, the 
framework only supports AMD and Nvidia GPUs and doesn't have out of box support 
for auto differentiation.

With the advent of Machine learning, several frameworks have been created such 
as TensorFlow[@Abadi], JAX[@Bradbury], PyTorch[@Paszke], and MLX[@Hannun]. These 
frameworks build a graph representation operations that can be 
auto-differentiated and compiled to GPUs. These frameworks are intended to be 
used through a python interface which lowers one barrier to using them but also
introduces new barriers. For instance, it's not straight forward to embed these 
frameworks in non-python codes and their non-python API's don't always support 
all the features or are as well documented as their python API's. Additionally 
performance is not guaranteed. It is not always straight forward to understand 
what the framework is doing. Additionally cross platform support is often 
unofficial and can be incomplete. Table \ref{frameworks} shows an overview of 
these frameworks.
what the framework is doing under the hood. Additionally cross platform support 
is often unofficial and can be incomplete. Table \ref{frameworks} shows an 
overview of these frameworks.

# Software design

The core of this software is built around a graph data structure representing 
mathematical expressions. In graph form, the expressions can be treated 
symbolically enabling two critical functions. Algebraic rules can be applied to 
@@ -187,7 +170,7 @@ respect to any other expression.

After expressions are built, workflows are created. A workflow is defined from 
one or more workflow items. A workflow item is defined from input nodes, output 
nodes, and maps between inputs and outputs. For each input and output nodes, 
nodes, and maps between inputs and outputs. For each input and output node, 
device buffers are allocated. Then starting from a given output, device specific 
kernel source code is created by traversing the graph and adding a line 
appropriate for the expression. Duplicate expressions are avoided by tracking a 
@@ -197,45 +180,34 @@ iterating through the workflow items.

# Research impact statement 

To demonstrate the performance of the optimized kernels created using this 
framework we measured the strong scaling using the the RF ray tracing problem 
in a realistic tokamak geometry. To to compare against other frameworks we 
benchmarked the achieved throughput for simulating gyro motion in a uniform 
magnetic field.
The `graph_framework` enables domain scientists to create portable high 
performance code by simply writing out equations. Symbolic mathematical 
reductions simplifies expressions which are JIT compiled to device code. The 
high performance code generated enables higher fidelity or the generation of 
large datasets for training reduced machine learning models. To demonstrate the 
performance of this framework we explored two physics examples, RF ray tracing 
in a realistic tokamak geometry and simulating gyro motion in a uniform magnetic
field.

## Strong Scaling

![Left: Strong scaling wall time for 100000 Rays traced in a realistic tokamak equilibrium. Right: Strong scaling speedup normalized to the wall time for a single device or core. The dashed diagonal line references the best possible scaling. The M2 Max has 8 fast performance cores and 4 slower energy efficiency cores resulting drop off in improvement beyond 8 cores.\label{strong}](../graph_docs/StrongScaling.png){width=90%}

To benchmark code performance we traced $10^{6}$ rays for $10^{3}$ time steps 
using the cold plasma dispersion relation in a realistic tokamak equilibrium. A
To measure strong scaling we traced $10^{6}$ rays for $10^{3}$ time steps using
the cold plasma dispersion relation in a realistic tokamak equilibrium. A
benchmarking application is available in the git repository. The figure above 
shows the strong scaling of wall time as the number of GPU and CPU devices are 
increased. The figure above shows the strong scaling speed up
shows the strong scaling of wall time and normalized speed up
$$SpeedUp = \frac{time\left(1\right)}{time\left(n\right)}$$
as the number of GPU and CPU devices are increased.

Benchmarking was prepared on two different setups. The first set up as a Mac 
Benchmarking was prepared on two different setups. The first set up is a Mac 
Studio with an Apple M2 Max chip. The M2 chip contains a 12 core CPU where 8 
cores are faster performance codes and the remaining 4 are slower efficiency 
cores. The M2 Max also contains a single 38-core GPU which only support single 
precision operations. The second setup is a server with 4 Nvidia A100 GPUs. 
Benchmarking measures the time to trace $10^{6}$ rays but does not include 
the setup and JIT times.

Figure \ref{strong} shows the advantage even a single GPU has over CPU 
execution. In single precision, the M2's GPU is almost $100\times$ faster than 
single CPU core while the a single A100 has a nearly $800\times$ advantage. An 
interesting thing to note is the M2 Max CPU show no advantage between single and 
double precision execution.

For large problem sizes the framework is expected to show good scaling with
number of devices as the problems we are applying are embarrassingly parallel in 
nature. The figure above shows the strong scaling speed up with the number
of devices. The framework shows good strong scaling as the problem is split
among more devices. The architecture of the M2 Chip contains 8 fast performance 
cores and 4 slower energy efficiency cores. This produces a noticeable knee in 
the scaling after 8 core are used. Overall, the framework demonstrates good 
scaling across CPU and GPU devices.
the setup and JIT times. Figure \ref{strong} shows the advantage even a single 
GPU has over CPU execution.

## Comparison to other frameworks

@@ -246,17 +218,18 @@ motion in a uniform magnetic field $\vec{B}=B_{0}\hat{z}$.
$$\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}$$
$$\frac{\partial\vec{x}}{\partial t} = dt\vec{v}$$
We compared the graph framework against the MLX framework since it supports
Apple GPUs and JAX due to it's popularity. Source codes for this benchmark case 
is available in the `graph_framework` documentation. 
Figure \ref{throughput} shows the throughput of pushing $10^{8}$ particles for 
$10^{3}$ time steps. The `graph_framework` consistently shows the best 
throughput on both CPUs and GPUs. Note MLX CPU throughput could by improved by 
splitting the problem to multiple threads.
Apple GPUs, JAX due to its popularity, and Kokkos for its performance 
portability. Source codes for this benchmark case are available in the 
`graph_framework` documentation. Figure \ref{throughput} shows the throughput of 
pushing $10^{8}$ particles for $10^{3}$ time steps. The `graph_framework` 
consistently shows the best throughput on both CPUs and GPUs.

# AI usage disclosure

No AI technology was used in the development of this software.

# Acknowledgements

The authors would like to thank Dr. Yashika Ghai, Dr. Rhea Barnett, and Dr. 
David Green for their valuable insights when setting up test cases for the 
RF-Ray Tracing.