{width=60%}
This framework focuses on the domain of physics problems where a
the same physics is being applied to large ensemble of particles or rays.
Applications have been developed for tracing large numbers of Radio Frequency
(RF) rays in fusion devices and particle tracing for understanding how particles
distributions are lost or evolve over time. The exploitation of GPU resources
afforded by this framework allows high fidelity simulations at low computational
cost.
This framework focuses on the domain of problems where a the same physics is
applied to large ensemble of independent particles or rays. Applications have
been developed for tracing large numbers of Radio Frequency (RF) rays in fusion
devices and particle tracing for understanding how particle distributions are
lost or evolve over time. The exploitation of GPU resources afforded by this
framework allows high fidelity simulations at low computational cost.
[^1]:Notice of Copyright This manuscript has been authored by UT-Battelle, LLC
under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The
@@ -59,24 +58,12 @@ Public Access Plan ([http://energy.gov/downloads/doe-public-access-plan](http://
# Statement of need
Modern supercomputers are increasingly relying on Graphic Processing Units
(GPUs) and other accelerators to achieve exa-scale performance at reasonable
energy usage. A major challenge of exploiting these accelerators is the
incompatibility between different vendors. A scientific code written using
CUDA will not operate on a AMD gpu. Frameworks that can abstract the physics
from the accelerator kernel code are needed to exploit the current and future
hardware. In the world of machine learning, several auto differentiation
frameworks have been developed that have the promise of abstracting the math
from the compute hardware. However in practice, these framework often lag in
supporting non-CUDA platforms. Their reliance on python makes them challenging
to embed within non python based applications.
Fusion energy is a grand engineering challenge to make into a viable power
source. Beyond the technical challenges towards making it work in the first
place, there is an economic challenge that it needs to be addressed. For fusion
energy to be competitive in the market place. Addressing the economic challenge
is tackled though design optimization. However, a barrier to optimization is
the computational costs associated with exploring the different configurations.
place,it needs to be economically competitive in the energy market place.
Addressing the economic challenge is tackled though design optimization.
However, a barrier to optimization is the computational costs associated with
exploring different configurations.
Low fidelity models like systems codes[@Kovari],[@Kovari2], can miss critical
physics that enable optimized designs. High fidelity models, are too costly to
@@ -92,24 +79,21 @@ challenges of adopting GPUs. As more super computers are diminishing CPU
capacity in favor of GPU support, we are losing the capacity computing necessary
to explore large ensembles necessary for device optimization.
The goal of the graph_framework is to lower the barrier of entry for adopting
The goal of the `graph_framework` is to lower the barrier of entry for adopting
GPU code. While there are many different solutions to the problem of performance
portable code, different solutions have different drawbacks or trade offs. With
that in mind the graph_framework was developed to address the specific
that in mind the `graph_framework` was developed to address the specific
capabilities of:
- Transparently support multiple CPUs and GPUs including Apple GPUs.
- Use an API that is as simple as writing equations.
- Allow easy embedding in legacy code (Doesn't rely on python).
- Allow easy embedding into legacy code (Doesn't rely on python).
- Enables automatic differentiation.
With these design goals in mind this framework is limited to the classes of
problems which the same physics is applied to a large ensemble of particles.
This limitation simplifies the complexity of this framework making future
extensibility simpler as a need arises for a new problem domain. In this paper
will describe the frameworks design and capabilities. Demonstrate applications
to problems in radio frequency (RF) heating and particle tracing, and show its
performance scaling.
extensibility simpler as a need arises for new problem domains.
| Kokkos | C++ | Official | None | Official | No |
| TensorFlow | Python, C++ | Official | Unofficial/Incomplete | Unofficial | Yes |
| JAX | Python | Official | Unofficial/Incomplete | Official | Yes |
| PyTorch | Python, C++, Java | Official | Official | Official | Yes |
@@ -132,46 +116,45 @@ performance scaling.
Table: Overview of GPU capable frameworks. \label{frameworks}
Standardized programming languages such as Fortran[@Backus], C[@Ritchie],
C++[@Stroustrup], have simplified the development of cross platform programs.
Scientific codes have relied on the ability to write source code which can
operate on multiple processor architectures and operating systems (OSs) with no
or little changes given an appropriate compiler. However, modern super computers
rely on graphical processing units (GPUs) to achieve exa-scale
performance[@Hines],[@Yang],[@Schneider] with reasonable energy usage. Unlike
central processing units (CPUs), the instruction sets of GPUs are proprietary
information. Additionally, since accelerators typically are hardware
accessories, an OS requires device drivers which are also proprietary. NVidia
GPUs are best programmed using CUDA[@Cuda] while Apple GPUs use Metal[@Metal]
and AMD GPUs use HIP[@Hip].
There are many potential solutions to cross performance portable support. Low
level cross platform frameworks general purpose GPU (GPGPU) programming
frameworks such as OpenCL[@Munshi] and Vulkan[@Vulkan] requires
direct vendor support. HIP can support NVidia GPUs by abstracting the driver API
and rewriting kernel code. However these frameworks are the lowest level and
require GPU programming expertise to utilize them effectively that a domain
scientist may not have. A higher level approach used in OpenACC[@Farber] and
OpenMP[@OpenMP] use source code annotation to transform loops and code blocks
into GPU kernels. The drawback of this approach is that source code written for
CPUs can result in poor GPU performance. Kokkos[@Edwards] is a collection of
performance portable array operations for building device agnostic applications.
However, the framework only support AMD and Nvidia GPUs and doesn't have out of
box support for auto differentiation.
With the advent of Machine learning, several machine learning frameworks have
been created such as TensorFlow[@Abadi], JAX[@Bradbury], PyTorch[@Paszke], and
MLX[@Hannun]. These frameworks build a graph representation operations that can
be auto-differentiated and compiled to GPUs. These frameworks are intended to be
used through a python interface which lowers one barrier to using but also
C++[@Stroustrup], simplify the development of cross platform programs.
Scientific codes have relied on this ability to support multiple processor
architectures and operating systems (OSs) with little or no changes given an
appropriate compiler. However, modern super computers rely on GPUs to achieve
exa-scale performance[@Hines],[@Yang],[@Schneider]. Unlike CPUs, the instruction
sets of GPUs are proprietary information. Additionally, since accelerators
typically are hardware accessories, an OS requires device drivers which are also
proprietary. NVidia GPUs are best programmed using CUDA[@Cuda] while Apple GPUs
use Metal[@Metal] and AMD GPUs use HIP[@Hip].
There are many potential solutions to cross platform GPU support. Low level
cross platform frameworks general purpose GPU (GPGPU) programming frameworks
such as OpenCL[@Munshi] and Vulkan[@Vulkan] requires direct vendor support. HIP
can support NVidia GPUs by abstracting the driver API and rewriting kernel code.
However these frameworks are the lowest level and require GPU programming
expertise to utilize them effectively that a domain scientist may not have. A
higher level approach used in OpenACC[@Farber] and OpenMP[@OpenMP] use source
code annotation to transform loops and code blocks into GPU kernels. The
drawback of this approach is that source code written for CPUs can result in
poor GPU performance. Kokkos[@Edwards] is a collection of performance portable
array operations for building device agnostic applications. However, the
framework only supports AMD and Nvidia GPUs and doesn't have out of box support
for auto differentiation.
With the advent of Machine learning, several frameworks have been created such
as TensorFlow[@Abadi], JAX[@Bradbury], PyTorch[@Paszke], and MLX[@Hannun]. These
frameworks build a graph representation operations that can be
auto-differentiated and compiled to GPUs. These frameworks are intended to be
used through a python interface which lowers one barrier to using them but also
introduces new barriers. For instance, it's not straight forward to embed these
frameworks in non-python codes and their non-python API's don't always support
all the features or are as well documented as their python API's. Additionally
performance is not guaranteed. It is not always straight forward to understand
what the framework is doing. Additionally cross platform support is often
unofficial and can be incomplete. Table \ref{frameworks} shows an overview of
these frameworks.
what the framework is doing under the hood. Additionally cross platform support
is often unofficial and can be incomplete. Table \ref{frameworks} shows an
overview of these frameworks.
# Software design
The core of this software is built around a graph data structure representing
mathematical expressions. In graph form, the expressions can be treated
symbolically enabling two critical functions. Algebraic rules can be applied to
@@ -187,7 +170,7 @@ respect to any other expression.
After expressions are built, workflows are created. A workflow is defined from
one or more workflow items. A workflow item is defined from input nodes, output
nodes, and maps between inputs and outputs. For each input and output nodes,
nodes, and maps between inputs and outputs. For each input and output node,
device buffers are allocated. Then starting from a given output, device specific
kernel source code is created by traversing the graph and adding a line
appropriate for the expression. Duplicate expressions are avoided by tracking a
@@ -197,45 +180,34 @@ iterating through the workflow items.
# Research impact statement
To demonstrate the performance of the optimized kernels created using this
framework we measured the strong scaling using the the RF ray tracing problem
in a realistic tokamak geometry. To to compare against other frameworks we
benchmarked the achieved throughput for simulating gyro motion in a uniform
magnetic field.
The `graph_framework` enables domain scientists to create portable high
performance code by simply writing out equations. Symbolic mathematical
reductions simplifies expressions which are JIT compiled to device code. The
high performance code generated enables higher fidelity or the generation of
large datasets for training reduced machine learning models. To demonstrate the
performance of this framework we explored two physics examples, RF ray tracing
in a realistic tokamak geometry and simulating gyro motion in a uniform magnetic
field.
## Strong Scaling
{width=90%}
To benchmark code performance we traced $10^{6}$ rays for $10^{3}$ time steps
using the cold plasma dispersion relation in a realistic tokamak equilibrium. A
To measure strong scaling we traced $10^{6}$ rays for $10^{3}$ time steps using
the cold plasma dispersion relation in a realistic tokamak equilibrium. A
benchmarking application is available in the git repository. The figure above
shows the strong scaling of wall time as the number of GPU and CPU devices are
increased. The figure above shows the strong scaling speed up
shows the strong scaling of wall time and normalized speed up