Commit 36f47f43 authored by Cianciosa, Mark's avatar Cianciosa, Mark
Browse files

Merge branch 'benchmark' into 'main'

Fix errors found in the benchmarking and fix several issues with documentation.

See merge request !88
parents c0f6a84a be514142
Loading
Loading
Loading
Loading
+16.3 KiB (48 KiB)
Loading image diff...
+54 −8
Original line number Diff line number Diff line
@@ -45,12 +45,16 @@
 * @f{equation}{\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}@f}
 * @f{equation}{\frac{\partial\vec{x}}{\partial t} = dt\vec{v}@f}
 *
 * We compared the graph framework against the MLX framework since it supports
 * Apple GPUs and JAX due to it's popularity. Source codes for this benchmark
 * case is available in the appendix. Figure \ref{fig:compare} shows the through put of
 * pushing $10^{8}$ particles for $10^{3}$ time steps. The graph framework
 * consistently shows the best throughput on both CPUs and GPUs. Note MLX CPU
 * throughput could by improved by splitting the problem to multiple threads.
 * We compared the graph framework against the
 * <a href="https://ml-explore.github.io/mlx/build/html/index.html">MLX</a>
 * framework since it supports Apple GPUs,
 * <a href="https://docs.jax.dev/en/latest/">JAX</a> due to it's popularity,
 * and <a href="https://kokkos.org">Kokkos</a> for its performance
 * portability. Source codes for this benchmark case is available in the
 * appendix. Figure \ref{fig:compare} shows the through put of pushing $10^{8}$
 * particles for $10^{3}$ time steps. The graph framework consistently shows the
 * best throughput on both CPUs and GPUs. Note MLX CPU throughput could by
 * improved by splitting the problem to multiple threads.
 *
 * @subsection code_performance_comparison_codes Source codes for throughput benchmark comparison
 * @subsubsection code_performance_comparison_graph Graph Framework
@@ -93,7 +97,7 @@ for (size_t i = 0, ie = threads.size(); i < ie; i++) {
        auto v_next = v + dt*lorentz;
        auto pos_next = pos + dt*v_next;
            
        workflow::manager<float> work(0);
        workflow::manager<float> work(thread_number);
        work.add_item({
            graph::variable_cast(x),
            graph::variable_cast(y),
@@ -177,7 +181,7 @@ const auto total_time = end - start;
def push(x, y, z, vx, vy, vz):
    dt = 0.000001
    vx_next = vx + dt*(vy*1 - vz*0)
    vy_next = vy + dt*(vz*0 - vy*1)
    vy_next = vy + dt*(vz*0 - vx*1)
    vz_next = vz + dt*(vx*0 - vy*0)
    return vx_next, vy_next, vz_next,
           x + dt*vx_next, y + dt*vy_next, z + dt*vz_next
@@ -201,6 +205,48 @@ jax.block_until_ready([x, y, z, vx, vy, vz])
end = time.time()

print(end - start)
 @endcode
 *
 * @subsubsection code_performance_comparison_kokkos Kokkos
 * @code
const size_t size = 100000000;
const size_t steps = 1000;

using ViewVectorType = Kokkos::View<float *, Kokkos::SharedSpace>;
ViewVectorType x("x", size);
ViewVectorType y("y", size);
ViewVectorType z("z", size);

ViewVectorType vx("vx", size);
ViewVectorType vy("vy", size);
ViewVectorType vz("vz", size);

Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) {
    vx[index] = 1;
    vz[index] = 1;
});

const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();

for (size_t i = 0; i < steps; i++) {
    Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) {
        const float dt = 0.000001;
        const float vx_next = vx[index] + dt*(vy[index]*1 - vz[index]*0);
        const float vy_next = vy[index] + dt*(vz[index]*0 - vx[index]*1);
        const float vz_next = vz[index] + dt*(vx[index]*0 - vy[index]*0);
        x[index] += dt*vx_next;
        y[index] += dt*vy_next;
        z[index] += dt*vz_next;
        vx[index] = vx_next;
        vy[index] = vy_next;
        vz[index] = vz_next;
    });
}

Kokkos::fence();

std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
const auto total_time = end - start;
 @endcode
 */
+9 −8
Original line number Diff line number Diff line
@@ -5,8 +5,8 @@
 * @section discription_introduction Introduction
 * The basic functionality of this framework is to build expression graphs
 * representing mathematical equations. Reduce those graphs to simpler forms.
 * Transform those graph to take derivatives. Just-In-Time (JIT) compile them to
 * available compute device kernels. Then run those kernels in workflow. The
 * Transform those graphs to take derivatives. Just-In-Time (JIT) compile them
 * to available compute device kernels. Then run those kernels in workflows. The
 * code is written in using C++23 features. To simplify embedding into legacy
 * codes, there are additional language bindings for C and Fortran.
 *
@@ -48,9 +48,10 @@
 * be reduced to a single constant by calling the evaluate method. Sub-graph
 * expressions are combined, factored out, or moved to enable better reductions
 * on subsequent passes. As new ways of reducing the graph are implemented,
 * current and existing code built using this framework benefit from improved
 * speed. The figure above shows a visualization of the tree data structure for
 * the equation of a line, the derivative, and the subsequent reductions.
 * current and existing code built using this framework will benefit from
 * improved speed. The figure above shows a visualization of the tree data
 * structure for the equation of a line, the derivative, and the subsequent
 * reductions.
 *
 * @subsubsection discription_graphs_builds Building Graphs
 * As an example building an expression of line @f$y=mx+b@f$ accomplished by
@@ -79,8 +80,8 @@ auto dydmx = y->df(0.5*x);
 * running them in order. One @ref workflow::manager is created for each device
 * or thread. The user is responsible for creating threads. Each kernel is
 * generated through a @ref workflow::work_item. A work item is defined by
 * kernel @ref graph::input_nodes, @ref graph::output_nodes and
 * @ref graph::map_nodes. Map items are used to take the results of kernel and
 * kernel @ref graph::input_nodes, @ref graph::output_nodes, and
 * @ref graph::map_nodes. Map items are used to take the results of a kernel and
 * update an input buffer. Using our example of line equation, we can create a
 * workflow to compute @f$y@f$ and @f$\frac{\partial y}{\partial x}@f$.
 * @code
@@ -99,7 +100,7 @@ work.add_item({
 * elements in the inputs. Multiple work items can be created and will be
 * executed in order of creation.
 *
 * Once the work items are defined that can be JIT compiled to a backend device.
 * Once the work items are defined they can be JIT compiled to a backend device.
 * The graph framework supports back ends for generic CPUs, Apple Metal GPUs,
 * Nvidia Cuda GPUs, and initial HIP support of AMD GPUs. Each back end supplies
 * relevant driver code to build the kernel source, compile the kernel, build
+9 −8
Original line number Diff line number Diff line
@@ -39,7 +39,7 @@
 * as either variables @f$x@f$ or constants @f$m,b@f$. These nodes are connected
 * by nodes for multiply and addition operations. The output @f$y@f$ represents
 * the entire graph of operations.
 * @image{} html line_graph.png "The graph structure for @f$y=mx+b@f$."
 * @image{} html line_graph.png "The graph structure for y = mx + b."
 * Evaluation of graphs start from the top most node in this case the @f$+@f$
 * operation. Evaluation of a node is not performed until all sub-nodes are
 * evaluated starting with the left operand. Evaluation starts by recursively
@@ -58,9 +58,10 @@
 * graphs of a function derivative. For an example of taking derivatives see the
 * @ref tutorial_derivatives "auto differentiation tutorial". Lets say that we
 * want to take the derivative of @f$\frac{\partial y}{\partial x}@f$. This is
 * achieved by evaluating the until bottom left most node is reached. Then a new
 * graph is build starting with @f$\frac{\partial m}{\partial x}=0@f$. Applying
 * the first half of the chain rule we build a new graph for @f$0x@f$
 * achieved by evaluating the graph until the bottom left most node is reached.
 * Then a new graph is constucted starting with
 * @f$\frac{\partial m}{\partial x}=0@f$. Applying the first half of the chain
 * rule we build a new graph for @f$0x@f$
 * @image{} html line_graph_dydf1.png ""
 * Then we take the derivative of the right operand and apply the second half
 * of the chain rule to build a new graph for @f$0x=0@f$.
@@ -73,8 +74,8 @@
 * The final expression for @f$\frac{\partial y}{\partial x}@f$ contains many
 * unnecessary nodes in the graph. Instead of building full graphs, we can
 * simplify and eliminate nodes as we build them. For instance, when the
 * expression @f$0x@f$ this created can be immediately reduce it  to a single
 * node.
 * expression @f$0\times x@f$ is created, this can be immediately reduced to a
 * single node @f$0@f$.
 * @image{} html line_graph_reduce1.png ""
 * Applying all possible reductions reduces the final expression to
 * @f$\frac{\partial y}{\partial x}=m@f$.
@@ -109,7 +110,7 @@
 * @subsection general_concepts_compile_maps Maps
 * Maps enable the results of an output node to be stored in an input node. This
 * is used for a wide varity of cases. For instance take a gradient decent step.
 * @f{equation}{y = y + \frac{\partial f}{\partial x}@f}
 * @f{equation}{y_{i+1} = y_{i} + \frac{\partial f}{\partial x}@f}
 * In this case the output of the expression
 * @f$y + \frac{\partial f}{\partial x}@f$
 * can be mapped to update @f$y@f$.
@@ -122,7 +123,7 @@
 * <hr>
 * @section general_concepts_safe_math Safe Math
 * There are some conditions where mathematically, a graph should evaluate to a
 * normal number. However, when evaluated suing floating point precision, can
 * normal number. However, when evaluated using floating point precision, can
 * lead to <tt>Inf</tt> or <tt>NaN</tt>. An example of this the
 * @f$\exp\left(x\right)@f$ function. For large argument values,
 * @f$\exp\left(x\right)@f$ overflows the maximum floating point precision and
+19 −19
Original line number Diff line number Diff line
@@ -13,7 +13,7 @@
 * executable target which can be used to test out the API's of this framework.
 * The playground starts with a blank main function.
 * @code
#include "../graph_framework/jit.hpp"
#include "graph_framework.hpp"

int main(int argc, const char * argv[]) {
    START_GPU
@@ -30,7 +30,7 @@ int main(int argc, const char * argv[]) {
 * main. This will allow us to play with different floating point types. For now
 * we will start with a simple float type.
 * @code
#include "../graph_framework/jit.hpp"
#include "graph_framework.hpp"

template<jit::float_scalar T>
void run_tutorial() {
@@ -84,7 +84,7 @@ void run_tutorial() {
 * so all method are called using the <tt>-></tt> operator.
 *
 * @subsection tutorial_constant Constant Nodes
 * Next we want to define a constant. There are two method to define constants
 * Next we want to define a constant. There are two methods to define constants
 * explicitly or implicitly.
 * @code
template<jit::float_scalar T>
Loading