Commit be514142 authored by Cianciosa, Mark's avatar Cianciosa, Mark
Browse files

Fix errors found in the benchmarking and fix several issues with documentation.

parent c0f6a84a
Loading
Loading
Loading
Loading
+16.3 KiB (48 KiB)
Loading image diff...
+54 −8
Original line number Diff line number Diff line
@@ -45,12 +45,16 @@
 * @f{equation}{\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}@f}
 * @f{equation}{\frac{\partial\vec{x}}{\partial t} = dt\vec{v}@f}
 *
 * We compared the graph framework against the MLX framework since it supports
 * Apple GPUs and JAX due to it's popularity. Source codes for this benchmark
 * case is available in the appendix. Figure \ref{fig:compare} shows the through put of
 * pushing $10^{8}$ particles for $10^{3}$ time steps. The graph framework
 * consistently shows the best throughput on both CPUs and GPUs. Note MLX CPU
 * throughput could by improved by splitting the problem to multiple threads.
 * We compared the graph framework against the
 * <a href="https://ml-explore.github.io/mlx/build/html/index.html">MLX</a>
 * framework since it supports Apple GPUs,
 * <a href="https://docs.jax.dev/en/latest/">JAX</a> due to it's popularity,
 * and <a href="https://kokkos.org">Kokkos</a> for its performance
 * portability. Source codes for this benchmark case is available in the
 * appendix. Figure \ref{fig:compare} shows the through put of pushing $10^{8}$
 * particles for $10^{3}$ time steps. The graph framework consistently shows the
 * best throughput on both CPUs and GPUs. Note MLX CPU throughput could by
 * improved by splitting the problem to multiple threads.
 *
 * @subsection code_performance_comparison_codes Source codes for throughput benchmark comparison
 * @subsubsection code_performance_comparison_graph Graph Framework
@@ -93,7 +97,7 @@ for (size_t i = 0, ie = threads.size(); i < ie; i++) {
        auto v_next = v + dt*lorentz;
        auto pos_next = pos + dt*v_next;
            
        workflow::manager<float> work(0);
        workflow::manager<float> work(thread_number);
        work.add_item({
            graph::variable_cast(x),
            graph::variable_cast(y),
@@ -177,7 +181,7 @@ const auto total_time = end - start;
def push(x, y, z, vx, vy, vz):
    dt = 0.000001
    vx_next = vx + dt*(vy*1 - vz*0)
    vy_next = vy + dt*(vz*0 - vy*1)
    vy_next = vy + dt*(vz*0 - vx*1)
    vz_next = vz + dt*(vx*0 - vy*0)
    return vx_next, vy_next, vz_next,
           x + dt*vx_next, y + dt*vy_next, z + dt*vz_next
@@ -201,6 +205,48 @@ jax.block_until_ready([x, y, z, vx, vy, vz])
end = time.time()

print(end - start)
 @endcode
 *
 * @subsubsection code_performance_comparison_kokkos Kokkos
 * @code
const size_t size = 100000000;
const size_t steps = 1000;

using ViewVectorType = Kokkos::View<float *, Kokkos::SharedSpace>;
ViewVectorType x("x", size);
ViewVectorType y("y", size);
ViewVectorType z("z", size);

ViewVectorType vx("vx", size);
ViewVectorType vy("vy", size);
ViewVectorType vz("vz", size);

Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) {
    vx[index] = 1;
    vz[index] = 1;
});

const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();

for (size_t i = 0; i < steps; i++) {
    Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) {
        const float dt = 0.000001;
        const float vx_next = vx[index] + dt*(vy[index]*1 - vz[index]*0);
        const float vy_next = vy[index] + dt*(vz[index]*0 - vx[index]*1);
        const float vz_next = vz[index] + dt*(vx[index]*0 - vy[index]*0);
        x[index] += dt*vx_next;
        y[index] += dt*vy_next;
        z[index] += dt*vz_next;
        vx[index] = vx_next;
        vy[index] = vy_next;
        vz[index] = vz_next;
    });
}

Kokkos::fence();

std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
const auto total_time = end - start;
 @endcode
 */
+9 −8
Original line number Diff line number Diff line
@@ -5,8 +5,8 @@
 * @section discription_introduction Introduction
 * The basic functionality of this framework is to build expression graphs
 * representing mathematical equations. Reduce those graphs to simpler forms.
 * Transform those graph to take derivatives. Just-In-Time (JIT) compile them to
 * available compute device kernels. Then run those kernels in workflow. The
 * Transform those graphs to take derivatives. Just-In-Time (JIT) compile them
 * to available compute device kernels. Then run those kernels in workflows. The
 * code is written in using C++23 features. To simplify embedding into legacy
 * codes, there are additional language bindings for C and Fortran.
 *
@@ -48,9 +48,10 @@
 * be reduced to a single constant by calling the evaluate method. Sub-graph
 * expressions are combined, factored out, or moved to enable better reductions
 * on subsequent passes. As new ways of reducing the graph are implemented,
 * current and existing code built using this framework benefit from improved
 * speed. The figure above shows a visualization of the tree data structure for
 * the equation of a line, the derivative, and the subsequent reductions.
 * current and existing code built using this framework will benefit from
 * improved speed. The figure above shows a visualization of the tree data
 * structure for the equation of a line, the derivative, and the subsequent
 * reductions.
 *
 * @subsubsection discription_graphs_builds Building Graphs
 * As an example building an expression of line @f$y=mx+b@f$ accomplished by
@@ -79,8 +80,8 @@ auto dydmx = y->df(0.5*x);
 * running them in order. One @ref workflow::manager is created for each device
 * or thread. The user is responsible for creating threads. Each kernel is
 * generated through a @ref workflow::work_item. A work item is defined by
 * kernel @ref graph::input_nodes, @ref graph::output_nodes and
 * @ref graph::map_nodes. Map items are used to take the results of kernel and
 * kernel @ref graph::input_nodes, @ref graph::output_nodes, and
 * @ref graph::map_nodes. Map items are used to take the results of a kernel and
 * update an input buffer. Using our example of line equation, we can create a
 * workflow to compute @f$y@f$ and @f$\frac{\partial y}{\partial x}@f$.
 * @code
@@ -99,7 +100,7 @@ work.add_item({
 * elements in the inputs. Multiple work items can be created and will be
 * executed in order of creation.
 *
 * Once the work items are defined that can be JIT compiled to a backend device.
 * Once the work items are defined they can be JIT compiled to a backend device.
 * The graph framework supports back ends for generic CPUs, Apple Metal GPUs,
 * Nvidia Cuda GPUs, and initial HIP support of AMD GPUs. Each back end supplies
 * relevant driver code to build the kernel source, compile the kernel, build
+9 −8
Original line number Diff line number Diff line
@@ -39,7 +39,7 @@
 * as either variables @f$x@f$ or constants @f$m,b@f$. These nodes are connected
 * by nodes for multiply and addition operations. The output @f$y@f$ represents
 * the entire graph of operations.
 * @image{} html line_graph.png "The graph structure for @f$y=mx+b@f$."
 * @image{} html line_graph.png "The graph structure for y = mx + b."
 * Evaluation of graphs start from the top most node in this case the @f$+@f$
 * operation. Evaluation of a node is not performed until all sub-nodes are
 * evaluated starting with the left operand. Evaluation starts by recursively
@@ -58,9 +58,10 @@
 * graphs of a function derivative. For an example of taking derivatives see the
 * @ref tutorial_derivatives "auto differentiation tutorial". Lets say that we
 * want to take the derivative of @f$\frac{\partial y}{\partial x}@f$. This is
 * achieved by evaluating the until bottom left most node is reached. Then a new
 * graph is build starting with @f$\frac{\partial m}{\partial x}=0@f$. Applying
 * the first half of the chain rule we build a new graph for @f$0x@f$
 * achieved by evaluating the graph until the bottom left most node is reached.
 * Then a new graph is constucted starting with
 * @f$\frac{\partial m}{\partial x}=0@f$. Applying the first half of the chain
 * rule we build a new graph for @f$0x@f$
 * @image{} html line_graph_dydf1.png ""
 * Then we take the derivative of the right operand and apply the second half
 * of the chain rule to build a new graph for @f$0x=0@f$.
@@ -73,8 +74,8 @@
 * The final expression for @f$\frac{\partial y}{\partial x}@f$ contains many
 * unnecessary nodes in the graph. Instead of building full graphs, we can
 * simplify and eliminate nodes as we build them. For instance, when the
 * expression @f$0x@f$ this created can be immediately reduce it  to a single
 * node.
 * expression @f$0\times x@f$ is created, this can be immediately reduced to a
 * single node @f$0@f$.
 * @image{} html line_graph_reduce1.png ""
 * Applying all possible reductions reduces the final expression to
 * @f$\frac{\partial y}{\partial x}=m@f$.
@@ -109,7 +110,7 @@
 * @subsection general_concepts_compile_maps Maps
 * Maps enable the results of an output node to be stored in an input node. This
 * is used for a wide varity of cases. For instance take a gradient decent step.
 * @f{equation}{y = y + \frac{\partial f}{\partial x}@f}
 * @f{equation}{y_{i+1} = y_{i} + \frac{\partial f}{\partial x}@f}
 * In this case the output of the expression
 * @f$y + \frac{\partial f}{\partial x}@f$
 * can be mapped to update @f$y@f$.
@@ -122,7 +123,7 @@
 * <hr>
 * @section general_concepts_safe_math Safe Math
 * There are some conditions where mathematically, a graph should evaluate to a
 * normal number. However, when evaluated suing floating point precision, can
 * normal number. However, when evaluated using floating point precision, can
 * lead to <tt>Inf</tt> or <tt>NaN</tt>. An example of this the
 * @f$\exp\left(x\right)@f$ function. For large argument values,
 * @f$\exp\left(x\right)@f$ overflows the maximum floating point precision and
+19 −19
Original line number Diff line number Diff line
@@ -13,7 +13,7 @@
 * executable target which can be used to test out the API's of this framework.
 * The playground starts with a blank main function.
 * @code
#include "../graph_framework/jit.hpp"
#include "graph_framework.hpp"

int main(int argc, const char * argv[]) {
    START_GPU
@@ -30,7 +30,7 @@ int main(int argc, const char * argv[]) {
 * main. This will allow us to play with different floating point types. For now
 * we will start with a simple float type.
 * @code
#include "../graph_framework/jit.hpp"
#include "graph_framework.hpp"

template<jit::float_scalar T>
void run_tutorial() {
@@ -84,7 +84,7 @@ void run_tutorial() {
 * so all method are called using the <tt>-></tt> operator.
 *
 * @subsection tutorial_constant Constant Nodes
 * Next we want to define a constant. There are two method to define constants
 * Next we want to define a constant. There are two methods to define constants
 * explicitly or implicitly.
 * @code
template<jit::float_scalar T>
Loading