Loading graph_docs/Comparison.png +16.3 KiB (48 KiB) Loading image diff... graph_docs/code_performance.dox +54 −8 Original line number Diff line number Diff line Loading @@ -45,12 +45,16 @@ * @f{equation}{\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}@f} * @f{equation}{\frac{\partial\vec{x}}{\partial t} = dt\vec{v}@f} * * We compared the graph framework against the MLX framework since it supports * Apple GPUs and JAX due to it's popularity. Source codes for this benchmark * case is available in the appendix. Figure \ref{fig:compare} shows the through put of * pushing $10^{8}$ particles for $10^{3}$ time steps. The graph framework * consistently shows the best throughput on both CPUs and GPUs. Note MLX CPU * throughput could by improved by splitting the problem to multiple threads. * We compared the graph framework against the * <a href="https://ml-explore.github.io/mlx/build/html/index.html">MLX</a> * framework since it supports Apple GPUs, * <a href="https://docs.jax.dev/en/latest/">JAX</a> due to it's popularity, * and <a href="https://kokkos.org">Kokkos</a> for its performance * portability. Source codes for this benchmark case is available in the * appendix. Figure \ref{fig:compare} shows the through put of pushing $10^{8}$ * particles for $10^{3}$ time steps. The graph framework consistently shows the * best throughput on both CPUs and GPUs. Note MLX CPU throughput could by * improved by splitting the problem to multiple threads. * * @subsection code_performance_comparison_codes Source codes for throughput benchmark comparison * @subsubsection code_performance_comparison_graph Graph Framework Loading Loading @@ -93,7 +97,7 @@ for (size_t i = 0, ie = threads.size(); i < ie; i++) { auto v_next = v + dt*lorentz; auto pos_next = pos + dt*v_next; workflow::manager<float> work(0); workflow::manager<float> work(thread_number); work.add_item({ graph::variable_cast(x), graph::variable_cast(y), Loading Loading @@ -177,7 +181,7 @@ const auto total_time = end - start; def push(x, y, z, vx, vy, vz): dt = 0.000001 vx_next = vx + dt*(vy*1 - vz*0) vy_next = vy + dt*(vz*0 - vy*1) vy_next = vy + dt*(vz*0 - vx*1) vz_next = vz + dt*(vx*0 - vy*0) return vx_next, vy_next, vz_next, x + dt*vx_next, y + dt*vy_next, z + dt*vz_next Loading @@ -201,6 +205,48 @@ jax.block_until_ready([x, y, z, vx, vy, vz]) end = time.time() print(end - start) @endcode * * @subsubsection code_performance_comparison_kokkos Kokkos * @code const size_t size = 100000000; const size_t steps = 1000; using ViewVectorType = Kokkos::View<float *, Kokkos::SharedSpace>; ViewVectorType x("x", size); ViewVectorType y("y", size); ViewVectorType z("z", size); ViewVectorType vx("vx", size); ViewVectorType vy("vy", size); ViewVectorType vz("vz", size); Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) { vx[index] = 1; vz[index] = 1; }); const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < steps; i++) { Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) { const float dt = 0.000001; const float vx_next = vx[index] + dt*(vy[index]*1 - vz[index]*0); const float vy_next = vy[index] + dt*(vz[index]*0 - vx[index]*1); const float vz_next = vz[index] + dt*(vx[index]*0 - vy[index]*0); x[index] += dt*vx_next; y[index] += dt*vy_next; z[index] += dt*vz_next; vx[index] = vx_next; vy[index] = vy_next; vz[index] = vz_next; }); } Kokkos::fence(); std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now(); const auto total_time = end - start; @endcode */ graph_docs/discription.dox +9 −8 Original line number Diff line number Diff line Loading @@ -5,8 +5,8 @@ * @section discription_introduction Introduction * The basic functionality of this framework is to build expression graphs * representing mathematical equations. Reduce those graphs to simpler forms. * Transform those graph to take derivatives. Just-In-Time (JIT) compile them to * available compute device kernels. Then run those kernels in workflow. The * Transform those graphs to take derivatives. Just-In-Time (JIT) compile them * to available compute device kernels. Then run those kernels in workflows. The * code is written in using C++23 features. To simplify embedding into legacy * codes, there are additional language bindings for C and Fortran. * Loading Loading @@ -48,9 +48,10 @@ * be reduced to a single constant by calling the evaluate method. Sub-graph * expressions are combined, factored out, or moved to enable better reductions * on subsequent passes. As new ways of reducing the graph are implemented, * current and existing code built using this framework benefit from improved * speed. The figure above shows a visualization of the tree data structure for * the equation of a line, the derivative, and the subsequent reductions. * current and existing code built using this framework will benefit from * improved speed. The figure above shows a visualization of the tree data * structure for the equation of a line, the derivative, and the subsequent * reductions. * * @subsubsection discription_graphs_builds Building Graphs * As an example building an expression of line @f$y=mx+b@f$ accomplished by Loading Loading @@ -79,8 +80,8 @@ auto dydmx = y->df(0.5*x); * running them in order. One @ref workflow::manager is created for each device * or thread. The user is responsible for creating threads. Each kernel is * generated through a @ref workflow::work_item. A work item is defined by * kernel @ref graph::input_nodes, @ref graph::output_nodes and * @ref graph::map_nodes. Map items are used to take the results of kernel and * kernel @ref graph::input_nodes, @ref graph::output_nodes, and * @ref graph::map_nodes. Map items are used to take the results of a kernel and * update an input buffer. Using our example of line equation, we can create a * workflow to compute @f$y@f$ and @f$\frac{\partial y}{\partial x}@f$. * @code Loading @@ -99,7 +100,7 @@ work.add_item({ * elements in the inputs. Multiple work items can be created and will be * executed in order of creation. * * Once the work items are defined that can be JIT compiled to a backend device. * Once the work items are defined they can be JIT compiled to a backend device. * The graph framework supports back ends for generic CPUs, Apple Metal GPUs, * Nvidia Cuda GPUs, and initial HIP support of AMD GPUs. Each back end supplies * relevant driver code to build the kernel source, compile the kernel, build Loading graph_docs/general.dox +9 −8 Original line number Diff line number Diff line Loading @@ -39,7 +39,7 @@ * as either variables @f$x@f$ or constants @f$m,b@f$. These nodes are connected * by nodes for multiply and addition operations. The output @f$y@f$ represents * the entire graph of operations. * @image{} html line_graph.png "The graph structure for @f$y=mx+b@f$." * @image{} html line_graph.png "The graph structure for y = mx + b." * Evaluation of graphs start from the top most node in this case the @f$+@f$ * operation. Evaluation of a node is not performed until all sub-nodes are * evaluated starting with the left operand. Evaluation starts by recursively Loading @@ -58,9 +58,10 @@ * graphs of a function derivative. For an example of taking derivatives see the * @ref tutorial_derivatives "auto differentiation tutorial". Lets say that we * want to take the derivative of @f$\frac{\partial y}{\partial x}@f$. This is * achieved by evaluating the until bottom left most node is reached. Then a new * graph is build starting with @f$\frac{\partial m}{\partial x}=0@f$. Applying * the first half of the chain rule we build a new graph for @f$0x@f$ * achieved by evaluating the graph until the bottom left most node is reached. * Then a new graph is constucted starting with * @f$\frac{\partial m}{\partial x}=0@f$. Applying the first half of the chain * rule we build a new graph for @f$0x@f$ * @image{} html line_graph_dydf1.png "" * Then we take the derivative of the right operand and apply the second half * of the chain rule to build a new graph for @f$0x=0@f$. Loading @@ -73,8 +74,8 @@ * The final expression for @f$\frac{\partial y}{\partial x}@f$ contains many * unnecessary nodes in the graph. Instead of building full graphs, we can * simplify and eliminate nodes as we build them. For instance, when the * expression @f$0x@f$ this created can be immediately reduce it to a single * node. * expression @f$0\times x@f$ is created, this can be immediately reduced to a * single node @f$0@f$. * @image{} html line_graph_reduce1.png "" * Applying all possible reductions reduces the final expression to * @f$\frac{\partial y}{\partial x}=m@f$. Loading Loading @@ -109,7 +110,7 @@ * @subsection general_concepts_compile_maps Maps * Maps enable the results of an output node to be stored in an input node. This * is used for a wide varity of cases. For instance take a gradient decent step. * @f{equation}{y = y + \frac{\partial f}{\partial x}@f} * @f{equation}{y_{i+1} = y_{i} + \frac{\partial f}{\partial x}@f} * In this case the output of the expression * @f$y + \frac{\partial f}{\partial x}@f$ * can be mapped to update @f$y@f$. Loading @@ -122,7 +123,7 @@ * <hr> * @section general_concepts_safe_math Safe Math * There are some conditions where mathematically, a graph should evaluate to a * normal number. However, when evaluated suing floating point precision, can * normal number. However, when evaluated using floating point precision, can * lead to <tt>Inf</tt> or <tt>NaN</tt>. An example of this the * @f$\exp\left(x\right)@f$ function. For large argument values, * @f$\exp\left(x\right)@f$ overflows the maximum floating point precision and Loading graph_docs/tutorial.dox +19 −19 Original line number Diff line number Diff line Loading @@ -13,7 +13,7 @@ * executable target which can be used to test out the API's of this framework. * The playground starts with a blank main function. * @code #include "../graph_framework/jit.hpp" #include "graph_framework.hpp" int main(int argc, const char * argv[]) { START_GPU Loading @@ -30,7 +30,7 @@ int main(int argc, const char * argv[]) { * main. This will allow us to play with different floating point types. For now * we will start with a simple float type. * @code #include "../graph_framework/jit.hpp" #include "graph_framework.hpp" template<jit::float_scalar T> void run_tutorial() { Loading Loading @@ -84,7 +84,7 @@ void run_tutorial() { * so all method are called using the <tt>-></tt> operator. * * @subsection tutorial_constant Constant Nodes * Next we want to define a constant. There are two method to define constants * Next we want to define a constant. There are two methods to define constants * explicitly or implicitly. * @code template<jit::float_scalar T> Loading Loading
graph_docs/code_performance.dox +54 −8 Original line number Diff line number Diff line Loading @@ -45,12 +45,16 @@ * @f{equation}{\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}@f} * @f{equation}{\frac{\partial\vec{x}}{\partial t} = dt\vec{v}@f} * * We compared the graph framework against the MLX framework since it supports * Apple GPUs and JAX due to it's popularity. Source codes for this benchmark * case is available in the appendix. Figure \ref{fig:compare} shows the through put of * pushing $10^{8}$ particles for $10^{3}$ time steps. The graph framework * consistently shows the best throughput on both CPUs and GPUs. Note MLX CPU * throughput could by improved by splitting the problem to multiple threads. * We compared the graph framework against the * <a href="https://ml-explore.github.io/mlx/build/html/index.html">MLX</a> * framework since it supports Apple GPUs, * <a href="https://docs.jax.dev/en/latest/">JAX</a> due to it's popularity, * and <a href="https://kokkos.org">Kokkos</a> for its performance * portability. Source codes for this benchmark case is available in the * appendix. Figure \ref{fig:compare} shows the through put of pushing $10^{8}$ * particles for $10^{3}$ time steps. The graph framework consistently shows the * best throughput on both CPUs and GPUs. Note MLX CPU throughput could by * improved by splitting the problem to multiple threads. * * @subsection code_performance_comparison_codes Source codes for throughput benchmark comparison * @subsubsection code_performance_comparison_graph Graph Framework Loading Loading @@ -93,7 +97,7 @@ for (size_t i = 0, ie = threads.size(); i < ie; i++) { auto v_next = v + dt*lorentz; auto pos_next = pos + dt*v_next; workflow::manager<float> work(0); workflow::manager<float> work(thread_number); work.add_item({ graph::variable_cast(x), graph::variable_cast(y), Loading Loading @@ -177,7 +181,7 @@ const auto total_time = end - start; def push(x, y, z, vx, vy, vz): dt = 0.000001 vx_next = vx + dt*(vy*1 - vz*0) vy_next = vy + dt*(vz*0 - vy*1) vy_next = vy + dt*(vz*0 - vx*1) vz_next = vz + dt*(vx*0 - vy*0) return vx_next, vy_next, vz_next, x + dt*vx_next, y + dt*vy_next, z + dt*vz_next Loading @@ -201,6 +205,48 @@ jax.block_until_ready([x, y, z, vx, vy, vz]) end = time.time() print(end - start) @endcode * * @subsubsection code_performance_comparison_kokkos Kokkos * @code const size_t size = 100000000; const size_t steps = 1000; using ViewVectorType = Kokkos::View<float *, Kokkos::SharedSpace>; ViewVectorType x("x", size); ViewVectorType y("y", size); ViewVectorType z("z", size); ViewVectorType vx("vx", size); ViewVectorType vy("vy", size); ViewVectorType vz("vz", size); Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) { vx[index] = 1; vz[index] = 1; }); const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now(); for (size_t i = 0; i < steps; i++) { Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) { const float dt = 0.000001; const float vx_next = vx[index] + dt*(vy[index]*1 - vz[index]*0); const float vy_next = vy[index] + dt*(vz[index]*0 - vx[index]*1); const float vz_next = vz[index] + dt*(vx[index]*0 - vy[index]*0); x[index] += dt*vx_next; y[index] += dt*vy_next; z[index] += dt*vz_next; vx[index] = vx_next; vy[index] = vy_next; vz[index] = vz_next; }); } Kokkos::fence(); std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now(); const auto total_time = end - start; @endcode */
graph_docs/discription.dox +9 −8 Original line number Diff line number Diff line Loading @@ -5,8 +5,8 @@ * @section discription_introduction Introduction * The basic functionality of this framework is to build expression graphs * representing mathematical equations. Reduce those graphs to simpler forms. * Transform those graph to take derivatives. Just-In-Time (JIT) compile them to * available compute device kernels. Then run those kernels in workflow. The * Transform those graphs to take derivatives. Just-In-Time (JIT) compile them * to available compute device kernels. Then run those kernels in workflows. The * code is written in using C++23 features. To simplify embedding into legacy * codes, there are additional language bindings for C and Fortran. * Loading Loading @@ -48,9 +48,10 @@ * be reduced to a single constant by calling the evaluate method. Sub-graph * expressions are combined, factored out, or moved to enable better reductions * on subsequent passes. As new ways of reducing the graph are implemented, * current and existing code built using this framework benefit from improved * speed. The figure above shows a visualization of the tree data structure for * the equation of a line, the derivative, and the subsequent reductions. * current and existing code built using this framework will benefit from * improved speed. The figure above shows a visualization of the tree data * structure for the equation of a line, the derivative, and the subsequent * reductions. * * @subsubsection discription_graphs_builds Building Graphs * As an example building an expression of line @f$y=mx+b@f$ accomplished by Loading Loading @@ -79,8 +80,8 @@ auto dydmx = y->df(0.5*x); * running them in order. One @ref workflow::manager is created for each device * or thread. The user is responsible for creating threads. Each kernel is * generated through a @ref workflow::work_item. A work item is defined by * kernel @ref graph::input_nodes, @ref graph::output_nodes and * @ref graph::map_nodes. Map items are used to take the results of kernel and * kernel @ref graph::input_nodes, @ref graph::output_nodes, and * @ref graph::map_nodes. Map items are used to take the results of a kernel and * update an input buffer. Using our example of line equation, we can create a * workflow to compute @f$y@f$ and @f$\frac{\partial y}{\partial x}@f$. * @code Loading @@ -99,7 +100,7 @@ work.add_item({ * elements in the inputs. Multiple work items can be created and will be * executed in order of creation. * * Once the work items are defined that can be JIT compiled to a backend device. * Once the work items are defined they can be JIT compiled to a backend device. * The graph framework supports back ends for generic CPUs, Apple Metal GPUs, * Nvidia Cuda GPUs, and initial HIP support of AMD GPUs. Each back end supplies * relevant driver code to build the kernel source, compile the kernel, build Loading
graph_docs/general.dox +9 −8 Original line number Diff line number Diff line Loading @@ -39,7 +39,7 @@ * as either variables @f$x@f$ or constants @f$m,b@f$. These nodes are connected * by nodes for multiply and addition operations. The output @f$y@f$ represents * the entire graph of operations. * @image{} html line_graph.png "The graph structure for @f$y=mx+b@f$." * @image{} html line_graph.png "The graph structure for y = mx + b." * Evaluation of graphs start from the top most node in this case the @f$+@f$ * operation. Evaluation of a node is not performed until all sub-nodes are * evaluated starting with the left operand. Evaluation starts by recursively Loading @@ -58,9 +58,10 @@ * graphs of a function derivative. For an example of taking derivatives see the * @ref tutorial_derivatives "auto differentiation tutorial". Lets say that we * want to take the derivative of @f$\frac{\partial y}{\partial x}@f$. This is * achieved by evaluating the until bottom left most node is reached. Then a new * graph is build starting with @f$\frac{\partial m}{\partial x}=0@f$. Applying * the first half of the chain rule we build a new graph for @f$0x@f$ * achieved by evaluating the graph until the bottom left most node is reached. * Then a new graph is constucted starting with * @f$\frac{\partial m}{\partial x}=0@f$. Applying the first half of the chain * rule we build a new graph for @f$0x@f$ * @image{} html line_graph_dydf1.png "" * Then we take the derivative of the right operand and apply the second half * of the chain rule to build a new graph for @f$0x=0@f$. Loading @@ -73,8 +74,8 @@ * The final expression for @f$\frac{\partial y}{\partial x}@f$ contains many * unnecessary nodes in the graph. Instead of building full graphs, we can * simplify and eliminate nodes as we build them. For instance, when the * expression @f$0x@f$ this created can be immediately reduce it to a single * node. * expression @f$0\times x@f$ is created, this can be immediately reduced to a * single node @f$0@f$. * @image{} html line_graph_reduce1.png "" * Applying all possible reductions reduces the final expression to * @f$\frac{\partial y}{\partial x}=m@f$. Loading Loading @@ -109,7 +110,7 @@ * @subsection general_concepts_compile_maps Maps * Maps enable the results of an output node to be stored in an input node. This * is used for a wide varity of cases. For instance take a gradient decent step. * @f{equation}{y = y + \frac{\partial f}{\partial x}@f} * @f{equation}{y_{i+1} = y_{i} + \frac{\partial f}{\partial x}@f} * In this case the output of the expression * @f$y + \frac{\partial f}{\partial x}@f$ * can be mapped to update @f$y@f$. Loading @@ -122,7 +123,7 @@ * <hr> * @section general_concepts_safe_math Safe Math * There are some conditions where mathematically, a graph should evaluate to a * normal number. However, when evaluated suing floating point precision, can * normal number. However, when evaluated using floating point precision, can * lead to <tt>Inf</tt> or <tt>NaN</tt>. An example of this the * @f$\exp\left(x\right)@f$ function. For large argument values, * @f$\exp\left(x\right)@f$ overflows the maximum floating point precision and Loading
graph_docs/tutorial.dox +19 −19 Original line number Diff line number Diff line Loading @@ -13,7 +13,7 @@ * executable target which can be used to test out the API's of this framework. * The playground starts with a blank main function. * @code #include "../graph_framework/jit.hpp" #include "graph_framework.hpp" int main(int argc, const char * argv[]) { START_GPU Loading @@ -30,7 +30,7 @@ int main(int argc, const char * argv[]) { * main. This will allow us to play with different floating point types. For now * we will start with a simple float type. * @code #include "../graph_framework/jit.hpp" #include "graph_framework.hpp" template<jit::float_scalar T> void run_tutorial() { Loading Loading @@ -84,7 +84,7 @@ void run_tutorial() { * so all method are called using the <tt>-></tt> operator. * * @subsection tutorial_constant Constant Nodes * Next we want to define a constant. There are two method to define constants * Next we want to define a constant. There are two methods to define constants * explicitly or implicitly. * @code template<jit::float_scalar T> Loading