Merge branch 'benchmark' into 'main' (36f47f43) · Commits · Cianciosa, Mark / graph_framework

graph_docs/Comparison.png

+16.3 KiB (48 KiB)

Loading image diff...

graph_docs/code_performance.dox

+54 −8

Original line number	Diff line number	Diff line
		@@ -45,12 +45,16 @@
		* @f{equation}{\frac{\partial\vec{v}}{\partial t} = dt\vec{v}\times\vec{B}@f}
		* @f{equation}{\frac{\partial\vec{x}}{\partial t} = dt\vec{v}@f}
		*
		* We compared the graph framework against the MLX framework since it supports
		* Apple GPUs and JAX due to it's popularity. Source codes for this benchmark
		* case is available in the appendix. Figure \ref{fig:compare} shows the through put of
		* pushing $10^{8}$ particles for $10^{3}$ time steps. The graph framework
		* consistently shows the best throughput on both CPUs and GPUs. Note MLX CPU
		* throughput could by improved by splitting the problem to multiple threads.
		* We compared the graph framework against the
		* <a href="https://ml-explore.github.io/mlx/build/html/index.html">MLX</a>
		* framework since it supports Apple GPUs,
		* <a href="https://docs.jax.dev/en/latest/">JAX</a> due to it's popularity,
		* and <a href="https://kokkos.org">Kokkos</a> for its performance
		* portability. Source codes for this benchmark case is available in the
		* appendix. Figure \ref{fig:compare} shows the through put of pushing $10^{8}$
		* particles for $10^{3}$ time steps. The graph framework consistently shows the
		* best throughput on both CPUs and GPUs. Note MLX CPU throughput could by
		* improved by splitting the problem to multiple threads.
		*
		* @subsection code_performance_comparison_codes Source codes for throughput benchmark comparison
		* @subsubsection code_performance_comparison_graph Graph Framework
		@@ -93,7 +97,7 @@ for (size_t i = 0, ie = threads.size(); i < ie; i++) {
		auto v_next = v + dt*lorentz;
		auto pos_next = pos + dt*v_next;

		workflow::manager<float> work(0);
		workflow::manager<float> work(thread_number);
		work.add_item({
		graph::variable_cast(x),
		graph::variable_cast(y),
		@@ -177,7 +181,7 @@ const auto total_time = end - start;
		def push(x, y, z, vx, vy, vz):
		dt = 0.000001
		vx_next = vx + dt(vy1 - vz*0)
		vy_next = vy + dt(vz0 - vy*1)
		vy_next = vy + dt(vz0 - vx*1)
		vz_next = vz + dt(vx0 - vy*0)
		return vx_next, vy_next, vz_next,
		x + dtvx_next, y + dtvy_next, z + dt*vz_next
		@@ -201,6 +205,48 @@ jax.block_until_ready([x, y, z, vx, vy, vz])
		end = time.time()

		print(end - start)
		@endcode
		*
		* @subsubsection code_performance_comparison_kokkos Kokkos
		* @code
		const size_t size = 100000000;
		const size_t steps = 1000;

		using ViewVectorType = Kokkos::View<float *, Kokkos::SharedSpace>;
		ViewVectorType x("x", size);
		ViewVectorType y("y", size);
		ViewVectorType z("z", size);

		ViewVectorType vx("vx", size);
		ViewVectorType vy("vy", size);
		ViewVectorType vz("vz", size);

		Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) {
		vx[index] = 1;
		vz[index] = 1;
		});

		const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();

		for (size_t i = 0; i < steps; i++) {
		Kokkos::parallel_for(size, KOKKOS_LAMBDA(const int64_t index) {
		const float dt = 0.000001;
		const float vx_next = vx[index] + dt(vy[index]1 - vz[index]*0);
		const float vy_next = vy[index] + dt(vz[index]0 - vx[index]*1);
		const float vz_next = vz[index] + dt(vx[index]0 - vy[index]*0);
		x[index] += dt*vx_next;
		y[index] += dt*vy_next;
		z[index] += dt*vz_next;
		vx[index] = vx_next;
		vy[index] = vy_next;
		vz[index] = vz_next;
		});
		}

		Kokkos::fence();

		std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
		const auto total_time = end - start;
		@endcode
		*/

graph_docs/discription.dox

+9 −8

Original line number	Diff line number	Diff line
		@@ -5,8 +5,8 @@
		* @section discription_introduction Introduction
		* The basic functionality of this framework is to build expression graphs
		* representing mathematical equations. Reduce those graphs to simpler forms.
		* Transform those graph to take derivatives. Just-In-Time (JIT) compile them to
		* available compute device kernels. Then run those kernels in workflow. The
		* Transform those graphs to take derivatives. Just-In-Time (JIT) compile them
		* to available compute device kernels. Then run those kernels in workflows. The
		* code is written in using C++23 features. To simplify embedding into legacy
		* codes, there are additional language bindings for C and Fortran.
		*
		@@ -48,9 +48,10 @@
		* be reduced to a single constant by calling the evaluate method. Sub-graph
		* expressions are combined, factored out, or moved to enable better reductions
		* on subsequent passes. As new ways of reducing the graph are implemented,
		* current and existing code built using this framework benefit from improved
		* speed. The figure above shows a visualization of the tree data structure for
		* the equation of a line, the derivative, and the subsequent reductions.
		* current and existing code built using this framework will benefit from
		* improved speed. The figure above shows a visualization of the tree data
		* structure for the equation of a line, the derivative, and the subsequent
		* reductions.
		*
		* @subsubsection discription_graphs_builds Building Graphs
		* As an example building an expression of line @f$y=mx+b@f$ accomplished by
		@@ -79,8 +80,8 @@ auto dydmx = y->df(0.5*x);
		* running them in order. One @ref workflow::manager is created for each device
		* or thread. The user is responsible for creating threads. Each kernel is
		* generated through a @ref workflow::work_item. A work item is defined by
		* kernel @ref graph::input_nodes, @ref graph::output_nodes and
		* @ref graph::map_nodes. Map items are used to take the results of kernel and
		* kernel @ref graph::input_nodes, @ref graph::output_nodes, and
		* @ref graph::map_nodes. Map items are used to take the results of a kernel and
		* update an input buffer. Using our example of line equation, we can create a
		* workflow to compute @f$y@f$ and @f$\frac{\partial y}{\partial x}@f$.
		* @code
		@@ -99,7 +100,7 @@ work.add_item({
		* elements in the inputs. Multiple work items can be created and will be
		* executed in order of creation.
		*
		* Once the work items are defined that can be JIT compiled to a backend device.
		* Once the work items are defined they can be JIT compiled to a backend device.
		* The graph framework supports back ends for generic CPUs, Apple Metal GPUs,
		* Nvidia Cuda GPUs, and initial HIP support of AMD GPUs. Each back end supplies
		* relevant driver code to build the kernel source, compile the kernel, build

graph_docs/general.dox

+9 −8

Original line number	Diff line number	Diff line
		@@ -39,7 +39,7 @@
		* as either variables @f$x@f$ or constants @f$m,b@f$. These nodes are connected
		* by nodes for multiply and addition operations. The output @f$y@f$ represents
		* the entire graph of operations.
		* @image{} html line_graph.png "The graph structure for @f$y=mx+b@f$."
		* @image{} html line_graph.png "The graph structure for y = mx + b."
		* Evaluation of graphs start from the top most node in this case the @f$+@f$
		* operation. Evaluation of a node is not performed until all sub-nodes are
		* evaluated starting with the left operand. Evaluation starts by recursively
		@@ -58,9 +58,10 @@
		* graphs of a function derivative. For an example of taking derivatives see the
		* @ref tutorial_derivatives "auto differentiation tutorial". Lets say that we
		* want to take the derivative of @f$\frac{\partial y}{\partial x}@f$. This is
		* achieved by evaluating the until bottom left most node is reached. Then a new
		* graph is build starting with @f$\frac{\partial m}{\partial x}=0@f$. Applying
		* the first half of the chain rule we build a new graph for @f$0x@f$
		* achieved by evaluating the graph until the bottom left most node is reached.
		* Then a new graph is constucted starting with
		* @f$\frac{\partial m}{\partial x}=0@f$. Applying the first half of the chain
		* rule we build a new graph for @f$0x@f$
		* @image{} html line_graph_dydf1.png ""
		* Then we take the derivative of the right operand and apply the second half
		* of the chain rule to build a new graph for @f$0x=0@f$.
		@@ -73,8 +74,8 @@
		* The final expression for @f$\frac{\partial y}{\partial x}@f$ contains many
		* unnecessary nodes in the graph. Instead of building full graphs, we can
		* simplify and eliminate nodes as we build them. For instance, when the
		* expression @f$0x@f$ this created can be immediately reduce it to a single
		* node.
		* expression @f$0\times x@f$ is created, this can be immediately reduced to a
		* single node @f$0@f$.
		* @image{} html line_graph_reduce1.png ""
		* Applying all possible reductions reduces the final expression to
		* @f$\frac{\partial y}{\partial x}=m@f$.
		@@ -109,7 +110,7 @@
		* @subsection general_concepts_compile_maps Maps
		* Maps enable the results of an output node to be stored in an input node. This
		* is used for a wide varity of cases. For instance take a gradient decent step.
		* @f{equation}{y = y + \frac{\partial f}{\partial x}@f}
		* @f{equation}{y_{i+1} = y_{i} + \frac{\partial f}{\partial x}@f}
		* In this case the output of the expression
		* @f$y + \frac{\partial f}{\partial x}@f$
		* can be mapped to update @f$y@f$.
		@@ -122,7 +123,7 @@
		* <hr>
		* @section general_concepts_safe_math Safe Math
		* There are some conditions where mathematically, a graph should evaluate to a
		* normal number. However, when evaluated suing floating point precision, can
		* normal number. However, when evaluated using floating point precision, can
		* lead to <tt>Inf</tt> or <tt>NaN</tt>. An example of this the
		* @f$\exp\left(x\right)@f$ function. For large argument values,
		* @f$\exp\left(x\right)@f$ overflows the maximum floating point precision and

graph_docs/tutorial.dox

+19 −19

Original line number	Diff line number	Diff line
		@@ -13,7 +13,7 @@
		* executable target which can be used to test out the API's of this framework.
		* The playground starts with a blank main function.
		* @code
		#include "../graph_framework/jit.hpp"
		#include "graph_framework.hpp"

		int main(int argc, const char * argv[]) {
		START_GPU
		@@ -30,7 +30,7 @@ int main(int argc, const char * argv[]) {
		* main. This will allow us to play with different floating point types. For now
		* we will start with a simple float type.
		* @code
		#include "../graph_framework/jit.hpp"
		#include "graph_framework.hpp"

		template<jit::float_scalar T>
		void run_tutorial() {
		@@ -84,7 +84,7 @@ void run_tutorial() {
		* so all method are called using the <tt>-></tt> operator.
		*
		* @subsection tutorial_constant Constant Nodes
		* Next we want to define a constant. There are two method to define constants
		* Next we want to define a constant. There are two methods to define constants
		* explicitly or implicitly.
		* @code
		template<jit::float_scalar T>