Loading README.md +100 −99 Original line number Diff line number Diff line Summary -------- This repository includes the source code for Juggler host and device runtimes and the sources for the applications used in the evaluation. Artifact described in this section includes the source code for Juggler host and device runtimes and the sources for the applications used in our evaluation. The scripts to compile the source code, generate inputs, execute binaries, validate results and parse the outputs are also included in the repo and explained below in detail. binaries, validate results, and parse the outputs are also included in the artifact and explained below in detail. Requirements Description ----------- - **Program:** CUDA 8.0 APIs. - **Compilation:** NVIDIA `nvcc` version 8.0. ### Check-list (artifact meta information) - **Binary:** CUDA host(x86-64) and device executable. Linux binary (Centos 7 recommended) is included. Source code and scripts to re-generate the binaries are also included. - **Hardware:** NVIDIA Tesla P100 or newer GPU with at least 12GB of memory, and x86-64 modern CPU. - **Program:** CUDA 8.0 APIs and above. - **Compilation:** NVIDIA `nvcc` version 8.0 and aboce. - **Binary:** CUDA host(x86-64) and device executable. Linux binary is included. Source code and scripts to regenerate the binaries are also included. - **Data set:** Dynamically generated data, prior to the execution. - **Run-time environment:** CUDA 8.0 APIs and drivers, They are - **Runtime environment:** CUDA 8.0 APIs and drivers. They are included with CUDA 8.0 toolkit distribution. - **Hardware:** NVIDIA Tesla P100 (12GB, PCI-e v3), and Intel CPU (Intel Xeon CPU E5-2683 as tested). - **Output:** Verification results and detailed timings such as execution times and runtime overhead. Loading @@ -37,139 +41,143 @@ Requirements The source code for Juggler host & device runtime and the experimented applications (both baseline and Juggler-integrated version) can be accessed via the following (this) repository: accessed via: ``` {.bash language="bash"} https://code.ornl.gov/fub/juggler ``` *https://code.ornl.gov/fub/juggler* ### Hardware dependencies We have performed our experiments on NVIDIA P100 GPU. The supplied makefile will only work for Pascal architecture or later. We performed our experiments on an NVIDIA P100 GPU. While Juggler is compatible with Kepler architecture and later, the provided artifact will work only for Pascal architecture or later, with at least 12GB of memory. ### Software dependencies CUDA 8.0 toolkit is required for compilation and profiling. At the time of CUDA driver installation, GCC 5.2 was installed on the system. To run the scripts, `bash`, and `egrep` is sufficient. the scripts, `bash` and `egrep` are sufficient. The scripts assume that `CUDA_HOME` is properly set to the installation directory of CUDA. The default location: ``` {.bash language="bash" numbers="none"} $ export CUDA_HOME=/usr/local/cuda ``` ### Datasets Due to large sizes, each application dynamically generates and populates its input dataset, as part of its initialization stage. the input dataset, as part of the initialization. Installation ------------ Clone Juggler runtime and application suite from the git repo: Clone Juggler from the ORNL repository: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ git clone https://code.ornl.gov/fub/juggler.git $ JUGGLER_HOME=$(pwd)/juggler $ export JUGGLER_HOME=$(pwd)/juggler ``` Experiment workflow ------------------- To repeat the experiments presented in the evaluation section, we have created a script named `exp`. It is located under `$JUGGLER_HOME/build` and the parameters to the script are as follows: To repeat the main experiments presented in the evaluation section, we have created a script named `exp`. It is located under `$JUGGLER_HOME/build` and the parameters to the script are as follows: ``` {.bash language="bash"} exp {scriptMode} {outFilePrefix} {runGB?} {runJUG?} {nRuns} {nProfRuns} ``` {.bash language="bash" numbers="none"} exp {scriptMode} {outFilePrefix} {runGB} {runJG} {nRuns} {nProfRuns} ``` To repeat the main set of experiments as presented in the paper, run: To repeat the main experiments presented in Figures 4 and 5, run: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ cd $JUGGLER_HOME/build $ bash exp 0 output 1 1 5 1 ``` This script compiles (i.e. `scriptMode=0`) each application for each scheduling policy; then runs each of them five (i.e. `nRuns=5`) times for global barriers (i.e. `runGB?=1`) and also five times for the Juggler integrated versions(i.e. `runJUG=1`). It also performs an additional run with profiling enabled (i.e. `nProfRuns=1`). The program output for each application are written into separate files prefixed by When run with `scriptMode=0`, `exp` script compiles each application for each scheduling policy; then runs each of them five times (i.e., `nRuns=5`) for global barriers (i.e., `runGB=1`) and also five times for the Juggler integrated versions (i.e., `runJG=1`). It also performs an additional run with profiling enabled (i.e., `nProfRuns=1`). The program output for each application is written into separate files prefixed by `outFilePrefix`. Evaluation and expected result ------------------------------ The execution of the `exp` script with the parameters above will produce a series of output files named as `output.$APPNAME` and `output.PROF.$APPNAME`, for each application. The `exp` script can also be used to properly parse (i.e. `scriptMode=1`) these output files and combine the values from all runs in a *tab separated value* format. Execution of the `exp` script with the parameters above will produce a series of output files, named `output.$APPNAME` and `output.PROF.$APPNAME`, for each application. The same `exp` script can also be used in parsing mode (i.e., `scriptMode=1`) to parse these output files and combine the values from all runs in a *tab separated value* (TSV) format. 1. **To list the kernel execution times** (i.e. the values used to draw Figure 4) for all 7 applications, in separate columns: 1. **To list the kernel execution times** (i.e., the values used to draw Figure 4) for all seven applications, in separate columns: ``` {.bash language="bash"} $ bash exp 1 output execTime 2 formattedResults1.csv ``` {.bash language="bash" numbers="none"} $ bash exp 1 output execTime 2 formattedResults1.tsv ``` Parsed values will be written into `formattedResults1.csv`, in a tabular format. There will be 7 columns in total, one for each application. The number of rows in the csv file will be equal to nRuns x 4, (= 20, when main experiment is run with 5 runs). The first 15 rows will be for LRR, GRR and LF, respectively, in groups of five. The last 5 rows will be for global barriers. Parsed values will be written into `formattedResults1 .tsv`, in a tabular format. There will be seven columns in total, one for each application. The number of rows in the tsv file will be equal to $nRuns \times 4$ (i.e., 20, when main experiment is run five times). The first 15 rows will be for $LRR$, $GRR$, and $LF$, respectively, in groups of five. The last five rows will be for global barriers. 2. **To see verification results** against serial execution: ``` {.bash language="bash"} $ bash exp 1 output check 2 formattedResults2.csv ``` {.bash language="bash" numbers="none"} $ bash exp 1 output check 2 formattedResults2.tsv ``` The parsed output will be written into `formattedResults2.csv` file, and the values will be either SUCCESS or FAIL. Please note that, since serial execution takes too long, the verification runs are only once per each applciation, even if multiple (i.e. 5) runs are specified. The parsed output will be written into the `formattedResults2.tsv` file, and the values will be either SUCCESS or FAIL. 3. **To parse task load deviations** , which is the data used to draw Figure 5), run: 3. **To parse task load deviations**, run: ``` {.bash language="bash"} $ bash exp 1 output avgTaskLoad 2 formattedResults3.csv $ bash exp 1 output minTaskLoad 2 formattedResults4.csv $ bash exp 1 output maxTaskLoad 2 formattedResults5.csv ``` {.bash language="bash" numbers="none"} $ bash exp 1 output minTaskLoad 2 formattedResults3.tsv $ bash exp 1 output maxTaskLoad 2 formattedResults4.tsv ``` Figure 5 in the paper is drawn as "HIGH-LOW-CLOSE" chart in Microsoft Excel, where HIGH, LOW and CLOSE are the highest, lowest and average values, respectively, of the differences between `maxTaskLoad` and `maxTaskLoad`, across five runs. Experiment customization ------------------------ **Number of runs** for each type of run in the main experiment can be modified by changing the input parameters of `exp` script, as explained above. in section A.4. Additionally, if only specific subset of applications is desired to be Additionally, if only a specific subset of applications is desired to be tested, the values in the bash array, named `$APPS`, in the `exp` script can be modified. **More information** can be parsed from the output files by providing the *key string* and *column number* to the value parser (i.e. *\"exp the *key string* and *column number* in the value parser (i.e., *\"exp 1\"*). A few examples: 1. Cache miss data: Keys can be `write_sector_misses, write_sector_queries, read_sector_misses` and `read_sector_queries`. 1. Cache miss data: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ bash exp 1 output.PROF write_sector_misses 7 formatted.tsv $ bash exp 1 output.PROF write_sector_queries 7 formatted.tsv $ bash exp 1 output.PROF read_sector_misses 7 formatted.tsv $ bash exp 1 output.PROF read_sector_queries 7 formatted.tsv ``` Please note that, profiling runs take longer to execute, and sometimes may not run to completion, due to the fact that nvprof does not play nice with the persistent threads approach that Juggler uses. 2. Host runtime and inspection loop overhead breakdown: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ bash exp 1 output initAppContext_H 2 formatted.tsv $ bash exp 1 output initAppContext_D 2 formatted.tsv $ bash exp 1 output initRtContext_D 2 formatted.tsv Loading @@ -180,39 +188,32 @@ the *key string* and *column number* to the value parser (i.e. *\"exp 3. Total application runtime, including user data initialization and transfers: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ bash exp 1 output totalTime 2 formatted.tsv ``` Please note that this data includes large data transfers that exist in both Juggler and baseline. Therefore, timings obtained via this method does not accurately represent benefits of Juggler. **Compiling & Running single application:** In our test suite, applications are distinguished with compiler directives, to optimize the resource usage for the ones that share common kernels. Similarly, Juggler runtime requires a re-compilation if internal run-time parameters (e.g. scheduling policy) are changed. **Compiling and running a single application:** In our test suite, applications are distinguished with compiler directives to optimize the resource usage for the ones that share common kernels. Similarly, the Juggler runtime requires a re-compilation if internal runtime parameters (e.g., scheduling policy) are changed. 1. To recompile Juggler for the desired application and scheduling policy: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ cd $JUGGLER_HOME/build $ bash switchAPP {DTW|HEAT|INT|JACOBI|SAT|SW|LUD} {LRR|GRR|LF}} ``` 2. To run the compiled application with Juggler runtime: 2. To run the compiled application with the Juggler runtime: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ cd $JUGGLER_HOME/build $ ./OMP_CUDART -n {matrix_size} -b {block_size} -d {1|2} [-c] ``` The `-d` parameter indicates the run mode, which is 1 for Juggler, and 2 for global barriers. The optional `-c` parameter enables verification against serial version and compares the two outputs. `-c` is enabled by default. verification against serial version and compares the two outputs. By default, `-c` is enabled. Loading
README.md +100 −99 Original line number Diff line number Diff line Summary -------- This repository includes the source code for Juggler host and device runtimes and the sources for the applications used in the evaluation. Artifact described in this section includes the source code for Juggler host and device runtimes and the sources for the applications used in our evaluation. The scripts to compile the source code, generate inputs, execute binaries, validate results and parse the outputs are also included in the repo and explained below in detail. binaries, validate results, and parse the outputs are also included in the artifact and explained below in detail. Requirements Description ----------- - **Program:** CUDA 8.0 APIs. - **Compilation:** NVIDIA `nvcc` version 8.0. ### Check-list (artifact meta information) - **Binary:** CUDA host(x86-64) and device executable. Linux binary (Centos 7 recommended) is included. Source code and scripts to re-generate the binaries are also included. - **Hardware:** NVIDIA Tesla P100 or newer GPU with at least 12GB of memory, and x86-64 modern CPU. - **Program:** CUDA 8.0 APIs and above. - **Compilation:** NVIDIA `nvcc` version 8.0 and aboce. - **Binary:** CUDA host(x86-64) and device executable. Linux binary is included. Source code and scripts to regenerate the binaries are also included. - **Data set:** Dynamically generated data, prior to the execution. - **Run-time environment:** CUDA 8.0 APIs and drivers, They are - **Runtime environment:** CUDA 8.0 APIs and drivers. They are included with CUDA 8.0 toolkit distribution. - **Hardware:** NVIDIA Tesla P100 (12GB, PCI-e v3), and Intel CPU (Intel Xeon CPU E5-2683 as tested). - **Output:** Verification results and detailed timings such as execution times and runtime overhead. Loading @@ -37,139 +41,143 @@ Requirements The source code for Juggler host & device runtime and the experimented applications (both baseline and Juggler-integrated version) can be accessed via the following (this) repository: accessed via: ``` {.bash language="bash"} https://code.ornl.gov/fub/juggler ``` *https://code.ornl.gov/fub/juggler* ### Hardware dependencies We have performed our experiments on NVIDIA P100 GPU. The supplied makefile will only work for Pascal architecture or later. We performed our experiments on an NVIDIA P100 GPU. While Juggler is compatible with Kepler architecture and later, the provided artifact will work only for Pascal architecture or later, with at least 12GB of memory. ### Software dependencies CUDA 8.0 toolkit is required for compilation and profiling. At the time of CUDA driver installation, GCC 5.2 was installed on the system. To run the scripts, `bash`, and `egrep` is sufficient. the scripts, `bash` and `egrep` are sufficient. The scripts assume that `CUDA_HOME` is properly set to the installation directory of CUDA. The default location: ``` {.bash language="bash" numbers="none"} $ export CUDA_HOME=/usr/local/cuda ``` ### Datasets Due to large sizes, each application dynamically generates and populates its input dataset, as part of its initialization stage. the input dataset, as part of the initialization. Installation ------------ Clone Juggler runtime and application suite from the git repo: Clone Juggler from the ORNL repository: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ git clone https://code.ornl.gov/fub/juggler.git $ JUGGLER_HOME=$(pwd)/juggler $ export JUGGLER_HOME=$(pwd)/juggler ``` Experiment workflow ------------------- To repeat the experiments presented in the evaluation section, we have created a script named `exp`. It is located under `$JUGGLER_HOME/build` and the parameters to the script are as follows: To repeat the main experiments presented in the evaluation section, we have created a script named `exp`. It is located under `$JUGGLER_HOME/build` and the parameters to the script are as follows: ``` {.bash language="bash"} exp {scriptMode} {outFilePrefix} {runGB?} {runJUG?} {nRuns} {nProfRuns} ``` {.bash language="bash" numbers="none"} exp {scriptMode} {outFilePrefix} {runGB} {runJG} {nRuns} {nProfRuns} ``` To repeat the main set of experiments as presented in the paper, run: To repeat the main experiments presented in Figures 4 and 5, run: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ cd $JUGGLER_HOME/build $ bash exp 0 output 1 1 5 1 ``` This script compiles (i.e. `scriptMode=0`) each application for each scheduling policy; then runs each of them five (i.e. `nRuns=5`) times for global barriers (i.e. `runGB?=1`) and also five times for the Juggler integrated versions(i.e. `runJUG=1`). It also performs an additional run with profiling enabled (i.e. `nProfRuns=1`). The program output for each application are written into separate files prefixed by When run with `scriptMode=0`, `exp` script compiles each application for each scheduling policy; then runs each of them five times (i.e., `nRuns=5`) for global barriers (i.e., `runGB=1`) and also five times for the Juggler integrated versions (i.e., `runJG=1`). It also performs an additional run with profiling enabled (i.e., `nProfRuns=1`). The program output for each application is written into separate files prefixed by `outFilePrefix`. Evaluation and expected result ------------------------------ The execution of the `exp` script with the parameters above will produce a series of output files named as `output.$APPNAME` and `output.PROF.$APPNAME`, for each application. The `exp` script can also be used to properly parse (i.e. `scriptMode=1`) these output files and combine the values from all runs in a *tab separated value* format. Execution of the `exp` script with the parameters above will produce a series of output files, named `output.$APPNAME` and `output.PROF.$APPNAME`, for each application. The same `exp` script can also be used in parsing mode (i.e., `scriptMode=1`) to parse these output files and combine the values from all runs in a *tab separated value* (TSV) format. 1. **To list the kernel execution times** (i.e. the values used to draw Figure 4) for all 7 applications, in separate columns: 1. **To list the kernel execution times** (i.e., the values used to draw Figure 4) for all seven applications, in separate columns: ``` {.bash language="bash"} $ bash exp 1 output execTime 2 formattedResults1.csv ``` {.bash language="bash" numbers="none"} $ bash exp 1 output execTime 2 formattedResults1.tsv ``` Parsed values will be written into `formattedResults1.csv`, in a tabular format. There will be 7 columns in total, one for each application. The number of rows in the csv file will be equal to nRuns x 4, (= 20, when main experiment is run with 5 runs). The first 15 rows will be for LRR, GRR and LF, respectively, in groups of five. The last 5 rows will be for global barriers. Parsed values will be written into `formattedResults1 .tsv`, in a tabular format. There will be seven columns in total, one for each application. The number of rows in the tsv file will be equal to $nRuns \times 4$ (i.e., 20, when main experiment is run five times). The first 15 rows will be for $LRR$, $GRR$, and $LF$, respectively, in groups of five. The last five rows will be for global barriers. 2. **To see verification results** against serial execution: ``` {.bash language="bash"} $ bash exp 1 output check 2 formattedResults2.csv ``` {.bash language="bash" numbers="none"} $ bash exp 1 output check 2 formattedResults2.tsv ``` The parsed output will be written into `formattedResults2.csv` file, and the values will be either SUCCESS or FAIL. Please note that, since serial execution takes too long, the verification runs are only once per each applciation, even if multiple (i.e. 5) runs are specified. The parsed output will be written into the `formattedResults2.tsv` file, and the values will be either SUCCESS or FAIL. 3. **To parse task load deviations** , which is the data used to draw Figure 5), run: 3. **To parse task load deviations**, run: ``` {.bash language="bash"} $ bash exp 1 output avgTaskLoad 2 formattedResults3.csv $ bash exp 1 output minTaskLoad 2 formattedResults4.csv $ bash exp 1 output maxTaskLoad 2 formattedResults5.csv ``` {.bash language="bash" numbers="none"} $ bash exp 1 output minTaskLoad 2 formattedResults3.tsv $ bash exp 1 output maxTaskLoad 2 formattedResults4.tsv ``` Figure 5 in the paper is drawn as "HIGH-LOW-CLOSE" chart in Microsoft Excel, where HIGH, LOW and CLOSE are the highest, lowest and average values, respectively, of the differences between `maxTaskLoad` and `maxTaskLoad`, across five runs. Experiment customization ------------------------ **Number of runs** for each type of run in the main experiment can be modified by changing the input parameters of `exp` script, as explained above. in section A.4. Additionally, if only specific subset of applications is desired to be Additionally, if only a specific subset of applications is desired to be tested, the values in the bash array, named `$APPS`, in the `exp` script can be modified. **More information** can be parsed from the output files by providing the *key string* and *column number* to the value parser (i.e. *\"exp the *key string* and *column number* in the value parser (i.e., *\"exp 1\"*). A few examples: 1. Cache miss data: Keys can be `write_sector_misses, write_sector_queries, read_sector_misses` and `read_sector_queries`. 1. Cache miss data: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ bash exp 1 output.PROF write_sector_misses 7 formatted.tsv $ bash exp 1 output.PROF write_sector_queries 7 formatted.tsv $ bash exp 1 output.PROF read_sector_misses 7 formatted.tsv $ bash exp 1 output.PROF read_sector_queries 7 formatted.tsv ``` Please note that, profiling runs take longer to execute, and sometimes may not run to completion, due to the fact that nvprof does not play nice with the persistent threads approach that Juggler uses. 2. Host runtime and inspection loop overhead breakdown: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ bash exp 1 output initAppContext_H 2 formatted.tsv $ bash exp 1 output initAppContext_D 2 formatted.tsv $ bash exp 1 output initRtContext_D 2 formatted.tsv Loading @@ -180,39 +188,32 @@ the *key string* and *column number* to the value parser (i.e. *\"exp 3. Total application runtime, including user data initialization and transfers: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ bash exp 1 output totalTime 2 formatted.tsv ``` Please note that this data includes large data transfers that exist in both Juggler and baseline. Therefore, timings obtained via this method does not accurately represent benefits of Juggler. **Compiling & Running single application:** In our test suite, applications are distinguished with compiler directives, to optimize the resource usage for the ones that share common kernels. Similarly, Juggler runtime requires a re-compilation if internal run-time parameters (e.g. scheduling policy) are changed. **Compiling and running a single application:** In our test suite, applications are distinguished with compiler directives to optimize the resource usage for the ones that share common kernels. Similarly, the Juggler runtime requires a re-compilation if internal runtime parameters (e.g., scheduling policy) are changed. 1. To recompile Juggler for the desired application and scheduling policy: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ cd $JUGGLER_HOME/build $ bash switchAPP {DTW|HEAT|INT|JACOBI|SAT|SW|LUD} {LRR|GRR|LF}} ``` 2. To run the compiled application with Juggler runtime: 2. To run the compiled application with the Juggler runtime: ``` {.bash language="bash"} ``` {.bash language="bash" numbers="none"} $ cd $JUGGLER_HOME/build $ ./OMP_CUDART -n {matrix_size} -b {block_size} -d {1|2} [-c] ``` The `-d` parameter indicates the run mode, which is 1 for Juggler, and 2 for global barriers. The optional `-c` parameter enables verification against serial version and compares the two outputs. `-c` is enabled by default. verification against serial version and compares the two outputs. By default, `-c` is enabled.