README.md 2.92 KB
Newer Older
Stansberry, Dale's avatar
Stansberry, Dale committed
1
SPARK ON DEMAND
2

Stansberry, Dale's avatar
Stansberry, Dale committed
3
4
5
6
7
8
This is a software setup tool to deploy and execute Spark jobs on Rhea, a commodity cluster with NCCS/OLCF. Spark applications written in Python, Scala, and Java are supported.


1. INSTALLATION & USE

Two scripts are provided by this utility:
9
10
11
12

spark_setup.py - Configures a spark deployment and generates PBS script
spark_deploy.py - Deploys spark on Rhea (called by generated PBS script)

Stansberry, Dale's avatar
Stansberry, Dale committed
13
Note: Spark 1.6.x + Hadoop 2.6.x (stand-alone) must be installed and the SPARK_HOME environment variable set to point to the Spark installation directory. The Rhea production environment will eventually include a Spark module which will configure your environment to point to a global installation of Spark. The two scripts mentioned above will be pre-installed in the SPARK_HOME/bin and SPARK_HOME/sbin directories, respectively. Until Spark is officially supported, Spark must be installed and configured manually.
14
15
16

To configure a spark job, run spark_setup.py with the following parameters:

Stansberry, Dale's avatar
Stansberry, Dale committed
17
18
19
20
21
22
23
    -s <file>     : The PBS script file to generate
    -a <account>  : Name of account to charge
    -n <num>      : Number of nodes*
    -w <time>     : Maximum walltime
    -d <path>     : Spark deployment directory

    *Note: number of nodes must be 2 or greater (tasks are not run on master node)
24

Stansberry, Dale's avatar
Stansberry, Dale committed
25
26
27
28
29
30
31
32
33
34
35
36
The deployment directory must be unique for each Spark batch job being executed and should be located in a scratch space (Spark uses this directory to write temporary files). After running spark_setup.py, the specified deployment directory will be created (or re-initialized if it already exists) and template configuration files/scripts will copied into the "templates" subdirectory under the deployment directory. If needed, these template files may be modified before the Spark job is submitted. When the job is submitted, these template files will be copied into per-node configuration directories and used by Spark to configure worker nodes.

The PBS script generated by spark_setup.py must be edited to configure any module dependencies and to specify the Spark application and any arguments on the spark-submit line (this line is commented-out initially). You may execute multiple spark jobs within one PBS script (just copy/paste the spark-submit line). When the PBS script exits, the deployed spark cluster will be shutdown.


2. NOTES

* The PBS script uses mpirun to launch the spark_deploy.py script on all allocated nodes. Default MPI implementation is OpenMPI. If a different version of MPI is used, mpirun parameters may need to be changed.

* The environment configured in the PBS script (i.e. modules) is automatically exported to Spark nodes by mpirun.

* Spark is configured in "Standalone" cluster mode.
37

Stansberry, Dale's avatar
Stansberry, Dale committed
38
* Hadoop is not utilized, but must be present for Spark to run.
39

Stansberry, Dale's avatar
Stansberry, Dale committed
40
* Scala is best supported, and potentially highest performance langauge to use. Java requires translation to/from Spark/Scala objects.
41

Stansberry, Dale's avatar
Stansberry, Dale committed
42
* Python is not as well supported by Spark libraries.
43