README.md 2.07 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
This is a software setup tool to deploy and execute Spark jobs on Rhea.

Two scripts are provided:

spark_setup.py - Configures a spark deployment and generates PBS script
spark_deploy.py - Deploys spark on Rhea (called by generated PBS script)

Note: Spark 1.6 (stand-alone) must be installed and the SPARK_HOME environment variable set to point to the Spark instalation directory. The Rhea production environment will eventually include a Spark module which will configure your environment to point to a global installation of Spark. The two scripts mentioned above will be pre-installed in the SPARK_HOME/bin and SPARK_HOME/sbin directories, respectively. Until Spark is officially supported, Spark must be installed and configured manually.

To configure a spark job, run spark_setup.py with the following parameters:

    -s <file>     : The PBS script file to generate'
    -a <account>  : Name of account to charge'
    -n <num>      : Number of nodes'
    -w <time>     : Maximum walltime'
    -d <path>     : Spark deployment directory'
    -p            : Include python support (if needed for spark application)'

The deployment directory must be unique for each Spark job being executed and should be located in a scratch space (Spark uses this directory to write temporary files). After running spark_setup.py, the specified deployment directory will be created (or re-initialized if it already exists) and template configuration files/scripts will copied into the "templates" subdirectory under the deployment directory. If needed, these template files may be modified before the Spark job is submitted. When the job is submitted, these template files will be copied into per-node configuration directories and used by Spark to configure worker nodes.


The PBS script generated by spark_setup.py must be edited to specify the Spark application and any arguments on the spark-submit line (this line is commented-out initially). You may execute multiple spark jobs within one PBS script (just copy/paste the spark-submit line). When the PBS script exits, the deployed spark cluster will be shutdown.