Commit d8c78aa0 authored by Stansberry, Dale's avatar Stansberry, Dale
Browse files

- Cleanup

- Added test programs (python and scala)
parent 84f69a1e
Pipeline #3720 skipped
This is a software setup tool to deploy and execute Spark jobs on Rhea.
SPARK ON DEMAND
Two scripts are provided:
This is a software setup tool to deploy and execute Spark jobs on Rhea, a commodity cluster with NCCS/OLCF. Spark applications written in Python, Scala, and Java are supported.
1. INSTALLATION & USE
Two scripts are provided by this utility:
spark_setup.py - Configures a spark deployment and generates PBS script
spark_deploy.py - Deploys spark on Rhea (called by generated PBS script)
Note: Spark 1.6 (stand-alone) must be installed and the SPARK_HOME environment variable set to point to the Spark instalation directory. The Rhea production environment will eventually include a Spark module which will configure your environment to point to a global installation of Spark. The two scripts mentioned above will be pre-installed in the SPARK_HOME/bin and SPARK_HOME/sbin directories, respectively. Until Spark is officially supported, Spark must be installed and configured manually.
Note: Spark 1.6.x + Hadoop 2.6.x (stand-alone) must be installed and the SPARK_HOME environment variable set to point to the Spark installation directory. The Rhea production environment will eventually include a Spark module which will configure your environment to point to a global installation of Spark. The two scripts mentioned above will be pre-installed in the SPARK_HOME/bin and SPARK_HOME/sbin directories, respectively. Until Spark is officially supported, Spark must be installed and configured manually.
To configure a spark job, run spark_setup.py with the following parameters:
......@@ -14,11 +19,23 @@ To configure a spark job, run spark_setup.py with the following parameters:
-n <num> : Number of nodes'
-w <time> : Maximum walltime'
-d <path> : Spark deployment directory'
-p : Include python support (if needed for spark application)'
The deployment directory must be unique for each Spark job being executed and should be located in a scratch space (Spark uses this directory to write temporary files). After running spark_setup.py, the specified deployment directory will be created (or re-initialized if it already exists) and template configuration files/scripts will copied into the "templates" subdirectory under the deployment directory. If needed, these template files may be modified before the Spark job is submitted. When the job is submitted, these template files will be copied into per-node configuration directories and used by Spark to configure worker nodes.
The deployment directory must be unique for each Spark batch job being executed and should be located in a scratch space (Spark uses this directory to write temporary files). After running spark_setup.py, the specified deployment directory will be created (or re-initialized if it already exists) and template configuration files/scripts will copied into the "templates" subdirectory under the deployment directory. If needed, these template files may be modified before the Spark job is submitted. When the job is submitted, these template files will be copied into per-node configuration directories and used by Spark to configure worker nodes.
The PBS script generated by spark_setup.py must be edited to configure any module dependencies and to specify the Spark application and any arguments on the spark-submit line (this line is commented-out initially). You may execute multiple spark jobs within one PBS script (just copy/paste the spark-submit line). When the PBS script exits, the deployed spark cluster will be shutdown.
2. NOTES
* The PBS script uses mpirun to launch the spark_deploy.py script on all allocated nodes. Default MPI implementation is OpenMPI. If a different version of MPI is used, mpirun parameters may need to be changed.
* The environment configured in the PBS script (i.e. modules) is automatically exported to Spark nodes by mpirun.
* Spark is configured in "Standalone" cluster mode.
* Hadoop is not utilized, but must be present for Spark to run.
The PBS script generated by spark_setup.py must be edited to specify the Spark application and any arguments on the spark-submit line (this line is commented-out initially). You may execute multiple spark jobs within one PBS script (just copy/paste the spark-submit line). When the PBS script exits, the deployed spark cluster will be shutdown.
* Scala is best supported, and potentially highest performance langauge to use. Java requires translation to/from Spark/Scala objects.
* Python is not as well supported by Spark libraries.
......@@ -7,7 +7,6 @@ import time
import shutil
verbose = False
python_support = False
script_file = 'spark.pbs'
spark_home = None
deploy_dir = None
......@@ -32,7 +31,6 @@ def printUsage():
print ' -w,--wall <time> : Maximum walltime'
print ' -d,--deploy-dir <path> : Spark deployment directory *'
print ' -t,--timeout <sec> : Set deployment timeout (default = 30 sec)'
print ' -p,--python : Include python support'
print ' --spark-home <path> : Override/set SPARK_HOME env variable **\n'
print ' --worker-mem <arg> : Worker memory\n'
print ' --worker-cores <arg> : Worker cores\n'
......@@ -52,7 +50,6 @@ def printOptions():
print 'Num nodes :', num_nodes
print 'Wall time :', walltime
print 'Deployment dir :', deploy_dir
print 'Python support :', python_support
print 'SPARK_HOME :', spark_home
print 'Worker memory :', worker_memory
print 'Worker cores :', worker_cores
......@@ -157,8 +154,8 @@ def setupDeployDir():
try:
opts, args = getopt.getopt( sys.argv[1:],
'?hvs:d:a:n:w:p',
['verbose','help','script=','deploy-dir=','account=','wall=','nodes=','spark-home=','python'] )
'?hvs:d:a:n:w:',
['verbose','help','script=','deploy-dir=','account=','wall=','nodes=','spark-home='] )
if len(args) > 0:
print 'Unrecognized options: ', args
......@@ -186,8 +183,6 @@ for opt, arg in opts:
walltime = arg
elif opt in ( '--spark-home' ):
spark_home = arg
elif opt in ( '-p', '--python' ):
python_support = True
elif opt in ( '-t', '--timeout' ):
deploy_timeout = arg
elif opt in ( '-v', '--verbose' ):
......
This diff is collapsed.
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
from __future__ import print_function
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: wordcount <file>", file=sys.stderr)
exit(-1)
sc = SparkContext(appName="WordCount")
lines = sc.textFile(sys.argv[1], 1)
counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
output = counts.takeOrdered(100, key = lambda x: -x[1])
for (word, count) in output:
print("%s: %i" % (word, count))
sc.stop()
import org.apache.spark._
import org.apache.spark.SparkContext._
object WordCount {
def main(args: Array[String]) {
if ( args.length != 1 )
{
println( "Usage: wordcount <file>" )
sys.exit(1)
}
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val lines = sc.textFile(args(0))
val counts = lines.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey( _ + _ )
val output = counts.takeOrdered(100)(Ordering[Int].reverse.on(x=>x._2))
output.foreach( println )
}
}
name := "WordCount"
version := "1.0"
scalaVersion := "2.10.4"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.1.0"
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment