Sahara/SparkPluginNotes
Savanna SPARK plugin
This page describes the design choices for the Apache SPARK plugin.
SPARK version
Currently, the image builder builds a VM image with the incubator version 0.8.1 of Spark. Currently, Spark executes in standalone mode (no Mesos, no Yarn). Spark has been compiled from sources, and then packaged into a "distribution" using the make-distribution.sh utility available in Spark: this generates the necessary configuration, library and scripts to setup a standalone cluster, and an interactive shell to interact with the cluster.
Spark has been compiled to work for the Cloudera distribution of Hadoop, namely CDH 4.5, installed as Ubuntu packages (no CDH manager for now).
Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH-4.5), Spark is not deployed as a Linux service.
SPARK configuration
Spark needs a handful of parameters to work:
- The scripts used to interact with the cluster need a file (slaves) to be populated with the worker (virtual) machines
- There are several parameters to access cluster information (at the master), worker status, and monitoring
- There are important configuration parameters that tell spark slave processes how much RAM and how many "cores" to use. These should be set appropriately, according to the VM flavor selected for the SAVANNA cluster deployment.
Additional Notes
Spark is installed and executed by the VM user defined at cluster setup.