Sahara/SparkPluginNotes

Savanna SPARK plugin

This page describes the design choices for the Apache SPARK plugin.

SPARK version

Currently, the image builder builds a VM image with the incubator version 0.8.1 of Spark. Currently, Spark executes in standalone mode (no Mesos, no Yarn). Spark has been compiled from sources, and then packaged into a "distribution" using the make-distribution.sh utility available in Spark: this generates the necessary configuration, library and scripts to setup a standalone cluster, and an interactive shell to interact with the cluster.

Spark has been compiled to work for the Cloudera distribution of Hadoop, namely CDH 4.5, installed as Ubuntu packages (no CDH manager for now).

Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH-4.5), Spark is not deployed as a Linux service.

SPARK configuration

Spark needs a handful of parameters to work:

The scripts used to interact with the cluster need a file (slaves) to be populated with the worker (virtual) machines
There are several parameters to access cluster information (at the master), worker status, and monitoring
There are important configuration parameters that tell spark slave processes how much RAM and how many "cores" to use. These should be set appropriately, according to the VM flavor selected for the SAVANNA cluster deployment.

Additional Notes

Spark is installed and executed by the VM user defined at cluster setup.

Sahara/SparkPluginNotes

Contents

Savanna SPARK plugin

SPARK version

SPARK configuration

Additional Notes