Sahara/SparkPluginNotes
Sahara Spark plugin
This page describes how to use the Apache Spark plugin for Sahara.
Disk image
The sahara-image-builder tool contains elements for installing and pre-configuring Spark and the Cloudera HDFS.
To generate an Ubuntu image compatible with the Sahara plugin, run:
# diskimage-builder.sh -p spark
Spark will be installed from the binary distribution and configured to run in standalone mode (no Mesos, no Yarn).
Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH HDFS daemons), Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.
Configuration
Spark needs few parameters to work and has sensible defaults. If needed they can be changed when creating the Sahara cluster template. No node group options are available.
A Spark cluster will need exactly one Spark master and at least one Spark slave. The tested configuration puts the NameNode co-located with the master and a DataNode with each slave to maximize data locality.
Running
Once the cluster is ready, connect with ssh to the master using the 'ubuntu' user and the appropriate ssh key. Spark is installed in /opt/spark
and should be completely configured and ready to start executing jobs. At the bottom of the cluster information page from the OpenStack dashboard, a link to the Spark web interface is provided.