Sahara/SparkPluginNotes

Sahara Spark plugin

This page describes how to use the Apache Spark plugin for Sahara.

Disk image

The sahara-image-builder tool contains elements for installing and pre-configuring Spark and the Cloudera HDFS.

To generate an Ubuntu image compatible with the Sahara plugin, run:

# diskimage-builder.sh -p spark

Spark will be installed from the binary distribution and configured to run in standalone mode (no Mesos, no Yarn).

Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH HDFS daemons), Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.

Configuration

Spark needs few parameters to work and has sensible defaults. If needed they can be changed when creating the Sahara cluster template. No node group options are available.

A Spark cluster will need exactly one Spark master and at least one Spark slave. The tested configuration puts the NameNode co-located with the master and a DataNode with each slave to maximize data locality.

Running

Once the cluster is ready, connect with ssh to the master using the 'ubuntu' user and the appropriate ssh key. Spark is installed in /opt/spark and should be completely configured and ready to start executing jobs. At the bottom of the cluster information page from the OpenStack dashboard, a link to the Spark web interface is provided.

Sahara/SparkPluginNotes

Contents