Jump to: navigation, search

Difference between revisions of "Sahara/SparkPluginNotes"

m (Sergey Lukjanov moved page Savanna/SparkPluginNotes to Sahara/SparkPluginNotes: Savanna project was renamed due to the trademark issues.)
 
Line 1: Line 1:
== Savanna SPARK plugin ==
+
== Sahara Spark plugin ==
This page describes the design choices for the Apache SPARK plugin.
+
This page describes how to use the Apache Spark plugin for Sahara.
  
=== SPARK version ===
+
=== Disk image ===
Currently, the image builder builds a VM image with the incubator version 0.8.1 of Spark. Currently, Spark executes in standalone mode (no Mesos, no Yarn).
+
The sahara-image-builder tool contains elements for installing and pre-configuring Spark and the Cloudera HDFS.
Spark has been compiled from sources, and then packaged into a "distribution" using the make-distribution.sh utility available in Spark: this generates the necessary configuration, library and scripts to setup a standalone cluster, and an interactive shell to interact with the cluster.
 
  
Spark has been compiled to work for the Cloudera distribution of Hadoop, namely CDH 4.5, installed as Ubuntu packages (no CDH manager for now).
+
To generate an Ubuntu image compatible with the Sahara plugin, run:
  
Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH-4.5), Spark is not deployed as a Linux service.
+
<pre><nowiki>
 +
# diskimage-builder.sh -p spark
 +
</nowiki></pre>
  
=== SPARK configuration ===
+
Spark will be installed from the binary distribution and configured to run in standalone mode (no Mesos, no Yarn).
Spark needs a handful of parameters to work:
 
  
* The scripts used to interact with the cluster need a file (slaves) to be populated with the worker (virtual) machines
+
Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH HDFS daemons), Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.
* There are several parameters to access cluster information (at the master), worker status, and monitoring
 
* There are important configuration parameters that tell spark slave processes how much RAM and how many "cores" to use. These should be set appropriately, according to the VM flavor selected for the SAVANNA cluster deployment.
 
  
=== Additional Notes ===  
+
=== Configuration ===
Spark is installed and executed by the VM user defined at cluster setup.
+
Spark needs few parameters to work and has sensible defaults. If needed they can be changed when creating the Sahara cluster template. No node group options are available.
 +
 
 +
A Spark cluster will need exactly one Spark master and at least one Spark slave. The tested configuration puts the NameNode co-located with the master and a DataNode with each slave to maximize data locality.
 +
 
 +
=== Running ===
 +
Once the cluster is ready, connect with ssh to the master using the 'ubuntu' user and the appropriate ssh key. Spark is installed in <code>/opt/spark</code> and should be completely configured and ready to start executing jobs. At the bottom of the cluster information page from the OpenStack dashboard, a link to the Spark web interface is provided.

Latest revision as of 16:36, 26 May 2014

Sahara Spark plugin

This page describes how to use the Apache Spark plugin for Sahara.

Disk image

The sahara-image-builder tool contains elements for installing and pre-configuring Spark and the Cloudera HDFS.

To generate an Ubuntu image compatible with the Sahara plugin, run:

# diskimage-builder.sh -p spark

Spark will be installed from the binary distribution and configured to run in standalone mode (no Mesos, no Yarn).

Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH HDFS daemons), Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.

Configuration

Spark needs few parameters to work and has sensible defaults. If needed they can be changed when creating the Sahara cluster template. No node group options are available.

A Spark cluster will need exactly one Spark master and at least one Spark slave. The tested configuration puts the NameNode co-located with the master and a DataNode with each slave to maximize data locality.

Running

Once the cluster is ready, connect with ssh to the master using the 'ubuntu' user and the appropriate ssh key. Spark is installed in /opt/spark and should be completely configured and ready to start executing jobs. At the bottom of the cluster information page from the OpenStack dashboard, a link to the Spark web interface is provided.