Jump to: navigation, search

Sahara/SparkImageBuilder

Sahara image elements for Spark

This page provides some documentation on the image builder utility for Sahara, with focus on the Hadoop CDH and the Spark element.

Hadoop CDH

Spark can be deployed alongside the two main distributions for Hadoop, namely CDH and HDP. For this reason, the image builder contains a new CDH element to deploy a cloudera-based Hadoop install. Note that the element uses Ubuntu packages and not Cloudera parcels, nor the Cloudera Manager.

Spark

Currently, the image builder supports the 0.9.1 release of Spark. By default the official binary distribution is downloaded from the Spark website. By using environment variables a different distribution package can be used, for example one created by compiling Spark with the "make_distribution" script.

(*) Note that Spark uses the Hadoop-client library to talk to HDFS. Because the HDFS protocol has changed in different versions of Hadoop, you must build Spark against the same version that your cluster uses. By default, Spark links to Hadoop 1.0.4. You can change this by setting the SPARK_HADOOP_VERSION variable when compiling. A list of supported Hadoop distributions is available here: [1]

Additional notes:

Spark is deployed in the standalone operational mode: this means there's an individual spark process per slave machine. If not configured properly, the spark slave process may underutilize the provisioned VM, depending on the flavor. Sahara exposes Spark configuration options to modify its use of cores and memory.