Sahara/SparkPlugin
Introduction
Spark is a fast and general engine for large-scale data processing.
This blueprint proposes a Sahara provisioning plugin for Spark that can launch and resize Spark clusters and run EDP jobs.
EDP support is in-progress, as some Sahara core code changes are needed to support Spark jobs.
We are currently testing a more general plugin to support Shark, one of the Spark related projects. Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.
Supported releases
This plugin supports Spark version 1.0.2. Currently, the deployment mode is "stand alone": as such, the Spark cluster will be suitable for EDP jobs and for individual spark applications (the cluster is not intended for a multi-tenant setup). Currently, there is no support for "Mesos" or "YARN" based deployments. Additionally, this plugin only supports a Cloudera-based HDFS (CDH4, CDH5) data layer. Future releases will relax such limitations.
The companion DIB element provided with this plugin generates disk images according to the configuration described above.
Documentation
- How to use the Spark plugin: Sahara/SparkPluginNotes
- Notes about the changes to sahara-image-elements: Sahara/SparkImageBuilder
Status
Development is done by Daniele Venzano (Research Engineer at Eurecom) and Pietro Michiardi (Prof. at Eurecom). A preliminary version of the plugin was developed with the additional help of two Master students at Eurecom, Do Huy-Hoang and Vo Thanh Phuc. This work is partially supported by the BigFoot project, a EC-funded research project with grant agreement n. 317858.