Difference between revisions of "Sahara/SparkPlugin"

Revision as of 14:45, 24 October 2014

Introduction

Spark is a fast and general engine for large-scale data processing.
This blueprint proposes a Sahara provisioning plugin for Spark that can launch and resize Spark clusters and run EDP jobs.

EDP support is in-progress, as some Sahara core code changes are needed to support Spark jobs.

We are currently testing a more general plugin to support Shark, one of the Spark related projects. Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.

Supported releases

This plugin supports Spark version 1.0.1. Currently, the deployment mode is "stand alone": as such, the Spark cluster will be suitable for EDP jobs and for individual spark applications (the cluster is not intended for a multi-tenant setup). Currently, there is no support for "Mesos" or "YARN" based deployments. Additionally, this plugin only supports a Cloudera-based HDFS (CDH4, CDH5) data layer. Future releases will relax such limitations.

The companion DIB element provided with this plugin generates disk images according to the configuration described above.

Documentation

How to use the Spark plugin: Sahara/SparkPluginNotes
Notes about the changes to sahara-image-elements: Sahara/SparkImageBuilder

Status

Development is done by Daniele Venzano (Research Engineer at Eurecom) and Pietro Michiardi (Prof. at Eurecom). A preliminary version of the plugin was developed with the additional help of two Master students at Eurecom, Do Huy-Hoang and Vo Thanh Phuc. This work is partially supported by the BigFoot project, a EC-funded research project with grant agreement n. 317858.

@@ Line 4: / Line 4: @@
 [https://blueprints.launchpad.net/sahara/+spec/spark-plugin This blueprint] proposes a Sahara provisioning plugin for Spark that can launch and resize Spark clusters and run EDP jobs.
-From the Sahara perspective, in the first iteration no support for scaling and EDP will be available, but those features are planned and will be integrated later.
+EDP support is in-progress, as some Sahara core code changes are needed to support Spark jobs.
 We are currently testing a more general plugin to support [http://shark.cs.berkeley.edu/ Shark], one of the Spark related projects. Shark is an open source distributed SQL query engine for Hadoop data. It brings state-of-the-art performance and advanced analytics to Hive users.
@@ Line 10: / Line 10: @@
 == Supported releases ==
-This plugin supports Spark version 0.9.1. Currently, the deployment mode is "stand alone": as such, the Spark cluster will be suitable for EDP jobs and for individual spark applications (the cluster is not intended for a multi-tenant setup). Currently, there is no support for "Mesos" or "YARN" based deployments. Additionally, this plugin only supports a Cloudera-based HDFS (CDH4, CDH5) data layer. Future releases will relax such limitations.
+This plugin supports Spark version 1.0.1. Currently, the deployment mode is "stand alone": as such, the Spark cluster will be suitable for EDP jobs and for individual spark applications (the cluster is not intended for a multi-tenant setup). Currently, there is no support for "Mesos" or "YARN" based deployments. Additionally, this plugin only supports a Cloudera-based HDFS (CDH4, CDH5) data layer. Future releases will relax such limitations.
 The companion DIB element provided with this plugin generates disk images according to the configuration described above.