Jump to: navigation, search

Difference between revisions of "Sahara/Incubation"

m (Sergey Lukjanov moved page Savanna/Incubation to Sahara/Incubation: Savanna project was renamed due to the trademark issues.)
(changing body text Savanna to Sahara, and a few links where appropriate)
Line 1: Line 1:
 
== Project codename ==
 
== Project codename ==
  
Savanna
+
Sahara
  
 
== Summary ==
 
== Summary ==
Line 17: Line 17:
 
== Detailed Description ==
 
== Detailed Description ==
  
Savanna provides users with simple means to use Hadoop on OpenStack. The project’s primary goal is to provide provisioning of on-demand Hadoop clusters as well as on-demand Hadoop job workflow (similar to Amazon Elastic MapReduce).
+
Sahara provides users with simple means to use Hadoop on OpenStack. The project’s primary goal is to provide provisioning of on-demand Hadoop clusters as well as on-demand Hadoop job workflow (similar to Amazon Elastic MapReduce).
  
Cluster operations include full cluster lifecycle management. That includes cluster provisioning, scaling and termination. Savanna supports various configuration options which are Hadoop distribution specific. Savanna also provides a built-in cluster template mechanism that allows configuration to be created once and deployed multiple times in a single-click fashion.  
+
Cluster operations include full cluster lifecycle management. That includes cluster provisioning, scaling and termination. Sahara supports various configuration options which are Hadoop distribution specific. Sahara also provides a built-in cluster template mechanism that allows configuration to be created once and deployed multiple times in a single-click fashion.  
  
Data operations allow users to be insulated from cluster creation tasks and work directly with data. I.e. users specify a job to run and data source, and Savanna takes care of the entire job and cluster lifecycle: bringing up the cluster, configuring the job and data source, running the job and shutting down the cluster. Savanna supports different types of jobs: MapReduce job, Hive query, Pig job, Oozie workflow. The data could be taken from various sources: Swift, remote HDFS, NoSQL and SQL databases.
+
Data operations allow users to be insulated from cluster creation tasks and work directly with data. I.e. users specify a job to run and data source, and Sahara takes care of the entire job and cluster lifecycle: bringing up the cluster, configuring the job and data source, running the job and shutting down the cluster. Sahara supports different types of jobs: MapReduce job, Hive query, Pig job, Oozie workflow. The data could be taken from various sources: Swift, remote HDFS, NoSQL and SQL databases.
  
There are a number of vendors offering customized Hadoop distributions. Savanna is designed to support any of these distributions via a plugin mechanism. At the moment, plugins for two distributions are implemented: Vanilla Apache Hadoop and Hortonworks Data Platform. Modern Hadoop installations include not only Hadoop, but a lot of products from the Apache Hadoop ecosystem such as Pig, Hive, Oozie, Sqoop, Hue, etc. Savanna aims to provide support for all of these services.
+
There are a number of vendors offering customized Hadoop distributions. Sahara is designed to support any of these distributions via a plugin mechanism. At the moment, plugins for two distributions are implemented: Vanilla Apache Hadoop and Hortonworks Data Platform. Modern Hadoop installations include not only Hadoop, but a lot of products from the Apache Hadoop ecosystem such as Pig, Hive, Oozie, Sqoop, Hue, etc. Sahara aims to provide support for all of these services.
  
Savanna is tightly integrated with core OpenStack components, including Nova, Keystone, Glance, Cinder and Horizon. Savanna also enables Hadoop to use Swift as a storage for MapReduce jobs. In addition to that, Savanna uses diskimage-builder to build images with installed Hadoop. Savanna has full technical support for i18n. Savanna utilizes oslo.config, oslo.messaging, other Oslo utilities, and PBR for packaging.
+
Sahara is tightly integrated with core OpenStack components, including Nova, Keystone, Glance, Cinder and Horizon. Sahara also enables Hadoop to use Swift as a storage for MapReduce jobs. In addition to that, Sahara uses diskimage-builder to build images with installed Hadoop. Sahara has full technical support for i18n. Sahara utilizes oslo.config, oslo.messaging, other Oslo utilities, and PBR for packaging.
  
We hope to eventually consume Ceilometer for metrics, Heat for orchestration and potentially Ironic for provisioning bare metal or hybrid Hadoop clusters. Savanna is committed to operating and integrating with the OpenStack ecosystem.
+
We hope to eventually consume Ceilometer for metrics, Heat for orchestration and potentially Ironic for provisioning bare metal or hybrid Hadoop clusters. Sahara is committed to operating and integrating with the OpenStack ecosystem.
  
 
== Basic roadmap for the project ==
 
== Basic roadmap for the project ==
Line 41: Line 41:
 
* nova-network support
 
* nova-network support
 
* diskimage-builder elements for building pre-installed images
 
* diskimage-builder elements for building pre-installed images
* OpenStack Dashboard plugin with all Savanna functionality supported
+
* OpenStack Dashboard plugin with all Sahara functionality supported
  
 
=== Next release plans ===
 
=== Next release plans ===
  
Savanna 0.3 is targeted to be released with OpenStack Havana. It will include the following features:
+
Sahara 0.3(known as Savanna) is targeted to be released with OpenStack Havana. It will include the following features:
  
* Elastic Data Processing - provide Hadoop as a Service: allow users to run SQL-like Hive queries, Pig or MapReduce jobs through Savanna. User will be able to run jobs of different types without explicitly starting a cluster.
+
* Elastic Data Processing - provide Hadoop as a Service: allow users to run SQL-like Hive queries, Pig or MapReduce jobs through Sahara. User will be able to run jobs of different types without explicitly starting a cluster.
 
* Provisioning of complex Hadoop clusters including Hadoop ecosystem products such as Hive, Pig and Oozie.
 
* Provisioning of complex Hadoop clusters including Hadoop ecosystem products such as Hive, Pig and Oozie.
* New architecture - Savanna will support multi-host installation which will increase its availability and speed up provisioning. That is an intermediate step to High Availability.
+
* New architecture - Sahara will support multi-host installation which will increase its availability and speed up provisioning. That is an intermediate step to High Availability.
 
* Full support of both Nova Network and Neutron.
 
* Full support of both Nova Network and Neutron.
 
* Extended OpenStack Dashboard plugin with support of all new functionality.
 
* Extended OpenStack Dashboard plugin with support of all new functionality.
Line 56: Line 56:
 
== Location of project source code ==
 
== Location of project source code ==
  
* [https://launchpad.net/savanna/ Savanna on Launchpad]
+
* [https://launchpad.net/Sahara/ Sahara on Launchpad]
 
* [https://github.com/stackforge/savanna/ Savanna at StackForge]
 
* [https://github.com/stackforge/savanna/ Savanna at StackForge]
 
* [https://github.com/stackforge/savanna-dashboard/ OpenStack Dashboard plugin at StackForge]
 
* [https://github.com/stackforge/savanna-dashboard/ OpenStack Dashboard plugin at StackForge]
Line 80: Line 80:
 
== Proposed project technical lead and qualifications ==
 
== Proposed project technical lead and qualifications ==
  
'''Sergey Lukjanov (SergeyLukjanov on irc)''' is the Tech Lead of Savanna project at Mirantis. His main responsibilities are architecture design and community-related work in Savanna. Also, he is a top contributor and reviewer of Savanna and he oversees all Launchpad and Gerrit activity. Sergey is experienced in Big Data projects and technologies (Hadoop, HDFS, HBase, Cassandra, Twitter Storm, etc.) and enterprise-grade solutions. He has been elected as a Savanna PTL by community - https://wiki.openstack.org/wiki/Savanna/PTL.
+
'''Sergey Lukjanov (SergeyLukjanov on irc)''' is the Tech Lead of Sahara project at Mirantis. His main responsibilities are architecture design and community-related work in Sahara. Also, he is a top contributor and reviewer of Sahara and he oversees all Launchpad and Gerrit activity. Sergey is experienced in Big Data projects and technologies (Hadoop, HDFS, HBase, Cassandra, Twitter Storm, etc.) and enterprise-grade solutions. He has been elected as a Sahara PTL by community - https://wiki.openstack.org/wiki/Sahara/PTL.
  
 
== Other project developers and qualifications ==
 
== Other project developers and qualifications ==
  
=== Current savanna-core team ===
+
=== Current Sahara-core team ===
  
 
In addition to Sergey Lukjanov:
 
In addition to Sergey Lukjanov:
Line 90: Line 90:
 
'''Alexander Ignatov (aignatov on irc)''' is a Senior Software Engineer at Mirantis. He has expertise in networks, Java and distributed systems such as Hadoop and HBase. Alexander is involved in the project since its beginning. He is the main author of Vanilla Hadoop plugin.
 
'''Alexander Ignatov (aignatov on irc)''' is a Senior Software Engineer at Mirantis. He has expertise in networks, Java and distributed systems such as Hadoop and HBase. Alexander is involved in the project since its beginning. He is the main author of Vanilla Hadoop plugin.
  
'''Matthew Farrellee (mattf on irc)''' is a Principal Software Engineer and Engineering Manager at Red Hat with over a decade of experience in distributed and computational system development and management. Matt has been involved with Savanna since it was renamed from EHO. He is a major contributor to diskimage-builder elements for Savanna and active participant of architecture design discussions. He is integrating Savanna within the Fedora Big Data SIG.
+
'''Matthew Farrellee (mattf on irc)''' is a Principal Software Engineer and Engineering Manager at Red Hat with over a decade of experience in distributed and computational system development and management. Matt has been involved with Savanna since it was renamed from EHO. He is a major contributor to diskimage-builder elements for Sahara and active participant of architecture design discussions. He is integrating Sahara within the Fedora Big Data SIG.
  
'''John Speidel (jspeidel on irc)''' is a Senior Member of Technical Staff at Hortonworks. He has 15 years of experience developing commercial middleware systems with a focus on distributed transaction processing. John is a co-author of the Hortonworks Data Platform plugin for Savanna.
+
'''John Speidel (jspeidel on irc)''' is a Senior Member of Technical Staff at Hortonworks. He has 15 years of experience developing commercial middleware systems with a focus on distributed transaction processing. John is a co-author of the Hortonworks Data Platform plugin for Sahara.
  
 
=== Active Code Contributors ===
 
=== Active Code Contributors ===
  
'''Dmitry Mescheryakov (dmitryme on irc)''' is a Senior Software Engineer at Mirantis. His primary expertise is Java, Linux and networking. He is involved in Savanna project since its beginning. Dmitry made major contributions to core and UI parts of Savanna.
+
'''Dmitry Mescheryakov (dmitryme on irc)''' is a Senior Software Engineer at Mirantis. His primary expertise is Java, Linux and networking. He is involved in Sahara project since its beginning. Dmitry made major contributions to core and UI parts of Sahara.
  
'''Alexander Kuznetsov (akuznetsov on irc)''' is a Principal Software Engineer at Mirantis. He has expertise in Hadoop, Machine Learning and in building robust and scalable applications. Alexander is one of initiators of the project and is responsible for general architecture of Savanna.
+
'''Alexander Kuznetsov (akuznetsov on irc)''' is a Principal Software Engineer at Mirantis. He has expertise in Hadoop, Machine Learning and in building robust and scalable applications. Alexander is one of initiators of the project and is responsible for general architecture of Sahara.
  
'''Nadya Privalova (nadya on irc)''' is a Software Engineer at Mirantis. Her expertise includes: Java, Hadoop, Hbase, Pig and networking. Nadya made several major contributions to the project and is currently actively working on EDP for Savanna 0.3.
+
'''Nadya Privalova (nadya on irc)''' is a Software Engineer at Mirantis. Her expertise includes: Java, Hadoop, Hbase, Pig and networking. Nadya made several major contributions to the project and is currently actively working on EDP for Sahara 0.3(known as Savanna).
  
'''Nikita Konovalov (NikitaKonovalov on irc)''' is a Software Engineer at Mirantis. His expertise includes Python, Java, Twitter Storm and UX. He is the main author of Savanna Dashboard plugin for Horizon.
+
'''Nikita Konovalov (NikitaKonovalov on irc)''' is a Software Engineer at Mirantis. His expertise includes Python, Java, Twitter Storm and UX. He is the main author of Sahara Dashboard plugin for Horizon.
  
'''Ruslan Kamaldinov (ruhe on irc)''' is a Development Manager at Mirantis. He has expertise in Linux, networks and distributed systems such as Hadoop and HBase. Ruslan contributed major part of Savanna documentation.
+
'''Ruslan Kamaldinov (ruhe on irc)''' is a Development Manager at Mirantis. He has expertise in Linux, networks and distributed systems such as Hadoop and HBase. Ruslan contributed major part of Sahara documentation.
  
'''Ilya Tyaptin (ityaptin on irc)''' is a Software Engineer at Mirantis. He has experience in Java, Python. Ilya is working on EDP feature for Savanna 0.3.
+
'''Ilya Tyaptin (ityaptin on irc)''' is a Software Engineer at Mirantis. He has experience in Java, Python. Ilya is working on EDP feature for Sahara 0.3(known as Savanna).
  
'''Ivan Berezovskiy (ivan on irc)''' is a Deployment Engineer at Mirantis. He is the main author of diskimage-builder elements for Savanna.
+
'''Ivan Berezovskiy (ivan on irc)''' is a Deployment Engineer at Mirantis. He is the main author of diskimage-builder elements for Sahara.
  
'''Nikolay Mahotkin (nmakhotkin on irc)''' is a Software Engineer at Mirantis. Nikolay worked on Savanna since its inception and researched Oozie, Hive and Pig. He implemented Hive and Oozie support for Savanna 0.3.
+
'''Nikolay Mahotkin (nmakhotkin on irc)''' is a Software Engineer at Mirantis. Nikolay worked on Sahara since its inception and researched Oozie, Hive and Pig. He implemented Hive and Oozie support for Sahara 0.3(known as Savanna).
  
'''Sergey Reshetnyak (sreshetniak on irc)''' is a Software Engineer at Mirantis. His skills include Linux, networks and Python. Sergey works on Savanna since April 2013. He is working on core parts of Savanna.
+
'''Sergey Reshetnyak (sreshetniak on irc)''' is a Software Engineer at Mirantis. His skills include Linux, networks and Python. Sergey works on Sahara since April 2013. He is working on core parts of Sahara.
  
'''Vadim Rovachev (vrovachev on irc)''' is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Savanna.
+
'''Vadim Rovachev (vrovachev on irc)''' is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Sahara.
  
'''Yaroslav Lobankov (ylobankov on irc)''' is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Savanna.
+
'''Yaroslav Lobankov (ylobankov on irc)''' is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Sahara.
  
 
'''Trevor McKay (tmckay on irc)''' is a Senior Software Engineer at Red Hat with experience in
 
'''Trevor McKay (tmckay on irc)''' is a Senior Software Engineer at Red Hat with experience in
distributed computing, user interface development, client server applications and control systems. He’s working on EDP part of Savanna 0.3.
+
distributed computing, user interface development, client server applications and control systems. He’s working on EDP part of Sahara 0.3(known as Savanna).
  
'''Chad Roberts (crobertsrh on irc)''' is a Senior Software Engineer at Red Hat. His expertise is Python, C/C++, Java and JavaScript. He has been involved with client server applications for over 13 years and is currently focused on the Savanna Dashboard UI integrating EDP functionality for the 0.3 release.
+
'''Chad Roberts (crobertsrh on irc)''' is a Senior Software Engineer at Red Hat. His expertise is Python, C/C++, Java and JavaScript. He has been involved with client server applications for over 13 years and is currently focused on the Sahara Dashboard UI integrating EDP functionality for the 0.3 release.
  
'''Jonathan Maron (jmaron on irc)''' is a Sr. Member of Technical Staff at Hortonworks. Over the years Jon has participated in a number of JCP Expert groups, published multiple articles, and was co-author of "Java Transaction Processing: Design and Implementation". He is a co-author of Hortonworks Data Platform plugin for Savanna.
+
'''Jonathan Maron (jmaron on irc)''' is a Sr. Member of Technical Staff at Hortonworks. Over the years Jon has participated in a number of JCP Expert groups, published multiple articles, and was co-author of "Java Transaction Processing: Design and Implementation". He is a co-author of Hortonworks Data Platform plugin for Sahara.
  
 
=== Architecture Design Contributors (no code submitted) ===
 
=== Architecture Design Contributors (no code submitted) ===
  
'''Ilya Elterman (ielterman on irc)''' is a Senior Director, Cloud Services at Mirantis. He is one of initiators of the project and participates in general architecture discussions of Savanna.
+
'''Ilya Elterman (ielterman on irc)''' is a Senior Director, Cloud Services at Mirantis. He is one of initiators of the project and participates in general architecture discussions of Sahara.
  
'''Erik Bergenholtz (ebergenholtz on irc)''' is Director of Engineering at Hortonworks brings more than 20 years of experience in developing software for the enterprise. Erik is excited to be bridging the gap between Hadoop and OpenStack through development of the the HDP Savanna plugin. He is involved in architecture design discussions.
+
'''Erik Bergenholtz (ebergenholtz on irc)''' is Director of Engineering at Hortonworks brings more than 20 years of experience in developing software for the enterprise. Erik is excited to be bridging the gap between Hadoop and OpenStack through development of the the HDP Sahara plugin. He is involved in architecture design discussions.
  
 
== Infrastructure requirements (testing, etc) ==
 
== Infrastructure requirements (testing, etc) ==
  
All our code/reviews and bugs/specs are hosted at OpenStack Gerrit and Launchpad correspondingly. Unit tests and all flake8/hacking checks are run at OpenStack Jenkins and we have integration tests running at our own Jenkins server for each patch set. We hope that we’ll move our integration tests to the OpenStack infrastructure. We have Sphinx-based docs published at [http://savanna.rtfd.org readthedocs] and it consists of dev, admin and user guides along with descriptions of REST API, plugins SPI, etc.
+
All our code/reviews and bugs/specs are hosted at OpenStack Gerrit and Launchpad correspondingly. Unit tests and all flake8/hacking checks are run at OpenStack Jenkins and we have integration tests running at our own Jenkins server for each patch set. We hope that we’ll move our integration tests to the OpenStack infrastructure. We have Sphinx-based docs published at [http://savanna.rtfd.org readthedocs] for 0.3 and earlier with newer docs at [http://docs.openstack.org/developer/sahara OpenStack Docs], they consist of dev, admin and user guides along with descriptions of REST API, plugins SPI, etc.
  
 
No additional infrastructure requirements are expected.
 
No additional infrastructure requirements are expected.
Line 151: Line 151:
 
** [https://github.com/stackforge/puppet-savanna/ Savanna Puppet Module] (planned)
 
** [https://github.com/stackforge/puppet-savanna/ Savanna Puppet Module] (planned)
 
* [https://review.openstack.org/#/q/project:stackforge/savanna,n,z Code reviews]
 
* [https://review.openstack.org/#/q/project:stackforge/savanna,n,z Code reviews]
* [https://savanna.readthedocs.org/en/latest/index.html Documentation]
+
* [https://savanna.readthedocs.org/en/latest/index.html <= 0.3 Documentation]
 
* Community
 
* Community
** [https://wiki.openstack.org/wiki/Savanna Wiki]
+
** [https://wiki.openstack.org/wiki/Sahara Wiki]
 
** [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack General mail list]
 
** [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack General mail list]
 
** [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Development mailing List]
 
** [http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev Development mailing List]
Line 167: Line 167:
 
=== Why both provisioning + EDP? && Intersections with Heat ===
 
=== Why both provisioning + EDP? && Intersections with Heat ===
  
Now Savanna provisions instances, installs management console (like Apache Ambari) on one of them and communicate with it using REST API of the installed console to prepare and run all requested services at all instances. So, the only provisioning that we're doing in Savanna is the instance, volumes creation and their initial configuration like /etc/hosts generation for all instances. The most part of these operations or even all of them will be eventually removed by Heat integration during the potential incubation in Icehouse cycle, so, after it we'll be concentrated at EDP (Elastic Data Processing) operations with extremely small provisioning part.  
+
Now Sahara provisions instances, installs management console (like Apache Ambari) on one of them and communicate with it using REST API of the installed console to prepare and run all requested services at all instances. So, the only provisioning that we're doing in Sahara is the instance, volumes creation and their initial configuration like /etc/hosts generation for all instances. The most part of these operations or even all of them will be eventually removed by Heat integration during the potential incubation in Icehouse cycle, so, after it we'll be concentrated at EDP (Elastic Data Processing) operations with extremely small provisioning part.  
  
Here is a wiki page with our plans on how to integrate with Heat https://wiki.openstack.org/wiki/Savanna/HeatIntegration
+
Here is a wiki page with our plans on how to integrate with Heat https://wiki.openstack.org/wiki/Sahara/HeatIntegration
  
 
=== Intersections with Trove ===
 
=== Intersections with Trove ===
  
Hadoop isn't a database or just data storage, but a huge ecosystem with tons of data processing related tools. Additionally, we are looking at integration with other data processing tools like Twitter Storm and etc. So, there are no intersections with Trove that is DBaaS and we have no plans to deploy databases. Moreover the aim of EDP part of Savanna is to enable Hadoop to process data located on arbitrary store, including SQL and NoSQL databases. That is a natural connection point between Savanna and Trove: we think it will not require much efforts to make Hadoop deployed by Savanna consume data from DB deployed by Trove. The idea was already discussed in [http://lists.openstack.org/pipermail/openstack-dev/2013-September/thread.html#14958 mailing thread].
+
Hadoop isn't a database or just data storage, but a huge ecosystem with tons of data processing related tools. Additionally, we are looking at integration with other data processing tools like Twitter Storm and etc. So, there are no intersections with Trove that is DBaaS and we have no plans to deploy databases. Moreover the aim of EDP part of Sahara is to enable Hadoop to process data located on arbitrary store, including SQL and NoSQL databases. That is a natural connection point between Sahara and Trove: we think it will not require much efforts to make Hadoop deployed by Sahara consume data from DB deployed by Trove. The idea was already discussed in [http://lists.openstack.org/pipermail/openstack-dev/2013-September/thread.html#14958 mailing thread].
  
 
=== Integration with other OpenStack projects ===
 
=== Integration with other OpenStack projects ===
  
We're planning to integrate with Ceilometer to store some metrics in it. Blueprints related to the Ceilometer integration: https://blueprints.launchpad.net/savanna/+spec/ceilometer-integration and https://blueprints.launchpad.net/savanna/+spec/hadoop-cluster-tracking
+
We're planning to integrate with Ceilometer to store some metrics in it. Blueprints related to the Ceilometer integration: https://blueprints.launchpad.net/sahara/+spec/ceilometer-integration and https://blueprints.launchpad.net/sahara/+spec/hadoop-cluster-tracking

Revision as of 19:03, 11 March 2014

Project codename

Sahara

Summary

Elastic Hadoop clusters provisioning and management on OpenStack and elastic data processing (on-demand Hadoop job workflow).

Mission Statement

Program name and mission statement discussions are in progress now, ml thread

Program: Data Processing

Mission: To provide the OpenStack community with an open, cutting edge, performant and scalable data processing stack and associated management interfaces.

Detailed Description

Sahara provides users with simple means to use Hadoop on OpenStack. The project’s primary goal is to provide provisioning of on-demand Hadoop clusters as well as on-demand Hadoop job workflow (similar to Amazon Elastic MapReduce).

Cluster operations include full cluster lifecycle management. That includes cluster provisioning, scaling and termination. Sahara supports various configuration options which are Hadoop distribution specific. Sahara also provides a built-in cluster template mechanism that allows configuration to be created once and deployed multiple times in a single-click fashion.

Data operations allow users to be insulated from cluster creation tasks and work directly with data. I.e. users specify a job to run and data source, and Sahara takes care of the entire job and cluster lifecycle: bringing up the cluster, configuring the job and data source, running the job and shutting down the cluster. Sahara supports different types of jobs: MapReduce job, Hive query, Pig job, Oozie workflow. The data could be taken from various sources: Swift, remote HDFS, NoSQL and SQL databases.

There are a number of vendors offering customized Hadoop distributions. Sahara is designed to support any of these distributions via a plugin mechanism. At the moment, plugins for two distributions are implemented: Vanilla Apache Hadoop and Hortonworks Data Platform. Modern Hadoop installations include not only Hadoop, but a lot of products from the Apache Hadoop ecosystem such as Pig, Hive, Oozie, Sqoop, Hue, etc. Sahara aims to provide support for all of these services.

Sahara is tightly integrated with core OpenStack components, including Nova, Keystone, Glance, Cinder and Horizon. Sahara also enables Hadoop to use Swift as a storage for MapReduce jobs. In addition to that, Sahara uses diskimage-builder to build images with installed Hadoop. Sahara has full technical support for i18n. Sahara utilizes oslo.config, oslo.messaging, other Oslo utilities, and PBR for packaging.

We hope to eventually consume Ceilometer for metrics, Heat for orchestration and potentially Ironic for provisioning bare metal or hybrid Hadoop clusters. Sahara is committed to operating and integrating with the OpenStack ecosystem.

Basic roadmap for the project

The current release provides

  • REST API for templates/clusters management
  • Plugin mechanism to support multiple Hadoop distributions
  • Multi-tenancy support for all objects
  • Integration with Keystone for authentication
  • Integration with Cinder
  • Alpha version of Python bindings
  • nova-network support
  • diskimage-builder elements for building pre-installed images
  • OpenStack Dashboard plugin with all Sahara functionality supported

Next release plans

Sahara 0.3(known as Savanna) is targeted to be released with OpenStack Havana. It will include the following features:

  • Elastic Data Processing - provide Hadoop as a Service: allow users to run SQL-like Hive queries, Pig or MapReduce jobs through Sahara. User will be able to run jobs of different types without explicitly starting a cluster.
  • Provisioning of complex Hadoop clusters including Hadoop ecosystem products such as Hive, Pig and Oozie.
  • New architecture - Sahara will support multi-host installation which will increase its availability and speed up provisioning. That is an intermediate step to High Availability.
  • Full support of both Nova Network and Neutron.
  • Extended OpenStack Dashboard plugin with support of all new functionality.
  • Python bindings.

Location of project source code

Programming language, required technology dependencies

Language
Python
Dependencies
alembic, eventlet, flask, jsonschema, paramiko, pbr, sqlalchemy, message queue, sql db

Is project currently open sourced? What license?

Yes, under the Apache 2.0 license.

Level of maturity of software and team

  • Team: Working together for more than six months including people from three different companies: Mirantis, Red Hat and Hortonworks.

Proposed project technical lead and qualifications

Sergey Lukjanov (SergeyLukjanov on irc) is the Tech Lead of Sahara project at Mirantis. His main responsibilities are architecture design and community-related work in Sahara. Also, he is a top contributor and reviewer of Sahara and he oversees all Launchpad and Gerrit activity. Sergey is experienced in Big Data projects and technologies (Hadoop, HDFS, HBase, Cassandra, Twitter Storm, etc.) and enterprise-grade solutions. He has been elected as a Sahara PTL by community - https://wiki.openstack.org/wiki/Sahara/PTL.

Other project developers and qualifications

Current Sahara-core team

In addition to Sergey Lukjanov:

Alexander Ignatov (aignatov on irc) is a Senior Software Engineer at Mirantis. He has expertise in networks, Java and distributed systems such as Hadoop and HBase. Alexander is involved in the project since its beginning. He is the main author of Vanilla Hadoop plugin.

Matthew Farrellee (mattf on irc) is a Principal Software Engineer and Engineering Manager at Red Hat with over a decade of experience in distributed and computational system development and management. Matt has been involved with Savanna since it was renamed from EHO. He is a major contributor to diskimage-builder elements for Sahara and active participant of architecture design discussions. He is integrating Sahara within the Fedora Big Data SIG.

John Speidel (jspeidel on irc) is a Senior Member of Technical Staff at Hortonworks. He has 15 years of experience developing commercial middleware systems with a focus on distributed transaction processing. John is a co-author of the Hortonworks Data Platform plugin for Sahara.

Active Code Contributors

Dmitry Mescheryakov (dmitryme on irc) is a Senior Software Engineer at Mirantis. His primary expertise is Java, Linux and networking. He is involved in Sahara project since its beginning. Dmitry made major contributions to core and UI parts of Sahara.

Alexander Kuznetsov (akuznetsov on irc) is a Principal Software Engineer at Mirantis. He has expertise in Hadoop, Machine Learning and in building robust and scalable applications. Alexander is one of initiators of the project and is responsible for general architecture of Sahara.

Nadya Privalova (nadya on irc) is a Software Engineer at Mirantis. Her expertise includes: Java, Hadoop, Hbase, Pig and networking. Nadya made several major contributions to the project and is currently actively working on EDP for Sahara 0.3(known as Savanna).

Nikita Konovalov (NikitaKonovalov on irc) is a Software Engineer at Mirantis. His expertise includes Python, Java, Twitter Storm and UX. He is the main author of Sahara Dashboard plugin for Horizon.

Ruslan Kamaldinov (ruhe on irc) is a Development Manager at Mirantis. He has expertise in Linux, networks and distributed systems such as Hadoop and HBase. Ruslan contributed major part of Sahara documentation.

Ilya Tyaptin (ityaptin on irc) is a Software Engineer at Mirantis. He has experience in Java, Python. Ilya is working on EDP feature for Sahara 0.3(known as Savanna).

Ivan Berezovskiy (ivan on irc) is a Deployment Engineer at Mirantis. He is the main author of diskimage-builder elements for Sahara.

Nikolay Mahotkin (nmakhotkin on irc) is a Software Engineer at Mirantis. Nikolay worked on Sahara since its inception and researched Oozie, Hive and Pig. He implemented Hive and Oozie support for Sahara 0.3(known as Savanna).

Sergey Reshetnyak (sreshetniak on irc) is a Software Engineer at Mirantis. His skills include Linux, networks and Python. Sergey works on Sahara since April 2013. He is working on core parts of Sahara.

Vadim Rovachev (vrovachev on irc) is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Sahara.

Yaroslav Lobankov (ylobankov on irc) is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Sahara.

Trevor McKay (tmckay on irc) is a Senior Software Engineer at Red Hat with experience in distributed computing, user interface development, client server applications and control systems. He’s working on EDP part of Sahara 0.3(known as Savanna).

Chad Roberts (crobertsrh on irc) is a Senior Software Engineer at Red Hat. His expertise is Python, C/C++, Java and JavaScript. He has been involved with client server applications for over 13 years and is currently focused on the Sahara Dashboard UI integrating EDP functionality for the 0.3 release.

Jonathan Maron (jmaron on irc) is a Sr. Member of Technical Staff at Hortonworks. Over the years Jon has participated in a number of JCP Expert groups, published multiple articles, and was co-author of "Java Transaction Processing: Design and Implementation". He is a co-author of Hortonworks Data Platform plugin for Sahara.

Architecture Design Contributors (no code submitted)

Ilya Elterman (ielterman on irc) is a Senior Director, Cloud Services at Mirantis. He is one of initiators of the project and participates in general architecture discussions of Sahara.

Erik Bergenholtz (ebergenholtz on irc) is Director of Engineering at Hortonworks brings more than 20 years of experience in developing software for the enterprise. Erik is excited to be bridging the gap between Hadoop and OpenStack through development of the the HDP Sahara plugin. He is involved in architecture design discussions.

Infrastructure requirements (testing, etc)

All our code/reviews and bugs/specs are hosted at OpenStack Gerrit and Launchpad correspondingly. Unit tests and all flake8/hacking checks are run at OpenStack Jenkins and we have integration tests running at our own Jenkins server for each patch set. We hope that we’ll move our integration tests to the OpenStack infrastructure. We have Sphinx-based docs published at readthedocs for 0.3 and earlier with newer docs at OpenStack Docs, they consist of dev, admin and user guides along with descriptions of REST API, plugins SPI, etc.

No additional infrastructure requirements are expected.

Have all current contributors agreed to the OpenStack CLA?

Yes.

Related Links

Raised Questions + Answers

"Clustering" API / commons

We are planning to contribute into this activity and participate in Design Summit session on this topic. We would like to prepare our vision for the clustering before the summit. There are already some thoughts about clustering posted in mailing thread.

Why both provisioning + EDP? && Intersections with Heat

Now Sahara provisions instances, installs management console (like Apache Ambari) on one of them and communicate with it using REST API of the installed console to prepare and run all requested services at all instances. So, the only provisioning that we're doing in Sahara is the instance, volumes creation and their initial configuration like /etc/hosts generation for all instances. The most part of these operations or even all of them will be eventually removed by Heat integration during the potential incubation in Icehouse cycle, so, after it we'll be concentrated at EDP (Elastic Data Processing) operations with extremely small provisioning part.

Here is a wiki page with our plans on how to integrate with Heat https://wiki.openstack.org/wiki/Sahara/HeatIntegration

Intersections with Trove

Hadoop isn't a database or just data storage, but a huge ecosystem with tons of data processing related tools. Additionally, we are looking at integration with other data processing tools like Twitter Storm and etc. So, there are no intersections with Trove that is DBaaS and we have no plans to deploy databases. Moreover the aim of EDP part of Sahara is to enable Hadoop to process data located on arbitrary store, including SQL and NoSQL databases. That is a natural connection point between Sahara and Trove: we think it will not require much efforts to make Hadoop deployed by Sahara consume data from DB deployed by Trove. The idea was already discussed in mailing thread.

Integration with other OpenStack projects

We're planning to integrate with Ceilometer to store some metrics in it. Blueprints related to the Ceilometer integration: https://blueprints.launchpad.net/sahara/+spec/ceilometer-integration and https://blueprints.launchpad.net/sahara/+spec/hadoop-cluster-tracking