Jump to: navigation, search

Sahara/Incubation

< Sahara
Revision as of 19:20, 24 September 2013 by Sergey Lukjanov (talk | contribs) (Q&A section added)

Project codename

Savanna

Summary

Elastic Hadoop clusters provisioning and management on OpenStack and elastic data processing (on-demand Hadoop job workflow).

Mission Statement

Program name and mission statement discussions are in progress now, ml thread

Program: Data Processing

Mission: To provide the OpenStack community with an open, cutting edge, performant and scalable data processing stack and associated management interfaces.

Detailed Description

Savanna provides users with simple means to use Hadoop on OpenStack. The project’s primary goal is to provide provisioning of on-demand Hadoop clusters as well as on-demand Hadoop job workflow (similar to Amazon Elastic MapReduce).

Cluster operations include full cluster lifecycle management. That includes cluster provisioning, scaling and termination. Savanna supports various configuration options which are Hadoop distribution specific. Savanna also provides a built-in cluster template mechanism that allows configuration to be created once and deployed multiple times in a single-click fashion.

Data operations allow users to be insulated from cluster creation tasks and work directly with data. I.e. users specify a job to run and data source, and Savanna takes care of the entire job and cluster lifecycle: bringing up the cluster, configuring the job and data source, running the job and shutting down the cluster. Savanna supports different types of jobs: MapReduce job, Hive query, Pig job, Oozie workflow. The data could be taken from various sources: Swift, remote HDFS, NoSQL and SQL databases.

There are a number of vendors offering customized Hadoop distributions. Savanna is designed to support any of these distributions via a plugin mechanism. At the moment, plugins for two distributions are implemented: Vanilla Apache Hadoop and Hortonworks Data Platform. Modern Hadoop installations include not only Hadoop, but a lot of products from the Apache Hadoop ecosystem such as Pig, Hive, Oozie, Sqoop, Hue, etc. Savanna aims to provide support for all of these services.

Savanna is tightly integrated with core OpenStack components, including Nova, Keystone, Glance, Cinder and Horizon. Savanna also enables Hadoop to use Swift as a storage for MapReduce jobs. In addition to that, Savanna uses diskimage-builder to build images with installed Hadoop. Savanna has full technical support for i18n. Savanna utilizes oslo.config, oslo.messaging, other Oslo utilities, and PBR for packaging.

We hope to eventually consume Ceilometer for metrics, Heat for orchestration and potentially Ironic for provisioning bare metal or hybrid Hadoop clusters. Savanna is committed to operating and integrating with the OpenStack ecosystem.

Basic roadmap for the project

The current release provides

  • REST API for templates/clusters management
  • Plugin mechanism to support multiple Hadoop distributions
  • Multi-tenancy support for all objects
  • Integration with Keystone for authentication
  • Integration with Cinder
  • Alpha version of Python bindings
  • nova-network support
  • diskimage-builder elements for building pre-installed images
  • OpenStack Dashboard plugin with all Savanna functionality supported

Next release plans

Savanna 0.3 is targeted to be released with OpenStack Havana. It will include the following features:

  • Elastic Data Processing - provide Hadoop as a Service: allow users to run SQL-like Hive queries, Pig or MapReduce jobs through Savanna. User will be able to run jobs of different types without explicitly starting a cluster.
  • Provisioning of complex Hadoop clusters including Hadoop ecosystem products such as Hive, Pig and Oozie.
  • New architecture - Savanna will support multi-host installation which will increase its availability and speed up provisioning. That is an intermediate step to High Availability.
  • Full support of both Nova Network and Neutron.
  • Extended OpenStack Dashboard plugin with support of all new functionality.
  • Python bindings.

Location of project source code

Programming language, required technology dependencies

Language
Python
Dependencies
alembic, eventlet, flask, jsonschema, paramiko, pbr, sqlalchemy, message queue, sql db

Is project currently open sourced? What license?

Yes, under the Apache 2.0 license.

Level of maturity of software and team

  • Team: Working together for more than six months including people from three different companies: Mirantis, Red Hat and Hortonworks.

Proposed project technical lead and qualifications

Sergey Lukjanov (SergeyLukjanov on irc) is the Tech Lead of Savanna project at Mirantis. His main responsibilities are architecture design and community-related work in Savanna. Also, he is a top contributor and reviewer of Savanna and he oversees all Launchpad and Gerrit activity. Sergey is experienced in Big Data projects and technologies (Hadoop, HDFS, HBase, Cassandra, Twitter Storm, etc.) and enterprise-grade solutions. He has been elected as a Savanna PTL by community - https://wiki.openstack.org/wiki/Savanna/PTL.

Other project developers and qualifications

Current savanna-core team

In addition to Sergey Lukjanov:

Alexander Ignatov (aignatov on irc) is a Senior Software Engineer at Mirantis. He has expertise in networks, Java and distributed systems such as Hadoop and HBase. Alexander is involved in the project since its beginning. He is the main author of Vanilla Hadoop plugin.

Matthew Farrellee (mattf on irc) is a Principal Software Engineer and Engineering Manager at Red Hat with over a decade of experience in distributed and computational system development and management. Matt has been involved with Savanna since it was renamed from EHO. He is a major contributor to diskimage-builder elements for Savanna and active participant of architecture design discussions. He is integrating Savanna within the Fedora Big Data SIG.

John Speidel (jspeidel on irc) is a Senior Member of Technical Staff at Hortonworks. He has 15 years of experience developing commercial middleware systems with a focus on distributed transaction processing. John is a co-author of the Hortonworks Data Platform plugin for Savanna.

Active Code Contributors

Dmitry Mescheryakov (dmitryme on irc) is a Senior Software Engineer at Mirantis. His primary expertise is Java, Linux and networking. He is involved in Savanna project since its beginning. Dmitry made major contributions to core and UI parts of Savanna.

Alexander Kuznetsov (akuznetsov on irc) is a Principal Software Engineer at Mirantis. He has expertise in Hadoop, Machine Learning and in building robust and scalable applications. Alexander is one of initiators of the project and is responsible for general architecture of Savanna.

Nadya Privalova (nadya on irc) is a Software Engineer at Mirantis. Her expertise includes: Java, Hadoop, Hbase, Pig and networking. Nadya made several major contributions to the project and is currently actively working on EDP for Savanna 0.3.

Nikita Konovalov (NikitaKonovalov on irc) is a Software Engineer at Mirantis. His expertise includes Python, Java, Twitter Storm and UX. He is the main author of Savanna Dashboard plugin for Horizon.

Ruslan Kamaldinov (ruhe on irc) is a Development Manager at Mirantis. He has expertise in Linux, networks and distributed systems such as Hadoop and HBase. Ruslan contributed major part of Savanna documentation.

Ilya Tyaptin (ityaptin on irc) is a Software Engineer at Mirantis. He has experience in Java, Python. Ilya is working on EDP feature for Savanna 0.3.

Ivan Berezovskiy (ivan on irc) is a Deployment Engineer at Mirantis. He is the main author of diskimage-builder elements for Savanna.

Nikolay Mahotkin (nmakhotkin on irc) is a Software Engineer at Mirantis. Nikolay worked on Savanna since its inception and researched Oozie, Hive and Pig. He implemented Hive and Oozie support for Savanna 0.3.

Sergey Reshetnyak (sreshetniak on irc) is a Software Engineer at Mirantis. His skills include Linux, networks and Python. Sergey works on Savanna since April 2013. He is working on core parts of Savanna.

Vadim Rovachev (vrovachev on irc) is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Savanna.

Yaroslav Lobankov (ylobankov on irc) is a Quality Assurance Engineer at Mirantis. His expertise includes Python and Selenium. He is co-author of integration tests for Savanna.

Trevor McKay (tmckay on irc) is a Senior Software Engineer at Red Hat with experience in distributed computing, user interface development, client server applications and control systems. He’s working on EDP part of Savanna 0.3.

Chad Roberts (crobertsrh on irc) is a Senior Software Engineer at Red Hat. His expertise is Python, C/C++, Java and JavaScript. He has been involved with client server applications for over 13 years and is currently focused on the Savanna Dashboard UI integrating EDP functionality for the 0.3 release.

Jonathan Maron (jmaron on irc) is a Sr. Member of Technical Staff at Hortonworks. Over the years Jon has participated in a number of JCP Expert groups, published multiple articles, and was co-author of "Java Transaction Processing: Design and Implementation". He is a co-author of Hortonworks Data Platform plugin for Savanna.

Architecture Design Contributors (no code submitted)

Ilya Elterman (ielterman on irc) is a Senior Director, Cloud Services at Mirantis. He is one of initiators of the project and participates in general architecture discussions of Savanna.

Erik Bergenholtz (ebergenholtz on irc) is Director of Engineering at Hortonworks brings more than 20 years of experience in developing software for the enterprise. Erik is excited to be bridging the gap between Hadoop and OpenStack through development of the the HDP Savanna plugin. He is involved in architecture design discussions.

Infrastructure requirements (testing, etc)

All our code/reviews and bugs/specs are hosted at OpenStack Gerrit and Launchpad correspondingly. Unit tests and all flake8/hacking checks are run at OpenStack Jenkins and we have integration tests running at our own Jenkins server for each patch set. We hope that we’ll move our integration tests to the OpenStack infrastructure. We have Sphinx-based docs published at readthedocs and it consists of dev, admin and user guides along with descriptions of REST API, plugins SPI, etc.

No additional infrastructure requirements are expected.

Have all current contributors agreed to the OpenStack CLA?

Yes.

Related Links

Raised Questions + Answers

"Clustering" API / commons

We are planning to contribute into this activity and participate in Design Summit session on this topic. We would like to prepare our vision for the clustering before the summit. There are already some thoughts about clustering posted in mailing thread.

Why both provisioning + EDP? && Intersections with Heat

Now Savanna provisions instances, installs management console (like Apache Ambari) on one of them and communicate with it using REST API of the installed console to prepare and run all requested services at all instances. So, the only provisioning that we're doing in Savanna is the instance, volumes creation and their initial configuration like /etc/hosts generation for all instances. The most part of these operations or even all of them will be eventually removed by Heat integration during the potential incubation in Icehouse cycle, so, after it we'll be concentrated at EDP (Elastic Data Processing) operations with extremely small provisioning part.

Here is a wiki page with our plans on how to integrate with Heat https://wiki.openstack.org/wiki/Savanna/HeatIntegration

Intersections with Trove

Hadoop isn't a database or just data storage, but a huge ecosystem with tons of data processing related tools. Additionally, we are looking at integration with other data processing tools like Twitter Storm and etc. So, there are no intersections with Trove that is DBaaS and we have no plans to deploy databases. Moreover the aim of EDP part of Savanna is to enable Hadoop to process data located on arbitrary store, including SQL and NoSQL databases. That is a natural connection point between Savanna and Trove: we think it will not require much efforts to make Hadoop deployed by Savanna consume data from DB deployed by Trove. The idea was already discussed in mailing thread.

Integration with other OpenStack projects

We're planning to integrate with Ceilometer to store some metrics in it. Blueprints related to the Ceilometer integration: https://blueprints.launchpad.net/savanna/+spec/ceilometer-integration and https://blueprints.launchpad.net/savanna/+spec/hadoop-cluster-tracking