Difference between revisions of "Sahara/ClusterHA"

Latest revision as of 22:46, 30 October 2014

Summary

It Shall provided system level HA. So even if a component fails during Hadoop provisioning, the system shall be able to complete the Hadoop provisioning without defect or error.

Release Note

When this is implemented, the system shall be able to complete the Hadoop provisioning without defect or error.

User stories

Operator gets the list of failed Cluster through savanna web
Operator clicks the resume icon
The cluster will be recreated by using this operation.

Design

Implementation

Check the cluster status
- Instance (is up? accessible?)
- Volume creation, attachment and mount
- ambari server/agent installment and configuring
If the error is generated, below steps will be done.
- Update the DB which is used by ClusterHA module. (Table name: ClusterHA)
- Delete a Instance
- Detach a volume and deleting a volume
- Jump to return the value (cluter_id, status)
Resume cluster creation

Code Changes

service/api.py
service/instances.py
service/volumes.py
plugins/hdp/hadooserver.py
plugins/hdp/ambariplugin.py
conductor/api.py
conductor/manager.py
db/api.py
db/sqlalchemy/api.py
db/sqlalchemy/models.py

...etc...

Test/Demo Plan

This need not be added or completed until the specification is nearing beta.

Unresolved issues

TBD

@@ Line 1: / Line 1: @@
-==Sumarry==
+==Summary==
 It Shall provided system level HA. So even if a component fails during Hadoop provisioning, the system shall be able to complete the Hadoop provisioning without defect or error.
@@ Line 6: / Line 6: @@
 ==User stories==
-# User doesn't know the vm is rebuilt.
+* Operator gets the list of failed Cluster through savanna web
-# Operator gets the list of failed host from monitoring system or nova network/compute state.
+* Operator clicks the resume icon
-# Operator gets the lists of the instances on the failed hosts.
+* The cluster will be recreated by using this operation.
-# Operator should rebuild the instances by using this operation.
 == Design ==
+[[File:ClusterHA1.png|174KBpx|center|ClusterHA1.png]]
+[[File:ClusterHA2.png|220KBpx|center|ClusterHA2.png]]
 ==Implementation==
-* Scheduler should call the compute manager to rebuild the instance.
+*Check the cluster status
-* Compute Manager should get a list of dictionaries of network data of an instance.
+** Instance (is up? accessible?)
-* Compute Manager should update the volume db and instance db.
+** Volume creation, attachment and mount
-* Compute Manager should setup volumes for block device mapping by using the volume manager.
+** ambari server/agent installment and configuring
-* Compute Manager should spawn the instance by using the virt driver.
+* If the error is generated, below steps will be done.
-* Compute Manager should associate the floating ip by using the network manager.
+** Update the DB which is used by ClusterHA module. (Table name: ClusterHA)
-* Compute Manager should update the instance db.
+** Delete a Instance
-* Compute Manager should restart the network module.
+** Detach a volume and deleting a volume
+** Jump to return the value (cluter_id, status)
+* Resume cluster creation
 ==Code Changes==
+* service/api.py
+* service/instances.py
+* service/volumes.py
+* plugins/hdp/hadooserver.py
+* plugins/hdp/ambariplugin.py
+* conductor/api.py
+* conductor/manager.py
+* db/api.py
+* db/sqlalchemy/api.py
+* db/sqlalchemy/models.py
+...etc...
 ==Test/Demo Plan==