Sahara/Templates

Hadoop has a large number of parameters and it is hard for end users to find an appropriate configuration for a cluster to achieve good performance. Template mechanism allows to simplify the process of creation and configuration of Hadoop cluster. The end user only needs to specify the cluster template and provide parameters that need to be changed. Cluster configuration will come from the template. It is assumed that templates are created by experienced Hadoop administrators. If user needs to redefine a parameter, he could create a custom template or override the parameter during cluster creation.

Template usage is limited by two things: plugin and Hadoop version. A template is always plugin-specific, because it contain configurations specific for the plugin. That means that one can use a template only with the plugin it is created for. The same applies to Hadoop version.

Sahara has two types of templates: cluster template and node group template.

Node Group Templates
A node group template contains configuration for a node in the cluster. It has “Node Group” in its name because cluster consists of group of nodes having the same configuration. Template includes configuration for Hadoop processes and VM characteristics (e.g. number of reduce slots for task tracker, number of CPUs and amount of RAM). The VM characteristics are specified with OpenStack flavor.

Node template contains the following parameters:

Example: {       "id": "aee4-strf-o14s-fd34", "flavor": "4", "image": "ah91-aij1-u78x-iunm", "name": ”fat task tracker + data node” "description": “a template for big nodes ...”, "plugin": “apache-hadoop”, "hadoop_version": “1.1.1” "node_processes": [“task tracker”, “data node”] "node_configs": {               ”service:mapreduce”: {                       "mapred.tasktracker.map.tasks.maximum": 8, "mapred.tasktracker.reduce.tasks.maximum": 3, ...                   }                ”service:hdfs”: {                         …                    }                ”general”: {                         …                    }            }    }

Cluster template
Cluster template contains configuration that applies to the whole cluster, e.g. HDFS replication factor or HDFS block size. It also contains list of node group templates. Ideally, this will allow user to create cluster in one click, by just specifying the cluster template.

Example:

{     "id": "asdf-wdvc-9as0-q23w", "name": ”small cluster”, "description": “a template for a small cluster”, "plugin": “apache hadoop”, "hadoop_version": “1.1.1” "configs": {             "service:mapreduce": {                     "compression": "snappy" }             "service:hdfs": {                     "hdfs_replication_factor": 3 }             "general": {                     ...                  }      }      "node_groups_templates": [         {              "name": "master node", "node-group-template": "aee4-strf-o14s-fd34", "count": 1 },         {              "name": "workers", "node-group-template": "fe1t-2t4f-1oa4-fdik", "count": 3 }     ]	  }

= Plugin integration = Plugin should provide the following facilities to support templates:
 * Provide list of configs by implementing get_configs(...) method
 * In validate_cluster(...) method plugin must verify user inputs specified for configs.



Configs
Plugin should provide a list of configs editable by users. Essentially a “config” is specification of a single parameter. The parameter could target either general configuration or a specific service (mapreduce, hdfs) configuration. It can also have scope of either cluster or a specific node group. The scope determines in which type of templates the config will be presented. 'cluster' scoped configs will be presented in cluster template.

'node' scoped configs appear both in cluster and in node group template. When they are specified in cluster template, they serve as new defaults for node templates used in cluster.

Since plugin provides the list of configs, it must also be able to apply them to cluster when user provides some values.

Example config: {     "name": “mapred.tasktracker.map.tasks.maximum”, "applicable_target": ”service:mapreduce”, "scope": "node" "default": 2, "required": true, "type": “int”, "description": “amount of map tasks per node” }