Beyond Metering - Extending Ceilometer

Etherpad: http://etherpad.openstack.org/grizzly-ceilometer-beyond-metering

Over the past few months, numerous request have been made to the Ceilometer project to extend its scope from just metering to more monitoring or alerting. This causes quite a few challenges but as we are instrumenting more and more openstack component to extract data from them, it makes sense to think on how we could extend our capabilities over time.

CloudWatch type functionality
- what granularity of data is needed
- should we collect data the same way
- should we store collected data in our database
- what other components would be needed
Alerting type functionality
- what granularity of data is needed
- should we collect data the same way
- should we store collected data in our database
- what other components would be needed
Other types?

What is the difference between monitoring and metering?

metering

used by a billing system which is a part of the deployment
has a fixed interval (normally 15 min - 5 min)
the data must be persistent
it is important that the data comes from a trusted source (not spoofed)

monitoring

can be used by anyone (real user)
variable interval (set by the user) as low as 1 min
data can expire (deleted after a week or two)
data can be user generated or from a trusted source (configurable)
alarms can be configured to trigger notifications

What are alarms?

The purpose of Alarms is to notify a user when a meter matches a certain criteria.

Some examples

"Tell me when the maximum disk utilization exceeds 90%"
"Tell me when the average CPU utilization exceeds 80% over 120 seconds"
"Tell me when my web app is becoming unresponsive" (loadbalancer latency meter)
"Tell me when my httpd daemon dies" (custom user script that checks daemon health)

How can you use Alarms

Create an alarm

{
 'period': '300',
 'eval_periods': '2',
 'meter': 'CPUUtilization',
 'function': 'average',
 'operator': 'gt',
 'threshold': '50'
 'resource_id': 'inst-002',
 'source': 'OS/compute',
 'alarm_actions': ['rpc/my_notify_topic', 'http://bla.com/bla'],
 'ok_actions': ['rpc/my_notify_topic']
}

This will check the "CPUUtilization" meter events every 300sec and if the average CPUUtilization was > 50% (for inst-002) for both of the last 2 300sec periods then it will send an rpc notification on the "my_notify_topic" topic and post the alarm details to http://bla.com/bla.

Then when the alarm goes below this level it will do the "ok_actions".

Why integrate them into one project?

one statistics api
have one vibrant community rather that two that are less so
code reuse

Requirements

Metrics

Kinds of metrics to monitor http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/CW_Support_For_AWS.html

Sample at between 10s to 60s Transmit at between 1min and 5min

User Rest API

Metrics:

list metrics

Data:

get stats
put custom metric data

Alarms:

create alarm
delete alarm
show alarm
show alarm history
enable/disable alarm
set alarm state (tempory override)

What fields are needed?

namespace.name
dimensions (resource_id, ++)
time stamp
units
user_id
project_id

Groupings (aws dimensions)

Some examples:

Autoscaling groupname
image id
instance id (resource_id)
instance type

Some are already collected in ceilometer in the counter as resource_id or metadata

Horizontal Scale

Hash on the resource_id to access the storage for a particular metric. But how do we know where a user's metrics are stored? ask nova/cinder/quantum/... ? And how do we know where the data is for an autoscaling group?

Re: grouping/dimensions (an idea)

make a virtual resource for each dimension that we don't explicitly measure (it gets hashed and stored in a known place)
create an event like for instance create/resize/destroy that we use to modify the db entry to point to the actual resource
when queried we just look for the resource "autoscaling_group_4" and we can then "see" how to find the statistics for that dimension

so when we see an instance create we need to modify image_id and instance_type virtual resources also heat will need to create rpc events like nova does but for scaling groups (create/resize/destroy essentially membership)

If we do this, I think we can do better that CW regarding querying based on dimensions: http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/cloudwatch_concepts.html#Di...

We could also make the user a virtual resource to make finding the location of the user's resources easier. Basically we end up with a number of fields that we can use to locate the metric data.

So logically we hash "image-num-4" and get db-on-host-1, then we hash each of the children to go find the hosts where each resource's data is stored.

db-on-host-1:

  resources:
    id: image-num-4
    type: image_id
    children: {inst-094, bla, bla}

db-on-host-2:

data:
  name: cpu_util
  value:
  units:
  timestamp:
  resource_id: inst-094

As an extension we could allow the user to create user-groupings of their own and alarm on those.

Alarms

Initially this can be poll based and run on the node that the resource is hashed to (including virtual resources/dimensions)

Instrumentation

How does this fit in with Ceilometer? - we will need to modify the agents to send the info we need at the rate we need. The agent would need to hash the resource_id and find out where to send the metric data.

Optimisations (reducing network activity)

Assume data aggregation, but have an api to turn that off.

What does that mean? The agent could locally (in memory) aggregate multiple samples into one data point (sum=31,min=2,max=5,sample_count=10) and only send that at a much higher period. An alternative is to still send all the data points (but in bulk) just delayed.

We could automatically turn this off when we have alarms attached to metrics.

EfficientMetering/GrizzlySummit/BeyondMetering