Beyond Metering - Extending Ceilometer

Over the past few months, numerous request have been made to the Ceilometer project to extend its scope from just metering to more monitoring or alerting. This causes quite a few challenges but as we are instrumenting more and more openstack component to extract data from them, it makes sense to think on how we could extend our capabilities over time.

CloudWatch type functionality
- what granularity of data is needed
- should we collect data the same way
- should we store collected data in our database
- what other components would be needed
Alerting type functionality
- what granularity of data is needed
- should we collect data the same way
- should we store collected data in our database
- what other components would be needed
Other types?

What is the difference between monitoring and metering?

metering

used by a billing system which is a part of the deployment
has a fixed interval (normally 15 min - 5 min)
the data must be persistent
it is important that the data comes from a trusted source (not spoofed)

monitoring

can be used by anyone (real user)
variable interval (set by the user) as low as 1 min
data can expire (deleted after a week or two)
data can be user generated or from a trusted source (configurable)
alarms can be configured to trigger notifications

What are alarms?

The purpose of Alarms is to notify a user when a meter matches a certain criteria.

Some examples

"Tell me when the maximum disk utilization exceeds 90%" "Tell me when the average CPU utilization exceeds 80% over 120 seconds" "Tell me when my web app is becoming unresponsive" (loadbalancer latency meter) "Tell me when my httpd daemon dies" (custom user script that checks daemon health)

Why integrate them into one project?

reuse the data collectors / pollsters
reduce the overall resource consumption with having duplicated systems

  (more messaging of stats data and cpu usage)

have one vibrant community rather that two that are less so
code reuse

Why bring alarms into ceilometer?

1) to improve the responsiveness

The sooner you get data the sooner you can trigger a possible alarm So if the alarming system is outside of the statistics collection then you get older data and the time from an alarm causing event and the generated alarm is too big.

So consider:

    "HttpFailureAlarm": {
     "Type": "AWS::CloudWatch::Alarm",
     "Properties": {
        "AlarmDescription": "Restart the WikiDatabase if httpd fails >= 1 time in 5 minutes",
        "MetricName": "ServiceFailure",
        "Namespace": "system/linux",
        "Statistic": "SampleCount",
        "Period": "300",
        "EvaluationPeriods": "1",
        "Threshold": "1",
        "AlarmActions": [ { "Ref": "WebServerRestartPolicy" } ],
        "ComparisonOperator": "GreaterThanOrEqualToThreshold"
      }

If the event (ServiceFailure) happens near the begging of the 5 min you will only get the alarm at the 5 min mark if alarming is outside of the stats collection (if stats will be collected every 5 min). However if the alarm calculation is near the source then you can trigger such alarms as soon as they happen.

2) to reduce network usage

When you have alarms you don't tend to poll the api as much because you create alarms to monitor things for you. Instead you tend to:

do historic queries (averages/max/min for different intervals)
act on alarms

If the alarms are checked/generated on the compute-host then the alarms can be checked often but stats don't have to be sent straight away. So just because the alarm interval is 60sec does not mean that we have to send stats at that frequency

Possible Changes

API

add alarm api
add alarm history api
add auth (make the api public)

Agent

agent with a rest api to post user meter data (source == user/$user_id)
agents get alarms from the collector (including the alarms for it's resources) agent then polls at the required frequency per resource and checks for an alarm (sends an alarm-notification if needed), aggregates the data into a single message
sends the aggregated message at the normal metering interval

{ resource_id ... counter_volume [2,4,5,6,7,3,9,76,76,4,3,5,1] }

- counter_volume is now a list

the collector stores the data as normal
now the api can get finer grained data points if asked to (but not available imediatly - which is ok)
a period task *could* clean out the extra data at some time later

EfficientMetering/GrizzlySummit/BeyondMetering