Jump to: navigation, search

Difference between revisions of "EfficientMetering/GrizzlySummit/BeyondMetering"

m (Text replace - "__NOTOC__" to "")
 
(13 intermediate revisions by 3 users not shown)
Line 1: Line 1:
__NOTOC__
+
 
 
= Beyond Metering - Extending Ceilometer =
 
= Beyond Metering - Extending Ceilometer =
 +
 +
Etherpad: http://etherpad.openstack.org/grizzly-ceilometer-beyond-metering
  
 
Over the past few months, numerous request have been made to the Ceilometer project to extend its scope from just metering to more monitoring or alerting.  This causes quite a few challenges but as we are instrumenting more and more openstack component to extract data from them, it makes sense to think on how we could extend our capabilities over time.
 
Over the past few months, numerous request have been made to the Ceilometer project to extend its scope from just metering to more monitoring or alerting.  This causes quite a few challenges but as we are instrumenting more and more openstack component to extract data from them, it makes sense to think on how we could extend our capabilities over time.
Line 19: Line 21:
  
 
metering
 
metering
- used by a billing system which is a part of the deployment
+
 
- has a fixed interval (normally 15 min - 5 min)
+
* used by a billing system which is a part of the deployment
- the data must be persistent
+
* has a fixed interval (normally 15 min - 5 min)
- it is important that the data comes from a trusted source (not spoofed)
+
* the data must be persistent
 +
* it is important that the data comes from a trusted source (not spoofed)
  
 
monitoring
 
monitoring
- can be used by anyone (real user)
+
 
- variable interval (set by the user) as low as 1 min
+
* can be used by anyone (real user)
- data can expire (deleted after a week or two)
+
* variable interval (set by the user) as low as 1 min
- data can be user generated or from a trusted source (configurable)
+
* data can expire (deleted after a week or two)
- alarms can be configured to trigger notifications
+
* data can be user generated or from a trusted source (configurable)
 +
* alarms can be configured to trigger notifications
  
 
== What are alarms? ==
 
== What are alarms? ==
Line 37: Line 41:
 
=== Some examples ===
 
=== Some examples ===
  
"Tell me when the maximum disk utilization exceeds 90%"
+
* "Tell me when the maximum disk utilization exceeds 90%"
"Tell me when the average CPU utilization exceeds 80% over 120 seconds"
+
* "Tell me when the average CPU utilization exceeds 80% over 120 seconds"
"Tell me when my web app is becoming unresponsive" (loadbalancer latency meter)
+
* "Tell me when my web app is becoming unresponsive" (loadbalancer latency meter)
"Tell me when my httpd daemon dies" (custom user script that checks daemon health)
+
* "Tell me when my httpd daemon dies" (custom user script that checks daemon health)
 +
 
 +
== How can you use Alarms ==
 +
 
 +
Create an alarm
 +
 
 +
<pre><nowiki>
 +
{
 +
'period': '300',
 +
'eval_periods': '2',
 +
'meter': 'CPUUtilization',
 +
'function': 'average',
 +
'operator': 'gt',
 +
'threshold': '50'
 +
'resource_id': 'inst-002',
 +
'source': 'OS/compute',
 +
'alarm_actions': ['rpc/my_notify_topic', 'http://bla.com/bla'],
 +
'ok_actions': ['rpc/my_notify_topic']
 +
}
 +
</nowiki></pre>
 +
 
 +
 
 +
This will check the "CPUUtilization" meter events every 300sec
 +
and if the average CPUUtilization was > 50% (for inst-002) for both of the
 +
last 2 300sec periods then it will send an rpc notification on the "my_notify_topic" topic
 +
and post the alarm details to http://bla.com/bla.
 +
 
 +
Then when the alarm goes below this level it will do the "ok_actions".
  
 
== Why integrate them into one project? ==
 
== Why integrate them into one project? ==
  
> reuse the data collectors / pollsters
+
* one statistics api
> reduce the overall resource consumption with having duplicated systems
+
* have one vibrant community rather that two that are less so
  (more messaging of stats data and cpu usage)
+
* code reuse
> have one vibrant community rather that two that are less so
+
 
> code reuse
+
= Requirements =
 +
 
 +
== Metrics ==
  
== Why bring alarms into ceilometer? ==
+
Kinds of metrics to monitor
 +
http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/CW_Support_For_AWS.html
  
1) to improve the responsiveness
+
Sample at between 10s to 60s
 +
Transmit at between 1min and 5min
  
The sooner you get data the sooner you can trigger a possible alarm
+
== User Rest API ==
So if the alarming system is outside of the statistics collection then you get older data and the time from an alarm causing event and the generated alarm is too big.
 
  
So consider:
+
=== Other API to take into account: ===
 +
* http://docs.amazonwebservices.com/AmazonCloudWatch/latest/APIReference/Welcome.html
 +
* http://dev.librato.com/v1/post/metrics
 +
* http://dmtf.org/sites/default/files/standards/documents/DSP0263_1.0.1.pdf page ~ 150
  
<pre><nowiki>
+
=== Metrics: ===
    "HttpFailureAlarm": {
+
* list metrics
    "Type": "AWS::CloudWatch::Alarm",
+
 
    "Properties": {
+
=== Data: ===
        "AlarmDescription": "Restart the WikiDatabase if httpd fails >= 1 time in 5 minutes",
+
* get stats
        "MetricName": "ServiceFailure",
+
* put custom metric data
        "Namespace": "system/linux",
+
 
        "Statistic": "SampleCount",
+
=== Alarms: ===
        "Period": "300",
+
* create alarm
        "EvaluationPeriods": "1",
+
* delete alarm
        "Threshold": "1",
+
* show alarm
        "AlarmActions": [ { "Ref": "WebServerRestartPolicy" } ],
+
* show alarm history
        "ComparisonOperator": "GreaterThanOrEqualToThreshold"
+
* enable/disable alarm
      }
+
* set alarm state (tempory override)
</nowiki></pre>
+
 
 +
=== What fields are needed? ===
  
If the event ([[ServiceFailure]]) happens near the begging of the 5 min you will only get the alarm at the 5 min mark if alarming is outside of the stats collection (if stats will be collected every 5 min). However if the alarm calculation is near the source then you can trigger such alarms as soon as they happen.
+
* namespace.name
 +
* dimensions (resource_id, ++)
 +
* time stamp
 +
* units
 +
* user_id
 +
* project_id
  
2) to reduce network usage
+
== Groupings (aws dimensions) ==
  
When you have alarms you don't tend to poll the api as much because you create alarms to monitor things for you.
+
Some examples:
Instead you tend to:
+
* Autoscaling groupname
- do historic queries (averages/max/min for different intervals)
+
* image id
- act on alarms
+
* instance id (resource_id)
 +
* instance type
  
If the alarms are checked/generated on the compute-host then the alarms can be checked often but stats don't have to be sent straight away.
+
Some are already collected in ceilometer in the counter as resource_id or metadata
So just because the alarm interval is 60sec does not mean that we have to send stats at that frequency
 
  
=== Possible Changes ===
+
== Alarms ==
  
API
+
Initially this can be poll based and run on the node that the resource is hashed to (including virtual resources/dimensions)
  
* add alarm api
+
== Instrumentation ==
* add alarm history api
 
* add auth (make the api public)
 
  
Agent
+
How does this fit in with Ceilometer? - we will need to modify the agents to send the info we need at the rate we need.
 +
The agent would need to hash the resource_id and find out where to send the metric data.
  
* agent with a rest api to post user meter data (source == user/$user_id)
+
== Optimisations (reducing network activity) ==
* agents get alarms from the collector (including the alarms for it's resources)
 
agent then polls at the required frequency per resource and checks for an alarm (sends an alarm-notification if needed), aggregates the data into a single message
 
* sends the aggregated message at the normal metering interval
 
  
{
+
Assume data aggregation, but have an api to turn that off.
resource_id
 
...
 
counter_volume [2,4,5,6,7,3,9,76,76,4,3,5,1]
 
}
 
  
- counter_volume is now a list
+
What does that mean?
 +
The agent could locally (in memory) aggregate multiple samples into one data point (sum=31,min=2,max=5,sample_count=10) and only send that at a much higher period. An alternative is to still send all the data points (but in bulk) just delayed.
  
* the collector stores the data as normal
+
We could automatically turn this off when we have alarms attached to metrics.
* now the api can get finer grained data points if asked to (but not available imediatly - which is ok)
 
* a period task *could* clean out the extra data at some time later
 

Latest revision as of 23:29, 17 February 2013

Beyond Metering - Extending Ceilometer

Etherpad: http://etherpad.openstack.org/grizzly-ceilometer-beyond-metering

Over the past few months, numerous request have been made to the Ceilometer project to extend its scope from just metering to more monitoring or alerting. This causes quite a few challenges but as we are instrumenting more and more openstack component to extract data from them, it makes sense to think on how we could extend our capabilities over time.

  • CloudWatch type functionality
    • what granularity of data is needed
    • should we collect data the same way
    • should we store collected data in our database
    • what other components would be needed
  • Alerting type functionality
    • what granularity of data is needed
    • should we collect data the same way
    • should we store collected data in our database
    • what other components would be needed
  • Other types?

What is the difference between monitoring and metering?

metering

  • used by a billing system which is a part of the deployment
  • has a fixed interval (normally 15 min - 5 min)
  • the data must be persistent
  • it is important that the data comes from a trusted source (not spoofed)

monitoring

  • can be used by anyone (real user)
  • variable interval (set by the user) as low as 1 min
  • data can expire (deleted after a week or two)
  • data can be user generated or from a trusted source (configurable)
  • alarms can be configured to trigger notifications

What are alarms?

The purpose of Alarms is to notify a user when a meter matches a certain criteria.

Some examples

  • "Tell me when the maximum disk utilization exceeds 90%"
  • "Tell me when the average CPU utilization exceeds 80% over 120 seconds"
  • "Tell me when my web app is becoming unresponsive" (loadbalancer latency meter)
  • "Tell me when my httpd daemon dies" (custom user script that checks daemon health)

How can you use Alarms

Create an alarm

{
 'period': '300',
 'eval_periods': '2',
 'meter': 'CPUUtilization',
 'function': 'average',
 'operator': 'gt',
 'threshold': '50'
 'resource_id': 'inst-002',
 'source': 'OS/compute',
 'alarm_actions': ['rpc/my_notify_topic', 'http://bla.com/bla'],
 'ok_actions': ['rpc/my_notify_topic']
}


This will check the "CPUUtilization" meter events every 300sec and if the average CPUUtilization was > 50% (for inst-002) for both of the last 2 300sec periods then it will send an rpc notification on the "my_notify_topic" topic and post the alarm details to http://bla.com/bla.

Then when the alarm goes below this level it will do the "ok_actions".

Why integrate them into one project?

  • one statistics api
  • have one vibrant community rather that two that are less so
  • code reuse

Requirements

Metrics

Kinds of metrics to monitor http://docs.amazonwebservices.com/AmazonCloudWatch/latest/DeveloperGuide/CW_Support_For_AWS.html

Sample at between 10s to 60s Transmit at between 1min and 5min

User Rest API

Other API to take into account:

Metrics:

  • list metrics

Data:

  • get stats
  • put custom metric data

Alarms:

  • create alarm
  • delete alarm
  • show alarm
  • show alarm history
  • enable/disable alarm
  • set alarm state (tempory override)

What fields are needed?

  • namespace.name
  • dimensions (resource_id, ++)
  • time stamp
  • units
  • user_id
  • project_id

Groupings (aws dimensions)

Some examples:

  • Autoscaling groupname
  • image id
  • instance id (resource_id)
  • instance type

Some are already collected in ceilometer in the counter as resource_id or metadata

Alarms

Initially this can be poll based and run on the node that the resource is hashed to (including virtual resources/dimensions)

Instrumentation

How does this fit in with Ceilometer? - we will need to modify the agents to send the info we need at the rate we need. The agent would need to hash the resource_id and find out where to send the metric data.

Optimisations (reducing network activity)

Assume data aggregation, but have an api to turn that off.

What does that mean? The agent could locally (in memory) aggregate multiple samples into one data point (sum=31,min=2,max=5,sample_count=10) and only send that at a much higher period. An alternative is to still send all the data points (but in bulk) just delayed.

We could automatically turn this off when we have alarms attached to metrics.