Ceilometer/blueprints/api-group-by

Summary
Enhance API v2 so it implements new arguments to do GROUP BY operations when calculating meter statistics.

If the user requests query filtering and/or period grouping, these operations are applied first, then the GROUP BY operations are applied second.

User stories
I had an instance running for 6h. It started as a m1.tiny flavor during the first 2 hours and then grew up to a m1.large flavor for the next 4 hours. I need to get this two durations so I can bill them with different rates.

Design example
For example add:

g[]=

That solves the user story above with:

/v2/meters/instance/statistics? q[0].field=resource& q[0].op=eq& q[0].value=& q[1].field=timestamp& q[1].op=lt& q[1].value= & q[2].field=timestamp& q[2].op=gt& q[2].value=& g[0]=metadata.flavor& period=360

Would return

{[ { "m1.tiny": { "min": 1, "max": 1, "avg": 1, "sum": 1 }, { "m1.tiny": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, ]}

Further more, dropping the q[0] request that narrows the search to only one resource allows to retrieve this information for all instances over that period of time:

/v2/meters/instance/statistics? q[0].field=timestamp& q[0].op=lt& q[0].value= & q[1].field=timestamp& q[1].op=gt& q[1].value=& g[0]=metadata.flavor& g[1]=resource& period=360

If there was another large instance, that would return:

{[ { "m1.tiny": { "min": 1, "max": 1, "avg": 1, "sum": 1 }, "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, { "m1.tiny": { "min": 1, "max": 1, "avg": 1, "sum": 1 }, "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 1 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 2 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 2 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 2 } }, { "m1.large": { "min": 1, "max": 1, "avg": 1, "sum": 2 } }, ]}

Angus's comments/ramblings
1) I assume we can't group by more than one field? If so this should be (not an array): groupby=metadata.flavor&

You can group by more than one field, see the second examples -- jd

2) period is not yet impl. - I'd better get on that ;)

3) Currently we return:

{[ { "min": 1, "max": 1, "avg": 1, "sum": 1, "count": 1, "duration": 1, }, ]}

To show the groupby we could return the following:

{[ { "min": 1, "max": 1, "avg": 1, "sum": 1, "count": 1, "duration": 1, "groupby": "m1.tiny", }, ]}

If there is no groupby that can just be None.

Fine with me, but you probably want "groupby": [ "m1.tiny" ] since you can group by multiple values. -- jd

We probably want that to be a mapping between the field name and its value. {'metadata.instance_type': 'm1.tiny'} -- dhellmann

4) from an impl. pov (mongo) we have:

MAP_STATS = bson.code.Code("""	   function  { -	        emit('statistics', { min : this.counter_volume, +	        emit(groupby_field, { min : this.counter_volume,	                             max : this.counter_volume,	                             qty : this.counter_volume,	                             count : 1,	                             timestamp_min : this.timestamp,	                             timestamp_max : this.timestamp } )	    }	    """)

If we can pass in the groupby field into the above function then this will be super easy. Can we generate this bcode dynamically?

I don't see why you couldn't :) -- jd

We will need to be careful about injection attacks. -- dhellmann

Metadata fields
Decided not to implement group by for metadata fields and do this at a later date.

Ordering of parameters applied
Meter statistics can be called with three parameters: query filter, period, and groupby. Each parameter corresponds to an operation. It is important to note the order in which these operations are applied. Query filtering is always applied first. Then what about period and group by?

Since period grouping is basically a group by for time range, there is an ambiguity when **both** period and group by for other field(s) are requested. Conceivably, you could have


 * 1) Period grouping is applied first, followed by group by on other field(s)
 * 2) Group by on field(s) is applied first, followed by period grouping

We've chosen to implement the first possibility, where period grouping is performed first.

To summarize, the order of application is


 * 1) Query filters
 * 2) Period grouping
 * 3) Group by on other field(s)

Storage driver tests to check group by statistics
Addressed by: https://review.openstack.org/41597 "Add SQLAlchemy implementation of groupby"

Created a new class StatisticsGroupByTest in tests/storage/base.py that contains the storage tests for group by statistics and has its own test data

The storage tests check group by statistics for
 * 1) single field, "user-id"
 * 2) single field, "resource-id"
 * 3) single field, "project-id"
 * 4) single field, "source"
 * 5) single field with invalid/unknown field value
 * 6) single metadata field (not yet implemented)
 * 7) multiple fields
 * 8) multiple metadata fields (not yet implemented)
 * 9) multiple mixed fields, regular and metadata (not yet implemented)
 * 10) single field groupby with query filter
 * 11) single metadata field groupby with query filter (not yet implemented)
 * 12) multiple field group by with multiple query filters
 * 13) multiple metadata field group by with multiple query filters (not yet implemented)
 * 14) single field with period
 * 15) single metadata field with period (not yet implemented)
 * 16) single field with query filter and period
 * 17) single metadata field with query filter and period (not yet implemented)

The test data is constructed such that the measurements are integers (specified by the "volume" attribute of the sample) and the averages in the statistics are also integers. This helps avoid floating point errors when checking the statistics attributes (e.g. min, max, avg) in the tests.

Currently, metadata group by tests are not implemented. Supporting metadata fields is a more complicated case, so we leave that for future work. The test data contains metadata fields, as a starting point for future work on metadata group by.

The group by period tests and test data are constructed, so that there are periods with no samples. For the group by period tests, statistics are calculated for the periods 10:11 - 12:11, 12:11 - 14:11, 14:11 - 16:11, and 16:11 - 18:11. However, there are no samples with timestamps in the period 12:11 - 14:11. It's important to have this case, to check that the storage drivers behave properly when there are no samples in a period.

SQL Alchemy group by implementation
Addressed by: https://review.openstack.org/41597 "Add SQLAlchemy implementation of groupby"

Decided to only implement group by for the "user-id", "resource-id", and "project-id" fields. The "source" and metadata fields are not supported. It turned out that supporting "source" in SQL Alchemy is much more complicated than "user-id", "resource-id", and "project-id".

MongoDB driver group by implementation
Addressed by: https://review.openstack.org/43043 "Adds group by statistics for MongoDB driver"

Adds group by meter statistics in the MongoDB driver for the case where the groupby fields are a combination of 'user-id', 'resource-id', 'project-id', and 'source' (metadata fields not implemented as noted in above section "General design comments")

Design for aggregation method
Summary: Decided to continue using the mapReduce MongoDB aggregation method, even though there are other options in the API.

There are three types of MongoDB aggregation commands: aggregate, mapReduce, group

The MongoDB manual has a comparison of these three types.

Apparently, MongoDB now recommends aggregate (aka "the aggregation pipeline", a new feature since MongoDB version 2.2) when possible:

"For most aggregation operations, the Aggregation Pipeline provides better performance and more coherent interface. However, map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline."

Ceilometer currently uses the mapReduce method to calculate meter statistics, but we could conceivably switch to using aggregate or group.

Decided to stick with mapReduce because it's the most flexible. mapReduce can support non-standard aggregation operators (operations that are not min, max, avg, etc.), whereas aggregate cannot.

For example, in the blueprint "Improvements for API v2", it suggests an improvement:

"Provide additional statistical function (Deviation, Median, Variation, Distribution, Slope, etc...) which could be given as multiple results for a given data set collection"

Functions like deviation and median are not standard aggregation operators in the MongoDB aggregation pipeline aggregate.

The group aggregation command is less flexible than mapReduce, slower in performance than `aggregate`, and doesn't support sharded collections (i.e. a database distributed across multiple servers)

Also, for the aggregate and group methods, the result set must fit within the maximum BSON document size limit (16 MB). but

"Additionally, map-reduce operations can have output sets that are that exceed the 16 megabyte output limitation of the aggregation pipeline."

Design for map functions
It's straightforward to implement group by statistics in MongoDB. The statistics are calculated using the mapReduce method. The way that mapReduce is implemented in Ceilometer, mapReduce needs a map function, a reduce function, and a finalize function.

To compute meter statistics in MongoDB, there are four cases that need to be accounted for:


 * 1) no period, no group by
 * 2) period only
 * 3) group by only
 * 4) period and group by

All the cases can be implemented by using slightly different map functions with the same reduce ad finalize functions. The map function works by processing each document and emitting a key value pair. Each case requires a different key.


 * 1) no period, no group by --> key can be anything as long as it's a constant, e.g. 'statistics'
 * 2) period only --> key is the variable "period_start"
 * 3) group by only --> key is the variable "groupby"
 * 4) period and group by --> key is the combination of variables "period_start" and "groupby"

Then we just need to pass right values for the "groupby", "period_start", and "period_end" objects in the emitted values.

Tried to minimize duplicate code by using string substitutions as much as possible in the map functions MAP_STATS, MAP_STATS_PERIOD, MAP_STATS_GROUPBY, MAP_STATS_PERIOD_GROUPBY

API tests to check group by statistics
Addressed by: https://review.openstack.org/44130 "Add group by statistics tests in API v2 tests"

Add API tests for group by statistics
The API group by statistics tests are in a new class StatisticsGroupByTest in tests/api/v2/test_statistics_scenarios.py

The tests implemented are group by


 * 1) single field, "user-id"
 * 2) single field, "resource-id"
 * 3) single field, "project-id"
 * 4) single field, "source" (*)
 * 5) single field with invalid/unknown field value
 * 6) multiple fields
 * 7) single field groupby with query filter
 * 8) multiple field group by with multiple query filters
 * 9) single field with start timestamp after all samples
 * 10) single field with end timestamp before all samples
 * 11) single field with start timestamp
 * 12) single field with end timestamp
 * 13) single field with start and end timestamps
 * 14) single field with start and end timestamps and query filter
 * 15) single field with start and end timestamps and period
 * 16) single field with start and end timestamps, query filter, and period

(*) Group by source isn't supported in SQLAlchemy at this time, so we have to put this test in its own class TestGroupBySource

The tests use the same data and test cases as the groupby storage tests in class StatisticsGroupByTest in tests/storage/test_storage_scenarios.py

Group by metadata fields is not implemented at this time, so there aren't any tests for metadata fields.

Add tests for new function _validate_groupby_fields in test_query.py
A new function _validate_groupby_fields was added in ceilometer/api/controllers/v2.py, so there need to be tests for it. The logical place to put the tests is tests/api/v2/test_query.py

The tests check for valid fields, invalid fields, and duplicate fields.

Add groupby parameter in stubs in test_compute_duration_by_resource_scenarios.py
In tests/api/v2/test_compute_duration_by_resource_scenarios.py, the function _stub_interval_func stubs out get_meter_statistics. Since the get_meter_statistics function now accepts a groupby parameter, the stubs should also have a groupby parameter.

An additional parameter groupby was added to the functions get_interval

Revise get_json to accept groupby parameter
The method get_json in ceilometer/tests/api.py simulates an HTTP GET request for testing purposes. It has been modified to accept a groupby parameter.

API group by statistics implementation
Addressed by: https://review.openstack.org/44130 "Add group by statistics tests in API v2 tests"

The additions below were made to ceilometer/api/controllers/v2.py

Add groupby attribute to class Statistics
The API has a class Statistics that holds all the computed statistics from a meter/meter_name/statistics request. The class has been updated to include an attribute "groupby" for the group, so that we know which group the statistics are associated with. For example if we request group by user_id, "groupby" might be {'user_id': 'user-1'}, indicating that these are the statistics for all samples with 'user-1'.

Add groupby parameter to API method statistics
The API has a method statistics which is called when the user submits an HTTP GET request of the form "meter/meter_name/statistics"

This method has been updated so it can accept groupby parameters like

/v2/meters/instance/statistics?groupby=user_id&groupby=source

The groupby fields are assumed to be unicode strings, such that the groupby parameter passed to statistics is a list of unicode strings. For the above example, the groupby parameter would be ['user_id', 'source']

The API method statistics then validates the groupby fields using a new method _validate_groupby_fields and if the fields are valid, calls the get_meter_statistics method corresponding to the current storage driver with those groupby fields.

The method _validate_groupby_fields validates the groupby parameter and removes duplicate fields. This method is useful because it throws an error if an invalid field is given, i.e. a field that is not in the set ['user_id', 'resource_id', 'project_id', 'source']. Note that the duplicate fields are removed using list(set(groupby_fields)), which does not preserve the order of the groupby fields. So if a request

/v2/meters/instance/statistics?groupby=user_id&groupby=source

is made, the order could be switched from ['user_id', 'source'] to ['source', 'user_id']