NovaQueryCache

= Query Cache in Nova (Draft) =

Overview
In large-scale OpenStack deployments, large number of users may interact with OpenStack API node(s) concurrently. In particular, they may perform large number of queries, such as 'describe_instances" (to retrieve details of all the instances that they have). For example, this is expected to be a common practise to do 'polling' of instances and their state from a higher-level management tool or a Web console. In the current OpenStack implementation, all such queries will end up hitting the Database, which may create a signification congestion (e.g., thousands of requests per second just for the instances polling). In many cases, the responses to consequent queries from a given user will be identical, because no changes were applied in between. Hence, caching of the results of such queries could significantly improve the scalability of the OpenStack fabric, as well as the round-trip of such queries.

In order to address the above deficiency, we introduce Query Cache layer between the API layer and the Data Cache layer (Database, in the current OpenStack implementation). Each instance of nova-api will have an instance of query cache, which would keep the latest queries that were sent to that nova-api instance, together with the corresponding response. Each entry in the query cache is a key-value pair. The key comprises the ID of the user which initiated the query, concatenated with the query itself, while the value holds the result returned to the user.

Query cache usage and population
When a new query request arrives, and there is a matching valid entry in the query cache, the cached result is immediately returned. Otherwise, the query is forwarded to the Data Cache layer, and the result is saved in the query cache (as well as returned to the user). If there is not enough space in the query cache to insert the new entry, a cache replacement algorithm is applied, which discards one or more cache entries and frees up space for the new entry.

Cache invalidation
When there is a change in the Data Cache (observer by the corresponding Data Model Manager), a notification is sent to *all* the query cache instances (e.g., via fanout_cast on the corresponding topic, defined in the messaging system for this purpose), which then discard/invalidate the corresponding cache entries -- based on a set of invalidation rules, and the nature of the Data Cache update. For example, when an instance owned by a certain user is changed, the cache entries which will be invalidated are those that comprise this user's ID concatenated with a query that may refer to that instance (such as "describe_instances", or "get_instance_details" when the argument points to the same instance ID which has been changed).

Cache usage optimization
In some cases the query caching can be optimized by disabling caching in certain situation, which are likely to cause 'noise' in the query cache without significantly improving the cache hit ratio. For example, when "describe_instances" is requested by the admin user, which returns all the instances of all the users, the result occupies a lot of cache space while the chance that it will remain valid until next such request is rather low (any change in any instance will cause invalidation of such an entry). Such a mechanism would identify the conditions (e.g., users and/or query types) in which the queries will not be cached, and will manage the caching correspondingly. These conditions might be determined via configuration, or dynamically (e.g., by observing historical cache hit ratio, and predicting which queries are likely to increase the cache hit ratio if cached).

Load balancing optimization
In order to increase locality (and hence cache hit ratio) of requests by any given user, in case of multiple nova-api nodes with a load balancer in front of them, an optimization to the load balancing algorithm may be introduced, preferring to dispatch requests of a certain user to the same nova-api node.

Alternatives
One seemingly valid alternative to the above approach is to use the query cache built-in in the Database implementation. However, the cache invalidation policy commonly used in database query cache implementations is often inefficient for our purposes. For example, the query cache invalidation in MySQL is applied at a granularity of an entire table -- i.e., for Instances table, even if a single instance is updated, the entire table is invalidate in the query cache. While in our implementation the granularity of cache invalidation is limited to records associated with a specific user (or even more granular in some cases).