Jump to: navigation, search

Difference between revisions of "Zaqar/bp/placement-service"

(Added deprecated page.)
m (Malini moved page Marconi/bp/placement-service to Zaqar/bp/placement-service: Project Rename)
 
(7 intermediate revisions by one other user not shown)
Line 23: Line 23:
 
* '''Storage deployment''': a set of storage nodes - one or many, as long as they're addressable with a single client connection
 
* '''Storage deployment''': a set of storage nodes - one or many, as long as they're addressable with a single client connection
  
== Reference Deployment: Smart Proxy and Cell as a Unit ==
+
== Reference Deployment: Smart Proxy and Partition as a Unit ==
  
 
This approach is emerging as the leading reference implementation for handling scaling of the Marconi service. The primary components are:
 
This approach is emerging as the leading reference implementation for handling scaling of the Marconi service. The primary components are:
Line 38: Line 38:
 
Operators can optimize N to match their storage configuration and persistence needs.
 
Operators can optimize N to match their storage configuration and persistence needs.
  
=== Load Balancer ===
+
=== Smart Proxy ===
  
The load balancer maintains a mapping from tenants/projects (ID-based) to partition URLs.
+
The smart proxy maintains a mapping from tenants/projects (ID-based) to partition URLs.
  
 
=== Migration Strategy ===
 
=== Migration Strategy ===
  
Double-writing: have proxy duplicate all write requests to the old partition and the new partition. All reads continue to be performed on the old partition. Once the migration is complete, update proxy to redirect to new partition entirely.
+
Freezing Export: have a migration service running on each Marconi partition. The service, when given a queue and a destination partition, launches an export worker. The export worker the communicates the desired data to the new partition's migration service, which in turn launches an import worker to bring in the data. In summary:
  
* POST/PUT/DELETE - writes
+
* "Freeze" the source queue
* HEAD/GET/OPTIONS - reads
+
* Export the queue from the source
* Cost: global load increases temporarily
+
* Import the queue to the destination
* Benefits: no modifications to Marconi, 0 downtime
+
* "Thaw" the queue
 +
 
 +
Freeze: set a particular queue as read-only at the proxy layer
 +
Thaw: restore a particular queue to normal status at the proxy layer
  
 
=== Advantages ===
 
=== Advantages ===
Line 60: Line 63:
 
=== Disadvantages ===
 
=== Disadvantages ===
  
* Data migration is less granular: performed at tenant level vs. queue level
+
* Requires the implementation of a smart proxy - this includes: routing requests, partition management, catalogue management, regeneration. and synchronization
 
+
* Benefits from having access to raw_read and raw_write functions wrt storage layer
== (Deprecated) Deploying w/ a Full-Blown Service ==
 
 
 
A placement service consists of the following components:
 
 
 
* A catalogue of mappings: {queue => [read locations], queue => [write locations], queue => status}
 
    * where status is one of ACTIVE or MIGRATING
 
* A cache maintained in each Marconi instance of the mappings
 
* A migration service
 
 
 
=== Placement API ===
 
 
 
The placement service must handle:
 
 
 
* storage assignment
 
* cataloging
 
* migrations
 
 
 
An API must be exposed to facilitate these operations.
 
 
 
==== Storage Management ====
 
 
 
* Add node (POST /v1/storage: {'location': 'mongo1.storage.com' , 'weight': 100})
 
* Delete node (DELETE /v1/storage/mongo1.storage.com)
 
* Change weight (PUT /v1/storage/mongo1.storage.com {'weight': 50})
 
* Query storage (GET /v1/storage => [...], GET /v1/storage/mongo1.storage.com => {'weight': 100})
 
 
 
==== Catalogue Management ====
 
 
 
* Add entry (POST)
 
* Update an entry (PUT)
 
* Remove entry (DELETE)
 
 
 
==== Migration ====
 
 
 
This means: "move object from location A to location B". In terms of the Placement service, it must:
 
 
 
* Update the storage write location
 
* Update the storage read location
 
* Update the entry status ("a" => "m")
 
 
 
=== Catalogue ===
 
 
 
The catalog, as described above, maintains these mappings from queues to locations.
 
 
 
Marconi can continue to function in a degraded state if the placement service goes down. The following capabilities will cease to function for Marconi if the placement service goes down:
 
  
* Migration
+
== Current State ==
* Queue creation
 
  
It may be possible to continue performing queue creation and deletion while the placement service is down by defaulting to the last storage location that Marconi held.
+
=== Concepts ===
  
The data contained in the catalogue can be regenerated by querying each Marconi endpoint and synchronizing against each cache.
+
==== Partitions ====
  
The Marconi catalogue contains a series of structures that look as follows:
+
Partitions have: 1) a name, 2) a weight, and 3) a list of node URIs. For example:
  
 
<pre><nowiki>
 
<pre><nowiki>
 
{
 
{
    "{project}.{queue}" : {
+
  "default": {
        "r": [("mongo://192.168.1.105:7777", "db1"), ("mongo://192.168.1.106:7777", "db2")],
+
    "weight": 100,
        "w": [("mongo://192.168.1.105:7777", "db1")],
+
    "nodes": [
        "s": "a"
+
      "http://localhost:8889",
     }
+
      "http://localhost:8888",
 +
      "http://localhost:8887",
 +
      "http://localhost:8886"
 +
     ]
 +
  }
 
}
 
}
 
</nowiki></pre>
 
</nowiki></pre>
  
An entry in the queue is found by concatenating a project with a queue name. "r" is the collection of locations where data for a particular queue can be gathered from. "w" is the locations where new data written to this queue are stored. "s" is the state of the queue: active "a" or migrating "m". A storage location is given by a URL and a database name.
+
==== Catalogue ====
 
 
The catalogue is filled with many such documents. The local cache for a given Marconi node is populated with the entries from this cache, to avoid lookups on each request.
 
 
 
==== Adding Items to the Catalogue ====
 
 
 
Whenever a Marconi queue is created, a hook is called that transfers control the the placement service. It then becomes the placement service's responsibility to:
 
  
* Take the given project ID and namespace (queue name)
+
Catalogue entries have: 1) a key, 2) a node URI, and 3) metadata. For example:
* Assign a read storage location to it
 
* Assign a write storage location to it
 
* Store these locations in the catalogue
 
* Return these locations to the Marconi worker
 
 
 
==== Removing Items from the Catalogue ====
 
 
 
When a Marconi queue has been deleted, a hook must be called to notify the Placement service. The placement service must then remove the entry from its catalogue and notify all subscribed Marconi listeners.
 
 
 
=== Placement Policy Engine ===
 
 
 
The placement service will use a weighted distribution to determine storage assignment for newly created queues. However, for certain tenants, we might want to assign them to special or dedicated storage.
 
 
 
A weighted distribution might look like:
 
  
 
<pre><nowiki>
 
<pre><nowiki>
 
{
 
{
    "mongo1.storage.com": 30,
+
  "{project_id}.{queue_name}": {
     "mongo2.storage.com": 20,
+
     "href": "http://localhost:8889",
     "mongo3.storage.com": 30,
+
     "metadata": {
    "mongo4.storage.com": 75,
+
      "awesome": "sauce"
     "mongo5.storage.com": 5,
+
     }
 +
  }
 
}
 
}
 
</nowiki></pre>
 
</nowiki></pre>
  
This means that the probability of a queue being stored on mongo1 is (30)/(30 + 20 + 30 + 75 + 5) => 18.75%.
+
=== API ===
  
The policy engine is responsible for maintaining this list of special tenants and their preferred mappings. Whenever a request to create a queue is received, Placement iterates through this list, and if the tenant is found, the associated dedicated storage is returned. If no such tenant is found, then the assignment of storage to the queue is determined by Placement.
+
<pre><nowiki>
 +
GET /v1/partitions  # list all registered partitions
  
=== Marconi Cache: Using PubSub ===
+
GET /v1/partitions/{name}  # fetch details for a single partition
 +
PUT /v1/partitions/{name}  # register a new partition
 +
DELETE /v1/partitions/{name}  #
  
If the Marconi catalog and the local caches are implemented using a pubsub-based transport, the invalidation procedure can be simplified. The catalog publishes on the cache channel and all listening Marconi workers update their local cache accordingly. Simple messages in the form of:
+
# the catalogue is updated by operations routed through /v1/queues/{name}
 
+
GET /v1/catalogue # list all entries in the catalogue for the given project ID
* <action> <queue> [<destination>]
+
GET /v1/catalogue/{name}  # fetch info for the given catalogue entry
* Where action is one of: delete, create, migrate
+
</nowiki></pre>
* Where destination is valid only for migrate, and if left unspecified, leaves the decision of destination up the placement service.
 
 
 
Possibilities here include: Redis, ZeroMQ
 
 
 
=== Placement Service API ===
 
 
 
To be filled soon: a Python transport/storage API for managing the placement service catalogue and storage nodes.
 
 
 
=== Advantages ===
 
 
 
* Can be generalized: Sharding as a Service
 
 
 
=== Disadvantages ===
 
 
 
* Complex - many pieces to implementation
 
* Lots of custom code - RPC, storage drivers, transport, etc.
 
* Adds a dependency to Marconi - not deployment friendly
 
* NIH - rewrites a lot of logic freely available else where
 
 
 
== Roadmap ==
 
 
 
=== Phase 1: Replication and Sharding ===
 
 
 
In the first iteration of this project, the goal is to provide an easy way to replicate and shard data across dynamically allocated storage nodes. The required features are:
 
 
 
* Catalogue + management API
 
* Storage allocation + management API: static weights
 
* Policy engine: being able to assign dedicated storage to particular tenants
 
* Local catalogue cache + push consistency
 
 
 
=== Phase 2: Migration, Dynamism, and Deletion ===
 
 
 
Phase 2 introduces the ability to migrate data (read: Marconi data, queues + messages) from one storage node to another. The ability to remove entries from the catalog and also schedule them to be removed from storage is considered. Finally, dynamically controlled storage allocation is introduced for more hands-off operation. In sum:
 
 
 
* Migration + migration API: move data from storage node to node, set storage as read-only
 
* Deletion: remove data from catalog and from storage nodes
 
* Dynamism: monitor storage nodes and adjust weights based on node capacity and load
 
 
 
=== Phase 3: Data Affinity and Generalization ===
 
 
 
This phase optimizes and generalizes placement service. Conceptually, there's no reason placement service should serve only the needs of Marconi. Requirements:
 
 
 
* Data affinity: attempt to cache particular storage connections on worker nodes where certain data appears more often
 
    - Useful for reducing the number of cached connections
 
* Generalization: make the placement service usable by other services
 
 
 
=== Phase N: ??? ===
 
 
 
Because the future is open, and predicting beyond this point is very difficult.
 
 
 
== Ideas Under Consideration ==
 
 
 
==== Periodic Refresh ====
 
 
 
On a separate Marconi "thread", poll the catalogue service periodically (say, every 10 seconds). This actor is responsible for updating the cache. It queries the catalogue service and looks for changes.
 
 
 
To enable this, migrations are only allowed a granularity of 5 minutes. This helps avoid race conditions on a catalogue resource, since the migration itself triggers a state change from active to migrating for a particular queue.
 
 
 
==== Push Refresh ====
 
 
 
This approach does away with time and puts the responsibility of invalidating caches on the placement service. All Marconi nodes must maintain a listen port connected to the placement service, and whenever a migration occurs, Marconi nodes receive updates on queues that are being affected.
 
 
 
=== Deletions ===
 
 
 
Deletions take priority over migrations. If a migration is in progress for Q1, and a request to delete Q1 is made, then all messages for Q1 are deleted both from the initial storage location and the destination storage location. The migration is cancelled.
 
  
=== Dynamic Weight Management ===
+
=== Implementation ===
  
The operator of the placement service can manually determine the weights of the storage locations to bootstrap the system. However, in the future, it would be preferred to dynamically update these weights based on host parameters such as:
+
==== Needs Review ====
  
* Storage location CPU load
+
* Proxy (partition, catalogue, queues handling): https://review.openstack.org/#/c/43909/
* Storage location remaining capacity
+
* Proxy (v1, health): https://review.openstack.org/#/c/44356/
 +
* Proxy (forward the rest of the routes): https://review.openstack.org/#/c/44364/
  
This adds a level of intelligence to the placement storage layer that makes maintenance a more hands-free experience.
+
==== To Do ====
  
=== Connection Pooling ===
+
* Hierarchical caching: store data in authoritative store (mongo replicaset) on write operations, and cache locally using Redis instance, hitting authoritative only on failed lookups
 +
* Benchmarking
 +
* Unit tests
 +
* Functional tests
 +
* Configuration
 +
* Catalogue and partition registry regeneration
  
To be filled soon: caching strategy for storage connections at the Marconi worker level
+
=== Deployment ===
  
 +
* Bring up authoritative replicaset
 +
* Bring up redis-server on each box
 +
* launch marconi.proxy.app:app using a WSGI/HTTP server
  
== Deprecated ==
+
== More Ideas/Deprecated ==
  
 
[[Deprecated]]
 
[[Deprecated]]

Latest revision as of 18:42, 7 August 2014

Overview

Rationale: Marconi has a storage bottleneck

Proposal goal: Remove that bottleneck

The placement service aims to address this by handling storage transparently and dynamically.

Transparency

  • User transparency: availability and use of the Marconi service must not be interrupted when a migration is taking place.
  • Implementation transparency: storage driver is handed a location/connection and only cares about the serialization/deserialization of data to that storage location.

Terminology

  • Marconi partition: one Marconi master, a set of Marconi workers, and a storage deployment. This is the minimum abstraction: one adds a Marconi partition, not a storage node or a Marconi worker
  • Marconi master: receives requests and forwards them round robin to Marconi workers
  • Marconi workers: process requests and communicate with storage
  • Storage deployment: a set of storage nodes - one or many, as long as they're addressable with a single client connection

Reference Deployment: Smart Proxy and Partition as a Unit

This approach is emerging as the leading reference implementation for handling scaling of the Marconi service. The primary components are:

  • A load balancer that can redirect tenant requests to a cluster URL
  • Operating Marconi at the partition level

Partitions

  • One master to round-robin tasks to workers
  • N Marconi web servers
  • A storage deployment

Operators can optimize N to match their storage configuration and persistence needs.

Smart Proxy

The smart proxy maintains a mapping from tenants/projects (ID-based) to partition URLs.

Migration Strategy

Freezing Export: have a migration service running on each Marconi partition. The service, when given a queue and a destination partition, launches an export worker. The export worker the communicates the desired data to the new partition's migration service, which in turn launches an import worker to bring in the data. In summary:

  • "Freeze" the source queue
  • Export the queue from the source
  • Import the queue to the destination
  • "Thaw" the queue

Freeze: set a particular queue as read-only at the proxy layer Thaw: restore a particular queue to normal status at the proxy layer

Advantages

  • Easier to implement
  • No changes to Marconi
  • Scalable
  • Transparent

Disadvantages

  • Requires the implementation of a smart proxy - this includes: routing requests, partition management, catalogue management, regeneration. and synchronization
  • Benefits from having access to raw_read and raw_write functions wrt storage layer

Current State

Concepts

Partitions

Partitions have: 1) a name, 2) a weight, and 3) a list of node URIs. For example:

{
  "default": {
    "weight": 100,
    "nodes": [
      "http://localhost:8889",
      "http://localhost:8888",
      "http://localhost:8887",
      "http://localhost:8886"
    ]
  }
}

Catalogue

Catalogue entries have: 1) a key, 2) a node URI, and 3) metadata. For example:

{
  "{project_id}.{queue_name}": {
    "href": "http://localhost:8889",
    "metadata": {
      "awesome": "sauce"
    }
  }
}

API

GET /v1/partitions  # list all registered partitions

GET /v1/partitions/{name}  # fetch details for a single partition
PUT /v1/partitions/{name}  # register a new partition
DELETE /v1/partitions/{name}  # 

# the catalogue is updated by operations routed through /v1/queues/{name}
GET /v1/catalogue  # list all entries in the catalogue for the given project ID
GET /v1/catalogue/{name}  # fetch info for the given catalogue entry

Implementation

Needs Review

To Do

  • Hierarchical caching: store data in authoritative store (mongo replicaset) on write operations, and cache locally using Redis instance, hitting authoritative only on failed lookups
  • Benchmarking
  • Unit tests
  • Functional tests
  • Configuration
  • Catalogue and partition registry regeneration

Deployment

  • Bring up authoritative replicaset
  • Bring up redis-server on each box
  • launch marconi.proxy.app:app using a WSGI/HTTP server

More Ideas/Deprecated

Deprecated