Manila/Replication Design Notes

Intro
The design for replication isn't complete. We have a vision for the feature, and we're trying to define the details. After collecting feedback we will reformat this into a design doc.

Threats to data

 * Hardware failures
 * Network failures
 * Power failures
 * Natural disasters (fire, flood, hurricanes, meteors)
 * Accidental corruption (bugs, human error)
 * Malicious users (viruses, hackers, disgruntled employees)

Solution for protecting data

 * Highly available storage systems
 * Strategies
 * RAID/Erasure coding (protection from media failures)
 * Clusters (protection from component failures)
 * Multipath network topologies (protection from connection failures)
 * Redundant power (protection for power failures)
 * Advantages
 * Transparent to clients
 * Zero RPO/RTO (except for maybe a brief pause)
 * Disadvantages
 * Typically limited distance (weak against site-wide failures)
 * Manila
 * HA storage solutions fit into Manila today without changes
 * Use "share types" to indicate certain storage backends are highly available
 * Backups
 * Strategies
 * Tape archive (in the old days)
 * Virtual tape archive (Amazon Glacier or similar)
 * Local snapshots (standard Cinder/Manila features)
 * Remote snapshots (copy snapshot to object store, like Cinder backup)
 * Advantages
 * Can be very cheap
 * Stores multiple points in time (protection from corruption/malicious destruction)
 * Disadvantages
 * RPO typically high
 * Local snapshots don't protect against equipment/site failures
 * Remote snapshots typically have to be restored before they become accessible -- high RTO
 * Manila
 * Local snapshots implemented today
 * Remote snapshots (aka "backup") planned for future
 * Replication
 * Strategies
 * Synchronous mirroring
 * Asynchronous mirroring
 * Advantages
 * Can handle much longer distances
 * Can offer very low RPO/RTO
 * Disadvantages
 * Not transparent to network clients
 * Manila
 * This is what we're proposing!!!

Overview of the proposal

 * Start from user experience
 * If it doesn't address a user's problem, then the rest of the design is pointless
 * Also think about administrator's needs and responsibilities
 * Consider vendors/driver authors' and practical issues
 * Design is intentionally open-ended to make it as easy as possible for vendors to implement

User Experience

 * Users will be able to create "replicated" shares and non-replicated shares by specifying a share type
 * All existing shares are non-replicated
 * Administrator must specifically create share types that include replication extra_spec
 * Open question -- should the "replication" extra spec be visible to tenants?
 * We could do this similarly to how driver_handles_shares_servers is visible to tenants
 * Alternatively: rely on administrator to communicate which types are replicated and rely on the "replicated" attribute appearing on the shares after they're created
 * Open question -- what should the "replication" extra spec be called?
 * Vendors should be free to offer additional capabilities for different types of replication, just
 * There must be a standard capability/extra_spec that controls the Manila replication feature though
 * Replicated shares will have a replicated=true flag returned by the API
 * Replicated shares will also have a replication_state field
 * In Sync - stable state - share data is being replicated to 1 or more secondary controllers
 * Out Of Sync - stable state - share data is NOT being replicated
 * Resyncing - transitional state - backend is trying to reestablish replication
 * Failing Over - transitional state - share is changing to a different primary
 * Two new tenant-visible APIs
 * Failover
 * Can only be called on shares in the In Sync state
 * Causes existing export locations to be removed (must unmount first to avoid data loss)
 * Causes share to go into Failover Over state
 * Causes new export location to appear (presumably on a different storage controller in another location)
 * Expected to succeed whether storage controller hosting the share is online or not
 * After a successful failover the share may be in 2 states:
 * In sync -- if the primary storage controller was online and the backend was able to reverse the replication from the secondary
 * Out of sync -- maybe the primary storage controller was offline, or the backend wasn't able to immediately establish replication again from the secondary
 * Resync
 * Can only be called on shares in the Out Of Sync state
 * Causes shares to go into Resyncing state
 * Causes backend to attempt to reestablish replication (if possible)
 * On success, share goes to In Sync state
 * On failure, share goes back into Out Of Sync state
 * This would be expected as long as the primary remains down
 * Replicated shares will have a primary_location=True/False flag
 * Indicates if the share is being served by the original (primary) storage controller
 * After failing over, this field would be set to False to indicate that the share is being served by a secondary storage controller
 * Secondary locations may not have all of the capabilities of the primary
 * For example, the share_type may specify SSD disks extra_spec, but the secondary storage controller may have spinning disks
 * This is up to the administrator to configure how he wants
 * Manila doesn't schedule the secondary location, so this should be okay
 * If the share is not being served by the primary storage controller, a failover should always attempt to move it back to the primary, if possible
 * This proposal allows replication to more than 1 place (at the administrator's option, if the backend allows it)
 * Users aren't aware of how many replications locations there are or which one their share is at -- they only know if it's at the primary or not

Administrator Experience

 * Administrator's job today
 * Install/configure hardware
 * Understand physical layout of infrastructure
 * Understand network connections and logical topology of infrastructure
 * Think about failure domains and contingencies in case of failures
 * Today if a storage controller hosting Manila fails, there's not much an admin can do other than try to get it back online
 * Configure Manila
 * Setup storage controllers
 * Install software
 * Configure backends in manila.conf (typically hostnames, logins, passwords, etc)
 * Administrator's new responsibilities with Manila DR
 * Choose primary/secondary sites for replication
 * Could be between racks, between aisles, between floors, between buildings, between cities, or between continents
 * Decide whether to do symmetric (active/active) or asymmetric (active/passive) replication
 * Individual shares always have a primary (accessible) and secondary (inaccessible) location
 * Active/active refers to having 2 controllers where some primaries are on each one and they replicate to each other
 * Active/passive refers to have all the primaries one controller and all of the secondaries to the other
 * Find a driver that supports replication
 * It is very important for generic driver to support replication
 * We want to offer this functionality to everyone
 * It's needed for the gate to be able to test this feature
 * Looking for volunteers to help with the generic driver enhancement
 * Setup hardware with sufficient bandwidth to accommodate mirroring
 * Configure Manila
 * No new config flags for replication
 * Each driver can decide how replication relationships should be expressed
 * Assume that replication will most likely be between same-vendor backends
 * Could be as simple as 1 new config option with a list of names of other backends that can be replicated to
 * It would be a really good idea to have an HA configuration of Manila in the case that a site failure could affect controller nodes
 * Respond to outages
 * Administrators typically have significantly more information than tenants about the actual infrastructure
 * Administrators should communicate with their tenants in the event of an outage
 * If the administrator decides that failover is appropriate given the nature of the outage, he can/should initiate it
 * Sometimes an outage may be brief enough that waiting for the primary to come back is better than failing over
 * This is one reason we don't propose automated failover
 * Open question: how can we optimize failing over a large number of shares?
 * Users can initiate a failover on their own, but we believe that would only be wise for testing purposes
 * Fix outages and recover
 * At the end of an outage, administrator should Resync all Out Of Sync shares
 * Open question: how can we optimize resyncing a large number of shares?
 * Users should be notified that the outage has ended and is it safe to fail back to the primary
 * Administrator should not fail back shares unilaterally
 * Failing over shares causes a brief loss of connection
 * Better to let the user choose the least disruptive time
 * Permanent outages
 * Sometimes outages are so long that it makes more sense to pick a new replication site instead of reconstructing the primary
 * Destruction of the building due to fire/flood/tornado/meteor
 * Admin/user does a failover to secondary, share goes to Out Of Sync
 * Administrator changes the list of replication relationships in manila.conf and restarts manila-share, invokes update_replication API
 * Shares move to resyncing state (update replication is like resync++)
 * Eventually share becomes In Sync again
 * If the current location is not the new primary location, the user may failover to the new primary

Driver Maintainers / Vendor Concerns

 * Replication is not a required feature
 * It only has to work if the backend advertises the "replication" capability
 * Still only 1 database row/1 UUID per share
 * Only 3 new DB fields
 * Replicated=true/false
 * Replication state=In Sync/Out Of Sync/Resyncing/Failing Over
 * Primary_location=true/false
 * Drivers should store any needed information about share replication using driver private data feature
 * Driver have 3 new methods
 * failover_share
 * Called after the manager deletes the existing export_locations
 * Manager sets the share state to Failing Over before invoking this method
 * Driver should do whatever is necessary to make the secondary accessible
 * The primary may still be accessible, or it may not
 * Failover is expected to succeed in both cases
 * Driver should return new export location in a model update
 * Driver MAY update the share's host field, if a different backend should own the share after the failover
 * Driver MAY reinitialize replication in the reverse direction immediately if the primary is accessible
 * Driver should update replication state to In Sync or Out Of Sync using a model update
 * In Sync indicates the failover was successful and replication was reestablished in the reverse direction
 * Out Of Sync indicates the failover was successful but replication was NOT reestablished
 * On failure, the share goes into ERROR state
 * resync_share
 * Manager sets the share state to Resyncing before invoking this method
 * Driver should attempt to establish replication again
 * Driver should update replication state to In Sync or Out Of Sync using a model update
 * In Sync indicates the resync was successful
 * Out Of Sync indicates the resync failed
 * update_share_replication
 * Admin only API
 * Informs driver that the topology has changed, and obsolete relationships should be cleaned up and new ones created
 * Driver should set a new primary_location if the old primary_location isn't part of the replication relationship anymore
 * Primary_location should only change when the replication topology changes
 * Open question: how does the driver know which location to make the primary?
 * Also does everything else that resync does
 * Changes to existing methods
 * create_share
 * Manager will set replication=true on share if share type has that extra spec
 * Driver should setup replication as needed and should set the replication state to In Sync in the model update
 * ensure_share
 * This method is called for each share on driver startup
 * In addition to other cleanup, shares with a replication state of Resyncing or Failing Over should be set to a stable state
 * Drivers have a lot of flexibility
 * Alternative topologies
 * Replicate to more than 1 other site
 * Fan-out replication or replication chains
 * Secondary backends
 * Two Manila backends can replicate to eachother
 * One Manila backend can manage two controllers
 * A backend could have a list of possible replication destinations and choose one (but no involvement from Manila schedueler)
 * Synchronous/Asyncronous RPO/RTO times
 * All options about different types of replication can be (driver-specific) backend capabilities
 * Driver can be aggressive or lazy about repairing broken replication relationships
 * The point of all this flexibility is to enable a wide variety of technologies to fit the design

API

 * Three new REST APIs
 * User facing: failover, resync
 * Admin facing: update replication
 * Validate share state and invoke manager RPC
 * Create
 * Set the replicated state depending on the share_type
 * Additional fields for share views
 * Replicated=true/false
 * Replication state
 * Primary_location=true/false

Scheduler

 * No changes
 * Existing extra_specs/capabilities logic ensures that appropriate backends are chosen for replicated shares

Share Manager

 * Add RPCs for failover/resync/update_replication
 * Implement appropriate replication state changes before calling driver methods
 * Clear export_locations before failover
 * Add new driver entry points
 * Validate model updates regarding replication states
 * Make sure that changing a share's host field doesn't cause problems