Manila/Replication Design Notes

Old Design Page

This page was used to help design a feature for a previous release of OpenStack. It may or may not have been implemented. As a result, this page is unlikely to be updated and could contain outdated information. It was last updated on 2015-10-21

Intro

The design for replication isn't complete. We have a vision for the feature, and we're trying to define the details. After collecting feedback we will reformat this into a design doc.

Threats to data

Hardware failures
Network failures
Power failures
Natural disasters (fire, flood, hurricanes, meteors)
Accidental corruption (bugs, human error)
Malicious users (viruses, hackers, disgruntled employees)

Solution for protecting data

Highly available storage systems
- Strategies
  - RAID/Erasure coding (protection from media failures)
  - Clusters (protection from component failures)
  - Multipath network topologies (protection from connection failures)
  - Redundant power (protection for power failures)
- Advantages
  - Transparent to clients
  - Zero RPO/RTO (except for maybe a brief pause)
- Disadvantages
  - Typically limited distance (weak against site-wide failures)
- Manila
  - HA storage solutions fit into Manila today without changes
  - Use "share types" to indicate certain storage backends are highly available
Backups
- Strategies
  - Tape archive (in the old days)
  - Virtual tape archive (Amazon Glacier or similar)
  - Local snapshots (standard Cinder/Manila features)
  - Remote snapshots (copy snapshot to object store, like Cinder backup)
- Advantages
  - Can be very cheap
  - Stores multiple points in time (protection from corruption/malicious destruction)
- Disadvantages
  - RPO typically high
  - Local snapshots don't protect against equipment/site failures
  - Remote snapshots typically have to be restored before they become accessible -- high RTO
- Manila
  - Local snapshots implemented today
  - Remote snapshots (aka "backup") planned for future
Replication
- Strategies
  - Synchronous mirroring
  - Asynchronous mirroring
- Advantages
  - Can handle much longer distances
  - Can offer very low RPO/RTO
- Disadvantages
  - Not transparent to network clients
- Manila
  - This is what we're proposing!!!

Overview of the proposal

Start from user experience
- If it doesn't address a user's problem, then the rest of the design is pointless
Also think about administrator's needs and responsibilities
Consider vendors/driver authors' and practical issues
- Design is intentionally open-ended to make it as easy as possible for vendors to implement

User Experience

Users will be able to create "replicated" shares and non-replicated shares by specifying a share type
- All existing shares are non-replicated
- Administrator must specifically create share types that include replication extra_spec
- Open question -- should the "replication" extra spec be visible to tenants?
  - We could do this similarly to how driver_handles_shares_servers is visible to tenants
  - Alternatively: rely on administrator to communicate which types are replicated and rely on the "replicated" attribute appearing on the shares after they're created
- Open question -- what should the "replication" extra spec be called?
  - Vendors should be free to offer additional capabilities for different types of replication, just
  - There must be a standard capability/extra_spec that controls the Manila replication feature though
Replicated shares will have a replicated=true flag returned by the API
Replicated shares will also have a replication_state field
- In Sync - stable state - share data is being replicated to 1 or more secondary controllers
- Out Of Sync - stable state - share data is NOT being replicated
- Resyncing - transitional state - backend is trying to reestablish replication
- Failing Over - transitional state - share is changing to a different primary
Two new tenant-visible APIs
- Failover
  - Can only be called on shares in the In Sync state
  - Causes existing export locations to be removed (must unmount first to avoid data loss)
  - Causes share to go into Failover Over state
  - Causes new export location to appear (presumably on a different storage controller in another location)
  - Expected to succeed whether storage controller hosting the share is online or not
  - After a successful failover the share may be in 2 states:
    - In sync -- if the primary storage controller was online and the backend was able to reverse the replication from the secondary
    - Out of sync -- maybe the primary storage controller was offline, or the backend wasn't able to immediately establish replication again from the secondary
- Resync
  - Can only be called on shares in the Out Of Sync state
  - Causes shares to go into Resyncing state
  - Causes backend to attempt to reestablish replication (if possible)
  - On success, share goes to In Sync state
  - On failure, share goes back into Out Of Sync state
    - This would be expected as long as the primary remains down
Replicated shares will have a primary_location=True/False flag
- Indicates if the share is being served by the original (primary) storage controller
- After failing over, this field would be set to False to indicate that the share is being served by a secondary storage controller
  - Secondary locations may not have all of the capabilities of the primary
  - For example, the share_type may specify SSD disks extra_spec, but the secondary storage controller may have spinning disks
    - This is up to the administrator to configure how he wants
    - Manila doesn't schedule the secondary location, so this should be okay
- If the share is not being served by the primary storage controller, a failover should always attempt to move it back to the primary, if possible
  - This proposal allows replication to more than 1 place (at the administrator's option, if the backend allows it)
  - Users aren't aware of how many replications locations there are or which one their share is at -- they only know if it's at the primary or not

Administrator Experience

Administrator's job today
- Install/configure hardware
- Understand physical layout of infrastructure
- Understand network connections and logical topology of infrastructure
- Think about failure domains and contingencies in case of failures
  - Today if a storage controller hosting Manila fails, there's not much an admin can do other than try to get it back online
- Configure Manila
  - Setup storage controllers
  - Install software
  - Configure backends in manila.conf (typically hostnames, logins, passwords, etc)
Administrator's new responsibilities with Manila DR
- Choose primary/secondary sites for replication
  - Could be between racks, between aisles, between floors, between buildings, between cities, or between continents
- Decide whether to do symmetric (active/active) or asymmetric (active/passive) replication
  - Individual shares always have a primary (accessible) and secondary (inaccessible) location
  - Active/active refers to having 2 controllers where some primaries are on each one and they replicate to each other
  - Active/passive refers to have all the primaries one controller and all of the secondaries to the other
- Find a driver that supports replication
  - It is very important for generic driver to support replication
    - We want to offer this functionality to everyone
    - It's needed for the gate to be able to test this feature
    - Looking for volunteers to help with the generic driver enhancement
- Setup hardware with sufficient bandwidth to accommodate mirroring
- Configure Manila
  - No new config flags for replication
  - Each driver can decide how replication relationships should be expressed
    - Assume that replication will most likely be between same-vendor backends
    - Could be as simple as 1 new config option with a list of names of other backends that can be replicated to
  - It would be a really good idea to have an HA configuration of Manila in the case that a site failure could affect controller nodes
- Respond to outages
  - Administrators typically have significantly more information than tenants about the actual infrastructure
  - Administrators should communicate with their tenants in the event of an outage
  - If the administrator decides that failover is appropriate given the nature of the outage, he can/should initiate it
    - Sometimes an outage may be brief enough that waiting for the primary to come back is better than failing over
      - This is one reason we don't propose automated failover
    - Open question: how can we optimize failing over a large number of shares?
    - Users can initiate a failover on their own, but we believe that would only be wise for testing purposes
- Fix outages and recover
  - At the end of an outage, administrator should Resync all Out Of Sync shares
    - Open question: how can we optimize resyncing a large number of shares?
  - Users should be notified that the outage has ended and is it safe to fail back to the primary
  - Administrator should not fail back shares unilaterally
    - Failing over shares causes a brief loss of connection
    - Better to let the user choose the least disruptive time
- Permanent outages
  - Sometimes outages are so long that it makes more sense to pick a new replication site instead of reconstructing the primary
    - Destruction of the building due to fire/flood/tornado/meteor
  - Admin/user does a failover to secondary, share goes to Out Of Sync
  - Administrator changes the list of replication relationships in manila.conf and restarts manila-share, invokes update_replication API
  - Shares move to resyncing state (update replication is like resync++)
  - Eventually share becomes In Sync again
  - If the current location is not the new primary location, the user may failover to the new primary

Driver Maintainers / Vendor Concerns

Replication is not a required feature
- It only has to work if the backend advertises the "replication" capability
Still only 1 database row/1 UUID per share
Only 3 new DB fields
- Replicated=true/false
- Replication state=In Sync/Out Of Sync/Resyncing/Failing Over
- Primary_location=true/false
Drivers should store any needed information about share replication using driver private data feature
Driver have 3 new methods
- failover_share
  - Called after the manager deletes the existing export_locations
  - Manager sets the share state to Failing Over before invoking this method
  - Driver should do whatever is necessary to make the secondary accessible
    - The primary may still be accessible, or it may not
    - Failover is expected to succeed in both cases
  - Driver should return new export location in a model update
  - Driver MAY update the share's host field, if a different backend should own the share after the failover
  - Driver MAY reinitialize replication in the reverse direction immediately if the primary is accessible
  - Driver should update replication state to In Sync or Out Of Sync using a model update
    - In Sync indicates the failover was successful and replication was reestablished in the reverse direction
    - Out Of Sync indicates the failover was successful but replication was NOT reestablished
  - On failure, the share goes into ERROR state
- resync_share
  - Manager sets the share state to Resyncing before invoking this method
  - Driver should attempt to establish replication again
  - Driver should update replication state to In Sync or Out Of Sync using a model update
    - In Sync indicates the resync was successful
    - Out Of Sync indicates the resync failed
- update_share_replication
  - Admin only API
  - Informs driver that the topology has changed, and obsolete relationships should be cleaned up and new ones created
  - Driver should set a new primary_location if the old primary_location isn't part of the replication relationship anymore
    - Primary_location should only change when the replication topology changes
    - Open question: how does the driver know which location to make the primary?
  - Also does everything else that resync does
Changes to existing methods
- create_share
  - Manager will set replication=true on share if share type has that extra spec
  - Driver should setup replication as needed and should set the replication state to In Sync in the model update
- ensure_share
  - This method is called for each share on driver startup
  - In addition to other cleanup, shares with a replication state of Resyncing or Failing Over should be set to a stable state
Drivers have a lot of flexibility
- Alternative topologies
  - Replicate to more than 1 other site
  - Fan-out replication or replication chains
- Secondary backends
  - Two Manila backends can replicate to eachother
  - One Manila backend can manage two controllers
  - A backend could have a list of possible replication destinations and choose one (but no involvement from Manila schedueler)
- Synchronous/Asyncronous RPO/RTO times
  - All options about different types of replication can be (driver-specific) backend capabilities
- Driver can be aggressive or lazy about repairing broken replication relationships
- The point of all this flexibility is to enable a wide variety of technologies to fit the design

API

Three new REST APIs
- User facing: failover, resync
- Admin facing: update replication
- Validate share state and invoke manager RPC
Create
- Set the replicated state depending on the share_type
Additional fields for share views
- Replicated=true/false
- Replication state
- Primary_location=true/false

Scheduler

No changes
Existing extra_specs/capabilities logic ensures that appropriate backends are chosen for replicated shares

Share Manager

Add RPCs for failover/resync/update_replication
Implement appropriate replication state changes before calling driver methods
Clear export_locations before failover
Add new driver entry points
Validate model updates regarding replication states
Make sure that changing a share's host field doesn't cause problems