Manila/Replication Use Cases

Answered Questions

Q: From where to where do we allow replication? Is it intra-cloud or inter-cloud? Do we allow replication to something that's not managed by Manila?

A: Intra-cloud. Replicating to something outside of Manila allows a bit more freedom, but with significantly less value, because there's practically nothing we can to do automate the failover/failback portion of a disaster. For use cases involving replication outside of Manila, we would need to involve other tools with more breadth/scope to manage the process.

Q: Do we support unplanned failovers? Do we support planned failovers? Are failovers disruptive or not?

A: Failovers can be planned or unplanned, but they are always disruptive at the data level. With application integration, they could be made nondisruptive at the application level, but unfortunately we've chosen no to use any intermediary technology (like virtfs) in the data path, which means we have no options for non disruptive failovers.

Q: Who configures the replication? The admin? The end user? The manila scheduler?

A: The end user. In the original design we presumed that the actual replication relationships should be hidden from the end user, but this doesn't match well with the concept of AZs that we are adding to Manila. If the users need to have control over which AZ the primary copy of their data lives in, then they also need to control where the other copies live. This means that the administrator's job is to ensure that for any share type that is replicated, it can be replicated from any AZ to any other AZ.

Q: Who triggers a failover? Is it a manual button the admin presses? Can Manila failover automatically? If so, when? Can the end users control failovers at all?

A: Failovers are manual, triggered by either an administrator or a user. Generally speaking it's more appropriate for the administrator to initiate a failover because the administrator has more knowledge about the nature of an outage. However, it's also essential for end users to test failovers so they need the capability to initiate failovers of shares themselves.

Q: During replication (before failover) is the secondary even visible/accessible?

A: Yes, but possibly with significant limitations. Some backends may not support accessing the secondary side of a replicated share. Some backends may allow access, but read-only. We know of at least one backend that can support write access to the secondary (in which case calling it a secondary isn't really accurate because it's more of an active-active relationship). Amazon's EFS has the model of active-active replication so it's something we don't want to disallow.

Q: What is the granularity of the failover? Whole backend? Single pools?

A: Individual shares. There's no technical reason to prevent failover/failback on share-by-share basis. To make the administrator's life easier, we also have to support whole-backend failover (could be essential to minimize downtime in an actual disaster). The ability to do single-share failover is nice because it allows testing of the DR system without triggering an outage that affects users, since failovers are disruptive.

Unanswered Questions

There are some major unanswered questions (or areas of investigation).

1) Is there no way to achieve nondisruptive failover? I would love to find out that our initial intuition here is wrong, because it would change a lot of aspects of the design. It's worth spending time to brainstorm and research possibilities in this area. So far the most promising ideas involve:

Using VirtFS to mediate filesystem access and achieving non-disruptive failover that way
Using some kind of agent inside the guests to mediate file access

2) How do we deal with recovery after a disaster and failover? Assuming a successful failover, and a repair of the original primary, failing back will cause another outage. How can we orchestrate that to minimize pain and suffering?

3) Assuming the disruptive aspect of failovers is unavoidable, how can we invest to make them less painful at the application level? Application quiescing and mount automation could make failovers nondisruptive for a least a select few applications.