Jump to: navigation, search

Difference between revisions of "CinderWallabyPTGSummary"

m (add summary of "Removing deprecated encryptors from os-brick")
m (Removing deprecated encryptors from os-brick)
Line 393: Line 393:
 
It's possible that removing support of the legacy names may break heat, python-openstackclient, tempest, patrole, or rally, so we will need to check for this and submit patches to affected projects.  Hound can be used to find usage:
 
It's possible that removing support of the legacy names may break heat, python-openstackclient, tempest, patrole, or rally, so we will need to check for this and submit patches to affected projects.  Hound can be used to find usage:
 
http://codesearch.openstack.org/?q=LuksEncryptor&i=nope&files=&repos=
 
http://codesearch.openstack.org/?q=LuksEncryptor&i=nope&files=&repos=
 +
====Conclusions====
 +
* '''ACTION''' - rosmaita - leave feedback on the patches
 +
* '''ACTION''' - anyone interested - patches to affected projects
  
 
===Improvements to service down===
 
===Improvements to service down===

Revision as of 20:06, 3 November 2020

Contents

Introduction

This page contains a summary of the subjects covered during the Cinder project sessions at the Wallaby PTG, held virtually October 26-30, 2020. The Cinder project team met from Tuesday 27 October to Friday 30 October, for 3 hours (1300-1600 UTC) with an extra hour scheduled for "hallway time", though Wednesday and Thursday's hallway time was spent in cross-project meetings with the Glance team.

The Cinder Team at the Wallaby (Virtual) PTG, October 2020.


This document aims to give a summary of each session. More context is available on the cinder PTG etherpad:


The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings are located at appropriate places below.

Tuesday 27 October

recordings

Summit Follow-up

Divisive language in code/docs/infrastructure

Basic idea (from [0]): "The OpenInfra Foundation (OIF) Board of Directors supports removal of wording identified as oppressive, racist and sexist by members of our communities from the software and documentation they produce. While we know there will be challenges, both technical and non-technical, this is an action we feel is important."

The working group driving this effort had a Forum session last week, and met at the PTG yesterday:


Rosmaita reported on a quick grep of the cinder codebase for the obvious offensive usages of master/slave and blacklist/whitelist

  • in the main cinder code, 'slave' shows up as a parameter to some oslo.db library calls, so we can't change this until they fix it
  • there is usage of master/slave in some of the drivers
  • also, blacklist/whitelist occurs in some drivers


Amy Marrich (spotz), who's on the working group driving the effort, attended the session and emphasized some points to the team:

  • the working group is proposing "The Stance" to the TC. When approved, it will give the individual project teams some guidance
  • we're not the only open source community doing this; there's a list on the etherpad of other open source projects and what changes they've made/are making
  • don't need to go back and change everything in our history, but projects should follow the guidance in new code/docs
  • projects can make the documentation more inclusive regardless of what happens in the code
  • the idea is to leave it to the projects to make appropriate changes; no plans to introduce churn by generated patches changing 'master' to something else for all projects
  • waiting to see what upstream git does about 'master' being the default trunk branch name. Github has changed it to 'main' for new projects created on github

The current Cinder PTL is personally on board with this effort because it's being done in a thoughtful way and all changes will go through the normal code review process [1]. Additionally, some employers (for example, Red Hat [2]) are encouraging this effort in open source projects in which they are involved.

[0] https://etherpad.opendev.org/p/vSummit2020__DivisiveLanguage
[1] http://lists.openstack.org/pipermail/openstack-discuss/2020-October/018181.html
[2] https://www.redhat.com/en/blog/making-open-source-more-inclusive-eradicating-problematic-language

Conclusions
  • My sense of the cinder PTG session was that the cinder team is generally supportive of the effort (or, at least not opposed to it). What this means in practice is not allowing non-inclusive terms into new docs/code
  • Changes to existing code will be fought out in Gerrit as usual
  • ACTION - rosmaita - update the cinder code review guide to mention this once the TC proposal has been accepted

The openstackclient situation

This was discussed at the Forum last week: https://etherpad.opendev.org/p/vSummit2020_FeatureGap

Seem to still be in the same situation, namely:

  • need better support for microversions
  • is still too big (requires many, many dependencies)


Some operators, at least, would really like to only have one client to work with, but first there needs to be feature parity between the OSC CLI and the individual project CLIs.

Conclusions
  • The key thing from my point of view is that OSC-everywhere is not going to be a community goal this cycle
  • Cinder team members interested in this effort should contact Stephen Finucane (nova core, core on some other projects) to help make this happen
  • ACTION - rosmaita - this came up later in the PTG, but we need to make sure that the team working on the openstackclient is aware that python-brick-cinderclient-ext support in OSC is an important issue for cinder

Community Goal for Wallaby

As of right now, only one goal has been accepted (though it looks like there will be another, see the next section), namely, "Migrate from oslo.rootwrap to oslo.privsep": https://review.opendev.org/#/c/755590/

Cinder is on the list of projects still using rootwrap, but I took a look and think it's only with regard to using rootwrap to escalate to run privsep (and there doesn't seem to be another way to do that). As part of the goal, there will be a hacking check introduced to make sure people aren't using rootwrap.

The spirit of the goal (though it's not required) is to make sure that projects are using good privsep rules, not wide-open rules. It would be good to use python commands for what we need instead of just calling a CLI to accomplish them. So we may want to invest some time into making our current privsep calls use python code.

Conclusions
  • Looks like cinder can meet the "letter" of the goal easily
  • Anyone on the team interested in security ... would be good to look into satisfying the "spirit" of the goal because that would be good for the project

"consistent and secure policies" update

There's a "Policy Popup Group" who's working on this. They held a Forum session and met at the PTG.


There's pretty much general consensus that this work needs to be completed to finally close Bug #968696 ("admin anywhere is admin everywhere"), a bug so old that it only has 6 digits (and the one that Adam Young wrote a song about). It's looking like it will happen in two stages:


During Wallaby, in Cinder we need to work on test coverage to define the baseline policy behavior. Rajat has started some of this work, and incorporated the system-scope into the new project-level-default-volume-types API. Lance Bragstad has proposed some patches to help us along: https://review.opendev.org/#/q/topic:secure-rbac+project:openstack/cinder

Conclusions
  • we look OK for W
  • ACTION - rosmaita - put up the policy.json deprecation note when the language is available
  • getting the foundational "X" work done during W is important for making the X goal accomplishable during the X cycle

Removal of nested quota driver

Rajat (whoami-rajat) pointed out that the nested quota driver was deprecated in Train and scheduled to be removed in Ussuri, with an effort instead to support scopes and unified limits. Rajat has a patch up to remove it: https://review.opendev.org/#/c/758913/

There was no objection to the proposed removal; the deprecation period has long passed.

Rajat intends to follow up with some code to fix how we query/set quotas. There is currently no validation in this code, and as a result, an operator can use a project name when setting a quota and the API reports success. But since you should be using a UUID, such a quota is never applied since cinder has no record of a quota being assigned to that project. This is both confusing to operators and annoying once they figure out what happened. Gorka pointed out that he has seen quite a few downstream bugs around this issue.

Conclusions

  • Rajat added project-id validation as part of microversion 3.62 and there were no objections, so it looks like we've agreed that a cinder API call is allowed to make a call to another service (within reason, of course).
  • By default, this is an operator-facing call, so it's not like the API is going to be hammered with these requests.

Cleanup improvements

Gorka (geguileo) pointed out that our resource cleanup on startup (or when we request job cleanups) is quite poor. (We're talking about what to do when the service has gone down and is restarted. This is important for Active/Active configurations.)

For example, on startup, we look at volumes in the "-ing" statuses:

  • if the status is 'creating', we set the volume to error
  • if the status is 'downloading', we set the volume to error
  • if the status is 'uploading', we change the status to 'available' or 'in-use' depending on whether the volume has any attachments
  • if the status is 'deleting', we retry the operation (which is exactly what we should do, so this status is being handled correctly)


Similarly for backups:

  • creating => Detach volume + Restore volume to previous status + Set backup to error
  • restoring =>Detach volume + Set volume to error_restoring + Set backup to available
  • deleting => This one is ok, we retry the operation


Gorka's idea is that we have the metadata available for each of these "-ing" statuses, so we could try to handle them more like the deleting case, where we retry the operation. Someone pointed out that for 'creating', it could be possible that the volume was actually created on the backend, but cinder went down before it could find out. But we could handle that by issuing a 'delete' first and then retrying the create. Also, there is some retry mechanism in the scheduler, so we are doing some of this already. It would be good to be a bit more thorough about it.

Conclusions

  • No one thought this requires a spec, so we'll just track with a blueprint.
  • No one had an objection to implementing the above, so Gorka should feel free to move forward with it.

Quotas: "Remove quota usage cache"

We store quota and usage information in the database so that we can respond quickly to various queries, but the DB "cache" drifts out of sync with actual usage and as a result, people get anomalous results and operators have to go in and clear things out by hand. Sam Morrison proposed a spec [0] to address this, but some requested revisions weren't made and it didn't get approved for Victoria. The substantive objection to Sam's proposal at the Victoria PTG concerned the performance impact of dynamically querying the database every time cinder needed the info that would ordinarily be in the cache, and Sam did add some performance info on the spec.

The interest in discussing this at the Wallaby PTG is to determine whether the performance info supplied by Sam is sufficient, or whether we need better performance info, and if so, what exactly would we want, so we could be in a position to move forward with this. Eric pointed out given how good modern databases are at things like summing a column when properly indexed, that the burden of proof should be on anyone who thinks this could have a bad performance impact to explain why.

Gorka pointed out that if there is some performance impact, we could balance it out by refactoring the DB code associated with the way we're using oslo versioned objects, there are a lot of calls when we query the database several times to get info that we already have from an earlier query.

Eric thinks that although the spec gives you the impression that this is pretty straightforward, it's going actually going to have a notable code impact. The question then came up whether the spec needs more details, but Eric pointed out that prototype code will be way more helpful than a really detailed spec. Rosmaita asked whether anyone is currently working on this ... and apparently no one is. But we continue to hope someone will pick it up.

[0] https://review.opendev.org/#/c/730701/

Conclusions

  • ACTION - rosmaita - revise the current spec (or shoot Sam an email asking him to do it; the current objections are formatting problems), targeting it for Wallaby

Robustifying the volume driver interface

What brought this up was a bug caused by a backport slipping through where a driver function was removed; this was OK in the most recent branch because the function had been refactored to a base class, but was a problem in earlier branches where the refactor hadn't been done, and this wasn't caught by any unit tests.

It would help to have a more detailed driver API. Sean pointed out that that was actually the goal of the interface classes defined in [0]. These are used to generate documentation of the API [1] and they are also used by the 'tox -e compliance' testenv which verifies that drivers satisfy the interface.

Our current practice is to just make changes to the driver API when we need to, and test the in-tree drivers and fix any breakage. This doesn't help out of tree drivers, but this didn't seem to be a major area of concern among those present.

Some of the current driver API is unclear with respect to object vs. dictionary for parameters and returned values. Eric's work on adding type checking should eventually improve this situation.

A complicating factor is that we still have all the Abstract Base Class code in place. It's a problem because new driver writers can use the ABC classes or just the base driver class when they implement a driver. Eric is pretty sure the type-checking code will get us many of the benefits that were supposed to come from the ABC code.

[0] https://opendev.org/openstack/cinder/src/branch/master/cinder/interface
[1] https://docs.openstack.org/cinder/latest/contributor/drivers.html#base-driver-interface

Conclusions

  • The interface classes that Sean put together will be considered the source of truth for the driver interface
  • ACTION - rosmaita - review the interface classes and their docstrings and add type info to the docstrings where appropriate; update the documentation in cinder/volume/driver.py to make sure people look in the right place for type information
  • Eventually, the ABC code should be removed

Vocabulary Minute

There are some ambiguous terms that we need to be careful about in order to avoid confusions in specs, code, and policies.

  • 'user'
    • could be an end user or an operator or both; be clear about what you are talking about
  • 'admin'
    • this is complicated by the new scoping added to Keystone. There are now different kinds of 'admins'.
      • 'system admin' ... an administrator with system scope. (This is the new name for what we used to just call the 'cinder admin'.)
      • 'project admin' ... a user who has some kind of administrative powers over a particular project (what we used to call a tenant), but they don't extend beyond that. An example would be in the project-level default-volume-type API (mv 3.62), the project admin could be allowed to set the default volume-type for the project (tenant), but a "regular" user in that project would not be allowed; the project admin would not be able to set the default volume-type for a different project (whereas a system admin would).
      • 'domain admin' ... basically, can do project-admin stuff on all the projects grouped together into a domain (so more than a simple project-admin, but way less than a system admin)
    • at this point, the plumbing is in place in Keystone for these different kinds of admins, but policy changes have to happen on the Cinder side to make them effective (see the "consistent and secure policies" update section above).
    • there's a primer in the keystone docs that explains each role/scope permutation and how they should be expected to behave: https://docs.openstack.org/keystone/latest/admin/service-api-protection.html
  • 'migrate'
    • it's a name for a cinder-specific operation; maybe we should refer it to as "move volume', which can be done with retype or migrate operations
  • 'backend'
    • we use this all over the place, and it's even in our logo ... but do we mean:
      • storage array
      • cinder volume service
      • backend section in the host/cluster DB field and message queues?
      • volume_backend_name
    • even the spelling has been questioned: some tech writers advise "back end" (two words) when used as a noun, "backend" (one word) when used as an adjective

Conclusions

  • When writing specs, docs, and code, we need to say exactly what is meant when we use the word
  • Reviewers should be aware of this, too, and push authors for clarification
  • ACTION - geguileo - draft an initial doc for the contributor docs explaining these distinctions


The team then adjourned to meetpad for Happy Hour, and there was much rejoicing.

Wednesday 28 October

recordings

Updates from TC Meeting

Jay (jungleboyj) gave us an update on the TC meeting. It's looking like Wallaby will have two goals, privsep removal and migrating the default policy file from JSON to YAML.

For the privsep goal, https://review.opendev.org/#/c/755590/ :

  • it's OK to use rootwrap to escalate the privsep process, though hopefully there will be a better way to do this in the future.


The policy file migration goal is proposed by https://review.opendev.org/#/c/759881/ ; it's pretty much what we discussed yesterday (see above).

With respect to the OpenStack Client situation, the TC is considering a resolution to articulate its "stance" on the issue: https://review.opendev.org/#/c/759904/. As far as cinder itself goes:

  • openstackclient does have support for adding plugins
  • unfortunately, the current volume support in osc doesn't use a plugin, and in fact the command structure is designed in such a way that it would be difficult to write a simple plugin adapting the python-cinderclient code. It would require some kind of translation layer (which would expand the complexity and hence the bug surface)
  • not clear that there's a way to handle the os-brick-ext functionality in osc
  • there is microversion support in osc, and it seems to work for nova

Conclusions

  • ACTION - rosmaita - update the cinderclient-openstackclient feature gap spreadsheet to assess current gaps (need to look at mv 3.55-3.62 support): https://ethercalc.openstack.org/cinderclient-osc-gap
  • ACTION - rosmaita - follow up with osc team about microversion support and whether it works for cinder requests or not
  • ACTION - rosmaita - make sure the osc team is aware of the brick-cinderclient-ext and that something like it must be supported by osc

Kioxia: Add new driver

The Kioxia KumoScale team wants to contribute a cinder driver in Wallaby. They gave a presentation [0] outlining how KumoScale works and what they'll need to implement, and got feedback from the cinder team.

The driver will be a pretty standard cinder driver. It will require a new os-brick connector, and Kioxia proposes enhancing the current nvme-of connector with some features based on improvements in the nvme-of protocol, which would make these available to the community, and then using it as a base class for the kioxia connector.

They want to open source all the code for this and are in the process of getting the licensing worked out. They pointed out that operators aren't interested in running non-open code on their compute nodes. Kioxia will deploy a healing agent, so they are very motivated to making it open source.

See the slides [0] and the PTG etherpad for more details.

While we were discussing this, Gorka noticed a possible bug in the current os-brick code, namely that it looks like we are not flushing the nvme-of connector before disconnecting. See the etherpad for details.

[0] https://docs.google.com/presentation/d/18zBxXfDTOieuD-lQmz4Cx2xBiTErjfXy9eRsMuYZxJI/

Conclusions

  • The Kioxia team will file blueprints for their driver and os-brick connector work (and do the work!)
  • ACTION - geguileo - file a bug for the non-flushing issue

Update on Image Encryption Efforts

Josephine (Luzi) gave us an update on the in-flight image encryption effort. Barbican decided that they needed microversions in order to introduce the "Secret Consumer API" (it's a consumer API for barbican secrets, not a way to have unknown consumers); this API is important to the encryption effort because it will allow services to register that they have a resource relying on the secret, and that way users can get a warning if they want to delete a secret that's currently in use (because deleting the secret would obviously make the resource unusable).

The encryption and decryption code is going into os-brick (we approved the spec maybe two cycles ago). It's really independent from the other ongoing work, and can be reviewed independently.

There are still some challenges in Glance, but once those and the castellan support for the Barbican secret consumer API happen, the Cinder code for consuming and uploading these images can be added. So it's looking like this may all come together in Wallaby.

Conclusions

  • the os-brick patch adding support for gpg encryption is ready for review: https://review.opendev.org/#/c/709432/
  • we are also interested in the Barbican Secret Consumer API as a way to make our current encryption key handling for LUKS more robust

Sizing encrypted volumes (continued)

Sofia (enriquetaso) reported on her ongoing work to fix a bug that at first looks simple, but which turns out to be complicated on several different levels, which are discussed on this etherpad: https://etherpad.opendev.org/p/sizing-encrypted-volumes

Briefly, the problem is that a LUKS encrypted container must contain a header. So currently, when a user requests an encrypted volume type of 1GB, they actually get something slightly smaller (namely, 1GB minus the space required by the header (which varies with the LUKS format)). We haven't received any complaints about that (!), but we do get complaints when users try to retype a volume of a non-encrypted volume-type to an encrypted volume-type; the volume-create fails when cinder tries to fit a 1GB volume into a volume that doesn't have a full 1GB of free space.

We discussed the idea of introducing a 'migrate_size' so that a user doing a retype from an unencrypted to an encrypted volume type could specify a larger size. Gorka pointed out that this is just passing the problem on to the user. Additionally, cinder specifies size in GB, whereas what we really need is only a few MB at most. Further, we would have this weird situation where a user has a 1GB unencrypted volume, retypes to an encrypted 2GB volume, retypes to an unencrypted 2GB volume, decides to retype to encrypted again and now the volume is 3GB even though that's not what the user needs. Gorka thinks it's a driver problem for how to handle the necessary extra space so that a x GB volume actually holds x GB, whether it's starting life as an encrypted volume, or is being retyped from an unencrypted volume to an encrypted volume. So we need to figure out a way to let each driver tell cinder what size increments it can allocate space in. While this means that a 2GB encrypted volume will take up more than 2GB of space on the backend, and it will be reported as only being a 2GB volume, operators could price encrypted volume types slightly higher to take into account the actual space consumption (or it could turn out that on some backends the overhead is too small to be worth worrying about).

Conclusions

Cross-project meeting with Glance team

We started out discussing the situation of the cinder glance_store when cinder is using an NFS backend, but it became apparent that this needed a longer discussion (to be held on Thursday).

We also talked about the status of the 'cursive' library, which nova, glance, and cinder all use for image-signature-validation. It's currently in the 'x' namespace: https://opendev.org/x/cursive but really should be part of openstack.

Conclusions

  • the cursive library should live in oslo with the consuming teams sharing maintenance
  • ACTION - rosmaita - bring this up with the oslo team

Thursday 29 October (Gorka's Day)

We declared this to be "Gorka's Day" because he contributed so many topics to be discussed.

recordings

Reset state robustification

Tushar Gite (TusharTgite) proposed this topic as a developer who's done some bug fixes, but would like to submit a feature and is looking for some guidance on how to do so. He's looking at picking up this spec: https://review.opendev.org/#/c/682456 (Reset state robustification).

The current state of the spec is that it was proposed for victoria, got positive comments, but wasn't revised in time for the Victoria spec freeze. So the first step will be taking over the spec and re-proposing it for Wallaby and making the requested revisions. Someone pointed out that the Cinder developer guide will be helpful, particularly this part: https://docs.opendev.org/opendev/infra-manual/latest/developers.html#rebasing-a-commit

The basic idea behind the spec is that Cinder has a "reset state" functionality for volumes, but there are several conditions (identified on the spec) where Cinder should know that resetting to the requested state will result in an invalid condition. (An example is taking a volume with an active attachment and resetting the state to 'available'.) In such cases, Cinder should reject the request with an informative error message.

It was pointed out on the spec that sometimes a knowledgeable operator may need to reset the state anyway and then manually make the current state valid. An open question on the spec was whether there should be a '--force' flag in the API for this, or whether this should be moved out of the API to cinder-manage. The team discussed this and decided that cinder-manage is the place where forcing a change will be allowed.

Gorka pointed out that we need to be careful about how we handle this and not simply make changes in the Cinder DB that don't reflect the state of the resource on the backend (for example, we can delete all of a volume's attachments in the database, which will free up the volume to be deleted, but if the backend isn't aware of this and thinks the attachments are still active, some backends will not allow the volume to be deleted).

Another point that came up is that we need to make operators aware of the "new" Attachments API (introduced in Ocata at microversion 3.27!). If they use the Attachments API to delete all the attachments, the volume should become available again without needing to reset state.

Conclusions

  • Forcing a status reset will be allowed, but not through the API ... you will have to use cinder-manage
  • Jay (jungleboyj) will be happy to answer questions about updating and moving the spec
  • Eric (eharney), as the original author, will be available to answer questions about content
  • ACTION - TusharTgite - take over https://review.opendev.org/#/c/682456 and re-propose it as discussed above
  • ACTION - rosmaita - there's a patch outstanding associated with the "volume list optimization" spec (which turned out to be a set of bugs, not a new feature). The point here is that it deals with some internal-only status values that are not supposed to be exposed to users; need to get that patch reviewed and make sure we don't have leakage of these internal-only states through the API

Ceph: Minimum supported version

For the cinder RBD driver, we don't specify a range of versions for what version of Ceph we support. This is becoming a problem because there are features in newer Ceph versions that we'd like to take advantage of (for example, the clone v2 API), and if we are committed to supporting older versions of Ceph without these features, we either have to add a bunch of compatibility code or possibly create two subclasses of the RBD driver and select which one to use based on the minimum Ceph cluster version. But it would be better to avoid extra complexity, especially if no one is using the older versions anyway.

The Ceph community does one release each year, targeting March 1. "The lifetime of a stable release series is calculated to be approximately 24 months (i.e., two 12 month release cycles) after the month of the first release" [0]. So right now they are officially supporting only Nautilus and Octopus.

We sent out an operator survey; results are available here: https://docs.google.com/spreadsheets/d/1bjh4o4piczet2u6EjiPRvgmS10HdXluFruoGmLSEaMw

It looks like customers who install their Ceph clusters along with OpenStack, get updated when they update OpenStack. Also, people with external clusters tend to follow what's available for their operating system distribution, which doesn't necessarily align with what the Ceph project is currently supporting.

It does look like the people running very old Ceph are planning to get upgraded to at least Mimic. This is good for us because Mimic contains the clone v2 API that we want to use in the RBD driver, and it would allow us to say that we support the versions that the Ceph project supports + 1 year, which seems reasonable.

The question came up about what about the Ceph client version? There was unanimous agreement that the client and server versions should be aligned in an OpenStack installation. We should state this clearly, however.

[0] https://docs.ceph.com/en/latest/releases/general/

Conclusions

  • Proposal: cinder supports only the current Ceph project active stable releases plus one release older
  • ACTION - rosmaita - draft an email to operators outlining this plan and see what kind of response we get
  • ACTION - rosmaita - figure out where this will go in the docs and make sure the client-server version alignment is clearly stated

Ceph: Should we change rbd_exclusive_cinder_pool default?

The 'rbd_exclusive_cinder_pool' option was added back in Queens by Change-Id: I32c7746fa9149bce6cdec96ee9aa87b303de4271 with a default value of False.

One implication of this being False is that as the number of volumes grows, the service takes longer and longer to start because all volumes have to be examined to see if they are Cinder volumes or not. A majority of deployments use a dedicated ceph pool for cinder, so having this option False just slows things down for no benefit unless the operator realizes that the option needs to be changed. But this can be really hard to debug because everything is fine but at some point as the deployment grows, the cinder service starts going up and down. Further, there's a good argument that sharing a Ceph Cinder volume pool with any other service is an operator error, and it should not be done in the first place.

The general consensus was that this is a change that should be made, and we should backport the change (though with a release note that makes it obvious that the default is being changed and what exactly this option does).

Conclusions

  • ACTION - geguileo - propose the change

Replication Canary Test

Gorka (geguileo) asked how many people have customers using replication. Nobody present was sure. Some bugs in replication have been fixed in the SolidFire driver recently, but those were found by the dev team and had not been reported by customers.

Currently Dell (PowerMax) and NetApp (SolidFire) are doing replication testing, but it's all manual.

Gorka is interested in having some kind of canary test so that operators could test against a live installation and verify that replication is working. He asked whether current drivers could support such a test, and the answer was it depends on what the test is like. For example, canary testing for asynchronous replication on PowerMax would be difficult because all volumes using that type would require synchronization to be paused--so the test would have to be sure to use a volume-type that wasn't currently being used by any users.

Another issue that came up is that the replication documentation needs to be reviewed and clarified. It needs to be clear that it depends on the backend. We also need to make clear the difference between synchronous and asynchronous replication.

Another complication is that while we could implement a canary test for cinder, that doesn't really address real life cases where other services are involved and would have to make sure a failover was implemented seamlessly.

Conclusions

  • The replication documentation needs to be enhanced.
  • It would be good to have some kind of canary test for replication, because otherwise it's difficult to tell if its configured correctly and working until it's too late.
  • ACTION - geguileo - write a spec proposing a canary test for replication and the team can work out the details on the spec
  • ACTION - geguileo - longer term, once we have something in Cinder, follow up with other groups so we can come up with a global OpenStack solution

Availability zone and volume type for backup restore

There are several things going on here:

  1. There was a spec hanging around for adding AZ and volume-type specification for backup restore [0].
  2. In the meantime, microversion 3.47 added support into the create-volume API [1] so that you can create volume from a backup and at the same time specify the volume type and AZ that you want for the volume, so the spec became kind of redundant.
  3. Recently, Jeremy Liu has proposed an implementation of the spec to add AZ and volume-type specification into the backup-restore call [3].


Gorka raised the question: what are we going to do about Jeremy's patch [2]?

  • we could accept the feature, but then we'll have an additional code path for creating a volume from backup with volume-type and AZ
    • but we already have two paths, and this change would simply put the 'restore' path on parity with the 'create' path
    • and it's not like the 'restore' path is only for *restoring* a backup to an existing volume whereas the 'create' path is the proper way to create a new volume from a backup; the 'restore' path currently gives you the option of creating a new volume
  • reject the feature
    • this is not very nice because the spec had been sitting around and now someone put in a bunch of work on it
    • on the other hand, we don't need this added to the API; we could still fake it in the client (if volume-type or AZ is passed to the backup-restore call, the client would actually call the 'create' API


Nobody had a really strong feeling about what to do. Rajat pointed out that the code change is pretty small--there's not a lot of code needed to add this second path, so it doesn't look like it will create a maintenance nightmare. And Gorka is right that it would be a real douche move to reject the patch at this point.

[0] https://blueprints.launchpad.net/cinder/+spec/availabilityzone-and-volumetype-for-backup-restore
[1] https://docs.openstack.org/api-ref/block-storage/v3/index.html#create-a-volume
[2] https://review.opendev.org/#/c/754925
[3] https://docs.openstack.org/api-ref/block-storage/v3/index.html#restore-a-backup

Conclusions

  • Anyone who feels that duplicating code is a big problem here needs to go comment on the patch and make that point. Otherwise, we'll just continue to review the patch in the normal way (that is, getting it in shape to be merged).
  • ACTION - rosmaita - need to review any old BPs hanging around in Launchpad to avoid this happening again
  • ACTION - rosmaita - make sure our contributor docs and the specs docs indicate that a spec in 'untargeted' or not implemented in a previous cycle must be re-proposed for the current cycle

Cinder glance_store when the cinder backend is NFS

This was a cross-project meeting with the Glance team. The problem is that the cinder glance_store works fine for the most part, but when the cinder backend is NFS there are some issues.

I'm not going to summarize the entire discussion, just the conclusions. If you want the full experience, recordings are available:

Conclusions

  • The cinder glance_store driver must only allow volumes to be stored as raw files when NFS is the cinder backend. This will allow Glance to put anything it wants in there, and due to the Glance use case of storing immutable image data, these volumes won't need to support efficient snapshots because Glance is never going to request that operation.

Friday 30 October

recordings

Cinder/PTG business

  • All cinder-specific deadlines have been added to the main Wallaby release schedule: https://releases.openstack.org/wallaby/schedule.html
  • During Victoria, the midcycle-1 was held 1 week before spec freeze and midcycle-2 was 4 weeks before Feature Freeze & final client library release. For Wallaby that would be:
    • midcycle-1: R-18 (week of 7 December)
    • midcycle-2: R-9 (week of 8 February)

The midcycle format would be the same as the past two cycles, namely a two-hour video conference.

The first midcycle is slightly early because the spec freeze is a week earlier than Victoria to take into account low reviewer bandwidth as we head into winter holidays.

Conclusions

  • People should look over the cinder dates in the Wallaby schedule to verify that they are reasonable
  • People should look at the proposed midcycle dates and indicate if there may be conflicts
    • 7-8 December are holidays in Spain, so if we want to stick with that week, we're probably looking at Wednesday or Thursday (Friday has traditionally been unpopular for midcycles)

Removing deprecated encryptors from os-brick

The deprecation was in Ocata. Lee Yarwood has patches up for brick, cinder, and nova, but he has a question about whether we'll need to do an online migration so that existing encrypted volume types are adjusted appropriately.


Consensus was that an upgrade check should be added to cinder to verify that the migration will be successful, plus an actual migration. This should be part of the cinder patch.

It's possible that removing support of the legacy names may break heat, python-openstackclient, tempest, patrole, or rally, so we will need to check for this and submit patches to affected projects. Hound can be used to find usage: http://codesearch.openstack.org/?q=LuksEncryptor&i=nope&files=&repos=

Conclusions

  • ACTION - rosmaita - leave feedback on the patches
  • ACTION - anyone interested - patches to affected projects

Improvements to service down

new specs (and other stuff that has been sitting around)

"volume list query optimization"

Two proposed specs on the same topic (mutually assured destruction)

digression: deprecate thick-provisioned LVM?

Support revert any snapshot to the volume

Backend Capabilities

Replace md5 with oslo version

how to test the optimized glance-cinder image-volume workflow?

Team roles & business

current roles

proposed

tactical (short term)

Late Topics

attachments API - exposing connection info

keeping connection info up to date

cinder attach/detach service

autospeccing mocks