Jump to: navigation, search

Difference between revisions of "CinderVictoriaPTGSummary"

m (Keeping 'unsupported' drivers in-tree)
m (Cycle Priorities and Schedule)
 
(17 intermediate revisions by the same user not shown)
Line 223: Line 223:
  
 
===Ussuri Development Cycle Retrospective for Driver Developers===
 
===Ussuri Development Cycle Retrospective for Driver Developers===
 +
 +
We didn't get a lot of feedback.
  
 
===Broken RBD live migration===
 
===Broken RBD live migration===
This wasn't actually a discussion, someone left a question on the etherpad and Gorka answered it.
+
This wasn't actually a discussion, someone left a question on the etherpad and Gorka answered it. See the etherpad for details:
 +
https://etherpad.opendev.org/p/victoria-ptg-cinder
  
 
==Thursday 04 June: Cinder Improvements==
 
==Thursday 04 June: Cinder Improvements==
Line 233: Line 236:
  
 
===Recent Security Issues===
 
===Recent Security Issues===
Quick discussion of [[OSSN/OSSN-0086|OSSN-0086]], which was announced yesterday.
+
Quick discussion of [[OSSN/OSSN-0086|OSSN-0086]], which was announced yesterday.  We has a similar issue last cycle, which is described in [[OSSN/OSSN-0085|OSSN-0085]].
 +
 
 +
====Conclusions====
 +
* It would be good for driver maintainers to look over those OSSNs and make sure that their drivers aren't vulnerable to a similar problem.
  
 
===Reviewing our reviewing strategies===
 
===Reviewing our reviewing strategies===
 
Discussion led by eharney
 
Discussion led by eharney
 +
 +
Eric reminded the team about 'tox -e cover' and that you can use it to make sure that the code you're reviewing is being tested.  It helped him find a problem with some code that looked like it was being tested, but actually was not.
 +
 +
This prompted a discussion of using spec or autospec when mocking stuff; without either, mock will allow you to call non-existent functions on a mocked object without complaining, which was the problem here: it allowed a call to .has_calls() to look like it was doing something, whereas it wasn't  (the correct method would be .assert_has_calls(), which would have made the test (correctly) fail).
 +
 +
There was also some discussion about the pylint job, which is always failing; it would be good if it ran only on the changes in the current patch so that you'd know if the failure was being caused by the patch you're reviewing.
 +
 +
====Conclusions====
 +
* would be good to have the coverage results right there on the review; we can add a non-voting coverage job in Victoria and revisit at the next PTG to see whether people are finding it useful
 +
* '''ACTION''' - rosmaita - set up the job for ^^
 +
* we should make "patch" and "mock_object" methods in "TestCase" use "autospec=True" (though if you do it now, you get 77 unit test failures)
 +
* '''ACTION''' - ^^
 +
* need to revisit the pylint job, because it can detect problems, but in its current state it isn't that useful
 +
* '''ACTION''' - ^^
 +
* some teams have implemented a before-and-after coverage test with a threshold (so that you don't allow changes that decrease coverage)
 +
* '''ACTION''' - ^^
 +
* what it comes down to is that these static analysis jobs aren't going to increase the amount of time it takes for the current zuul "check" to complete, so they're worth a try
  
 
===Interop WG interlude===
 
===Interop WG interlude===
 +
 +
This was a reminder from the Interoperability Working Group to make sure they are up-to-date with respect to Cinder.  (We haven't had any major REST API changes during Ussuri, so we are probably OK).  Recent change removed Block Storage API v2 from the interop tests, which is fine, since the v2 API was deprecated in Pike.
 +
 +
====Conclusions====
 +
* there is still a position as Interop WG Cinder project Liaison open!
 +
* '''ACTION''' - rosmaita - verify that the functionality covered by the removed v2 tests continues to be covered by the current v3 tests
  
 
===Looking at type annotation===
 
===Looking at type annotation===
 
Discussion led by eharney
 
Discussion led by eharney
 +
 +
Eric discussed using type annotations, which are part of the Python 3 language, to help improve code by allowing static checking of parameters and return types.
 +
 +
He's got a patch up to introduce a 'mypy' tox environment that will execute mypy type checking against cinder code: https://review.opendev.org/#/c/733620/
 +
 +
References:
 +
* https://docs.python.org/3/library/typing.html
 +
* PEP 3107 -- Function Annotations : https://www.python.org/dev/peps/pep-3107/
 +
 +
The question came up that code using these won't be directly backportable to stable branches that still support Python 2.7, and what about our policy about not using Python 3-only language features in part of the code used by drivers?  The consensus was that we should use Python 3.  Some backports may be more complicated, but that's the way it goes.
 +
 +
====Conclusions====
 +
* Type checking will enable us to improve the cinder code, we should take advantage of it.
 +
* '''ACTION''' - rosmaita - revise the py2-to-3 transition guidelines: https://docs.openstack.org/cinder/latest/contributor/gerrit.html#transition-guidelines
  
 
===Cinderclient CLI usability issue===
 
===Cinderclient CLI usability issue===
 
Discussion led by rosmaita
 
Discussion led by rosmaita
 +
 +
Many Block Storage API requests return a 202 (Accepted) response.  This doesn't really tell you whether your request has succeeded or not, and end users don't always understand this.
 +
 +
One thing we can do is enhance the 202 response so that it gives a user a hint about what to do next.  In the cinderclient, we can add the --poll option to more commands.
 +
 +
The question came up, maybe we shouldn't enhance the cinderclient, instead we should put it into the openstackclient.  And that led to ...
  
 
====OpenStack Client tangent====
 
====OpenStack Client tangent====
Because we were talking about working on the cinderclient, the question came up: what about the openstackclient?  Since Jay is on the TC, and the TC is thinking about the unified client issue, we talked about this a bit.
+
Because we were talking about working on the cinderclient, the question came up: what about the openstackclient?  Since Jay is on the TC, and the TC is thinking about the unified client issue, we talked about this a bit so he could bring our input back to the TC.  Here are a few points; see the etherpad for more.
 +
 
 +
* There is a feature gap right now between the openstackclient and the cinderclient.  We can't just stop work on the cinderclient and instead put new commands into the openstackclient because a user will still need to use the cinderclient for the "gap" features that aren't in openstackclient yet.  Before we stop work on the cinderclient, the openstackclient must have feature parity with the cinderclient.
 +
* People want access to new features ''now'', not next cycle or the one after that.  While it would be nice to have a unified client, when faced with the choice between the convenience of a CLI that doesn't do everything, and the inconvenience of having to use a different client for cinder in order to use new features, new features are going to win.
 +
* The osc, openstack sdk, and shade have been promoted as "opinionated" tools.  While that may be appropriate for some users, operators at least are going to want full access to what the APIs offer.  So it's not obvious that a move to the openstackclient makes sense until it's clear what the design philosophy of these tools are.
 +
 
 +
====Conclusions====
 +
* At this point, it doesn't make sense to stop working on the python-cinderclient.
 +
* To get back to the session topic, it doesn't look like it would be too difficult to implement --poll for other resources.
  
 
===Community Goals===
 
===Community Goals===
 +
Discussion led by Jay Bryant.
 +
 +
There is one goal approved so far for Victoria: tosky's move all CI jobs to native zuul v3 jobs.  That's good from the cinder point of view, because tosky has already converted most of our jobs.
 +
 +
The likely second goal is to move all CI/CD jobs to Ubuntu 20.04 (Focal): https://review.opendev.org/#/c/731213/
 +
 +
For the W cycle, the goals on https://etherpad.opendev.org/p/YVR-v-series-goals are all candidates.  An early favorite is the removal of rootwrap.  There's some talk about the openstackcli, but that will probably require more discussion (see previous session).
  
 
===Ussuri Development Cycle Retrospective===
 
===Ussuri Development Cycle Retrospective===
 +
 +
Didn't get a lot of feedback on this one, either.  One issue that did come up was that while we might not be able to speed up reviewing, we could make reviews more predictable.
 +
 +
  
 
(We adjourned early for the Virtual Happy Hour, which was not recorded.)
 
(We adjourned early for the Virtual Happy Hour, which was not recorded.)
Line 263: Line 331:
 
===Cinder/Glance creating image from volume with Ceph===
 
===Cinder/Glance creating image from volume with Ceph===
 
Discussion led by abhishekk
 
Discussion led by abhishekk
 +
 +
When glance doesn't know the size of an image being created in advance, when it's using the Ceph backend you get thrashing because it acts conservatively and makes really small resize requests as the image is being written.  At the Shanghai summit, we discussed two approaches to fix this: (1) passing the size in request (since cinder already knows how large the volume needs to be; image API change) or (2) a glance_store change (make way bigger resize requests (like 1GB) & then shrink at the end).  In Shanghai, the cinder team was satisfied with solution (2).  The Glance team didn't have time to implement this in Ussuri and wanted to make sure we're still good with that option.  We are.
  
 
===Make cinder store of glance compatible with multiple stores===
 
===Make cinder store of glance compatible with multiple stores===
Disucssion led by abhishekk
+
Discussion led by abhishekk
 +
 
 +
Glance now has the ability to offer different backend image stores to users.  One use case is for an operator to offer different tiers which are presented to users as a store_id and a description (for example, 'bronze': 'Less expensive storage suitable for whatever'; an operator doesn't need to expose the backend technology to end users, just the fact that there are different stores available).  If an operator is using the cinder glance_store driver, currently only one volume_type can be configured for use by that store, even though there may be multiple storage tiers defined by various volume_types in cinder.
 +
 
 +
It looks like one way to address this would be to continue to use a single cinder glance_store, but give it the ability to pass different volume_types when it makes a volume-create request.  The volume_type would not be exposed directly to end users -- instead, each volume_type could be associated with a store_id in the glance stores-info response so that the stores abstraction layer would not be broken.
  
 
===Gate job for glance cinder store===
 
===Gate job for glance cinder store===
 
Discussion led by whoami-rajat
 
Discussion led by whoami-rajat
 +
 +
Rajat says:
 +
 +
Currently we support cinder store in glance but there isn't any gate job to verify the image creation/deletion works as expected when glance receives a request.
 +
I have been working on a gate job on glance_store for cinder[1] (which is passing finally) but i feel it would make sense to have a job on glance gate to test the whole
 +
flow of image creation when glance receives a request.
 +
 +
[1] https://review.opendev.org/#/c/329825/
 +
 +
====Conclusions====
 +
* The cinder glance_store seems to be increasing in use; both the Cinder and Glance teams agree that it would be good to make sure it's being tested well.
  
 
===Glance image co-location===
 
===Glance image co-location===
 
Discussion led by jokke_
 
Discussion led by jokke_
 +
 +
In Ussuri, cinder implemented a scheme to assist Glance in image co-location.  What this means is that if a volume was created from an image, if that volume was later uploaded as an image, cinder passes along the image_id of the original image so that Glance can determine what store(s) that original image is located in, and put copies of the new image in the same locations.  The Glance team would like to modify the implementation on the cinder side so that instead of passing the base_image_ref as a header in the image-create call to glance, if the base_image_ref is present, cinder makes a call to the glance Import API.  This would enable Glance to reuse a code path which would be better tested and easier to maintain.
 +
 +
====Conclusions====
 +
* The proposal seems reasonable.  It shouldn't make the image-create take longer on the cinder side; the Import API returns a 202 (Accepted) and does its work asynchronously, so it would just be a quick API call.  We need to look more closely, though.
  
 
===Refresh connection_info===
 
===Refresh connection_info===
 
Discussion led by eharney (lyarwood couldn't be here)
 
Discussion led by eharney (lyarwood couldn't be here)
 +
 +
We briefly discussed the proposal Lee outlined on the ML and in a comment on a review:
 +
* http://lists.openstack.org/pipermail/openstack-discuss/2020-April/014348.html
 +
* https://review.opendev.org/#/c/579004/
 +
 +
 +
It seems reasonable.  Gorka mentioned that there's an email thread out there where he discusses some aspects of this in more detail.
 +
 +
====Conclusions====
 +
* Follow up with lyarwood when he gets back about putting this into a spec
 +
* Dig up Gorka's email thread to make sure those points are addressed
  
 
==== DR operations and broken volume connections====
 
==== DR operations and broken volume connections====
 
Related topic introduced by sfernand
 
Related topic introduced by sfernand
 +
 +
Disaster recovery operations leave broken volume connections in the vms. Maybe it could be useful for nova to automatically reestablish iscsi connections after a failover/failback?
 +
* could maybe do this if multipath is enabled ... but is not going to be easy
 +
* Fernando is interested in working on this ... Gorka volunteered to talk through the problem with him and bring in Lee at some point because he's familiar with this too
 +
* work items:
 +
** Cinder drivers must report whether they do async or sync, A/A (nothing to do since the multipath already has access to both paths) or A/P (we need to refresh the connection) replication
 +
** Cinder must either tell Nova which volumes have a new "source" or automatically update the initialized connection and tell Nova to refresh specific instances
 +
** OS-Brick needs to be able to switch from the old to the new connection as cleanly as possible
  
 
===Dynamic front-end QoS===
 
===Dynamic front-end QoS===
 +
We have the ability to specify front-end QoS in Cinder that are applied by Nova when a volume is attached.  There have been requests for a feature that would allow these to be changed while a volume is attached.  This would be useful when QoS specs are changed in Cinder, or when a volume is extended and is using iops_per_gb QoS specs.
 +
 +
This would require:
 +
* a Nova spec to modify the (nova) External Events API: https://docs.openstack.org/api-ref/compute/?expanded=#create-external-events-os-server-external-events
 +
* writing Nova code to get the info out of the Events API request and down to libvirt
 +
 +
====Conclusions====
 +
* There was widespread agreement that this is a great idea, but no one present expressed interest in picking it up
  
 
===Victoria spec review===
 
===Victoria spec review===
 +
 +
We quickly went through the currently proposed specs: https://review.opendev.org/#/q/project:openstack/cinder-specs+status:open
 +
 +
See the etherpad for comments.
  
 
===Technical debt===
 
===Technical debt===
 +
 +
====Alembic!====
 +
We need to do this this cycle.  Rosmaita will work on it pulling in people as needed.
 +
 +
====Block Storage API v2 Removal====
 +
It's been deprecated since Pike.  We should at least remove the external-facing stuff.  (Internally, the v3 API is using some of the same code as the v2 contributed extensions.)
 +
 +
====Other Stuff====
 +
There are some open items from the Ussuri PTG that are still viable:
 +
https://etherpad.opendev.org/p/cinder-ussuri-ptg-actions
  
 
===Cycle Priorities and Schedule===
 
===Cycle Priorities and Schedule===
 +
====Schedule====
 +
Patch: https://review.opendev.org/#/c/733730/
 +
 +
This is kind of a short cycle--so the spec freeze is at R-15, just like in ussuri, but it seems to happen about 3 weeks earlier.  So make sure you mark the date on your calendar.
 +
 +
This time I paid attention to holidays that might fall around important deadlines:
 +
* We will be low on reviewers right around the time of the Spec Freeze, so the freeze will happen on '''Wednesday''' this cycle (instead of the usual Friday).  Keep that in mind as you work on getting specs revised.
 +
* We will also be low on reviewers around the weekend before the Feature Freeze; there's a holiday, but with the pandemic situation (hopefully winding down at that point), people may be inclined to take an extra day to make the weekend even longer.  So last-minute reviews may not be available.
 +
 +
 +
To avoid the cavalcade of driver feature reviews that happened right before Ussuri feature freeze, there's a new deadline on the schedule: ''Cinder Driver Features Declaration''.  This will help us organize the reviews better.  See the description on the schedule for details, and ask me if you have any questions.
 +
 +
====Priorities====
 +
(This list is subject to revision, but should be pretty much set by 19 June.)
 +
 +
We have some open items begun in Ussuri that we'd like to finish early in Victoria (target is milestone-1, but that is coming up ''really'' fast):
 +
* volume-local-cache
 +
* nfs-volume-encryption
 +
* os-brick changes to support in-flight image encryption/decryption
 +
 +
 +
There are two deprecations that need to be addressed:
 +
* https://blueprints.launchpad.net/cinder/+spec/rbd-remove-option-causing-ossn-0085
 +
* https://blueprints.launchpad.net/cinder/+spec/os-reset-status-remove-back-compat

Latest revision as of 02:56, 10 June 2020

Contents

Introduction

This page contains a summary of the subjects covered during the Cinder project sessions at the Victoria PTG, held virtually June 2-5, 2020.

The Cinder Team at the Victoria (Virtual) PTG, June 2020.


This document aims to give a summary of each session. More context is available on the cinder PTG etherpad:


The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings are located at appropriate places below.

Tuesday 02 June: Current Cinder Issues

recordings

Discuss the security issue when an attached volume exposes host information to non-admins

Discussion led by whoami-rajat

This was about https://bugs.launchpad.net/cinder/+bug/1740950 and Related-Bug: https://bugs.launchpad.net/cinder/+bug/1736773

The compute host is leaked in the volume-detail response, and also in the REST Attachments API.

We've already agreed that the volume-detail response will only display the host when called in an administrative context (the API response is already allowed to have a null value there, so there is no API change).

For the Attachments API response, need to do some research to see whether this field is needed and in what contexts. Doesn't look like Nova needs it, doesn't look like any of our os-brick connectors are using it, but on the other hand, the info has been available in the response so long that we need to be careful, because something could be using it. We have to be careful about what info is left out of the response, because if it's too restricted, the caller won't be able to connect to the backend.

Conclusions

  • short term, need to get policies in place to govern this
  • open issue: the proposal was to keep the default behavior, but that continues to allow the leakage of compute host name; would probably be better to have a more restrictive policy (since we don't believe that any services depend on it), and then operators can adjust the policies to allow users who have use cases that need this info to see it
  • ACTION - whoami-rajat to continue working on this

Sizing encrypted volumes

Discussion led by eharney and enriquetaso

When you create an encrypted volume, you lose some space for the LUKS header. This has shown up in some migration bugs. If you have an unencrypted volume and migrate to an encrypted type, it fails because there isn't enough space on the target to fit the entire source volume.


The exact amount of space lost isn't clear, it may vary by luks format and also iscsi (cryptsetup) vs rbd (luks). It's probably on the order of 1-2 MB, but we allocate space in GB. Whether you can do really fine-grained space allocation or not probably depends on the backend containing the volume. We also need to keep the gigabyte boundaries for accounting purposes (usage, quotas, charges, etc.) because that's what everyone's tooling expects.

Walt suggested that we change the API to allow a new size to be specified during a volume migration. For retype, Gorka suggested that we could allow a flag similar to the current migration_policy in the retype request, which gives the user the option to say it's OK to perform a migration if it's required. We could have a flag indicating that it's OK for the volume to be increased in size if it's required for the retype to succeed. This way we don't go behind the user's back -- we fail if the volume won't fit, and allow them to retype to a new size if they want the retype to succeed.

Conclusions

  • Take the Walt/Gorka approach. It will allow us to fix a bunch of bugs, and we can move to something more elaborate later if there's demand for it.
  • ACTION - Sofia - add a note to the docs explaining that retyping an unencrypted volume to the same size encrypted volume will most likely fail (so that users know not to try this in older releases)

Quotas!!!

Discussion led by rosmaita

The problem is that there are race conditions and bugs in the code so that operators see stuff like duplicate entries in the the quotas table, incorrect usage reports, etc.

There is a spec proposed for removing the quota usage cache (database table): https://review.opendev.org/730701

The idea is that instead of trying to keep this table updated, we determine usage on the fly. Some people are suspicious that this could cause a performance problem, but Eric thinks that with appropriate indices, this would be very fast for a database to compute. It would definitely be a very small number relative to the amount of time it takes to actually build a volume.

There may still be some race conditions we need to worry about with the new approach. Our usage code was based on Nova's, but they've rewritten theirs to eliminate the database table. But they also accepted that it's OK for users to occasionally be over quota. We may have to do something similar.

Two related issues came up during the discussion:

  • the keystone "unified limits" work -- this has to do with where a user's limits are stored, but doesn't impact usage computations. Moving the user limits to keystone would be an independent effort
  • the REST API currently returns 413 or 500 for usages. The API SIG suggests returning 403 instead of 413, but there doesn't seem to be a point in us changing. (You'd need a new microversion to return the "correct" response code, but you'd still need to be able to handle the "incorrect" one if that microversion weren't available, so why not just stick with what we have.) The 500s should be fixed, though. (But that wouldn't require a microversion change.)

Conclusions

  • The approach in https://review.opendev.org/730701 looks promising. It would be nice if we had some performance numbers to assess the impact of the change.
  • Need to figure out what new race conditions we will see and how to deal with them. (Looking at what Nova did should help here.)
  • ACTION - continue discussion on the spec


(The team adjourned early so that the Cinder core security team could review the patches and plan for rolling out the changes for the bugfix for Launchpad Bug #1823200, which would become public the next day.)

Wednesday 03 June: Third Party Drivers

recordings

Bug: Fail to extend attached volume using generic NFS driver

Discussion led by lseki

The bug: https://bugs.launchpad.net/cinder/+bug/1870367

It affects Quobyte and NetApp drivers relying on generic NFS implementation we need to fix it in 2 steps:

(1) don't allow extend of an attached volume for nfs drivers: https://review.opendev.org/#/c/725805/

(2) figure out how to do the online extend safely, and then re-enable that functionality

This was working at some point, but has stopped. Suspect that changes in qemu, qemu-img seem to be doing more locking (versions 2.10, 2.11, somewhere around there). The hypothesis is that nova locks the file and then the cinder process can't write to it.

Need to research whether there is a qemu operation that we could ask nova to do, like,

virsh qemu-monitor-command <DOMAIN> --hmp "block_resize <DISK> 2G"

The idea would be to model this on the snapshot, you have a nova API that's implemented through libvirt.

Conclusions

  • continue the two-stage approach, that is, block the operation for nfs drivers so that you don't get failures, and research how to actually fix it
  • seems like we need to get nova cooperation on this, we shouldn't try to get around the locking somehow on our own
  • testing: we run devstack-plugin-nfs-tempest-full in the cinder gates; when ready, should enable volume-feature-enabled.extend_attached_volume in tempest on that job

Improvement proposal for LVM volume driver, direct attached storage via cinder

The proposer wasn't available, but we discussed the issue anyway because it's come up several times.


The basic idea is that you can get some optimization by opening volumes as devices instead of connecting via iSCSI when they're on the same node. The problem is that this is a very special case.

The team discussed a few ways to do this:

  • introduce this as an optimization in the LVM driver, if you are on same node, skip iscsi layer
    • pro: would allow more functionality
    • con: makes the driver more complex and likely to introduce bugs
  • add an option in cinder to enable "try local connection" on the LVM driver
    • if it's enabled adds path to the connection info
    • If os-brick receives the path, then tries to use it if it exists, if it doesn't, it does the normal iSCSI connection
  • do something similar in the lio target


What is the attitude of the project to this?

  • if it's a new target, could clone the lio job and run those tests (because it's a single node case)
    • need to have good documentation, especially about the limitations
  • if it's the "optional" one, it would use the same lio job but setting the "try local connection" config
  • bottom line is that we want as little interference with other drivers as possible

Conclusions

  • we want to know more about the use case -- it really sounds like using cinder to manage nova ephemeral storage
  • proposer can bring this up again after addressing the above points

Ceph iSCSI driver

Discussion led by hemna

Walt began work on this a while back but got stalled because some of the dependencies weren't packaged for a distribution, so there wasn't a way to get infra-approved CI for it.

It came up again on the mailing list in April: http://lists.openstack.org/pipermail/openstack-discuss/2020-April/014278.html

Walt wrote an update this morning: http://lists.openstack.org/pipermail/openstack-discuss/2020-June/015237.html

Ubuntu Focal now has packages, so we can aim for Focal as the CI target. Packages may be coming soon for Fedora.

There's an effort to get the ceph devstack plugin updated to a modern version of ceph: https://review.opendev.org/#/c/676722/ -- would be good to help get this merged so we have a modern ceph to work with.

Conclusions

  • Walt has some time to continue with this now; he'll coordinate with the Linaro developers to work on it

Remove volume driver abc classes

Discussion led by eharney

The design of the base driver was to include a bunch of interfaces as abstract classes, for example: https://opendev.org/openstack/cinder/src/commit/be9d5f3fa54b1b7948f44a9c91cf3f4753c0a03b/cinder/volume/driver.py#L1998

It's a whole bunch of code that didn't work out well (abc didn't do exactly what we had hoped), and it would be good to get rid of it. (This has been a TODO for a few years.)

The general idea is to start removing these various "xxxVD" abstract classes and have a single volume driver class instead of all these subclasses with abc on them.

General observations:

  • probably phase this over a cycle or so
  • will definitely affect out-of-tree drivers, will need to publicize this
  • maybe leave the classes there, but remove abc from them and make them no-op classes
  • change in-tree drivers to inherit only from the base driver class, instead of that class comma xxxVD comma yyyVD

We have two sets of classes that describe the interface for drivers:

  • the abc classes in cinder.volume.driver
  • the classes in cinder.interface

It would be better to have a single source of interface class that defines what you need to do to implement a driver.

Conclusions

  • doesn't seem to be conceptually difficult, but will touch a lot of code
  • e0ne is interested in seeing this happen
  • need to work out the points in "General observations" above more clearly before starting, but the general sense of the community is this is a good idea

Backup Driver issues

Discussion led by Shatadru

Well, one issue anyway. The issue is the container parameter for the backup drivers. It's handled differently depending on what driver you're using:

  • swift creates a bucket
  • nfs creates a directory
  • ceph uses it as a pool name


What happens with Ceph is that if the pool doesn't exist, the backup will eventually fail, which can be mystifying to users:

  • they expect it to act like the other drivers, where you can use it to organize your backups
  • it's not documented anywhere that for Ceph, you need to give it an existing pool name or your request will fail


Most operators using Ceph probably don't want this used by end users because they've probably set aside a particular pool they want used for backups and don't want users to put their backups in some other pool. Also, they probably don't want users to even know any of the pool names.

Would like a configuration setting to allow the Ceph driver to ignore the container if it's passed. Problem is, this doesn't make so much sense for the other backend drivers, so we don't want to make it a global setting -- and that means it won't be the kind of thing we can detect and reject at the API layer.

It's a real problem for Ceph because an end user can probe to find out what pools they have access to, and they can put things into the "wrong" pool. But it seems perfectly OK for the other drivers.

Conclusions

  • should add a configuration option for the Ceph driver, something like 'ignore_container' and then the configured backup pool is used even when a user specifies something else
  • we'll track this as a Launchpad bug: https://bugs.launchpad.net/cinder/+bug/1881934
  • ACTION - Shatadru - will put up a documentation patch
  • ACTION - rosmaita - will update the bug with this discussion to point the direction the fix should take

Backporting 'supported' status

We allowed this during Ussuri in https://review.opendev.org/#/c/700775/ but that may have been an oversight. But then we did it again with https://review.opendev.org/#/c/718758/ because a precedent had been set. Just want to verify that we think this is a good idea.

Conclusions

  • it's OK to do this, but we'll only allow a backport to the most recent stable branch
  • need to see evidence that the driver actually runs correctly in that stable branch (because most of the third-party CI only test master)
  • ACTION - rosmaita - update the driver policy to clearly state the above

Keeping 'unsupported' drivers in-tree

We changed our next-cycle-removal policy in January 2020 to keep the drivers around longer, for the convenience of operators who want to use the driver, but also in the hope that the vendor would ultimately take responsibility to get them 'supported' again. And in fact that did happen during Ussuri, several drivers who had problems meeting the CI-must-run-python-3 requirement in Train did get their CIs fixed in Ussuri and their drivers marked 'supported' again.

However, we ran into a problem that caused some reviewers to express second thoughts, namely, a cinder feature entailed a change in some drivers, including at least one 'unsupported' driver: https://review.opendev.org/#/c/720722/

So the point of this discussion is to make sure that people are on-board with the idea of keeping 'unsupported' drivers in tree, even if that means the cinder team (not the specific driver maintainer) may have to make changes to the driver to support general cinder features.

Conclusions

  • we already know that we need to keep the 'unsupported' drivers passing tests, which means the cinder team (not the specific driver maintainer) may have to make changes to the driver code; adding code to support a cinder feature doesn't seem relevantly different
  • the current policy is that we can remove an 'unsupported' driver if maintenance becomes a burden, so we will continue to assess this on a case-by-case basis
    • the case-by-case business is because in the case discussed above, some unsupported drivers were fixed because they inherited from a class that was fixed, whereas there was at least one unsupported driver that required its own change. If the change was complicated, it would make sense to keep the former but remove the latter.
  • ACTION - looks like this is already covered by the current policy, so no action needed on this one

Ussuri Development Cycle Retrospective for Driver Developers

We didn't get a lot of feedback.

Broken RBD live migration

This wasn't actually a discussion, someone left a question on the etherpad and Gorka answered it. See the etherpad for details: https://etherpad.opendev.org/p/victoria-ptg-cinder

Thursday 04 June: Cinder Improvements

recordings

Recent Security Issues

Quick discussion of OSSN-0086, which was announced yesterday. We has a similar issue last cycle, which is described in OSSN-0085.

Conclusions

  • It would be good for driver maintainers to look over those OSSNs and make sure that their drivers aren't vulnerable to a similar problem.

Reviewing our reviewing strategies

Discussion led by eharney

Eric reminded the team about 'tox -e cover' and that you can use it to make sure that the code you're reviewing is being tested. It helped him find a problem with some code that looked like it was being tested, but actually was not.

This prompted a discussion of using spec or autospec when mocking stuff; without either, mock will allow you to call non-existent functions on a mocked object without complaining, which was the problem here: it allowed a call to .has_calls() to look like it was doing something, whereas it wasn't (the correct method would be .assert_has_calls(), which would have made the test (correctly) fail).

There was also some discussion about the pylint job, which is always failing; it would be good if it ran only on the changes in the current patch so that you'd know if the failure was being caused by the patch you're reviewing.

Conclusions

  • would be good to have the coverage results right there on the review; we can add a non-voting coverage job in Victoria and revisit at the next PTG to see whether people are finding it useful
  • ACTION - rosmaita - set up the job for ^^
  • we should make "patch" and "mock_object" methods in "TestCase" use "autospec=True" (though if you do it now, you get 77 unit test failures)
  • ACTION - ^^
  • need to revisit the pylint job, because it can detect problems, but in its current state it isn't that useful
  • ACTION - ^^
  • some teams have implemented a before-and-after coverage test with a threshold (so that you don't allow changes that decrease coverage)
  • ACTION - ^^
  • what it comes down to is that these static analysis jobs aren't going to increase the amount of time it takes for the current zuul "check" to complete, so they're worth a try

Interop WG interlude

This was a reminder from the Interoperability Working Group to make sure they are up-to-date with respect to Cinder. (We haven't had any major REST API changes during Ussuri, so we are probably OK). Recent change removed Block Storage API v2 from the interop tests, which is fine, since the v2 API was deprecated in Pike.

Conclusions

  • there is still a position as Interop WG Cinder project Liaison open!
  • ACTION - rosmaita - verify that the functionality covered by the removed v2 tests continues to be covered by the current v3 tests

Looking at type annotation

Discussion led by eharney

Eric discussed using type annotations, which are part of the Python 3 language, to help improve code by allowing static checking of parameters and return types.

He's got a patch up to introduce a 'mypy' tox environment that will execute mypy type checking against cinder code: https://review.opendev.org/#/c/733620/

References:

The question came up that code using these won't be directly backportable to stable branches that still support Python 2.7, and what about our policy about not using Python 3-only language features in part of the code used by drivers? The consensus was that we should use Python 3. Some backports may be more complicated, but that's the way it goes.

Conclusions

Cinderclient CLI usability issue

Discussion led by rosmaita

Many Block Storage API requests return a 202 (Accepted) response. This doesn't really tell you whether your request has succeeded or not, and end users don't always understand this.

One thing we can do is enhance the 202 response so that it gives a user a hint about what to do next. In the cinderclient, we can add the --poll option to more commands.

The question came up, maybe we shouldn't enhance the cinderclient, instead we should put it into the openstackclient. And that led to ...

OpenStack Client tangent

Because we were talking about working on the cinderclient, the question came up: what about the openstackclient? Since Jay is on the TC, and the TC is thinking about the unified client issue, we talked about this a bit so he could bring our input back to the TC. Here are a few points; see the etherpad for more.

  • There is a feature gap right now between the openstackclient and the cinderclient. We can't just stop work on the cinderclient and instead put new commands into the openstackclient because a user will still need to use the cinderclient for the "gap" features that aren't in openstackclient yet. Before we stop work on the cinderclient, the openstackclient must have feature parity with the cinderclient.
  • People want access to new features now, not next cycle or the one after that. While it would be nice to have a unified client, when faced with the choice between the convenience of a CLI that doesn't do everything, and the inconvenience of having to use a different client for cinder in order to use new features, new features are going to win.
  • The osc, openstack sdk, and shade have been promoted as "opinionated" tools. While that may be appropriate for some users, operators at least are going to want full access to what the APIs offer. So it's not obvious that a move to the openstackclient makes sense until it's clear what the design philosophy of these tools are.

Conclusions

  • At this point, it doesn't make sense to stop working on the python-cinderclient.
  • To get back to the session topic, it doesn't look like it would be too difficult to implement --poll for other resources.

Community Goals

Discussion led by Jay Bryant.

There is one goal approved so far for Victoria: tosky's move all CI jobs to native zuul v3 jobs. That's good from the cinder point of view, because tosky has already converted most of our jobs.

The likely second goal is to move all CI/CD jobs to Ubuntu 20.04 (Focal): https://review.opendev.org/#/c/731213/

For the W cycle, the goals on https://etherpad.opendev.org/p/YVR-v-series-goals are all candidates. An early favorite is the removal of rootwrap. There's some talk about the openstackcli, but that will probably require more discussion (see previous session).

Ussuri Development Cycle Retrospective

Didn't get a lot of feedback on this one, either. One issue that did come up was that while we might not be able to speed up reviewing, we could make reviews more predictable.


(We adjourned early for the Virtual Happy Hour, which was not recorded.)

Friday 05 June 2020: XP & TCB (cross-project and taking care of business)

recordings

Cinder/Glance creating image from volume with Ceph

Discussion led by abhishekk

When glance doesn't know the size of an image being created in advance, when it's using the Ceph backend you get thrashing because it acts conservatively and makes really small resize requests as the image is being written. At the Shanghai summit, we discussed two approaches to fix this: (1) passing the size in request (since cinder already knows how large the volume needs to be; image API change) or (2) a glance_store change (make way bigger resize requests (like 1GB) & then shrink at the end). In Shanghai, the cinder team was satisfied with solution (2). The Glance team didn't have time to implement this in Ussuri and wanted to make sure we're still good with that option. We are.

Make cinder store of glance compatible with multiple stores

Discussion led by abhishekk

Glance now has the ability to offer different backend image stores to users. One use case is for an operator to offer different tiers which are presented to users as a store_id and a description (for example, 'bronze': 'Less expensive storage suitable for whatever'; an operator doesn't need to expose the backend technology to end users, just the fact that there are different stores available). If an operator is using the cinder glance_store driver, currently only one volume_type can be configured for use by that store, even though there may be multiple storage tiers defined by various volume_types in cinder.

It looks like one way to address this would be to continue to use a single cinder glance_store, but give it the ability to pass different volume_types when it makes a volume-create request. The volume_type would not be exposed directly to end users -- instead, each volume_type could be associated with a store_id in the glance stores-info response so that the stores abstraction layer would not be broken.

Gate job for glance cinder store

Discussion led by whoami-rajat

Rajat says:

Currently we support cinder store in glance but there isn't any gate job to verify the image creation/deletion works as expected when glance receives a request. I have been working on a gate job on glance_store for cinder[1] (which is passing finally) but i feel it would make sense to have a job on glance gate to test the whole flow of image creation when glance receives a request.

[1] https://review.opendev.org/#/c/329825/

Conclusions

  • The cinder glance_store seems to be increasing in use; both the Cinder and Glance teams agree that it would be good to make sure it's being tested well.

Glance image co-location

Discussion led by jokke_

In Ussuri, cinder implemented a scheme to assist Glance in image co-location. What this means is that if a volume was created from an image, if that volume was later uploaded as an image, cinder passes along the image_id of the original image so that Glance can determine what store(s) that original image is located in, and put copies of the new image in the same locations. The Glance team would like to modify the implementation on the cinder side so that instead of passing the base_image_ref as a header in the image-create call to glance, if the base_image_ref is present, cinder makes a call to the glance Import API. This would enable Glance to reuse a code path which would be better tested and easier to maintain.

Conclusions

  • The proposal seems reasonable. It shouldn't make the image-create take longer on the cinder side; the Import API returns a 202 (Accepted) and does its work asynchronously, so it would just be a quick API call. We need to look more closely, though.

Refresh connection_info

Discussion led by eharney (lyarwood couldn't be here)

We briefly discussed the proposal Lee outlined on the ML and in a comment on a review:


It seems reasonable. Gorka mentioned that there's an email thread out there where he discusses some aspects of this in more detail.

Conclusions

  • Follow up with lyarwood when he gets back about putting this into a spec
  • Dig up Gorka's email thread to make sure those points are addressed

DR operations and broken volume connections

Related topic introduced by sfernand

Disaster recovery operations leave broken volume connections in the vms. Maybe it could be useful for nova to automatically reestablish iscsi connections after a failover/failback?

  • could maybe do this if multipath is enabled ... but is not going to be easy
  • Fernando is interested in working on this ... Gorka volunteered to talk through the problem with him and bring in Lee at some point because he's familiar with this too
  • work items:
    • Cinder drivers must report whether they do async or sync, A/A (nothing to do since the multipath already has access to both paths) or A/P (we need to refresh the connection) replication
    • Cinder must either tell Nova which volumes have a new "source" or automatically update the initialized connection and tell Nova to refresh specific instances
    • OS-Brick needs to be able to switch from the old to the new connection as cleanly as possible

Dynamic front-end QoS

We have the ability to specify front-end QoS in Cinder that are applied by Nova when a volume is attached. There have been requests for a feature that would allow these to be changed while a volume is attached. This would be useful when QoS specs are changed in Cinder, or when a volume is extended and is using iops_per_gb QoS specs.

This would require:

Conclusions

  • There was widespread agreement that this is a great idea, but no one present expressed interest in picking it up

Victoria spec review

We quickly went through the currently proposed specs: https://review.opendev.org/#/q/project:openstack/cinder-specs+status:open

See the etherpad for comments.

Technical debt

Alembic!

We need to do this this cycle. Rosmaita will work on it pulling in people as needed.

Block Storage API v2 Removal

It's been deprecated since Pike. We should at least remove the external-facing stuff. (Internally, the v3 API is using some of the same code as the v2 contributed extensions.)

Other Stuff

There are some open items from the Ussuri PTG that are still viable: https://etherpad.opendev.org/p/cinder-ussuri-ptg-actions

Cycle Priorities and Schedule

Schedule

Patch: https://review.opendev.org/#/c/733730/

This is kind of a short cycle--so the spec freeze is at R-15, just like in ussuri, but it seems to happen about 3 weeks earlier. So make sure you mark the date on your calendar.

This time I paid attention to holidays that might fall around important deadlines:

  • We will be low on reviewers right around the time of the Spec Freeze, so the freeze will happen on Wednesday this cycle (instead of the usual Friday). Keep that in mind as you work on getting specs revised.
  • We will also be low on reviewers around the weekend before the Feature Freeze; there's a holiday, but with the pandemic situation (hopefully winding down at that point), people may be inclined to take an extra day to make the weekend even longer. So last-minute reviews may not be available.


To avoid the cavalcade of driver feature reviews that happened right before Ussuri feature freeze, there's a new deadline on the schedule: Cinder Driver Features Declaration. This will help us organize the reviews better. See the description on the schedule for details, and ask me if you have any questions.

Priorities

(This list is subject to revision, but should be pretty much set by 19 June.)

We have some open items begun in Ussuri that we'd like to finish early in Victoria (target is milestone-1, but that is coming up really fast):

  • volume-local-cache
  • nfs-volume-encryption
  • os-brick changes to support in-flight image encryption/decryption


There are two deprecations that need to be addressed: