CinderYogaPTGSummary

Introduction
This page contains a summary of the subjects covered during the Cinder project sessions at the Project Team Gathering for the Yoga development cycle, held virtually October 18-22, 2021. The Cinder project team met from Tuesday 19 October to Friday 22 October, for 4 hours each day (1300-1700 UTC).



This document aims to give a summary of each session. More context is available on the cinder Yoga PTG etherpad:
 * https://etherpad.opendev.org/p/yoga-ptg-cinder

The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings are located at appropriate places below.

recordings

 * https://www.youtube.com/watch?v=Bon4Y-Us_to
 * https://www.youtube.com/watch?v=EvW1go6Xa_w
 * https://www.youtube.com/watch?v=oPngR_8RIJI

Greetings, user survey discussion
For the benefit of people who haven't attended, this is the way the cinder team works at the PTG:
 * sessions are recorded
 * please sign in on the "Attendees" section of this etherpad for each day
 * all notes, questions, etc. happen in the etherpad; try to remember to preface your comment with your irc nick
 * anyone present can comment or ask questions in the etherpad
 * also, anyone present should feel free to ask questions or make comments during any of the discussions
 * we discuss topics in the order listed in the etherpad, making adjustments as we go for sessions that run longer or shorter
 * we stick to the scheduled times for cross-project sessions, but for everything else we are flexible

Next, we took a look at the Project Specific Feedback Responses from the latest User Survey. Here's an ethercalc that's organized to make the cinder-relevant responses easier to see: https://ethercalc.openstack.org/=2021-user-survey

Our question on the survey was: "If there was one thing you would like to see changed (added, removed, fixed) in Cinder, what would it be?"

We received 39 responses (out of about 425 responses to the survey).

Looking through the responses, there were requests for features that we already have and some comments that we didn't understand. We decided send a response to the mailing list mentioning the implemented features and starting a discussion on the items we didn't understand. (Survey responses are anonymous, so we can't contact operators directly.) Amy Marrich (spotz, who facilitates the operator meetups) has mentioned that operators tend to follow the meetup twitter account, so a good way to contact operators is to post to the ML and then notify her to tweet out the link from the ops meetup account.

The quantitative responses (for example, how many deployments include cinder) are on the OpenStack analytics page: https://www.openstack.org/analytics/

Some feedback for the User Survey team:
 * The data in the report on the website is really difficult to consume (it's displayed in non-resizable graphs, and it's difficult to distinguish the percentages for "interested", "testing" and "production"). There's an option to download as PDF, but that gives you the same non-resizable graphs.  It would be helpful to be able to download the data as a CSV file.
 * For the next survey, we want to add the question: What driver(s) are you using for your Cinder environment?
 * We like our current question, but a lot of the answers are too vague -- do you have any suggestions on how to indicate to people that they should be clear and specific?

conclusions

 * action (rosmaita): start an etherpad for the response to operators
 * action (rosmaita): communicate our feedback to the User Survey team

In-flight image encryption update
Josephine Seifert (Luzi) updated us on the status of the in-flight encryption effort. The current plan is to have "experimental Image Encryption without Secret Consumers". The reason is to allow coding and reviewing of the Image Encryption work (and set up CI) while waiting for the Secret Consumers API. (The Secret Consumers API in Barbican will allow services to register that a secret is in use (though the secret owner can still delete it by using a --force flag. The holdup is that microversioning needs to be introduced to the Barbican API before the new API can be added.)

The current idea is to release this as an Experimental feature and "officially" release when the Secret Consumers API is ready. This strategy is described in a Glance spec-lite: https://review.opendev.org/c/openstack/glance-specs/+/792134/

What's required from the cinder team for this work is:
 * os-brick -- will have the PGP encryption code to be used by the services. The patch for this is available for review: https://review.opendev.org/709432
 * cinder -- download image from glance will need to decrypt such images to write them to volumes
 * cinder -- upload volume to image (maybe? need to check the spec)
 * cinder -- will also need to use Secret Consumers when available (if cinder does encryption on upload)
 * may also want to add Secret registration to our current luks encrypted volume code to protect encryption key ids
 * what about the glance cinder backend?
 * we have an optimized path that clones instead of downloads and copies onto the volume
 * need to handle this case
 * what about the image-volume cache?
 * should these things not be included in the cache?
 * need to see what the spec says about this

The current cinder spec is: https://specs.openstack.org/openstack/cinder-specs/specs/xena/image-encryption.html

There isn't a patch yet for the cinder changes, though there is a glance PoC patch with placeholders for Secret Consumers: https://review.opendev.org/c/openstack/glance/+/705445

conclusions

 * action (rosmaita) Review the cinder spec again. It was approved in Train, and hasn't really been looked at since.
 * action (cinder team) Interested parties should also look at the spec again.
 * action (whoami-rajat) Review the spec again specifically with cinder glance_store cases in mind.
 * action (Luzi) Gorka pointed out that it will be easier to review if some cinder POC patches are available:
 * cinder patch (pending?)
 * cinder-spec (done)
 * os-brick patch (ready)
 * glance patch (ready)

Support for quiesced snapshot/backup
Arthur Outhenin-Chalandre (Mr_Freezeex) wants to support quiescing volumes for backup or snapshot, following up on some previous proposals:


 * Call `quiesce`/`unquiesce` from nova (old spec: https://wiki.openstack.org/wiki/Cinder/QuiescedSnapshotWithQemuGuestAgent#Cinder)
 * Related (old) bp: https://blueprints.launchpad.net/cinder/+spec/quiesced-snapshots-with-qemu-guest-agent

The netapp/vmware volume driver already support quiesced snapshot without nova calls, and this could be extended to other drivers.

Some points that came up during discussion were:


 * consistency groups: give you only crash consistency, the same as current regular snapshots made from the cinder side. If crash consistency is all people need, then this proposal isn't necessary
 * question: how is this supposed to work for multiple volumes attached to an instance?
 * generic groups allow arbitrary grouping of volumes, how will this impact them?
 * is cinder the correct entrypoint for this request?
 * Simon mentioned that quiescing the entire VM is overkill for taking a snapshot (and is possibly risky), especially if most people are set up to deal with crash-consistent snapshots anyway
 * Gorka pointed out that we need to consider some broader use cases and decide whether they need to be addressed, for example:
 * Single volume: snapshot with quiesce (easy case)
 * All volumes in a single VM
 * Belong to the same group (generic or consistency)
 * Are independent volumes (including boot)
 * All volumes in a group (could be attached to different VMs)
 * multiattach (single volume attached to multiple VMs)

An implementation doesn't have to handle all of the above (and there are probably some more use cases that will come up), but we do want to be clear on exactly what use cases are being addressed and which ones we aren't going to handle.

conclusions

 * action (Mr_Freexeex) will propose a spec addressing the above issues

default types retrospective
Rajat Dhasmana (whoami-rajat) led a discussion about where we are currently with default volume types. Here are his slides if you want to follow along: https://docs.google.com/presentation/d/1qEI3PApyzYWBnoUUTOnxR8zQlI1s3YwNjt5CZqHVuSA/edit?usp=sharing

Back in Train, we decided that cinder would no longer allow untyped volumes (there are places in the code, particularly in some drivers, where a volume type is assumed, and when it's not there bad stuff happens). So to address this, a __DEFAULT__ type (a very minimal type, basically just name, id, and description) was introduced in the Train release to guarantee that there was at least one volume type in every deployment, and any untyped volumes were assigned this type in the database. Further, if the default_volume_type cinder option wasn't set, the __DEFAULT__ type would be used.

A problem was that there are some deployment tools (and operators) that already created and set a default_volume_type, and operators reported that end users were confused when they saw __DEFAULT__ in the volume-type-list response, and would explicitly create volumes of type __DEFAULT__, which wasn't what operators wanted, they wanted end users to simply use the configured default type. The problem was that the __DEFAULT__ type couldn't be deleted (because its purpose was to make sure that there was always some volume type available in the deployment).

We reworked the logic on this later (and backported it to Train) so that the default_volume_type cinder option is required (with a default value of __DEFAULT__) and cinder will not allow the volume type that's the value of default_volume_type to be deleted (while it's the default), and that there will always be at least one volume type. So it's possible for operators to treat the __DEFAULT__ type like any other type (for example, it can be deleted if there are no existing volumes of that type).

However, __DEFAULT__ is still there out of the box, and it's causing confusion for deployments that don't want to use it, or where it's unnecessary.

We discussed this a bit and concluded that it's a deployment responsibility to decide what to do about the __DEFAULT__ volume type in a particular deployment.

But there are definitely some things we can still do on the cinder side to improve the situation:


 * Make sure it's clear in the operator docs that we already have mechanism in-place to avoid creating untyped volumes, so it's OK to delete the __DEFAULT__ if it's not used anymore. (We picked its name so that it wouldn't clash with any existing volume types, but "__DEFAULT__" looks official and scary, and can lead operators to think that its necessary for cinder's correct functioning.)
 * add this info somewhere in the operator configuration docs
 * In victoria, cinder introduced default types per project: https://specs.openstack.org/openstack/cinder-specs/specs/victoria/default-volume-type-overrides.html. We need to promote the idea that if a user wants to see what the effective default volume type is, they need to make the GET /v3/{project_id}/types/default API call, not look at the volume-type-list and try to figure it out from the name or description.  We can improve the documentation to promote this call:
 * upgrade the API-REF
 * add something into the volume-create section
 * probably also in the type list
 * upgrade Client help as well?
 * check to see what the volume-create help text says, add something there
 * in the installation docs, say something about the importance of the default type config and that the __DEFAULT__ type can (and should be) removed (by the operator/deployment tool) after startup a default has been created and set in cinder.conf
 * Gorka suggested that maybe we could introduce a microversion that somehow highlights your effective default volume type when you make the volume-type-list request
 * horizon may already be doing something like this (where "like this" means highlighting the default volume type)? In any case, we should do it too.

conclusions
action (rosmaita) make sure people follow up on this ... we've had inquiries about small features and documentation work from various community members, and this would be ideal for such people

Driver Support Matrix update
This was a participatory activity. The idea (that has come up several times in past PTGs) is that we need to review the content of the support matrix. There are features called out that aren't really "missing" if a driver doesn't support them, and there may be some capabilities that are good to know about that aren't specified. We've discussed this a lot, let's just sit down and do it.

The aim of the support matrix is to be useful to operators when selecting a backend. What it currently looks like is this: https://docs.openstack.org/cinder/latest/reference/support-matrix.html

The team made notes in an etherpad, you can see the full discussion there: https://etherpad.opendev.org/p/yoga-cinder-support-matrix

conclusions

 * It would be a lot easier to read if we swapped columns and rows. Chuck did a quick mockup and it's pretty clear that he's right.  We have some custom python sphinx code that generates the matrix from an RST and a config file, so this would have to be changed in our custom code.
 * action (rosmaita): Update the note on what happens to drivers that don't have a working CI.  It is no longer accurate.
 * current removal policy: https://docs.openstack.org/cinder/latest/drivers-all-about.html#driver-removal
 * Open question: should we call out microversions associated with features? For example, [operation.online_extend_support] is associated with microversion 3.42, should we mention that in support matrix?
 * Looks like the recent gerrit upgrade broke the script that monitors our CI stats: http://cinderstats.ivehearditbothways.com/cireport.txt
 * action (jungleboyj): Follow up with Sean on if he can fix this and if he plans to continue maintaining it.
 * action (rosmaita): update the basic cinder features list

Interop!
The plan was to meet with the Interop WG about adding capabilities to the trademark guidelines.

Some links to remind us what we have questions about:


 * previous cinder team discussion: http://eavesdrop.openstack.org/meetings/cinder/2021/cinder.2021-05-12-14.00.log.html#l-56
 * next interop guidelines: https://opendev.org/osf/interop/src/branch/master/guidelines/next.json
 * general info about the Interop WG: https://wiki.openstack.org/wiki/Governance/InteropWG
 * Presentation: https://docs.google.com/presentation/d/18NEDcZUFttCee564DDSVSfA6fINcYDLjZjGjAwEemFA/edit#slide=id.p1
 * Draft of proposed guideline: https://review.opendev.org/c/openinfra/interop/+/811049/3/guidelines/2021.11.json
 * List of tempest API coverage by interop: https://etherpad.opendev.org/p/refstack-test-analysis

The representative from the Interop WG was delayed, and then had to drop, but left these requests:


 * review and comment on https://review.opendev.org/c/openinfra/interop/+/811049/3/guidelines/2021.11.json for volume/cinder sections and if these should still be required, if there are some known issues or tempest changes for coverage
 * What is the new functionality and tests added in the Wallaby & Xena, separately, cycle so we can consider it for for future guidelines?
 * What are your recommendations on the functionality not covered by current guidelines that reached maturity and usage by the customers?
 * Finally, how does cinder want to handle microversions? Currently we cover API version 3.0 for inetrop. What is the range of microversion of APIs that each release support? Then we can define overlap of these for 3 latest + more future releases that we can claim every cloud implementation must support.

conclusions

 * We could still use input from the Interop WG to satisfy the above requests.
 * action (rosmaita): reach out the to Interop WG and invite them to an upcoming cinder meeting

Community goals and the "secure and consistent RBAC" effort
At the beginning of the session, it looked like we had everything planned out:
 * The Xena spec was a bit ambitious, and has been revised for what we did in Xena and what's planned for Yoga
 * https://review.opendev.org/c/openstack/cinder-specs/+/809741/2
 * general info at the top of the doc; details in the "Implementation Schedule"
 * https://docs.openstack.org/cinder/xena/configuration/block-storage/policy-personas.html
 * implementation details: see comment in https://opendev.org/openstack/cinder/src/branch/stable/xena/cinder/policies/base.py

But, Dan Smith and Lance Bragstad attended and hipped us to the current discussion:
 * there's an ongoing issue across projects, namely, the misuse of system scope to allow admins to perform actions within projects
 * system-* personas should not be allowed to do stuff within projects
 * system-scope have scope of operations on the system (and must respect project boundaries)
 * system-scope allows you to operate on system resources (e.g., services, clusters, volume-types)
 * project-scope allows you to operate on project resources (e.g., volumes, snapshots, backups)
 * example: system-* personas can interact with volume-types (CRUD), but not use them to create volumes (because a volume is a project resource)* only project-* personas can act within projects
 * the plan is to rely on inheritance from higher level domains in order to create a persona who can act within individual projects without being "in" those projects
 * important to understand keystone hierarchical boundaries: https://wiki.openstack.org/wiki/Hierarchical_administrative_boundary
 * https://docs.openstack.org/api-ref/identity/v3/index.html#os-inherit

We continued the discussion here: https://etherpad.opendev.org/p/cinder-yoga-secure-rbac-more-thoughts

conclusions
action (rosmaita, abishop): follow up to figure out what's going on

recordings

 * https://www.youtube.com/watch?v=vZ2S0SLS2jQ
 * https://www.youtube.com/watch?v=jR1bLGH1zic

os-brick for NVMe - The Next Steps
Simon Dodsley (simondodsley) gave an update on the working group of people interested in extending the NVMe support in os-brick. Right now, Pure, Kioxia, and Dell/EMC are all interested in this technology, and working on support. We are gathering info and ideas here: https://docs.google.com/document/d/1_xXgYOElC5G8RawWyEEHKOF1A5E3RmtlUlmcqCtuUzA/

Ping simondodsley in IRC if you can't access the doc. Anyone interested in NVMe is welcome to participate in this effort.

Gorka has put up two patches so that we'll be able to do testing using LVM. These use the NVMe tools/protocols to export LVM volumes, using NVMe instead of iSCSI to make the volumes available. This way we can have testing independent of NVMe solution vendors as they are still getting their third-party CI systems up (there are problems getting network cards due to covid-19):


 * DevStack NVMe (RDMA & TCP) support: https://review.opendev.org/c/openstack/devstack/+/814193
 * LVM Cinder driver support for NVMe TCP: https://review.opendev.org/c/openstack/cinder/+/791929

One thing that came up in discussion is that when reviewing os-brick patches, there's not a clear way to know what connector is being used. You can usually tell from the CI name what cinder driver is being used, but the connector isn't so obvious. This has been a problem in the past with the nvmeof connector, where it hasn't been clear which third party CI results should all pass to indicate that the connector is working. Gorka suggested that we create a reviewer chart for os-brick connectors and the CIs that test them (including description on variants, for example, iSCSI shared targets, iSCSI individual targets). It might also be possible to require that os-brick CI jobs add to their name the connector they are using.

This led to a general discussion of what we want to get done in os-brick in the Yoga cycle:


 * gpg encryption support (discussed Tuesday)
 * https://review.opendev.org/709432
 * connection agent (missed Xena due to reviewer bandwidth problems, so let's get it reviewed and out of the way early in Yoga)
 * https://review.opendev.org/c/openstack/os-brick/+/802691
 * Zohar will get some more info onto the review about how the agent is being tested

Note that both of the above efforts don't impact the mainline os-brick code, so are very low risk for anyone who doesn't use them.


 * nvme next steps
 * nvmeof fixes
 * critical patch for NVMe with kernels that have ANA (native multipathing) enabled: https://review.opendev.org/c/openstack/os-brick/+/806687
 * others: https://review.opendev.org/q/project:openstack/os-brick+status:open+file:os_brick/initiator/connectors/nvmeof.py
 * CI patches
 * https://review.opendev.org/c/openstack/devstack/+/814193
 * https://review.opendev.org/c/openstack/cinder/+/791929
 * watch for patches to os-brick .zuul.yaml setting up jobs
 * native multipathing
 * stretch goal in Yoga is to implement DM multipathing

conclusions

 * reminder: Changes to be included in the Yoga release of the os-brick library must be merged by Thursday 10 February 2022 (20:00 UTC)
 * action: create the reviewer chart of os-brick connectors and the CIs that test them

Clarify the Volume Driver API, Part 1
This was another participatory activity. Over the past few PTGs, we've all agreed that it would be helpful to driver contributors if the cinder project team were to more clearly specify the interface that drivers are expected to implement. We have some tooling in place to make sure that specific functions are implemented, but there are cases where the documentation describing the semantics and even what's expected to be returned are either outdated or incomplete. Everyone always agrees that this is a good idea, so instead of all waiting for someone else to get this going, we decided to discuss this in a working session at the PTG to get something hammered out.

The idea is to get started today, and follow up on Friday. The notes from the session are here: https://etherpad.opendev.org/p/yoga-volume-driver-API

The Friday session was cancelled so that we could attend the TC session about the Yoga community goal ("secure and consistent RBAC").

conclusions

 * The discussion was productive, as we identified two functions right away that should be removed. We'll continue to do this as an ongoing activity throughout the cycle, maybe as a "volume driver API function of the week", and incrementally get the entire interface documented.

Happy Hour and mascot/team name discussion
We moved to meetpad for this discussion since it didn't need to be recorded.

Simon Dodsley (simondodsley) led us in a discussion of what to name the cinder mascot and what to name the cinder team. You can see suggestions and the votes on the etherpad: https://etherpad.opendev.org/p/cinder-yoga-happy-hour

conclusions

 * the cinder mascot is named "Argo" (which happens to be the name of the horse of Xena, Warrior Princess)
 * https://twitter.com/jungleboyj/status/1450870018720288768
 * the cinder team will henceforth be known as the "Argonauts"
 * https://usercontent.irccloud-cdn.com/file/wFrlqPKu/image.png

recordings

 * https://www.youtube.com/watch?v=VCcqPb-VYV4
 * cross-project session with Glance: https://www.youtube.com/watch?v=VTc4Do0aY6k
 * cross-project session with Nova: https://www.youtube.com/watch?v=pvj8joZyhJ

Optimize create bootable volume from qcow2 image [cinder backed]
Rajat Dhasmana (whoami-rajat) discussed a WIP patch he's been working on to optimize a path for creating a bootable volume from a Glance image stored in the cinder glance_store: https://review.opendev.org/c/openstack/cinder/+/805949

The current workflow is roughly:
 * attach image-volume to glance host
 * attach new vol to cinder host
 * copy data from glance to cinder

The new workflow would be:
 * attach both image-volume and new volume to cinder host
 * copy data between them

One issue that came up is that it's not obvious how much of a performance improvement this would yield.

During the course of the discussion, another possible performance improvement came up. If cinder knows the virtual size of the image, it could convert directly to the destination without using the image_conversion_dir that gets filled up. (This may not be possible for all drivers, for example, RBD, that doesn't actually present the volume under /dev, but it's worth looking into if it averts the conversion directory space issue that seems to happen a lot).

conclusions

 * action (whoami-rajat): will look into quantifying the performance improvement he is after

Remove the exception mapping from user messages?
Briefly, cinder has a lot of API calls that return 202 (Accepted), but could still fail. You wind up with a volume in an error state and wonder what happened. The user messages API allows a developer to generate a message containing some info that an end user can understand and possibly give to a support person to troubleshoot further. To make sure that deployment details that shouldn't be exposed to end users aren't leaked, you must specify a pre-defined message field for an Action and a Detail that show up in the response to the end user.

Complete info about this feature can be found in the cinder developer docs: https://docs.openstack.org/cinder/latest/contributor/user_messages.html

The topic of this particular discussion is a comment in the code: "Also, use exception-to-detail mapping to decrease the workload of classifying event in cinder's task code." https://github.com/openstack/cinder/blob/master/cinder/message/message_field.py#L20-L21

The original idea for the exception mapping sounded good, but it's turning out to be not so good in practice. First off, it's a bit confusing to use, as described in the dev docs: https://docs.openstack.org/cinder/latest/contributor/user_messages.html#cinder-exception-in-context

If you pass and exception AND a Detail, if the exception is in the mapping, your Detail is ignored, and the mapped Detail is used. We don't want people to pass *only* an exception, because if it's not in the mapping, the result is a useless "Unknown Error" Detail message. But, if the exception may be raised in several places, the mapped Detail is going to be kind of generic. In some situations, you may know exactly what the Detail message should be, and so you need to *not* pass the exception (or your more precise Detail won't be used). In other situations, though, using the mapping may be fine.

You can see some examples on this review: https://review.opendev.org/c/openstack/cinder/+/786627/19

I think Gorka made a good point there that we want to encourage people to pass the exception, because at some point we may want an administrator to see more info in a user message response than a regular end user, and that extra info would be populated from the exception. But the current way the exception mapping works discourages a developer from passing the exception in cases where you know what Detail should be used.

So the question is: do we want to remove the exception mapping?

conclusions

 * Consensus was that we should keep the mapping, but introduce a new parameter to allow the developer to specify whether to prefer the passed Detail (new behavior) or the mapped Detail (old behavior).
 * action (rosmaita): put up a patch for this
 * Further, we want to pass the exception in all cases so that the User Message response can be enhanced at some point for administrators.
 * action (rosmaita): put up a followup patch adding back the exceptions to message_create calls

Completing sqlalchemy-migrate -> alembic migration
The Cinder project was really lucky that Stephen Finucane (stephenfin) (whose name you will recognize from nova, openstackdocstheme, a bunch of oslo libraries, and some other stuff) became a Hero of Cinder in the Xena cycle and helped us address a boatload of technical debt in moving away from using sqlalchemy-migrate (which is no longer supported) and moving to alembic (which in addition to being supported, has a lot of nice features). Stephen gave us a quick update about where this is going and what we need to do in Yoga to complete the transition.

First off, you can find the patches here: https://review.opendev.org/q/topic:%2522bp/remove-sqlalchemy-migrate%2522+(status:open+OR+status:merged)+project:openstack/cinder

While working on converting our sqlalchemy-migrate migrations to alembic, some gaps between our models and the results of the series of migrations have been revealed. Stephen has some patches up addressing these, and also added tests (based on oslo.db-provided stuff) to prevent this regressing in the future: https://review.opendev.org/c/openstack/cinder/+/813223

Stephen pointed out that SQLAlchemy 2.0 is a breaking change. There are A LOT of deprecations in 1.4 (that we're using now) that need to be addressed before we can start using 2.0. Some of this being done in oslo_db, but some will have to be done in cinder. (We can use some nova changes as a model.) Also, sqlalchemy-migrate is NOT compatible with 2.0, so the transition to alembic must be complete by the time openstack makes the transition to SQLAlchemy 2.0.

For cinder devs who need to write new migrations for Yoga, Stephen said that the alembic docs are really good, and we can use the initial migration as an example. One big difference with sqlalchemy-migrate is that alembic doesn't rely on a filename convention for ordering; rather, metadata is maintained in the migration file. An example of the metadata is here: https://github.com/openstack/cinder/blob/master/cinder/db/migrations/versions/921e1a36b076_initial.py#L13-L18

To generate a new migration file, you can use alembic directly, though Stephen as a patch up with some docs showing how to do it using tox.

As far as overall documentation goes, we have:
 * Alembic docs: https://alembic.sqlalchemy.org/en/latest/
 * The nova documentation lives here: https://docs.openstack.org/nova/latest/reference/database-migrations.html
 * Stephen has proposed equivalent cinder documentation in these two patches:
 * https://review.opendev.org/c/openstack/cinder/+/813225 (makes some slight tweaks to the cinder upgrade doc)
 * https://review.opendev.org/c/openstack/cinder/+/813226 (adds docs about how to generate a new migration file, mostly copy-paste from nova)

conclusions

 * Thanks to Stephen from the Cinder Team for addressing this and setting us up for success
 * action (everyone!): review patches that show up in this gerrit query: https://review.opendev.org/q/topic:%2522bp/remove-sqlalchemy-migrate%2522+(status:open+OR+status:merged)+project:openstack/cinder
 * (stephenfin) Don't hesitate to ping me on IRC/email if anything doesn't make sense. I have a good handle on this stuff rn, though I can't say for how long this will remain the case :)

Cross-project session with the Glance team
The Glance PTG etherpad is here: https://etherpad.opendev.org/p/yoga-glance-ptg

Optimize Upload volume to image in RBD backend
Rajat Dhasmana (whoami-rajat) discussed that in the case when we upload a volume as an image to glance's rbd backend, it starts with a 0 size rbd image and performs resize in chunks of 8 MB which makes the operation very slow. Current idea is to pass volume size as image size to avoid these resize operations.
 * cinder spec: https://review.opendev.org/c/openstack/cinder-specs/+/810363
 * cinder patch: https://review.opendev.org/c/openstack/cinder/+/809523
 * performance impact table (in HTML): https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_131/810363/6/check/openstack-tox-docs/131e18d/docs/specs/yoga/optimize-upload-volume-to-rbd-store.html#performance-impact

To do this, cinder needs to know that the store where the image is being put is RBD. On the glance side, the backend details of the store should not be exposed to end users (just like cinder doesn't expose backend details to end users). Maybe use a service token to get this info? Somehow we need a secure way to get this information.

One problem is that glance doesn't currently recognize service tokens, so that support would have to be added. Also, there's a more general question about how the secure RBAC work impacts using service tokens (we don't know).

Another general problem with this RBD shortcut around Glance (that also arises with with nova using ceph) is that glance doesn't have the multihash properties for such images. Thus users who include checksum validation in their workflow can't consume these images.

We could handle the security/performance tradeoff by allowing this optimization to be turned off in cinder. It will be slower, but you'll get the multihash properties for later validation.

conclusions

 * action (whoami-rajat) revise cinder spec in light of this discussion
 * action (whoami-rajat) add a glance spec for the Stores API change

Optimization issue in Upload volume to image when glance backend is cinder
Rajat Dhasmana (whoami-rajat) discussed that when the glance backend is cinder and we upload a volume to image, there's an optimization which clones the volume as image-volume and registers the location in glance. In the case when a volume is cloned in a different backend (let's say RBD) than glance's default backend (let's say cinder-lvm) then it causes an issue. The current solution discussed is to pass the volume type as glance metadata and compare it on glance side to pass or fail the operation.

conclusions

 * If we implement the code to expose the store type from glance API, we can query for the default store and do the optimization (or not) accordingly, because this scheme has the same performance/security tradeoff as the one discussed earlier. So we could use the same glance API change and cinder config option to control the optimization as the previous proposal.

General discussion

 * What's up with the virtual size in Glance?
 * glance reads the image headers and pulls the virtual size from there and populates it (maybe not for all formats, though)
 * patch introducing this behavior: https://review.opendev.org/c/openstack/glance/+/744234 (victoria)
 * virtual_size is a read-only property in glance (can't be modified via API, so it must be set by glance)
 * image co-location (put snapshot in same store as base image)
 * this may be important for the edge case
 * (rosmaita) I would've sworn that I had read a spec for this, but I can't find it anywhere

Secure RBAC second thoughts
Resumed the discussion of the complications Dan and Lance brought up during the 'Community goals and the "secure and consistent RBAC" effort' session on Tuesday in an effort to start preparing for the OpenStack wide discussion of this issue at the TC sessions on Friday. Thoughts were collected on this etherpad: https://etherpad.opendev.org/p/cinder-yoga-secure-rbac-more-thoughts

Cross-project session with the Nova team
The Nova PTG etherpad: https://etherpad.opendev.org/p/nova-yoga-ptg

Volume re-image
Rajat Dhasmana (whoami-rajat) is interested in extending the current nova instance re-image feature to volume backed instances. (It is currently only supported for instances that boot from the system disk.) There are some old blueprints for this:
 * Cinder BP: https://blueprints.launchpad.net/cinder/+spec/add-volume-re-image-api
 * Nova BP: https://blueprints.launchpad.net/nova/+spec/volume-backed-server-rebuild

The workflow would be:
 * user calls nova -> reimage server
 * nova calls cinder -> reimage volume
 * cinder tells nova -> volume is ready (nova needs updated connection info)
 * nova resumes

conclusions

 * The nova team was supportive of this feature. It will close a gap in the current server-reimage feature.

General discussion

 * Luigi Toscano (tosky) mentioned that the devstack-plugin-nfs gates are currently blocked for ussuri. The ussuri job consistently fails.  It's being tracked by https://bugs.launchpad.net/nova/+bug/1916750
 * you can see the blocked patches stacking up here: https://review.opendev.org/q/project:openstack/devstack-plugin-nfs+status:open
 * the nova team said they'd look into it soon
 * The nova team mentioned that cinder should be using an admin client to contact the nova Events API (at some point was using only the user-level token)
 * specifically happens for resize
 * there's a workaround: https://bugs.launchpad.net/openstack-ansible/+bug/1902914 (but we should really fix this)
 * what's weird is that it looks like we're using an elevated client on the cinder side: https://github.com/openstack/cinder/blob/01183a171776257b7aaf27220ce4113403f257bc/cinder/compute/nova.py#L147
 * Rajat can look into this because he'll be using this code to notify nova that the volume re-image has completed

recordings

 * https://www.youtube.com/watch?v=qU-rIt_5fjI
 * https://www.youtube.com/watch?v=HdUqlIVRbrQ

Quotas!!!
Gorka Eguileor (geguileo) did a bunch of quotas work in Xen, but the reservations code (inherited from nova) is way more complicated than is necessary for cinder. Gorka wanted to get some agreement on the direction he'd like to take in Yoga to improve cinder quotas.

Gorka gave a quick presentation on how quotas are used in cinder and why they are complicated. (Watch the recording!)

Some issues that came up were:
 * Need to consult with an actual expert to get the indexes right and do a comparison with current performance. Simon was really worried that what's considered acceptable performance in an OpenStack API is pretty slow relative to backend performance, and he's worried about anything that might make the API slower.  As Eric points out each time this comes up, though, databases are optimized for counting, and counting is better than using a possibly stale stored value.  Gorka is OK with checking with an expert to make sure we've got the correct indices to make this fast.
 * We agreed that it's acceptable for the quotas to work inconsistently while doing a roling upgrade. There doesn't seem to be a point in building in a compatability layer to preserve the "old" quota system during an upgrade (especially since nobody likes the old system because it already gives you inconsistent results!)

conclusions

 * action (geguileo) write up a spec for the Yoga quotas improvements

community goal: secure RBAC
The cinder team adjourned to the TC meeting in the "Juno" room. We needed to participate in this discussion, because the concept of the system-* personas has changed and we need to get a better understanding of where the community is going with this.

The developing discussion is on this etherpad: https://etherpad.opendev.org/p/policy-popup-yoga-ptg

This recording link is queued up to the RBAC discussion: https://www.youtube.com/watch?v=RT492bi6Xto&t=2470s

We also stuck around for the community goals discussion; looks like there will be only the one goal for Yoga.

conclusions

 * We were hoping to get the Yoga work done very early in the cycle, but that is at risk (because we no longer know exactly what the work is).
 * action (rosmaita, abishop): participate in the policy-popup meetings to get this hammered out as soon as possible

Clarify the Volume Driver API, Part 2
We ran out of time for this one, but what we learned in Part 1 is that this is likely to take some time. It will be obvious what to do with some functions and documentation, while others may take a bit of research.

conclusions

 * We'll continue to do this as an ongoing activity throughout the cycle, maybe as a "volume driver API function of the week", and incrementally get the entire interface documented.

Yoga priorities and responsibilities
The official release schedule has been updated with cinder-specific events, so you can check there for dates and deadlines: https://releases.openstack.org/yoga/schedule.html

Cycle priorities for the Cinder project
 * secure RBAC
 * our official position is that we are anti-hack to get the secure RBAC done
 * we will work to get the requirements better defined, hopefully quickly
 * possibly leave out all_tenants=true on the list calls if the correct way to do this is unclear, and then we'll fix it in Z
 * os-brick
 * focus early in the cycle (before Milestone 1) on the patches that cannot break anything that exists now (because they are a completely different path)
 * gpg encryption
 * NVMe agent
 * NVMe mutlipathing
 * cinder
 * Gorka plans to work on quotas ... will need reviews of the POC patches and the spec as high priority, though we may not be able to get the actual work done in Yoga
 * sqlalchemy-migrate -> alembic patches : want to merge early in Yoga
 * volume driver API clarification ... will work on this throughout the cycle ("function of the week") and incrementally clarify/improve the volume driver API definition
 * re-image volumes
 * this also includes better handling of the way we contact the nova Events API
 * Rajat's cinder/glance interaction improvements/optimizations
 * CI situation
 * Jay will follow up with Sean about the CI stats
 * Need to encourage third-party CI maintainers to fix the gerrit-comment-polution problem
 * add an 'autogenerated' tag to their gerrit comments
 * instructions here: http://lists.openstack.org/pipermail/openstack-discuss/2021-May/022733.html
 * static type checking
 * we are convinced that this will include code quality once we have sufficient coverage, so we want to continue this work
 * Eric has been posting and rebasing patches, but not getting much review action
 * we will have a "mypy patch of the week" throughout the cycle to get work merged on this
 * get 2 cores + one non-core to commit to reviewing the patch and quickly re-reviewing if there are problems to get it merged within the next week
 * default type improvements
 * (as discussed on Tuesday)
 * mostly documentation changes
 * may introduce a new microversion to enhance the volume-type-list response to indicate what the default type is

conclusions

 * action (rosmaita) update the driver docs with the info about using the 'autogenerated' tag when CIs contact gerrit
 * a few people are in favor of picking an additional "patch of the week" (non-mypy), but I want to hold off until we see how the mypy and volume-driver-API-revision work goes