Jump to: navigation, search

Difference between revisions of "CinderYogaPTGSummary"

m (add "os-brick for NVMe - The Next Steps")
m (add "Clarify the Volume Driver API, Part 1")
Line 262: Line 262:
  
 
===Clarify the Volume Driver API, Part 1===
 
===Clarify the Volume Driver API, Part 1===
 +
This was another participatory activity.  Over the past few PTGs, we've all agreed that it would be helpful to driver contributors if the cinder project team were to more clearly specify the interface that drivers are expected to implement.  We have some tooling in place to make sure that specific functions are implemented, but there are cases where the documentation describing the semantics and even what's expected to be returned are either outdated or incomplete.  Everyone always agrees that this is a good idea, so instead of all waiting for someone else to get this going, we decided to discuss this in a working session at the PTG to get something hammered out.
 +
 +
The idea is to get started today, and follow up on Friday.  The notes from the session are here:
 +
https://etherpad.opendev.org/p/yoga-volume-driver-API
 +
 +
The Friday session was cancelled so that we could attend the TC session about the Yoga community goal ("secure and consistent RBAC").
 +
====conclusions====
 +
* The discussion was productive, as we identified two functions right away that should be removed.  We'll continue to do this as an ongoing activity throughout the cycle, maybe as a "volume driver API function of the week", and incrementally get the entire interface documented.
  
 
===Happy Hour and mascot/team name discussion===
 
===Happy Hour and mascot/team name discussion===

Revision as of 16:19, 28 October 2021

Introduction

This page contains a summary of the subjects covered during the Cinder project sessions at the Project Team Gathering for the Yoga development cycle, held virtually October 18-22, 2021. The Cinder project team met from Tuesday 19 October to Friday 22 October, for 4 hours each day (1300-1700 UTC).

Subset of the Cinder Team at the Yoga (Virtual) PTG, October 2021.


This document aims to give a summary of each session. More context is available on the cinder Yoga PTG etherpad:


The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings are located at appropriate places below.

Tuesday 19 October

recordings

Greetings, user survey discussion

For the benefit of people who haven't attended, this is the way the cinder team works at the PTG:

  • sessions are recorded
  • please sign in on the "Attendees" section of this etherpad for each day
  • all notes, questions, etc. happen in the etherpad; try to remember to preface your comment with your irc nick
  • anyone present can comment or ask questions in the etherpad
  • also, anyone present should feel free to ask questions or make comments during any of the discussions
  • we discuss topics in the order listed in the etherpad, making adjustments as we go for sessions that run longer or shorter
  • we stick to the scheduled times for cross-project sessions, but for everything else we are flexible


Next, we took a look at the Project Specific Feedback Responses from the latest User Survey. Here's an ethercalc that's organized to make the cinder-relevant responses easier to see: https://ethercalc.openstack.org/=2021-user-survey

Our question on the survey was: "If there was one thing you would like to see changed (added, removed, fixed) in Cinder, what would it be?"

We received 39 responses (out of about 425 responses to the survey).

Looking through the responses, there were requests for features that we already have and some comments that we didn't understand. We decided send a response to the mailing list mentioning the implemented features and starting a discussion on the items we didn't understand. (Survey responses are anonymous, so we can't contact operators directly.) Amy Marrich (spotz, who facilitates the operator meetups) has mentioned that operators tend to follow the meetup twitter account, so a good way to contact operators is to post to the ML and then notify her to tweet out the link from the ops meetup account.

The quantitative responses (for example, how many deployments include cinder) are on the OpenStack analytics page: https://www.openstack.org/analytics/

Some feedback for the User Survey team:

  • The data in the report on the website is really difficult to consume (it's displayed in non-resizable graphs, and it's difficult to distinguish the percentages for "interested", "testing" and "production"). There's an option to download as PDF, but that gives you the same non-resizable graphs. It would be helpful to be able to download the data as a CSV file.
  • For the next survey, we want to add the question: What driver(s) are you using for your Cinder environment?
  • We like our current question, but a lot of the answers are too vague -- do you have any suggestions on how to indicate to people that they should be clear and specific?

conclusions

  • action (rosmaita): start an etherpad for the response to operators
  • action (rosmaita): communicate our feedback to the User Survey team

In-flight image encryption update

Josephine Seifert (Luzi) updated us on the status of the in-flight encryption effort. The current plan is to have "experimental Image Encryption without Secret Consumers". The reason is to allow coding and reviewing of the Image Encryption work (and set up CI) while waiting for the Secret Consumers API. (The Secret Consumers API in Barbican will allow services to register that a secret is in use (though the secret owner can still delete it by using a --force flag. The holdup is that microversioning needs to be introduced to the Barbican API before the new API can be added.)

The current idea is to release this as an Experimental feature and "officially" release when the Secret Consumers API is ready. This strategy is described in a Glance spec-lite: https://review.opendev.org/c/openstack/glance-specs/+/792134/

What's required from the cinder team for this work is:

  • os-brick -- will have the PGP encryption code to be used by the services. The patch for this is available for review: https://review.opendev.org/709432
  • cinder -- download image from glance will need to decrypt such images to write them to volumes
  • cinder -- upload volume to image (maybe? need to check the spec)
  • cinder -- will also need to use Secret Consumers when available (if cinder does encryption on upload)
    • may also want to add Secret registration to our current luks encrypted volume code to protect encryption key ids
  • what about the glance cinder backend?
    • we have an optimized path that clones instead of downloads and copies onto the volume
    • need to handle this case
  • what about the image-volume cache?
    • should these things not be included in the cache?
    • need to see what the spec says about this


The current cinder spec is: https://specs.openstack.org/openstack/cinder-specs/specs/xena/image-encryption.html

There isn't a patch yet for the cinder changes, though there is a glance PoC patch with placeholders for Secret Consumers: https://review.opendev.org/c/openstack/glance/+/705445

conclusions

  • action (rosmaita) Review the cinder spec again. It was approved in Train, and hasn't really been looked at since.
  • action (cinder team) Interested parties should also look at the spec again.
  • action (whoami-rajat) Review the spec again specifically with cinder glance_store cases in mind.
  • action (Luzi) Gorka pointed out that it will be easier to review if some cinder POC patches are available:
    • cinder patch (pending?)
    • cinder-spec (done)
    • os-brick patch (ready)
    • glance patch (ready)

Support for quiesced snapshot/backup

Arthur Outhenin-Chalandre (Mr_Freezeex) wants to support quiescing volumes for backup or snapshot, following up on some previous proposals:


The netapp/vmware volume driver already support quiesced snapshot without nova calls, and this could be extended to other drivers.

Some points that came up during discussion were:

  • consistency groups: give you only crash consistency, the same as current regular snapshots made from the cinder side. If crash consistency is all people need, then this proposal isn't necessary
  • question: how is this supposed to work for multiple volumes attached to an instance?
  • generic groups allow arbitrary grouping of volumes, how will this impact them?
  • is cinder the correct entrypoint for this request?
  • Simon mentioned that quiescing the entire VM is overkill for taking a snapshot (and is possibly risky), especially if most people are set up to deal with crash-consistent snapshots anyway
  • Gorka pointed out that we need to consider some broader use cases and decide whether they need to be addressed, for example:
    • Single volume: snapshot with quiesce (easy case)
    • All volumes in a single VM
      • Belong to the same group (generic or consistency)
      • Are independent volumes (including boot)
    • All volumes in a group (could be attached to different VMs)
    • multiattach (single volume attached to multiple VMs)


An implementation doesn't have to handle all of the above (and there are probably some more use cases that will come up), but we do want to be clear on exactly what use cases are being addressed and which ones we aren't going to handle.

conclusions

  • action (Mr_Freexeex) will propose a spec addressing the above issues

default types retrospective

Rajat Dhasmana (whoami-rajat) led a discussion about where we are currently with default volume types. Here are his slides if you want to follow along: https://docs.google.com/presentation/d/1qEI3PApyzYWBnoUUTOnxR8zQlI1s3YwNjt5CZqHVuSA/edit?usp=sharing

Back in Train, we decided that cinder would no longer allow untyped volumes (there are places in the code, particularly in some drivers, where a volume type is assumed, and when it's not there bad stuff happens). So to address this, a __DEFAULT__ type (a very minimal type, basically just name, id, and description) was introduced in the Train release to guarantee that there was at least one volume type in every deployment, and any untyped volumes were assigned this type in the database. Further, if the default_volume_type cinder option wasn't set, the __DEFAULT__ type would be used.

A problem was that there are some deployment tools (and operators) that already created and set a default_volume_type, and operators reported that end users were confused when they saw __DEFAULT__ in the volume-type-list response, and would explicitly create volumes of type __DEFAULT__, which wasn't what operators wanted, they wanted end users to simply use the configured default type. The problem was that the __DEFAULT__ type couldn't be deleted (because its purpose was to make sure that there was always some volume type available in the deployment).

We reworked the logic on this later (and backported it to Train) so that the default_volume_type cinder option is required (with a default value of __DEFAULT__) and cinder will not allow the volume type that's the value of default_volume_type to be deleted (while it's the default), and that there will always be at least one volume type. So it's possible for operators to treat the __DEFAULT__ type like any other type (for example, it can be deleted if there are no existing volumes of that type).

However, __DEFAULT__ is still there out of the box, and it's causing confusion for deployments that don't want to use it, or where it's unnecessary.

We discussed this a bit and concluded that it's a deployment responsibility to decide what to do about the __DEFAULT__ volume type in a particular deployment.

But there are definitely some things we can still do on the cinder side to improve the situation:

  • Make sure it's clear in the operator docs that we already have mechanism in-place to avoid creating untyped volumes, so it's OK to delete the __DEFAULT__ if it's not used anymore. (We picked its name so that it wouldn't clash with any existing volume types, but "__DEFAULT__" looks official and scary, and can lead operators to think that its necessary for cinder's correct functioning.)
    • add this info somewhere in the operator configuration docs
  • In victoria, cinder introduced default types per project: https://specs.openstack.org/openstack/cinder-specs/specs/victoria/default-volume-type-overrides.html. We need to promote the idea that if a user wants to see what the effective default volume type is, they need to make the GET /v3/{project_id}/types/default API call, not look at the volume-type-list and try to figure it out from the name or description. We can improve the documentation to promote this call:
    • upgrade the API-REF
      • add something into the volume-create section
      • probably also in the type list
    • upgrade Client help as well?
      • check to see what the volume-create help text says, add something there
    • in the installation docs, say something about the importance of the default type config and that the __DEFAULT__ type can (and should be) removed (by the operator/deployment tool) after startup a default has been created and set in cinder.conf
  • Gorka suggested that maybe we could introduce a microversion that somehow highlights your effective default volume type when you make the volume-type-list request
    • horizon may already be doing something like this (where "like this" means highlighting the default volume type)? In any case, we should do it too.

conclusions

action (rosmaita) make sure people follow up on this ... we've had inquiries about small features and documentation work from various community members, and this would be ideal for such people

Driver Support Matrix update

This was a participatory activity. The idea (that has come up several times in past PTGs) is that we need to review the content of the support matrix. There are features called out that aren't really "missing" if a driver doesn't support them, and there may be some capabilities that are good to know about that aren't specified. We've discussed this a lot, let's just sit down and do it.

The aim of the support matrix is to be useful to operators when selecting a backend. What it currently looks like is this: https://docs.openstack.org/cinder/latest/reference/support-matrix.html

The team made notes in an etherpad, you can see the full discussion there: https://etherpad.opendev.org/p/yoga-cinder-support-matrix

conclusions

  • It would be a lot easier to read if we swapped columns and rows. Chuck did a quick mockup and it's pretty clear that he's right. We have some custom python sphinx code that generates the matrix from an RST and a config file, so this would have to be changed in our custom code.
  • action (rosmaita): Update the note on what happens to drivers that don't have a working CI. It is no longer accurate.
  • Open question: should we call out microversions associated with features? For example, [operation.online_extend_support] is associated with microversion 3.42, should we mention that in support matrix?
  • Looks like the recent gerrit upgrade broke the script that monitors our CI stats: http://cinderstats.ivehearditbothways.com/cireport.txt
    • action (jungleboyj): Follow up with Sean on if he can fix this and if he plans to continue maintaining it.
  • action (rosmaita): update the basic cinder features list

Interop!

The plan was to meet with the Interop WG about adding capabilities to the trademark guidelines.

Some links to remind us what we have questions about:


The representative from the Interop WG was delayed, and then had to drop, but left these requests:

  • review and comment on https://review.opendev.org/c/openinfra/interop/+/811049/3/guidelines/2021.11.json for volume/cinder sections and if these should still be required, if there are some known issues or tempest changes for coverage
  • What is the new functionality and tests added in the Wallaby & Xena, separately, cycle so we can consider it for for future guidelines?
  • What are your recommendations on the functionality not covered by current guidelines that reached maturity and usage by the customers?
  • Finally, how does cinder want to handle microversions? Currently we cover API version 3.0 for inetrop. What is the range of microversion of APIs that each release support? Then we can define overlap of these for 3 latest + more future releases that we can claim every cloud implementation must support.

conclusions

  • We could still use input from the Interop WG to satisfy the above requests.
  • action (rosmaita): reach out the to Interop WG and invite them to an upcoming cinder meeting

Community goals and the "secure and consistent RBAC" effort

At the beginning of the session, it looked like we had everything planned out:


But, Dan Smith and Lance Bragstad attended and hipped us to the current discussion:

  • there's an ongoing issue across projects, namely, the misuse of system scope to allow admins to perform actions within projects
    • system-* personas should not be allowed to do stuff within projects
    • system-scope have scope of operations on the system (and must respect project boundaries)
  • system-scope allows you to operate on system resources (e.g., services, clusters, volume-types)
  • project-scope allows you to operate on project resources (e.g., volumes, snapshots, backups)
    • example: system-* personas can interact with volume-types (CRUD), but not use them to create volumes (because a volume is a project resource)* only project-* personas can act within projects
  • the plan is to rely on inheritance from higher level domains in order to create a persona who can act within individual projects without being "in" those projects


We continued the discussion here: https://etherpad.opendev.org/p/cinder-yoga-secure-rbac-more-thoughts

conclusions

action (rosmaita, abishop): follow up to figure out what's going on

Wednesday 20 October

recordings

os-brick for NVMe - The Next Steps

Simon Dodsley (simondodsley) gave an update on the working group of people interested in extending the NVMe support in os-brick. Right now, Pure, Kioxia, and Dell/EMC are all interested in this technology, and working on support. We are gathering info and ideas here: https://docs.google.com/document/d/1_xXgYOElC5G8RawWyEEHKOF1A5E3RmtlUlmcqCtuUzA/

Ping simondodsley in IRC if you can't access the doc. Anyone interested in NVMe is welcome to participate in this effort.

Gorka has put up two patches so that we'll be able to do testing using LVM. These use the NVMe tools/protocols to export LVM volumes, using NVMe instead of iSCSI to make the volumes available. This way we can have testing independent of NVMe solution vendors as they are still getting their third-party CI systems up (there are problems getting network cards due to covid-19):


One thing that came up in discussion is that when reviewing os-brick patches, there's not a clear way to know what connector is being used. You can usually tell from the CI name what cinder driver is being used, but the connector isn't so obvious. This has been a problem in the past with the nvmeof connector, where it hasn't been clear which third party CI results should all pass to indicate that the connector is working. Gorka suggested that we create a reviewer chart for os-brick connectors and the CIs that test them (including description on variants, for example, iSCSI shared targets, iSCSI individual targets). It might also be possible to require that os-brick CI jobs add to their name the connector they are using.

This led to a general discussion of what we want to get done in os-brick in the Yoga cycle:

Note that both of the above efforts don't impact the mainline os-brick code, so are very low risk for anyone who doesn't use them.

conclusions

  • reminder: Changes to be included in the Yoga release of the os-brick library must be merged by Thursday 10 February 2022 (20:00 UTC)
  • action: create the reviewer chart of os-brick connectors and the CIs that test them

Clarify the Volume Driver API, Part 1

This was another participatory activity. Over the past few PTGs, we've all agreed that it would be helpful to driver contributors if the cinder project team were to more clearly specify the interface that drivers are expected to implement. We have some tooling in place to make sure that specific functions are implemented, but there are cases where the documentation describing the semantics and even what's expected to be returned are either outdated or incomplete. Everyone always agrees that this is a good idea, so instead of all waiting for someone else to get this going, we decided to discuss this in a working session at the PTG to get something hammered out.

The idea is to get started today, and follow up on Friday. The notes from the session are here: https://etherpad.opendev.org/p/yoga-volume-driver-API

The Friday session was cancelled so that we could attend the TC session about the Yoga community goal ("secure and consistent RBAC").

conclusions

  • The discussion was productive, as we identified two functions right away that should be removed. We'll continue to do this as an ongoing activity throughout the cycle, maybe as a "volume driver API function of the week", and incrementally get the entire interface documented.

Happy Hour and mascot/team name discussion

Thursday 21 October

recordings


Friday 22 October

recordings