CinderZedPTGSummary

Introduction

The fifth virtual PTG for the Zed cycle of Cinder was conducted from Tuesday, 5th April, 2022 to Friday, 8th April, 2022, 4 hours each day (1300-1700 UTC). This page will provide a summary of all the topics discussed throughout the PTG.

File:Cinder-zed-ptg.png

This document aims to give a summary of each session. More context is available on the cinder Zed PTG etherpad:

https://etherpad.opendev.org/p/zed-ptg-cinder

The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings are located at appropriate places below.

Tuesday 05 April

recordings

For the benefit of people who haven't attended, this is the way the cinder team works at the PTG:

sessions are recorded
please sign in on the "Attendees" section of this etherpad for each day
all notes, questions, etc. happen in the etherpad; try to remember to preface your comment with your irc nick
anyone present can comment or ask questions in the etherpad
also, anyone present should feel free to ask questions or make comments during any of the discussions
we discuss topics in the order listed in the etherpad, making adjustments as we go for sessions that run longer or shorter
we stick to the scheduled times for cross-project sessions, but for everything else we are flexible

Release cadence discussion: tick-tock model

https://governance.openstack.org/tc/resolutions/20220210-release-cadence-adjustment.html

There was a PTL-TC session about this on 4th April, 2022 (Monday) and the following points were discussed:

This only affects the upgrade path and not the release model which remains same (i.e. 6 months)
It proposes a tick-tock release model where if a release is tick, the subsequent release will be tock and so on
This new effort provides the ability to upgrade from tick->tick release (skipping one release) but we cannot upgrade (directly) from tock-> tock release
There is a job in place, grenade-skip-level, that will run on tick releases and check upgrade from tick-> tick release (or N-2 to N release)

There's a patch up by Gorka documenting the impact of the new release cadence on Cinder and it requires changes based on the discussion at PTG about the following points: Patch: https://review.opendev.org/c/openstack/cinder/+/830283

removal of configuration option
deprecating or removing a driver
Python version support
Backports (works the same way)
New deprecation policy (deprecation policy in the project team guide: https://docs.openstack.org/project-team-guide/deprecation.html)

There will be a 2 cycle deprecation process which we can see with the following example, Suppose we have a config option "cinder_option_foo" deprecated in AA (tick), we need to continue the deprecation process in BB (tock), then we can remove that option in CC (tick + 1).

conclusions

action: geguileo to update the patch with the current discussion points

Best review practices doc

whoami-rajat is working on putting together a review doc that would help new reviewers to efficiently review changes hence increasing the quality of review. link: https://review.opendev.org/c/openstack/cinder/+/834448 The discussion had a great point that should be mentioned in the review doc regarding a reviewer doesn't have to review everything but mentioning what they reviewed would benefit the other reviewers a lot. Eg: If someone reviewed the releasenote, it saves other reviewers time looking at the releasenote. Also there is a suggestion regarding adding the tick/tock release cadence specific review points.

conclusions

action: whoami-rajat to update the review doc with the suggested points

Secure RBAC

We made the project ID optional in the url to support the system scope use case with the plan to expand scopes from project level to system level in Zed. System level personas will deal with system level resources that are not project specific, Eg: host information. We also have to take into account mixed personas for some resources like volume type is a system level resource but acts at project level if it is private and also needs to be listed by project members to create resources like volumes.

The community goal is divided into different phases and the goals for every phase are defined as follows:

Phase 1: project scope support -- COMPLETED
Phase 2: project manager and service role
Phase 3: (in AA) implement system-member and system-reader personas

The two new roles i.e. manager and service are intended to serve some use cases as follows:

Manager: It will have more authority than members but less authority than an admin. Currently, it is useful for set default volume type for a project.
Service: useful for service to service interaction. Eg: currently we requires an admin token for cinder-nova interaction that makes a service like cinder to be able to do anything in nova as an admin.

There were doubts regarding resource filtering which we can propose as extend work item to the current SRBAC goal. Currently our resource filtering has same functional structure i.e. if it doesn't work for non-admins then it doesn't work for admins either. There was another concern regarding attribute level granularity. Eg: the host field in the volume show response is a system scope entity which should be not be returned with a project scoped token response.

conclusions

action: rosmaita to update policy matrix
action: consider attributes (system level like host) associated to personas (for show, list, filtering...)

More "Cloudy" like actions for Cinder

Walt discussed that certain cases should automatically be handled by the cinder.

1) Certain actions should also invoke an automatic migration due to space limitations. For example, when there are multiple pools against the same backend and a user wants to extend a volume. If the volume doesn't fit on it's existing pool, but there is space for it on another pool on the same backend. There is a concern that If moving between pools take considerably long time then it would not be good that the operation takes that amount of time then we have the following points to consider:

moving between pools with dd will not be efficient
A user message could be useful in this case
Some backends (like RBD) backend can do this efficiently
would be better if we have a generic mechanism + efficient way in the driver to do it

There was a concern regarding a lot of concurrent migration happening due to this at the same time causing performance issue but we currently do that while migrating volumes and it works fine. There was also a suggestion to not migrate the original volume if it's a large one and rather migrate a smaller sized volume to free up the space required to extend but there are a lot of things to consider in this case, major being that the other volume might belong to another project and any failure during the operation might corrupt the other volume as well.

2) Backups, sometimes induce a snap of the volume. Snaps require living on the same pool as the original volume as the original volume. An optimization to this is the volume drivers can say whether they want to use snapshots or clones, depending on what's best for them. The driver can report if it will require the full space for a temp resource or not, if it requires it, it will go through the scheduler to check for free space, in the other case we will just proceed with the efficient cloning.

conclusions

action: Walt to write a spec describing the design and working of it

Unifying and fixing of Capacity factors calculations and reporting

There are some inconsistencies with our scheduler stats like the allocated_capacity_gb is created by the volume manger to let the scheduler know what cinder has already allocated against a backend/pool, this value isn't being updated for migrations. This value can go negative because the init_host calculations only account for in-use and available currently. Patch: https://review.opendev.org/c/openstack/cinder/+/826510

Also we've an issue where there are a few places in Cinder that try and calculate the virtual free space for a pool but the problem is the Capacity filter and Capacity weigher do it differently. Patch: https://review.opendev.org/c/openstack/cinder/+/831247

The backend's stats may show that there is lots of space free/available, but cinder's view might be different due to:

reserved_percentage
thin vs. thick provisioning
lazy volume creation (space unused until volume is actually created)
max_over_subscription_ratio

These calculations needs to be corrected so the operators have the accurate idea which backends are low on space and requires attention and we don't face resource creation failures even though when there is available space in the backend.

conclusions

action: review the patches proposed by Walt

Volume affinity/anti-affinity issues

We have the affinity filters but affinity is kept in check while creating volumes but not when migrating them. One way to handle this is we can preserve the scheduler hint (for affinity/anti-affinity) for later operations on the volume. There are a lot of things to consider with this approach:

What happens if the original volume (we kept affinity from) is deleted/migrated?
Should we keep it as a UUID or host?
Should we consider the scheduler hint only for the volume create operation or preserve it for all the operation for the rest of the life of that volume?
How to go about the design -- should we store it in the metadata or create a separate table to store the volume UUIDs provided for affinity/anti-affinity
What to do about when this requires cascade operations, should we move a lot of resources during the operation maintaining the affinity/anti-affinity?
Need to also think about cases like backup/restore, when replication is enabled

conclusions

action: define in our docs that we only honor hints on creation
action: ask Nova team if they have already solved this problem. Nova has a spec up for a similar case: https://specs.openstack.org/openstack/nova-specs/specs/rocky/implemented/complex-anti-affinity-policies.html
action: continue discussion in upcoming meetings and collect the points gathered in a spec

Wednesday 06 April

recordings

For Driver Maintainers: How Cinder (the project) Works

Brian provided a quick overview of Cinder's software development cycle, when key deadlines occur in each cycle, the difference between features and bugfixes, when bugfixes are backportable, things you can do to make sure your patches are ready for review, where key information about the project is located, etc.

link: https://etherpad.opendev.org/p/how-cinder-works

Documenting the Driver Interface

During this session, the team reviewed documentation patches for the driver interface class which was a pending item from last PTG. We got valuable feedback and we are planning to do it again to get these type of changes in. Patches to review:

conclusions

action: do the review session again and review current patches

Third-party CI: testing

We found out that most of our third party CI drivers are not testing encryption. A fixed key should be enough to test it. Later this discussion was generalized to what should be tested in third party CI and following is the list:

compute API -- attachments, bfv
volume API
image API -- glance configured to use cinder as the backend (nice to have but not required)
scenario tests
cinder-tempest-plugin

A suggestion is a python script tool would be helpful to check what the CI promises and which tests are running in tempest (tool will check in tempest.conf and cinder.conf) for which we need to force 3rd party CI systems to store things in a specific location.

upstream tempest tests: https://etherpad.opendev.org/p/cinder-community-CI-tests example downstream tempest tests: https://etherpad.opendev.org/p/cinder-3rd-party-CI-tests-rh

conclusions

action add the current discussion points in the third party CI document
action list of desired tests for comments

Third-party CI: infrastructure

NetApp team provided a great presentation on Software Factory. <placeholder for link of presentation>

Thursday 07 April

recordings

CinderZedPTGSummary

Contents

Introduction

Tuesday 05 April

recordings

Release cadence discussion: tick-tock model

conclusions

Best review practices doc

conclusions

Secure RBAC

conclusions

More "Cloudy" like actions for Cinder

conclusions

Unifying and fixing of Capacity factors calculations and reporting

conclusions

Volume affinity/anti-affinity issues

conclusions

Wednesday 06 April

recordings

For Driver Maintainers: How Cinder (the project) Works

Documenting the Driver Interface

conclusions

Third-party CI: testing

conclusions

Third-party CI: infrastructure

Thursday 07 April

recordings