CinderBobcatPTGSummary

Introduction
The Seventh virtual PTG for the 2023.2 Bobcat cycle of Cinder was conducted from Tuesday, 28th March, 2023 to Friday, 31st March, 2023, 4 hours each day (1300-1700 UTC). This page will provide a summary of all the topics discussed throughout the PTG.



This document aims to give a summary of each session. More context is available on the cinder Bobcat PTG etherpad:
 * https://etherpad.opendev.org/p/bobcat-ptg-cinder

The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings are located at appropriate places below.

recordings

 * https://www.youtube.com/watch?v=HektgZH_C2g
 * https://www.youtube.com/watch?v=_gk_bAs50M4

Announcements
ML: https://lists.openstack.org/pipermail/openstack-discuss/2023-March/032872.html Link: https://www.youtube.com/watch?v=YdLTUTyJ1eU
 * 2023.1 (Antelope) is released!
 * 2023.1 (Antelope) Project update in OpenInfra live

PTL and TC Interaction Summary
We discussed about what are the challenges faced by a new contributor in OpenStack and what steps we can take to help improve the process. Some of the difficulties discussed are:


 * Gerrit interface isn't very intuitive
 * Devstack errors are not easy to debug and resolve: Cinder team didn't mandate installing devstack for Outreachy contribution this time which improved the contribution significantly.
 * OpenStack requires more devops/linux-style knowledge than many other OSS projects

There is a crash course for learning linux concepts

Link: https://missing.csail.mit.edu/

Manila team is working on a guide for outreachy applicants

Link: https://wiki.openstack.org/wiki/Outreachy_Applicants_Guide#Outreachy_Applicants_Guide

2023.1 (Antelope) Retrospective
What was good?
 * Added a new core -- Jon Bernard

What was bad?
 * Delaying of RC1 and RC2 due to certain fixes

What should we stop doing?
 * Someone mentioned that they don't like the recent practice of adding reviews to the meeting agenda, but on the other hand, it does get them some attention
 * Keystone has "reviewathon" on Fridays. They separate managing bugs/reviews and doing reviews in a meeting. - We also have "festivals", but not every week.
 * It would be good to have more people joining
 * It was also mentioned that we should have a meeting where people can bring their own patches for reviews (not specifically XS)
 * There was a concern regarding driver patches taking time for third party CI to respond and patches keep waiting even after CI passes
 * we can discuss about their CI status and if not reporting, we can flag and warn them to fix it

What should we continue doing?
 * Festival of XS reviews
 * once-a-month video team meeting

Cinder contribution Information:

Link: https://tiny.cc/cinder-info

Outreachy Overview
Sofia provided a great presentation on outreachy which is available here.

Link: https://docs.google.com/presentation/d/e/2PACX-1vRrCWvWw6YV13LafHBBSu9EHm8deZu4WTjIebWt0AZEOkovbjhIY9ft9TIk75gL7HZa3lp2apRMQIli/pub?start=false&loop=false&delayms=3000

Quick question about NFS encryption
are Dell and NetApp developers interested in getting the encryption support for their drivers? If we enable encryption in generic nfs driver, drivers inheriting from it automatically gets the support which is not something we want. Responses from driver vendors:


 * NetApp
 * Netapp already has backend encryption but they don't enable is since they don't have any customer request for encryption


 * Dell
 * No plan to use NFS encryption as there is no real ask from their customers

Cinder backup improvements
Christian couldn't attend the meeting so here are the specs he mentioned that requires attention.


 * Spec for dedicated status tracking to decouple backup process from other tasks
 * Link: https://review.opendev.org/c/openstack/cinder-specs/+/868761


 * Spec to allow backups to be encrypted
 * https://review.opendev.org/c/openstack/cinder-specs/+/862601
 * ML Thread
 * https://lists.openstack.org/pipermail/openstack-discuss/2022-September/030263.html
 * #action: Gorka will be updating the spec

tobias-urdin brought up a problem with cinder backup/restore and availability zones.
 * Bug: https://bugs.launchpad.net/cinder/+bug/1949313
 * Gorka thinks the source of the bug is that we don't pass the availability zone while creating the volume to restore to
 * One solution is have a config option to allow cross AZ volume backup relation, example, enable_cross_az_backups = true (default)
 * #action: Someone to take up the task to fix the bug

EM vs EOL for rocky and stein and in general
We thought that it would be a good idea to remove all jobs from a branch but still keep it for collaboration purpose but there were few points opposing that idea:


 * if there are multiple patches proposed to a branch where we aren't merging anything, we will end up patches conflicting with each other
 * Keeping branches EM signals that it is still maintained (based on the naming extended maintenance) which isn't a good message from our side

Also there was a mention of the idea that we mark the branch as EOL but still keep it for collaboration:
 * If we mark branches as EOL and still keep it, we will need to convince other projects about the proposal of marking branch as EOL but not deleting it

Another discussion related to stable branches that should report third party CI runs:
 * There are 3 active stable branches at any point, currently for 2023.2 development, they are 2023.1, Zed and Yoga (Xena will move to EM)
 * We need to keep track of the ubuntu and python version while doing this testing

Action Items:
 * #action: reply to the ML regarding Cinder's take on the situation i.e. to EOL rocky and stein
 * #action: Brian to summarize the discussion in an etherpad and sent it out to ML
 * https://etherpad.opendev.org/p/cinder-EOL-rocky-stein

recordings

 * https://www.youtube.com/watch?v=7xV_UaAVI_A
 * https://www.youtube.com/watch?v=YaNXq08_u5A

Image Encryption - Current State
Patches in python-barbicanclient and castellan have merged and castellan release will be out soon.

From cinder perspective, we will have patches for os-brick and cinder (for the create bootable volume operation).

The glance and cinder changes will be dependent upon the os-brick patch so the priority should be os-brick > glance and cinder.

The team feels tempest scenario tests would be good to have including glance, cinder, os-brick code paths.

Action Items
 * #action: review the os-brick change
 * Link: https://review.opendev.org/c/openstack/os-brick/+/709432

FIPS jobs
We have ubuntu and centos jobs proposed.


 * Ubuntu
 * https://review.opendev.org/c/openstack/tempest/+/873697
 * https://review.opendev.org/c/openstack/devstack/+/871606
 * CentOS
 * https://review.opendev.org/c/openstack/cinder-tempest-plugin/+/847086
 * https://review.opendev.org/c/openstack/devstack-plugin-nfs/+/847087
 * https://review.opendev.org/c/openstack/cinder/+/790535

Since Ubuntu Focal (20.04) doesn't have a kernel supporting anything else than MD5, we can't use lvm + iscsi target. We can probably try with LVM+ NVMe-TCP or LVM + nvme-rdma using the Soft-RoCE. Also FIPS is only enabled on Focal so Jammy isn't qualified for FIPS testing as of now.

Action Items
 * #action: review and merge proposed patches to start running jobs as non-voting but also keep an eye on failures

Operator Hour
Etherpad: https://etherpad.opendev.org/p/march2023-ptg-operator-hour-cinder

There were no operators that joined the cinder operator hour. To make better use of the time, we discussed a topics that were proposed by an operator.


 * (How) to reduce memory footprint for "all" deployment tools, as done here for
 * Devstack: https://review.opendev.org/c/openstack/devstack/+/848290
 * Triple-O: https://review.opendev.org/c/openstack/tripleo-common/+/845807
 * There are two other ways to reduce memory consumption in a deployment
 * 1) Keep the chunk size small
 * a) The option name is different for each backup driver, example, for RBD it is backup_ceph_chunk_size
 * 2) limit number of backup/restore operations
 * a) this can be done with backup_max_operations config option
 * b) https://docs.openstack.org/cinder/latest/admin/volume-backups.html#backup-max-operations


 * Memory usage in CI jobs
 * We did some tweaking in mysql config to reduce memory usage in a devstack deployment
 * https://review.opendev.org/c/openstack/devstack/+/873646
 * This is disabled by default but can be enabled in gate jobs with a devstack variable, MYSQL_REDUCE_MEMORY: true
 * We've also added indexes to our DB improving the query performance
 * https://review.opendev.org/c/openstack/cinder/+/819669

Are we ready for SQLAlchemy 2?
oslo.db 13.0.0 will be released during 2023.2 Bobcat development and will remove sqlalchemy-migrate support and formally add support for sqlalchemy 2.x. For cinder to adopt to this change, we will need to merge the following patches.
 * https://review.opendev.org/q/topic:sqlalchemy-20%20project:openstack/cinder%20status:open

There is also an effort to remove the abstraction in the DB code and make sqlalchemy as our only DB ORM. The team agrees that we should move forward with this.
 * https://review.opendev.org/c/openstack/cinder/+/846173

Action Items
 * #action: Rajat to look into cinderlib for open patches and release it
 * #action: Brian to change job definitions to test against 2023.1 instead of cinder/os-brick master
 * cinderlib release model, changes: https://docs.openstack.org/cinderlib/latest/contributor/contributing.html
 * 878943: Continue 2023.1 (Antelope) development | https://review.opendev.org/c/openstack/cinderlib/+/878943

recordings

 * https://www.youtube.com/watch?v=Wkl9aT8XcOA
 * https://www.youtube.com/watch?v=gU83byXgej8

OpenStack Client update
We added missing commands in OSC in the 2023.1 Antelope release and got parity between cinderclient and OpenStack client. Following are the changes we are planning for 2023.2 Bobcat development cycle:


 * We will make OSC as the default CLI and only add new commands to OSC and not cinderclient
 * We will still need to add python bindings to cinderclient
 * We will Improve openstacksdk to add support for missing cinder operations

Action Items
 * #action: go forward with the plan of working towards parity with SDK

Quotas
Partial work has been done but unfortunately Gorka won't be able to continue working on it due to other priorities. Rajat has proposed to work on it and appropriate handover will be done for continuing the work.


 * #action: Rajat to understand the current state and take handover from Gorka
 * #action: All cores to read the spec when it's finalized after the handover

Active/Active support with NFS
This should be doable but will require a lock to avoid two services working on the same resource.
 * https://docs.openstack.org/cinder/latest/contributor/drivers_locking_examples.html

Gorka has a series of posts written on Active-Active that should be helpful.
 * https://gorka.eguileor.com/?s=active-active

Testing:
 * A sanity test by running tempest with multiple volume services
 * More thorough testings should be done with browbeat or rally

RBD deletion issues
When cinder and glance both use RBD as their backend and we create a bootable volume from image, COW cloning is performed which creates a dependency chain. This is also True for cloning from a source volume operation. This dependency causes problem when deleting the parent resource. The current work on cinder allows the deletion to happen by using RBD's trash functionality.

Currently there is a cinder patch in progress. We need changes in glance similar to cinder to allow deletion of parent images that have dependent volumes.
 * Link: https://review.opendev.org/c/openstack/cinder/+/835384.

Action Items
 * #action: Eric to do a POC for glance RBD store and propose a spec accordingly

Glance Image Direct URL access
The work started on the glance side and the spec was merged, however the implementation wasn't started.
 * 2023.1 (Antelope) spec: https://specs.openstack.org/openstack/glance-specs/specs/2023.1/approved/glance/new-location-info-apis.html

Nova team requires a separate spec for the nova changes to handle their upgrade and backward compatibility scenarios.

Some requirements from nova team for nova side changes:
 * Use SDK instead of glanceclient
 * Keep backward compatibility to handle API requests between SLURP releases
 * keep the legacy credentials for the new APIs
 * nova has testing in place to check if thin clone is working
 * https://docs.openstack.org/nova/latest/configuration/config.html#workarounds.never_download_image_if_on_rbd
 * https://opendev.org/openstack/nova/src/commit/29de62bf3b3bf5eda8986bc94babf1c94d67bd4e/.zuul.yaml#L634

Action Items
 * #action: Repropose glance spec to 2023.2
 * #action: Propose a nova spec handling nova specific use cases
 * #action: Glance team to start working on implementation

NFS encryption
This is an effort to enable encryption for the generic NFS driver. This feature will require changes on both nova and cinder side.
 * Link: https://review.opendev.org/q/topic:bp%252Fnfs-volume-encryption
 * Nova changes
 * https://review.opendev.org/c/openstack/nova/+/854030
 * https://review.opendev.org/c/openstack/nova/+/870012

Nova team would require a blueprint to track the work. A spec wouldn't be required since there are no DB or API changes. It would be good to share as much code as possible with the nova provisioned disk encryption feature.

Action Items:
 * #action: Propose a nova blueprint
 * #action: Handle the upgrade concern about making sure we are not scheduling to an older compute (using a specific trait + a prefilter)
 * #action: For testing, cinder will enable encryption in their existing NFS job and nova could run it on our periodic jobs

Allow specifing a hardware model for cinder volume on a per volume basis
Currently nova allows us to select the disk model via the image using hw_disk_bus image property, example, hw_disk_bus=virtio or hw_disk_bus=sata If we wanted to support this as a pre volume attribute, we can use volume metadata for it. The validation of the value will be done o

Action Items:
 * #action: The disk model can be set in volume metadata and nova would validate if the value is correct
 * #action: Implement a precedence order in nova to provide higher priority to volume metadata field than glance image metadata field

recordings

 * https://www.youtube.com/watch?v=fLHCqlQ6mAI

Release notes guidelines for SLURP/NON-SLURP cadence
We need to handle release notes case for SLURP/non-SLURP releases. Brian has a proposal up documentation for the same.
 * Link: https://review.opendev.org/c/openstack/project-team-guide/+/843457/1/doc/source/release-management.rst

Gorka also has a Documentation patch for cinder related changes in SLURP vs non-SLURP releases
 * https://review.opendev.org/c/openstack/cinder/+/830283

Action Items
 * #action: Review documentation proposed by Brian and Gorka

Upload volume to image optimization for RBD
Currently the work is on hold for the service role to be available and RBD deletion fixes to be merged, breaking the dependency chain. We can also use the service role without keystone bootstrapping it and document it as a required prerequisite for this feature to work.


 * Spec: https://specs.openstack.org/openstack/cinder-specs/specs/yoga/optimize-upload-volume-to-rbd-store.html
 * POC patch: https://review.opendev.org/c/openstack/cinder/+/809523

Action Items:
 * #action: Rajat to go through nova docs and add documentation regarding using this with service role and service token
 * #action: Go through the RBD patch to see if we need to include any custom changes to make RBD delete work for this

Cinder retype for migration, passing the new_volume_type_id to the drivers
The concern was regarding cinder not passing new_volume_type_id while calling migrate volume functionality. The driver team wanted to replicate the generic migration flow where they create a new volume on the new host and copy data from old volume to new volume. This doesn't seem like a reasonable approach for a driver since the migration from a driver is expected to be efficient. The driver can always rely on the generic migration by not implementing the migrate_volume method.

Action Items
 * #action: work on the patch to allow drivers to return extra_specs properties that are OK for retyping (with migration) a volume