Jump to: navigation, search

CinderXenaMidCycleSummary

Introduction

At the Xena Virtual PTG we decided to continue the practice of holding two mid-cycle meetings, each two hours long, during weeks R-18 and R-9.

What follows is a brief summary of what was discussed. Full details are on the etherpad: https://etherpad.opendev.org/p/cinder-xena-mid-cycles

Session One: R-18: 2 June 2021

We met in BlueJeans from 1400 to 1600 UTC.
etherpad: https://etherpad.openstack.org/p/cinder-xena-mid-cycles
recording: https://www.youtube.com/watch?v=bSs5AHz2Iq8

General Cinder Project Business

Quick discussion of the following issues:

  • Reminder that we are in OFTC now for #openstack-cinder and #openstack-meeting-alt. There are 69 people showing in the old Freenode cinder room, but no one has said anything in there, so it looks like people have made the transition.
  • The TC has asked teams to consider holding their team meetings in their project channel instead of one of the dedicated #openstack-meeting* rooms. Sounded like most people are OK with that or have no opinion. Since we already have one change people need to be aware of right now (our next meeting will be the first in OFTC), we'll continue to meet in #openstack-meeting-alt for now and take a vote about moving in the next meeting or so.
  • The release team likes to make sure libraries have a released early in the cycle so that changes that have accumulated in master since the most recent stable branch was cut can get a workout and we can find out if we've broken anyone. We did a quick review of unmerged os-brick patches and picked two that should be included in this early release. Rajat will update the release patch when the changes have been merged. (We will wait to release an early python-cinderclient until after v2 support has been removed; see below.)
  • A question came up about a backport proposed to stable/queens: https://review.opendev.org/c/openstack/cinder/+/760362. The issue is that it's a pretty big patch (on the order of 1K lines, though half of that is unit tests) and that it's not a clean backport from rocky (and hence will require more careful reviewing). On the other hand, the change is isolated to a single driver. The general feeling was that since we've already allowed backporting from victoria (where it was introduced) back to rocky, we should be consistent and allow it to be backported to queens. Plus, it's a significant bugfix, and people are still using queens.
  • It turns out that os-brick is not tagged 'vulnerability:managed', which was a surprise to me. The question for the team was: did anyone remember this being done intentionally, or was it simply an oversight (and we didn't notice because most people file security bugs directly against cinder)? No one thought this had been done intentionally, and the team agreed that it makes sense for private security bugs for os-brick to be managed by the VMT, as is currently done for cinder and the cinderclient.
  • A discussion about the default envlist in tox.ini for cinder project repos was prompted by a proposed patch to os-brick: https://review.opendev.org/c/openstack/os-brick/+/793221. We're inconsistent about this across repos, although to a certain extent it doesn't matter because no one present admitted to ever using the default tox environment when running tests locally. The default environment is aimed at new contributors, and the general consensus is that we'd like them to run both (a) the unit tests with the latest python version we support, plus (b) pep8.

Block Storage API v2 Removal Update

Patches have been posted cinder and the cinderclient, but they aren't seeing much review action. So we'll take an adopt-a-patch strategy so people have responsibility for reviewing a specific patch. For each patch, we need two core reviewers (as usual), and it would be nice to have another reviewer (doesn't have to be core) to do a sanity check review, since most of the patches are large.

what about the enable_v3_api option?

We discussed what to do about the cinder enable_v3_api option. See:


Some options are:

  1. Deprecate in Xena for removal in Y; and honor it in Xena. This means that if it's enabled, the only Block Storage API call available on nodes running cinder-api will be the /versions request, which will return an empty list of available versions. Not sure there's any utility in that. (This is what's done in the current patch.)
  2. Deprecate it in Xena for removal in Y, but ignore it in Xena (that is, act as if enable_v3_api=true even if it's actually set false in cinder.conf).
  3. Just go ahead and remove it in Xena, because if it's ignored, there's no point in it being there at all. On the other hand, option 2 would log a deprecation message that could explain that the option is now a no-op, so there would be the log message available to people who didn't read the release notes carefully.


We decided to think about this some more and handle it in a follow up patch.

cgroupv1 -> cgroupv2 update

Follow up from the PTG: libcgroup v2.0, which supports cgroup v2, and provides 'cgexec' which cinder uses, was released May 6, 2021. This would allow us to use the current code with minimal changes. However, it won't be packaged for some distros, so we need to switch to Plan B.

The alternative is to use systemd, where we would define drop-in slice files that set the io throttling, and then use systemd-run to run the copy/convert commands currently being run with cgexec from libcgroup. Eric pointed out that we could have cinder would generate the files on-demand, as we do elsewhere in the code. That way we could continue to use the config option that sets the io throttling value, which would preserve backward compatibility with current config files.

Just for the record, here are the places where throttling is handled in the current code:

Xena Specs Review

(Note: even though the R-18 midcycle is timed so that spec proposers can get feedback in advance of the spec freeze at the end of week R-15, the word apparently hasn't gotten out and we need to figure out a better way to communicate this, both so that proposers know to show up at the midcycle, but also so that they realize they should participate in the midcycle scheduling process very early in the cycle if the current times are preventing attendance.)

Here's a summary of the specs we discussed. Anyone with a spec proposal that wasn't discussed, and who needs more feedback than is currently on the Gerrit review, should reach out to the Cinder team for help (by putting a topic on the weekly meeting agenda, asking in the OFTC #openstack-cinder channel, or via the openstack-discuss mailing list).

Also, here's a link to the midcycle recording for more context to the summaries that follow: https://www.youtube.com/watch?v=bSs5AHz2Iq8

Update original volume az

spec proposal: https://review.opendev.org/c/openstack/cinder-specs/+/778437

  • The team is generally supportive of this idea, but we need more details.
  • This spec could use a more accurate title, because it's more like "Migrate all the stuff associated with a backend to a new availability zone when the backend has been moved to a new AZ", which gives the reader a better idea of the complexity of the issue being addressed.
  • The question came up of what about in-use volumes? Will this only work for available volumes? What steps does an operator need to take before running this cinder-manage operation?
  • We need a clear description of how other resources that are AZ-aware will be handled, for example:
    • snapshots
    • backups
    • groups
  • Need to add a documentation note that an operator will have to update any volume_types that include an availability zone.

Support revert any snapshot to the volume

spec proposal: https://review.opendev.org/c/openstack/cinder-specs/+/736111

  • The team remains supportive of this idea, but continues to stress that we need a much more thorough proposal, especially with respect to:
    • Vendors need to be aware of the implications of this functionality, especially for deletion
    • It would be extremely helpful to have an outline of tempest test scenarios sketched out in the spec, because it will give a clear statement of the expected pre-conditions and post-conditions of various applications of this operation
      • It must be clear that all the normal Cinder API semantics are respected when this operation is added
      • A suite of tests in the cinder-tempest-plugin (that can be implemented from the outline) will help driver maintainers assess the impact of this change on their drivers
  • A question came up about the 'safe_revert_middle_snapshot' property. Is it a static property that can be defined for each driver, or does it depend on other factors (for example, licensing, backend API version, etc.) that must be determined dynamically from the driver?
  • Please explain clearly what happens to the more recent snapshots under this proposal. For example,
    • volume -> snap A -> snap B -> snap C
    • revert the volume to snap B and then write data to the volume
    • what effect does this have on snap C?
  • All the normal cinder API semantics should still work. For example,
    • if you revert volume v1 to A, you should still be able to delete A
    • if you then revert volume v2 to A, you should have the original data from A (that is, it shouldn't contain anything that v1 has changed after its revert)

Migration support for a volume with replication status enabled

spec proposal: https://review.opendev.org/c/openstack/cinder-specs/+/766130

  • A key question here is: What happens to the old replica?
  • There will need to be some documentation making operators aware of what questions to ask vendors about their specific backend. For example, if replication is async, there will be a time when the migrated volume is not replicated. There are probably some other issues that you can think of.
  • There's a comment at the end of the spec: "We could follow the multiattach implementation for this spec proposal". Could you please explain in what way multiattach is helpful to your implementation? It's not clear to us what you're thinking here, and an explanation will help us understand your proposal better.

Session Two: R-9: 4 August 2021

We met in BlueJeans from 1400 to 1600 UTC.
etherpad: https://etherpad.openstack.org/p/cinder-xena-mid-cycles
recording: https://www.youtube.com/watch?v=9Aj8FzqFuHM

Reset state robustification for snapshot

The issue is that there are a *lot* of state transitions to consider, it looks unfeasible to account for them all. So what to do?

Eric mentioned that an alternative would be to just remove this capability from the REST API and make it only available to cinder-manage, where presumably it would be used only by people who know the implications of what they're doing. But the problem with that is that you have to have access to a cinder node to run cinder-manage from, and that would mean only top-tier support would have access to it, which probably isn't what we want.

Recommendation is to pick a small, clear case (for example, the 2 in the spec) and implement this to at least get started. Also, there are some destination states (particularly those ending in "-ing") that should *never* be transferred to, so we can also block those. (These states are used by cinder to mark the middle of a workflow, and you don't want to allow someone to pretend to put the volume into the middle of a desired workflow simply by changing the state in the DB, because that's not really going to do anything.)

action

TusharTgite: do the implementation along the lines discussed above

what to do about the enable_v3_api option?

This is a follow up from xena R-18 midcycle, discussed here: https://wiki.openstack.org/wiki/CinderXenaMidCycleSummary#what_about_the_enable_v3_api_option.3F Asked the TC about this on the ML: http://lists.openstack.org/pipermail/openstack-discuss/2021-July/023898.html

action

rosmaita: put up a patch removing the option: https://review.opendev.org/c/openstack/cinder/+/803533

python-cinderclient next steps

action

rosmaita: put up a patch adjusting the config: https://review.opendev.org/c/openstack/python-cinderclient/+/803522

os-brick Xena release

This is coming up fast: os-brick Xena release is at R-7 (week of 16 August)

Priorities:

action

everyone: review the 2 above patches pronto

sqlalchemy-migrate -> alembic

Stephen Finucane put up a bunch of patches to get this moving. He probably won't be able to help further due to a job switch.

Luigi pointed out that sqlalchemy-migrate could be unsupported already. We could propose a festival to make this happen since we need it asap.

action

rosmaita: organize a review festival or something around this

what to do about qcow2-v2 support being yanked from various linux distros

"Yanked" may be a bit strong; the proposal in RHEL, for example, is to have read-only support.

Not really a cinder problem because qcow-v3 has been the default for qcow2 creation in qemu-img since QEMU-1.7 (Dec 2013), so for anyone running openstack since Icehouse (spring 2014), cinder has only been creating qcow2-v3 images for remotefs-based drivers.

The issue for cinder is if an operator manages a qcow2-v2 format image.

Eric asked: does the cinder manage API even support managing qcow2 using the NFS drivers? He suspects that maybe we don't. So, if not, we should continue not to support this!

action

rosmaita: look into whether we already don't allow an operator to manage a NFS driver volume that's a qcow and report back

keystone "trusts" support in cinder

The context for this is that there's an etherpad collecting operator pain points to assess the feasibility of having a "fix some pain points in your project" community goal (that is, each project fixes its own pain points instead of trying to find an openstack-wide pain point to address). This patch is on the list: https://review.opendev.org/c/openstack/cinder/+/785362

It's still not clear that the problem reported requires trusts. I'd rather use service tokens, which we already support, and are less complicated.

action

rosmaita: leave a comment on the patch and/or associated bug explaining this and asking for a reason why what's proposed *cannot* be accomplished via service tokens

surfacing backend info to operators

The context for this one is a patch that is using the 'disabled_reason' field of the volume service to explain why replication is disabled on the backend, which seems to be an unintended use of that field: https://review.opendev.org/c/openstack/cinder/+/791281/5/cinder/volume/manager.py#636

GirishChilukuri explained: During initialization of replicated backend's because of some reason any of the site may be down and we are setting service replication_status as disabled. So to give info to user we are using disabled_reason to explain why replication is disabled. We could see "disabled_reason" is used during failover to update the reason, So we have used the disabled_reason field to update replication status.

Some points that came up during discussion:

  • this is not an appropriate use of disabled_reason on a service
  • surfacing this info (particularly about replication status) is a common use case across drivers
  • not clear that the service object is the place this should be recorded, though

action

Girish and team will make a proposal for how to surface this info, consulting with other driver developers (simondodsley from Pure expressed some interest)

general project maintenance issues

action

everyone: read through and act on the above!