Jump to: navigation, search

Difference between revisions of "CinderUssuriMidCycleSummary"

m (corrected status of veritas cnfs driver)
m (Fixed youtube liinks)
 
Line 5: Line 5:
 
We met in BlueJeans from 1300 to 1500 UTC.<br/>
 
We met in BlueJeans from 1300 to 1500 UTC.<br/>
 
etherpad: https://etherpad.openstack.org/p/cinder-ussuri-mid-cycle-planning<br/>
 
etherpad: https://etherpad.openstack.org/p/cinder-ussuri-mid-cycle-planning<br/>
recording: (spam filter preventing direct link) https:// youtu.be/ Dz28U1pQnqA   (remove the spaces in the URL)
+
recording: https://www.youtube.com/watch?v=Dz28U1pQnqA
  
 
==="Drop Py2 Support" Community Goal===
 
==="Drop Py2 Support" Community Goal===
Line 81: Line 81:
 
We met in BlueJeans from 1200 to 1400 UTC.<br/>
 
We met in BlueJeans from 1200 to 1400 UTC.<br/>
 
etherpad: https://etherpad.openstack.org/p/cinder-ussuri-mid-cycle-planning<br/>
 
etherpad: https://etherpad.openstack.org/p/cinder-ussuri-mid-cycle-planning<br/>
recording: (spam filter preventing direct link)  https:// youtu.be/ cA_VfYnS77o   (remove the spaces in the URL)
+
recording: https://www.youtube.com/watch?v=cA_VfYnS77o
 
 
 
===Welcome and some Cinder project business===
 
===Welcome and some Cinder project business===
 
Reminder of where we are in the cycle:
 
Reminder of where we are in the cycle:

Latest revision as of 21:23, 8 June 2020

Introduction

Since the Ussuri mid-cycle meeting will be held virtually, we don't have to worry about physical arrangements, and decided to hold it in two separate two-hour sessions. The first session will be around the Cinder Spec Freeze (last week in January) and the second session will be around the "Cinder New Feature Status Checkpoint" (week of 16 March 2020), which is 3 weeks before the final release for client libraries.

Session One: 21 January 2020

We met in BlueJeans from 1300 to 1500 UTC.
etherpad: https://etherpad.openstack.org/p/cinder-ussuri-mid-cycle-planning
recording: https://www.youtube.com/watch?v=Dz28U1pQnqA

"Drop Py2 Support" Community Goal

We've been tracking this on an etherpad: https://etherpad.openstack.org/p/cinder-ussuri-community-goal-drop-py27-support

At this point, all Cinder project deliverables no longer check/gate on Python 2. The official Ussuri Python runtimes are 3.6 and 3.7, and tox has been configured so that there are unit and functional test environments for both 3.6 and 3.7, and the check/gate config in .zuul.yaml for each project has been configured to run both 3.6 and 3.7 as well.

There's going to be a major version bump on all components to indicate that Python 2.7 is no longer supported.

The upper-constraints files on the stable branches prevent py3-only libraries from being used with a stable release. Deployers who manually patch systems need to pay attention to upper-constraints so that they don't inadvertently drop in a recent library release that hasn't been tested with Python 2.

actions

  • rosmaita - while changing tox.ini and .zuul.yaml, it turned out that for some of our previous py3 testing, what version was actually being tested against depended on the system python3. The later patches for "drop-py2-support" were set up to test against both py3.6 and py3.7 for unit and functional tests in the gate; need to verify that this is true for all Cinder deliverables.

Volume Local Cache Spec

Liang has updated the spec. We're trying to keep things simple on the Cinder and os-brick side because the configuration of the cache, including the cache mode(s) used, is completely independent of Cinder (and could change dynamically). Current proposal is to have a 'cacheable' volume-type and users can control whether it's actually cached or not by selecting an appropriate flavor (or whatever nova decides is OK). For some cache modes, you cannot live migrate or retype the volume, so it looks like somehow info about the cache mode is going to have to make it into cinder so inappropriate operations can be blocked. Maybe for the first implementation, os-brick could only allow safe cache modes at attach time, and we can document that operators should only use safe modes and should not change cache modes dynamically.

actions

  • Liang will update the spec. It's still not clear how exactly cache-mode info will be propagated to Cinder; we'll discuss at the weekly meeting.

Using six

We agreed to some guidelines about using python-3-only language features at the PTG, but they have turned out to be kind of vague, especially with respect to how to handle new code in drivers that uses the six compatability library. For example, there is this patch: https://review.opendev.org/#/c/701542/

These are the PTG guidelines:

  • Reminder to reviewers: we need to be checking the code coverage of test cases very carefully so that new code has excellent coverage and will be likely to fail when tested with py2 in stable branches when a backport is proposed
  • Reminder to committers: be patient with reviewers when they ask for more tests!
  • Reminder to community: new features can use py3 language constructs; bugfixes likely to be backported should be more conservative and write for py2 compatabilty
  • Reminder to driver maintainers: ^^
  • Ivan and Sean have a green light to start removing py2 compatability

After some discussion, we decided that:

  • we will allow drivers to continue to use six at their discretion
  • we will not remove six from code that impacts the drivers (e.g., classes they inherit from)
  • we can remove six from code that doesn't impact drivers, keeping in mind that backports may be more problematic, and hence making sure that we have really good test coverage

actions

  • rosmaita - write this up as a change to the contributor docs and we can all fight it out on the review

The current state of 3rd Party CIs

Jay has been working on removing drivers that were marked as 'unsupported' in Train and subject to removal in Ussuri, as well as marking drivers whose CIs are not being maintained as 'unsupported' in Ussuri. It is turning out to be a distressing number of drivers.

We discussed a few issues around this general topic.

Do we need upgrade checks for unsupported drivers?

No, we decided that the current log notice and release note and the fact that you have to enable an unsupported driver in config gives operators plenty of notice of impending driver removal.

Should we hold off on removing drivers right now?

Yes, at least for a few weeks. Eric suggested that we reconsider the removal policy, and Walt pointed out that some of the drivers in question are pretty widely used (looking at you, HPE), and it's going to be a PITA for operators who want to use them. So it might make sense to extend the 'unsupported' period. On the other hand, Sean pointed out that we could have problems with keeping non-maintained drivers in-tree, especially as libraries in master are updated to versions that have only been tested on python 3.6 & 3.7.

What about drivers that have already been removed?

Maybe the ones that have been removed in the past few weeks could be restored if the vendors are not completely disinterested. For example, Brocade issued a statement that they will not support their OpenStack driver beyond the Train release (they only want to support Python 2). So that's an easy call. In addition to assessing vendor "interestedness", Sean pointed out that this brings up a fairness issue, namely, what about the drivers removed in the last cycle (or the cycle before that ...). We need to discuss this some more so we can have a consistent policy.

Can we make running 3rd Party CI easier?

Luigi reported on the "Software Factory" effort: https://www.softwarefactory-project.io/

Software Factory is basically a packaging of a CI system so that it doesn't have to be assembled and built from disparate parts. It is used by RDO to do their CI. There's a doc patch up for Software Factory that could use some reviews/feedback: https://softwarefactory-project.io/r/#/c/17097/

There's already some documentation up there that explains how to hook the CI up to openstack gerrit: https://www.softwarefactory-project.io/docs/3.4/operator/quickstart.html#third-party-ci-quickstart

Our hope is that this will make it easier to set up and maintain 3rd party CI. Plus, if a lot of people use Software Factory, it will be easier for maintainers to get help.

Should we reconsider the 3rd Party CI requirements?

Current requirement is that 3rd Party CI must run on all changes, whether they impact that driver or not. What we're winding up with on most patches is a big list of failures that you scroll past to get to the list of comments on the review, and you only pay attention to them when you're reviewing a driver change, and then you only look at that vendor's CI results.

Sean pointed out that this isn't a new discussion. Here are a few things to think about from 2017: https://wiki.openstack.org/wiki/CinderPikePTGSummary#3rd_Party_CI_Requirements http://eavesdrop.openstack.org/meetings/cinder/2017/cinder.2017-03-15-16.00.log.html#l-267

Some other things that came up were: if this is a resource issue, maybe only require the CI to run once a day? What it comes down to is that we have a big constraint satisfaction problem here.

actions

  • everyone - look over the Software Factory patch https://softwarefactory-project.io/r/#/c/17097/
  • jungleboyj - will try to contact vendors with problematic CIs and report back at the 29 January meeting
  • rosmaita - get something together to organize this discussion (either a patch to our docs or an etherpad) for the 29 January meeting

Session Two: 16 March 2020

We met in BlueJeans from 1200 to 1400 UTC.
etherpad: https://etherpad.openstack.org/p/cinder-ussuri-mid-cycle-planning
recording: https://www.youtube.com/watch?v=cA_VfYnS77o

Welcome and some Cinder project business

Reminder of where we are in the cycle:

  • this is week R-8 (Cinder New Feature Status Checkpoint)
  • 2 weeks to final non-client library releases at R-6
  • 3 weeks to final client library release for Ussuri at R-5
  • 3 weeks to M-3 and feature freeze (R-5)
  • 3 weeks to 3rd Party CI Compliance Checkpoint (R-5)
  • 5 weeks to RC-1 target week (R-3)


Releases from stable/train and stable/stein will be proposed later this week. Final rocky release was earlier this month.

PTL self-nomination starts next week. Anyone interested, feel free to talk to Jay, Sean, or Brian to get an idea of what the job entails.


cinder core team: people interested in taking on more responsibility in the Cinder project should reach out to any of the current cores to get some guidance on how to make it happen

actions

rosmaita get the stein and train releases proposed this week

The resource_filters response

This was a follow-up from discussion earlier in the cycle at the weekly meeting: http://eavesdrop.openstack.org/meetings/cinder/2020/cinder.2020-02-05-14.00.log.html#l-177

Briefly, the issue is that the API call is meant to help the API be self-documenting by giving a user a programmatic way to determine what filters are available when requesting a list of various Cinder resources. The current default response (it's operator-configurable) looks like this:

{
    "volume": ["name", "status", "metadata",
               "bootable", "migration_status", "availability_zone",
               "group_id", "size", "created_at", "updated_at"],
    "backup": ["name", "status", "volume_id"],
    "snapshot": ["name", "status", "volume_id", "metadata",
                 "availability_zone"],
    "group": ["name"],
    "group_snapshot": ["name", "status", "group_id"],
    "attachment": ["volume_id", "status", "instance_id", "attach_status"],
    "message": ["resource_uuid", "resource_type", "event_id",
                "request_id", "message_level"],
    "pool": ["name", "volume_type"],
    "volume_type": ["is_public"]
}

Each element is a resource name with a list of the filters that can be applied to it.

The resource names in the URLs, however, don't match these:

GET /v3/{project_id}/volumes
GET /v3/{project_id}/volumes/detail
GET /v3/{project_id}/backups
GET /v3/{project_id}/backups/detail
GET /v3/{project_id}/snapshots
GET /v3/{project_id}/snapshots/detail
GET /v3/{project_id}/groups
GET /v3/{project_id}/groups/detail
GET /v3/{project_id}/group_snapshots
GET /v3/{project_id}/group_snapshots/detail
GET /v3/{project_id}/attachments
GET /v3/{project_id}/attachments/detail
GET /v3/{project_id}/messages
GET /v3/{project_id}/scheduler-stats/get_pools
GET /v3/{project_id}/types

Some points from the discussion:

  • Ideally, you'd want to go from the $resource you're querying on (e.g, 'volumes') to look up the list of applicable filters in the JSON response. But if people have already adapted to the current situation, then "fixing" this will break them.
  • maybe do a poll of operators to see how people are using this
  • the key issue is that when we add new resources (like for https://review.opendev.org/#/c/708845/), we need to be consistent
  • consistency with URL resource names is supposed to make this self-documenting; we could give up on that and just beef up the documentation

actions

  • rosmaita put together a poll of operators to get an idea of usage

Fallout from Bug #1852106

https://bugs.launchpad.net/nova/+bug/1852106

This has been fixed for now (you won't be able to create an image via nova that when deleted, removes the barbican secret that an image created via cinder upload-volume-to-image depends on). But it did raise the issue of whether we need to check whether a single barbican secret is associated with more than one resource (either by preventing this from happening at create time or by not deleting such a secret).

Eric pointed out that even if the current cinder code that keeps a 1-1 relation between resources and barbican secrets is correct, it's possible for users to "hack" the workflow and break the 1-1 correspondence, so we should anticipate that.

Glance is planning to wait for the Barbican Secret Consumer API to be implemented (was approved in Train, expected to be completed in Ussuri) to address a similar problem on their side.

Consensus was when the new Barbican API is available, look into this some more. Maybe that plus a uniqueness constraint in the DB would be the way to go.

The problem is that whatever we do will have some limitations, because other services (or the end user) may decide to use a particular barbican ID for something. So even if we know that *Cinder* is no longer using the ID, and the secret can be deleted as far as Cinder is concerned, we can't be certain that absolutely nothing in the cloud is holding a reference to the secret.

actions

  • rosmaita remember to raise this issue again when the Barbican Secret Consumer API is available
  • rosmaita look at our current documentation -- iirc, the Glance and Cinder Train release notes emphasized that the cinder_encryption_key_id shouldn't be used by anything other than cinder; make sure this is somewhere prominent in the regular documentation

Review of recently removed drivers

In January, we amended the driver removal policy to give vendors a bit more time to address Third-Party CI problems (that's been the major reason for drives being marked 'usupported' and then removed): https://docs.openstack.org/cinder/latest/drivers-all-about.html#driver-removal

Previously, a deprecated and marked 'unsupported' driver was removed during the next development cycle. The new policy says that the driver may remain in-tree at the discretion of the Cinder team, depending on factors like whether the vendor is communicating an effort to get the CI fixed, whether the backend it supports has gone EOL, etc). The policy does say that a driver is subject to removal as soon as it breaks the gate.

Before the policy was revised in January, we removed some drivers subject to the previous policy; we need to figure out what to do with those.

The following drivers were marked 'unsupported' in Train and subject to removal in Ussuri.

Datera  <-- being re-supported by the vendor: https://review.opendev.org/#/c/704153/
Huawei Fusionstorage   <-- vendor fixed CI and it is now 'supported' in U
IBM Flashsystem  <-- vendor would like to, but no resources  (leave in)
IBM GPFS   <-- no plan from IBM to support it  (leave in)
IBM Storage XIV  <-- vendor would like to, but no resources   (leave in)
IBM Storage DS8k   <-- vendor fixed CI and it is now 'supported' in U
IBM Storwize    <-- vendor fixed CI and it is now 'supported' in U
HPE LeftHand  <-- product line may has gone EOL; so, remove
Nimble          <-- has been removed in U  --- part of HPE -- find out intention and report back
Oracle ZFSSA  <-- no longer supported (leave out)
Sheepdog        <-- has been removed in U (project no longer active) (leave out)
Prophetstor     <-- has been removed in U    (no response) -> restore
Veritas Access  <-- has been removed in U  (no response) -> restore
Virtuozzo       <-- has been removed in U  (no response) -> restore

The following were marked 'unsupported' in Ussuri:

  • Brocade FCZM driver
    • the vendor announced no support after Train (driver runs only under py27)
    • Gorka is going to test to see whether it runs under py36/37
    • decision: revisit just before RC-time; leave if confirmed to run on py3
  • MacroSAN
  • IET iSCSI
    • project is no longer active, will remove in Victoria
  • Dell EMC PS Series
    • product going EOL, will remove in Victoria
  • Veritas Clustered NFS
    • no response from vendor


Walt raised the point that operators use these drivers and need to have them available; it might be worth keeping them around even if they break the gate. Otherwise operators may fork Cinder and then get stuck down that path. We decided to revisit this at the PTG, because we're probably OK right now. But we do need to get a plan in place for when one of them breaks the gate. Some possibilities:

  • move the driver to a special directory
  • disable the unit tests for the failing driver

actions

  • make sure the drivers noted to be restored above are in fact restored
  • rosmaita - contact the "no response" vendors again to encourage them to get 3rd Party CI running again and find out what their intentions are

handling unsupported/re-supported drivers in the upgrade check

Main thing here is that the patches removing a driver are usually separate from the patch adding that driver to the upgrade check. We need to review at M-3 to make sure that the restored drivers are removed from the upgrade checker.

actions

  • rosmaita review the upgrade check removed driver list

Third-Party CI and "Python 3"

During the push to get 3rd Party CI testing with Python 3 in Train, we said: "Because the Train release of OpenStack is supposed to support both Python 3.6 and Python 3.7, it's acceptable to have your CI running 3.7 only on the theory that anything that runs in 3.7 will also run in 3.6" (https://wiki.openstack.org/wiki/Cinder/3rdParty-drivers-py3-update)

This isn't a good inference, but it was a big problem getting 3rd Party CIs to run a py3 that's not their distro's default, so we only asked for one, namely, 3.7. The question has come up when a new driver is added, what runtime(s) do they need to test. The answer is:

  • ideally, all of the cycle python runtimes
  • otherwise, one of the cycle runtimes
  • but, they need to keep an eye on the OpenStack python runtimes each cycle -- it won't be python 3.6 forever


Luigi mentioned that 3rd Party CIs need to make sure they're running the cinder-tempest-plugin

Walt mentioned that there's an easy way to stand up zuulv3 and everything you need to run a CI: https://01.org/openstack/blogs/manjeets/2019/how-test-open-source-hardware-drivers-zuul-v3-and-docker-compose

This is similar to the Software Factory solution we've been mentioning to 3rd Party CI maintainers.

actions

  • rosmaita - make sure the above is clear in the documentation

security problem no longer a security bug

https://bugs.launchpad.net/cinder/+bug/1740950

This has been around a long time and was made public a few weeks ago due to the change in the VMT policy that security bugs only stay private for 90 days.

Rajat has a patch up: https://review.opendev.org/713231

actions

  • everyone - review https://review.opendev.org/713231
  • rosmaita - verify with the VMT that they don't want to do anything about this - they did a "won't fix" based on it being a class Y vulnerability (only present in development version) but that was several cycles ago

cinder-tempest-plugin on 3rd party CIs

Luigi mentioned that 3rd Party CIs need to make sure they're running the cinder-tempest-plugin.

There's also room for more thorough tests so that we can have more certainty that the code works beyond not getting 500s. For example, Rajat has a patch up to add a snapshot data integrity test: https://review.opendev.org/#/c/702495/

actions