Jump to: navigation, search

CinderUssuriPTGSummary

Revision as of 02:50, 28 November 2019 by Brian-rosmaita (talk | contribs)

Contents

Introduction

This page contains a summary of the subjects covered during the Ussuri PTG held in Shanghai, China, November 7-8, 2019. It also contains a summary of the Virtual PTG held November 25 and 27, 2019.

The sessions were recorded. Links to the recordings will be added here when they are available.

The full etherpad and all associated notes may be found here:

Thursday (Shanghai)

Cinder project onboarding and "meet the Cinder developers" session

Everyone who attended were 100% satisfied and very complimentary about Cinder. Unfortunately, no one attended, so we spent the time figuring out how to get the recording equipment connected and positioned properly.

(recording 1 starts here)

Python 2 support & Work remaining to remove Py27 support

I came into this arguing that we need to keep Python 2 testing in master for a while -- at least while we are still supporting Python 2 in stable branches, because otherwise backports become a big problem (won't have a clean backport if any py3-only language features are used in a patch to master). Pretty much no one agreed with this.

Sean pointed out that as libraries drop py2 support, we won't be able to use them in py2 testing anyway. Ivan and Sean can't wait to start ripping out py2-compatability code. Gorka didn't think that the extra effort to modify backports would be that big a deal, and that if we're going to start using py3 for real, might as well start now.

Actions

  • Reminder to reviewers: we need to be checking the code coverage of test cases very carefully so that new code has excellent coverage and will be likely to fail when tested with py2 in stable branches when a backport is proposed
  • Reminder to committers: be patient with reviewers when they ask for more tests!
  • Reminder to community: new features can use py3 language constructs; bugfixes likely to be backported should be more conservative and write for py2 compatabilty
  • Reminder to driver maintainers: ^^
  • Ivan and Sean have a green light to start removing py2 compatability

Policy migration

Some background: https://etherpad.openstack.org/p/policy-migration-steps Keystone has added a default read-only role and service-scoped roles, but they don't do anything until projects write policies that use them oslo.policy code has a way to define a default policy + deprecated default policy; during deprecation, the most permissive wins. This will allow easy migration to new policies for operators.

There are still questions about how to set up testing for these. Keystone did only unit tests, but the tests were very heavyweight; had to set up all the users in the DB each time; they wish they had done things in tempest. But there is a concern that it may not be practical to do in tempest, either. (It may depend on the project.)

A pop-up team is going to be started to help get the larger projects moved to the new Policy Code.

Actions

  • rosmaita will investigate the different testing approaches. Note: It's possible that tempest will add the methods to create different users and the different projects will have to do their own testing using those.
  • rosmaita To look at the scoping options and understand what the impact on Cinder will be.
    • Need to create a matrix of our policies and different scopes.
    • Need to figure out how the administrative context fits in
    • Need to check current test coverage and where testing needs to be enhanced.
    • Don't have to have one person do it. It is possible to split up the work.
  • rosmaita and e0ne (and anyone else interested in this) to join the pop-up team to get more info and help get Cinder started.

Cinder V2 API removal

We can't just remove V2 code right now (e.g., the V2 extensions need to be moved to V3) but we can remove access to the API. Though actually that's not true either, there is a lot of background work that needs to be done before we remove V2 API:

  • Tempest still assumes that the V2 API will be there. Need to fix it.
  • OpenStack Client also has some V2 API assumptions.
  • Devstack also will not work.

Sean has a patch to see how badly things break with V2 removal: https://review.opendev.org/554372

V3 is pretty much exactly the same as V2. We should be able to change and just have people switch the endpoint and have it work. It would be nice if we could just update the catalog but that doesn't appear to be the case.

Actions

  • follow up on this for the virtual PTG. What did we find with Sean's patch?
    • Create a list of the specific work items that need to be completed.
    • At that point we may be able to split the work up to an intern (if we have an intern).

Cinder REST API V4

We had talked in Vancouver about getting to a point after we have enough micro-versions piling up to move to V4.

Actions

  • We don't need to do that in this release (let's get rid of V2 first), but it is something we need to keep in mind as a future goal.

(recording 2 starts here)

Volume local cache

Requires both Cinder and Nova work:

Currently there're different types of fast NVME SSDs, such as Intel Optane SSD, which r/w throughput can be 2.x~3.x GB/s, latency can be ~10 us. While typical remote volume for a VM can be hundreds of MB/s, latency can be millisecond level (iscsi / rbd). So these fast SSDs can be mounted on compute node locally and used as a cache for remote volumes. Regarding storage team, we need to add support in os-brick.

Consensus was: there are some storage solutions this cannot be done for (Ceph, no mount point on host machine), some that might not require this (some vendors already have super-fast caching), and some it's worth doing for, so the overall feeling was supportive for this effort.

See the PTG etherpad for details. Picture of the flip chart used during the discussion: https://twitter.com/jungleboyj/status/1192323512238776320

Actions

  • Liang Fang to continue working on this

Mutable options

The context for this is a NetApp customer request who wants to be able to change backend credentials without restarting any services.

Problem is that the current mutable config can be done for the REST API, but doesn't extend beyond that. Further, changing driver credentials is a little more work since it may require reloading the driver or having a mechanism in all drivers to recognize and handle that change. Also, we don't want config options that are shared across drivers to be mutable.

Gorka pointed out that a driver supporting Active-Active would not need mutable options for this purpose. It would be better to implement Active-Active instead of refresh credentials this way. A/A HA support has been ready for several releases now, but so far RBD has been the only driver to test and enable it.

The team feels that using Active/Active is the best way to go.

Actions

  • Gorka volunteered to support the NetApp team if they choose to implement A/A
  • need to add to the developer docs that just making an option mutable in oslo.config does not solve the problem for drivers (more info on the etherpad)

(recording 3 starts here)

Cross Project Discussion with Edge Working Group

Apparently the next version of TripleO will support storage at the edge. They were wondering if we knew anything about that. We don't.

As far as edge persistent storage goes, telcos think about having NFS-only and have it in the core - scary concept.

In considering the edge use case, it is important to understand the physical limitations of what people have in mind. For example, one small telco rack, or a smaller DC with air conditioning, or a bigger DC with AC and bigger storage unit, etc. You really can't talk about "the edge" (insert U2 joke here).

See the etherpad for more.

Default volume types depending on project or user

Having a single volume type default is too restrictive for bigger clouds with multiple AZs and many tenants/projects. Operators want more defaults to use in particular situations.

The selection of which default to use is easy; the hard part of this will be the code enabling creation of the default at the end user/project level. Will need:

  • new API calls (create, show, list, update, delete), new microversion
  • client support
  • tell horizon about it

Actions

  • geguileo - write the spec
    • request from Glance: we may also want a per-service default (triggered when a service token is passed)

Cinder retype doesn't use driver assisted migration

Gorka thinks this doesn't depend on the driver; he thinks it's broken for all drivers. There is code in the manager that prevents the efficient path from being taken: https://github.com/openstack/cinder/blob/ca5c2ce4e8ae9fbc92181ac4ba09cec3429a71e6/cinder/volume/manager.py#L2490 There was a reason for it; we need to review and see if it still holds.

Ivan thinks this is just a bug. Though we don't have a bug open for it.

Actions

  • e0ne to investigate and fix it if he can verify that it is broken.

EOL some of the currently open branches

We have 8 open branches plus master (ussuri). Sent an email to the ML asking for data so we can make a good decision about this: http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010385.html

Got zero responses, so this apparently isn't seen as a big deal by the community.

The policy is that we need to announce 6 months ahead of time the fact that we are planning to EOL a branch. This allows time for a vendor to come in and pick it up if necessary. So, if we want to drop branches we just need to announce that we are planning to EOL branches and then we can do it in 6 months.

The driverfixes branches have not been used in quite a while.

  • Should we delete those? No, we don't really want to lose that history of commits.
  • Could we re-name them? Put 'archived' in the title or something to make it clear that it doesn't still take code. (Or just document that they are an archive of old driver fixes.)
  • When we EOL a driver we should probably make it a driverfixes branch. (Not clear on exactly what's being proposed here, need to follow up at VPTG.)

Actions

  • rosmaita - find out about renaming branches from infra team; also, about read-only branches (change to gerrit so no patches can be proposed to the branch)?
    • proposal: EOL o, p and rename them archived-ocata, archived-pike
  • rosmaita - send proposal to ML that o, p are due to exit EM status in 6 months
  • revisit this at the Virtual PTG
    • the EOL policy was revised recently, no longer requires the 6 month waiting period
    • want to reconsider whether not deleting the EOL branches is a good idea if we're not going to merge anything into them

(recording 4 starts here)

Discuss the latest User Survey Results

Here's a handy compiled list of only the Cinder responses: https://etherpad.openstack.org/p/cinder-2019-user-survey-question-responses

Actions

  • replication needs better documentation so that people know we can failover and fail back correctly
  • ivan is planning to continue the generic backup driver work

Meeting with the Nova team

When we failover in Cinder, volumes are no longer usable in Nova, but we don't tell Nova that the failover has ocurred. Any procedure in Nova to correct the situation needs to be done manually. It would be better if we let Nova know that a failover has occurred so they can do something.

A complication is that Nova can't simply detach and attach the volume because data that is in flight would be lost.

How about boot from volume? In that case the instance is dead anyway because access to the volume has been lost. Could go through the shutdown, detach, attach, reboot path. Problem is that detach is going to fail. Need to force it or handle the failure. But we aren't sure that Nova will allow a detach of a boot volume. And we don't currently have a force detach API.

Also discussed a possible Nova bug for images created from encrypted volumes: https://bugs.launchpad.net/nova/+bug/1852106 , though it's not clear that the scenario described in the bug can actually happen

Actions

  • need to figure out how to pass the force to os-brick to detach volume and when rebooting a volume
  • rosmaita to investigate Bug #1852106

Friday (Shanghai)

Meeting with the Glance team

Support for Glance multiple stores in Cinder

References: (cinder spec) https://review.openstack.org/#/c/641267/

The Cinder team is still OK with this idea (which was approved for Train).

Actions
  • retarget spec for Ussuri
  • get Abhishek's patch reviewed

Image snapshot co-location

For the Edge use case, Glance is planning to use info provided by Nova about what image a server was booted from to co-locate snapshots of that server in the same store as the original image. Would like to do the same with Cinder volumes uploaded as images. Just need a header that specifies the "base" image of the volume being uploaded as an image. We agreed that this is a separate use case from the above.

Actions
  • Abhishek will write the spec for Cinder

Glance Cinder driver is very limited

We think it uses only default volume type, and also, it is not very well tested. We all agreed that this is a sad state of affairs.

Actions
  • somebody should do something

Meet with Horizon about their proposed implementation of Cinder user messages

Horizon is interested in exposing the User Messages API. We agreed that this is a great idea.

There's a question about having the message displayed in a requested language. It's possible that this is already handled at the REST API layer via the "Accept-Language" header. If it's not, that's probably the place to support this.

Actions

  • rosmaita determine whether this would require a change to the API code, or whether existing code handles this already

Attach/Detach speed

Gorka was wondering whether there are any complaints about attach/detach speed in OpenStack, particularly since people are now using Cinder to provide volumes for Kubernetes (cinder in-tree driver, Cinder-CSI, Ember-CSI) and may be seeing a lot more attach/detach requests.

Everybody seems to be OK with it, it's only geguileo who's complaining.

Actions

  • not a concern at the moment

Topics from Train mid-cycle: status and carry-over to Ussuri

Notes about the Train mid-cycle: https://wiki.openstack.org/wiki/CinderTrainMidCycleSummary

Mid-cycle etherpad: https://etherpad.openstack.org/p/cinder-train-mid-cycle-planning

Multiattach

All items need followup. Goals are:

  • short-term: document some guidance for how this feature should be tested
  • long-term: get some new tests into the cinder-tempest-plugin for this
Actions
  • rosmaita draft the short-term document

iSCSI Ceph driver

Due to some downstream priorities changes, Walt is having trouble finding time to work on this. Ivan suggested that we encourage Walt to post whatever he's got, even if it's not working, so what he's learned isn't lost. There are some patches up and a github repo for some code Walt had to write that doesn't have a home in OpenStack or Ceph yet

Actions
  • rosmaita: follow up with Walt
  • rosmaita: put together an etherpad with links to the work done so far

3rd Party CI Irregularities

Third-party testing by backend vendors of their driver code is very important to the project. But most of the 3rd Party CI appear to be pretty unstable.

For most vendors, updating their 3rd Party CI to run python 3.7 in Train was not a simple task. It would be good if we could offer them better guidance about how to set up & maintain their 3rd Party CI. Would also like vendors to be running the cinder-tempest-plugin, but don't want to make it a demand unless we can make the path easier. (BTW, Datera is running the cinder-tempest-plugin in their CI!)

Third Party CI Docs (partial list)

Actions
  • Luigi has some ideas about using RDO Software Factory as a basis for 3rd Party CI; need to follow up with him on that
  • Gorka: will check about what RDO has available
  • e0ne: will look to see who's using cinder-tempest-plugin
  • the team: after gorka and e0ne report back, reorganize & update the 3rd party CI docs

Improve Automated Test Coverage

We want to do this via the cinder-tempest-plugin. Sophia (enriquetaso) is mentoring an Outreachy intern who has begun some work on this. Eric has been writing bugs to suggest test cases that need to be addressed.

SQLAlchemy to Alembic migration

No progress on this. Put in a proposal for a summer intern to work on this; maybe we'll get lucky.

See https://etherpad.openstack.org/p/cinder-train-ptg-planning (line #247) for more info.

Capabilities Reporting

Operators need to read the vendor's manual to figure out which extra specs they can write for a particular backend, and what they're used for. it would be nice to drivers report their capabilities in a way that the operator can figure out this info from the CLI.

Everyone agreed that we still want to do this. It will require an API change and there's already a spec for this: https://review.opendev.org/#/c/655939/1/specs/train/backend_capabilities.rst

Actions
  • revisit at the Virtual PTG and figure out who's interested in working on it

Cinder Business

Cinder Ussuri Priorities

We will finalize this after the Virtual PTG, but here's the initial list:

  • Increase testing coverage
  • Increase number of CIs running cinder-tempest-plugins
  • Better support for third party CIs: Make their life easier by having a way to deploy a robust system
  • Volume types per user/project/service-token
    • better documentation
  • Generic Backups
  • Improve HA Active-Active documentation
    • want to make it easier to test it
  • remove V2 API
  • remove python 2 support

Cinder-core update

See http://lists.openstack.org/pipermail/openstack-discuss/2019-November/010519.html We are at roughly the same review strength we had in Train.

Meeting Time Change update

We were holding off on this until after the Summit so that new contributors could participate in a poll. We'll consider the options from Liang Fang's original proposal at the Cinder weekly meeting: http://eavesdrop.openstack.org/meetings/cinder/2019/cinder.2019-10-23-16.00.log.html#l-166 These are to move the meeting 1 or 2 hours earlier. There has also been some discussion on the ML: http://lists.openstack.org/pipermail/openstack-discuss/2019-October/010328.html

Actions
  • rosmaita put together a community poll

Virtual PTG

We discussed what the format should be. Consensus was to do it over 2 consecutive days, using 2 hours each day. This should make it easier for people to participate in at least part of the meeting. We want to do it soon; consensus was the week after KubeCon to avoid conflicts. So that would be the last week in November.

Actions
  • rosmaita put together a community poll to determine days/times

Virtual Mid-Cycle

There is interest in having a midcycle. Although everyone recognizes that face-to-face is the best, contributors have been having trouble getting travel support. So we decided to do a completely Virtual Mid-Cycle meetup for Ussuri. We decided to figure out the format after we see how the Virtual PTG works out.

Monday (Virtual)

Forum session recap: Are You Using Upgrade Checks?

Jay gave a quick recap of the Forum session. There are a number of action items in the etherpad above. They are assigned to jungleboyj right now as a TC action.

There are still some questions about (a) how operators are using these, and (b) what kind of checks we should be providing from the development side. The Cinder team was seeing this as pre-check. Others are seeing it as a check that is used along the way while an upgrade is in process to ensure that things are ready before operators start up their services. Pre-checks seem to make sense for us; Sean noted that we could add an option to do some pre-checks to the cinder-status command.

So what should the Cinder team do during the Ussuri cycle (before we have the above issues settled)? At the very least, we should still add them when a driver is unsupported and subject to removal:

  • inform operators that in order to use an unsupported driver, a flag has to be set in cinder.conf
  • inform operators that they need to contact the vendor about whether they have plans to have the driver re-instated; otherwise, the operator needs to prepare to migrate the affected volumes to a backend with a supported driver for the next Cinder release

Actions

  • jungleboyj - Start a discussion on the mailing list to find out if anyone is actually using or has used the upgrade checks in production
  • need to figure out where the documentation for this goes

Snapshot co-location

The spec for this is: https://review.opendev.org/#/c/695630/

This is related to glance multi-store support in Cinder, but the spec needs to specify more carefully what the use case is. We think that is: a user has a volume that was created from a glance image, and wants to upload it as an image; want to give Glance info so that in can put the new image in the same store as the original image. (So use of the term "snapshot" here may be inaccurate.)

This feature depends on the implementation of the other glance multistore spec: https://review.opendev.org/#/c/661676/

Actions

  • Rajat, Abhishek - update the spec

Python 2 support removal

Gave a quick summary of what we discussed in Shanghai (see above) so that we're all on the same page.

There's a patch up now removing py2 testing from Cinder: https://review.opendev.org/695317 . Once that's approved, will do the same for the other components.

General advice to Cinder developers about using Python 3 language features: https://wiki.openstack.org/wiki/CinderUssuriPTGSummary#Actions

Actions

  • rosmaita - get the testing/gate patches merged, then let the good times roll

User messages

Quick discussion of the admin action "leakage" issue discussed on https://review.opendev.org/#/c/694954/

Consensus was that it would be useful to expose admin-oriented actions in user messages that only admins would be able to view. Maybe set a special flag when the message is created, and then use the admin context to decide whether this gets shown or not. Agreed that the message content will be same as we have currently (that is, don't expose any sensitive information even to admins). We can wait until admin-facing user messages are being used and get feedback about whether more info is required or not.

Ivan pointed out that this change should not require a new microversion, since there's no change to the user message API and no change to the current response.

Actions

  • rosmaita - write up a spec

3rd Party CI irregularities

The issue we want to address is that the 3rd Party CI systems seem pretty unstable. We'd like to be able to provide some more support to make the infrastructures more reliable. Luigi suggested using RDO Software Factory as a basis for 3rd Party CI.

References:

Actions

  • Luigi - follow up with RDO team and get some feedback on how plausible this scenario is
  • e0ne - will look to see who's using cinder-tempest-plugin

Extending default volume type support for tenants

Quick recap of the Shanghai discussion (see above). Simon had mentioned that he might have developer at Pure who'd be interested in doing the implementation. Rajat volunteered to help support the implementation.

Actions

  • Gorka - write up the spec
  • rosmaita - follow up with Simon

Quotas!

Eric has a patch up that may fix one of many problems: https://review.opendev.org/#/c/695096/ Eric thinks the patch could be optimized if someone is interested.

The general problem is that we update multiple tables and there can be (are) race conditions and you wind up with strange situations like negative quota values or multiple quotas for the same project. Operators have posted some scripts to be used occasionally clean up the database, but it would be better to fix this in Cinder.

EOL for driverfixes/{m,n} and stable/{o,p}

Since the Shanghai discussion, a patch has merged that removes the 6 month waiting period for the transition from EM -> EOL: https://review.opendev.org/#/c/682381/

There was a discussion about this in #openstack-tc last week: http://eavesdrop.openstack.org/irclogs/%23openstack-tc/%23openstack-tc.2019-11-22.log.html#t2019-11-22T15:35:01

Consensus is that we should go ahead and do this.

Actions

Driver support matrix

Follow-up from the discussion in Shanghai. A suggestion was made that multipath should be a specific category in the support matrix.

Consensus is that multipath is more a feature of the backend than of the driver. It is useful to know if drivers do it, but it's not the kind of thing like replication that they do or not. Also, there are options that have to be set in nova in order for it to be useful - nova:libvirt:use_volume_multipath. So there doesn't seem to be a point in adding this to the support matrix.

Wednesday (Virtual)