Jump to: navigation, search

Difference between revisions of "CinderCaracalPTGSummary"

(Gate Issues)
(New Quota system)
Line 274: Line 274:
 
*** https://docs.openstack.org/oslo.db/latest/install/index.html
 
*** https://docs.openstack.org/oslo.db/latest/install/index.html
 
* #action: Rajat to work on the implementation
 
* #action: Rajat to work on the implementation
 +
 +
===Force volume unmanage===

Revision as of 09:54, 29 October 2023

Introduction

The Eighth virtual PTG for the 2024.1 (Caracal) cycle of Cinder was conducted from Tuesday, 24th October, 2023 to Friday, 27th October, 2023, 4 hours each day (1300-1700 UTC). This page will provide a summary of all the topics discussed throughout the PTG.

Cinder Bobcat Virtual PTG 29 March 2023


This document aims to give a summary of each session. More information is available on the cinder 2024.1 Caracal PTG etherpad:


The sessions were recorded, so to get all the details of any discussion, you can watch/listen to the recording. Links to the recordings for each day are below their respective day's heading.


Tuesday 24 October

recordings

Bobcat Retrospective

We categorized the discussion into the following sub sections:

  • What went good?
    • Active contribution across different companies
    • We got 2 Outreachy interns in summer internship having great contributions in api-ref sample tests
    • Releases happened on time
  • What went bad?
    • CI failure rate impacting productivity
    • Sofia leaving the team really affected review bandwidth
  • What should we continue doing?
    • Sponsoring outreachy interns
      • proposal accepted with at least one applicant
  • What should we stop doing?
    • Lack of structure around the review request section in cinder meetings
      • Too many patches discourage the reviewers from taking a look
      • Authors should add explanation if the patch is complicated and ask if they have any doubts
      • Author should try to add only the relevant patch based on milestone instead of adding all patches possible
      • Author should be active in reviews since the core team prioritizes patches of contributors that do active reviews
      • #action: whoami-rajat to follow up on this in a cinder meeting

Gate Issues

Continuing the discussion from 2023.2 Bobcat midcycle, we are still seeing gate issues consisting of OOMs and timeouts. Looking at a sample gate job failure, we decided on the following points:

  • Increase system memory of VM if possible, 8GB is not enough for tempest tests
  • Increase swap space (make it same size as RAM for one to one mapping)
  • Change cinder to use the file lock for the coordination in order to get rid of etcd
  • Reduce number of processes
    • We see the pattern of multiple services running multiple instances
      • neutron-server: 5
      • nova-conductor: 6
      • nova-scheduler: 2
      • swift: 3 for each of its service
  • Reduce concurrency to 2 for testing purposes to see how many VMs we end up running
  • #action: rosmaita to propose patch for few straightforward tasks like increasing swap space

Backup/Restore performance

There is a bug reported against S3 and swift backend complaining they are very slow. Launchpad: https://bugs.launchpad.net/cinder/+bug/1918119

The general discussion was around the following points:

  • If there are issues in backup/restore, report a bug, it helps the team to be aware about all potential improvements
  • Using stream instead of chunks
  • We don't have a backup or restore progress status anywhere
  • We can work on something to get data about long running operations and their current status.
  • #action: zaitcev is to investigate what we have now and propose a spec for observability, in particular for restores ~ we have %% notifications already, but no current %%


Few of the specs related to backup were also mentioned:

Wednesday 25 October

recordings

Cinder multi-tenancy support

Use Case: An operator wants to partition and isolate their storage array on a per tenant basis.

Manila has DHSS (Driver Handles Server Share) support which allows it's backends to create virtual storage instances.

Currently for cinder, they will need to create all the virtual storage instance manually and map that to a backend in cinder.conf.

They can use a private volume type to tie a particular volume type to a project and also to a backend with "volume_backend_name" extra spec.

Conclusion:

  • The use case is NetApp specific since the virtual storage instance creation happens on the storage side and manila's workflow includes the step to create virtual instances (share server)
  • The Manila architecture allows using neutron for network partitioning which is completely different from how cinder exposes the LUNs
  • The use case is not suitable for Cinder because
    • Cinder doesn't interact with neutron at all
    • Even if we add the support, other drivers won't be able to make use of it

NetApp LUN space allocation

Netapp team wants a way to configure a space_allocation property for their volumes in the NetApp backend.

Space allocation enables ONTAP to reclaim space automatically when host deletes a file.

The current proposal is to configure it via a volume type extra spec and the volumes created with that type will have space allocation enabled.

Another case is that we need to be able to notify nova that the LUN is sparse.

If the LUN is sparse and nova guest supports TRIM/DISCARD, the guest will send the TRIM/DISCARD commands to the storage array to reclaim the storage space.

By default thin provisioning is enabled in the NetApp backend but space allocation is disabled.

Agreed Design

  • Add the space_allocation property in volume type extra specs
  • Report the discard value same as space_allocation when returning connection info
  • By default, report_discard_supported will be reported with the "discard" parameter (if operator sets it in cinder.conf)
  • #action: NetApp team to add support for space_allocation and report discard.

NOTE: If we enable/change configuration on backend side, we need to remap the LUN (detach then attach) for the space_allocation/discard value to be reflected

Operator Hour

No operator showed up in the operator hour.

Operators not showing up doesn't seem to be a cinder problem

Manila also had similar problems of operators not showing up.

  • #action: Ask the opendev team for stats about operators attendance in other project's operator hours.

Block device driver

Block device driver allowed creating local block devices and making it available to nova instances.

It supported providing device path and also exporting the device via iSCSI.

It had limited features and people were not using it so it was removed from the cinder driver tree

The use case to bring is back is replicated databases and it is not limited to etcd.

We can modify the LVM driver to do local attach.

Problems with using nova ephemeral

  • not configurable: we can either use ceph or lvm
  • deployed computes may not have local storage available
  • we don't have a dedicated block device for etcd, we will just allocate a part of total ephemeral storage to etcd


Some of the constraints using this configuration is:

  • the cinder-volume service needs to be deployed in compute nodes hosting the instance
  • a local connector that returns a path -- which we have one already
  • No volume migration and data persistence (it is not a priority)
  • We will have some k8s controllers with cinder+LVM and some without it
  • running cinder volume on every compute and scaling it to a number like 500 might put load on scheduler if it reports every 60 seconds (default)


Conclusion

  • LVM with local storage should be a good option
    • block device driver might be a future option if the latency issue persists (currently we don't have a performance claim that block device driver performs better than LVM local)
  • #action: eharney to go work on this

Cross project with nova

Wednesday: 1600-1700 UTC

Improper attachment cleanup during failure in some operations

Some of the nova operations, that interact with cinder attachments, doesn't do proper cleanup in failure scenarios.

  • Live migration
  • Evacuate
  • Cold migration

Improper cleanup leads to inconsistencies in the deployment that needs to be addressed manually which has been harder to address after CVE-2023-2088 since attachment-delete only works when requested by a service user with a service token.

Solution:

  • at compute init_host, we could identify the attachments and log each of them which are uncleaned
  • We have a cinder API that returns all the attachments for a given compute host.
    • /attachments?all_tenants=1&attach_host=?
    • We will require admin + service token to get it
  • #action: nova team can report a bug and decide who would like to work on it

fixing RBD retype regressed in Wallaby

Cinder has 2 migration paths:

  • generic: cinder does the migration
  • optimized: backend driver does the migration

A change in wallaby caused the operation to use optimized path for drivers that wasn't used before.

This introduced us to a new bug in the optimized migration path.

We have two ways to fix the issue:

This will also require certain level of testing:

Image Metadata in Cinder, Nova, and Glance

  • In glance, the image metadata is unrestricted but cinder and nova restricts it to 255 chars in the schema validation. It is because glance doesn't have any schema validation for image properties.
  • currently it's a text field (64k) in nova, cinder, glance etc, DB causing lot of IO while reading it which isn't ideal.
  • #action: (rosmaita) bring it to the mailing list as to what should be the correct value for this field (based on how it is used)

Gate Issues

The gate job shows 6 conductor instances running in the gate job. Based on the configuration, the workers are set to two.

[conductor]

workers = 2

The conductor processes should not cause gate failures but might consume memory causing other processes to starve.

Proposed Solutions:

  • Increasing swap space. Recommendation: 8 GB
  • keystone and neutron DB queries can be optimized with caching
  • the 14 qemu processes is possibly a result of higher tempest concurrency but with current concurrency (6) it shouldn't reach to 14 instances so need to check if there are any cleanup issues
  • Request for higher RAM systems (16GB) for gate jobs


Another gate issue that requires attention is the lvm calls being stuck in nova gate jobs. grenade can be an example.

  • Proposal to move lvm calls to privsep
  • Switch gate jobs to use ceph
    • we won't be testing iscsi code path (os-brick connector)
  • check if lvm configure device filters in devstack
    • Melanie checked it and fixed issue around it
  • #action: cinder team to look into gate failures faced by nova regarding LVM


Thursday 26 October

recordings

New Quota system

Force volume unmanage