Jump to: navigation, search

Difference between revisions of "Cinder/Specs/NVMEMDHealingAgent"

(Created page with "=== OpenStack Healing Agent === ===== init ===== host_uuid host_nqn <other params such as version> Get host uuid and nqn, schedule main method to run every X second...")
 
(No difference)

Latest revision as of 17:17, 29 November 2020

OpenStack Healing Agent

init
  host_uuid
  host_nqn
  <other params such as version>

Get host uuid and nqn, schedule main method to run every X seconds


main method

This will be scheduled to run every X seconds on the connectoring host with the following sub methods:

hostprobe

Call storage provisioner /hostprobe API with stored info:

  host_nqn
  host_uuid
  host_name
  client_type
  duration
  version


monitor_host

Query storage provisioner for metadata on all volumes belonging to this host (uuid) Inspect all KS volume NVMe connections / hook into their events Inspect every KS replicated volume host MD for its legs states


Call self_healing spec below with provisioner metadata + inspected host volume devices info:

self_healing

If storage provisioner metadata shows a different set of legs for the volume than what was inspected on the host, reconcile the volume’s MD state: 1. Connect to targets of new replicas if not already connected 2. Remove replica legs from MD that provisioner says no longer part of the volume 3. Re-assemble MD with provisioner replicas info of the volume

Active self healing:

If the host MD shows one of its legs as failed, but metadata from storage provisioner says it is supposed to be available, report to the provisioner the failed/missing leg. (and vice versa for available leg that provisioner says is supposed to be failed/missing.)

If the volume has maxDownTime>0, and the provisioner reports a leg as missing for more than maxDownTime, and the volume is not being migrated, try to replace the leg: 1. Call provisioner add_replica (with node’s host uuid / topology) 2. Publish the replica and connect to it 3. If successful, call provisioner delete_replica for the missing leg


Also in monitor_host report to provisioner any of the detected events below:

Target connect / disconnect Replicated volume degraded / healed Replicated volume started / finished sync NVMe session established / closed

(This is for monitoring/telemetry purposes)