Difference between revisions of "Cinder/Specs/NVMEMDHealingAgent"
Zohar.cloud (talk | contribs) (Created page with "=== OpenStack Healing Agent === ===== init ===== host_uuid host_nqn <other params such as version> Get host uuid and nqn, schedule main method to run every X second...") |
(No difference)
|
Latest revision as of 17:17, 29 November 2020
Contents
OpenStack Healing Agent
init
host_uuid host_nqn <other params such as version>
Get host uuid and nqn, schedule main method to run every X seconds
main method
This will be scheduled to run every X seconds on the connectoring host with the following sub methods:
hostprobe
Call storage provisioner /hostprobe API with stored info:
host_nqn host_uuid host_name client_type duration version
monitor_host
Query storage provisioner for metadata on all volumes belonging to this host (uuid) Inspect all KS volume NVMe connections / hook into their events Inspect every KS replicated volume host MD for its legs states
Call self_healing spec below with provisioner metadata + inspected host volume devices info:
self_healing
If storage provisioner metadata shows a different set of legs for the volume than what was inspected on the host, reconcile the volume’s MD state: 1. Connect to targets of new replicas if not already connected 2. Remove replica legs from MD that provisioner says no longer part of the volume 3. Re-assemble MD with provisioner replicas info of the volume
Active self healing:
If the host MD shows one of its legs as failed, but metadata from storage provisioner says it is supposed to be available, report to the provisioner the failed/missing leg. (and vice versa for available leg that provisioner says is supposed to be failed/missing.)
If the volume has maxDownTime>0, and the provisioner reports a leg as missing for more than maxDownTime, and the volume is not being migrated, try to replace the leg: 1. Call provisioner add_replica (with node’s host uuid / topology) 2. Publish the replica and connect to it 3. If successful, call provisioner delete_replica for the missing leg
Also in monitor_host report to provisioner any of the detected events below:
Target connect / disconnect Replicated volume degraded / healed Replicated volume started / finished sync NVMe session established / closed
(This is for monitoring/telemetry purposes)