Jump to: navigation, search

Difference between revisions of "Meetings/InfraTeamMeeting"

(Agenda for next meeting)
(Agenda for next meeting)
 
(13 intermediate revisions by 4 users not shown)
Line 19: Line 19:
 
*** https://etherpad.opendev.org/p/opendev-server-upgrade-planning Central tracking document which may link to more host specific documents
 
*** https://etherpad.opendev.org/p/opendev-server-upgrade-planning Central tracking document which may link to more host specific documents
 
*** Next on the list are graphite and backup servers
 
*** Next on the list are graphite and backup servers
*** Can probably spin up new backup servers alongside the old ones then migrate the old volumes off the old servers to the new ones and finally delete the old servers. Just need to double check borg version support matrix details and also what adding new backup servers will do to our cron job setups for backups.
+
*** backup03.ca-ymq-1.vexxhost.opendev.org has been launched and is being backed up too
 +
**** https://review.opendev.org/c/opendev/system-config/+/995420 Starting backup02 removal here
 
*** Remember to use launch-node's --config-drive flag when booting new Noble nodes in Rax Classic
 
*** Remember to use launch-node's --config-drive flag when booting new Noble nodes in Rax Classic
** Dealing with web crawlers (clarkb 20251216)
 
*** We have seen ghcr.io hosted anubis images return errors during some deployment jobs. If this becomes consistent we may need to mirror the image
 
*** static02 and static04 are being cleaned up.
 
*** Anything else to monitor or can we close this item up for now?
 
 
** Deploying a Prometheus for Server Metrics (clarkb 20260331)
 
** Deploying a Prometheus for Server Metrics (clarkb 20260331)
 
*** https://review.opendev.org/c/opendev/system-config/+/980840
 
*** https://review.opendev.org/c/opendev/system-config/+/980840
 
*** This change and its child deploy prometheus with node exporter to collect server metrics
 
*** This change and its child deploy prometheus with node exporter to collect server metrics
*** Napkin math says that a 1TB volume should get us about 60 days of metrics. mnasiadka also indicates that Prometheus doesn't handle longer term metrics super well
+
*** These two changes simplify the setup and testing of prometheus and node exporter
*** Ideally we would collect at least a years' worth of data. Can we make that happen with Prometheus?
+
**** https://review.opendev.org/c/zuul/zuul-jobs/+/994564 manage /etc/hosts with public IPs
*** Do we need to look at Prometheus adjacent tools like Mimir or Thanos?
+
**** https://review.opendev.org/c/opendev/system-config/+/994565 Use public IPs in system-config-run jobs
**** Both of these solutions seem to tie into Prometheus using Prometheus as the data collection system. Then they store the data in a different system which can handle long term storage more nimbly. Then for queries they speak promql and prometheus apis allowing you to point tools like grafana at them as if they were prometheus.
+
** Larger VM sizes for tests (corvus 20260618)
** Upgrade Ansible to v9 (clarkb 20260310)
+
*** corvus has been testing python 3.14 with zuul; zuul unit tests now use slightly more than 8GB under 3.14
*** https://docs.ansible.com/projects/ansible/latest/reference_appendices/release_and_maintenance.html#ansible-core-support-matrix
+
*** We have 16gb nodes, but in two clouds, rax-classic and vexxhost, they have fewer vcpus than their 8gb counterparts, so we need to use 32gb nodes to compensate
*** This has mostly gone well.
+
*** Are we okay with this?  Alternatives?
*** We had to force some hosts to use python2 due to their python3 being too old
+
** Dealing with alien zuul config errors in the openstack tenant (frickler 20260617)
*** https://review.opendev.org/c/opendev/ansible-role-puppet/+/989028 ansible-role-puppet needs to stop using deprecated include tasks.
+
*** Currently there are still 185 zuul config errors in the openstack tenant, despite my year-long struggle to get rid of them.
** Gerrit Account Cleanups (clarkb 20260317)
+
*** Most of these are from "alien" repos (74 airship, 29 starlingx) that I have no motivation to fix myself with my OpenStack hats on
*** Since the upgrade to Gerrit notedb we've had account inconsistencies that prevent us from push to the external ids ref/table directly.
+
*** Efforts to motivate these projects to clean up their errors themselves have mostly failed
*** clarkb did a bunch of work to get the number down from hundreds to about 33 consistency errors before stalling out.
+
*** I still believe that cleaning these up and being able to easily identify fresh errors is important for the healthyness of the CI setup as a whole
*** The tail was the most difficult as it wasn't clear what the more appropriate fix for each account would be
+
*** One pretty strong action would be to move these repos into their own tenant(s) or a different shared one like opendev
*** Since then it has been years and those accounts are likely inactive and unused. We can rerun the Gerrit consistency check, feed the info back through our audit script then decide if we need to be careful with any of these accounts
+
*** I acknowledge that without further work this would break their CI setup, but I'm questioning now whether that impact would be worse than the impact the current situation has on my ability to maintain the OpenStack CI
*** Chances are we can simply disable them all and remove the conflicting external ids.
+
*** Other ideas or opinions are welcome
*** If we take good notes we can reconstruct the accounts as appropriate after the fact without Gerrit downtime should one of these users show up and wonder what happened.
+
*** clarkb reached out to starlingx and airship about this
** Gerrit 3.13 Upgrade Planning (clarkb 20260414)
+
**** Airship indicated they would like to avoid the extra work involved in setting up a separate tenant
*** Clarkb would like to target a 3.13 upgrade for the end of May/early June. How does Friday June 5 look for others?
+
**** clarkb pointed out to them that they would need to fix their zuul config errors and be reachable via email or matrix at a bare minumum if we want to make that work.
*** Gerrit 3.13 removes support for Robot comments so Zuul will start making normal inline comments
+
**** https://lists.starlingx.io/archives/list/starlingx-discuss@lists.starlingx.io/thread/YQVACUR4OCX74ZULHAJ4AD44MHGY37YI/
*** This also means that the Zuul restarts performed as part of the upgrade process are actually required when we upgrade to 3.13 to get Zuul's Gerrit version detection sorted out.
+
** Gitea 1.26.4 Upgrade (clarkb 20260622)
*** https://etherpad.opendev.org/p/gerrit-upgrade-3.13 Beginnings of an upgrade plan document
+
*** https://review.opendev.org/c/opendev/system-config/+/994326 Upgrade Gitea to 1.26.4
*** Clarkb will be retesting the upgrade process now that 3.12.7 and 3.13.6 images are available.
+
*** Its time to upgrade to the next Gitea bugfix release
** Etherpad 3.1.0 Upgrade (clarkb 20260519)
+
** Bump Anubis difficult to 5 (clarkb 20260630)
*** Upgrade to 2.7.3 seems to have worked well enough
+
*** There is some evidence that bots are regularly solving the Anubis challenge
*** 3.0.0 and 3.1.0 have been released
+
*** The challenges are slowing them down enough that services continue to be mostly responsive
*** https://github.com/ether/etherpad/blob/v3.1.0/CHANGELOG.md Big change appears to be the ability for etherpad to self update. I assume we would disable this and control etherpad via container images.
+
*** Should we increase the difficulty one level to slow them down even futher?
** Zuul reporting empty public_v6 addresses for test nodes (clarkb 20260519)
+
*** This will impact regular users too which is likely the primary consideration we should make.
*** Zuul is reporting public_v6 values of '' for test nodes that do have working ipv6 in clouds like rax classic and ovh
+
*** https://review.opendev.org/c/opendev/system-config/+/995096
*** This may be an openstack api bug, an openstacksdk bug, or a zuul-launcher bug.
+
** Planning Gerrit Project Renames (clarkb 20260622)
*** Be aware this may impact the behavior of some test jobs.
+
*** We have a request to rename x/cursive to openstack/cursive
*** We will need to dig into why this is happening to understand it better.
+
*** Any concern with project ownership doing that? The current group membership includes people from Johns Hopkins University and OpenStack Barbican
 +
*** Aiming for July 9 at ~2100 UTC
  
 
* Open discussion
 
* Open discussion
Line 66: Line 64:
 
Changes should have their topic set to project-rename.
 
Changes should have their topic set to project-rename.
  
* Rename example/foo -> example/bar: https://review.opendev.org/c/openstack/project-config/+/654321
+
* Rename x/cursive -> openstack/cursive: https://review.opendev.org/c/openstack/project-config/+/990122 (stephenfin, fungi)
  
 
== Previous meetings ==
 
== Previous meetings ==
 
Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/
 
Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/

Latest revision as of 14:53, 30 June 2026

Weekly Project Infrastructure team meeting

The OpenDev Team holds public weekly meetings in #opendev-meeting on OFTC, Tuesdays at 1900 UTC. Everyone interested in infrastructure and process surrounding automated testing and deployment is encouraged to attend.

Please feel free to add agenda items (and your IRC nick in parenthesis).

Agenda for next meeting

  • Announcements
  • Actions from last meeting
  • Specs Review
  • Topics
    • Upgrading Old Servers (clarkb 20230627)
    • Deploying a Prometheus for Server Metrics (clarkb 20260331)
    • Larger VM sizes for tests (corvus 20260618)
      • corvus has been testing python 3.14 with zuul; zuul unit tests now use slightly more than 8GB under 3.14
      • We have 16gb nodes, but in two clouds, rax-classic and vexxhost, they have fewer vcpus than their 8gb counterparts, so we need to use 32gb nodes to compensate
      • Are we okay with this? Alternatives?
    • Dealing with alien zuul config errors in the openstack tenant (frickler 20260617)
      • Currently there are still 185 zuul config errors in the openstack tenant, despite my year-long struggle to get rid of them.
      • Most of these are from "alien" repos (74 airship, 29 starlingx) that I have no motivation to fix myself with my OpenStack hats on
      • Efforts to motivate these projects to clean up their errors themselves have mostly failed
      • I still believe that cleaning these up and being able to easily identify fresh errors is important for the healthyness of the CI setup as a whole
      • One pretty strong action would be to move these repos into their own tenant(s) or a different shared one like opendev
      • I acknowledge that without further work this would break their CI setup, but I'm questioning now whether that impact would be worse than the impact the current situation has on my ability to maintain the OpenStack CI
      • Other ideas or opinions are welcome
      • clarkb reached out to starlingx and airship about this
    • Gitea 1.26.4 Upgrade (clarkb 20260622)
    • Bump Anubis difficult to 5 (clarkb 20260630)
      • There is some evidence that bots are regularly solving the Anubis challenge
      • The challenges are slowing them down enough that services continue to be mostly responsive
      • Should we increase the difficulty one level to slow them down even futher?
      • This will impact regular users too which is likely the primary consideration we should make.
      • https://review.opendev.org/c/opendev/system-config/+/995096
    • Planning Gerrit Project Renames (clarkb 20260622)
      • We have a request to rename x/cursive to openstack/cursive
      • Any concern with project ownership doing that? The current group membership includes people from Johns Hopkins University and OpenStack Barbican
      • Aiming for July 9 at ~2100 UTC
  • Open discussion

Upcoming Project Renames

(any additions should mention original->new full names and link to the corresponding project-config rename change in Gerrit) Changes should have their topic set to project-rename.

Previous meetings

Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/