Weekly Project Infrastructure team meeting

The OpenDev Team holds public weekly meetings in #opendev-meeting on OFTC, Tuesdays at 1900 UTC. Everyone interested in infrastructure and process surrounding automated testing and deployment is encouraged to attend.

Please feel free to add agenda items (and your IRC nick in parenthesis).

Agenda for next meeting

Announcements
- Clarkb will be out on Friday

Actions from last meeting

Specs Review

Topics
- Upgrading Old Servers (clarkb 20230627)
  - https://etherpad.opendev.org/p/opendev-server-upgrade-planning Central tracking document which may link to more host specific documents
  - Next on the list are graphite and backup servers
  - Can probably spin up new backup servers alongside the old ones then migrate the old volumes off the old servers to the new ones and finally delete the old servers. Just need to double check borg version support matrix details and also what adding new backup servers will do to our cron job setups for backups.
  - mnasiadka has been working to replace some of the older mirror nodes. Please look out for changes related to this effort.
  - Remember to use launch-node's --config-drive flag when booting new Noble nodes in Rax Classic
- Dealing with web crawlers (clarkb 20251216)
  - We have deployed anubis to lists.opendev.org and it seems to be working well enough there
  - We have also deployed anubis on the gitea servers. This was after spending much of last week fighting the crawler flood
    - Would be good to look at getting PROXY protocol support working with gitea to make debugging easier
  - Should we be looking at adding anubis to other services like static?
- Deploying a Prometheus for Server Metrics (clarkb 20260331)
  - https://review.opendev.org/c/opendev/system-config/+/980840
  - This change and its child deploy prometheus with node exporter to collect server metrics
  - Napkin math says that a 1TB volume should get us about 60 days of metrics. mnasiadka also indicates that Prometheus doesn't handle longer term metrics super well
  - Ideally we would collect at least a years' worth of data. Can we make that happen with Prometheus?
  - Do we need to look at Prometheus adjacent tools like Mimir or Thanos?
    - Both of these solutions seem to tie into Prometheus using Prometheus as the data collection system. Then they store the data in a different system which can handle long term storage more nimbly. Then for queries they speak promql and prometheus apis allowing you to point tools like grafana at them as if they were prometheus.
- Upgrade Ansible to v9 (clarkb 20260310)
  - https://docs.ansible.com/projects/ansible/latest/reference_appendices/release_and_maintenance.html#ansible-core-support-matrix
  - https://review.opendev.org/c/opendev/system-config/+/976282
  - Based on Ansible's python support Matrix Ansible 9 gives us a good deal of flexibility for bridge and remote nodes
  - Ansible 9 also fixes problems with the use of pkg_resources in the Ansible ip module
  - Any concerns with proceeding with the upgrade since tests look good?
- Gerrit Account Cleanups (clarkb 20260317)
  - Since the upgrade to Gerrit notedb we've had account inconsistencies that prevent us from push to the external ids ref/table directly.
  - clarkb did a bunch of work to get the number down from hundreds to about 33 consistency errors before stalling out.
  - The tail was the most difficult as it wasn't clear what the more appropriate fix for each account would be
  - Since then it has been years and those accounts are likely inactive and unused. We can rerun the Gerrit consistency check, feed the info back through our audit script then decide if we need to be careful with any of these accounts
  - Chances are we can simply disable them all and remove the conflicting external ids.
  - If we take good notes we can reconstruct the accounts as appropriate after the fact without Gerrit downtime should one of these users show up and wonder what happened.
- Gerrit 3.12 Upgrade Followup (clarkb 20260310)
  - https://review.opendev.org/q/hashtag:upgrade-gerrit-3.12+status:open Followup changes for Gerrit 3.12 upgrade here.
  - Things seem to have gone well. In addition to the changes above there are a few things we should plan to do
    - Monitor H2 v2 cache file sizes and get a sense for whether or not they grow like H2 v1 cache files did
    - Clean up files that were created during the upgrade process. In particular the old H2 v1 caches.
- Gerrit 3.13 Upgrade Planning (clarkb 20260414)
  - Due to the PTG next week and travel the week after I am targetting the 3.13 upgrade to end of May/early June
  - Gerrit 3.13 adds an unconfigurable AI Code Review helper button to changes
    - https://groups.google.com/g/repo-discuss/c/duY8pKj3qBg discusses this a bit more
  - Gerrit 3.13 removes support for Robot comments so Zuul will start making normal inline comments
  - This also means that the Zuul restarts performed as part of the upgrade process are actually required when we upgrade to 3.13 to get Zuul's Gerrit version detection sorted out.
- Ubuntu Resolute Test Nodes (clarkb 20260331)
  - https://review.opendev.org/c/openstack/diskimage-builder/+/982231 Add Resolute testing to dib
  - https://review.opendev.org/c/opendev/zuul-providers/+/982182 Add Resolute images to Zuul
  - Ubuntu Bionic mirror content has been removed. We can probably start the process of mirroring Resolute packages.
- OpenInfra PTG Prep (clarkb 20260331)
  - The next PTG is happening April 20-24 which is next week.
  - We will want to put meetpad and jvm nodes in the emergency file prior to the event to prevent unwanted upgrade disruptions.
  - Is there any other prep work that we think should be done ahead of the event?
- Noble Docker Not Talking to Podman Socket for all Operations (clarkb 20260414)
  - During the Gerrit 3.12 upgrade we noticed that `docker image ls` doesn't work on Noble nodes anymore due to API version support mismatches between Docker and Podman
  - `podman image ls` does work just fine and is what we used
  - The problem appears to be due to noble-updates upgrading the docker.io package compared to noble proper
  - The problem does not appear to affect all docker subcommands `docker ps -a` works just fine.
  - Please keep an eye out for problems in configuration management caused by this.

Open discussion

Upcoming Project Renames

(any additions should mention original->new full names and link to the corresponding project-config rename change in Gerrit) Changes should have their topic set to project-rename.

Rename example/foo -> example/bar: https://review.opendev.org/c/openstack/project-config/+/123456

Previous meetings

Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/

Meetings/InfraTeamMeeting

Contents

Weekly Project Infrastructure team meeting

Agenda for next meeting

Upcoming Project Renames

Previous meetings