Jump to: navigation, search

Difference between revisions of "Meetings/InfraTeamMeeting"

(Agenda for next meeting)
(Agenda for next meeting)
 
(27 intermediate revisions by 4 users not shown)
Line 10: Line 10:
  
 
* Announcements
 
* Announcements
** OpenStack release is scheduled to happen April 1. https://releases.openstack.org/gazpacho/schedule.html
 
** Clarkb will be out Monday March 23
 
  
 
* Actions from last meeting
 
* Actions from last meeting
Line 21: Line 19:
 
*** https://etherpad.opendev.org/p/opendev-server-upgrade-planning Central tracking document which may link to more host specific documents
 
*** https://etherpad.opendev.org/p/opendev-server-upgrade-planning Central tracking document which may link to more host specific documents
 
*** Next on the list are graphite and backup servers
 
*** Next on the list are graphite and backup servers
*** Can probably spin up new backup servers alongside the old ones then migrate the old volumes off the old servers to the new ones and finally delete the old servers. Just need to double check borg version support matrix details and also what adding new backup servers will do to our cron job setups for backups.
+
*** backup03.ca-ymq-1.vexxhost.opendev.org has been launched and is being backed up too
*** In addition to the backups servers and graphite, clarkb can work with mnasiadka to do some of the outstanding cleanup for the mirror nodes.
+
**** https://review.opendev.org/c/opendev/system-config/+/995420 Starting backup02 removal here
 
*** Remember to use launch-node's --config-drive flag when booting new Noble nodes in Rax Classic
 
*** Remember to use launch-node's --config-drive flag when booting new Noble nodes in Rax Classic
** Adding Bad Crawler Honeypots to our Sites (clarkb 20251216)
+
** Deploying a Prometheus for Server Metrics (clarkb 20260331)
*** A DDoS against static hosted sites resulted in new WAF rules and approaches on static02
+
*** https://review.opendev.org/c/opendev/system-config/+/980840
*** It is possible we may wish to apply similar approaches to sites like lists though the specific details will be different
+
*** This change and its child deploy prometheus with node exporter to collect server metrics
*** https://review.opendev.org/q/hashtag:%22apache-waf%22+status:open
+
*** These two changes simplify the setup and testing of prometheus and node exporter
*** There was also discussion about subscribing to common mod security rulesets that are already packaged and available via Ubuntu repos.
+
**** https://review.opendev.org/c/zuul/zuul-jobs/+/994564 manage /etc/hosts with public IPs
*** Should we consider a larger static server, or multiple servers behind a load balancer?
+
**** https://review.opendev.org/c/opendev/system-config/+/994565 Use public IPs in system-config-run jobs
** Upgrade Ansible to v9 (clarkb 20260310)
+
** Larger VM sizes for tests (corvus 20260618)
*** https://docs.ansible.com/projects/ansible/latest/reference_appendices/release_and_maintenance.html#ansible-core-support-matrix
+
*** corvus has been testing python 3.14 with zuul; zuul unit tests now use slightly more than 8GB under 3.14
*** https://review.opendev.org/c/opendev/system-config/+/976282
+
*** We have 16gb nodes, but in two clouds, rax-classic and vexxhost, they have fewer vcpus than their 8gb counterparts, so we need to use 32gb nodes to compensate
*** Based on Ansible's python support Matrix Ansible 9 gives us a good deal of flexibility for bridge and remote nodes
+
*** Are we okay with this?  Alternatives?
*** Ansible 9 also fixes problems with the use of pkg_resources in the Ansible ip module
+
** Dealing with alien zuul config errors in the openstack tenant (frickler 20260617)
*** Any concerns with proceeding with the upgrade since tests look good?
+
*** Currently there are still 185 zuul config errors in the openstack tenant, despite my year-long struggle to get rid of them.
** Gerrit Account Cleanups (clarkb 20260317)
+
*** Most of these are from "alien" repos (74 airship, 29 starlingx) that I have no motivation to fix myself with my OpenStack hats on
*** Since the upgrade to Gerrit notedb we've had account inconsistencies that prevent us from push to the external ids ref/table directly.
+
*** Efforts to motivate these projects to clean up their errors themselves have mostly failed
*** clarkb did a bunch of work to get the number down from hundreds to about 33 consistency errors before stalling out.
+
*** I still believe that cleaning these up and being able to easily identify fresh errors is important for the healthyness of the CI setup as a whole
*** The tail was the most difficult as it wasn't clear what the more appropriate fix for each account would be
+
*** One pretty strong action would be to move these repos into their own tenant(s) or a different shared one like opendev
*** Since then it has been years and those accounts are likely inactive and unused. We can rerun the Gerrit consistency check, feed the info back through our audit script then decide if we need to be careful with any of these accounts
+
*** I acknowledge that without further work this would break their CI setup, but I'm questioning now whether that impact would be worse than the impact the current situation has on my ability to maintain the OpenStack CI
*** Chances are we can simply disable them all and remove the conflicting external ids.
+
*** Other ideas or opinions are welcome
*** If we take good notes we can reconstruct the accounts as appropriate after the fact without Gerrit downtime should one of these users show up and wonder what happened.
+
*** clarkb reached out to starlingx and airship about this
** Gerrit 3.12 and 3.13 Upgrade Planning (clarkb 20260310)
+
**** Airship indicated they would like to avoid the extra work involved in setting up a separate tenant
*** Targeting April 5/6 and April 12/13 for upgrade to 3.12 and 3.13 respectively.
+
**** clarkb pointed out to them that they would need to fix their zuul config errors and be reachable via email or matrix at a bare minumum if we want to make that work.
*** Goal is to catch back up to being only one release behind Gerrit upstream. 3.14 is expected to release in May
+
**** https://lists.starlingx.io/archives/list/starlingx-discuss@lists.starlingx.io/thread/YQVACUR4OCX74ZULHAJ4AD44MHGY37YI/
*** Will probably need to start building 3.13 images earlier than usual and test both the 3.11 -> 3.12 and 3.12 ->3.13 upgrades
+
** Gitea 1.26.4 Upgrade (clarkb 20260622)
*** Would rather not do them all in one go to simplify rollbacks if necessary and reduce the total downtime as >1 release upgrade requires offline reindexing.
+
*** https://review.opendev.org/c/opendev/system-config/+/994326 Upgrade Gitea to 1.26.4
**** The big risk currently on the radar is that H2 is upgraded from v1 to v2 in 3.12. But will need to do more digging through release notes as well as testing.
+
*** Its time to upgrade to the next Gitea bugfix release
** Purging backups on the smaller backup server (clarkb 20260310)
+
** Bump Anubis difficult to 5 (clarkb 20260630)
*** Purging review02 and paste01 backups did free up some additional space
+
*** There is some evidence that bots are regularly solving the Anubis challenge
*** Should we do the same for eavesdrop01 and refstack01 backups?
+
*** The challenges are slowing them down enough that services continue to be mostly responsive
 +
*** Should we increase the difficulty one level to slow them down even futher?
 +
*** This will impact regular users too which is likely the primary consideration we should make.
 +
*** https://review.opendev.org/c/opendev/system-config/+/995096
 +
** Planning Gerrit Project Renames (clarkb 20260622)
 +
*** We have a request to rename x/cursive to openstack/cursive
 +
*** Any concern with project ownership doing that? The current group membership includes people from Johns Hopkins University and OpenStack Barbican
 +
*** Aiming for July 9 at ~2100 UTC
  
 
* Open discussion
 
* Open discussion
Line 59: Line 64:
 
Changes should have their topic set to project-rename.
 
Changes should have their topic set to project-rename.
  
* Rename example/foo -> example/bar: https://review.opendev.org/c/openstack/project-config/+/123456
+
* Rename x/cursive -> openstack/cursive: https://review.opendev.org/c/openstack/project-config/+/990122 (stephenfin, fungi)
  
 
== Previous meetings ==
 
== Previous meetings ==
 
Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/
 
Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/

Latest revision as of 14:53, 30 June 2026

Weekly Project Infrastructure team meeting

The OpenDev Team holds public weekly meetings in #opendev-meeting on OFTC, Tuesdays at 1900 UTC. Everyone interested in infrastructure and process surrounding automated testing and deployment is encouraged to attend.

Please feel free to add agenda items (and your IRC nick in parenthesis).

Agenda for next meeting

  • Announcements
  • Actions from last meeting
  • Specs Review
  • Topics
    • Upgrading Old Servers (clarkb 20230627)
    • Deploying a Prometheus for Server Metrics (clarkb 20260331)
    • Larger VM sizes for tests (corvus 20260618)
      • corvus has been testing python 3.14 with zuul; zuul unit tests now use slightly more than 8GB under 3.14
      • We have 16gb nodes, but in two clouds, rax-classic and vexxhost, they have fewer vcpus than their 8gb counterparts, so we need to use 32gb nodes to compensate
      • Are we okay with this? Alternatives?
    • Dealing with alien zuul config errors in the openstack tenant (frickler 20260617)
      • Currently there are still 185 zuul config errors in the openstack tenant, despite my year-long struggle to get rid of them.
      • Most of these are from "alien" repos (74 airship, 29 starlingx) that I have no motivation to fix myself with my OpenStack hats on
      • Efforts to motivate these projects to clean up their errors themselves have mostly failed
      • I still believe that cleaning these up and being able to easily identify fresh errors is important for the healthyness of the CI setup as a whole
      • One pretty strong action would be to move these repos into their own tenant(s) or a different shared one like opendev
      • I acknowledge that without further work this would break their CI setup, but I'm questioning now whether that impact would be worse than the impact the current situation has on my ability to maintain the OpenStack CI
      • Other ideas or opinions are welcome
      • clarkb reached out to starlingx and airship about this
    • Gitea 1.26.4 Upgrade (clarkb 20260622)
    • Bump Anubis difficult to 5 (clarkb 20260630)
      • There is some evidence that bots are regularly solving the Anubis challenge
      • The challenges are slowing them down enough that services continue to be mostly responsive
      • Should we increase the difficulty one level to slow them down even futher?
      • This will impact regular users too which is likely the primary consideration we should make.
      • https://review.opendev.org/c/opendev/system-config/+/995096
    • Planning Gerrit Project Renames (clarkb 20260622)
      • We have a request to rename x/cursive to openstack/cursive
      • Any concern with project ownership doing that? The current group membership includes people from Johns Hopkins University and OpenStack Barbican
      • Aiming for July 9 at ~2100 UTC
  • Open discussion

Upcoming Project Renames

(any additions should mention original->new full names and link to the corresponding project-config rename change in Gerrit) Changes should have their topic set to project-rename.

Previous meetings

Previous meetings, with their notes and logs, can be found at http://eavesdrop.openstack.org/meetings/infra/ and earlier at http://eavesdrop.openstack.org/meetings/ci/