Downtime on the computer system October 19.-22.

Status:

Monday evening 23.10.23, 16:00: All machines are up, except owl25-28. The ganglia cluster monitoring web pages are still down for the reinstalled hyades, pleiades, eagle clusters.

Sunday evening 22.10.23, 20:30:

These compute nodes are upgraded to RHEL9: eagle4-9, nekkar, orion, electra3,4, viscacha, mimosa3, owl18-37, pleiades3-12.

These nodes will be ready with RHEL9 in a day or two: hyades-16, euclid21-30, pleiades13-29

These compute nodes are still at RHEL7: beehives, euclid-16, hercules2-15

Workstations are all available.

 

21.10.23, 15:55 Upgrade of StorNext complete. We now open login.astro.uio.no (tsih4), login2.astro.uio.no (tsih2) and scp.uio.no and all workstations for use. We are working on Red Hat upgrades on the compute nodes, so they are still unavailable.

20.10.23, 19:00 Progress according to plan today. Work continues tomorrow.

20.10.23, 07:00 StorNext upgrade in progress

19.10.23, 19:00 System shut down

19.10.23, 18:00 Starting shutdown

Downtime:

We are planning downtime on the computer system from Thursday October 19 at 18.00. to Sunday October 22.

During the downtime we will:

  • Upgrade the StorNext storage system to version 7.1.1
  • Upgrade the Infiniband swithces in the clusters with new firmware and software.
  • Upgrade the operating system on as many (hopefully all) compute nodes as we can manage from RHEL 7 to RHEL 9.

We will do the reinstallation with RHEL9 on as many nodes and clusters that we manage, working us through this prioritised list:

To RHEL9:

hyades-16
eagle4-9
orion, nekkar, mimosa3, viscacha
owl18-37
euclid21-30
pleiades3-29, electra3,4

 

Then afterwards, if time permits, (or most likely at a later date), some of the oldest clusters will be reinstalled with RHEL8 (they cannot run RHEL9):

To RHEL8:

beehive-31
euclid-16
hercules2-15

 

This is a major upgrade, and it is likely you have to recompile codes.

We have already upgraded hyades17-21 with RHEL9 on Tuesday September 12. So you can test your codes well in advance of the main upgrade.


Early warning:


Early November 2024 we will retire the following  cluster nodes:

  • All beehive nodes
  • All hercules nodes
  • euclid-euclid16

When we retire them in a little over 1 year, some of them will be nearly 11 years and some nearly 12 years old. They have been remarkably stable and have served us well.

 

By Torben Leifsen
Published Sep. 10, 2023 11:37 AM - Last modified Oct. 24, 2023 2:34 PM