Downtime 16/3 - 19/3 and 30/3 - 3/4

We are replacing our two pairs of StorNext metadata controllers in the storage system. This requires some downtime.

Status

11.04.23, 17:30 beehive11 is ready for use

04.04.23, 14:35 zedaron and hamal fixed. beehive11 has a new disk. Will be ready after easter.

03.04.23, 17:45 All linux workstations are now updated with the new StorNext version and are available for use, except zedaron and hamal which have some network issues that needs to be looked at. beehive11 is also down with a broken system disk that needs replacement. Otherwise all systems are up.

03.04.23, 13:35 StorNext system is up now. Login from outside is enabled. All MacOS workstations have been upgraded to version 13.3 (Ventura). You can use Mac workstations and compute nodes now, but not linux workstations. They still need StorNext upgrades

03.04.23, 08:35 Yesterday we got into problems with the metadata controllers while we were upgrading workstations with the new version of StorNext. Quantum is investigating the issue and will need a couple of hours to analyse the problem. We want to understand the cause of the problems before we bring the system up. Expect another 2 hours or so downtime.

02.04.23, 21:20 We have some problems, and the system will not be available before tomorrow.

01.04.23, 18:05 Migration to new metadata servers complete. Compute nodes and workstations will be started tomorrow. Some work remains before we open for login.

01.04.23, 09:35 Progress according to plan so far

30.03.23, 17:30 Shutdown complete

30.03.23, 14:45 System shutdown starts in 15 min

29.03.23 Downtime is scheduled for 17.00 tomorrow (Thursday).

22.03.23, Everything is then up and ready again, and we are now preparing for the next downtime, starting the evening of Thursday March 30.

22.03.23, hyades17-21 are up. They used to be the old beehive43-47, but they are now on the fast infiniband network together with the rest of the hyades cluster.

19.03.23, 17:30 enir is up

19.03.23, 14:00 All workstations and compute nodes are up, with the following exceptions: enir (we are doing some work still) and beehive43-beehive47. We are upgrading them with 200 Gb Infiniband and moving them to the hyades cluster. They will reappear there as hyades 17-hyades21. 

18.03.23, 17:05 Migration to new metadata servers complete. Compute nodes and workstations will be started tomorrow morning.

18.03.23, 13:15 login.astro.uio.no is up now

18.03.23, 11:25 We expect login.astro.uio.no to be back around 12.45-13.00

18.03.23, 10:58 Shutdown of login.astro.uio.no in 10 min

18.03.23, 09:40 Shutdown of login.astro.uio.no will be around 11.00. We will give a 10 minute warning.

18.03.23, 00:15 Shutdown of login.astro.uio.no will most likely be between 10 and 12 Saturday. Downtime 1-1,5 hours.
When it comes back all disks will be available.

17.03.23, 17:55 Good progress on migration from old to new servers today. Be aware that we need to take down login.astro.uio.no when we are going to reenable the disks that are down now. We don't know yet when that will happen, but will let you know.

17.03.23, 10:10 login.astro.uio.no is now available for login. Please, no computing on the login node!

17.03.23, 08:30 We expect to be able to start login.astro.uio.no at around 9.30-10.00

16.03.23, 17.30 Shutdown complete

16.03.23, 17.00 Starting shutdown

Replacing metadata controller hardware

Our StorNext storage system is controlled by two pairs of metadata controllers. Each pair controls half of our file systems each.

The controllers are now old, and needs to be replaced with new hardware. This requires some downtime. We have chosen to split this into two periods, one for each metadata controller pair:

  1. Thursday March 16. evening to Sunday March 19. for the first pair
  2. Thursday March 30. evening to Monday April 3. for the second pair + file system upgrades and tape capacity expansion.

During the work we will:

  • Replace the metadata controllers with new hardware
  • Migrate metadata from the old to the new hardware
  • Upgrade StorNext on the controllers to the latest version (7.1.0)
  • Increase the capacity in the tape storage system.

Downtime

We will start shutdown on Thursday at 17.00. All systems (including tsih2) will be shut down.

The first weekend tsih2 (login.astro.uio.no) will be up from Friday morning with the following file systems:

/mn/stornext/u3
/mn/stornext/d8
/mn/stornext/d7
/mn/stornext/d9
/mn/stornext/d11
/mn/stornext/d10
/mn/stornext/d21
/mn/stornext/d22

All other compute nodes and workstations will be down. You can not do any work on tsih2, but you will have access to your files on these file systems (including the home directories on u3).

Later Linux upgrades

The upgrade to StorNext 7.1.0 will pave the way for the long awaited upgrade of Red Hat Enterprise linux to version 9.1. This will require some testing once the StorNext upgrades are done. Downtime for the Red Hat upgrade will be announced later.

 

 

By Torben Leifsen
Published Feb. 24, 2023 2:46 PM - Last modified Apr. 11, 2023 5:32 PM