Building rehabilitation and downtime August 5. - 11.

The rehabilitaion of the roof where our cooling system is located nears completion, and our coolers will be reinstalled and the temporary cooling removed.

While this work is done, we will have limited cooling capacity and need to take down parts of the computer system again.

STATUS:

12.08, 08:00

All machines are online again. Contact drift@astro.uio.no if you miss any.

12.08, 07:15

All servers and compute nodes are up. Eagle5 has an unresolved issue

11.08, 15:20

The work with the cooles are not yet finished and estimated to take two more hours.
The machines wil be taken online again after this - probably tomorrow

07.08.20, 10:00:

The cooling system seems stable, so a few additional nodes are powered on. According to the plan, the rest will be turned on on Tuesday.

06.08.20, 15:25:

The cooling is now restored temporarilly. The machines owl31, owl32, owl33, beehive, beehive45 og beehive46 are turned back on.

06.08.20, 07:30:

A few of the remaining nodes are taken down. Now we are only running a handfull of the most critical nodes.

There is a slight increasing trend in the temperature in the machine room, and we expect the outside temperature to rise during the day today, so we want to be prepared for that.

05.08.20, 14:30:

The load in the machine room is still to much for our backup cooler. Some additional nodes have been powered off. The users running on them has been alerted.

We still have beehive, beehive43-47 and some workstations up (and a few owls/eagles) in addtion to all servers and storage.

05.08.20, 08:00:

All machines are shut down except the machines mentioned below.

We will monitor the temperature in the machine room the following days to check if we need to take down additional machines.

05.08.20, 01:10:

As it looks now, the following machines will remain up:

Workstations:

acubens, polaris, arion, alphecca, rukbat, castor, barakish, mairac, orin, pixie, sansa, tsugaru and tuscan

Compute:

beehive, beehive43, beehive44, beehive45, beehive46, beehive47, hercules, hercules15, euclid, owl24, owl25, owl26, owl27, owl28, owl29, owl30, owl31, owl32, owl33, owl34, owl35, charybdis1, eagle and eagle5

Servers:

tsih, tsih2 (login.astro.uio.no), sunflower (NFS and samba server), electra2, electra3, sdc-fs, sdc-db, acuxdb, thubandb, alruba, alruba2 and mintaka

In addition the storage system and a number of system- and infrastructure servers will remain up.

We may have to shut down some of the compute nodes if we exceed the available cooling capacity.

PLAN:

Wednesday August 5:

Shut down all non critical computers.

Our main cooler will be reinstalled on the roof

Thursday August 6 - Friday August 7:

The temporary cooler in the backyard will be disconnected

Restart and testing of main cooling system

Monday August 10:

Our backup cooler will be disconnected and moved back to the roof and reconnected

Tuesday August 11:

Restart of backup cooler.

Power up of all computer systems

We will keep the most important systems operational during this process, but we will have limited cooling capacity and will have to shut down large parts of the system.

STORAGE AND SERVERS:

Storage system and main servers (including login.astro.uio.no, PRITS servers etc) will be up

COMPUTE CAPACITY:

We will keep some compute capacity operational.

WORKSTATIONS:

We will shut down all workstations that are not needed. If you need to have your workstation operational, let us know and we will keep it up. If not we can save some power that can be used for cluster nodes.

If you have specific needs during this periods, let us know and we will try to accomodate this within the capacity we have available.

Status, more details and updates will be posted here.

By Torben Leifsen

Published July 20, 2020 4:50 PM - Last modified Aug. 12, 2020 7:59 AM