Archive for June, 2008

Summer 2008 Downtime Schedule

Friday, June 20th, 2008

Here is the maintenance schedule for the summer:

Tue, Jun 24, 2008, 4:00am-8:00am
Tue, Jul 8, 2008, 4:00am-8:00am
Tue, Jul 22, 2008, 4:00am-8:00am
Tue, Aug 5, 2008, 4:00am-8:00am
Tue, Aug 19, 2008, 4:00am-8:00am
Tue, Sep 2, 2008, 4:00am-8:00am

During these times we will be performing a variety of update, installation, and maintenance tasks.

Downtime: Tuesday, June 10, 2008

Friday, June 6th, 2008

On Tuesday, June 10, 2008, we will have a scheduled downtime of the entire CS computing and networking infrastructure during normal business hours beginning at 6:00am. We don’t have an exact completion time but anticipate that everything will be back up before 2:00pm.

Who is affected:

  • This downtime will impact all users of the departmental infrastructure. We will power down all equipment in room 218 including the network (wired and wireless), web servers, mail servers, compute servers, and file servers.

What is happening:

  • We are upgrading our battery-backup power system for room 218. This includes replacing the existing UPS (40kVA, 208V, 3φ) with a larger unit (80kVA, 480V, 3φ). Because we are reconfiguring the system to operate at 480V and installing a new bypass switch, we must power-down the entire machine room to perform the work.

Why it is happening:

  • Due to continued growth of the department’s infrastructure, we reached the capacity of our current power configuration. This new configuration will allow us to install additional infrastructure equipment to meet the department’s needs

Update 2:10pm:While the work is moving along smoothly, it is taking longer than anticipated. Our new estimate to be online is 5:00pm.

Update 5:05pm:A circuit breaker has failed in the new UPS. We are working with the vendor to get a replacement unit ASAP. We are hopeful that this will be first thing in the morning. We’ll know later tonight about the specific ETA. We do not anticipate the systems coming back online tonight.

Update 6:25pm:The field engineer was able to track down a replacement circuit breaker in Virginia. It will be shipped overnight and is due to be in the building by 8:30am. The earliest we anticipate being back online is 11:00am. However, there are still several unknowns so this is still only a lower-bound.

Update Wednesday, 12:25pm: The field engineer installed the replacement circuit breaker. Unfortunately, it exhibited the same problem and now he is debugging the system. We put in a call to the vendor and the engineer’s supervisor is on his way (with additional spare parts, if needed) to assist with the troubleshooting. We are now simultaneously working to get the new UPS online and weighing our options in the event that things continue to drag out. One option is to bring things back online without any protection from power hits; this is risky as without protection, a power event can bring down the room and damage equipment. The last time this occurred, it degraded our systems and resulted in a series of failures over a period of weeks; some of the failures led to the permanent loss of user data. In addition, we will need to bring the room back down to complete the UPS installation in any event. We understand that we must get the systems back online ASAP.

Update Wednesday, 5:25pm: Our new UPS is up and running. We have begun our normal start-up procedures. We anticipate being online at approximately 6:00pm.

Emergency Downtime: Thursday, June 5, 2008

Thursday, June 5th, 2008

Due to a failure in our main UPS, we need to perform emergency maintenance today beginning at 12:00pm (noon). This work will require the shutdown of our main server room and is expected to last approximately 90 minutes. We realize that many people are up against conference deadlines this week and we have not made this decision lightly. At this time our systems are not protected by backup power and any power event could cause a disruption that could last for substantially more than our expected downtime.

Here’s a close-up of the inside of the UPS unit showing charring around one of the main power cables.

null

Update: At 3:30pm, we are back up. The work was only a partial success. We now have battery backup for one power event at a time. After each event we must manually reset the system to be ready for the next event. We are all crossing our fingers that the commercial power is clean through Tuesday when we will have our scheduled downtime to connect our new, bigger UPS.