March 2013 – Princeton CS Status

Downtime: Tue, March 19, 2013

All / scott

On Tuesday, March 19, 2013, we will have a scheduled downtime from 6:00am to 8:00am EDT.

Some of this work (bullets \”1\” below) was originally scheduled for March 5, 2013.

Who is affected:

Users of the public login machines (soak, wash, rinse, spin, tux, opus), certain CS websites (jobs, kiosk, msdnaa, search, wiki), database servers, runscript, CAS, lpdrelay (printing), labpc machines (Friend Center fishbowl lab) and the ionic cluster.
Dynamic websites for projects or groups with hostnames of the form project-or-group-name.cs.princeton.edu that are accessing executables (e.g., python, perl, java) on the shared /usr/local filesystem.

What is happening:

These hosts will receive a critical security update and be rebooted.
Websites of the form project-or-group-name.cs.princeton.edu are moving to a new server that no longer mounts the shared /usr/local filesystem.

Why is it happening:

There is a critical kernel security patch available for our Springdale hosts. It addresses a specific security vulnerability. As this is a kernel update, all machines must be rebooted. Actual expected downtime should only be a minute or two for each host to reboot.
The website migration is the next step to decommission the shared /usr/local filesystem. This work has already been done for the cycles and penguin machines and will result in decreased load on our central file server and simplified management of our web server infrastructure. Additional notes about the website migration:
- In the week leading up to the downtime, we will reach out to the owners of as many affected websites as we can identify with instructions on what they need to do for the migration. We anticipate that very few sites will need modification and that those modifications will be minor.
- After the downtime, all projects and groups should double check that their sites are operating as expected. If not, please notify CS Staff immediately and we can assist and/or temporarily move the site back to the old server.

Downtime: Tue, March 19, 2013 Read More »

Downtime: Sat, March 16, 2013

All / scott

On Saturday, March 16, 2013, we will have a scheduled downtime from 6:00am to 8:00am EDT.

Who is affected:

All users of the CS wired network (including PlanetLab at 221 Nassau, CS hosts in CITP in Sherrerd, and the CS section of the data center at 151 Forrestal) as well as users of the OIT wireless network in the CS Building.

What is happening:

We will be updating the firmware on the department\’s gateway router and firewall service module. During this time, there will be no OIT wireless connectivity in the CS Building (as the OIT access points use the CS wired infrastructure) and there will be no Internet connectivity between the CS network and the outside world.

Why is it happening:

This update will address some ongoing network issues affecting a limited number of services.

Downtime: Sat, March 16, 2013 Read More »

OIT Wireless Down: March 6, 2013

All / scott

OIT is reporting that the puwireless is not available to some users. For more details, see their outage message.

OIT Wireless Down: March 6, 2013 Read More »

Downtime: Tue, March 5, 2013

All / scott

On Tuesday, March 5, 2013, we will have a scheduled downtime from 6:00am to 7:00am EST.

UPDATE: This downtime did not happen and the work has been postponed to March 19, 2013.

Who is affected:

Users of the public login machines (soak, wash, rinse, spin, tux, opus), certain CS websites (jobs, kiosk, msdnaa, search, wiki), database servers, runscript, CAS, lpdrelay (printing), labpc machines (Friend Center fishbowl lab) and the ionic cluster.

What is happening:

These hosts will receive a critical security update and be rebooted.

Why is it happening:

There is a critical kernel security patch available for our Springdale hosts. It addresses a specific security vulnerability. As this is a kernel update, all machines must be rebooted. Actual expected downtime should only be a minute or two for each host to reboot.

Downtime: Tue, March 5, 2013 Read More »

Emergency Downtime: March 4, 2013

All / scott

The temperature in the central computing facility has risen very quickly to critical levels this morning. OIT and facilities are aware of the problem. We have already started to see equipment failures. To protect our infrastructure, we are very likely to begin shutting down equipment.

UPDATE: At 10:43am, OIT reports, \”All air-handlers at the HPCRC are currently stopped.\”

UPDATE 10:55am: The penguin machines (tux, opus) and cycles (wash, rinse, spin, soak) are being shutdown. Our primary DNS server automatically shutdown due to the high temperature.

UPDATE: At 10:56am, OIT reports, \”The air handlers are running again and temperatures are returning to normal.\” CS Staff is monitoring temperature sensors in our area.

UPDATE 11:13am: Primary DNS server has been restarted. We expect to restart penguins and cycles within 15 minutes. The switch for the ionic cluster failed. We have a spare in the CS building. It will be a while (no ETA yet; could be a full day) before it is configured, installed, and operational.

UPDATE 2:00pm: We may not be out of the woods yet. We are seeing elevated temperatures again and we are keeping a close eye on our systems. From OIT, \”HPCRC control system is having problems again. Facilities staff are still at HPCRC and are responding.\”

UPDATE 2:55pm: We have seen temperatures peak and then decline again. OIT now reports, \”HPCRC cooling is now operating normally.\”

UPDATE 4:20pm: Facilities believes that they have identified and corrected the root cause. As a precaution, facilities staff will stay at the HPCRC all night to be able to respond quickly in the event of another problem.

Emergency Downtime: March 4, 2013 Read More »