Emergency Downtime: March 4, 2013 – Princeton CS Status

The temperature in the central computing facility has risen very quickly to critical levels this morning. OIT and facilities are aware of the problem. We have already started to see equipment failures. To protect our infrastructure, we are very likely to begin shutting down equipment.

UPDATE: At 10:43am, OIT reports, \”All air-handlers at the HPCRC are currently stopped.\”

UPDATE 10:55am: The penguin machines (tux, opus) and cycles (wash, rinse, spin, soak) are being shutdown. Our primary DNS server automatically shutdown due to the high temperature.

UPDATE: At 10:56am, OIT reports, \”The air handlers are running again and temperatures are returning to normal.\” CS Staff is monitoring temperature sensors in our area.

UPDATE 11:13am: Primary DNS server has been restarted. We expect to restart penguins and cycles within 15 minutes. The switch for the ionic cluster failed. We have a spare in the CS building. It will be a while (no ETA yet; could be a full day) before it is configured, installed, and operational.

UPDATE 2:00pm: We may not be out of the woods yet. We are seeing elevated temperatures again and we are keeping a close eye on our systems. From OIT, \”HPCRC control system is having problems again. Facilities staff are still at HPCRC and are responding.\”

UPDATE 2:55pm: We have seen temperatures peak and then decline again. OIT now reports, \”HPCRC cooling is now operating normally.\”

UPDATE 4:20pm: Facilities believes that they have identified and corrected the root cause. As a precaution, facilities staff will stay at the HPCRC all night to be able to respond quickly in the event of another problem.