Archive for November, 2008

Emergency Downtime: Tue, Nov 25, 2008

Tuesday, November 25th, 2008

TODAY, Tuesday, November 25, 2008, we will have an emergency downtime from 1:00pm to 2:00pm EST (note time).

During this time we will be shutting down our infrastructure so that we can (1) revert to a previous version of the software on our faulty file system, and (2) move some of our service/infrastructure file systems to a temporary volume on an alternate file server.

We are taking these actions to help alleviate some of the file system problems we are experiencing. These changes should make the department web sites, FC 010 lab, moodle, and CVS more stable. Accessing the project file space should be no worse than it is now; our expectation is that it will be better, but still under par.

Update 1:42pm: Systems are back on-line. With additional information and upon further consideration, we opted to only perform part (2) above at this time. The department web sites, FC 010 lab, moodle, and CVS should be more stable and more responsive. We postponed reverting the file system software to allow our vendor additional time to debug the problem. We are likely to revert to a previous state early tomorrow morning.

CVS Downtime / Migration

Tuesday, November 25th, 2008

We are bringing down the department’s CVS server and moving the content from our problematic file server to a temporary location on another file server. We expect to have it back online by noon today. We will post a follow-up when it is complete.

Update 9:40am: Because we did an initial rsync for the CVS data yesterday, the final rsync this morning completed quickly and the CVS server is now back on-line. There is still a read-only dependency between the CVS server and the problematic file system; however, we expect that the CVS performance should be much closer to normal. As usual, please report problems to csstaff@cs.

Short Notice Downtimes and File System Issues

Monday, November 24th, 2008

As many of you have noticed and reported, we continue to have serious performance problems with our file server. (This is the system that serves everything except home directories.) The issue has been escalated to the highest level and is getting 24/7 attention from our vendor. To get the problem resolved as quickly as possible, we expect to have a few short downtimes (<  1 hour) with short notice (~30 minutes) this week. This notice is to make sure that you pay close attention to messages on the downtime list and on this blog. If you are a researcher and have conference/journal/proposal deadlines this week, please let us know who is working on them and when they are due. Also, if you are an instructor and have assignments due this week (that require use of the CS infrastructure), please let us know the course and time the assignment is due. At this time, the read performance from the file server is acceptable; however, the write performance is pathologically slow. Note that to get the read performance where it is now, we have to temporarily disable updating the atime (accessed time) value. Also, snapshots on the project space have been temporarily disabled. Update 12:37pm: We will have a brief downtime at 1:00pm today, when we reboot the file server. In an attempt to minimize disruption, other systems will be left up during the file server reboot, so you will notice a long pause while the reboot occurs. Things should hopefully return once the server comes back up. The downtime should be less than 30 minutes if it goes as expected.

Emergency Downtime: Fri, Nov 21, 2008

Friday, November 21st, 2008

On Friday, November 21, 2008, we will have an emergency downtime from 12:00pm to 2:00pm EST (note time).

During this time we will be shutting down our infrastructure so that we can apply a software patch to the file server.

Our file server vendor has high confidence that they have identified the bug that is the root cause of our performance and connectivity problems. They have created a custom “patch” for us that removes the bug. Given the severity of the problem for our infrastructure and the confidence from the vendor that this will correct the problem, we have opted to apply the patch during business hours TODAY.

Update 12:50pm: We have applied the patch and our systems are back online. Performance seems to be better; time will tell if the improvement is permanent. As always, please continue to report problems to csstaff.

Emergency Downtime: Tue, Nov 18, 2008

Monday, November 17th, 2008

On Tuesday, November 18, 2008, we will have an emergency downtime from 4:00 AM to 8:00 AM EST.

During this time we will be applying vendor-recommended patches to our fileserver in order to address the recent performance and connectivity issues with the server.

Fileserver Problems

Sunday, November 16th, 2008

Our main department fileserver is having problems providing NFS (and probably SMB) service to the systems in the department. Impacted services include the main department web server (www.cs.princeton.edu), virtual web servers, the CS public unix machines (portal) and the c2 cluster. So far, e-mail appears to be working normally.

We have collected debugging information and sent it to the vendor for analysis. We are monitoring the situation and doing what we can to keep services up. Please send e-mail to csstaff if you have any new issues to report. We will update this space as we know more.

Emergency Downtime: Thu, Nov 6, 2008

Tuesday, November 4th, 2008

On Thursday, November 6, 2008, we will have an emergency downtime from 4:00am to 8:00am EST.

During this time we will be working on fixing the performance and connectivity problems with our file server.

Beginning tomorrow (November 5, 2008), our vendor will be providing an on-site engineer to help diagnose the problem and to help us with our plan-of-attack for Thursday’s early-morning downtime.

Emergency Downtime: Mon, Nov 3, 2008

Sunday, November 2nd, 2008

On Monday, November 3, 2008, we will have an emergency downtime from 3:00am to 9:00am EST (note time).

During this time we will be shutting down our infrastructure so that we can re-install the operating system on the file server.

We continue to have performance/connectivity problems with the file server. The vendor (who has been working the issue over the weekend) is suggesting that we do a fresh install of the OS so that we have clean slate. While we don’t know if this will correct the problems, our hope is that, at the very least, we will regain critical tracing capability that will then help us find and correct the problem.

Update 9:00am: Our systems are still down so that we can continue to debug the problem with our filesystem. We will bring our systems back online by 11:00am. Our hope is that we will have resolved the file server issue by then. However, in the event that we are unsuccessful, we will revert to our previous configuration/state that, while problematic, works more than 0% of the time.

Update 11:10am: We were not able to correct the problem with the file server and had to rollback to our previous state. We will announce another downtime soon.