Princeton CS Status

Downtime: Week of January 28, 2013

During the week of January 28, 2013, we will be moving our centralized computing infrastructure and our cluster (\”ionic\”) to a new location.

For all infrastructure services (e-mail, CS web sites, file services), the downtime window is:

START: Tuesday, January 29, 2013, at 6:00 AM EST
END: Wednesday, January 30, 2013, at 10:00 AM EST

If you use the ionic cluster, its downtime window is longer:

START: Monday, January 28, 2013, 4:00 PM EST
END: Thursday, January 31, 2013, 10:00 AM EST

Additional Details

\”Infrastructure\” is everything except the ionic cluster and includes e-mail, web pages, file system, printing, and general purpose computing (i.e., penguins: tux and opus; cycles: soak, wash, rinse, and spin).

We are prioritizing the infrastructure over the ionic cluster. We will bring up the ionic cluster within 1 business day of bringing up the infrastructure.

The wired and wireless network in both the CS Building and in the CS section of the data center (e.g., PlanetLab, VICCI, SNS, Memex, etc.) will continue to work during the downtime. Users will be able to access University systems and the Internet from their desktops/laptops during the downtime.

Because the CS e-mail server will be down longer than 4 hours, people sending e-mail to CS accounts will get bounce messages. (These messages are generated by the sending server and are sent back to the sending account.) Properly configured senders will retry sending for 5 days so incoming messages will be delivered after the infrastructure is back online. People sending e-mail to the CS department can expect to see bounce messages of the form \”warning: message not delivered after 4 hours; will re-try for 5 days.\” (The exact message, timeouts, and retry periods are specific to the server sending the message.)

Due to the magnitude of the move, support services from CS Staff will be limited.

While all changes to infrastructure (including this move) have inherent risks, CS Staff has been taking significant steps to reduce these risks to stay within the 28-hour window for the infrastructure and within the additional 1-business-day window for the ionic cluster.

UPDATE 1/28/2013 at 4:08pm: The ionic cluster has been shut down in preparation for its move.

UPDATE 1/29/2013 at 7:06am: All servers have been powered down.

UPDATE 1/29/2013 at 8:47am: All infrastructure servers have been removed from their racks. Packing has begun.

UPDATE 1/29/2013 at 10:25am: Truck with the infrastructure servers has arrived at the new data center; unloading has begun. In the CS Building, the ionic cluster has been removed from its racks; packing has begun.

UPDATE 1/29/2013 at 11:15am: Truck with systems for the ionic cluster has left the CS Building.

UPDATE 1/29/2013 at 11:30am: Installers dropped one of our disk arrays. Assessing damage. Other work continues. For this kind of eventuality, we did engage ~~the manufacturer~~ our storage system vendor and had an on-site engineer already in place to help.

UPDATE 1/29/2013 at 12:42pm: We are working with the vendor to attempt to get a replacement disk array chassis today. Racking of other infrastructure systems continues. Truck unloading of ionic cluster continues.

UPDATE 1/29/2013 at 3:00pm: All systems are in the machine room. Approximately half are in racks. Some are starting to be wired up. Still waiting for arrival of replacement disk array chassis.

UPDATE 1/29/2013 at 4:20pm: All systems are mounted in their racks. Infrastructure systems have been cabled. Awaiting arrival of replacement disk array chassis.

UPDATE 1/29/2013 at 6:10pm: Replacement disk array chassis due to arrive by 8:15pm. We have tested several systems successfully. The disk array chassis is 1 of 7 chassis in our file server system. Even with this setback, we still believe we will meet our deadlines to be back online.

UPDATE 1/29/2013 at 8:25pm: Replacement disk array arrived. Work continues.

UPDATE 1/29/2013 at 9:50pm: Components have been moved from damaged chassis to new chassis. We are working with vendor to bring replacement chassis online.

UPDATE 1/29/2013 at 11:20pm: Chassis replacement complete and system operational. Work continues.

UPDATE 1/30/2013 at 12:10am: Infrastructure services (e-mail, CS web sites, file services) are starting to come back online.

UPDATE 1/30/2013 at 12:40am: All Infrastructure services (e-mail, CS web sites, file services) are now online. The ionic cluster is the only service that is still down. We will be bringing the ionic cluster back online sometime after 1:00pm today.

UPDATE 1/30/2013 at 8:15am: While not all nodes are yet online, the ionic cluster is operational and available for use. The remainder of the nodes that are down will be coming up ~~this morning~~when an additional power strip is installed in one of the cluster racks.

UPDATE 1/30/2013 at 10:55am: The ionic cluster is fully online. At this point, all systems should be operating normally.

Downtime: Week of January 28, 2013 Read More »

CS Web Sites are down

All / jrc

The main CS website, virtual websites, wiki and CS Guide are currently down. We are working to correct the problem.

UPDATE 10:06am: we believe we have isolated the issue and continue to correct the problem.

UPDATE 11:03am: all systems are back up and working as expected.

CS Web Sites are down Read More »

Internet connectivity from OIT provided networks

All / jrc

From http://helpdesk.princeton.edu/outages/view.plx?ID=3906

\”OIT network administrators have informed the Help Desk that the OIT provided network is experiencing problems acccessing resources off of the Princeton campus. This includes all internet websites as well as mail resources. Connections to intranet resources (file shares, OIT mail servers, Princeton websites) will all still work from on campus connections. OIT administrators are working to resolve this issue and more information will be provided here as it becomes avaiable.\”

Internet connectivity from OIT provided networks Read More »

CS and HPCRC Network Outage

All / jrc

As of 4:39pm today both the CS and HPCRC external fiber connections are currently down. We are working with OIT to restore the connections and will post updates as we know more. All external traffic outside of the CS network will not work because the fiber connections are down.

UPDATE 8:02pm: we now have the external connection restored to the main CS router and have connectivity to the outside from the CS department. We still have no connections to HPCRC.

UPDATE 8:36pm: we now have the HPCRC fiber connection restored.

UPDATE 8:40pm: All network connections have been restored and connectivity to the outside should be working as expected.

EPILOG 9/26/2012 11:21am: Root cause may have been related to renovation at another site where our fiber is routed.

– CS Staff

CS and HPCRC Network Outage Read More »

Unplanned Outage – Database/Web – April 19, 2012

All / scott

One of our database servers is having trouble. This is affecting some of our backend/administrative websites. We are investigating.

UPDATE 3:40pm: problem seems to be a failed (local) hard disk that is part of a mirrored pair.

UPDATE 4/20, 9:00am: due to the hardware failure, the database has become unusable. We are going to re-deploy a replacement server using the last good nightly back-up. During this restore, the following systems will be down:

The problem system – ~~for today (4/20), send problem tickets to scott@cs rather than csstaff@cs~~
Dropbox – previously submitted assignments are not stored in the database and are not affected
CAS – signin for websites using CS authentication
CS Guide
Adm Site – the site where the admin staff updates people information
Jobs
Database driven portions of the CS website
a few other less commonly used pages

Once the databases are back up, we will be auditing them to see what updates may have been made in the interim and taking appropriate action.

UPDATE 4/20, 12:00pm: work continues. We will be restoring the databases to the original hardware (with replaced hard drives). This will get things back to the Wednesday night snapshot. We will start with the restore of a test database. When that is successful, we will be able to start providing estimates for the remainder of the databases and systems.

UPDATE 4/20, 3:15pm: multiple attempts to load the data and yield a stable system have not been successful. We have the data in plain-text form so there\’s no concern about loss of data. In addition, the data loads into the database engine. However, at this time there are lingering issues at the system level (hardware / operating systems). We will provide an additional update by 4pm.

UPDATE 4/20, 4:15pm: attempts to restore with the old hardware, the old operating system, and the old version of MySQL (5.0.x) were not fruitful. We are now installing on new hardware, with a new (puppet-ized) operating system, and a newer version of MySQL (5.1.x). There are slight changes to the meta-data between MySQL versions that require a bit of hand-editing of a couple of files. We are first focusing on the databases needed for dropbox.

UPDATE 4/20, 4:25pm: While our ticket system is still offline, you may use csstaff@cs again to report (other) issues. The e-mail will reach members of CS Staff.

UPDATE 4/20, 5:30pm: Dropbox is back online. Work continues on the databases for other systems.

UPDATE 4/20, 6:00pm: Many systems are back online including the main web site. Work continues on remaining systems.

UPDATE 4/20, 6:20pm: All systems should be operating normally now.

Unplanned Outage – Database/Web – April 19, 2012 Read More »

Downtime Rescheduled: Thu, March 22, 2012

All / scott

~~On Tuesday, March 20, 2012, we will have a scheduled downtime from 4:00am to 8:00am EDT.~~

On Thursday, March 22, 2012, we will have a scheduled downtime from 4:00am to 8:00am EDT.

This downtime affects all users of the department\’s computing and networking infrastructure.

Scheduled work includes:

Deploy new hardware for cycles (soak, wash, rinse, spin)
Transfer our virtual machines to new infrastructure
Upgrade the operating systems on penguins (tux and opus) and cycles to version 6.2
Deploy a new cluster, named ionic, using hardware from the current c2 cluster (and some of the retired equipment noted above).

SPECIAL NOTE: Since we will be re-installing the operating systems on the public cycle servers (penguins: tux, opus; cycles: soak, wash, rinse, spin), crontab entries will be lost. If you have cron jobs that need to persist, please save them before the downtime and restore them afterwards.

After the downtime, c2 will have approximately 50 servers and ionic will have approximately 20 servers. Over the coming weeks, we will reassign approximately 10 servers per week from c2 to ionic until c2 is empty and then retired. More details will be posted on the CS beowulf list.

This work improves the speed and capacity of our general purpose cycle servers as well as our virtual machine infrastructure. This also synchronizes the operating system version of all our public machines (cycles, penguins, and ionic). The new machines (physical or virtual) will be under the configuration control of our recently deployed Puppet infrastructure. This will allow us to more easily keep the operating environments of the servers both up-to-date and in sync.

During some of the 4am-8am window, most of the services (e.g., e-mail, web, cycle servers) will be unavailable. E-mail destined to the department will be queued and then delivered at the end of the maintenance window.

Downtime Rescheduled: Thu, March 22, 2012 Read More »

Unplanned Outage – Email – January 28, 2012

All / scott

We have a report of e-mail trouble. We are investigating.

UPDATE 10:05am: E-mail is \”up\” but running extremely slowly leading to time-out problems. We are continuing to work the issue.

UPDATE 11:08am: We are repairing the e-mail system\’s database. The system will continue to be slow for a while. The problem appears to be a latent fault introduced when our previous storage system failed on January 12, 2012. That hardware has since been replaced, but a flaw in the data had migrated to the new system.

UPDATE 11:48 AM: E-mail service is actually down at the moment, while data is being repaired. We are working to restore service ASAP.

UPDATE 14:00: E-mail service is running again, but still very slow. It may be turned off again at some point, as troubleshooting is still underway.

UPDATE 16:00: E-mail service has been restored. We have located a source of delays in our storage and eliminated it for the time being. Our apologies for the extended inconvenience.

Unplanned Outage – Email – January 28, 2012 Read More »

Downtime: Tue, January 24, 2012

All / scott

On Tuesday, January 24, 2012, we will have a scheduled downtime from 4:00am to 8:00am EST.

This downtime affects all users of the department\’s computing and networking infrastructure.

Scheduled work includes:

Operating System updates and miscellaneous work on the infrastructure.

Downtime: Tue, January 24, 2012 Read More »

Unplanned Outage â€“ January 16, 2012

All / jrc

We are still having trouble with the storage array that hosts our virtual infrastructure. We had a short outage on the CS websites but those sites are all back up again as of 12:35pm. Here is a list of things that are currently down:

1) The submission server that runs the check submit script on dropbox.cs
2) opus – the public server

We will post updates to this page as more information becomes available.

UPDATE 1:32pm: The submission server that runs the check submit script on dropbox.cs is now online again.

UPDATE 1:33pm: opus is now online again.

Unplanned Outage â€“ January 16, 2012 Read More »