Internet connectivity from OIT provided networks

From http://helpdesk.princeton.edu/outages/view.plx?ID=3906

\”OIT network administrators have informed the Help Desk that the OIT provided network is experiencing problems acccessing resources off of the Princeton campus. This includes all internet websites as well as mail resources. Connections to intranet resources (file shares, OIT mail servers, Princeton websites) will all still work from on campus connections. OIT administrators are working to resolve this issue and more information will be provided here as it becomes avaiable.\”

Internet connectivity from OIT provided networks Read More »

CS and HPCRC Network Outage

As of 4:39pm today both the CS and HPCRC external fiber connections are currently down. We are working with OIT to restore the connections and will post updates as we know more. All external traffic outside of the CS network will not work because the fiber connections are down.

UPDATE 8:02pm: we now have the external connection restored to the main CS router and have connectivity to the outside from the CS department. We still have no connections to HPCRC.

UPDATE 8:36pm: we now have the HPCRC fiber connection restored.

UPDATE 8:40pm: All network connections have been restored and connectivity to the outside should be working as expected.

EPILOG 9/26/2012 11:21am: Root cause may have been related to renovation at another site where our fiber is routed.

– CS Staff

CS and HPCRC Network Outage Read More »

Unplanned Outage – Database/Web – April 19, 2012

One of our database servers is having trouble. This is affecting some of our backend/administrative websites. We are investigating.

UPDATE 3:40pm: problem seems to be a failed (local) hard disk that is part of a mirrored pair.

UPDATE 4/20, 9:00am: due to the hardware failure, the database has become unusable. We are going to re-deploy a replacement server using the last good nightly back-up. During this restore, the following systems will be down:

  • The problem system – for today (4/20), send problem tickets to scott@cs rather than csstaff@cs
  • Dropbox – previously submitted assignments are not stored in the database and are not affected
  • CAS – signin for websites using CS authentication
  • CS Guide
  • Adm Site – the site where the admin staff updates people information
  • Jobs
  • Database driven portions of the CS website
  • a few other less commonly used pages

Once the databases are back up, we will be auditing them to see what updates may have been made in the interim and taking appropriate action.

UPDATE 4/20, 12:00pm: work continues. We will be restoring the databases to the original hardware (with replaced hard drives). This will get things back to the Wednesday night snapshot. We will start with the restore of a test database. When that is successful, we will be able to start providing estimates for the remainder of the databases and systems.

UPDATE 4/20, 3:15pm: multiple attempts to load the data and yield a stable system have not been successful. We have the data in plain-text form so there\’s no concern about loss of data. In addition, the data loads into the database engine. However, at this time there are lingering issues at the system level (hardware / operating systems). We will provide an additional update by 4pm.

UPDATE 4/20, 4:15pm: attempts to restore with the old hardware, the old operating system, and the old version of MySQL (5.0.x) were not fruitful. We are now installing on new hardware, with a new (puppet-ized) operating system, and a newer version of MySQL (5.1.x). There are slight changes to the meta-data between MySQL versions that require a bit of hand-editing of a couple of files. We are first focusing on the databases needed for dropbox.

UPDATE 4/20, 4:25pm: While our ticket system is still offline, you may use csstaff@cs again to report (other) issues. The e-mail will reach members of CS Staff.

UPDATE 4/20, 5:30pm: Dropbox is back online. Work continues on the databases for other systems.

UPDATE 4/20, 6:00pm: Many systems are back online including the main web site. Work continues on remaining systems.

UPDATE 4/20, 6:20pm: All systems should be operating normally now.

Unplanned Outage – Database/Web – April 19, 2012 Read More »

Downtime Rescheduled: Thu, March 22, 2012

On Tuesday, March 20, 2012, we will have a scheduled downtime from 4:00am to 8:00am EDT.

On Thursday, March 22, 2012, we will have a scheduled downtime from 4:00am to 8:00am EDT.

This downtime affects all users of the department\’s computing and networking infrastructure.

Scheduled work includes:

  • Deploy new hardware for cycles (soak, wash, rinse, spin)
  • Transfer our virtual machines to new infrastructure
  • Upgrade the operating systems on penguins (tux and opus) and cycles to version 6.2
  • Deploy a new cluster, named ionic, using hardware from the current c2 cluster (and some of the retired equipment noted above).

SPECIAL NOTE: Since we will be re-installing the operating systems on the public cycle servers (penguins: tux, opus; cycles: soak, wash, rinse, spin), crontab entries will be lost. If you have cron jobs that need to persist, please save them before the downtime and restore them afterwards.

After the downtime, c2 will have approximately 50 servers and ionic will have approximately 20 servers. Over the coming weeks, we will reassign approximately 10 servers per week from c2 to ionic until c2 is empty and then retired. More details will be posted on the CS beowulf list.

This work improves the speed and capacity of our general purpose cycle servers as well as our virtual machine infrastructure. This also synchronizes the operating system version of all our public machines (cycles, penguins, and ionic). The new machines (physical or virtual) will be under the configuration control of our recently deployed Puppet infrastructure. This will allow us to more easily keep the operating environments of the servers both up-to-date and in sync.

During some of the 4am-8am window, most of the services (e.g., e-mail, web, cycle servers) will be unavailable. E-mail destined to the department will be queued and then delivered at the end of the maintenance window.

Downtime Rescheduled: Thu, March 22, 2012 Read More »

Unplanned Outage – Email – January 28, 2012

We have a report of e-mail trouble. We are investigating.

UPDATE 10:05am: E-mail is \”up\” but running extremely slowly leading to time-out problems. We are continuing to work the issue.

UPDATE 11:08am: We are repairing the e-mail system\’s database. The system will continue to be slow for a while. The problem appears to be a latent fault introduced when our previous storage system failed on January 12, 2012. That hardware has since been replaced, but a flaw in the data had migrated to the new system.

UPDATE 11:48 AM: E-mail service is actually down at the moment, while data is being repaired. We are working to restore service ASAP.

UPDATE 14:00: E-mail service is running again, but still very slow. It may be turned off again at some point, as troubleshooting is still underway.

UPDATE 16:00: E-mail service has been restored. We have located a source of delays in our storage and eliminated it for the time being. Our apologies for the extended inconvenience.

Unplanned Outage – Email – January 28, 2012 Read More »

Unplanned Outage – January 16, 2012

We are still having trouble with the storage array that hosts our virtual infrastructure. We had a short outage on the CS websites but those sites are all back up again as of 12:35pm. Here is a list of things that are currently down:

1) The submission server that runs the check submit script on dropbox.cs
2) opus – the public server

We will post updates to this page as more information becomes available.

UPDATE 1:32pm: The submission server that runs the check submit script on dropbox.cs is now online again.

UPDATE 1:33pm: opus is now online again.

Unplanned Outage – January 16, 2012 Read More »

E-mail / Web – Unplanned Outage – January 12, 2012

As of 7:00am, E-mail and the main web pages are down. We are investigating.

UPDATE 8:05am: Problem is isolated to an old storage server that is used by our virtual machine servers.

UPDATE 8:35am: Storage server is coming back online. After that, all the virtual machines will need a clean restart.

UPDATE 8:45am: Storage server is still having trouble. No ETA yet.

UPDATE 9:35am: We are waiting for a return call from the storage server vendor.

UPDATE 10:05am: Systems are beginning to come back online. Simultaneously, the storage system is rebuilding and remirroring the data. During this time, we expect that systems will be slower. We are also working to stabilize the systems.

UPDATE 11:20am: Systems are basically online. Storage system rebuilding (and associated slowness) continues. E-mail was delayed but none lost.

E-mail / Web – Unplanned Outage – January 12, 2012 Read More »

Downtime: Tue, December 20, 2011

On Tuesday, December 20, 2011, we will have a scheduled downtime from 4:00am to 8:00am EST.

This downtime affects all users of the department\’s computing and networking infrastructure.

Scheduled work includes:

  • Moving several infrastructure file systems to our new storage system. This requires a configuration change to most of our servers that can only be done when the systems are idle.

Downtime: Tue, December 20, 2011 Read More »

Scroll to Top