Unplanned Outage – Database/Web – April 19, 2012

One of our database servers is having trouble. This is affecting some of our backend/administrative websites. We are investigating.

UPDATE 3:40pm: problem seems to be a failed (local) hard disk that is part of a mirrored pair.

UPDATE 4/20, 9:00am: due to the hardware failure, the database has become unusable. We are going to re-deploy a replacement server using the last good nightly back-up. During this restore, the following systems will be down:

  • The problem system – for today (4/20), send problem tickets to scott@cs rather than csstaff@cs
  • Dropbox – previously submitted assignments are not stored in the database and are not affected
  • CAS – signin for websites using CS authentication
  • CS Guide
  • Adm Site – the site where the admin staff updates people information
  • Jobs
  • Database driven portions of the CS website
  • a few other less commonly used pages

Once the databases are back up, we will be auditing them to see what updates may have been made in the interim and taking appropriate action.

UPDATE 4/20, 12:00pm: work continues. We will be restoring the databases to the original hardware (with replaced hard drives). This will get things back to the Wednesday night snapshot. We will start with the restore of a test database. When that is successful, we will be able to start providing estimates for the remainder of the databases and systems.

UPDATE 4/20, 3:15pm: multiple attempts to load the data and yield a stable system have not been successful. We have the data in plain-text form so there\’s no concern about loss of data. In addition, the data loads into the database engine. However, at this time there are lingering issues at the system level (hardware / operating systems). We will provide an additional update by 4pm.

UPDATE 4/20, 4:15pm: attempts to restore with the old hardware, the old operating system, and the old version of MySQL (5.0.x) were not fruitful. We are now installing on new hardware, with a new (puppet-ized) operating system, and a newer version of MySQL (5.1.x). There are slight changes to the meta-data between MySQL versions that require a bit of hand-editing of a couple of files. We are first focusing on the databases needed for dropbox.

UPDATE 4/20, 4:25pm: While our ticket system is still offline, you may use csstaff@cs again to report (other) issues. The e-mail will reach members of CS Staff.

UPDATE 4/20, 5:30pm: Dropbox is back online. Work continues on the databases for other systems.

UPDATE 4/20, 6:00pm: Many systems are back online including the main web site. Work continues on remaining systems.

UPDATE 4/20, 6:20pm: All systems should be operating normally now.

Unplanned Outage – Database/Web – April 19, 2012 Read More ยป