April 27, 1997
The downtime that was scheduled on April 27 had some problems
as most users are probably already aware. What follows is a
summary of the problems and what is being done to prevent them in
the future:
- The program "lynx" has stopped working because
of several security patches applied to the IBM machines.
We are in the process of evaluating this and trying to
find a work around.
- The system incoming mail spool filled up causing problems
with the delivery of mail. This problem started sometime
Saturday morning. This was caused by an error in a system
script which has since been corrected. Some mail may have
been lost or returned to the sender as a result of this
problem from Saturday morning until Sunday morning.
- HP delayed the startup of their tasks (see related system
news article "970427_Down" for a list of the
tasks) until 2:00 PM on Sunday, which should have allowed
enough time if everything had gone as planned;
unfortunately it did not. At least one of the scripts
that needed to be applied failed and caused problems in
bringing the machines back up. The vendor is being asked
to allocate more time for their tasks, allowing for any
unforeseen problems, and to keep them within the
originally scheduled times.
- The systems were finally brought up by 10:30 PM, but
sendmail either didn't come up correctly or failed soon
after our systems staff went home for the night and was
finally brought back up about 7:00 AM on April 28, 1997.
Procedures to rectify further elongated outages are being
developed.
- Some of the problems encountered are being attributed to
a lack of proper communications between the various
groups involved. This is being worked on before any more
scheduled downtimes will be scheduled.
The procedures used to plan and execute these downtimes are
under review to find ways of improving them and lessening the
impact on the users.
We are already finding that the use of redundant hardware is
meaningless when the primary system undergoing maintenance is the
one item which is not redundant (the RAID disk controller for
example).
Revised by: George Westlund (gwestlun@calpoly.edu)
Revised: May 5, 1997
Back
to the Main Table of Contents
Back
to the Cal Poly Home Page