As explained previously, whenever there’s a major incident, Updata provide a report. Here is a summary from the week prior to half term – any queries, please get in touch with me.
On the 6th of February at approximately 21:07, Capita Reigate 24/7 Network Operations Centre (NOC) Engineers were proactively alerted by their network monitoring tools to the loss of numerous devices in LD4 and LD5 datacentres. The incident was raised as a priority 1 and assigned to the oncall Technical Escalations Team engineer to conduct initial investigations. All customer traffic traversing these links was down for the duration of the incident. Herts Corporate and Herts Education sites suffered Network connectivity issues of varying degrees for the duration of the incident.
Resolution to this incident comprised numerous activities described below:
The initial LD4 6513 hardware (HICS core router) failure was resolved by the replacement of the failed supervisor card and reboot of the device.
The downstream routing issues were resolved by the Technical team trouble shooting and bouncing MPLS tunnels and local PE devices.
The high CPU usage on the LD5 6513 was attributed to a corrupted table that came about as the result of the initial unscheduled reboots of that device. This was corrected under the direction of Cisco Technical Assistance Centre (TAC) by a clean restart of the device which rebuilt the corrupt routing table.
The root cause of the incident has been attributed to catastrophic Network failure caused by the failure of the supervisor card in the LD4 6513 device. All subsequent impacts are directly related to this failure. It was noted that a scheduled change was being implemented just prior to the hardware failure. When the Internet was brought back up there was a lot of routing convergence happening. It is possible that this sudden flood of routing requests could have caused a failure on the supervisor card.
The Capita Reigate 24/7 NOC were alerted to a hardware failure in the LD4 & LD5 datacentres relating to the Herts Network. They engaged the Major Incident (MI) Manager who assembled the MI team. A field engineer was tasked to go to the LD4 datacentre. Upon arrival he identified that the onsite 6513 device had failed and would not reboot. It was subsequently identified that the supervisor card had failed.
A Cisco TAC case was raised and a spare part ordered Cisco advised that the service agreement for this device was for next business day delivery. At this point a failover was attempted however due to configuration issues this was not possible.
In parallel the MI manager got in contact with the Capita Tannochside support team and was able to agree the sourcing of a spare supervisor card. This was expedited and couriered to the Datacentre arriving at approximately 09:40 on 07/02. The card was configured and fitted bringing the device back up. Some network availability was restored however a number of routing issues persisted. The support teams worked on these issues providing network connectivity for the majority by 19:00 on 06/07.
A downstream impact was later identified whereby some schools were experiencing slow internet connectivity. The issue was investigated and after engaging Cisco TAC was found to be an issue with a routing table on the LD5 6513 which was causing the device to run at extremely high CPU usage. An emergency change was implemented to reboot the LD5 6513 and normal internet connection speed was restored to Herts schools.
Contracted support for the LD4 & LD5 6513 devices is next business day. For critical devices this is not adequate
When devices are brought into service the appropriate level of support should be considered based upon criticality of the device. Additionally when amendments are made to service the support requirements should be reconsidered.
When the device failed a TAC case was raised with Cisco and an Replacement arranged. Going forward this activity should be coordinated via Tannochside
Support documentation to be updated to ensure that the Tannochside Service Desk are engaged immediately when it is identified that there may be a requirement to procure spares.
The 6513 devices are old and various components are very close to end of life
The Service Architect team have already made the customer aware that there are exposures with the current end of life hardware. They are in consultation to plan the potential decommissioning of the aged 6513 devices. This will involve a re-engineering of the Network.
Spare parts were not readily available and had to be ordered from Cisco
Updata system’s stocklists to be checked to ensure that information is current and also that the location of any spare parts is recorded correctly
The attempt to fail services to LD5 was not successful due to a reliance on the LD4 datacentre
It has been identified that over the course of time and with numerous projects that have changed the network topology the resilience that was once available has been diminished. As part of the ongoing discussions with the customer the Service Architect team will be reviewing network resilience and the failover process as an integral part of the proposed network redesign.
During the course of the major incident it was difficult to gain a definitive customer impact
In order to manage incidents appropriately a definitive customer impact is vital. The process for gaining impact needs to be reviewed with the customer to ensure that it is recorded accurately. Additionally all issues experienced need to be recorded in the Incident Management tool for completeness and also to ensure that these issues are taken into account when assessing the impact.