Major Incident report 6th February to 9th February

As explained previously, whenever there’s a major incident, Updata provide a report. Here is a summary from the week prior to half term – any queries, please get in touch with me.

Description of incident and customer impact

On the 6th of February at approximately 21:07, Capita Reigate 24/7 Network Operations Centre (NOC) Engineers were proactively alerted by their network monitoring tools to the loss of numerous devices in LD4 and LD5 datacentres. The incident was raised as a priority 1 and assigned to the oncall Technical Escalations Team engineer to conduct initial investigations. All customer traffic traversing these links was down for the duration of the incident. Herts Corporate and Herts Education sites suffered Network connectivity issues of varying degrees for the duration of the incident.

Resolution details

Resolution to this incident comprised numerous activities described below:

The initial LD4 6513 hardware (HICS core router) failure was resolved by the replacement of the failed supervisor card and reboot of the device.

The downstream routing issues were resolved by the Technical team trouble shooting and bouncing MPLS tunnels and local PE devices.

The high CPU usage on the LD5 6513 was attributed to a corrupted table that came about as the result of the initial unscheduled reboots of that device. This was corrected under the direction of Cisco Technical Assistance Centre (TAC) by a clean restart of the device which rebuilt the corrupt routing table.

Root cause analysis

The root cause of the incident has been attributed to catastrophic Network failure caused by the failure of the supervisor card in the LD4 6513 device. All subsequent impacts are directly related to this failure. It was noted that a scheduled change was being implemented just prior to the hardware failure. When the Internet was brought back up there was a lot of routing convergence happening. It is possible that this sudden flood of routing requests could have caused a failure on the supervisor card.

Summary of incident

The Capita Reigate 24/7 NOC were alerted to a hardware failure in the LD4 & LD5 datacentres relating to the Herts Network. They engaged the Major Incident (MI) Manager who assembled the MI team. A field engineer was tasked to go to the LD4 datacentre. Upon arrival he identified that the onsite 6513 device had failed and would not reboot. It was subsequently identified that the supervisor card had failed.

A Cisco TAC case was raised and a spare part ordered Cisco advised that the service agreement for this device was for next business day delivery. At this point a failover was attempted however due to configuration issues this was not possible.

In parallel the MI manager got in contact with the Capita Tannochside support team and was able to agree the sourcing of a spare supervisor card. This was expedited and couriered to the Datacentre arriving at approximately 09:40 on 07/02. The card was configured and fitted bringing the device back up. Some network availability was restored however a number of routing issues persisted. The support teams worked on these issues providing network connectivity for the majority by 19:00 on 06/07.

A downstream impact was later identified whereby some schools were experiencing slow internet connectivity. The issue was investigated and after engaging Cisco TAC was found to be an issue with a routing table on the LD5 6513 which was causing the device to run at extremely high CPU usage. An emergency change was implemented to reboot the LD5 6513 and normal internet connection speed was restored to Herts schools.

Updata’s Observations and Corrective Actions

Contracted support for the LD4 & LD5 6513 devices is next business day. For critical devices this is not adequate  

When devices are brought into service the appropriate level of support should be considered based upon criticality of the device. Additionally when amendments are made to service the support requirements should be reconsidered.

When the device failed a TAC case was raised with Cisco and an Replacement arranged. Going forward this activity should be coordinated via Tannochside 

Support documentation to be updated to ensure that the Tannochside Service Desk are engaged immediately when it is identified that there may be a requirement to procure spares.

The 6513 devices are old and various components are very close to end of life  

The Service Architect team have already made the customer aware that there are exposures with the current end of life hardware. They are in consultation to plan the potential decommissioning of the aged 6513 devices. This will involve a re-engineering of the Network.

Spare parts were not readily available and had to be ordered from Cisco

Updata system’s stocklists to be checked to ensure that information is current and also that the location of any spare parts is recorded correctly

The attempt to fail services to LD5 was not successful due to a reliance on the LD4 datacentre

It has been identified that over the course of time and with numerous projects that have changed the network topology the resilience that was once available has been diminished. As part of the ongoing discussions with the customer the Service Architect team will be reviewing network resilience and the  failover process as an integral part of the proposed network redesign.

During the course of the major incident it was difficult to gain a definitive customer impact

In order to manage incidents appropriately a definitive customer impact is vital. The process for gaining impact needs to be reviewed with the customer to ensure that it is recorded accurately. Additionally all issues experienced need to be recorded in the Incident Management tool for completeness and also to ensure that these issues are taken into account when assessing the impact.

              

Posted in Internet access, Service, Uncategorized | Leave a comment

Prevent alerts now available on HICS

We are pleased to inform you that HICS now has an added layer of protection for its users. Schools can now opt in to receive emailed alerts notifying them when local sessions have accessed websites with extremist related content, in line with the Government’s Prevent Strategy. The alert will be sent out within a few minutes to an email address of your choice. The event gets captured in real-time, then queued for screen-shot generation, and finally it might take a minute or two for the email itself to be delivered.

The list of words that generate these alarms are a closely guarded secret because if they are readily available, it will defeat the object. The words are ever changing and research is on-going meaning that when new information and words come to light they are added. The words are based on detailed research into terrorist groups and their propaganda. It includes the names of people, concepts, ideas, places (such as routes used to travel to Syria) and also the media arms and publications that terrorist groups use.

The content on the webpage is scanned in real-time by the HICS filtering platform. It should be noted that only specific content types are scanned – basically anything text-based, which includes HTML, Javascript and Stylesheets – so images (jpegs and png files etc) are not scanned – nor are executables. When a match is found, the proxy then queues up a screen-shot request to the screen-capture service (which runs on the same server as the proxy). The screen-capture service will then log the event and send an email out.

In order for the scanning engine to be able to detect the words, it must have access to the data-stream – this means that SSL inspection must be set up. For now, activity will only be logged against internal IP address and not users from Active Directory. Hopefully as the alerts are generated in real time this will not be a huge issue but rest assured that we are investigating ways to sync this with Active Directory.

To get this set up, all you need to do is to contact the HICS Service Desk and request this gets deployed.

Posted in Web Filtering | Leave a comment

Procurement update

Lots of you are understandably very keen to hear about the next HICS provision (October 2017 onwards). The procurement has taken a lot longer than we had hoped but we are nearly there to answer your questions. Trust me I am very keen to open dialogue.

Just to keep you all in the loop, a very high level overview:

  •  The HfL HICS marketing campaign will be starting next week – more information will then become apparent.
  • Lots of service improvements  are on the horizon– improved filtering platform, more proactive filtering alerts, app based filtering etc. School can manage their DNS, firewall requirements through HICS, rather than logging this with us. (although we can still do this if you’d prefer). Lots of emphasis on resiliency, backup circuits,  scheduled fail over testing. Traffic shaping/prioritisation. Plus lots more.
  • Prices will be in line what they currently are – possibly with savings in year 2 and 3, dependant on the take up. These are currently being worked on to make sure they are competitive
  • Change of Broadband provider – Updata will no longer be the network supplier (can’t name who the new supplier is just yet)
  • I will be hosting an event for a Q&A session with the new provider in the coming weeks ,which secondary schools are welcome to attend. Date TBC
  • If there’s demand, I will also invite companies in who support primary schools for another session.

Further details next week…. I really appreciate your patience.

 

Posted in Service, Service Improvements | Leave a comment

Incident reports

When we have a major incident, I am sent a report on; what happened, what went wrong, what lessons can be learnt etc. I will add a post to summarise these reports in due course… but they do sometimes take a few weeks.

Please do not think that when something gets fixed, that is the end of it. None of us want to revisit these periods of downtime so it is essential Updata learn from their mistakes.

Posted in Service | Leave a comment

Communication from the HfL Senior Management Team

Please note that this has been sent to all Head Teachers from the HfL Senior Management Team:

I am mailing to apologise for the unexpected internet service disruption yesterday morning (22nd Feb 2017).

As you will be aware, the HICS service has provided reliable and safe internet services to most schools and academies and we have previously been proud of both our network security and reliability.  Yesterday, there was an unpredictable network failure that affected schools, HCC offices, libraries, some NHS staff and other Hertfordshire services.  We worked hard with our supplier to identify the issue and restore service to users as soon as possible.  Unfortunately, it transpired that a fault had developed in a high speed line in the core of the network.  We were able to reroute traffic around the failure and service was resumed whilst the initial failure point is being addressed.  We are aware that these kind of issues with connectivity cause significant disruption to both the business of running a school and curriculum coverage, and we would like to apologise for any inconvenience caused.   Please be assured that we are working with our supplier to rectify the failure, review system resilience and negotiate improvements in provision.

Given the importance of ensuring that schools have a secure, reliable and safe educational connectivity service, we have been reviewing our provision for the next three years to ensure that schools are able to maximise improvements in technology.  We will mail you with more information about this as soon as possible.

Posted in Service | Leave a comment