Thursday 09 May 2024
Select a region
News

Failure to maintain ageing States IT contributed to catastrophe

Failure to maintain ageing States IT contributed to catastrophe

Thursday 22 June 2023

Failure to maintain ageing States IT contributed to catastrophe

Thursday 22 June 2023


Serious mismanagement led to a severe and sustained incident of disruption to public IT services, an independent report has now revealed. Additionally, confusion over whether responsibility lay with the States or its IT partner Agilisys, coupled with outdated or non-existent procedures, hampered the emergency response.

A major incident last year saw multiple failsafe systems not function as expected, critical equipment in use not being resilient enough, other equipment which should have been in place non-existent or broken, and ineffective monitoring of conditions in the data rooms located in government buildings.

Issues with States systems first arose on November 25 2022 when an air-conditioning unit failed in the main server room at Frossard House. The incident led to widespread disruption across many States-delivered services and triggered an independent investigation into the reasons behind the failure.

PwC - the independent scrutineers of the outages - found that “mission-critical components of the infrastructure equipment supporting the [in-house IT systems] were not maintained in accordance with manufacturer recommendations and generally accepted industry best practice".

Maintenance contracts, which were handled by Agilisys, to ensure stable power supplies for key servers expired more than a year before the major incidents at the end of last year and were not replaced. 

The States also waived Agilisys’ responsibility for “providing support to maintain, manage and develop the IT Service Continuity Plan” in November 2021 as such a plan hadn’t been provided to the IT firm by government.

A major incident affecting access to patients’ medical records at the Princess Elizabeth Hospital occurred in May 2022, and one air conditioning unit cooling servers at Frossard House broke down the following month, but no action was taken following either event despite the alarm being raised.

Government officials were reduced to using WhatsApp to coordinate the incident response at the end of 2022 and beginning of 2023 while internal communication channels were inaccessible.

The States said at the time that all public data held wasn’t lost and that blue light services remained operational throughout four recorded incidents, which has been confirmed by PwC in its investigations.

Recommendations have been provided seeking to address major gaps in the multi-million-pound contract with Agilisys, with the States already committed to massive investment to improve IT resilience.

TOP_rec.jpg

Pictured: Some of the key recommendations provided to the States by PwC - work on some of these has already been completed or is underway.

The findings

Investment in IT was notably reduced in 2016 after the States resolved in principle to outsource services in 2016. While the States had committed to replacing its outdated in-house IT infrastructure by outsourcing to Agilisys, no plan for ongoing maintenance was made “beyond planning it's long-term replacement”. 

PwC reported that no decision was taken to “clearly allocate the overall responsibility and accountability for ongoing maintenance” of the outdated infrastructure when Agilisys officially swooped in in 2019, so senior leadership decided to consider existing IT systems as either at or nearing end-of-life. 

Since those systems were not fully managed or maintained, but still carried essential services such as those that underpin education, the government was inadvertently carrying significantly more risk than expected, PwC said.

Confusion over the division of responsibilities meant formal procedures weren’t drawn up for emergencies. The States and Agilisys were at loggerheads over the management of existing IT infrastructure after the contract was signed.

At the time of the major IT blackout, States leaders were unaware how many “mission-critical” services were based on the old servers or how important they were as no impact analysis of outages had been documented.

There was no disaster recovery plan for those systems either, so those responding were in the dark with regards to “pre-defined and tested response procedures”.

Senior civil servants also didn't convene a Tactical Coordinating Group throughout the crisis, which may have restricted incident response guidance and coordination, as it was felt the issues wouldn’t be as prolonged and repetitive as they were.

Frossard_House_State_IT_server.jpg

Pictured: Widespread disruption was caused to public IT services when a key server room overheated due to the failure of air conditioning units.

Root causes

The failings were not limited to the last quarter of 2022, however. 

Senior leadership were advised as early as November 2018 that the two air conditioning units at Frossard House had a “9 out of 10 likelihood to fail, with their condition reported as 4 out of 10”. 

Periodic maintenance works on the units also identified hardware failings and recommendations were made for replacements.

The States chose to do nothing.

Four years later, one of the two units failed in June 2022 but the States again chose not to fix it despite repeated notices it wasn’t fit for purpose.

It wasn’t until the second air conditioning unit failed in November 2022 – wiping out large swathes of public IT - that the necessary replacements were actioned.

Those outages resulted in prolonged disruption.

The third day of a States debate was ongoing when deputies and parliamentary officials were suddenly left in the dark, unable to access government papers through the States’ website.

It derailed payment systems for public sector workers and those in receipt of benefits too. Customers couldn’t even buy a drink from the café at Beau Sejour as the tills were linked to the aged IT systems.

Screenshot_2023-06-21_at_17.09.49.png 

Pictured: A breakdown of events leading up the blackout.

What happened?

At 06:07 on Friday 25 November the one functioning air conditioning unit inside the government’s main data room, located at Frossard House, failed. 

The temperature inside the room quickly increased, and a string of warning alerts were sent out to a third party which managed one of two alert systems.

But a total of 62 warnings were sent between 21:23 on 24 November and 06:00 the following day from a separate warning system alerting that conditions were reaching dangerous levels.

These alerts were sent to email addresses that either no longer existed or were no longer being monitored, as staff had transferred from government addresses to Agilisys ones. 

PwC said it wasn’t provided with “any evidence” that these initial alerts were looked at or responded to. 

The third party that received temperature reports the following morning quickly sought out a government representative from a “call-tree” to advise of the situation but was unsuccessful in reaching two out of three individuals as the information provided was “significantly out of date”.

The States notified Agilisys and other government departments by 7:30, at which point representatives from each had arrived at Frossard House.

Air conditioning engineers were requested after services were confirmed to be down, with a major incident declared by Agilisys and a "P1" incident declared by the States just after 08:30.

Frossard house air conditioner

Pictured: The States didn't replace air conditioning units despite warnings they were inadequate and were already failing.

By the time specialist engineers arrived the temperature had topped 48 degrees, with doors and windows opened to cool the room. Fans and cooling units were later brought in to beat the heat. The temperature was reducing by 09:30 and the failed air con unit was operational again by 13:30. 

Agilisys began restarting networking devices in the morning once temperatures dropped but found systems to be in a “partly failed state”. 

A safety system which shut down the servers started automatic processes to protect the data on file. At that point an emergency switchover to secondary servers, located at Edward T Wheadon House, was supposed to occur, but it also failed.

Alerts were being sent out days before this indicating that the switchover mechanism may not be operational, and PwC noted it may have failed regardless of other events.

The day ended with engineers unable to restore the networks and so no systems were brought back online.

Working late into the evenings throughout the weekend, piecemeal parts of the system were restored by local and external experts while States departments discussed potential disruption to services for the coming week.

By 14:50 on 28 November various services reported being operational again, with Wi-Fi restored in all schools two days later. 

Screenshot_2023-06-21_at_17.10.01.png 

Pictured: Further events leading up to and including the IT outages.

But the day after, a power failure in the Frossard House server room brought IT functions to a halt again in the morning. This was caused by an ineffective uninterrupted power supply (UPS).

Most services were restored by that afternoon through Agilisys following the same recovery procedure as on the 25 November, but issues fluctuated over the coming two weeks with several services intermittently affected.

A second power loss occurred at Edward Wheadon House while a power engineer was working to restore the uninterrupted power supply there on 13 December. The States and Agilisys agreed to keep the servers offline until this work was complete, but this caused issues for services dependent on those servers. Engineers transferred those to Frossard House that evening.

All services were restored the following day and work continued throughout the week to ensure automatic switchovers between server sites, in case of recurrent problems, were operational.

A power cut hit St. Peter Port on 3 January 2023 which affected the States’ two data sites. Agilisys declared a major incident at 04:00 after discovering services linked to old systems were down. Data was restored from damaged disks that evening. 

Services were gradually restored throughout that week, with senior States leadership calling an end to the major incident period on 5 January.

Response to the report

The States of Guernsey have since said: "In response to the outages, improvements such as equipment upgrades and enhanced maintenance contracts are underway. This includes air conditioning units being repaired with additional temperature sensors now in place, the installation of new generators, improved automated reporting mechanisms and more."

Express has interviewed the President of Policy & Resources, Deputy Peter Ferbrache, and the Head of the Public Service, Mark de Garis. More coverage of the report and political response will come later this morning.

READ MORE…

St. Peter Port power cut continues to impact States' IT

States IT back online after persistent problems

Questions growing as IT issues continue

Sign up to newsletter

 

Comments

Comments on this story express the views of the commentator only, not Bailiwick Publishing. We are unable to guarantee the accuracy of any of those comments.

You have landed on the Bailiwick Express website, however it appears you are based in . Would you like to stay on the site, or visit the site?