Signalling chaos: Inside the Elizabeth line’s two-day breakdown
Last month, the Elizabeth line suffered two days of problems when the signalling system broke down, and now a clearer timeline of what happened is emerging.
The first inklings that there might be a problem came early at 5am when the Elizabeth line logged intermittent problems with the communications system, but it wasn’t expected to affect passenger services.
26th November 2024
5:15am – the first train departed the Plumstead depot heading towards Paddington, and difficulties were reported communicating with the signalling system, with the service dropping out and restoring repeatedly over the next 25 minutes.
5:44am – Staff reset the system, restoring the communications links, only for it to drop out again 10 minutes later.
6:20am—The decision is taken to close the Elizabeth line’s central core tunnels between Abbey Wood and Paddington. The justification is passenger safety if trains unexpectedly break down in tunnels, especially if there’s overcrowding and a delay in evacuating them. The rest of the network east and west of central London runs under “Contingency COS 0”, which is from Paddington and Liverpool Street mainline stations rather than the tunnel platforms.
7am – Trains were still completing journeys and running empty to depot when they lost connection with the overhead line monitors and the ventilation systems, and then they lost the entire core signalling system.
7:50am – An internal significant incident report said that trains were having problems communicating within the core signalling area (CBTC) while train managers couldn’t see the headcodes on their network maps (LWOD). The train managers were also having to control trains manually due to problems with the Automatic Route Setting (ARS).
8:50am – Drivers also reporting problems with the GSMR radio links, so a full reset of the Elizabeth line takes place.
10:30am – TfL issues statement confirming that they are working on fixing the signalling fault.
10:45am – An internal memo confirms that Siemens had been unable to diagnose the root cause of the signalling problem. A team of specialists hopes to have a resolution by 11:30am. By now, there’s little confidence that they can recover the service by that evening, and plans are being put in place to see if they can run additional services on the Network Rail sides of the line.
Lunchtime The engineers had recovered some of the system but were still having problems getting it fully operational.
1:38pm – Internal memo update on the problems reports that they can restore service, but it drops out again after about 5 minutes. The software engineers were also having problems with computer firewalls regularly locking them out of the network. They now suspect that the firewall itself might be the problem.
Early afternoon – Two more reboots took place, but still couldn’t fully recover the communications network.
3:30pm — A conference call between Siemens and Elizabeth line engineers suggests they have identified the problem. If it works, they might be able to get the line running test trains by 7pm.
5:47pm — An internal memo reports that Siemens is amending some 30 systems to fix the problem. There’s also the possibility that physical equipment at Canary Wharf has broken, which would need to be replaced, delaying the reopening of the line.
7pm – The network is running again, and some test trains were run through the core tunnels to ensure the signalling network was stable. Testing started with one train in each direction along the core tunnels, and was upgraded to a 6 trains per hour service at around 7:30pm.
8:30pm – Conference call to decide if the service can reopen to passengers and if it will open tomorrow as well.
9:45pm – The Elizabeth line reopens to passengers with six trains per hour for the rest of the day.
A bad day, but at least the problems were fixed.
Or so they thought.
27th November 2024
3:30am – The signalling system fails again.
4:45am – Decision taken not to reopen the Elizabeth line that morning.
8am – TfL apologises for the problems and confirms they’re working with Siemens on a fix.
8:30am — The problem is identified and fixed, and they start running some empty test trains through the tunnels to check the service is reliable.
9:12am – Internal memo says that test trains are running reliably and they aim to reopen the line at 11am, but initially the core stations will be exit only until they reach at least 6 trains per hour through the core tunnels.
The decision to open central stations as exit only was taken to manage crowds, as there can be concerns that people might be taken ill (PIOT in the jargon) if there’s overcrowding on the first few trains — which, while obviously bad for the passenger affected, also slows down the service recovery as trains linger longer at stations for first aid to arrive.
11am – The service is open to passengers, but there are fewer trains per hour and delays while trains and drivers are resynced with the timetable.
2pm – Full service restored.
And just as staff started to think they might be able to calm down after a frantic couple of days, Network Rail found a cracked rail crossing between Manor Park and Forest Gate, requiring trains to slow down until the fault was fixed.
“The problem is identified and fixed”
What was the problem or the fix?
The Elizabeth line has been horrendous for the last two months. It has let me down time and time again.
You literally cannot rely on it.
I use it almost every day – hardly ever have problems.
I worked on the ERTMS project on the Cambrian. GSMR is a key component for the system as it provides communications as well as transmitting data via a second link.
We had problems one night when a train lost connection with the system. Though not pertinent to the Elizabethan Line problem, by a process of elimination we discovered it was due to the ballise reader on the unit being out of alignment. It worked fine when the unit was running at low speed but failed once it accelerated to line speed.
Without knowing details, it looks like a failure of the GSMR data stream.
Perhaps someone pulled out the plug in order to plug in the vacuum cleaner. Sort of happened to us one night when GSMR was shut down. They (Rugby) didn’t understand that we still had a train running and we needed the GSMR data channel to get it home.
The problem has always been (and will continue to be) designing brand new trains with signalling system to be compatible with existing 20 year old existing systems.
Easy with hindsight but was it really thought through in the Basic Design Stage?
No hindsight needed as NR signalling systems were a given from project initiation. To me it was simply part of the requirements; That it became a “problem” just says poor project management, or a client simply not being knowledgeable enough about what they are buying.
No doubt we can expect further fun & games when ETCS comes along
See also: Underground 4 Lines Modernisation and it’s curtailment
Wasn’t this issue said to be triggered by “overnight maintenance” on Monday 25th?
With firewalls being mentioned one has to wonder if there’s a loose connection to the TfL “cyber incident” and the healthy paranoia that must have followed – speculatively it sounds like security-related configuration changes may have broken the GSMR comms; in my experience IP network traffic between different system components can be complex and not entirely as expected/documented. Punitive firewall restrictions, whilst in theory great for preventing malicious traffic, can also cause these kinds of intermittent connectivity issues, because the complexity of the genuine traffic appears to the firewall to be malicious.
If this was the case a conscious decision must have been made NOT to simply roll-back the change to get the system up and running – possibly because healthy security paranoia meant that they had to understand the issue – rolling back would have denied that opportunity.
From an IT perspective you might call this kind of thing “Self-inflicted Denial of Service” or SDOS!
Thanks for all the detail Ian!
We still have no clearer indication of what the actual cause was.
Read it all and still have no idea what the fix was. Saw aspects like “it will be fixed.. Then broke again after xx minutes”.
For obvious security reasons, they’re not going to outline what the bug was lest it give hints to outside actors how to break the network.
Good luck to Tokyo Metro who takes over soon
Yet. Tfl refused me any form of refund. And have failed to acknowledge any subsequent requests for their complaint procedure. Any help from anyone would be great.
TfL stonewalls all delay claims for the Elizabeth line, regardless of how obvious and clear the case. And if you call up to query the rejection, they just resubmit the same claim and reject it again.
In the end I spent an hour or two documenting all the runarounds to London Travelwatch. Although they were unable to help, they did send the cases back to TfL who, without any further explanation or apology, processed the refunds. Still no systemic fix for the problem though.