Air traffic failure caused by two locations 3600nm apart sharing 3-letter code
FYI: nm = nautical miles, not nanometers.
Good news: the system successfully detected an error and didn't send bad data to air traffic controllers.
Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).
You know there's a software engineer somewhere that saw this as a potential problem, brought up a solution, and had that solution rejected because handling it would add 40 hours of work to a project.
There's been some prior discussion on this over the past year, here are a few I found (selected based on comment count, haven't re-read the discussions yet):
From the day of:
https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)
Discussions after:
https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)
https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)
https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)
So, essentially the system has a serious denial of service flaw. I wonder how many variations of flight plans can cause different but similar errors that also force a disconnect of primary and secondary systems.
Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"
Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.
I guarantee that piece of code has a comment like
/* This should never happen */
if (waypoints.matchcount > 2) {
Funny airport call letters story: I once headed to Salt Lake City, UT (SLC) for a conference. My luggage was processed by a dyslexic baggage handler, who sent it to... SCL (Santiago, Chile).
I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.
Original (2023) thread with 446 comments,
https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)")
This is old news, but what's new news is that last week, the UK Civil Aviation Authority openly published its Independent Review of NATS (En Route) Plc's Flight Planning System Failure on 28 August 2023 https://www.caa.co.uk/publication/download/23337 (PDF)
Let's look at point 2.28: "Several factors made the identification and rectification of the failure more protracted than it might otherwise have been. These include:
• The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.
• The engineer team followed escalation protocols which resulted in the assistance of the Level 3 engineer not being sought for more than 3 hours after the initial event.
• The Level 3 engineer was unfamiliar with the specific fault message recorded in the FPRSA-R fault log and required the assistance of Frequentis Comsoft to interpret it.
• The assistance of Frequentis Comsoft, which had a unique level of knowledge of the AMS-UK and FPRSA-R interface, was not sought for more than 4 hours after the initial event.
• The joint decision-making model used by NERL for incident management meant there was no single post-holder with accountability for overall management of the incident, such as a senior Incident Manager.
• The status of the data within the AMS-UK during the period of the incident was not clearly understood.
• There was a lack of clear documentation identifying system connectivity.
• The password login details of the Level 2 engineer could not be readily verified due to the architecture of the system."
WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY VERIFIED" MEAN?
EDIT: Per NATS Major Incident Investigation Final Report - Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023 https://www.caa.co.uk/publication/download/23340 (PDF) ... "There was a 26-minute delay between the AMS-UK system being ready for use and FPRSA-R being enabled. This was in part caused by a password login issue for the Level 2 Engineer. At this point, the system was brought back up on one server, which did not contain the password database. When the engineer entered the correct password, it could not be verified by the server. "
I've posted this here before, but they really need globally unique codes for all the airports, waypoints, etc, it's crazy there are collisions. People always balk at this for some reason but look at the edge cases that can occur, it's crazy CRAZY
If you want to, you can read the final report from the UK Civil Aviation Authority here: https://www.caa.co.uk/publication/download/23340
It's pretty readable and quite interesting.
For the people skimming the comments and are confused: 3600nm here is nautical miles, not nanometers.
My first thought was that this was some parasitic capacitance bug in a board design causing a failure in an aircraft.
Is nm the official abbreviation for nautical miles? I assume it is natural miles. For me it is nanometers.
What brought me to read this article was a confusion: how can two locations related to air traffic be 3600 nanometers apart? Was it two points within some chip, or something?
Only way into the article it dawned to me that "nm" could stand for something else, and guess it was "nautical miles". Live and learn...
Still, it turned out to be an interesting read)
So, exactly the same airline (French Bee) and exactly the same route (LAX-ORY) and exactly the same waypoint (DVL) as last September, resulting in exactly the same failure mode:
https://chaos.social/@russss/111048524540643971
Time to tick that "repeat incident?" box in the incident management system, guys.
Unique IDs that are not really unique are the beginning of all evil, and there is a special place in hell for those that "recycle" GUIDs instead of generating new ones.
Having ambiguous names can likewise lead to disaster, as seen here, even if this incident had only mild consequences. (Having worked on place name ambiguity academically, I met people who flew to the wrong country due to city name ambiguity and more.)
At least artificial technical names/labels should be globally unambiguous.
Hmm, is this the same incident which happened last year? Or is this a new incident?
From Sept 2023 (flightglobal.com):
- Comments: https://news.ycombinator.com/item?id=37430384
Also some more detailed analysis:
- https://jameshaydon.github.io/nats-fail/
- Comments: https://news.ycombinator.com/item?id=37461695
The DVL really is in the details.
When there's no global clearing house for those identifiers, maybe namespaces would help?
Related: The editorialized HN title uses nanometers (nm) when they possibly mean nautical miles (nmi). What would a flight control system make of that?
Sounds like the kind of thing fuzzing would find easily, if it was applied. Getting a spare system to try it on might be hard though.
As an aside, that site's cookie policy sucks. You can opt out of some, but others, like "combine and link data from other sources", "identify devices based on information transmitted automatically", "link different devices" and others can't be disabled. I feel bad for people that don't have the technical sophistication to protect themselves against that kind of prying.
"and it generated a critical exception error. This caused the FPRSA-R primary system to disconnect, as designed,"
as designed here sounds a big PR move to hide the fact they let an uncaught exception crash the entire software ...
How about : don't trust your inputs guys ?
There’s little to no authentication on filing flight plans which makes this a potentially bigger problem. I’m sure it’s fixed but the mechanism that caused the failure is an assertion that fails by disconnecting the critical systems entirely for “safety”. And the backup failed the same way. Bet there are similar bugs.
> Just 20s elapsed between the receipt of the flightplan and the shutdown of both FPRSA-R systems, causing all automatic processing of flightplan data to cease and forcing reversion to manual procedures.
That's quite a DoS vulnerability...
I would’ve thought that in flight industry they got the „business key” uniqueness right ages ago. If a key is multi-part then each check should check all parts not just one. Alternatively, force all airport codes to be globally unique.
I’m curious what part of the code rejected the validity of the flight plan. Im also curious what keys are actually used for lookups when they aren’t unique??
"What are these? Airports for ants?" I would HN dudes expect to fix the headline regarding SI / nautical units. Sloppy copy.
Could you front-end the software with a proxy which bounces code-collision requests and limit the damage to the specific route, and not the entire systems integrity?
This is hack-on-hack stuff, but I am wondering if there is a low cost fix for a design behaviour which can't alter without every airline, every other airline system worldwide, accommodating the changes to remove 3-letter code collision.
Gate the problem. Require routing for TLA collisions to be done by hand, or be fixed in post into two paths which avoid the collision. (intrude an intermediate waypoint)
What's crazy is that this hasn't happened before, waypoints that share a name isn't uncommon
Headline still hasn't been fixed? (Correct abbreviation is NM).
Initially read this as 3600 nanometres... :-)
Well, 3600 billionths of a meter IS kinda close...just sayin'
The title sounds like an AMD cpu issue.
oh nautical miles !
not nanometres as you might assume from being used to normal units
Title should be nmi
Is it just me or was it basically impossible to decipher what those three letter codes were?
In other news, goat carts are still getting 100 furlong–firkin–fortnight on dandelions.
=3
[dead]
3600 nanometers? That's cool.
People posting on this forum saying "ah well software's failure case isn't as bad"
> This forced controllers to revert to manual processing, leading to more than 1,500 flight cancellations and delaying hundreds of services which did operate.
It's like déjà vu all over again, Yogi.
Aug 2023: “UK air traffic woes caused by 'invalid flight plan data'”
https://www.theregister.com/2023/08/30/uk_air_traffic_woes_i... --
(-11 down votes and counting)
I don't know how long that failure mode has been in place or if this is relevant, but it makes me think of analogous times I've encountered similar:
When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.
In that situation, switching back to yesterday's workflow is something that won't interrupt much.
A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.
The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.