Air traffic failure caused by two locations 3600nm apart sharing 3-letter code

basilesimon | 265 points

I don't know how long that failure mode has been in place or if this is relevant, but it makes me think of analogous times I've encountered similar:

When automated systems are first put in place, for something high risk, "just shut down if you see something that may be an error" is a totally reasonable plan. After all, literally yesterday they were all functioning without the automated system, if it doesn't seem to be working right better switch back to the manual process we were all using yesterday, instead of risk a catastrophe.

In that situation, switching back to yesterday's workflow is something that won't interrupt much.

A couple decades -- or honestly even just a couple years -- later, that same fault system, left in place without much consideration because it rarely is triggered -- is itself catastrophic, switching back to a rarely used and much more inefficient manual process is extremely disruptive, and even itself raises the risk of catastrophic mistakes.

The general engineering challenge, is how we deal with little-used little-seen functionality (definitely thinking of fault-handling, but there may be other cases) that is totally reasonable when put in place, but has not aged well, and nobody has noticed or realized it, and even if they did it might be hard to convince anyone it's a priority to improve, and the longer you wait the more expensive.

jrochkind1 | 3 days ago

FYI: nm = nautical miles, not nanometers.

jp57 | 3 days ago

Good news: the system successfully detected an error and didn't send bad data to air traffic controllers.

Bad News: the system can't recover from an error in an individual flight plan, bringing the whole system down with it (along with the backup system since it was running the same code).

FateOfNations | 7 days ago

You know there's a software engineer somewhere that saw this as a potential problem, brought up a solution, and had that solution rejected because handling it would add 40 hours of work to a project.

steeeeeve | 3 days ago

There's been some prior discussion on this over the past year, here are a few I found (selected based on comment count, haven't re-read the discussions yet):

From the day of:

https://news.ycombinator.com/item?id=37292406 - 33 points by woodylondon on Aug 28, 2023 (23 comments)

Discussions after:

https://news.ycombinator.com/item?id=37401864 - 22 points by bigjump on Sept 6, 2023 (19 comments)

https://news.ycombinator.com/item?id=37402766 - 24 points by orobinson on Sept 6, 2023 (20 comments)

https://news.ycombinator.com/item?id=37430384 - 34 points by simonjgreen on Sept 8, 2023 (68 comments)

Jtsummers | 3 days ago

So, essentially the system has a serious denial of service flaw. I wonder how many variations of flight plans can cause different but similar errors that also force a disconnect of primary and secondary systems.

Seems "reject individual flight plan" might be a better system response than "down hard to prevent corruption"

Bad assumption that a failure to interpret a plan is a serious coding error seems to be the root cause, but hard to say for sure.

jmvoodoo | 3 days ago

I guarantee that piece of code has a comment like

  /* This should never happen */
  if (waypoints.matchcount > 2) {
convivialdingo | 3 days ago

Funny airport call letters story: I once headed to Salt Lake City, UT (SLC) for a conference. My luggage was processed by a dyslexic baggage handler, who sent it to... SCL (Santiago, Chile).

I was three days in my jeans at business meetings. My bag came back through Lima, Peru and Houston. My bag was having more fun than me.

GnarfGnarf | 3 days ago

Original (2023) thread with 446 comments,

https://news.ycombinator.com/item?id=37461695 ("UK air traffic control meltdown (jameshaydon.github.io)")

perihelions | 3 days ago

This is old news, but what's new news is that last week, the UK Civil Aviation Authority openly published its Independent Review of NATS (En Route) Plc's Flight Planning System Failure on 28 August 2023 https://www.caa.co.uk/publication/download/23337 (PDF)

Let's look at point 2.28: "Several factors made the identification and rectification of the failure more protracted than it might otherwise have been. These include:

• The Level 2 engineer was rostered on-call and therefore was not available on site at the time of the failure. Having exhausted remote intervention options, it took 1.5 hours for the individual to arrive on-site to perform the necessary full system re-start which was not possible remotely.

• The engineer team followed escalation protocols which resulted in the assistance of the Level 3 engineer not being sought for more than 3 hours after the initial event.

• The Level 3 engineer was unfamiliar with the specific fault message recorded in the FPRSA-R fault log and required the assistance of Frequentis Comsoft to interpret it.

• The assistance of Frequentis Comsoft, which had a unique level of knowledge of the AMS-UK and FPRSA-R interface, was not sought for more than 4 hours after the initial event.

• The joint decision-making model used by NERL for incident management meant there was no single post-holder with accountability for overall management of the incident, such as a senior Incident Manager.

• The status of the data within the AMS-UK during the period of the incident was not clearly understood.

• There was a lack of clear documentation identifying system connectivity.

• The password login details of the Level 2 engineer could not be readily verified due to the architecture of the system."

WHAT DOES "PASSWORD LOGIN DETAILS ... COULD NOT BE READILY VERIFIED" MEAN?

EDIT: Per NATS Major Incident Investigation Final Report - Flight Plan Reception Suite Automated (FPRSA-R) Sub-system Incident 28th August 2023 https://www.caa.co.uk/publication/download/23340 (PDF) ... "There was a 26-minute delay between the AMS-UK system being ready for use and FPRSA-R being enabled. This was in part caused by a password login issue for the Level 2 Engineer. At this point, the system was brought back up on one server, which did not contain the password database. When the engineer entered the correct password, it could not be verified by the server. "

amiga386 | 3 days ago

I've posted this here before, but they really need globally unique codes for all the airports, waypoints, etc, it's crazy there are collisions. People always balk at this for some reason but look at the edge cases that can occur, it's crazy CRAZY

sam0x17 | 3 days ago

If you want to, you can read the final report from the UK Civil Aviation Authority here: https://www.caa.co.uk/publication/download/23340

It's pretty readable and quite interesting.

gadders | 3 days ago

For the people skimming the comments and are confused: 3600nm here is nautical miles, not nanometers.

My first thought was that this was some parasitic capacitance bug in a board design causing a failure in an aircraft.

junon | 3 days ago

Is nm the official abbreviation for nautical miles? I assume it is natural miles. For me it is nanometers.

fyt2024 | 3 days ago

What brought me to read this article was a confusion: how can two locations related to air traffic be 3600 nanometers apart? Was it two points within some chip, or something?

Only way into the article it dawned to me that "nm" could stand for something else, and guess it was "nautical miles". Live and learn...

Still, it turned out to be an interesting read)

IlliOnato | 3 days ago

So, exactly the same airline (French Bee) and exactly the same route (LAX-ORY) and exactly the same waypoint (DVL) as last September, resulting in exactly the same failure mode:

https://chaos.social/@russss/111048524540643971

Time to tick that "repeat incident?" box in the incident management system, guys.

NovemberWhiskey | 3 days ago

Unique IDs that are not really unique are the beginning of all evil, and there is a special place in hell for those that "recycle" GUIDs instead of generating new ones.

Having ambiguous names can likewise lead to disaster, as seen here, even if this incident had only mild consequences. (Having worked on place name ambiguity academically, I met people who flew to the wrong country due to city name ambiguity and more.)

At least artificial technical names/labels should be globally unambiguous.

jll29 | 2 days ago

Hmm, is this the same incident which happened last year? Or is this a new incident?

From Sept 2023 (flightglobal.com):

- https://archive.is/uiDvy

- Comments: https://news.ycombinator.com/item?id=37430384

Also some more detailed analysis:

- https://jameshaydon.github.io/nats-fail/

- Comments: https://news.ycombinator.com/item?id=37461695

cbhl | 3 days ago

The DVL really is in the details.

_pete_ | 3 days ago

When there's no global clearing house for those identifiers, maybe namespaces would help?

Related: The editorialized HN title uses nanometers (nm) when they possibly mean nautical miles (nmi). What would a flight control system make of that?

tempodox | 3 days ago

Sounds like the kind of thing fuzzing would find easily, if it was applied. Getting a spare system to try it on might be hard though.

mkj | 2 days ago

As an aside, that site's cookie policy sucks. You can opt out of some, but others, like "combine and link data from other sources", "identify devices based on information transmitted automatically", "link different devices" and others can't be disabled. I feel bad for people that don't have the technical sophistication to protect themselves against that kind of prying.

chefandy | 3 days ago

"and it generated a critical exception error. This caused the FPRSA-R primary system to disconnect, as designed,"

as designed here sounds a big PR move to hide the fact they let an uncaught exception crash the entire software ...

How about : don't trust your inputs guys ?

mirages | 2 days ago

There’s little to no authentication on filing flight plans which makes this a potentially bigger problem. I’m sure it’s fixed but the mechanism that caused the failure is an assertion that fails by disconnecting the critical systems entirely for “safety”. And the backup failed the same way. Bet there are similar bugs.

mmaunder | 3 days ago

> Just 20s elapsed between the receipt of the flightplan and the shutdown of both FPRSA-R systems, causing all automatic processing of flightplan data to cease and forcing reversion to manual procedures.

That's quite a DoS vulnerability...

cryptonector | 2 days ago

I would’ve thought that in flight industry they got the „business key” uniqueness right ages ago. If a key is multi-part then each check should check all parts not just one. Alternatively, force all airport codes to be globally unique.

polskibus | 3 days ago

I’m curious what part of the code rejected the validity of the flight plan. Im also curious what keys are actually used for lookups when they aren’t unique??

klysm | 2 days ago
[deleted]
| 3 days ago

"What are these? Airports for ants?" I would HN dudes expect to fix the headline regarding SI / nautical units. Sloppy copy.

whiteandmale | 2 days ago

Could you front-end the software with a proxy which bounces code-collision requests and limit the damage to the specific route, and not the entire systems integrity?

This is hack-on-hack stuff, but I am wondering if there is a low cost fix for a design behaviour which can't alter without every airline, every other airline system worldwide, accommodating the changes to remove 3-letter code collision.

Gate the problem. Require routing for TLA collisions to be done by hand, or be fixed in post into two paths which avoid the collision. (intrude an intermediate waypoint)

ggm | 3 days ago
[deleted]
| 3 days ago

What's crazy is that this hasn't happened before, waypoints that share a name isn't uncommon

aeroevan | 3 days ago
[deleted]
| 2 days ago
[deleted]
| 3 days ago

Headline still hasn't been fixed? (Correct abbreviation is NM).

dboreham | 3 days ago
[deleted]
| 3 days ago

Initially read this as 3600 nanometres... :-)

entropyie | 3 days ago

Well, 3600 billionths of a meter IS kinda close...just sayin'

Optimal_Persona | 7 days ago

The title sounds like an AMD cpu issue.

mjan22640 | 3 days ago

oh nautical miles !

not nanometres as you might assume from being used to normal units

craigds | 3 days ago
[deleted]
| 3 days ago
[deleted]
| 3 days ago

Title should be nmi

ipunchghosts | 3 days ago

Is it just me or was it basically impossible to decipher what those three letter codes were?

jojohohanon | 3 days ago

In other news, goat carts are still getting 100 furlong–firkin–fortnight on dandelions.

=3

Joel_Mckay | 3 days ago

[dead]

QuiltOverture | 2 days ago
[deleted]
| 2 days ago

3600 nanometers? That's cool.

muffwiggler | 2 days ago

People posting on this forum saying "ah well software's failure case isn't as bad"

> This forced controllers to revert to manual processing, leading to more than 1,500 flight cancellations and delaying hundreds of services which did operate.

hobs | 3 days ago

It's like déjà vu all over again, Yogi.

Aug 2023: “UK air traffic woes caused by 'invalid flight plan data'”

https://www.theregister.com/2023/08/30/uk_air_traffic_woes_i... --

(-11 down votes and counting)

J05ephu5M13r | 3 days ago