Reproducing the AWS Outage Race Condition with a Model Checker

simplegeek | 141 points

I'm a fan of more formal methods in progam analysis, but this particular excercise is very hindsight-is-20/20

> In this case, we can set up an invariant stating that the DNS should never be deleted once a newer plan has been applied

If that invariant had been expressed in the original code — as I'm sure it now is — it wouldn't have broken in the first place. The invariant is obvious in hindsight, but it's hardly axiomatic.

muglug | 2 days ago

Presumably that one guy at AWS who promotes TLA+ is furiously modeling all this himself in more detail for internal analysis.

jldugger | 2 days ago

Real world systems often have to deviate from the "pure" version used to run formal methods on. This could be how long you keep transaction logs for, or how long rows are tombstoned for, etc. The longer the time period, the costlier it usually is, in total storage cost and sometimes performance too. So you have to compromise with where you set the time period for.

Let's imagine that the process usually takes 1 minute and the tombstones are kept for 1 day. It would take something ridiculous to make the thing that usually takes 1 minute take longer than a day - not worth even considering. But sometimes there are a confluence of events that make such a thing possible... For example, maybe the top of rack switch died. The server stays running, it just can't succeed any upstream calls. Maybe it is continuously retrying while the network is down (or just slowly timing out on individual requests and skipping to the next one to try it). When the network comes back up, those calls start succeeding but now it's so much staler than you ever even thought was possible or planned for. That's just one scenario, probably not exactly what happened to AWS.

grogers | 2 days ago

If I were to summarize how we attacked this when I was on AWS (different team)

formal methods. Some of this started a long time ago so not sure if it was TLA, TLA+, or something else. (I am a useless manager type)

fake clients / servers to make testing possible

strict invariants

A simulator to fuzz/fault the entire system. We didn't get this until later in the life of the service but flushed out race condition bugs that would have taken years to do.

We never got to replaying customer traffic patterns which was a pet idea of mine but probably the juice wasn't worth the squeeze.

harshaw | 2 days ago

One thing I've observed in my career is that if you take the extra time / are good enough / have the experience to ship something solid enough that nobody ever thinks about it, you never get nearly as much credit as the folks who quickly ship broken things or the folks that parachute in to fix those things.

I've changed my behavior to reflect the incentives but remain sad about it.

pkilgore | 2 days ago

imho, model checker suits for the problem with many different states and complex state transformation. But in this case, it's a simple toctou problem.. Using model checker sounds weird for me

mmiao | 2 days ago

Wish the author had an introduction to model checker article.

I have yet to learn about this and will not be throwing some time into researching this topic.

tonetegeatinst | 2 days ago

I don't really understand the purpose of this. It's not like they have anything other than the RCA (e.g. the code)

philipwhiuk | 2 days ago