On rebooting: the unreasonable effectiveness of turning computers off and on

todsacerdoti | 250 points

I remember attending a talk by Walter Bright maybe three decades ago, back when he was known for his C++ compiler for Windows, and before that his compilers for DOS. He talked about his very paranoid approach to assertions; in addition to a huge number of asserts for preconditions, he would start out with asserts essentially at every basic block; he'd take them out as he produced tests that reached this point. This was the DOS days and he didn't have a coverage tool, so the presence of an unconditional assert meant that this line hadn't been reached yet in testing.

He said he developed that style because the lack of protection in DOS meant that any error (like a buffer overrun) could trash everything, right down to damaging the machine.

He said that early on those asserts were enabled in the shipping code, making it appear that his compiler was less reliable than competitors when he felt it was the opposite, but I think he wound up having to modify his approach.

not2b | 2 years ago

It’s been known for more than 35 years that most bugs found in production are “Heisenbugs”:

https://www.cs.columbia.edu/~junfeng/08fa-e6998/sched/readin...

Gray’s paper provides a logical reason why: If a bug can be reproduced reliably it’s much more likely to be found during testing and fixed, so appearing seemingly at random is a kind of Darwinian survival adaptation.

twoodfin | 2 years ago

Also designing something "crash-first", even if you don't call it like this, leads to many different approaches and possible improvements.

For example lets imagine some embedded device in the ISP network, it is not very accessible so high reliability is required. You can overengineer it to be a super reliable Voyager-class computer, but that a) will cost much more money than it should, b) you will fail to achieve the target.

Or you can go crash-first approach. Many things can be simplified then, for example no need to have stateful config, no need to code any config management which saves it, checks it etc. You just rely on receiving new config every boot and correctly writing it to controllers. Less complexity.

Then if you are crash-first you expect to reboot more often. You then optimize boot time, which would be a much lower priority task otherwise. And suddenly you have x times better (lower) downtimes per device.

You can optimize on some hard stuff - e.g. any and all 3rd party controllers with 3rd party code blobs and weird APIs. Instead of writing a lot of health checks for each and every failure mode of the stuff you can't really influence, you write a bare minimum and then watchdog to reboot whole unit and hope it will recover. And this works very well in practice.

The list goes on. Instead of very complicated all-in-one device you have a lightweight code which has a good and predictable recovery mechanism. It is cheaper, and eventually even more reliable than overengineered alternative. Another example - network failure. Overengineered device will do a lot of smart things trying to recover network, re-initialize stuff, re-try different waits (and there will be a lot of re-tries), and may eventually get stuck without access. Lightweight device has a short simple wait with simple retry, or several, and then reboots. Statistically this is better than running some super complicated code, if the device is engineered to reboot from the start.

Yizahi | 2 years ago

I am utterly in awe every time there's a breakthrough in durable random access memory, where someone (often the author) is so very excited about how this will mean we will never have to power off our computers again. Flip a switch and it's instantly on!

Have you lost your ever-loving mind? Of course you're going to reboot. Bit rot improvements have only ever been incremental. To a first order approximation we've expanded from a day or two to a couple of weeks/months for general purpose computers (I argue that servers are not GP, and so their spectacular uptimes don't translate).

In thirty years, that's barely more than an order of magnitude improvement. You're gonna need a couple more orders than that before taking away the reboot option sounds anything like sanity. If you can keep a machine from shitting the bed for 2-3 times longer than the expected life of the hardware, then we can talk, but I'm not making any promises. Until then I'm gonna reboot this thing sometimes even if it's just placebo.

hinkley | 2 years ago

"There seem to be certain analogies here between computing systems and biological ones. Your body is composed of trillions of compartmentalized cells, most of which are programmed to die after a while, partly because this prevents their DNA from accumulating enough mutations to start misbehaving in serious ways. Our body even sends its own agents to destroy misbehaving cells that have neglected to destroy themselves; sometimes you just gotta kill dash nine."

I think that's also reminiscent of Alan Kay's philosophy behind OO in its most original form, and probably most closely realized in Erlang:

"I thought of objects being like biological cells and/or individual computers on a network, only able to communicate with messages (so messaging came at the very beginning -- it took a while to see how to do messaging in a programming language efficiently enough to be useful). - I wanted to get rid of data. The B5000 almost did this via its almost unbelievable HW architecture. I realized that the cell/whole-computer metaphor would get rid of data[..]"

I wonder why so many of the most popular programming languages went into the opposite, very state and lock based direction given the strong theoretical foundation that computing had for systems that looked much more robust.

http://userpage.fu-berlin.de/~ram/pub/pub_jf47ht81Ht/doc_kay...

Barrin92 | 2 years ago

Back in the day, when I was cutting my teeth on embedded systems, I read an Intel Application Note - probably for the 8051. The section about managing watchdog timers stated that it is often good practice to deliberately let the watchdog time out periodically, at a convenient time. Then the system would be reset to a predictable state regularly, and failure states leading to unplanned watchdog timeouts would occur less frequently.

To this day, all my systems which are intended for prolonged unattended operation reset themselves at least once a day.

deepspace | 2 years ago

Opposite of the robustness principle:

https://en.wikipedia.org/wiki/Robustness_principle

It's funny, I spent some time developing tools for CPU architects. Both the concept in the OP's anecdote, and the above principle don't really apply because logic doesn't break in the same way source code breaks. You don't run a program in HDL, you synthesize a logic flow. One could concievably test all possible combinations of that logic for errors, but it becomes 2^N combinations where N is the number of inputs and the number of state elements. Since this cannot be tested because the space is huge (excluding hierarchical designs and emulation), you generate targeted test patterns (and many many mutexes) to pare down the space, and perhaps randomize some of the targeted tests to verify you don't execute out of bounds. And even "out of bounds" is defined by however smart the microarchitect was when they wrote the spec, and that can be wrong too.

The only way to find and fix these bugs is to run trillions of test vectors (called "coverage") and hope you've passed all the known bugs, and not found any new ones.

There are four decades of papers written on hardware validation, so I'm barely scratching the surface, but I think it's a very different perspective compared to how programmers approach the world. I think a lot of the bugs that OP is talking about fall into the hardware logic domain. There isn't really a fallback "throw" or "return status" that you can even check for. Just fault handlers (for the most part).

sbf501 | 2 years ago

> A novice was trying to fix a broken Lisp machine by turning the power off and on.

> Knight, seeing what the student was doing, spoke sternly: “You cannot fix a machine by just power-cycling it with no understanding of what is going wrong.”

> Knight turned the machine off and on.

> The machine worked.

from http://www.catb.org/jargon/html/koans.html#id3141171

sodiumjoe | 2 years ago

I found it interesting also that the Recovery-Oriented Computing project at UC Berkeley/Stanford puts "design for restartability" as a research area. http://roc.cs.berkeley.edu/roc_overview.html

surfmike | 2 years ago

Considering every animal sleeps on a daily basis, I would say “turning it on and off again” pertains to maintaining the stability of complex state-based systems.

nulluint | 2 years ago

On a practical note, testing a system's ability to start up is usually much faster and easier than testing its ability to work properly for long periods of time. Real-world long-term behavior is unpredictable and somewhat chaotic, but there are only so many things you can do straight out of a reboot.

AdamH12113 | 2 years ago

Will we never be done with this “unreasonable effectiveness” cliche - or am I missing something?

germinalphrase | 2 years ago

The winning strategy of Mike’s parable is also known as offensive programming: https://en.m.wikipedia.org/wiki/Offensive_programming

layer8 | 2 years ago

The missing link is RE-initilization or re-validation of the expected state.

Offhand I remember some discussion of how old dialup friendly multiplayer games would transfer state. Differential state would be transferred. There might or might not be a checksum. There would be global state refreshes (either periodically or as bandwidth allowed).

The global state refreshes are a different form of re-initilization. The current state discarded in favor of a canonically approved state.

mjevans | 2 years ago

Well yeah, all modern computers and software are nondeterministic state machines. The order and set of operations is nondeterministic, and its input, output and algorithm constantly mutates. As there is no function within the computer to automatically reverse its tree of operations to before it encountered a conflict, an infinite number of state changes inevitably leads to entropy, thus all running systems [with state changes] crash.

Since the computer is a state machine, and all initial state is read from disk, restarting the computer reinitializes the state machine from initial state. Rebooting resets the state machine to before state mutation entropy led to conflicts. Rebooting is state machine time travel.

We could prevent this situation by designing systems whose state and algorithms don't mutate. But that would require changing software development culture, and that's a much harder problem than rebooting.

0xbadcafebee | 2 years ago

Anyone old enough to remember IT Crowd? Here's my favorite quote from it: https://www.cipher-it.co.uk/wp-content/uploads/2017/11/ITCro...

DeathArrow | 2 years ago

I once deleted a 15TB database by temporarily shutting off an AWS i3en instance.

Nobody told i3en's have "ephemeral storage" that gets wiped on shutdown.

xthrowawayxx | 2 years ago

I once heard someone use the term “therapeutic reboot” for when computers get into a bad operating state. It really stuck with me.

logbiscuitswave | 2 years ago

State management!

Basically how can you know all the variables at one time? Start on initialization.

IMO the fact that everything "works" after reboots is simply because the state is 1. well defined and 2. available to all developers.

omgJustTest | 2 years ago

Is there a citable source for "The parable of Mike (Michael Burrows) and the login shell"?

I do not think that this whole discussion about returning to a "known good initial state" has much merit, because computers are Turing machines: they modify their own state and then act upon that state. Which means that unless your boot drive is read-only, nobody can predict what your computer will do on the next reboot. (Unless someone solved The Halting Problem and I missed it.)

But the discussion about assertions and initially fragile software that quickly becomes robust did strike a chord with me, because that's exactly what I have been doing for decades now. (Also, I see other programmers using assertions so rarely, that I feel kind of lonely in this.) I have a long post about assertions (https://blog.michael.gr/2014/09/assertions-and-testing.html) to which I would like to add a paragraph or two about this aspect, and preferably cite Burrows.

mikenakis | 2 years ago

A possible analogue of this for building/packaging systems is the DOCKERFILE for building a container for infrastructure. No wonder, it acts as a reliable state initialization procedure.

vivegi | 2 years ago

So what's everyone's take on restarting servers every now and then?

I think that it might be something that you do occasionally (say, once a month or so) for each of your boxes, regardless of whether it's a HA system or not.

Though perhaps I can only say that because the HA systems/components (like API nodes or nodes serving files) won't really see much downtime due to clustering, whereas the non-HA systems/components (monoliths, databases, load balancers) that i've worked with have been small enough for it to be okay to do restarts in the middle of the night (or just weekends or whatever) with very little impact to anyone, even without clustering at the DB level/switchovers.

I mean, surely you do need to update the servers themselves, set aside a bit of time for checking that everything will be okay and don't try to patch a running kernel most of the time, right? I'd personally also want to restart even hypervisor or container host nodes occasionally to avoid weirdness (i've had Docker networks stop working after lots of uptime) and to make sure that restarts will result in everything actually starting back up afterwards successfully.

Then again, most of this probably doesn't apply to the PaaS folk out there, or the serverless crowd.

KronisLV | 2 years ago

The first law of tech support. Second law is check (unplug and re-plug) all the cables.

tingletech | 2 years ago

> Is this the best that anyone can do?

No. It's the best that can sometimes be done quickly.

Additionally, this doesn't mention the value of a good postmortem. Or the horrors of cloud computing, where restarting things is deemed a good enough mitigation because in the cloud these things happen, and nobody pushes for a good postmortem and repair items.

RajT88 | 2 years ago

One of my first jobs was working on DAB digital radios. Our workhorse low-end model was quite flaky, but it had a watchdog timer that would reboot it if it hung; and it was so simple that it would get back to acquiring a signal in a fraction of a second.

So the upshot was, a system crash just meant the audio would cut out for maybe 6 seconds, then start playing again. For the user, completely indistinguishable from poor radio reception, which is always going to happen from time to time anyway. Crashing wasn’t a problem at all as long as it was rare enough.

(I like making robust software and hate crashes, so I don’t like that approach, but I use that experience to remind myself that sometimes worse really is better, or at least good enough.)

iainmerrick | 2 years ago

On an application level, this means to restart the application.

Even when no unexpected exception occurs, sometimes frequently restarting the application can be a valid solution. For example when there are potential memory leaks (maybe in your code but maybe also in external library code) and the application should just run infinitely.

Btw, a simple method to restart an application: https://stackoverflow.com/questions/72335904/simple-way-to-r...

albertzeyer | 2 years ago

As it is now common to shut down subsystems for reasons of power optimization, this sort of approach may be becoming an accidentally autonomous architecture.

contingencies | 2 years ago

There are 2 hard problems in IT:

    0. Cache invalidation
    1. Naming things
    2. Off-by-one errors
Index 0 is why reboots are so effective.
usrbinbash | 2 years ago

I find even leaving them off pleasing. Don't run compputers 24/7, your brain doesn't, too.

The smaller the scale (amount of people affected), the more sense do business hours make. Ot times of non-operation. Who cares if a personal/smallbiz webserver sleeps for some hours at night?

mro_name | 2 years ago

Back in the 90's i competed with my friends on who had the longest uptime. Most guys I've meet they reseted the computer multiple times a day to increase the performance. I never turn off my mac, it has to be a software update or something like a general crash.

meerita | 2 years ago

I don't think it's that unreasonable for a complex system with complex states to reboot. I imagine (I'm not a medical professional) that it works in roughly the same way as when human beings black out; reboot.

fennecfoxy | 2 years ago

"Before a computing system starts running, it’s in a fixed initial state. At startup, it executes its initialization sequence, which transitions the system from the initial state to a useful working state:"

LOL etc

gerdesj | 2 years ago

Before HN, I didn’t know so many things are unreasonably effective.

nesarkvechnep | 2 years ago

Works great for bad OS’s like Windows

TedShiller | 2 years ago