Backpropagation is a leaky abstraction (2016)

swatson741 | 349 points

Karpathy's contribution to teaching around deep learning is just immense. He's got a mountain of fantastic material from short articles like this, longer writing like https://karpathy.github.io/2015/05/21/rnn-effectiveness/ (on recurrent neural networks) and all of the stuff on YouTube.

Plus his GitHub. The recently released nanochat https://github.com/karpathy/nanochat is fantastic. Having minimal, understandable and complete examples like that is invaluable for anyone who really wants to understand this stuff.

gchadwick | 3 days ago

Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].

Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)

[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.

nirinor | 3 days ago

I took a course in my Master's (URV.cat) where we had to do exactly this, implementing backpropagation (fwd and backward passes) from a paper explaining it, using just basic math operations in a language of our choice.

I told everyone this was the best single exercise of the whole year for me. It aligns with the kind of activity that I benefit immensely but won't do by myself, so this push was just perfect.

If you are teaching, please consider this kind of assignments.

P.S. Just checked now and it's still in the syllabus :)

joaquincabezas | 3 days ago

I have a naive question about backprop and optimizers.

I understand how SGD is just taking a step proportional to the gradient and how backprop computes the partial derivative of the loss function with respect to each model weight.

But with more advanced optimizers the gradient is not really used directly. It gets per weight normalization, fudged with momentum, clipped, etc.

So really, how important is computing the exact gradient using calculus, vs just knowing the general direction to step? Would that be cheaper to calculate than full derivatives?

drivebyhooting | 3 days ago

It seems to me that in 2016 people did (have to) play a lot more tricks with the backpropagation than today. Back then it was common to meddle with gradients in between the gradient propagation.

For example, Alex Graves's (great! with attention) 2013 paper "Sequence Generation with Recurrent Neural Networks" has this line:

One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, all the experiments in this paper clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.

with this footnote:

In fact this technique was used in all my previous papers on LSTM, and in my publicly available LSTM code, but I forgot to mention it anywhere—mea culpa.

That said, backpropagation seems important enough to me that I once did a specialized videocourse just about PyTorch (1.x) autograd.

t-vi | 2 days ago

The original title is "Yes you should understand backprop" - which is good and descriptive.

stared | 3 days ago

This comment:

> “Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”

worries me because is structured with the same reasoning of "why we have to demonstrate we understand addition if in the real world we have calculators"

sebastianconcpt | 3 days ago

I have to be contrarian here. The students were right. You didn't need to learn to implement backprop in NumPy. Any leakiness in BackProp is addressed by researchers who introduce new optimizers. As a developer, you just pick the best one and find good hparams for it.

jamesblonde | 3 days ago

I wonder if in the long term, with compute being cheap and parameter volumes being the constraint, if it will make sense to train models to be robust to different activation functions that look like ReLU (I.e. swish, gelu etc.)

You might even be able to do a ugly version of this (akin to dropout) where you swap activation functions (with adjusted scaling factors so they mostly yield similar output shapes to ReLU for most input) randomly during training. The point is we mostly know what an ReLU like activation function is supposed to do, so why should we care about the edge cases of the analytical limits of any specific one.

The advantage would be that you’d probably get useful gradients out of one of them(for training), and could swap to the computationally cheapest one during inferencing.

rao-v | 2 days ago

Karpathy is butchering the metaphor. There is no abstraction here. Backprop is an algorithm. Automatic differentiation is a technique. Neither promises to hide anything.

I agree that understanding them is useful, but they are not abstractions much less leaky abstractions.

xpe | 3 days ago

When I first started learning deep learning, I only had a vague idea of how backprop worked. It wasn't until I forced myself to implement it from scratch that I realized it was not magic after all. The process was painful, but it gave me much more confidence when debugging models or trying to figure out where the loss was getting stuck. I would really recommend everyone in deep learning try writing it out by hand at least once.

Huxley1 | 2 days ago

More generally, it's often worth learning and understanding things one step deeper. Having a more fundamental understanding of things explains more of the "why" behind why some things are the way they are, or why we do some things a certain way. There's probably a cutoff point for balancing how much you actually need to know though. You could potentially take things a step further by writing the backwards pass without using matrix multiplication, or spend some time understanding what the numerical value of a gradient means.

alyxya | 3 days ago

Karpathy suggests the following error:

  def clipped_error(x): 
    return tf.select(tf.abs(x) < 1.0, 
                   0.5 * tf.square(x), 
                   tf.abs(x) - 0.5) # condition, true, false
Following the same principles that he outlines in this post, the "- 0.5" part is unnecessary since the gradient of 0.5 is 0, therefore -0.5 doesn't change the backpropagated gradient. In addition, a nicer formula that achieves the same goal as the above is √(x²+1)
WithinReason | 3 days ago

Karpathy's work on large datasets for deep neural flow is conceiving of the "backward pass" as the preparation for initializing the mechanics for weight ranges, either as derivatives in -10/+10 statistic deviations.

away74etcie | 3 days ago

... (2016)

9 years ago, 365 points, 101 comments

https://news.ycombinator.com/item?id=13215590

emil-lp | 3 days ago

I feel like my learning curve for AI is:

1) Learn backprop, etc, basic math

2) Learn more advanced things, CNNs, LMM, NMF, PCA, etc

3) Publish a paper or poster

4) Forget basics

5) Relearn that backprop is a thing

repeat.

Some day I need to get my education together.

mirawelner | 2 days ago

Do LLMs still use backprop?

brcmthrowaway | 3 days ago

Are dead ReLUs still a pronlem today? Why not?

raindear | 2 days ago
[deleted]
| 3 days ago

off-topic, anybody knows what's going on with EurekaLabs? It's been a while since the announcement

joaquincabezas | 3 days ago

Given that we're now in the year 2025 and AI has become ubiquitous, I'd be curious to estimate what percentage of developers now actually understand backprop.

It's a bit snarky of me, but whenever I see some web developer or product person with a strong opinion about AI and its future, I like to ask "but can you at least tell me how gradient descent works?"

I'd like to see a future where more developers have a basic understanding of ML even if they never go on to do much of it. I think we would all benefit from being a bit more ML-literate.

joshdavham | 3 days ago

I was happy to see Karpathy writing a new blog post instead of simply Twitter threads, but when I opened the link I just got dispointed to realize it's from 9 years ago…

I really hate what Twitter did to blogging…

littlestymaar | 3 days ago

Sidenote why are people still using medium?

phplovesong | 3 days ago