Before Reading

for those who are not super confident in multivariate derivatives:

Linear and Abstract Algebra

great article:

Calculus on Computational Graphs: Backpropagation

Backpropagation for a Linear Layer by Justin Johnson

Gradient

another name: reverse-mode differentiation
- Forward-mode differentiation tracks how one input affects every node. Reverse-mode differentiation tracks how every node affects one output.
multivariable *chain rule: $\frac{\partial f}{\partial z}=\sum_i \frac{\partial f}{\partial q_i}\frac{\partial q_i}{\partial z}$
e.g. $f=(x+y)z=qz$
- → computational graph
- → compute partial derivatives starting from $\frac{\partial f}{\partial f}$ and then backward
- $\frac{\partial f}{\partial y}=\frac{\partial f}{\partial q} * \frac{\partial q}{\partial y}$
e.g. $f=1/(1+exp(-(w_0x_0+w_1x_1+w_2)))=sigmoid(w_0x_0+w_1x_1+w_2)$
- accelerate some function like sigmoid gate by $\sigma'(x)=\sigma(x)(1-\sigma(x))$
- modularized implementation: class Gate have method forward and backward

green→numerical result, red→local gradient

observation

Notice that the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in.
It is very important to stress that if you were to launch into performing the differentiation with respect to either x or y, you would end up with very large and complex expressions. However, it turns out that doing so is completely unnecessary because we don’t need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it.
cache forward pass variables to use them in backpropagation
reflection: Isn’t This Trivial? just a formalized way to compute partial derivative and basically the only analytic way

When I first understood what backpropagation was, my reaction was: “Oh, that’s just the chain rule! How did it take us so long to figure out?” I’m not the only one who’s had that reaction. It’s true that if you ask “is there a smart way to calculate derivatives in feedforward neural networks?” the answer isn’t that difficult.

But I think it was much more difficult than it might seem. You see, at the time backpropagation was invented, people weren’t very focused on the feedforward neural networks that we study. It also wasn’t obvious that derivatives were the right way to train them. Those are only obvious once you realize you can quickly calculate derivatives. There was a circular dependency.

Worse, it would be very easy to write off any piece of the circular dependency as impossible on casual thought. Training neural networks with derivatives? Surely you’d just get stuck in local minima. And obviously it would be expensive to compute all those derivatives. It’s only because we know this approach works that we don’t immediately start listing reasons it’s likely not to.

That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is already done.

Example

input - fully connected layer - ReLU - fully connected layer - softmax
softmax detail: Linear Classification