for those who are not super confident in multivariate derivatives:
great article:
Calculus on Computational Graphs: Backpropagation
Backpropagation for a Linear Layer by Justin Johnson
Gate
have method forward
and backward
green→numerical result, red→local gradient
Notice that the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in.
It is very important to stress that if you were to launch into performing the differentiation with respect to either x or y, you would end up with very large and complex expressions. However, it turns out that doing so is completely unnecessary because we don’t need to have an explicit function written down that evaluates the gradient. We only have to know how to compute it.
cache forward pass variables to use them in backpropagation
reflection: Isn’t This Trivial? just a formalized way to compute partial derivative and basically the only analytic way
When I first understood what backpropagation was, my reaction was: “Oh, that’s just the chain rule! How did it take us so long to figure out?” I’m not the only one who’s had that reaction. It’s true that if you ask “is there a smart way to calculate derivatives in feedforward neural networks?” the answer isn’t that difficult.
But I think it was much more difficult than it might seem. You see, at the time backpropagation was invented, people weren’t very focused on the feedforward neural networks that we study. It also wasn’t obvious that derivatives were the right way to train them. Those are only obvious once you realize you can quickly calculate derivatives. There was a circular dependency.
Worse, it would be very easy to write off any piece of the circular dependency as impossible on casual thought. Training neural networks with derivatives? Surely you’d just get stuck in local minima. And obviously it would be expensive to compute all those derivatives. It’s only because we know this approach works that we don’t immediately start listing reasons it’s likely not to.
That’s the benefit of hindsight. Once you’ve framed the question, the hardest work is already done.