Remember the output of a neuron’s affine function followed by a ReLU activation function (note that there are many other activation functions but the key concept should be the same) is given by:

$$ activation(\vec{x})=max(0, \vec{w}\cdot \vec{x} + b) $$

To find the gradients w.r.t to $\vec{w}$ and b of this function we will divide the task into two parts:

  1. Find the derivative of $\vec{w}\cdot \vec{x} + b$ wrt $\vec{w}$ and b

  2. Deal with the unary element-wise operation $max$

Step 1: The derivative of the affine function

We don’t know what the derivative of $\vec{w}\cdot \vec{x}$ is but we do know that

$$ \vec{w}\cdot \vec{x} = \sum_i w_ix_i = sum(\vec{w} \otimes \vec{x}) $$

Therefore, we need the vector chain rule here. Let’s define an intermediate function

$$ \vec{u} = \vec{w} \otimes \vec{x} $$

$$ \therefore y = sum(\vec{u}) $$

From the previous chapter we know:

$$ \frac{\partial \vec{u}}{\partial \vec{w}} = \frac{\partial}{\partial \vec{w}} (\vec{w} \otimes \vec{x}) = diag(\vec{x}) $$

$$ \frac{\partial y}{\partial \vec{u}} = \frac{\partial}{\partial \vec{u}} sum(\vec{u}) = \vec{1}^T $$

By chain-rule.

$$ \frac{\partial y}{\partial \vec{w}} = \frac{\partial y}{\partial \vec{u}}\frac{\partial \vec{u}}{\partial \vec{w}} = \vec{1}^T diag(\vec{x}) = \vec{x}^T $$

Also,

$$ \frac{\partial y}{\partial b} = \frac{\partial}{\partial b} sum(\vec{w} \otimes \vec{x}) = 0 $$