Remember the output of a neuron’s affine function followed by a ReLU activation function (note that there are many other activation functions but the key concept should be the same) is given by:
$$ activation(\vec{x})=max(0, \vec{w}\cdot \vec{x} + b) $$
To find the gradients w.r.t to $\vec{w}$ and b of this function we will divide the task into two parts:
Find the derivative of $\vec{w}\cdot \vec{x} + b$ wrt $\vec{w}$ and b
Deal with the unary element-wise operation $max$
We don’t know what the derivative of $\vec{w}\cdot \vec{x}$ is but we do know that
$$ \vec{w}\cdot \vec{x} = \sum_i w_ix_i = sum(\vec{w} \otimes \vec{x}) $$
Therefore, we need the vector chain rule here. Let’s define an intermediate function
$$ \vec{u} = \vec{w} \otimes \vec{x} $$
$$ \therefore y = sum(\vec{u}) $$
From the previous chapter we know:
$$ \frac{\partial \vec{u}}{\partial \vec{w}} = \frac{\partial}{\partial \vec{w}} (\vec{w} \otimes \vec{x}) = diag(\vec{x}) $$
$$ \frac{\partial y}{\partial \vec{u}} = \frac{\partial}{\partial \vec{u}} sum(\vec{u}) = \vec{1}^T $$
By chain-rule.
$$ \frac{\partial y}{\partial \vec{w}} = \frac{\partial y}{\partial \vec{u}}\frac{\partial \vec{u}}{\partial \vec{w}} = \vec{1}^T diag(\vec{x}) = \vec{x}^T $$
Also,
$$ \frac{\partial y}{\partial b} = \frac{\partial}{\partial b} sum(\vec{w} \otimes \vec{x}) = 0 $$