The multilayer perceptron consists of several layers of operations. The output of each neuron in the layer is the linear combination of the input tensor followed by an activation function. For instance, in $Layer^1$, the first neuron $\mathbf{a}^1_1$ is:
$\mathbf{a}^1_1 = f^1(\mathbf{X^0w^1_1}+b^1)$
The output of every neuron in $Layer^1$ will then be stacked into a new Tensor:
$\mathbf{A}^1=\begin{bmatrix} \mathbf{a}^1_1 & \mathbf{a}^1_2 & \cdots & \mathbf{a}^1_{n1}\end{bmatrix}=\begin{bmatrix} a_{11}^1 & a_{21}^1 & \cdots & a_{n_11}^1\\ a_{12}^1 & a_{22}^1 & \cdots & a_{n_12}^1\\ \vdots & \vdots & \ddots & \vdots\\ a_{1m}^1 & a_{2m}^1 & \cdots & a_{n_1m}^1\\\end{bmatrix}$
where $n_1$ is the number of neuron in the $Layer^1$.

We can denote the output of each layer with $a_{ij}^l$, where $l$ is the index of layer, $i$ is the index of neuron in $Layer^l$, $j$ is the index of the sample. The output of $Layer^{l}$ will then be fed to the next layer $Layer^{l+1}$ as the input. This process continues until the last layer $Layer^{k}$ of the network, generating the prediction $\mathbf{\hat{Y}}=\mathbf{A}^k$. Note that the number of neuron in $Layer^k$ has to be the same as the shape of $\mathbf{Y}$ (number of classes in classification, or number of targets in regression). Note that we apply different initialization for the weight in the different neuron, and therefore the neurons in the same layer generate slightly different output (will discuss in Weight Initialization).
The goal to train this multilayer perceptron network is to find the optimal weight $\mathbf{W}^*$ in every layer that optimizes the objective function between the output of the network $\mathbf{\hat{Y}}$ and the ground truth target $\mathbf{Y}$. To be more specific, we first need to compute the gradient of the objective function with respect to the output of the network $\mathbf{\hat{Y}}$:
$\nabla_\mathbf{\hat{Y}}L(\mathbf{Y}, \mathbf{\hat{Y}})$
The Jacobian matrix of all the output from every neuron in $Layer^k$ with respect to the linear combination $\mathbf{X}^k$:
$J_{f^k}(\mathbf{\mathbf{x}})=\frac{\partial\mathbf{a}^k_i}{\partial\mathbf{x}_i^k}$
And the Jacobian matrix of the output of linear combination with respect to the weight $\mathbf{w}$:
$J_{\mathbf{A}^{k-1}w^k_i}(\mathbf{a_i^{k-1}})=\frac{\partial\mathbf{x}_i^k}{\partial\mathbf{w}_i^k}=\mathbf{a}_i^{k-1}$
where $k$ denotes the layer in the multilayer perceptron network. The gradient of the objective function with respect to the weight in the last layer $\mathbf{w}^k$ is:
$\nabla_\mathbf{w_1^k}L(\mathbf{Y}, \mathbf{\hat{Y}})=(\frac{\partial\mathbf{x}_i^{k}}{\partial\mathbf{w}_i^k})^T(\frac{\partial\mathbf{a}_i^k}{\partial\mathbf{x}i^k})^T\nabla\mathbf{\hat{Y}}L(\mathbf{Y}, \mathbf{\hat{Y}})$
The process continues from the output layer to the input layer; therefore, this process is called back propagation. The gradient of the objective function with respect to the weight $\mathbf{w}_i$ (contribute to neuron $i$) in layer $l$ can be written as:
$\nabla_{w_i^l}L(\mathbf{Y}, \mathbf{\hat{Y}})=(\frac{\partial\mathbf{x}_i^l}{\partial\mathbf{w}i^l})^T(\frac{\partial\mathbf{a}i^l}{\partial\mathbf{x}i^l})^T\sum\limits{j=1}^{n{l+1}}\nabla\mathbf{w_j^{l+1}}L(\mathbf{Y}, \mathbf{\hat{Y}})$
After computing the gradient, we can apply gradient descent in combination with optimizers to optimize the weight in each layer $\mathbf{W}^l$
