Overfitting can be a serious problem in complicated deep neural networks. Dropout is a technique that utilizes "wisdom of crowds" (averaging the prediction out multiple models) to improve the model. However, training many neural networks with different architectures is very inefficient and not applicable to solving data science problems. To address this, Nitish Srivastava et al. proposed the idea of dropout to prevents overfitting and provides a way of approximately combining many different neural networks efficiently.
Consider $Layer^l$ in a neural network:
$\mathbf{X}^l=\mathbf{A}^{l-1}\mathbf{W}^l+\mathbf{b}^l$
$\mathbf{A}^l = f^l(\mathbf{X}^l)$
Dropout add a Bernoulli process prior to the weight.
$\mathbf{R}^l\sim\text{Bernoulli}(\rho)$
$\mathbf{X}^l=\mathbf{A^{l-1}\mathbf{R}^l\odot W}^l+\mathbf{b}^l$
$\mathbf{A}^l = f^l(\mathbf{X}^l)$
In other words, dropout randomly removes some of the features for each training sample in the mini-batch during the training process. The gradient of the selected features is $0$, and the final updates of the weight take the average gradient of all samples in the mini-batch. Note that the authors suggest that using dropout along with max-norm regularization, large decaying learning rates and high momentum provides a significant boost over just using dropout.


The final weights are scaled according to the dropout probability as:
$\mathbf{W}^l_{test}=\rho\mathbf{W}^l$

model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(64, activation='relu'),
])