Visualizing Neural Networks with the Grand Tour

The Grand Tour

is a classic visualization technique for high-dimensional point clouds that projects a high-dimensional dataset into two dimensions. Over time, the Grand Tour smoothly animates its projection so that every possible view of the dataset is (eventually) presented to the viewer. Unlike modern nonlinear projection methods such as t-SNE and UMAP, the Grand Tour is fundamentally a linear method. In this article, we show how to leverage the linearity of the Grand Tour to enable a number of capabilities that are uniquely useful to visualize the behavior of neural networks. Concretely, we present three use cases of interest: visualizing the training process as the network weights change, visualizing the layer-to-layer behavior as the data goes through the network and visualizing both how adversarial examples are crafted and how they fool a neural network.

Introduction

Deep neural networks often achieve best-in-class performance in supervised learning contests such as the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

. Unfortunately, their decision process is notoriously hard to interpret, and their training process is often hard to debug. In this article, we present a method to visualize the responses of a neural network which leverages properties of deep neural networks and properties of the Grand Tour. Notably, our method enables us to more directly reason about the relationship between changes in the data and changes in the resulting visualization. As we will show, this data-visual correspondence is central to the method we present, especially when compared to other non-linear projection methods like UMAP and t-SNE.

To understand a neural network, we often try to observe its action on input examples (both real and synthesized)

. These kinds of visualizations are useful to elucidate the activation patterns of a neural network for a single example, but they might offer less insight about the relationship between different examples, different states of the network as it’s being trained, or how the data in the example flows through the different layers of a single network. Therefore, we instead aim to enable visualizations of the context around our objects of interest: what is the difference between the present training epoch and the next one? How does the classification of a network converge (or diverge) as the image is fed through the network? Linear methods are attractive because they are particularly easy to reason about. The Grand Tour works by generating a random, smoothly changing rotation of the dataset, and then projecting the data to the two-dimensional screen: both are linear processes. Although deep neural networks are clearly not linear processes, they often confine their nonlinearity to a small set of operations, enabling us to still reason about their behavior. Our proposed method better preserves context by providing more consistency: it should be possible to know how the visualization would change, if the data had been different in a particular way.

Working Examples

To illustrate the technique we will present, we trained deep neural network models (DNNs) with 3 common image classification datasets: MNIST

, fashion-MNIST and CIFAR-10 . While our architecture is simpler and smaller than current DNNs, it’s still indicative of modern networks, and is complex enough to demonstrate both our proposed techniques and shortcomings of typical approaches.

The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional

and fully-connected), max-pooling, and ReLU layers, culminating in a softmax layer.

Even though neural networks are capable of incredible feats of classification, deep down, they really are just pipelines of relatively simple functions. For images, the input is a 2D array of scalar values for gray scale images or RGB triples for colored images. When needed, one can always flatten the 2D array into an equivalent (w⋅h⋅c) -dimensional vector. Similarly, the intermediate values after any one of the functions in composition, or activations of neurons after a layer, can also be seen as vectors in Rn, where n is the number of neurons in the layer. The softmax, for example, can be seen as a 10-vector whose values are positive real numbers that sum up to 1. This vector view of data in neural network not only allows us represent complex data in a mathematically compact form, but also hints us on how to visualize them in a better way.

Most of the simple functions fall into two categories: they are either linear transformations of their inputs (like fully-connected layers or convolutional layers), or relatively simple non-linear functions that work component-wise (like sigmoid activations

or ReLU activations). Some operations, notably max-pooling and softmax, do not fall into either categories. We will come back to this later.

The above figure helps us look at a single image at a time; however, it does not provide much context to understand the relationship between layers, between different examples, or between different class labels. For that, researchers often turn to more sophisticated visualizations.

Using Visualization to Understand DNNs

Let’s start by considering the problem of visualizing the training process of a DNN. When training neural networks, we optimize parameters in the function to minimize a scalar-valued loss function, typically through some form of gradient descent. We want the loss to keep decreasing, so we monitor the whole history of training and testing losses over rounds of training (or “epochs”), to make sure that the loss decreases over time. The following figure shows a line plot of the training loss for the MNIST classifier.

Although its general trend meets our expectation as the loss steadily decreases, we see something strange around epochs 14 and 21: the curve goes almost flat before starting to drop again. What happened? What caused that?