Introduction

Can artificial intelligence make your photo a Van Gogh painting? The blending of technology and art has been an interesting topic, particularly the power to convert images into famous artistic styles. Using Convolutional Neural Networks (CNNs), Leon A. Gatys et al. in their 2015 paper, A Neural Algorithm of Artistic Style, split and re-mix an image's content and style to produce beautiful artistic interpretations. I coded this up in Python using PyTorch last week, and this post delves into the mathematics behind it. Look forward to seeing major figures from the paper explain how it works. (along with some cool ass visualizations)

Background on Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) underlie contemporary image processing, crafted to simulate human perception. CNNs take images through a series of filters, each picking up certain features hierarchically. The early layers detect low-level features such as edges and textures, while the deeper layers detect high-level structures such as objects and object arrangements. The layered design compels CNNs to be strong in tasks such as recognizing objects and, as the paper investigates, style transfer.
In A Neural Algorithm of Artistic Style, the authors employ the VGG-19 network, ImageNet-pretrained, as their CNN backbone. VGG-19, with 16 convolutional and 5 pooling layers, performs well in object recognition because it can learn to represent objects with richer hierarchical feature representations. The authors substitute VGG's max-pooling with average-pooling to facilitate better gradient flow in image synthesis such that optimization is smoother for creating stylized images (Methods, page 9).
Why VGG-19? Its sequential, deep architecture gives clean, layer-by-layer feature maps well-suited to isolate content (from later layers) and style (through correlations between layers). ImageNet pre-trained, VGG-19 encompasses a broad set of visual features and is ideal for extracting detailed textures as well as semantic content required for style transfer.
To make this clear, Figure 1 from the paper (page 3) illustrates the CNN architecture and feature reconstructions. Content reconstructions (a-e) indicate how low layers ('conv1_1' to 'conv3_1') replicate near-identical pixel details, whereas top layers ('conv4_1', 'conv5_1') capture objects and layouts, sacrificing finer details. Style reconstructions (a-e) reveal textures and patterns, abandoning spatial arrangements. This figure illustrates how VGG-19's hierarchy captures progressive information, which is important for the algorithm's success.

Separating Content and Style Representations

The work introduces a technique for disentangling an image's content and style based on the neural representations of a Convolutional Neural Network (CNN), such that they can be recombined to form new images.
Content Representation: Content is defined by the paper as feature maps from the higher layers of the CNN, i.e., 'conv4_1' or 'conv5_1'. These layers are concerned with objects and their spatial relationships, capturing the semantic meaning of the image rather than exact pixel values. For example, in a photo, top layers highlight structures such as buildings or individuals over pixel intensities. This is evident in the paper's content reconstructions, where top-layer outputs retain the general structure but lose finer details (Figure 1, content reconstructions d-e, page 3).
Style Representation: Style is encoded in Gram matrices, which calculate correlations between filter responses over multiple CNN layers (e.g., 'conv1_1' to 'conv5_1'). Texture, color, and local patterns are represented in the matrices, regardless of global spatial layout. Aggregating correlations over layers, the style representation constitutes a multi-scale description, where lower layers capture fine textures and higher layers represent broader patterns, like brushstrokes or color distributions.
Key Insight: The fact that content and style can be disentangled in the CNN means that they can be manipulated separately. Content features of a photo can be mixed with style features of a painting to generate an image that has the structure of the photo but the aesthetic of the painting.
Visualization: Figure 1 of the paper (page 3) compares these representations. Content reconstructions (a-e) illustrate a gradual shift from pixel-perfect reproductions in lower layers to abstract, object-centric outputs in upper layers. Style reconstructions (a-e), based on Gram matrices, create textured patterns that describe the appearance of the input—e.g., colors and local structures—while ignoring scene layout, producing outputs akin to abstract, texture-rich versions of the original.

What are Gram Matrices: Capturing Style

What is a Gram Matrix ?
- A Gram matrix is computed by taking a matrix (e.g., feature maps from a CNN) and multiplying it by its transpose. This process measures the degree of correlation between vectorized feature maps.
```
                  $G = F \\cdot F^T$  (matrix multiplication)
```
- In the context of style transfer, the Gram matrix represents the style of an image, capturing elements like texture, brush strokes, and color distributions.
From the paper (Equation 3):

For a layer $l$ with a feature map $F^l \in \mathbb{R}^{N_l \times M_l},$ the matrix is $G^l_{ij} = \sum_k F^l_{ik} F^l_{jk}$ where $N_l$ is the number of filters and $M_l$ is the spatial size (height * width)