Beauty in the Time of Algorithms

Still in the Eye of the Beholder

by Sam Winslow — for Cultural History of the Screen (J. LaRiviere, NYU)

Can a computer vision system comprehend beauty? What heuristics are available at the cutting edge of machine learning, how are they currently being used to answer this question, and what primary research might I be able to do without being an expert in vector calculus? What are the cultural and relational characteristics of beauty that make it so difficult to come up with a single set of parameters to describe it?

This paper might be read as a kind of update to John Berger's "Ways of Seeing" for the contemporary age of rapid image processing. I will start my inquiry from my own immediate frame of reference. The left photo is the view from my New York City apartment looking out into the atrium in the center of the building. The right photo is the view from my family's house in Lake Winnipesaukee, New Hampshire, where I have spent most of the semester.

Which is the more beautiful photo? For now, let's say it's the nature scene. Let's start with its purely formal qualities and some napkin-sketch math.

At first glance, one might say the nature photo is more vibrant. But the most common colors in each one rank equally at 45% saturation on the HSL (hue-saturation-luminance) scale, and even the rainbow's colors stand at a meager 9%.

What about symmetry? The image at left probably ranks higher, although I haven't quantified it here. However, processes to quantify image symmetry do exist, and, fascinatingly, these processes can be used to detect chest infections. Symmetry is desirable in the human body: this has been studied for centuries, and it was a criterion in the first AI-judged beauty contest. Symmetry seems, then, to be an essential component of beauty. But how it should be weighed against other classifications is dependent on the intended use of the output.

How about complexity: a substitute for visual interest, if not quite beauty? File size in JPEG format is one very rough way to quantify this. (Fractal dimension is another, and I went down a rabbit-hole here which I'll come back to later.) At the same pixel dimensions, the building image is 11% larger than the nature scene. But the majority of this data probably comes from the rough texture of the brick wall and the repeating bannisters on each floor, a pattern so rich that it fooled Google's Vision API.

Google's computer-vision API demo tried to "read the writing on the wall."

These areas are noisy: they're dense with visual data but sparse in semantic meaning. A general-purpose algorithm will get stuck on areas such as these, as you can see in the humorous example above. Most people would take a mental shortcut not to look for writing in this context, in other words, to ignore stimuli which might resemble letterforms. The decision process to ignore the wall texture might be:

<aside> 🧠 familiar size of door (7 feet) → inferred scale of entire wall (4 stories) → observer is not close to wall AND there are no walkways adjacent to the wall → IF there were writing, it would be large enough to read from a long distance → ignore any stimuli under a certain perceptual size which might resemble letterforms.

</aside>

This is not necessarily a more accurate way to see, but it is more useful and more generalizable. By the above logic, a human would probably not notice small writing on one of the bricks that reads "ACME BRICK CO." The Google algorithm would not overlook this detail, provided the image was of high enough resolution. But a human typically visits an apartment hallway to find a path to a destination, and in this scenario, the small text would be irrelevant. The human would ignore it even if he or she had never visited that particular building. In contrast, for a poorly trained delivery robot, accidentally observing the number 28 on a wall adjacent to what is actually the 6th floor would be quite confusing.

Note the bolded statement above: "observer is not close to wall." The condition of the observer is a crucial dimension for the interpretation of meaning in an image. In Berger's Ways of Seeing, he describes exactly this:

"It is seeing which establishes our place in the surrounding world; we explain that world with words, but words can never undo the fact that we are surrounded by it. . . . We never look at just one thing; we are always looking at the relation between things and ourselves."*

The memories, thoughts, and aspirations I attach to the above photos are central to my value judgments placed on them. You might see a modern apartment building; I see an atrium resembling Bobst Library, a place that has served as banal insulation against the excitement of New York. You might see a lake and sky; I am washed over with memories of my family, the smell of grilled corn, the sound of a crackling campfire. I observe my own experience, and you observe your own.

In Écrits, Jacques Lacan describes the uniquely human capability to observe oneself which develops in infancy when a baby becomes joyfully aware of his or her own body in a mirror. This event constitutes the discovery of the basic set of relations in which the infant self exists, or as Lacan puts it:

"the symbolic matrix in which the I is precipitated in a primordial form, prior to being objectified in the dialectic of identification with the other, and before language restores to it, in the universal, its function as subject."**