What Color Is This? Part 2

We have to find the colors of our clothes from their images, as we described in our last post about color. This will help us in many ways, including deciding what styles to buy and which clients to send them to. We described the hybrid human-computer approach, but only went into depth about the human part: translating our images into a hierarchy of colors. In this post, we’ll get into depth about the computational part: our current computer vision algorithm, some of our process in coming up with that algorithm, and ideas for what we’ll do next.

How will we know if it’s working?

Before even developing an algorithm, we have to think about how to evaluate it. If we make an algorithm and it says “this image is these colors”, is it right? What does “right” mean?

For this task, we decided on two important dimensions: correct labeling of the main color, and correct number of colors. We operationalize this as the CIEDE 2000 distance (as implemented in scikit-image) between the main color our algorithm predicted and our ground truth main color, and mean absolute error on the number of colors. We decided on these for a few reasons:

They are easy to evaluate.
If we had more metrics, it might be difficult to pick a “best” algorithm.
If we had fewer metrics, we might miss an important distinction between two algorithms.
Many clothing items only have one or two colors anyway, and many of our processes rely on the main color. Therefore, getting the main color correct is much more important than getting the second or third colors correct.

What about ground truth data? We have labels provided by our merchandising team, but our tools only allow them to select a broad color like “grey” or “blue”, not an exact value. These broad colors encompass a lot of items, so we can’t use them as the ground truth. We have to build our own data set.

Some of your minds might be jumping to Mechanical Turk or other labeling services. But we don’t need very many images, so describing the task might be harder than just doing it. Plus, building the data set helps us understand our data better. Using a quick and dirty bit of HTML/Javascript, we randomly selected 1000 images, picked out a pixel representing the main color of each, and labeled how many distinct colors we saw. Now it’s easy to get two numbers (mean CIEDE distance of main color, and MAE on number of colors) that tell us how well an algorithm does.

At times we also did some manual verification by running two algorithms on the same image and displaying the item with both lists of colors. We then manually rated about 200 of them, selecting which list of colors is “better.” This kind of engagement with the data is important, not only to get a result (“algorithm B is better than A 70% of the time”) but to understand what’s happening in each one (“algorithm B tends to select too many clusters, but algorithm A misses out on some very light colors”).

A sweater (center), and two different algorithms' color extraction results (left and right)

Our color extraction algorithm

Before processing our images, we convert them to the CIELAB (or LAB) color space, instead of the more common RGB. Instead of three numbers representing the amount of red, green, and blue in an image, LAB points (technically Lab*, though we will call it LAB throughout this post for simplicity) represent three different axes. L represents lightness, from black (0) to white (100). A and B represent the color in two parts: A ranges from green (-128) to red (127) and B ranges from blue (-128) to yellow (127). The main benefit of this color space is its approximate perceptual uniformity: two LAB points with the same Euclidean distance between them will appear approximately equally distinct, no matter where in the space they live.

Of course, LAB introduces other difficulties: notably, images are always viewed on computer screens, which use device-dependent RGB color spaces. Also, the LAB gamut is wider than the RGB gamut, meaning that LAB can express colors that RGB cannot. This means that LAB-RGB conversions are not bi-directional; if you convert points from LAB to RGB and back, you probably will not get the exact same points. We acknowledge that these are theoretical shortcomings in the rest of this work, but empirically it works.

After we convert to LAB, we have a series of pixels, which can be seen as (L, A, B, X, Y) points. The rest of our algorithm is two-stage clustering on these points, where the first stage clusters based on all 5 dimensions, and the second stage omits the X and Y dimensions.

Clustering using space

We start with an image of a flat item that has been color corrected as described in the previous post, shrunk to 300x200, and converted to LAB.