Quick Overview on GANs

GANs are traditionally explained as a game played between two neural networks (art forger vs. the art detective) — see below article for very high-level overview:

The GANfather: The man who's given machines the gift of imagination

Training Hardware

In practice / implementation, both the detective and forger are implemented as part of the same "neural network graph." which we load and train on a large cluster of GPUS. Each model uses 8GPUs apiece, and we often test multiple different models in parallel. At peak, it essentially looks like a cryptocurrency operation — 64GPUs crunching the distribution of "good looking feet." Unlike cryptocurrency we don't use power-efficient GPUs or networking, so at peak we're burning 20kW/hr worth of power.

Models take anywhere from 3-7 days to train and we spent pretty much the entirety of November running different models and permutations of data.

Training Data

Our rule of thumb is that 10K examples is about the lower bound on achieving a high-quality generator (necessary, but not sufficient). 100k is a much safer number but often is a hard target to hit.

We ran a number of preliminary tests with the initial 1K images scraped by Claire to see how feasible things were. The fact that anything showed up at all was promising.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d479eae0-ccea-4333-bd17-e3ce54d9ce5d/Untitled.png

Initial training results

The original request of POV feet actually has a rather large distribution that is difficult to capture. The perspectives are often inconsistent (since whole leg is in the way and you can't do a straight top-down image / crop) and people do lots of weird shit (lotion, rose petals, feet in cake, etc.)

In contrast, the scraped sole images are mostly significant more consistent. (Mostly lifting their feet and taking picture straight at the bottom of the sole). So the generator has an easier time.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/23cd0a0b-5081-4ec9-a4fd-5d85ff6bc843/Untitled.png

Counterexample: A sole image that someone still managed to do some weird shit to (and that caused us to spit out our coffee when we saw it)

So midway through November we decided to just split POV images from sole images since they really represent two separate distribution. Training two models helped improve the quality (most noticeable w/ the soles initially)

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/2ba699c9-dd2a-44ae-be51-ab41d06d3006/Untitled.png

Generation results from a preliminary sole model.

Claire's team also started applying tags which meant we could augment more effectively. We used the tags to both align the images as well as augment (e.g. generating additional "real" examples through mirroring, rotations, etc. The new stricter cropping guidelines also helped significantly.

Model Understanding

A fast and easy way to understand if a model has "generalized" to the distribution of feet is to check whether it's possible to "travel" through the space of valid feet. Left is a collapsed model that has only learned to memorize certain images of feet in pixel-space. Right is a model that has generalized much better.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/e13d1fc5-4241-47e0-baa4-270bda51fcda/vidse1__1.mp4

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/78506839-8211-45a7-9013-55e31679ff95/soles.mp4

Tangent: in the collapsed model, this fucking guy shows up so bloody often. That face is a thing of nightmares.