Estimated Time to Complete: 2 Hours
General guideline for completing the assignment:
Feel free to stack-overflow, Google, and research any questions/issues you have. You are also welcome to use any programming language. The only thing we ask is that this work is done independently by you without the help of your family/friends. If the assignment requires coding, please write necessary docs to instruct people to run your codes.
When you have completed the assignment, please name the file in accordance with this naming pattern: [YOUR FULL NAME]-Assignment for [NAME OF THE ROLE THE ASSIGNMENT IS FOR] and upload it to **this dropbox.**
Please choose ONE of the following two questions 2a or 2b to answer.
Question 2a: CPU level parallelism and GPU (2 hours)
- As a CV engineer, you should have had experience doing image manipulations using multi-threading in the CPU or using GPU kernels. We want to detect edges in a grayscale image using Sobel operator (a 3x3 kernel and valid convolutions (no padding)). The input is a WxH image with values in the range of [0, 1] and the output should be a (W-2)x(H-2) binary map where all edge pixels are marked as ones. Write a function using CPU thread-level parallelism to utilize all the available cores on the CPU to accomplish this objective.
- Write the pseudo-code in either OpenCL or CUDA or Metal for the kernel that performs the same function. Remember to specify what your threadgroup id and local thread id is. Write both the instantiation code as well as the kernel code in CUDA/OpenCL/Metal (pseudo code is okay). Assume a max thread count of 256 per thread group (or per block depending on your terminology), and thread group count (or block count) it has no limit.
- Explain what you expect as the speed up for each implementation as compared to a single thread sequential implementation. You do not need to be precise. Assume the following:
- input image is 1080p (1920x1080) and 1 channel (grayscale)
- GPU has a max 256 local cores, 8GB of local memory
- CPU has 8 cores, runs at 2GHz, 8GB of RAM
Question 2b: Knowledge Distillation (2 hours)
- We have trained a large neural network for an image classification task and now want to deploy it on our mobile app. Unfortunately, the model is too large to be run on mobile devices. We want to apply knowledge distillation to obtain a much smaller model that still achieves decent accuracy. If you're not familiar with knowledge distillation, have a look at this paper: Distilling the Knowledge in a Neural Network. We will work with the CIFAR10 dataset. Download a pretrained ResNet model (e.g for PyTorch or tensorflow) and report the accuracy and inference time of the model.
- Set up a significantly smaller model (e.g up to 10 hidden layers) with any architecture of your choice that can train quickly on your local machine. Train the small network on a combination of distillation loss on the logits of the large networks and a standard loss between the small network's output and the ground truth labels. Use a higher temperature for the softmax layer in the distillation loss. Train the smaller network for a while and report both the achieved accuracy and its inference time. (Don't worry too much about finding perfect hyperparameters for high accuracy or training for many epochs, we are more interested in seeing if you developed a good training process for this problem which could be used to find specific hyperparameters later on).