A few papers have looked at how Transformers can be quantized to low-precision [1], more explorations are needed in this field to look at how various forms of low-bitwidth quantizations can be applied to Transformers. Our team has looked at how to emulate the effect of various transformers in Pytorch. This summer internship would aim to explore further and extend our system to more models, number systems, datasets and learning tasks.

The following quantization methods would be implemented in this project:

The student would also have to consider integrate quantization with CUDA functions for run-time performance improvements [2].

Skill requirements