Almost 6 months ago to the day, EleutherAI released GPT-J 6B, an open-source alternative to OpenAIs GPT-3. GPT-J 6B is the 6 billion parameter successor to EleutherAIs GPT-NEO family, a family of transformer-based language models based on the GPT architecture for text generation.

EleutherAI's primary goal is to train a model that is equivalent in size to GPT⁠-⁠3 and make it available to the public under an open license.

Over the last 6 months, GPT-J gained a lot of interest from Researchers, Data Scientists, and even Software Developers, but it remained very challenging to deploy GPT-J into production for real-world use cases and products.

There are some hosted solutions to use GPT-J for production workloads, like the Hugging Face Inference API, or for experimenting using EleutherAIs 6b playground, but fewer examples on how to easily deploy it into your own environment.

In this blog post, you will learn how to easily deploy GPT-J using Amazon SageMaker and the Hugging Face Inference Toolkit with a few lines of code for scalable, reliable, and secure real-time inference using a regular size GPU instance with NVIDIA T4 (~500$/m).

But before we get into it, I want to explain why deploying GPT-J into production is challenging.


The weights of the 6 billion parameter model represent a ~24GB memory footprint. To load it in float32, one would need at least 2x model size CPU RAM: 1x for initial weights and another 1x to load the checkpoint. So for GPT-J it would require at least 48GB of CPU RAM to just load the model.

To make the model more accessible, EleutherAI also provides float16 weights, and transformers has new options to reduce the memory footprint when loading large language models. Combining all this it should take roughly 12.1GB of CPU RAM to load the model.

from transformers import GPTJForCausalLM
import torch

model = GPTJForCausalLM.from_pretrained(

The caveat of this example is that it takes a very long time until the model is loaded into memory and ready for use. In my experiments, it took 3 minutes and 32 seconds to load the model with the code snippet above on a P3.2xlarge AWS EC2 instance (the model was not stored on disk). This duration can be reduced by storing the model already on disk, which reduces the load time to 1 minute and 23 seconds, which is still very long for production workloads where you need to consider scaling and reliability.

For example, Amazon SageMaker has a 60s limit for requests to respond, meaning the model needs to be loaded and the predictions to run within 60s, which in my opinion makes a lot of sense to keep the model/endpoint scalable and reliable for your workload. If you have longer predictions, you could use batch-transform.

In Transformers the models loaded with the from_pretrained method are following PyTorch's recommended practice, which takes around 1.97 seconds for BERT [REF]. PyTorch offers an additional alternative way of saving and loading models using, PATH) and torch.load(PATH).

“Saving a model in this way will save the entire module using Python’s pickle module. The disadvantage of this approach is that the serialized data is bound to the specific classes and the exact directory structure used when the model is saved.”

This means that when we save a model with transformers==4.13.2 it could be potentially incompatible when trying to load with transformers==4.15.0. However, loading models this way reduces the loading time by ~12x, down to 0.166s for BERT.

Applying this to GPT-J means that we can reduce the loading time from 1 minute and 23 seconds down to 7.7 seconds, which is ~10.5x faster.