Date: April 15,2025 | Author: Rachit Jani, Umang Goyal

Training of LLMs and the Need for Efficient Fine-Tuning

Training large language models (LLMs) involves two major steps:

  1. Pretraining on large-scale general domain data.
  2. Fine-tuning for specific tasks or domains.

Initially, techniques like full fine-tuning were used. However, this approach is computationally expensive, as it essentially involves updating nearly all model parameters. If the model is initialized with pretrained weights $\Phi_0$, fine-tuning requires updating it to $\Phi_0 + \Delta\Phi$ to maximize the conditional language modeling objective:

$$ \max_{\Phi} \sum_{(x,y)\in \mathcal{Z}} \sum_{t=1}^{|y|} \log \left( P_{\Phi}(y_t \mid x, y_{<t}) \right) $$

Here, $\Delta\Phi$ has the same dimensionality as $\Phi_0$, making it parameter-inefficient and computationally heavy.

To address this challenge, several techniques have been proposed over time.

Techniques before LoRA

Prior to the introduction of LoRA (Low-Rank Adaptation), several alternative fine-tuning strategies were developed. Two notable approaches include:

1. Adapter Layers

Adapter layers are a lightweight fine-tuning method that introduces small, trainable modules into each transformer block while keeping the original model weights frozen. This significantly reduces the number of trainable parameters. While adapter-based methods are parameter-efficient, their major drawback is the increase in inference latency. Large language models rely heavily on hardware parallelism for speed, but since adapter layers are applied sequentially, they can become a bottleneck during inference.

2. Optimizing Input Layer Activations (Prefix Tuning)

Another approach is Prefix Tuning, where a fixed set of trainable tokens (called prefixes) are prepended to the input sequence. These prefix tokens act as soft prompts that condition the model’s behavior without modifying its internal parameters.

However, this method has its downsides:

These limitations led to the need for a more efficient and scalable method of fine-tuning—one that preserves inference speed, minimizes parameter count, and remains effective across diverse tasks.