Accueil > Skills > How to finetune your own LLM

28 Feb. 2025

How to finetune your own LLM

Learn how to fine-tune transformer-based language models using Hugging Face to create specialized AI for your specific use cases and domains.

Stay on top of the latest tech trends & AI news with Le Wagon’s newsletter

This article is written by Andrei Danila, a Machine Learning Engineer.

Below is a step-by-step guide to fine-tuning a language model (in this case, GPT-2 or its variants) on a text dataset using Hugging Face’s transformers library and the datasets library.

We’ve also written a companion Colab notebook which can be accessed here.

Note: If you’d like a deeper explanation of how Transformers work, please refer to our article on ChatGPT and Transformers. We’ll keep this guide relatively high-level, focusing on the main steps you need to fine-tune a model.

Why Fine-Tune a Transformer Model?

Pretrained transformer models like GPT-2, BERT, and others come with a wealth of linguistic knowledge acquired from massive amounts of text data. However, these models are often general-purpose. If you want a model to focus on a specific style, topic, or domain, you can fine-tune it on a smaller dataset of text that is more relevant to your use case.

Fine-tuning:

Saves Training Time: You don’t have to train a model from scratch on huge corpora.
Requires Fewer Data: You only need enough domain-specific data to adapt the model’s generative or understanding capabilities.
Improves Performance on Niche Tasks: For specialized tasks (e.g., Shakespearean text generation or specific domain dialogue), fine-tuning can drastically improve the model’s output.

Note: If you’re on Colab, click on Runtime at the top of the page, select Change runtime type, select T4 GPU, then press Save. This will significantly speed up training.

1. Install Required Libraries

First, install the transformers and datasets libraries. These provide tools to load, process, and train state-of-the-art language models. Both of these libraries have been created by Huggingface, the industry leader in open source ML infrastructure.

!pip install -qqq transformers datasets

2. Import Libraries

Next, import the libraries you’ll need in your Python environment or Colab notebook. Importing them “brings” the relevant classes and functions into our notebook.

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

torch: The PyTorch library, which is the underlying framework for the transformers library.
datasets: A library that gives easy access to a wide range of popular NLP datasets.
transformers: The Hugging Face library for loading and using pretrained transformer models.

3. Set Up the Device

Modern deep learning libraries can leverage GPUs for faster training. We can automatically detect if a GPU is available and use it. Otherwise, fall back to CPU.

device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)
print(“Using device:”, device)

If you’re running in Google Colab (with GPU enabled), you should see cuda.
If no GPU is found, it will print cpu.

4. Choose a Model and a Dataset

Here are two important choices:

Model: The transformer model you want to start with. (distilgpt2 is a smaller variant of GPT-2.)
Dataset: The text data you’d like to train/fine-tune on. (Here, we use the imdb movie review dataset as an example.)

model_name = “distilgpt2″# Try: “gpt2”, “gpt2-medium”, etc.
dataset_name = “imdb”# Could be: “yelp_polarity”, “wiki40b”, etc.

DistilGPT-2 is a lightweight version of GPT-2 (one of the first OpenAI models, ChatGPT’s ancestor), making it faster to train.
IMDB is a popular dataset for movie reviews, often used for sentiment analysis or text classification.

5. Load the Dataset

The load_dataset function from datasets makes it straightforward to load common datasets.

This is what our dataset looks like (from here on Huggingface):

dataset = load_dataset(dataset_name)
print(dataset[‘train’][0][‘text’][:250])

This will give you a sense of what the text data looks like.
The dataset is a dictionary-like object with keys like ‘train’ and ‘test’.

6. Tokenize the Data

Transformers deal with tokens—numeric representations of pieces of text. That is to say, they can’t deal with text directly, so we transform each word in our vocabulary into a number (e.g. “dog” becomes 5). We use a Tokenizer to achieve this.

tokenizer = AutoTokenizer.from_pretrained(model_name)

# Some GPT-style models do not come with a pad token by default.if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
tokens = tokenizer(
examples[“text”],
padding=”max_length”,
truncation=True,
max_length=128
)
# For causal language modeling, the labels are the same as the input IDs.
tokens[“labels”] = tokens[“input_ids”].copy()
return tokens

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=[“label”])

Key points:

padding='max_length' and truncation=True: Ensures all sequences are the same length (here, 128 tokens).
Labels for Causal LM: In a language modeling task, we predict the next token, so the labels are the same as the inputs.

7. Load the Model

We use a model class that’s designed for causal language modeling: AutoModelForCausalLM. The actual model loaded will bet the one we specified above (e.g. distilgpt2). This is the model page for distilgpt2, from Huggingface:

model = AutoModelForCausalLM.from_pretrained(model_name)
model.to(device)

Moving to device ensures the model is on the GPU (if available).

8. Quick Pre-Fine-Tuning Test

It’s often useful to see what the model outputs before fine-tuning. Let’s provide a simple prompt to gauge the model’s initial response.

prompt = “The movie was absolutely awful because”
inputs = tokenizer.encode(prompt, return_tensors=”pt”).to(device)
outputs = model.generate(inputs, max_length=50, num_beams=5, no_repeat_ngram_size=2)
print(tokenizer.decode(outputs[0]))

max_length=50: Generate up to 50 tokens.
num_beams=5: Beam search with 5 beams (helps generate better text).
no_repeat_ngram_size=2: Avoid repeating the same phrase.

This is a sample output:

The movie was absolutely awful because of the way it was made, and I don’t know if I’ll ever get to see it again, but I can’t wait for it to come out.

Not very good as you can see.

9. Set Up Training Arguments

We specify how we want to train. This includes batch size, number of epochs, and where to save checkpoints.

training_args = TrainingArguments(
output_dir=”./distilgpt2-finetuned-imdb”,
evaluation_strategy=”epoch”,# Evaluate once every epoch
learning_rate=2e-5,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
num_train_epochs=1,
weight_decay=0.01,
report_to=”none”# Turn off logging to external services
)

output_dir: Where trained model files will be saved.
evaluation_strategy=”epoch”: Evaluate after each epoch.
learning_rate=2e-5: A lower LR helps avoid catastrophic forgetting but can also slow training. Feel free to experiment.
num_train_epochs=1: Training for just 1 epoch is often too little for a real project, but it’s good for a quick demo.

10. Create a Trainer

The Trainer class wraps the model, your training arguments, and the datasets together.

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset[“train”],
eval_dataset=tokenized_dataset[“test”]
)

train_dataset is the tokenized training split.
eval_dataset is the tokenized test split, used for evaluation.

11. Train the Model

Run the actual fine-tuning process. Depending on the size of your dataset and the power of your GPU, this could take a while.

trainer.train()

You’ll see a training loop with metrics like loss being printed out.

12. Evaluate

After training finishes, you can evaluate the model on the test set to see how it’s performing.

trainer.evaluate()

This yields metrics like eval_loss, which you can use to see if your model is improving.

13. Test the Fine-Tuned Model

Finally, let’s see how the model’s output might have changed after fine-tuning. We’ll use the same prompt as before:

Compare this output to the pre-trained model’s response. Ideally, the fine-tuned model now has more relevant knowledge or style that you introduced during training (in this case, likely more knowledge of IMDB reviews).

Sample output:

The movie was absolutely awful because it was so bad. The acting was terrible, the plot was horrible, and the acting wasn’t as good as you’d expect from a movie like this. It was a waste of time and money to make this movie.

Much better than before we fine-tuned it!

Key Takeaways

Pretrained Models: Transformers come loaded with general language understanding capabilities.
Fine-Tuning: By providing your domain-specific dataset, you help the model adapt to a specialized style or topic.
Efficiency: You don’t need enormous datasets to see improvements; often, a smaller domain-specific dataset is enough to significantly alter the model’s outputs.
Experimentation: Adjust hyperparameters (learning rate, batch size, epochs) to strike a balance between training speed and performance.

That’s it! You’ve successfully fine-tuned a GPT-style language model with Hugging Face Transformers. Remember, this guide is flexible—feel free to swap out models, datasets, and parameters to fit your specific needs. If you want to dive deeper into how Transformers work, make sure to check out my article on Transformers. Happy fine-tuning!