In this blog, we will first discuss the basics of text summarization, a high-level natural language processing (NLP) task. We will then explore how abstractive summarization works and how to implement it with Hugging Face transformers. Lastly, we will find out how we can assess the summarization models. This blog will focus on the basics and data processing for summarization.
Applying Hugging Face Machine Learning Pipelines in Python
Hugging Face is a community-driven effort to develop and promote artificial intelligence for a wide array of applications. The organization’s pre-trained, state-of-the-art deep learning models can be deployed to various machine learning tasks. In this course, you’ll explore the Hugging Face artificial intelligence library with particular attention to natural language processing (NLP) and computer vision. You’ll first explore Hugging Face’s approach to deep learning with specific attention to transformers. You’ll then learn Hugging Face’s pipeline API model and apply various pipelines to unique NLP tasks such as classification, summarization, question answering, and more. You’ll continue with a new set of Hugging Face pipelines for computer vision tasks including object detection and segmentation. By the end of this course, you’ll be familiar with a wide array of Hugging Face’s pipelines for common machine learning tasks and their implementation in Python using pytorch.
In the rapid information era and digitally globalized world, textual data is growing enormously. It is becoming hard to keep ourselves updated with loads of data. For example, an online product that we want to buy has thousands of reviews but it is hard to grasp them all. But if there is a tool that sums up all those reviews and provides a summary of those reviews, it can make our lives much easier. Similarly, consider the case of an investigative journalist who wants to collect information on a specific event from various resources. What if there is a tool that can generate a timeline summary of that specific event from previous news and other sources? Here comes the application of NLP, called text summarization. Text summarization can be divided into different categories depending on the nature of a task, as shown in the illustration below. Some important questions to ask for defining the task are:
What kind of input is given—single-document vs. multi-document, query vs. generic.
What type of summarizer we want—selecting sentences from the original input, extractive summarization vs. generating a human-like summary containing salient information coherently, abstractive summarization.
What kind of summary is required—generating a text’s title or one-liner summary, extreme summarization vs. abstract like a multi-sentence summary.
In what language is a summary required—a summary in the source language, monolingual summarization vs. a summary in another target language, cross-lingual summarization.
A summarization problem can be a mix and match of these categories. So, what is the formal definition of text summarization, and what are the properties a good summary should have? The former part of the question is easy to answer, and the latter is trickier.
By definition, text summarization is a high-level NLP task that takes a text as an input and produces its summary as an output. A summary should contain salient information about the given text. In terms of properties, a good summary should be at least fluent, well-structured, and coherent. Depending on the nature of the task, there can be some additional properties. However, measuring these properties, such as coherence and fluency, is not a straightforward task and requires human effort.
Let’s check a black box example of summarization with transformers, where we provide an input text to a summarizer, and it generates the output summary. The video below represents how we can execute transformers on Educative’s platform. This example intentionally covers a simplified version of summarization where we only provide the input and get the output.
Let’s understand how a summarization model can be trained, tested, and evaluated on a given dataset.
We need some building blocks for training an abstractive summarization model. Let’s check the flowchart below, and then we’ll discuss these building blocks.
Firstly, we need a summarization dataset where each instance consists of a text-summary pair. We provide these instances with a sequence-to-sequence (S2S) model to the training loop responsible for training the model. The model is trained by showing examples of input and expected output (reference summary). This kind of training is called supervised learning. The data given to the training loop is called the training set (for now, let’s ignore the dev set). Now, the trained model can produce the outputs of given texts. Suppose we saved a chunk of the dataset that wasn’t used during training—called the test set. We provide the trained model and input texts to the testing loop, which generates the output summaries for all the given inputs. This step is also called inference.
Here, a question arises—how do we know that the generated summaries are accurate? To confirm this, we need a metric that can assess the outputs. This is called the automatic evaluation of a model. We provide the output summaries and their reference summaries (which we did not use during testing) to the evaluation loop, which assesses the summaries by comparing them. The outcome of the evaluation is evaluation scores. This way, we can measure the quality of the model output. We can also evaluate the outputs given to human annotators to assess their quality, called human evaluation. The outcome of any kind of evaluation are scores that indicate how well/poorly a model has been trained.
Enough theoretical discussion! Let’s move to the implementation part of text summarization. Summarization is a hot topic and almost all big tech companies have developed various libraries and tools for summarization. However, this blog will focus on Hugging Face transformers implementation for abstractive summarization.
Here comes a question: What is Hugging Face (HF)? Let’s make it easy. Think of HF as an umbrella for the AI community, providing a platform containing several open-source datasets, models, pipelines, and evaluation metrics with the AI community discussions. An AI developer can find almost everything required in Hugging Face.
For abstractive summarization with HF, we need a dataset, a pipeline, or a pretrained model for training and inference and evaluation metrics for summarization. Luckily, all the pieces required—data, models, evaluation metrics—are already provided by HF. Let’s understand how these pieces work one by one. We will start with the data processing for summarization, and this blog only covers this part.
First, we need a summarization dataset in which each instance consists of a text and reference summary (or summaries) pair. We split data into train, development (dev), and test sets. We train a summarization model with train and dev sets, while the test set is unseen for the trained model to measure its performance. The split ratio can vary according to the domain and problem. However, the most common ratios are 80/10/10 or 90/5/5 for train/dev/test, respectively.
HF has a variety of summarization datasets ranging from news genres to long scientific papers and from covering many languages as monolingual datasets to cross-lingual datasets. The code snippet below shows how to load an existing HF dataset.
from datasets import load_datasetdataset = load_dataset("grammarly/pseudonymization-data")
Other summarization datasets examples are CNN-Daily Mail, XSum, Multi-News, Amazon Reviews Multi, and arXiv datasets. Either we have separate files in the dataset to split the data, or it can be done in the code. We can also use custom datasets with HF models.
Now that we have our dataset, the next step is to process it before forwarding it to the model for training. For this processing, we need a tokenizer and a data collator. The tokenizer is responsible for tokenizing the data and maintaining a vocabulary. These days, byte pair encoding (BPE) or sub-word tokenization techniques are popular as they reduce the vocabulary size effectively.
There are plenty of different tokenizers available on HF. However, it is important to use the same tokenizer as the model. For example, if we want to use a pretrained model—BART—the data tokenization must be performed by the BART tokenizer. In many cases, we want to make our code flexible so we can reuse it for applying various models. The good news is that HF provides this flexibility with Auto Classes. In the code snippet below, we have AutoTokenizer, which will eventually get the tokenizer from_pretrained models of our choice.
Don’t worry about model_args; these are parameters given with the execution command for the code. These args are maintained with a helper class AutoConfig by HF. The AutoConfig class ensures the correct parameter mappings with data, models, and metrics. The tokenizer_name parameter selects the given model for the tokenizer, cache_dir is the name of the folder if we want to change the cache for HF, and use_fast_tokenizer selects a speedy implementation based on the Rust library for tokenizers. The model_revision parameter is used to select the specific version of a model. The use_auth_token parameter is used for security to use a bearer token for remote files on the datasets hub.
Remember: Tokenizers are responsible for tokenization, truncation, padding of data, and adding special tokens. Tokenizers are also responsible for encoding (text-to-vector) and decoding (vector-to-text).
If we are working with multilingual or cross-lingual data, we have to set source and target languages. We also have to set forced_bos_token_id for the decoding.
Now, we have initialized our tokenizer, but we haven’t applied it to our data. For this, we create a function that we use to apply our tokenizer on each set (train, dev, and test) on the text (input) and the reference summary (target).
In line 8, we add prefix at the start of each input text. For some pretrained models, adding the task name as prefix is required because these models are trained for multiple NLP tasks. Then, in line 9, we apply the tokenizer to our inputs where max_length=data_args.max_source_length sets the maximum accepted length of the given input (2048 tokens at max), padding=padding is a boolean flag setting if we have to pad the text if it is less than max_length, and truncation=True is also a boolean flag setting to truncate if the given text is longer than the maximum length. In lines 12–13, with tokenizer.as_target_tokenizer() sets the tokenizer for the decoding side, and then we tokenize our summaries, which max_length=max_target_length sets the maximum length of the target. We don’t want to include the padding token in the loss calculations for model optimization. Lines 17–18 ensure that the padding token is ignored.
Now, we have our tokenizer all set; however, we can’t process all data simultaneously due to resource limitations. We need batch processing to process chunks of data in multiple iterations. Here comes DataCollator for our help. It loads data as batches into memory and also performs shuffling among the instances if enabled. The code snippet below shows a DataCollator for summarization.
The DataCollatorForSeq2Seq constructor (for the time being, just ignore Seq2Seq) accepts the selected tokenizer and model along with label_pad_token_id to ignore it during loss calculation. It also takes pad_to_multiple_of=8 for padding to a multiple of the given value.
Batch size depends on many factors—the length of input and output text, size of the loaded model and tokenizer, and specifications of hardware resources (GPU memory).
The Transformers library has evolved since the original version of this tutorial. Some arguments and methods have been deprecated, and adopting the new APIs will make your code more future-proof.
Replace use_auth_token with the new token argument when loading models and tokenizers.
Use max_new_tokens instead of max_length when generating text, or configure generation settings through GenerationConfig.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfigtokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn", token="YOUR_HF_TOKEN")model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")config = GenerationConfig(max_new_tokens=128,num_beams=4,no_repeat_ngram_size=3)inputs = tokenizer("Your input text here", return_tensors="pt")summary = model.generate(**inputs, generation_config=config)print(tokenizer.decode(summary[0], skip_special_tokens=True))
The blog currently mentions a 2048-token maximum input length, but that’s not always true anymore.
Many modern summarization models now support much longer contexts — up to 16k tokens or more.
If you’re summarizing long documents (e.g., research papers, legal texts, reports), consider models designed for extended input lengths:
LED (Longformer Encoder-Decoder): Great for multi-page documents.
BigBird-Pegasus: Scales to long contexts, optimized for scientific and medical texts.
LongT5 or MPT-based summarizers: Instruction-tuned and long-context-ready.
Alternatively, use a chunking strategy — split text into smaller sections, summarize individually, then combine results.
Summarization models can be large and resource-intensive.
In 2025, it’s common practice to make them more efficient using quantization and parameter-efficient fine-tuning (PEFT) techniques.
Quantization: Reduces memory usage by storing model weights in 8-bit or 4-bit precision.
PEFT / LoRA / QLoRA: Fine-tune large models on modest hardware by updating only a small fraction of parameters.
Example — 4-bit Quantization:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, BitsAndBytesConfigbnb_config = BitsAndBytesConfig(load_in_4bit=True)tokenizer = AutoTokenizer.from_pretrained("google/pegasus-xsum")model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum",quantization_config=bnb_config,device_map="auto")
This approach significantly reduces GPU memory requirements while maintaining accuracy.
While ROUGE remains popular for summarization evaluation, it doesn’t capture everything.
Modern projects now include semantic similarity and factual accuracy metrics for more comprehensive assessment.
Key metrics:
BERTScore: Measures semantic similarity between the generated summary and reference text.
QuestEval: Evaluates factual consistency via Q&A-style verification.
SummaC: Checks whether summaries are faithful and avoid hallucinations.
Example:
import evaluaterouge = evaluate.load("rouge")bertscore = evaluate.load("bertscore")rouge_results = rouge.compute(predictions=preds, references=refs)bertscore_results = bertscore.compute(predictions=preds, references=refs, lang="en")
Adding these metrics helps you understand model performance beyond surface-level overlap.
pipeline("summarization") is great for quick experiments, but direct model usage offers more control in production systems.
This allows you to:
Fine-tune generation parameters (num_beams, top_p, temperature)
Optimize performance (quantization, device mapping)
Experiment with advanced decoding for factual summaries
Tip: Start with the pipeline for prototypes; switch to custom inference for scalable deployments.
For texts exceeding even long-context model limits, use a map-reduce summarization approach:
Split the text into manageable chunks.
Summarize each chunk individually.
Combine those summaries into a cohesive final output.
This technique scales to hundreds of pages and works well for books, legal documents, and academic papers.
The tutorial currently uses grammarly/pseudonymization-data, which isn’t a summarization dataset.
To make examples more realistic, switch to one of these:
CNN/DailyMail: News summarization.
XSum: Extreme summarization (one-sentence outputs).
GovReport / PubMed: Long-document summarization tasks.
Using relevant datasets makes examples more meaningful and easier to follow.
By now, we have discussed that when using HF, we can either use a pipeline or a pretrained model for text summarization. In this blog, we will discuss fine-tuning a pretrained abstractive summarization model. Interestingly, HF provides two options: With or without the trainer class. The trainer class provides an efficient API for feature-complete training for various tasks. We only need to pass all hyperparameters, our dataset, and the model of our choice. However, if we opt out of using the trainer class (which is our case), we must create the training loop. Let’s discuss what the key points for the training loop are.
HF has a variety of summarization models, such as BERT for extractive summarization, BART, T5, Pegasus, ProphetNet, BigBird, and so on. Some are trained for multiple tasks and on various and/or multilingual datasets. Depending on the parameters and model size, different variations of models are available on HF. Some examples of BART are mentioned below:
facebook/bart-base
facebook/bart-large-cnn
shahm/bart-german
facebook/mbart-large-50-many-to-many-mmt
eugenesiow/bart-paraphrase
We can set our model either in the code or it can be provided via model_args. The code snippet below shows how to load a pretrained model for fine-tuning.
The data must be processed by the selected model type. For example, if we select BART as our summarization model, then we need to use the BART tokenizer.
In the previous blog, we saw how to process the data for the summarization task. Now, let’s discuss how it can be loaded for training.
DataLoader is a PyTorch class used for optimized and efficient data loading in the GPU memory. At this stage, data instances will be converted into tensors. For the training set, we usually opt for shuffling, so if we rerun the same experiment, the instances will be shuffled to maintain the randomness of experiments. Now, our train and dev sets are ready to be processed in training.
DataLoader with DataCollator. The DataLoader classes are responsible for transforming instances into tensors. These instances have been shaped with DataCollator (padding, indexing for batches, and so on).The DataLoader class also helps in the parallel processing of data instances.
For training neural networks, we require an optimizer for adjusting the training parameters during training to minimize the loss. The optimization algorithms—such as Gradient Descent, Stochastic Gradient Descent, Adam, and AdaFactor—enable the model to learn from data by updating weights, biases, and learning rate iteratively. The update rules, learning rate, and momentum depend on the optimization algorithm.
It is important to note that weights and biases are learnable parameters of the model, while the learning rate is a hyperparameter we initially provided. The optimizer updates the learning rate with the help of the learning rate scheduler (lr_scheduler). The scheduler is responsible for making the learning rate adaptive to the optimizer for increasing performance and reducing training time. We set our optimizer (AdamW) and lr_scheduler in the code snippet below. We use the length of train_dataloader and the args.gradient_accumulation_steps hyperparameter to compute the num_update_steps_per_epoch.
Now let’s review the above code:
In lines 12–16, args.max_train_steps is used to specify the maximum number of training steps. It represents the total number of optimization steps we want to perform during training.
If args.max_train_steps is not provided (None), it is calculated as the product of args.num_train_epochs and num_update_steps_per_epoch. In other words, it determines the maximum training steps based on the number of epochs and updates per epoch.
If args.max_train_steps is already set, it calculates the number of training epochs required to reach this maximum number of steps.
In lines 18–21, we initialize a learning rate scheduler of our choice—in this case, args.lr_scheduler_type. We also set num_warmup_steps and num_training_steps to configure the learning rate warm-up and total training steps for the scheduler.
Remember earlier when we discussed the optimization of our model during training? The optimization target is to minimize the loss. There are different loss functions—cross-entropy, mean squared error, mean absolute error, KL divergence, and so on—to be used for optimization. The most common one for summarization is cross-entropy. However, adding a summarization metric to get insights into training is more convenient. We’ll use ROUGE, which is a standard evaluation metric for the summarization task, to determine the behavior of our model at each epoch.
ROUGE (R) is an n-gram-based metric that evaluates n-gram overlaps between the system output and the reference summary. R-1 (uni-gram), R-2 (bi-gram), and R-L (longest common sequence) are the most reported ones.
Now, we are ready to set up our training logs and checkpoints. We need to log some information to check if everything is working.
Let’s review the code snippet below:
In line 2, total_batch_size is computed.
In lines 3–9, we log important information about our training. The logger object will create a txt file with all this information during training, which we can check during training to ensure that everything is working correctly.
The checkpoints are used to resume the training in case training is ceased (power failure, GPU timeout, CUDA out of memory, and so on).
In lines 12–23, the resume_from_checkpoint variable decides whether to resume the training or not. If it resumes, it also determines which checkpoint to load.
In lines 25–31, we check whether the given checkpoint is an epoch or a step (we can set a hyperparameter for how we want to save our checkpoints).
Now, we will set up our training loop with the total number of training epochs.
Let’s review the code snippet above:
In line 2, we set the model to training mode, which is necessary to enable features like dropout and batch normalization that behave differently during training and evaluation.
In lines 3–4, if the args.with_tracking flag is enabled, we initialize total_loss variable to zero to keep track of the total loss during the current epoch.
From lines 6–34, we have an inner loop for batch processing. This loop iterates over train_dataloader to process a batch of data at a time.
In lines 7–8, if the training process is resuming from a checkpoint—as specified with args.resume_from_checkpoint—it skips steps until it reaches the step where training was paused. This is to avoid reprocessing data already processed before the interruption.
In lines 10–11, we compute a forward pass of the model with the current batch of data and calculate the loss. The loss is typically a measure of how well the model’s predictions match the actual target values.
In lines 14–15, if tracking is enabled, the loss from the current batch is added to the total_loss. The .detach().float() part ensures that the loss is treated as a float and detached from the computation graph.
In lines 16–17, we compute the scaled loss by dividing the loss by args.gradient_accumulation_steps. It helps mimic training with bigger batch sizes, especially when we quickly run out of CUDA memory. We send the scaled loss back to the network with accelerator.backward() to update the weights. For example, assume we want to train our model with a batch size of 32; however, our input and output sizes prevent us from using this size. So we can complete 32 iterations with a batch size of 1, accumulate the gradients, and then divide by 32 to get the equivalent of training with a batch size of 32.
In lines 19–24, an optimization step is taken—weights are updated—if either the current step is a multiple of args.gradient_accumulation_steps or it’s the last step in the epoch. After each optimization step, the learning rate is adjusted using the learning rate scheduler (lr_scheduler). Gradients are zeroed with optimizer.zero_grad() to prepare for the next batch.
In lines 26–31, based on checkpointing_steps, we check if the current step is a multiple of checkpointing_steps. If it is, the model checkpoint is saved. This is a common practice to save model progress during training.
Lines 32–34 ensure that the number of completed training steps does not exceed the maximum allowed training steps (args.max_train_steps). Otherwise, the training loop is terminated.
Let’s review the code snippet above:
In line 3, we set the model to evaluation mode to disable certain operations, like dropout and batch normalization, typically used during training. This ensures that the model’s evaluation is consistent and does not include randomness introduced by these operations.
In lines 4–5, we set the maximum target sequence length for generation during evaluation. If args.val_max_target_length is not specified, it is set to args.max_target_length.
In lines 7–8, we set a dictionary—gen_kwargs—that contains various generation settings. It specifies parameters for generating target sequences, such as the maximum length and the number of beams to use during generation. Next, we have samples_seen = 0, which we use to keep track of the number of samples processed during evaluation.
From lines 11–49, we have an inner loop for batch processing. This loop iterates over eval_dataloader to process a single batch of data at a time. Each batch contains input data and target sequences.
In line 12, we use with torch.no_grad() to ensure that the following operations are not tracked for gradient computation. We don’t need to compute gradients during evaluation because we’re not training the model.
In lines 13–38, we generate text sequences from the model by calling its generate method and providing input IDs and other parameters specified in gen_kwargs. The generated tokens represent the model’s predictions for the target sequences. Then, we post-process the generated tokens and the reference (target) labels. This includes padding, converting tokens to CPU and NumPy arrays, handling special tokens, and decoding the token sequences into human-readable text.
In line 49, the decoded predictions and reference labels are used to compute evaluation metrics. Remember that we set ROUGE as our metric.
In lines 51–56, we save the results of our metric. The use_stemmer=True argument indicates that stemming should be used when comparing the generated text to reference text. Stemming reduces words to their root forms, which can help match different inflections or forms of the same word. Please note that these lines are outside of the inner loop (eval_dataloader).
Here is a well-formatted representation of the code snippets in the GitHub