How to Build A Text Summarizer Using Hugging Face Transformers

Home/

Blog/

Data Science/

21 mins read

Oct 31, 2025

Applying Hugging Face Machine Learning Pipelines in Python

Applying the Hugging Face Machine Learning Pipelines in Python

Hugging Face is a community-driven initiative that develops and promotes artificial intelligence tools for a wide range of applications. The organization provides pretrained, state-of-the-art deep learning models that can be easily applied to real-world machine learning tasks. In this course, you’ll explore the Hugging Face library with a focus on practical applications in natural language processing (NLP) and computer vision. You’ll begin by understanding how transformer-based models are used within the Hugging Face ecosystem. You’ll gain hands-on experience applying the Hugging Face pipeline API to common NLP tasks, such as text classification, summarization, question answering, and sentiment analysis. Next, you’ll explore computer vision pipelines, including image classification, object detection, and image segmentation. By the end of this course, you’ll be comfortable using a wide range of Hugging Face pipelines for common machine learning tasks and implementing them in Python using PyTorch.

40mins

Intermediate

17 Playgrounds

2 Quizzes

What is summarization?#

In the rapid information era and digitally globalized world, textual data is growing enormously. It is becoming hard to keep ourselves updated with loads of data. For example, an online product that we want to buy has thousands of reviews but it is hard to grasp them all. But if there is a tool that sums up all those reviews and provides a summary of those reviews, it can make our lives much easier. Similarly, consider the case of an investigative journalist who wants to collect information on a specific event from various resources. What if there is a tool that can generate a timeline summary of that specific event from previous news and other sources? Here comes the application of NLP, called text summarization. Text summarization can be divided into different categories depending on the nature of a task, as shown in the illustration below. Some important questions to ask for defining the task are:

What kind of input is given—single-document vs. multi-document, query vs. generic.
What type of summarizer we want—selecting sentences from the original input, extractive summarization vs. generating a human-like summary containing salient information coherently, abstractive summarization.
What kind of summary is required—generating a text’s title or one-liner summary, extreme summarization vs. abstract like a multi-sentence summary.
In what language is a summary required—a summary in the source language, monolingual summarization vs. a summary in another target language, cross-lingual summarization.

A summarization problem can be a mix and match of these categories. So, what is the formal definition of text summarization, and what are the properties a good summary should have? The former part of the question is easy to answer, and the latter is trickier.

By definition, text summarization is a high-level NLP task that takes a text as an input and produces its summary as an output. A summary should contain salient information about the given text. In terms of properties, a good summary should be at least fluent, well-structured, and coherent. Depending on the nature of the task, there can be some additional properties. However, measuring these properties, such as coherence and fluency, is not a straightforward task and requires human effort.

Summarization example#

Let’s check a black box example of summarization with transformers, where we provide an input text to a summarizer, and it generates the output summary. The video below represents how we can execute transformers on Educative’s platform. This example intentionally covers a simplified version of summarization where we only provide the input and get the output.

Let’s understand how a summarization model can be trained, tested, and evaluated on a given dataset.

How does text summarization work?#

We need some building blocks for training an abstractive summarization model. Let’s check the flowchart below, and then we’ll discuss these building blocks.

Firstly, we need a summarization dataset where each instance consists of a text-summary pair. We provide these instances with a sequence-to-sequence (S2S) model to the training loop responsible for training the model. The model is trained by showing examples of input and expected output (reference summary). This kind of training is called supervised learning. The data given to the training loop is called the training set (for now, let’s ignore the dev set). Now, the trained model can produce the outputs of given texts. Suppose we saved a chunk of the dataset that wasn’t used during training—called the test set. We provide the trained model and input texts to the testing loop, which generates the output summaries for all the given inputs. This step is also called inference.

Here, a question arises—how do we know that the generated summaries are accurate? To confirm this, we need a metric that can assess the outputs. This is called the automatic evaluation of a model. We provide the output summaries and their reference summaries (which we did not use during testing) to the evaluation loop, which assesses the summaries by comparing them. The outcome of the evaluation is evaluation scores. This way, we can measure the quality of the model output. We can also evaluate the outputs given to human annotators to assess their quality, called human evaluation. The outcome of any kind of evaluation are scores that indicate how well/poorly a model has been trained.

How to implement summarization#

Enough theoretical discussion! Let’s move to the implementation part of text summarization. Summarization is a hot topic and almost all big tech companies have developed various libraries and tools for summarization. However, this blog will focus on Hugging Face transformers implementation for abstractive summarization.

Here comes a question: What is Hugging Face (HF)? Let’s make it easy. Think of HF as an umbrella for the AI community, providing a platform containing several open-source datasets, models, pipelines, and evaluation metrics with the AI community discussions. An AI developer can find almost everything required in Hugging Face.

Summarization with Hugging Face#

For abstractive summarization with HF, we need a dataset, a pipeline, or a pretrained model for training and inference and evaluation metrics for summarization. Luckily, all the pieces required—data, models, evaluation metrics—are already provided by HF. Let’s understand how these pieces work one by one. We will start with the data processing for summarization, and this blog only covers this part.

Data#

First, we need a summarization dataset in which each instance consists of a text and reference summary (or summaries) pair. We split data into train, development (dev), and test sets. We train a summarization model with train and dev sets, while the test set is unseen for the trained model to measure its performance. The split ratio can vary according to the domain and problem. However, the most common ratios are 80/10/10 or 90/5/5 for train/dev/test, respectively.

Datasets#

HF has a variety of summarization datasets ranging from news genres to long scientific papers and from covering many languages as monolingual datasets to cross-lingual datasets. The code snippet below shows how to load an existing HF dataset.

Data processing#

Now that we have our dataset, the next step is to process it before forwarding it to the model for training. For this processing, we need a tokenizer and a data collator. The tokenizer is responsible for tokenizing the data and maintaining a vocabulary. These days, byte pair encoding (BPE) or sub-word tokenization techniques are popular as they reduce the vocabulary size effectively.

Tokenizer#

There are plenty of different tokenizers available on HF. However, it is important to use the same tokenizer as the model. For example, if we want to use a pretrained model—BART—the data tokenization must be performed by the BART tokenizer. In many cases, we want to make our code flexible so we can reuse it for applying various models. The good news is that HF provides this flexibility with Auto Classes. In the code snippet below, we have AutoTokenizer, which will eventually get the tokenizer from_pretrained models of our choice.

Python 3.10.4

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name 
                                        if model_args.tokenizer_name 
                                        else model_args.model_name_or_path,
                                        cache_dir=model_args.cache_dir,
                                        use_fast=model_args.use_fast_tokenizer,
                                        revision=model_args.model_revision,
                                        use_auth_token=True 
                                        if model_args.use_auth_token 
                                        else None,)

Don’t worry about model_args; these are parameters given with the execution command for the code. These args are maintained with a helper class AutoConfig by HF. The AutoConfig class ensures the correct parameter mappings with data, models, and metrics. The tokenizer_name parameter selects the given model for the tokenizer, cache_dir is the name of the folder if we want to change the cache for HF, and use_fast_tokenizer selects a speedy implementation based on the Rust library for tokenizers. The model_revision parameter is used to select the specific version of a model. The use_auth_token parameter is used for security to use a bearer token for remote files on the datasets hub.

Remember: Tokenizers are responsible for tokenization, truncation, padding of data, and adding special tokens. Tokenizers are also responsible for encoding (text-to-vector) and decoding (vector-to-text).

If we are working with multilingual or cross-lingual data, we have to set source and target languages. We also have to set forced_bos_token_id for the decoding.

Python 3.10.4

if isinstance(tokenizer, tuple(MULTILINGUAL_TOKENIZERS)):
    assert (data_args.lang is not None), 
    f"{tokenizer.class.name} is a multilingual tokenizer which requires --lang argument"
    
    tokenizer.src_lang = data_args.lang
    tokenizer.tgt_lang = data_args.tgt_lang
    # For multilingual translation models like mBART-50 and M2M100 we need to force the target language token
    # as the first generated token. We ask the user to explicitly provide this as --forced_bos_token argument.
    forced_bos_token_id = (tokenizer.lang_code_to_id[data_args.forced_bos_token] 
    if data_args.forced_bos_token is not None else None)
        model.config.forced_bos_token_id = forced_bos_token_id

Python 3.10.4

def preprocess_function(examples):
    inputs, targets = [], []
    for i in range(len(examples[text_column])):
        if examples[text_column][i] is not None and examples[summary_column][i] is not None:
            inputs.append(examples[text_column][i])
            targets.append(examples[summary_column][i])
    inputs = [prefix + inp for inp in inputs]
    model_inputs = tokenizer(inputs, max_length=data_args.max_source_length, padding=padding, truncation=True)
    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(targets, max_length=max_target_length, padding=padding, truncation=True)
    # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore
    # padding in the loss.    
    if padding == "max_length" and data_args.ignore_pad_token_for_loss:
        labels["input_ids"] = [[(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]]
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In line 8, we add prefix at the start of each input text. For some pretrained models, adding the task name as prefix is required because these models are trained for multiple NLP tasks. Then, in line 9, we apply the tokenizer to our inputs where max_length=data_args.max_source_length sets the maximum accepted length of the given input (2048 tokens at max), padding=padding is a boolean flag setting if we have to pad the text if it is less than max_length, and truncation=True is also a boolean flag setting to truncate if the given text is longer than the maximum length. In lines 12–13, with tokenizer.as_target_tokenizer() sets the tokenizer for the decoding side, and then we tokenize our summaries, which max_length=max_target_length sets the maximum length of the target. We don’t want to include the padding token in the loss calculations for model optimization. Lines 17–18 ensure that the padding token is ignored.

Data collator#

Now, we have our tokenizer all set; however, we can’t process all data simultaneously due to resource limitations. We need batch processing to process chunks of data in multiple iterations. Here comes DataCollator for our help. It loads data as batches into memory and also performs shuffling among the instances if enabled. The code snippet below shows a DataCollator for summarization.

Batch size depends on many factors—the length of input and output text, size of the loaded model and tokenizer, and specifications of hardware resources (GPU memory).

Updating Code for Hugging Face Transformers v5+#

The Transformers library has evolved since the original version of this tutorial. Some arguments and methods have been deprecated, and adopting the new APIs will make your code more future-proof.

Replace use_auth_token with token and Use max_new_tokens#

Replace use_auth_token with the new token argument when loading models and tokenizers.
Use max_new_tokens instead of max_length when generating text, or configure generation settings through GenerationConfig.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, GenerationConfig
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn", token="YOUR_HF_TOKEN")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
config = GenerationConfig(
    max_new_tokens=128,
    num_beams=4,
    no_repeat_ngram_size=3
)
inputs = tokenizer("Your input text here", return_tensors="pt")
summary = model.generate(**inputs, generation_config=config)
print(tokenizer.decode(summary[0], skip_special_tokens=True))

Choosing the Right Model for Longer Contexts#

The blog currently mentions a 2048-token maximum input length, but that’s not always true anymore.
Many modern summarization models now support much longer contexts — up to 16k tokens or more.

If you’re summarizing long documents (e.g., research papers, legal texts, reports), consider models designed for extended input lengths:

LED (Longformer Encoder-Decoder): Great for multi-page documents.
BigBird-Pegasus: Scales to long contexts, optimized for scientific and medical texts.
LongT5 or MPT-based summarizers: Instruction-tuned and long-context-ready.

Alternatively, use a chunking strategy — split text into smaller sections, summarize individually, then combine results.

Improving Performance with Quantization and PEFT#

Summarization models can be large and resource-intensive.
In 2025, it’s common practice to make them more efficient using quantization and parameter-efficient fine-tuning (PEFT) techniques.

Quantization: Reduces memory usage by storing model weights in 8-bit or 4-bit precision.
PEFT / LoRA / QLoRA: Fine-tune large models on modest hardware by updating only a small fraction of parameters.

Example — 4-bit Quantization:

This approach significantly reduces GPU memory requirements while maintaining accuracy.

Beyond ROUGE: Modern Evaluation Techniques#

While ROUGE remains popular for summarization evaluation, it doesn’t capture everything.
Modern projects now include semantic similarity and factual accuracy metrics for more comprehensive assessment.

Key metrics:

BERTScore: Measures semantic similarity between the generated summary and reference text.
QuestEval: Evaluates factual consistency via Q&A-style verification.
SummaC: Checks whether summaries are faithful and avoid hallucinations.

Example:

Adding these metrics helps you understand model performance beyond surface-level overlap.

Pipeline vs. Custom Generation: When to Use Each#

pipeline("summarization") is great for quick experiments, but direct model usage offers more control in production systems.
This allows you to:

Fine-tune generation parameters (num_beams, top_p, temperature)
Optimize performance (quantization, device mapping)
Experiment with advanced decoding for factual summaries

Tip: Start with the pipeline for prototypes; switch to custom inference for scalable deployments.

New Strategies for Summarizing Very Long Documents#

For texts exceeding even long-context model limits, use a map-reduce summarization approach:

Split the text into manageable chunks.
Summarize each chunk individually.
Combine those summaries into a cohesive final output.

This technique scales to hundreds of pages and works well for books, legal documents, and academic papers.

Update Example Datasets for Clarity#

The tutorial currently uses grammarly/pseudonymization-data, which isn’t a summarization dataset.
To make examples more realistic, switch to one of these:

CNN/DailyMail: News summarization.
XSum: Extreme summarization (one-sentence outputs).
GovReport / PubMed: Long-document summarization tasks.

Using relevant datasets makes examples more meaningful and easier to follow.

Implementing summarization#

By now, we have discussed that when using HF, we can either use a pipeline or a pretrained model for text summarization. In this blog, we will discuss fine-tuning a pretrained abstractive summarization model. Interestingly, HF provides two options: With or without the trainer class. The trainer class provides an efficient API for feature-complete training for various tasks. We only need to pass all hyperparameters, our dataset, and the model of our choice. However, if we opt out of using the trainer class (which is our case), we must create the training loop. Let’s discuss what the key points for the training loop are.

Models#

HF has a variety of summarization models, such as BERT for extractive summarization, BART, T5, Pegasus, ProphetNet, BigBird, and so on. Some are trained for multiple tasks and on various and/or multilingual datasets. Depending on the parameters and model size, different variations of models are available on HF. Some examples of BART are mentioned below:

facebook/bart-base
facebook/bart-large-cnn
shahm/bart-german
facebook/mbart-large-50-many-to-many-mmt
eugenesiow/bart-paraphrase

We can set our model either in the code or it can be provided via model_args. The code snippet below shows how to load a pretrained model for fine-tuning.

DataLoader is a PyTorch class used for optimized and efficient data loading in the GPU memory. At this stage, data instances will be converted into tensors. For the training set, we usually opt for shuffling, so if we rerun the same experiment, the instances will be shuffled to maintain the randomness of experiments. Now, our train and dev sets are ready to be processed in training.

Don’t confuse DataLoader with DataCollator. The DataLoader classes are responsible for transforming instances into tensors. These instances have been shaped with DataCollator (padding, indexing for batches, and so on).

The DataLoader class also helps in the parallel processing of data instances.

Setting up an optimizer and learning rate scheduler#

For training neural networks, we require an optimizer for adjusting the training parameters during training to minimize the loss. The optimization algorithms—such as Gradient Descent, Stochastic Gradient Descent, Adam, and AdaFactor—enable the model to learn from data by updating weights, biases, and learning rate iteratively. The update rules, learning rate, and momentum depend on the optimization algorithm.

The optimizer helps improve the model’s performance. However, it can widely affect the accuracy and training speed of the model.

It is important to note that weights and biases are learnable parameters of the model, while the learning rate is a hyperparameter we initially provided. The optimizer updates the learning rate with the help of the learning rate scheduler (lr_scheduler). The scheduler is responsible for making the learning rate adaptive to the optimizer for increasing performance and reducing training time. We set our optimizer (AdamW) and lr_scheduler in the code snippet below. We use the length of train_dataloader and the args.gradient_accumulation_steps hyperparameter to compute the num_update_steps_per_epoch.

Python 3.10.4

# Setting optimizer
no_decay = ["bias", "LayerNorm.weight"]
optimizer_grouped_parameters = [{"params": [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)],
                                "weight_decay": args.weight_decay,},
                                {"params": [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)],
                                "weight_decay": 0.0,},]
optimizer = AdamW(optimizer_grouped_parameters, 
                  lr=args.learning_rate)
# Setting scheduler with reference to the number of training steps.
num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
if args.max_train_steps is None:
    args.max_train_steps = args.num_train_epochs * num_update_steps_per_epoch
else:
    args.num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)
lr_scheduler = get_scheduler(name=args.lr_scheduler_type, 
                            optimizer=optimizer, 
                            num_warmup_steps=args.num_warmup_steps,
                            num_training_steps=args.max_train_steps,)

Now let’s review the above code:

In lines 12–16, args.max_train_steps is used to specify the maximum number of training steps. It represents the total number of optimization steps we want to perform during training.
- If args.max_train_steps is not provided (None), it is calculated as the product of args.num_train_epochs and num_update_steps_per_epoch. In other words, it determines the maximum training steps based on the number of epochs and updates per epoch.
- If args.max_train_steps is already set, it calculates the number of training epochs required to reach this maximum number of steps.
In lines 18–21, we initialize a learning rate scheduler of our choice—in this case, args.lr_scheduler_type. We also set num_warmup_steps and num_training_steps to configure the learning rate warm-up and total training steps for the scheduler.

Setting up evaluation criteria#

Remember earlier when we discussed the optimization of our model during training? The optimization target is to minimize the loss. There are different loss functions—cross-entropy, mean squared error, mean absolute error, KL divergence, and so on—to be used for optimization. The most common one for summarization is cross-entropy. However, adding a summarization metric to get insights into training is more convenient. We’ll use ROUGE, which is a standard evaluation metric for the summarization task, to determine the behavior of our model at each epoch.

ROUGE (R) is an n-gram-based metric that evaluates n-gram overlaps between the system output and the reference summary. R-1 (uni-gram), R-2 (bi-gram), and R-L (longest common sequence) are the most reported ones.

Setting up logs and checkpoints#

Now, we are ready to set up our training logs and checkpoints. We need to log some information to check if everything is working.

Let’s review the code snippet below:

In line 2, total_batch_size is computed.
In lines 3–9, we log important information about our training. The logger object will create a txt file with all this information during training, which we can check during training to ensure that everything is working correctly.

Python 3.10.4

total_batch_size = args.per_device_train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
logger.info("***** Running training *****")
logger.info(f"  Num examples = {len(train_dataset)}")
logger.info(f"  Num Epochs = {args.num_train_epochs}")
logger.info(f"  Instantaneous batch size per device = {args.per_device_train_batch_size}")
logger.info(f"  Total train batch size (w. parallel, distributed & accumulation) = {total_batch_size}")
logger.info(f"  Gradient Accumulation steps = {args.gradient_accumulation_steps}")
logger.info(f"  Total optimization steps = {args.max_train_steps}")
# If resuming from a checkpoint, we load weights and states.
if args.resume_from_checkpoint:
    if args.resume_from_checkpoint is not None or args.resume_from_checkpoint != "":
        accelerator.print(f"Resumed from checkpoint: {args.resume_from_checkpoint}")
        accelerator.load_state(args.resume_from_checkpoint)
        resume_step = None
        path = args.resume_from_checkpoint
    else:
        # Get the most recent checkpoint
        dirs = [f.name for f in os.scandir(os.getcwd()) if f.is_dir()]
        dirs.sort(key=os.path.getctime)
        path = dirs[-1]  # Sorts folders by date modified
    if "epoch" in path:
        args.num_train_epochs -= int(path.replace("epoch_", ""))
    else:
        resume_step = int(path.replace("step_", ""))
        args.num_train_epochs -= resume_step // len(train_dataloader)
        resume_step = (args.num_train_epochs * len(train_dataloader)) - resume_step

The checkpoints are used to resume the training in case training is ceased (power failure, GPU timeout, CUDA out of memory, and so on).

In lines 12–23, the resume_from_checkpoint variable decides whether to resume the training or not. If it resumes, it also determines which checkpoint to load.
In lines 25–31, we check whether the given checkpoint is an epoch or a step (we can set a hyperparameter for how we want to save our checkpoints).

Training loop#

Now, we will set up our training loop with the total number of training epochs.

Python 3.10.4

for epoch in range(args.num_train_epochs):
    model.train()
    if args.with_tracking:
        total_loss = 0
    for step, batch in enumerate(train_dataloader):
        if args.resume_from_checkpoint and epoch == 0 and step < resume_step:
            continue
        outputs = model(**batch)
        loss = outputs.loss
        # the loss at each epoch
        if args.with_tracking:
            total_loss += loss.detach().float()
        loss = loss / args.gradient_accumulation_steps
        accelerator.backward(loss)
        if step % args.gradient_accumulation_steps == 0 or step == len(train_dataloader) - 1:
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            progress_bar.update(1)
            completed_steps += 1
        if isinstance(checkpointing_steps, int):
            if completed_steps % checkpointing_steps == 0:
                output_dir = f"step_{completed_steps}"
                if args.output_dir is not None:
                    output_dir = os.path.join(args.output_dir, output_dir)
                accelerator.save_state(output_dir)
        if completed_steps >= args.max_train_steps:
            break
    #model eval steps are excluded to make it easy

Let’s review the code snippet above:

In line 2, we set the model to training mode, which is necessary to enable features like dropout and batch normalization that behave differently during training and evaluation.
In lines 3–4, if the args.with_tracking flag is enabled, we initialize total_loss variable to zero to keep track of the total loss during the current epoch.
From lines 6–34, we have an inner loop for batch processing. This loop iterates over train_dataloader to process a batch of data at a time.
- In lines 7–8, if the training process is resuming from a checkpoint—as specified with args.resume_from_checkpoint—it skips steps until it reaches the step where training was paused. This is to avoid reprocessing data already processed before the interruption.
- In lines 10–11, we compute a forward pass of the model with the current batch of data and calculate the loss. The loss is typically a measure of how well the model’s predictions match the actual target values.
- In lines 14–15, if tracking is enabled, the loss from the current batch is added to the total_loss. The .detach().float() part ensures that the loss is treated as a float and detached from the computation graph.
- In lines 16–17, we compute the scaled loss by dividing the loss by args.gradient_accumulation_steps. It helps mimic training with bigger batch sizes, especially when we quickly run out of CUDA memory. We send the scaled loss back to the network with accelerator.backward() to update the weights. For example, assume we want to train our model with a batch size of 32; however, our input and output sizes prevent us from using this size. So we can complete 32 iterations with a batch size of 1, accumulate the gradients, and then divide by 32 to get the equivalent of training with a batch size of 32.
- In lines 19–24, an optimization step is taken—weights are updated—if either the current step is a multiple of args.gradient_accumulation_steps or it’s the last step in the epoch. After each optimization step, the learning rate is adjusted using the learning rate scheduler (lr_scheduler). Gradients are zeroed with optimizer.zero_grad() to prepare for the next batch.
- In lines 26–31, based on checkpointing_steps, we check if the current step is a multiple of checkpointing_steps. If it is, the model checkpoint is saved. This is a common practice to save model progress during training.
- Lines 32–34 ensure that the number of completed training steps does not exceed the maximum allowed training steps (args.max_train_steps). Otherwise, the training loop is terminated.

Python 3.10.4

for epoch in range(args.num_train_epochs):
    #model train steps are exclued from here to make the code easy-to-read
    model.eval()
    if args.val_max_target_length is None:
        args.val_max_target_length = args.max_target_length
    gen_kwargs = {"max_length": args.val_max_target_length if args is not None else config.max_length,
                    "num_beams": args.num_beams,}
    
    samples_seen = 0
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(batch["input_ids"],
                                attention_mask=batch["attention_mask"],**gen_kwargs,)
            generated_tokens = accelerator.pad_across_processes(generated_tokens, dim=1, 
                                pad_index=tokenizer.pad_token_id)
            labels = batch["labels"]
            if not args.pad_to_max_length:
                # If we did not pad to max length, we need to pad the labels too
                labels = accelerator.pad_across_processes(batch["labels"], dim=1, pad_index=tokenizer.pad_token_id)
            generated_tokens, labels = accelerator.gather((generated_tokens, labels))
            generated_tokens = generated_tokens.cpu().numpy()
            labels = labels.cpu().numpy()
            if args.ignore_pad_token_for_loss:
                # Replace -100 in the labels as we can't decode them.
                labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            
            decoded_preds = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
            decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)
            
            # If we are in a multiprocess environment, the last batch has duplicates
            if accelerator.num_processes > 1:
                if step == len(eval_dataloader):
                    decoded_preds = decoded_preds[: len(eval_dataloader.dataset) - samples_seen]
                    decoded_labels = decoded_labels[: len(eval_dataloader.dataset) - samples_seen]
            
                else:
                    samples_seen += decoded_labels.shape[0]
            metric.add_batch(predictions=decoded_preds, references=decoded_labels,)
    
    result = metric.compute(use_stemmer=True)
    # Extract a few results from ROUGE
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    logger.info(result)

Let’s review the code snippet above:

In line 3, we set the model to evaluation mode to disable certain operations, like dropout and batch normalization, typically used during training. This ensures that the model’s evaluation is consistent and does not include randomness introduced by these operations.
In lines 4–5, we set the maximum target sequence length for generation during evaluation. If args.val_max_target_length is not specified, it is set to args.max_target_length.
In lines 7–8, we set a dictionary—gen_kwargs—that contains various generation settings. It specifies parameters for generating target sequences, such as the maximum length and the number of beams to use during generation. Next, we have samples_seen = 0, which we use to keep track of the number of samples processed during evaluation.
From lines 11–49, we have an inner loop for batch processing. This loop iterates over eval_dataloader to process a single batch of data at a time. Each batch contains input data and target sequences.
- In line 12, we use with torch.no_grad() to ensure that the following operations are not tracked for gradient computation. We don’t need to compute gradients during evaluation because we’re not training the model.
- In lines 13–38, we generate text sequences from the model by calling its generate method and providing input IDs and other parameters specified in gen_kwargs. The generated tokens represent the model’s predictions for the target sequences. Then, we post-process the generated tokens and the reference (target) labels. This includes padding, converting tokens to CPU and NumPy arrays, handling special tokens, and decoding the token sequences into human-readable text.
- In line 49, the decoded predictions and reference labels are used to compute evaluation metrics. Remember that we set ROUGE as our metric.
In lines 51–56, we save the results of our metric. The use_stemmer=True argument indicates that stemming should be used when comparing the generated text to reference text. Stemming reduces words to their root forms, which can help match different inflections or forms of the same word. Please note that these lines are outside of the inner loop (eval_dataloader).

Putting it all together#

Here is a well-formatted representation of the code snippets in the GitHub repositoryhttps://github.com/MehwishFatimah/t5_finetune/blob/main/run_summarization_no_trainer.py. https://github.com/MehwishFatimah/t5_finetune/blob/main/run_summarization_no_trainer.pyThis blog discusses fine-tuning pretrained abstractive summarization models using the Hugging Face (HF) library. We have learned to train a pretrained model for a given dataset. We have covered the training setup, optimizer and learning rate scheduler configuration, evaluation criteria, and setting up logs and checkpoints.

Written By:

Mehwish Fatima

Free Resources

blog

Julia vs. Python: A comprehensive comparison

blog

R Tutorial: a quick beginner's guide to using R

blog

Kubernetes: A Comprehensive Tutorial for Beginners

How to Build A Text Summarizer Using Hugging Face Transformers

What is summarization?#

Summarization example#

How does text summarization work?#

How to implement summarization#

Summarization with Hugging Face#

Data#

Datasets#

Data processing#

Tokenizer#

Data collator#

Updating Code for Hugging Face Transformers v5+#

Replace use_auth_token with token and Use max_new_tokens#

Choosing the Right Model for Longer Contexts#

Improving Performance with Quantization and PEFT#

Beyond ROUGE: Modern Evaluation Techniques#

Pipeline vs. Custom Generation: When to Use Each#

New Strategies for Summarizing Very Long Documents#

Update Example Datasets for Clarity#

Implementing summarization#

Models#

Training#

Setting up an optimizer and learning rate scheduler#

Setting up evaluation criteria#

Setting up logs and checkpoints#

Training loop#

Putting it all together#