What is the BLEU evaluation metric?

Evaluation metrics are quantitative measures to evaluate the performance of machine learning models. They are essential to know how good or bad our model is performing for specific tasks.

What is BLEU?

BLEU (Bilingual Evaluation Understudy) is an evaluation metric commonly used in NLP to evaluate the quality of the predicted text. The BLEU metric compares the generated text to one or more references and assigns a score based on word overlap between the two texts. The more words in common, the higher the BLEU score.

Here, we'll be calculating the BLEU score in terms of machine generated text summarization, referred as candidate summary.

How to calculate the BLEU score

The following are the steps to calculate the BLEU score:

  1. Calculate the precision for the n-gram.

  2. Compute the geometric mean of the precision score.

  3. Apply the Brevity Penalty (BP).

  4. Calculate the BLEU score.

Calculate the precision for n-gram

We calculate the precision for each n-gram to measure how well the candidate summary matches the reference summary. Common values for NN include 1 (unigrams), 2 (bigrams), 3 (trigrams), and sometimes 4.

The formula to calculate precision for a n-gram is

Compute the geometric mean of the precision scores

After computing the n-gram precision, we compute its geometric mean. Usually, we use N=4N=4 and uniform weights wn=1/4w_n = 1/4.

Apply the Brevity Penalty

A Brevity Penalty adjusts the BLEU score if the candidate summary is shorter than the reference sentences. It is calculated as follows:

Here:

  • cand_length: Length of candidate summary.

  • ref_length: Length of reference summary.

Calculate the BLEU score

The final BLEU score is calculated by multiplying the Brevity Penalty by the geometric mean precision. The formula to calculate the BLEU score is:

The BLEU score ranges from 0 to 1, with higher values indicating a higher quality of the summary.

Code

Now, let's see how to calculate the BLEU score using Python.

import nltk

reference_summary = ['Machine', 'learning', 'is', 'a', 'subset', 'of', 'artificial', 'intelligence']
candidate_summary = ['Machine', 'learning', 'is', 'seen', 'as', 'a', 'subset', 'of', 'artificial', 'intelligence']

BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference_summary], candidate_summary)
print(BLEUscore)
Calculating BLEU score

Code explanation

  • Line 1: We import the nltk library, which is used widely in the field of NLP.

  • Line 3: We define a reference_summary variable and set its value to “Machine learning is a subset of artificial intelligence”.

  • Line 4: We defined a candidate_summary variable and set its value to “Machine learning is seen as a subset of artificial intelligence."

  • Line 6: We calculate the BLEU score using the sentence_bleu() function from the nltk.translate.bleu_score.

  • Line 7: We print the BLEU score for the provided candidate summary.

Copyright ©2024 Educative, Inc. All rights reserved