Evaluation metrics are quantitative measures to evaluate the performance of machine learning models. They are essential to know how good or bad our model is performing for specific tasks.
BLEU (Bilingual Evaluation Understudy) is an evaluation metric commonly used in NLP to evaluate the quality of the predicted text. The BLEU metric compares the generated text to one or more references and assigns a score based on word overlap between the two texts. The more words in common, the higher the BLEU score.
Here, we'll be calculating the BLEU score in terms of machine generated text summarization, referred as candidate summary.
The following are the steps to calculate the BLEU score:
Calculate the precision for the n-gram.
Compute the geometric mean of the precision score.
Apply the Brevity Penalty (BP).
Calculate the BLEU score.
We calculate the precision for each n-gram to measure how well the candidate summary matches the reference summary. Common values for
The formula to calculate precision for a n-gram is
After computing the n-gram precision, we compute its geometric mean. Usually, we use
A Brevity Penalty adjusts the BLEU score if the candidate summary is shorter than the reference sentences. It is calculated as follows:
Here:
cand_length
: Length of candidate summary.
ref_length
: Length of reference summary.
The final BLEU score is calculated by multiplying the Brevity Penalty by the geometric mean precision. The formula to calculate the BLEU score is:
The BLEU score ranges from 0 to 1, with higher values indicating a higher quality of the summary.
Now, let's see how to calculate the BLEU score using Python.
import nltk reference_summary = ['Machine', 'learning', 'is', 'a', 'subset', 'of', 'artificial', 'intelligence'] candidate_summary = ['Machine', 'learning', 'is', 'seen', 'as', 'a', 'subset', 'of', 'artificial', 'intelligence'] BLEUscore = nltk.translate.bleu_score.sentence_bleu([reference_summary], candidate_summary) print(BLEUscore)
Line 1: We import the nltk
library, which is used widely in the field of NLP.
Line 3: We define a reference_summary
variable and set its value to “Machine learning is a subset of artificial intelligence”.
Line 4: We defined a candidate_summary
variable and set its value to “Machine learning is seen as a subset of artificial intelligence."
Line 6: We calculate the BLEU score using the sentence_bleu()
function from the nltk.translate.bleu_score
.
Line 7: We print the BLEU score for the provided candidate summary.