Evaluating the Results Quantitatively

Learn about the evaluation metrics for image caption generation.

We'll cover the following

There are many different techniques for evaluating the quality and the relevancy of the captions generated. We’ll briefly discuss several such metrics we can use to evaluate the captions. We’ll discuss four metrics: BLEU, ROGUE, METEOR, and CIDEr.

All these measures share a key objective: to measure the text’s adequacy (the meaning of the generated text) and fluency (the grammatical correctness of text). To calculate all these measures, we’ll use a candidate sentence and a reference sentence, where a candidate sentence is the sentence or phrase predicted by our algorithm, and the reference sentence is the true sentence or phrase we want to compare with.


BLEU was proposed by Papineni and others in BLEU: A Method for Automatic Evaluation of Machine TranslationProceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, July (2002): 311-318. It measures the n-gram similarity between reference and candidate phrases in a position-independent manner. This means that a given n-gram from the candidate is present anywhere in the reference sentence and is considered to be a match. BLEU calculates the n-gram similarity in terms of precision. BLEU comes in several variations (BLEU-1, BLEU-2, BLEU-3, and so on), denoting the value of nn in the n-gram.

Get hands-on with 1200+ tech skills courses.