Challenge: Compare the Performance of Two Different LLMs

Evaluate text generation by using multiple LLMs and determine the best performer.


In this challenge, we’ll explore the capabilities of two LLMs, google/flan-t5-small and bigscience/mt0-small. The task is to use these models for a specific text-generation task and evaluate their performance using ROUGE metrics.


Translate the German proverb “Anfangen ist leicht, beharren eine Kunst” into English using both LLMs with the Transformers pipeline. Then, evaluate each model’s performance using ROUGE metrics and determine which one performs better.

Using the Transformers pipeline

Note: Google’s FLAN-T5-Small is a refined version of the T5 model, developed for a diverse range of tasks without the need for additional fine-tuning. Released in the “Scaling Instruction-Finetuned Language Models” research paper, this open-source, sequence-to-sequence large language model has been fine-tuned on multiple tasks, across multiple languages.

  • For google/flan-t5-small:

Get hands-on with 1200+ tech skills courses.