Benchmark Tasks and Datasets

Learn about benchmark tasks and datasets used to evaluate the performance of transformers.


Three prerequisites are required to prove that transformers have reached state-of-the-art performance levels:

  • A model

  • A dataset-driven task

  • A metric

We will begin by exploring the SuperGLUE benchmark to illustrate the evaluation process of a transformer model.

From GLUE to SuperGLUE

The SuperGLUE benchmark was designed and made public by Wang et al. (2019). Wang et al. (2019) first designed the General Language Understanding Evaluation (GLUE) benchmark.

The motivation of the GLUE benchmark was to show that to be useful, NLU has to be applicable to a wide range of tasks. Relatively small GLUE datasets were designed to encourage an NLU model to solve a set of tasks.

However, the performance of NLU models, boosted by the arrival of transformers, began to exceed the level of the average human, as we can see in the GLUE leaderboard (December 2021). The GLUE leaderboard shows a remarkable display of NLU talent, retaining some of the former RNN/CNN ideas while mainly focusing on the ground-breaking transformer models.

The following excerpt of the leaderboard shows the top leaders and the position of GLUE’s Human Baselines:

Get hands-on with 1200+ tech skills courses.