I wanted to get a feel for myself of how well Masked Language Models perform at uncovering masks. This is a fairly trivial exercise, but I wanted to do it myself to get a better intuition of relative performance. Maybe it’ll be helpful to you as well, code on the bottom.
The Tasks: Answer these questions:
Question | Expected Answer |
---|---|
The cat chased the [MASK]. | mouse |
The capital of [MASK] is Paris. | France |
Bread is typically baked in an [MASK]. | oven |
When it’s raining, you use an [MASK] to stay dry. | umbrella |
You brush your [MASK] to keep them clean. | teeth |
Money doesn’t grow on [MASK]. | trees |
A doctor often works in a [MASK]. | hospital |
You don’t put metal in a [MASK]. | microwave |
When you’re sleepy, you go to [MASK]. | bed |
The [MASK] is the center of our solar system. | sun |
People breathe [MASK] to live. | oxygen |
Plants need [MASK] to grow. | water |
The [MASK] is a natural satellite of the Earth. | moon |
Cars usually run on [MASK]. | gasoline |
A fridge is used to keep food [MASK]. | cold |
To see stars, you look at the [MASK]. | sky |
Humans have [MASK] fingers on each hand. | five |
Fish live in [MASK]. | water |
Fire needs [MASK] to burn. | oxygen |
Birds use [MASK] to fly. | wings |
The [MASK] rises in the East. | sun |
Models tested include BERT, Roberta, Electra, DeBERTa, XLM_Roberta_Base, BERT_Large_Cased, Legal_BERT, InfoXLM_Large, and Albert_Base_V1. The results are displayed in tables for each question, comparing the top 5 predicted tokens and their scores from each model.
CODE:
https://colab.research.google.com/drive/1FRyFGc9R70xd7nLeu9pmNBrWc2G5wcuh?usp=sharing
Evaluation:
I’m underwhelmed by the models’ performance. I’m sure I could’ve tweaked things further to get better outcomes, but I still expected more. Roberta did reasonably well, but still not too exciting.