BERT – Robin on Linux

My summary for the paper “Unified Language Model Pre-training for Natural Language Understanding and Generation”

For NLU (Natural Language Understanding), we use the bidirectional language model (like BERT), but for NLG(Natural Language Generation), the left-to-right unidirectional language model (like GPT) is the only choice.

Could we accomplish these two tasks by using one unified language model?

In this paper, the authors use a mask matrix to run different tasks in the same model:

The pivotal equation for this method is:

“M is the mask matrix and determines whether a pair of tokens can be attended to each other.”

“Unidirectional LM is done by using a triangular matrix for the self-attention mask M (as in the above equation), where the upper triangular part of the self-attention mask is set to −∞, and the other elements to 0”

“Within one training batch, 1/3 of the time we use the bidirectional LM objective, 1/3 of the time we employ the sequence-to-sequence LM objective, and both left-to-right and right-to-left LM objectives are sampled with the rate of 1/6”

Keep a note that the training process use bidirectional/unidirectional/seq2seq objective, not samples)

Understanding Transformer

In the paper Attention Is All You Need, the Transformer neural network had been introduced for the first time in 2017. One year later, the BERT appeared. And last year I gave a simple presentation in my previous company about the Transformer and BERT. As showed below:

Transformer and BERT from Hao(Robin) Dong

A couple of days before I started to review the Transformer paper and found out that I need to recommend the article The Illustrated Transformer again. This article really helps me to understand a lot of details in the Transformer.

But there is still a question jump out of my brain: what’s the use of decoder in Transformer? How the information flows through encoder to decoder ? After thinking for quite a while, I figured it out: Transformer was used for Machine Translation task at the first place. The encoder is used to “transform” sentence of source language to a couple of Keys and Values; the decoder will “transform” a word of target language to a Query. By using a Query and a couple of Keys and Values, it could get a vector, which is actually the embedding of next word in target language.

Here is a digram draw by me. Hope it could explain my own confusion.

“Ich bin ein guter Kerl” in German means “I am a good guy”. By encoding all German words to a couple of Keys and Values, and decode “good” to a Query, the Transformer could finally output the embedding vector of “guy”.