In the paper Attention Is All You Need, the Transformer neural network had been introduced for the first time in 2017. One year later, the BERT appeared. And last year I gave a simple presentation in my previous company about the Transformer and BERT. As showed below:
A couple of days before I started to review the Transformer paper and found out that I need to recommend the article The Illustrated Transformer again. This article really helps me to understand a lot of details in the Transformer.
But there is still a question jump out of my brain: what’s the use of
decoder in Transformer? How the information flows through
decoder ? After thinking for quite a while, I figured it out: Transformer was used for Machine Translation task at the first place. The
encoder is used to “transform” sentence of source language to a couple of
decoder will “transform” a word of target language to a
Query. By using a
Query and a couple of
Values, it could get a vector, which is actually the embedding of next word in target language.
Here is a digram draw by me. Hope it could explain my own confusion.
“Ich bin ein guter Kerl” in German means “I am a good guy”. By encoding all German words to a couple of
Values, and decode “good” to a
Query, the Transformer could finally output the embedding vector of “guy”.