Transformer

5 min readApr 30, 2019

Credit: The codebase I refer to is https://github.com/jadore801120/attention-is-all-you-need-pytorch

Two very good surveys:

Attention? Attention! This is a vivid presentation for many attention mechanisms.

An Introductory Survey on Attention Mechanisms in NLP Problems. This paper discussed thoroughly the attention mechanisms in NLP. Including multidimension attention (akin to multi-head), hierachical attention, etc.

Self-Attention

Attention is All You Need is the paper most well-known for the self-attention mechanism. Of course it is not the only one. An earlier paper A Structured Self-attentive Sentence Embedding (let’s call it SSE) also used self attention, just not as systematic (fancy) as the Transformer.

The above figure showed the attention used by SSE. Attentions are achieved by $W_{S_1}$ and $W_{S_2}$. Here, no explicit Query-Key-Value triplet is specified. It’s just $W_{s_1}$ plays a rule of linear projection (and compression) and then $W_{S_2}$ plays the rule of constructing $r$ heads of attentions. In the QKV paradigm, the matrix after tanh can be interpreted as a query matrix, while the $W_{S_2}$ serves as a projection to generate keys. This model, when constructing attention, does not consider the interaction between different keys (i.e., all linear transformation are local transformation that concerns the current hidden state). Thinking this attention layer could be worse than the Transformer.

In Transformer, attention is done by the following equation:

This equation says that, given $lq$ queries (e.g., a sentence of length $l_q$ in GE, where each token is a query) and $l_k$ (key, value) pairs (e.g., an EN sentence of length $l_k$ where each token is a key/value), how is the reaction?

To be more specific, assume word embedding dimension is $de$, then the query matrix is of shape $(l_q, de)$, and the key matrix should be of shape$(l_k,de)$. So $QK^T$ gives a query-key matrix of shape$(l_q, l_k)$. By softmax normalization, each entry (i,j) of the Q-K matrix (let it be M) represents that for each query $q_i$, how it is likely to extract a value $v_j$ of key $k_j$? The following multiplication $MV$ represents the extraction process that take values according to the respected likelihood.

There are two kinds of attention in the original Transformer paper:

Decoder-Encoder attention, where the query is the sentence in GE up to position i (which is of length i), and the key and value are both the full source EN sentence.
Self-attention, where all query, key and value are the original sentence (source sentence or the decoded sentence up to position i).

Note that the above figure is misleading, as the QKV order are different between the two figures, where the order of the right is consistent with the one in the model architecture.

Another Attention

I don’t think the SSE method can be natually regarded as QKV, but another commonly used attention satisfies.

Another commonly used attention mechanism can be seen by the Seq2Seq implementation from Pytorch. where the attn module is implemented using a linear layer which maps a vector $q$ [input_emb;hidden] to a vector of length equal to the source sentence. We can also regard the vector $q$ as a query, stating that “given previous state and current input”, and then the weight matrix from attn as the Key matrix, the the encoder outputs as the Value matrix. (When attn is not a deep network, we can regard the last layer as the Key matrix while output from penultimate layer as query).

The attention operation used in GeoMAN is similar to this one, whose attn is a two-layer network.

Why self-attention(Transformer) works?

The key is the semantics.

Take Transformer as an example. It requires a set of pretrained embeddings. Why? In each layer of the transformer (for simplicity, local transformations for Q,K,V are not considered), all it does is compute the pairwise correlation using QK for addressing, then according to the correlation to read V. For example, the representation “taste” and “salty” will be presumably correlated, then the representation of “taste” will consider “salty” to construct its representation at the next context. Another example is the sentiment analysis. “pasta is tasty”. We need to tag “pasta” positive, while “O” (other) to the word “tasty” as it’s not a meaning norm by self. In this case, the embedding of “pasta” definitely needs to consider the semantic of “tasty” for a better sentiment analysis. Of course other method like RNN works, but attention is one structural way to this.

Here, the presumption is that the correlation between word representations are faithfully reflecting the true relations between the words (i.e., semantics). As transformer needs to employ pretrained word embeddings. Otherwise, the activation is only some random combination of some noises.

This is also true for images. Like in this image captioning paper, the attention mechanism is deployed at the feature map level instead of the original pixel level, in both “soft” and “hard” attention.

Why using Attention?

The initial motivation for introducing attention in RNN is to resolve the memory fade-off problem, which exists in very long sequences even when LSTM or GRU are concerned.

This paper suggests transformer does not prevail in capturing long-term dependency, but in extracting semantics by empirical explanation. But the claims are quite weak as the experiments are conducted using only very short sentences.

Notes

Some Jargon:

The notions “Query, Key and Values” are first introduced from the Neural Turning Machine paper. In this paper, “memory” is used to denote “Values” and the process of weighted average is denoted by “addressing”.
Many cases Key and Values are identical. However, some cases Key can be designed separately, such as a window instead of a specific location.
The query is the current state that needed to be considered.