The Transformer – a model that uses attention to boost the speed with which these models can be trained. The Transformers outperforms the Google Neural Machine Translation model in specific tasks. The biggest benefit, however, comes from how The Transformer lends itself to parallelization. It is in fact Google Cloud’s recommendation to use The Transformer as a reference model to use their Cloud TPU offering.
Top level look
Let's have a look on the top level of model that how it looks like it just a simple black box . first of all what is black box , soo let me explain it in every model we give input in the black box which as algorithms , some mechanism which we predit the output for us .
Here we have france language as input and English language as output we pass the input to transformer it will covert the fance language to english for us.
Now if look inside the transformer there are two components name as Encoder and Decoders.
If we move more inside it we will find the stacks of Encoder and Decoder (the paper stacks six of them on top of each other – there’s nothing magical about the number six, one can definitely experiment with other arrangements). NO of stacks is a Hyper-parameter tuning you can do by experiment but here we have taken six because it give us good result .
The encoders are all identical in structure (yet they do not share weights). Each one is broken down into two sub-layers ( look at the diagram)
The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. We’ll look closer at self-attention later in the post.
The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
The decoder has both those layers, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence
We have seen the introduction of transformer how it looks links in various sub layers , now we will see how we can convert word to vector and how this vectors traverse from all the layers in transformer and give output to us .
As in the case of NLP applications in general, we begin by turning each input word into a vector using an embedding, word2vec.
The embedding only happens in the bottom-most encoder. The abstraction that is common to all the encoders is that they receive a list of vectors each of the size 512 – In the bottom encoder that would be the word embeddings, but in other encoders, it would be the output of the encoder that’s directly below. The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset.
After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder.
NOTE - As we have seen in RNN and LSTM there vectors flows in sequence ordered manner one by one but in this all vectors flow in encoder at a same time (parallel) this is the beauty of transformer .
Next, we’ll switch up the example to a shorter sentence and we’ll look at what happens in each sub-layer of the encoder.
Now We’re Encoding!
As we seen already that Encoders recieve list of vectors as an input which passes through "Self Attention" layer the from here it pass into feed-forward neural network, then sends out the output upwards to the next encoder.
High level of Self Attention
Lets have a look on it how it works . Say the following sentence is an input sentence we want to translate:
”The animal didn't cross the street because it was too tired”
What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.
When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.
As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
Self-attention is the method the Transformer uses to make the “understanding” of other relevant words into the one we’re currently processing.
As we are encoding the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism was focusing on "The Animal", and maked a part of its representation into the encoding of "it".
Detailed Self Attention
First off allwe will randomly assign some weight WQ , Wk , Wv , now we have words which are coverted into 512 dimensions set as input .
NOw what will happen here look into diagram we have multiplied X1 to WQ so we will get q1 , similarly X2 multiplies by WQ we will get q2 . The no of dimension which we get in query ( q1,q2) are of 64 dimention.
Now it's time to calculate key , we have to do same process multiplie X1 to Wk and X2 to Wk output will be K1 , K2 respectively.
similar to the above we calculate values by multipling X1 to Wv and X2 to Wv output will be v1 ,v2 respectively.
This how we calculated Queries ,Keys ,Values with the help of Weights
Now you are thinking about that what are queries, keys, values and why we are calculating it ?? right!!
They’re abstractions that are useful for calculating and thinking about attention. Once you proceed with reading how attention is calculated below, you’ll know pretty much all you need to know about the role each of these vectors plays.
In self attention second step is to calculate score . This score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.
Here i will only show for 'thinking' word that how score is calculated .The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.
Third and Forth steps
In this divide the scores by 8 (the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible values here, but this is the default), then pass the result through a softmax operation. Softmax normalizes the scores so they’re all positive and add up to 1.
This softmax score determines how much each word will be expressed at this position. Clearly the word at this position will have the highest softmax score, but sometimes it’s useful to attend to another word that is relevant to the current word.
In this step we have to multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
In this step we have to sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
That concludes the self-attention calculation. The resulting vector is one we can send along to the feed-forward neural network.
In the actual implimentation calculation is done in matrix form because it has fast processing, so let's deep into it,
Matrix Calculation of Self-Attention
Above we have seen that we have taken single single vector (X1 then X2) and multiplied by weight(WQ) to find queries similarly for keys and values.
Here we will Combine all the vector like a packet in a matrix and the multiply by weight (WQ .Wk ,Wv)
Finally, since we’re dealing with matrices, we can condense steps two through six in one formula to calculate the outputs of the self-attention layer.
The Beast With Many Heads
The paper and blog further refined the self-attention layer by adding a mechanism called “multi-headed” attention. This improves the performance of the attention layer in two ways:
1. It expands the model’s ability to focus on different positions. Yes, in the example above, z1 contains a little bit of every other encoding, but it could be dominated by the the actual word itself. It would be useful if we’re translating a sentence like “The animal didn’t cross the street because it was too tired”, we would want to know which word “it” refers to.
2. It gives the attention layer multiple “representation subspaces”. As we’ll see next, with multi-headed attention we have not only one, but multiple sets of Query/Key/Value weight matrices (the Transformer uses eight attention heads, so we end up with eight sets for each encoder/decoder). Each of these sets is randomly initialized. Then, after training, each set is used to project the input embeddings (or vectors from lower encoders/decoders) into a different representation subspace.
If we do the same self-attention calculation we outlined above, just eight different times with different weight matrices, we end up with eight different Z matrices
The feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a vector for each word). So we need a way to condense these eight down into a single matrix.
NOW how do we do that?
here is the solution We concat the matrices then multiple them by an additional weights matrix WO.
Now its easy we get your final single matrix which we need to give to feed-forward layer.soo let's summarize all the step in one digaram for clear vision.
If we add all the attention heads to the picture, however, things can be harder to interpret
One thing that’s missing from the model as we have described it so far is a way to account for the order of the words in the input sequence.
To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word, or the distance between different words in the sequence. The intuition here is that adding these values to the embeddings provides meaningful distances between the embedding vectors once they’re projected into Q/K/V vectors and during dot-product attention.
One detail in the architecture of the encoder that we need to mention before moving on, is that each sub-layer (self-attention, feed-forward layer) in each encoder has a residual connection around it, and is followed by a layer-normalization step.
when self attention does not work well the input vector are directly passed to normalized layer their normalization of vectors as been done and then they are passed to feed forward layer but if it also does not work well then it is passed directly to the #2 Encoder (shown in fig).
visualiztion of the vectors and the layer-norm operation associated with self attention, it would look like this
This goes for the sub-layers of the decoder as well. If we’re to think of a Transformer of 2 stacked encoders and decoders, it would look something like this:
Now main point to notice here is that when we get output from encoders it is given to decoder but in decoder it not given to the self attention layer it is given to the encoder-decoder attention directly then feed forward layer and then give output through linear and softmax , this output are now taken
down to the self forward layer with the next word and then same process repeat again (this done for to make it in sequencial order of sentence) this was the main point to notice hoe decoder works.
We have look all the operation done in Encoder ,we basically know how the components of decoders work as well. But let’s take a look at how they work together.
The encoder start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence
The following steps repeat the process until a special symbol is reached indicating the transformer decoder has completed its output. The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did. And just like we did with the encoder inputs, we embed and add positional encoding to those decoder inputs to indicate the position of each word.
Final - Linear and Softmax Layer
The decoder stack outputs a vector of floats. How do we turn that into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.
The Linear layer is a simple fully connected neural network that projects the vector produced by the stack of decoders, into a much, much larger vector called a logits vector.
Let’s assume that our model knows 10,000 unique English words (our model’s “output vocabulary”) that it’s learned from its training dataset. This would make the logits vector 10,000 cells wide – each cell corresponding to the score of a unique word. That is how we interpret the output of the model followed by the Linear layer.
The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.
References and credit
1. Krish Naik - he is an amazing teacher for Data Science you can just visit to his Youtube channel and explore this all concept.
2. Jay Alammar - he also a youtuber you can find him on twetter @JayAlammar.,for all images and gif credit goes to jay alammar.
3. Research paper paper Attention is All You Need
Your feedback is appreciated!
Did you find this Blog helpful? Any suggestions for improvement? Please let me know by filling the contact us form or ping me on LinkedIn .