If we stack the encoders of Transformers, we get BERT
Encoders in sequence-to-sequence architecture are meant to give an output of word embeddings providing the context of the word. And the decoding component converts these embeddings to the human observable sequences.
The transformer contains an encoder and decoder for Machine Neural Translation. BERT takes only the encoder part from Transformers and replicates the encoders to form a stack. A stack of encoders! Basically, BERT is a stack of encoders.
Encoders are identical in structure and broken into two sublayers, the self-attention layer and the feed-forward layer. The Self-attention layer is a layer that helps the encoder look at other words in the input sentence as it encodes a specific word. The outputs of the self-attention layer are fed to a feed-forward neural network. The exact same feed-forward network is independently applied to each position.
Before passing the words in the encoder, the words are converted to embeddings. The embeddings size are constant and it is a hyperparameter that can be modified, generally, it is the longest sentence in the training set. After generating embeddings, the words are passed in the self-attention layer. Every transfer of information from different layers happens parallelly unlike RNN/LSTM (Because of the self-attention layer).
Let’s suppose we’ve to translate this sentence, ”The animal didn't cross the street because it was too tired”.
What does ‘it’ in the sentence refer to? When the model is processing the word ‘it’, self-attention allows it to associate ‘it’ with ‘animal’.
As the model processes each word (each position in the input sequence), self-attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.
If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
Before transferring the embeddings to the self-attention layer, the model needs to understand the order of the words in the input sequence. To address this, the transformer adds a vector to each input embedding. These vectors follow a specific pattern that the model learns, which helps it determine the position of each word or the distance between different words in the sequence.
One detail in the architecture of the encoder that we need to mention before moving on is that each sub-layer (self-attention, ffnn) in each encoder has a residual connection around it, and is followed by a layer normalization step.
This completes one encoder, a stack of this type of encoder makes up BERT.
medium article - https://arpanghoshal.medium.com/part-2-journey-of-bert-234375920637