Part 1: Journey of BERT

From RNN to BERT

BERT is the state-of-the-art model introduced by Jacob Devlin in Google, which changed the course of finding the contextual meaning of words. Eventually, it was adopted by Google in their search engine in 2019 to improvise its searches.

Contexts are the word embeddings computed that represent the meaning of the word based on sentences. Earlier word embedding representations like Word2Vec and GloVe represented the word without the contextual meaning. However, with the advancement of the language models from RNN to BERT, the computation to find the contextual embeddings became more efficient and better.


RNN is a feed-forward neural network rolled out over time as such. They deal with sequence data where the input has some defined ordering that gives rise to several types of architectures, which includes,

  1. Vector-Sequence Models take in a fixed size vector as input, and it outputs a sequence of any length. (E.g, In Image Captioning).

  2. Sequence-Vector Models take sequences of input and output a fixed-sized vector. (E.g, In Sentiment Classification).

  3. Sequence-Sequence Models take sequences of input and output sequences of vectors. (E.g, In Language Translation). The sequence-to-sequence model consisted of Encoder and Decoder. The encoder encodes the words sequentially to embedding and the decoder converts those embeddings to words.

However, RNN has some disadvantages,

  1. They are very slow to train, so slow that we use a truncated version of backpropagation to train.

  2. They can't deal with long sequences very well; we have gradients that vanish and explode if the network is too long.


Then LSTM came in 1991, which replaced LSTM cells with neurons in RNN. This cell has a branch that allows passed information to skip a lot of the processing of the current cell and move on to the next. This allows the memory to be retained for a longer sequence.

Some disadvantages with LSTM are,

  1. It does not capture the true meaning of the words, even bi-directionally, as technically, it just concatenates the context learned from left-to-right and right-to-left.

  2. RNN are slower to train, LSTM is much slower than RNN.

  3. RNN/LSTM requires sequential flow to train, which doesn't use today's GPU well for parallel computation.


In 2017, Transformer neural network architecture was introduced that employs encoder-decoder architecture much like RNN; the difference is that the input sequence can be passed in parallel instead of sequentially. This helped transformers to train faster as words can be processed concurrently and the context of words is better learned as they can learn context from both directions simultaneously.

The transformer consists of two key components, encoder, and decoder, like in RNN. The encoder takes in an English word and generates embeddings simultaneously. These embeddings are the vectors that encapsulate the meaning of the word. Similar words have closer numbers. The decoder takes these embeddings and the previously generated words as input, and then it outputs the next generated words.


Transformers are conceptually more appealing than LSTM cells as it can see the physical separation of the tasks. The encoder learns the context of the input sequence, and the decoder learns how to map the input sequence with the output sequence. Both of these separately have some understanding of language, and because of this understanding, we part the architecture and build systems that understand language.

We stack the decoders of Transformers, and we get GPT transformer architecture; conversely, if we stack the encoders of Transformers, we get BERT, Bi-Directional Encoder Representation from Transformers.

In Transformers, there was a disadvantage,

  1. It can solve only Neural Machine Translation (Like translating English to French).

Whereas, in BERT, it can solve problems like,

  1. Neural Machine Translation

  2. Question Answering

  3. Sentiment Analysis

  4. Text Summarization

  5. And many more!

Part 2: If we stack the encoders of Transformers, we get BERT

Part 3: How to make use of the BERT model?

medium article -