top of page

Part 3: Journey of BERT

How to make use of the BERT model?

Pre-training and Fine-tuning

Pass 1

The goal of pre-training is to learn language and context. BERT learns language by training on two unsupervised tasks simultaneously. They are,

  1. Masked Language Model (MLM)

  2. Next Sentence Prediction (NSP)

For MLM, BERT takes in a sentence with random words filled with masks. The goal is to output these mask tokens. This is kind of like filling in the blanks, it helps BERT understand a bi-directional context within a sentence.

In NSP, BERT takes in two sentences and it determines if the second sentence actually follows the first. This is kind of a binary classification problem.

This helps BERT understand context across different sentences themselves and using both of these together BERT gets a good understanding of language.

The goal of fine-tuning is to further train BERT on very specific NLP tasks. For example, in question-answering, we just have to replace the fully connected output layers with a fresh set of output layers that can basically output the answer of the question we want from a particular passage. Then we can perform supervised training with the question answering dataset, it won’t take a long time to, as it’s only the output parameters that are learned from scratch. The rest of the model parameters are slightly fine-tuned and as a result training time is fast.

Pass 2

In practice both MLM and NSP are trained simultaneously.

The input is a set of two sentences with some of the words being masked. Each token is the word and we convert each of these words into embeddings using pre-trained embeddings.

The output side has a C, which is a binary output showing if sentence B follows sentence A in context. Each of the T’s here are word vectors that correspond to the outputs for the masked language model problem. So the number of word vectors we input is the same as the word vectors that we output on the fine-tuning phase.

Usually, for classification tasks, we fine-tune it by feeding the tokens(embeddings?) of the sentences to BERT which eventually gives a Class Label as output in C.

Pass 3

How are the embeddings generated from the tokens? Initially, embeddings is constructed from three vectors,

  1. The token embeddings

  2. Segment embeddings

  3. Position embeddings

Token embeddings are the pre-trained vectors (paper uses a vocab of 30k). Segment embeddings are the sentence numbers encoded into a vector. Position embeddings are the position of the word within the sentence. Adding these vectors gives the input encoding for BERT. The segment and position embeddings help in preserving the order of words in context.

While training, we need to minimize the loss, two keynotes are,

  1. All of the vectors in output have the same size.

  2. All of the vectors in output are generated simultaneously.

We need to take each word vector and pass it into a fully connected layer output with the same number of neurons equal to the number of tokens in the vocabulary (30k if we refer to paper). And then we would apply a softmax layer, this will convert word vector to distribution and the actual label will be a one-hot encoder of the actual word. So, we compare these distributions and then train the network using cross-entropy loss.

The output has all the words even though those inputs weren’t masked at all. The loss though only considers the prediction of the masked and ignores other word output by the network. This is done to ensure that more focus is given to predicting these mask values.

This concludes how we can pre-train and fine-tune BERT to use it.

If we stack up Encoder Transformers to 12 layers it is BERT-base, if 24 BERT-large and so on.

Now it’s the time to use BERT in real life. (Part - 4?)

medium article -


Recent Posts

See All


bottom of page