Deep Learning- Recurrent Neural Networks

Madhu Ramiah
4 min readJul 4, 2019

--

Many applications like ChatBots, Processing Videos, Parts of Speech (POS) tagging, Machine translation, speech recognition, text summarization all use recurrent neural networks. We know that CNN’s are used for image classification and recognition. But, when it comes to any predictions using sequence analysis we use Recurrent Neural Networks.

  1. If you consider a video, it is a sequence of images over time
  2. If you consider time series analysis, it is also a sequence of data over a period of time
  3. If you consider Parts of Speech tagging (POS), that is also a sequence of words in which a words POS depends upon all the previous words POS.

Recurrent Neural Network Architecture:

RNN’s are similar to CNN’s except for the fact that RNN’s need to handle the time domain as well. Consider the architecture image below to understand better.

Recurrent Neural Network Architecture

Let us consider input divided into a sequence x1,x2……xn and each of those inputs are neurons. Each input neuron is connected to a neural network with multiple hidden layers (1 to k) and gives a respective output. In a neural network, there are weights from one layer to another only in the forward direction (feed forward). But in RNN’s, in addition to weights in forward direction (W𝒻) we also have weights from one time step to the next time step in the same layer (Wᵣ)- also called as recurrent weights. The outputs from each time step are y₁,y₂….yₙ. The outputs need not be same as the number of inputs, it can also be just 1 output or lesser than or greater then the number of input neurons. We will also consider that each neuron in the hidden layer has a bias value(b) associated with it.

Mathematical Formula:

We discussed the mathematical formula of neural networks as the output from each layer(l) is defined as

hˡ=wˡ.hˡ⁻¹ + bˡ

Similarly, for RNN’s the mathematical formula would be such that, output (a) from each layer (l) is defined as

aˡ = w𝒻ˡ.aₜ₊₁ˡ⁻¹ + wᵣˡ.aₜˡ + bˡ

As you can see the output depends upon the weights from the previous layer in the same time and previous time in the same layer. For faster computation we generally process the inputs in batches.

Types of RNN’s:

  1. Many to One RNN: This type of RNN uses many input’s, but has only one output. For example, if we need to find if a sentence is grammatically correct or not, we will use this type of RNN. This application may have a number of inputs but only one output- correct or not.
  2. Many to Many RNN: In this case the number of inputs will be equal to the number of outputs. They would be of same length. For example, Parts of Speech Tagging would require this type of RNN. The reason is the total number of inputs would be same as the number of outputs- for each word we would tag if it is a noun, pronoun, verb or adjective
  3. Encoder-Decoder RNN: If the size of the inputs is not equal to the number of outputs, then we would use this type of RNN. For example, you can use this RNN if we have a chatbot where you can ask questions and the chatbot would reply you with answers. The reason why we use this type of RNN is that all questions and answers would be of different lengths, and we cannot pre-define those.
  4. One To Many RNN: If there is just 1 input but many outputs, then we would use this type of RNN. For example, when you want to do image captioning- where you taken one input image, but output a number of words you would use this type of RNN.

Problems with RNN:

  1. RNN’s are extremely difficult to train, especially when the sequence gets bigger and bigger. This is because at each state, the model learns the cumulative knowledge of the sequence it has seen so far.
  2. Back Propagation in RNN’s not only happens from right to left, but also from previous time to current time called “Back Propagation Through time(BPTT)”. So when you modify a weight in layer 1 and time 2, the weights in subsequent layers and times would all change, resulting in a change in the final outputs. In back propagation we use gradients. So if the gradients are too small like 0.1,0.2,etc then the multiplication of all these gradients would become so small that the value could be negligible, and so arises the vanishing gradient problem. Also, if the gradients are large like 1000, 100, 10000, etc, then the dot product of all these would be really large, thus arising the exploding gradient problem.

Solutions:

  1. To handle the exploding gradient problem, we can always specify a cap on the upper limit called Gradient Clipping.
  2. But the vanishing gradient problem is very hard to handle. We can handle this using some advanced RNN techniques like LSTM (Long Short Term Memory Unit) and GRU(Gated Recurrent Unit). We will be looking in detail about this in my next blog.

Conclusion:

RNN’s are extremely effective in a wide variety of applications including NLP and in Videos. But, it is extremely difficult to train and is much more complex than CNN’s. We need to properly train the model to make use of it effectively.

Hope you enjoyed reading my blog! Leave your comments below or contact me via LinkedIn.

--

--

Madhu Ramiah
Madhu Ramiah

Written by Madhu Ramiah

Data Scientist who loves teaching!

Responses (1)