Introduction

Training neural nets is all about computing gradients. In case you are new to this idea, refer to this awesome post by Andrej Karpathy. Briefly, deep down every ML problem is an optimization problem. We want to “learn” (find) the weights which will result in least average loss. The way we do it is - start with arbitrary wieghts and keep adjusting them in small quantities until we get them right i.e. arrive at a set of values for which loss function has least value. Gradients tells us by how much should we adjust each of the weights. Not clear - check this video by Andrew NG and this blog by Sanjeev Arora.

In this post we will focus on the maths that goes into computing these gradients - we will systematically derive gradients. The complexity of calculations depends on 3 things:

  1. Depth of the network
  2. Number of training examples (1 or more)
  3. Number of components in input (1=scalar, >1=vector)

Through out this post we assume:

  1. There is no bias term.
  2. . is matrix multiplication, * is element wise product and X is normal multiplication.
  3. All activations are sigmoid a.k.a logistic. It is defined as \( f(u) = \frac{1}{1+e^{-u}}\). If you plot it, it comes as:
Sigmoid function

It easy to see it is smooth and differentiable and bound between 0 and 1 [No? not straight forward - need to fix this].

Derivative

The derivative of the logistic function(\(\sigma\)) is simply:

From where this comes ? read on:

likewise,

We will be using the above result a lot, so make sure you understand it. If it is not clear, have a look at this post.

To compute the gradients, we will start with the simplest case and increase the complexity gradually. To keep things simple we will complete it in 7 parts

  1. 1 layer network, 1 training example (scalar)
  2. 1 layer network, 1 training example (vector)
  3. 1 layer network, batch training (>1 training examples where each is a vector)
  4. 2 layer network with 1 node hidden layer, 1 training example (vector)
  5. 2 layer network with 2 node hidden layer, 1 training example (vector)
  6. 2 layer network, batch training (>1 training examples where each is a vector)
  7. Generalization and take home

Since we will be dealing with matrices, a key step in every equation is to check if all matrix dimensions are consistent.

Next