Introduction to Deep Learning

Introduction to Deep Learning

What is deep learning?

Deep Learning is a branch of machine learning.  It’s central application these days to solve perception problems. For eg –

  • Understand what people are talking
  • Help robots interact with world
  • Recognize images
  • Speech Recognition
  • Computer Vision


Deep Learning is also increasingly being used to solve image recognition in medical images, natural language processing, understanding/translating documents etc.


Deep Learning and Neural Networks

Neural Networks form the backbone of deep learning. Most of the important work in neural networks happened in 80s and 90s but due lack of fast computers not much could be done using neural networks. Neural Networks had all but disappeared in the early 2000s.  But in 2009 and 2012 they made a big comeback in Speech Recognition and Computer Vision. This was mainly due to availability of cheap and fast GPUs.

Introduction to Neural Networks

The idea of neural networks is borrowed from working of the human brain. It is believed that working of the brain can be simulated using silicon. Nodes of the neural network imitate biological neurons. The neurons are connected by links (axons). Each link in the neural network is associated with a weight. The output at each node is called its activation or node value.

A typical neural network is shown below –

Neural Network

Neural Network

A typical neural network has a set of input nodes, one or more hidden layers and an output layer comprising of output nodes.

The Mathematics Behind Neural Networks

Consider a neural network shown below.

Neural Network

Neural Network

The output y = f(h), where h is the input to the output unit. We can think of f as the activation function and h as the linear combination of weights, inputs and bias.

Hence, h = ∑iwixi + b

A typical activation function is the sigmoid, where

sigmoid(x) = 1/(1+ex)

Thus the predicted output of the neural network is

y = f(h) = sigmoid( ∑iwixxi +b)

Hence the predicted output of any neural network over all data points u (mu) is (assume b = 0)

y’uj = f (  1n wij xiu)

And the error of the neural network is

E =   1/2  ∑u ∑ j [actual-predicted output]^2

We aim to minimize the error. To do this we use a technique called gradient descent.

Gradient Descent

We use gradient descent to minimize the error.

Think of the error as a mountain. Let us say that we are at the top of the mountain and want to take multiple small steps to the bottom of the mountain. The fastest way down a mountain is in the steepest direction. We have to take steps in a direction that may not be the fastest but minimizes the error the most. We do this by taking the gradient of the error.

Thus at each step we calculate the error and the gradient and use these to find how much to change each weight. Repeating this process will find weights close to the minimum of the error function. It is possible that if the weights are incorrectly initialized then after gradient descent they may end up in a local minimum but not the lowest. In order to avoid this from happening we use methods called momentum.

The following python code shows to implement gradient descent using an example.

import numpy as np


def sigmoid(x):


Calculate sigmoid


return 1/(1+np.exp(-x))


def sigmoid_prime(x):


# Derivative of the sigmoid function


return sigmoid(x) * (1 – sigmoid(x))


learnrate = 0.5

x = np.array([1, 2. 3, 4])

y = np.array(0.5)


# Initial weights

w = np.array([0.5, -0.5, 0.3, 0.1])


### Calculate one gradient descent step for each weight

### Note: Some steps have been consilated, so there are

###       fewer variable names than in the above sample code


# Calculate the node’s linear combination of inputs and weights

h =, w)


#Calculate output of neural network

nn_output = sigmoid(h)


# Calculate error of neural network

error = y – nn_output


# Calculate the error term

#       Remember, this requires the output gradient, which we haven’t

#       specifically added a variable for.

error_term = error * sigmoid_prime(h)

# error_term = error * nn_output * (1 – nn_output)


# Calculate change in weights

delta_w = learnrate * error_term * x


print(‘Neural Network output:’)


print(‘Amount of Error:’)


print(‘Change in Weights:’)


Stochastic Gradient Descent

There is a problem with scaling gradient descent. If computing the error (or loss) once takes n floating point operations, computing its gradient takes about two to three times the compute. The loss function is usually huge as it depends on every single element in the training set. We usually train on lots of data and we go over the data 10s to 100s of times. This can take very long time.

So instead we randomly pick a small sample from the training data, compute the loss and derivative for the sample and treat the loss and derivative as the gradient descent. We repeat this many times. This is called stochastic gradient descent (SGD). It is scalable.

Momentum and Learning Rate

At each step in SGD we take a very small step to minimize the loss. The aggregate of all the steps of SGD takes us to the minimum loss. We can use previous information of small SGD steps to give us a better direction for the next step. This is called the momentum technique and gives a better convergence of data.

When we use SGD instead of gradient descent, we use small (though noisier) steps towards the objective of minimizing the loss (or error). As we train we may choose to make this step smaller and smaller. This is called the learning rate. We may choose to keep lowering it over time for better convergence.

Mini Batch

Mini-batching is a technique for training on subsets of the dataset instead of all the data at one time. This provides the ability to train a model, even if a computer lacks the memory to store the entire dataset.

It’s also quite useful when combined with SGD. The idea is to randomly shuffle the data at the start of each epoch, then create the mini-batches. For each mini-batch, you train the network weights with gradient descent. Since these batches are random, you’re performing SGD with each batch.


Each epoch is a single forward and backward pass over the entire dataset during training.

Multilayer Neural Networks

So far, we are able to write the weights as an array, indexed as wi.

But now, the weights need to be stored in a matrix, indexed as wij. Each row in the matrix will correspond to the weights leading out of a single input unit, and each column will correspond to the weights leading in to a single hidden unit. For our three input units and two hidden units, the weights matrix looks like this:

Multi layer Neural Network

Multi layer Neural Network


Backpropagation algorithm is an extension of gradient descent. It uses chain rule to find error w.r.t the weights connecting the input layer to the hidden layer.

To update the weights to hidden layers using gradient descent, we have to know how much error each of the hidden units contributed to the final output. Since the output of a layer is determined by the weights between layers, the error resulting from units is scaled by the weights going forward through the network. Since we know the error at the output, we can use the weights to work backwards to hidden layers.

For example, in the output layer, you have errors δok attributed to each output unit k. Then, the error attributed to hidden unit j is the output errors, scaled by the weights between the output and hidden layers (and the gradient):

δjhΣ wjk δok f’ (hj)

Then, the gradient descent step is the same as before, just with the new errors:

Δwij = ηδjh xi

where wij are the weights between the inputs and hidden layer and xi are input unit values. This form holds for however many layers there are. The weight steps are equal to the step size times the output error of the layer times the values of the inputs to that layer

Δwij = ηδoutputVin

Here, you get the output error, δoutput, by propagating the errors backwards from higher layers. And the input values, Vin are the inputs to the layer, the hidden layer activation to the output unit for example.

General Algorithm for updating weights using backpropagation

  1. Set weight steps for each weight equal to 0.
    1. Inputs to hidden weights ∆wij = 0
    2. Hidden to output weights ∆wj = 0
  2. For each pass in the training data,
    1. Make a forward pass and calculate y’, the predicted output
    2. Calculate the error gradient in the output unit
    3. Propagate the errors through the hidden layer(s)
    4. Update the weight steps
  3. Update the weights, where η is the learning rate and m is the number of records:
    1. wj=wj+ηΔwj/m
    2. wij=wij+ηΔwij/m

Repeat for e epochs

Brief Introduction to Tensorflow

Refer to installation instructions for tensorflow on the tensorflow website.

Once tensorflow is installed, start python in a terminal window, the type “import tensorflow as tf” (without quotes). If you do not get an error, then tensorflow has been installed correctly.

In TensorFlow, data values are encapsulated in an object called a tensor. For example, of hello = tf.constant(‘Hello World!’), hello is a 0-dimensional string tensor. Tensors come in a variety of sizes as shown below:

# A is a 0-dimensional int32 tensor

A = tf.constant(500)

# B is a 1-dimensional int32 tensor

B = tf.constant([345,999,888,7777])

# C is a 3-dimensional int32 tensor

C = tf.constant([ [1,2,3], [444,555,666],[77,88,99] ])


tf.constant() is one of many TensorFlow operations. In this operation the value of the tensor never changes.

A Tensorflow Session

A “TensorFlow Session”, is an environment for running a tensorflow graph (program). The session is in charge of allocating the operations to GPU(s) and/or CPU(s), including remote machines.

with tf.Session() as sess:

output =

We evaluate the tensor’ hello’ in the above session.

The code creates a session instance, sess, using tf.Session. The function then evaluates the tensor and returns the results.

Tensorflow Input

Sadly you can’t just set  any variable y to your dataset and put it in TensorFlow, because over time you’ll want your TensorFlow model to take in different datasets with different parameters. tf.placeholder() returns a tensor that gets its value from data passed to the function, allowing you to set the input right before the session runs.

Session’s feed_dict

x = tf.placeholder(tf.string)

with tf.Session() as sess:

output =, feed_dict={x: ‘Hello World’})

Use the feed_dict parameter in to set the placeholder tensor.

Tensorflow Math

We can use basic math functions with tensors. Example –

tf.subtract(tf.cast(tf.constant(2.0), tf.int32), tf.constant(1))   # 1

Weights and Bias in Tensorflow

The goal of training a neural network is to modify weights and biases to best predict the labels. In order to use weights and bias, we need a Tensor that can be modified. This is where we use tf.Variable.


x = tf.Variable(5)

The tf.Variable class creates a tensor with an initial value that can be modified, much like a normal Python variable. We can use the tf.global_variables_initializer() function to initialize the state of all the Variable tensors.


init = tf.global_variables_initializer()

with tf.Session() as sess:


The tf.global_variables_initializer() call returns an operation that will initialize all TensorFlow variables from the graph.

Initializing the weights with random numbers from a normal distribution is good practice. Randomizing the weights helps the model from becoming stuck in the same place every time you train it.

Similarly, choosing weights from a normal distribution prevents any one weight from overwhelming other weights. You’ll use the tf.truncated_normal() function to generate random numbers from a normal distribution.


n_features = 120n_labels = 5weights = tf.Variable(tf.truncated_normal((n_features, n_labels)))

The tf.truncated_normal() function returns a tensor with random values from a normal distribution whose magnitude is no more than 2 standard deviations from the mean.

Since the weights are already helping prevent the model from getting stuck, you don’t need to randomize the bias. Let’s use the simplest solution, setting the bias to 0.


num_labels = 3

bias = tf.Variable(tf.zeros(num_labels))

The tf.zeros() function returns a tensor with all zeros.



RELU and Softmax Activation functions

Instead of sigmoids, most recent deep learning networks use rectified linear units (ReLUs) for the hidden layers. A rectified linear unit has output 0 if the input is less than 0, and raw output otherwise.

Often you’ll find you want to predict if some input belongs to one of many classes. This is a classification problem, but a sigmoid is no longer the best choice. Instead, we use the softmax function. The softmax function squashes the outputs of each unit to be between 0 and 1. It also divides each output such that the total sum of the outputs is equal to 1.


The above can be used for one of the most important applications of deep learning – classification of images using Convolution Neural Networks or CNNs.



  1. Udacity nanodegree course on machine learning
  2. Andrej Karpathy’s CS231n Stanford course on Convolutional Neural Networks.
  3. Michael Nielsen’s free book on Deep Learning.
  4. Goodfellow, Bengio, and Courville’s more advanced free book on Deep Learning.
  5. TutorialsPoint AI
  6. FAST AI Course on Deep Learning
Ashish Lal
Ashish Lal is a freelancer. His interests are Deep Learning, Machine Learning, Networking, VoIP, embedded software and Java.

Ashish Lal is a freelancer. His interests are Deep Learning, Machine Learning, Networking, VoIP, embedded software and Java.