A Deep-Dive into CNNs

Teerka Baskaran
9 min readDec 15, 2021

--

Convolutional Neural Networks (CNNs for short) is a type of machine learning neural network used for image detection — at times more efficient than the human eye. Take the image below for example.

Knee Cartilage MRI sample from diseased and healthy patients.

The MRI scan of knee cartilage on the left will be diagnosed with Osteoarthritis while the one on the right is healthy. The differences are subtle, right? This is why Osteoarthritis tends to be diagnosed 3 years too late after most of the damage is done.

However, a machine learning algorithm trained on diseased and healthy knee cartilage MRI was able to differentiate and detect a pattern that was evident in millions of scans that were undetectable by the human eye. CNNs will revolutionize image detection.

So… What is Machine Learning?

Machine Learning is a subset of Artificial Intelligence that focuses on the use of data and algorithms to allow the machine to learn automatically and improve from experiences without being explicitly coded to do so — mimicking the human learning process.

Types of Machine Learning

There are three ways a machine can learn: supervised learning, unsupervised learning, and reinforcement learning. Let’s link this back to humans, specifically 6-year-old Anika who is just learning how to add.

  1. Supervised Learning (training on labeled data)

Supervised Learning is when we teach a machine with labeled data. For example, we would give Anika 30 solved problem sets and tell her which one was added correctly and which ones were added incorrectly. She would then be able to find the pattern easier.

It is like an answer key with the arithmetic symbols removed.

Supervised Learning primarily solves Regression and Classification problems.

2. Unsupervised Learning (training on unlabelled data)

In Unsupervised Learning, we remove the teacher and correct labels and just train the machine on the unlabelled data. This way Anika looks at the problems and draws connections herself. The unique part of Unsupervised Learning is it can discover patterns that we don’t order it to/we can’t see ourselves making it especially popular in Data Mining.

Unsupervised Learning primarily solves Association and Clustering problems.

3. Reinforcement Learning

Reinforcement Learning is when there is no input data, but rather we put the machine in an environment where performing certain actions will result in rewards from that actions. For example, in the classroom, Anika will learn that doing her classwork will result in positive responses from her teacher. There is no input data in reinforcement learning.

Deep Learning

Now you are ready to understand Deep Learning, a subset of machine learning that involves training Artificial Neural Networks and Convolutional Neural Networks. Deep Learning can use both supervised and unsupervised learning to train the AI.

The whole purpose of Deep Learning is to emulate how our brain works, hence why we call the AI a “neural network”. In humans, we have neurons that receive an electric signal and transmit it to the next neuron. Similarly in AI, we have neurons represented by the circles. We also have synapses that transmit signals in the brain represented by the lines connecting the neurons.

These neurons are grouped into 3 different layers:

  1. Input Layer
  2. Hidden Layer(s)
  3. Output Layer

The input layer receives input values. In the human brain, this is like your 5 senses. All input values need to be standardized which means ensuring they have a mean of 0 and variance of 1. This is done to ensure the values are as close to each other as possible which will make it easier for the neural network to process them. Inputs are then sent to the first hidden layer.

Learn more about standardization here.

The hidden layers perform mathematical functions on the inputs. In a CNN, this includes convolutional layers, pooling layers, and connected layers (more on this later).

Having more than one hidden layer is what makes deep learning “deep”. Hidden Layers give additional flexibility and power. We will see later why this is necessary for Convolutional Neural Networks.

The output layer returns the output. Output values can be:

  • continuous (time)
  • binary (yes/no)
  • categorical (multiple output values representing categories)

So how do we develop the output? This is where Deep Learning kicks in.

Each synapse (the connection between neurons) is assigned a weight. Weight is a number between 1 and 0 that determines the importance of the input value or neuron to the network. Initially, the weights are set at random but get adjusted when the model gets trained.

The neuron takes the weighted sum of all the input values and applies an activation function. One of the roles of the activation function is to standardize the output from the neuron. The final value gets passed on to the next layer.

Let’s take a look at 4 activation functions.

In the Threshold Function:

  • if the value is more than or equal to 0, it passes on a 1
  • if the value is less than 0, it passes a 0

The Sigmoid Function will use the formula below, where x is the weighted sum. This function is ideal when predicting probability in the last layer.

The Rectifier Function will output the input directly if it is positive, otherwise, it will output zero. This is most commonly used in Artificial Neural Networks.

The Hyperbolic Tangent Function is very similar to the Sigmoid Function, however, values can be output between -1 and 1.

To summarize, in the input layer values are standardized and passed to the hidden layer where the weighted sum of inputs and an activation function are applied and passed to the output layer.

Let’s dive into CNN’s and you’ll see the magic of Deep Learning.

Convolutional Neural Networks

Convolutional Neural Networks are a type of deep neural network used for image detection. Before we start, let’s take a look at the image below. You might notice two images — one with the man looking right at us and one with the man sideways.

Our brain is able to detect this image using the features we see, and some features our brain cannot detect because both options seem possible. We see how our brain works off of features and classifies an image. CNN’s do the same.

Let’s say we had a black and white 2 x 2 image. That can be digitally represented in an array with each pixel having a value between 1 and 255 with 1 being black, 255 being white, and everything in between falling along the greyscale.

For a colored image, instead of just one value, there will be 3 values, Red, Green, and Blue between 1 and 255.

However, instead of having from 0 to 255, let’s simplify things to having 0 as white and 1 as black which would create an input layer seen below.

Step 1: Convolution

Convolution is the first layer of the neural network to extract features from the image. It is a mathematical function seen below that takes two inputs: the image matrix and the filter matrix.

Convolution Function

They do this using a filter (usually a 3 x 3 matrix) which is passed along 3 x 3 segments of the input image.

The function multiplies each pixel of the input image with the overlapping pixel of the filter matrix. For each 3 x 3 segment, the products are summed and added to the feature map/convolved feature.

Examples of feature maps, also known as filters or kernels are seen below.

Stride is by how many pixels the filter shifts over the input matrix. Below we see convolution with a stride of 2. When the stride is 2, we move the filter 2 pixels over each time.

ReLU Layer

ReLU stands for Rectified Linear Unit for a non-linear function. The output of this activation feature is f(x) = max (0,x). The role of ReLU is to increase linearity in our network because the images themselves are highly non-linear. ReLU is chosen over sigmoid and other activations functions because it performs better than the others.

ReLU Operation

Step 2: Pooling

The neural network needs a property called spatial variance that allows us to detect features even if it’s a bit tilted, further apart, etc. This is done through pooling which reduces the number of parameters to observe when images are too big. There are 3 types of spatial pooling:

  • Max Pooling
  • Average Pooling
  • Sum Pooling

Max pooling is taking the largest elements from a feature map. Usually, a 2 x 2 filter with a stride of 2 is passed over the rectified feature map, recording the maximums in the next feature map.

Max Pooling

Benefits of pooling include:

  • reduces the size of the image by 75%
  • preserves features
  • introduces spacial invariance
  • reduces the number of parameters by 25% which prevents overfitting

Step 3: Fully Connected Layer

The pooled feature map is then taken as a long column with each value one after the other in a long vector which will be inputted into the fully connected layer like an Artificial Neural Network. This process is known as flattening. Here we will have an activation function that decides the output value.

Training the Neural Network

Now comes the hard part of training the neural network. This is difficult because we need a large set of data and a large amount of computational power.

At first, weights are chosen at random. Then, we calculate the difference between the actual value and our output value to know how “off” the weights were. This is known as calculating the cost functions.

Cost Function

After we know the cost function, we feed it back into the neural network and the weights are adjusted until the cost function is finally 0.

Well… how are the weights adjusted to minimize the cost function?

One way we could do this is through brute force where we try out 1000 weights and get a chart like this… problem is not even the fastest computer is able to do this. It would take Sunway TaihuLight, the world’s fastest supercomputer 3.42 x 10⁵⁰ years to do this.

Sunway TaihuLight

How we are able to do this is through gradient descent. Gradient Descent is a technique that will allow us to find the minimum in our function.

It works by computing the derivative of the cost function at a certain weight which will tell us which direction the minimum is in. Then, the weight is adjusted by a small increment to move the value closer to the minimum. This is done through many iterations which require a large amount of computational power.

All the weights are then adjusted simultaneously as the errors are backpropagated through the network in the opposite direction.

Now, our Convolutional Neural Network is ready for use!

Learn More!

If you would like to learn more about Convolutional Neural Networks, I recommend you check out the Machine Learning A-Z: Hands-On Python & R In Data Science https://www.udemy.com/course/machinelearning/.

If you enjoyed reading this article, please give it a few claps so it can reach more people. Thank you!

Follow me on Linkedin

Follow me on Twitter

--

--