CNN Building Blocks
Neural networks accept an input image/feature vector (one input node for each entry) and transform it through a series of hidden layers, commonly using nonlinear activation functions. Each hidden layer is also made up of a set of neurons, where each neuron is fully connected to all neurons in the previous layer. The last layer of a neural network (i.e., the “output layer”) is also fully connected and represents the final output classifications of the network.
However, neural networks operating directly on raw pixel intensities:
- Do not scale well as the image size increases.
- Leaves much accuracy to be desired (i.e., a standard feedforward neural network on CIFAR-10 obtained only 52% accuracy).
To demonstrate how standard neural networks do not scale well as image size increases, let’s again consider the CIFAR-10 dataset. Each image in CIFAR-10 is 32×32 with a Red, Green, and Blue channel, yielding a total of 32×32×3 = 3,072 total inputs to our network.
A total of 3,072 inputs does not seem to amount to much, but consider if we were using 250×250 pixel images — the total number of inputs and weights would jump to 250×250×3 = 187,500 — and this number is only for the input layer alone! Surely, we would want to add multiple hidden layers with a varying number of nodes per layer — these parameters can quickly add up, and given the poor performance of standard neural networks on raw pixel intensities, this bloat is hardly worth it.
Instead, we can use Convolutional Neural Networks (CNNs) that take advantage of the input image structure and define a network architecture in a more sensible way. Unlike a standard neural network, layers of a CNN are arranged in a 3D volume in three dimensions: width, height, and depth (where depth refers to the third dimension of the volume, such as the number of channels in an image or the number of filters in a layer).
To make this example more concrete, again consider the CIFAR-10 dataset: the input volume will have dimensions 32×32×3 (width, height, and depth, respectively). Neurons in subsequent layers will only be connected to a small region of the layer before it (rather than the fully connected structure of a standard neural network) — we call this local connectivity, which enables us to save a huge amount of parameters in our network. Finally, the output layer will be a 1×1×N volume, which represents the image distilled into a single vector of class scores. In the case of CIFAR-10, given ten classes, N = 10, yielding a 1×1×10 volume.
There are many types of layers used to build Convolutional Neural Networks, but the ones you are most likely to encounter include:
- Convolutional (CONV
- Activation (ACT
, where we use the same or the actual activation function)
- Pooling (POOL
- Fully connected (FC
- Batch normalization (BN
- Dropout (DO
Stacking a series of these layers in a specific manner yields a CNN. We often use simple text diagrams to describe a CNN:
Here, we define a simple CNN that accepts an input, applies a convolution layer, then an activation layer, then a fully connected layer, and, finally, a softmax classifier to obtain the output classification probabilities. The
activation layer is often omitted from the network diagram as it is assumed it directly follows the final
Of these layer types,
(and to a lesser extent,
) are the only layers that contain parameters that are learned during the training process. Activation and dropout layers are not considered true “layers” themselves but are often included in network diagrams to make the architecture explicitly clear. Pooling layers (
), of equal importance as
, are also included in network diagrams as they have a substantial impact on the spatial dimensions of an image as it moves through a CNN.
are the most important when defining your actual network architecture. That’s not to say that the other layers are not critical, but take a backseat to this critical set of four as they define the actual architecture itself.
Remark: Activation functions themselves are practically assumed to be part of the architecture, When defining CNN architectures we often omit the activation layers from a table/diagram to save space; however, the activation layers are implicitly assumed to be part of the architecture.
In this tutorial, we’ll review each of these layer types in detail and discuss the parameters associated with each layer (and how to set them). In a future tutorial, I’ll discuss in more detail how to stack these layers properly to build your own CNN architectures.
5 Layers of a Convolutional Neural Network
1. Convolutional Layer: This layer performs the convolution operation on the input data, which extracts various features from the data.
Convolutional Layers in a CNN model architecture are one of the most vital components of CNN layers . These layers are responsible for extracting features from the input data and forming the basis for further processing and learning. A convolutional layer consists of a set of filters (also known as kernels) applied to the input data in a sliding window fashion. Each filter extracts a specific set of features from the input data based on the weights associated with it.
The number of filters used in the convolutional layer is one of the key hyperparameters in the architecture. It is determined based on the type of data being processed as well as the desired accuracy of the model. Generally, more filters will result in more features extracted from the input data, allowing for more complex network architectures to understand the data better.
The convolution operation consists of multiplying each filter with the data within the sliding window and summing up the results. This operation is repeated for all the filters, resulting in multiple feature maps for a single convolutional layer. These feature maps are then used as input for the following layers, allowing the network to learn more complex features from the data.
Convolutional layers are the foundation of deep learning architectures and are used in various applications, such as image recognition, natural language processing, and speech recognition. By extracting the most critical features from the input data, convolutional layers enable the network to learn more complex patterns and make better predictions.
2. Pooling Layer: This layer performs a downsampling operation on the feature maps, which reduces the amount of computation required and also helps to reduce overfitting.
The pooling layer is a vital component of the architecture of CNN . It is typically used to reduce the input volume size while extracting meaningful information from the data.
Pooling layers are usually used in the later stages of a CNN, allowing the network to focus on more abstract features of an image or other type of input. The pooling layer operates by sliding a window over the input volume and computing a summary statistic for the values within the window. Common statistics include taking the maximum, average, or sum of the values within the window. This reduces the input volume’s size while preserving important information about the data. The pooling layer is also typically used to introduce spatial invariance, meaning that the network will produce the same output regardless of the location of the input within the image. This allows the network to inherit more general features about the image rather than simply memorizing its exact location.
3. Activation Layer: This layer adds non-linearity to the model by applying a non-linear activation function such as ReLU or tanh.
An activation layer in a CNN is a layer that serves as a non-linear transformation on the output of the convolutional layer. It is a primary component of the network, allowing it to learn complex relationships between the input and output data. The activation layer can be thought of as a function that takes the output of the convolutional layer and maps it to a different set of values. This enables the network to learn more complex patterns in the data and generalize better. Common activation functions used in CNNs include ReLu (Rectified Linear Unit), sigmoid, and tanh. Each activation function serves a different purpose and can be used in different scenarios. ReLu is the most commonly used activation function in most convolutional networks. It is a non-linear transformation that outputs 0 for all negative values and the same value as the input for all positive values. This allows the network to imbibe more complex patterns in the data. Sigmoid is another commonly used activation function, which outputs values between 0 and 1 for any given input. This helps the network to understand complex relationships between the input and output data but is more computationally expensive than ReLu. Tanh is the least commonly used activation function, which outputs values between -1 and 1 for any given input.
The activation layer is an essential component of the CNN, as it prevents linearity and enhances non-linearity in the output. Choosing the right activation function for the network is essential, as each activation function serves a different purpose and can be used in different scenarios. Selecting a suitable activation function can lead to better performance of the CNN structure .
Learn Machine Learning Online Courses from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
4. Fully Connected Layer: This layer connects each neuron in one layer to every neuron in the next layer, resulting in a fully-connected network.
A fully connected layer in a CNN is a layer of neurons connected to every neuron in the previous layer in the network. This is in contrast to convolutional layers, where neurons are only connected to a subset of neurons in the previous layer based on a specific pattern. By connecting every neuron in one layer to every neuron in the next layer, the fully connected layer allows information from the previous layer to be shared across the entire network, thus providing the opportunity for a more comprehensive understanding of the data. Fully connected layers in CNN are typically used towards the end of a CNN model architecture , after the convolutional layers and pooling layers, as they help to identify patterns and correlations that the convolutional layers may not have recognized. Additionally, fully connected layers are used to generate a non-linear decision boundary that can be used for classification. In conclusion, fully connected layers are an integral part of any CNN and provide a powerful tool for identifying patterns and correlations in the data.
5. Output Layer: This is the final layer of the network, which produces the output labels or values.
The output layer of a CNN is the final layer in the network and is responsible for producing the output. It is the layer that takes the features extracted from previous layers and combines them in a way that allows it to produce the desired output. A fully connected layer is typically used when the output is a single value, such as a classification or regression problem. A single neuron layer is generally used when the outcome is a vector, such as a probability distribution. A softmax activation function is used when the output is a probability distribution, such as a probability distribution over classes. The output layer of a CNN is also responsible for performing the necessary computations to obtain the desired output. This includes completing the inputs’ necessary linear or non-linear transformations to receive the output required. Finally, the output layer of a CNN can also be used to perform regularization techniques, such as dropout or batch normalization, to improve the network’s performance.