Basics of Image Classification with PyTorch

Many deep learning frameworks have been released over the past few years. Among them, PyTorch from Facebook AI Research is very unique and has gained widespread adoption because of its elegance, flexibility, speed, and simplicity.

Most deep learning frameworks have either been too specific to application development without sufficient support for research, or too specific for research without sufficient support for application development.

However, PyTorch blurs the line between the two by providing an API that’s very friendly to application developers while at the same time providing functionalities to easily define custom layers and fully control the training process, including gradient propagation.

This makes it a great fit for both developers and researchers.

Chief of all PyTorch’s features is its define-by-run approach that makes it possible to change the structure of neural networks on the fly, unlike other deep learning libraries that rely on inflexible static graphs.

In this post, you’ll learn from scratch how to build a complete image classification pipeline with PyTorch. Get ready for an exciting ride!

Installing PyTorch

Installing PyTorch is a breeze thanks to pre-built binaries that work well across all systems.

INSTALL ON WINDOWS

CPU Only:

pip3 install http://download.Pytorch.org/whl/cpu/torch-0.4.0-cp35-cp35m-win_amd64.whl

pip3 install torchvision

With GPU Support

pip3 install http://download.Pytorch.org/whl/cu80/torch-0.4.0-cp35-cp35m-win_amd64.whl
pip3 install torchvision

INSTALL ON LINUX

CPU Only:

pip3 install torch torchvision

With GPU Support

pip3 install http://download.Pytorch.org/whl/cpu/torch-0.4.0-cp35-cp35m-linux_x86_64.whl

pip3 install torchvision

INSTALL ON OSX

CPU Only:

pip3 install torch torchvision

With GPU Support

Visit Pytorch.org for instructions regarding installing with gpu support on OSX.

Note: To run experiments in this post, you should have a cuda capable GPU. If you don’t, NO PROBLEM! Visit colab.research.google.com to get a cloud based gpu accelerated vm for free.

Brief Introduction to Convolutional Neural Networks

The models we’ll be using in this post belong to a class of neural networks called Convolutional Neural Networks (CNN). A CNN is primarily a stack of layers of convolutions, often interleaved with normalization and activation layers. The components of a convolutional neural network is summarized below.

CNN — A stack of convolution layers

Convolution Layer — A layer to detect certain features. Has a specific number of channels.

Channels — Detects a specific feature in the image.

Kernel/Filter — The feature to be detected in each channel. It has a fixed size, usually 3 x 3.

To briefly explain, a convolution layer is simply a feature detection layer. Every convolution layer has a specific number of channels; each channel detects a specific feature in the image. Each feature to detect is often called a kernel or a filter. The kernel is of a fixed size, usually, kernels of size 3 x 3 are used.

For example, a convolution layer with 64 channels and kernel size of 3 x 3 would detect 64 distinct features, each of size 3 x 3.

Defining the Model Structure

Models are defined in PyTorch by custom classes that extend the Module class. All the components of the models can be found in the torch.nn package.

Hence, we’ll simply import this package. Here we’ll build a simple CNN model for the purpose of classifying RGB images from the CIFAR 10 dataset. The CIFAR10 dataset consists of 50,000 training images and 10,000 test images of size 32 x 32.

# Import needed packages
import torch
import torch.nn as nn


class SimpleNet(nn.Module):
    def __init__(self, num_classes=10):
        super(SimpleNet, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=3, out_channels=12, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()

        self.conv2 = nn.Conv2d(in_channels=12, out_channels=12, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()

        self.pool = nn.MaxPool2d(kernel_size=2)

        self.conv3 = nn.Conv2d(in_channels=12, out_channels=24, kernel_size=3, stride=1, padding=1)
        self.relu3 = nn.ReLU()

        self.conv4 = nn.Conv2d(in_channels=24, out_channels=24, kernel_size=3, stride=1, padding=1)
        self.relu4 = nn.ReLU()

        self.fc = nn.Linear(in_features=16 * 16 * 24, out_features=num_classes)

    def forward(self, input):
        output = self.conv1(input)
        output = self.relu1(output)

        output = self.conv2(output)
        output = self.relu2(output)

        output = self.pool(output)

        output = self.conv3(output)
        output = self.relu3(output)

        output = self.conv4(output)
        output = self.relu4(output)

        output = output.view(-1, 16 * 16 * 24)

        output = self.fc(output)

        return output

In the code above, we first define a new class named SimpleNet, which extends the nn.Module class. In the constructor of this class, we specify all the layers in our network. Our network is structured as convolution — relu — convolution — relu — pool — convolution — relu — convolution — relu — linear.

To clarify what is happening in each layer, let’s go over them one by one.

The Convolution Layer

Given that our input would be RGB images which have 3 channels (RED-GREEN-BLUE), we specify the number of in_channels as 3. Next we want to apply 12 feature detectors to the images, so we specify the number of out_channels to be 12.

Here we use the standard 3 x 3 kernel size (defined simply as 3). The stride is set to 1, and should always be so, unless you plan to reduce the dimension of the images. By setting the stride to 1, the convolution would move 1 pixel at a time.

Lastly, we set the padding to be 1: this ensures our images are padded with zeros to keep the input and output size the same.

Basically, you need not worry much about the stride and padding at present. Keep your focus on the in_channels and out_channels.

Note that the out_channels in this layer, serves as the in_channels in the next layer, as seen below.

ReLU

This is the standard ReLU activation function, it basically thresholds all incoming features to be 0 or greater. In simple English, when you apply relu to the incoming features, any number less than 0 is changed to zero, while others are kept the same.

MaxPool2d

This layer reduces the dimension of the image by setting the kernel_size to be 2, reducing our image width and height by a factor of 2. What it essentially does is take the maximum of the pixels in a 2 x 2 region of the image and use that to represent the entire region; hence 4 pixels become just one.

Linear

The final layer of our network would almost always be the linear layer. It’s a standard, fully connected layer that computes the scores for each of our classes — in this case ten classes.

Note that we have to flatten the entire feature map in the last conv-relu layer before we pass it into the image. The last layer has 24 output channels, and due to 2 x 2 max pooling, at this point our image has become 16 x 16 (32/2 = 16). Our flattened image would be of dimension 16 x 16 x 24. We do this with the code:

In our linear layer, we have to specify the number of input_features to be 16 x 16 x 24 as well, and the number of output_features should correspond to the number of classes we desire.

Note the simple rule of defining models in PyTorch. Define layers in the constructor and pass in all inputs in the forward function.

That hopefully gives you a basic understanding of constructing models in PyTorch.

Modularity

The code above is cool but not cool enough — if we were to write very deep networks, it would look cumbersome. The key to cleaner code is modularity. In the above example, we could put convolution and relu in one single separate module and stack much of this module in our SimpleNet.

To do that, we first define a new module as below

class Unit(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(Unit, self).__init__()

        self.conv = nn.Conv2d(in_channels=in_channels, kernel_size=3, out_channels=out_channels, stride=1, padding=1)
        self.bn = nn.BatchNorm2d(num_features=out_channels)
        self.relu = nn.ReLU()

    def forward(self, input):
        output = self.conv(input)
        output = self.bn(output)
        output = self.relu(output)

        return output

Consider the above as a mini-network meant to form a part of our larger SimpleNet.

As you can see above, this Unit consists of convolution-batchnormalization-relu.

Unlike in the first example, here I included BatchNorm2d before ReLU. Batch Normalization essentially normalizes all inputs to have zero mean and unit variance. It greatly boosts the accuracy of CNN models.

Having defined the unit above, we can now stack many of them together.

class Unit(nn.Module):
    def __init__(self,in_channels,out_channels):
        super(Unit,self).__init__()
        

        self.conv = nn.Conv2d(in_channels=in_channels,kernel_size=3,out_channels=out_channels,stride=1,padding=1)
        self.bn = nn.BatchNorm2d(num_features=out_channels)
        self.relu = nn.ReLU()

    def forward(self,input):
        output = self.conv(input)
        output = self.bn(output)
        output = self.relu(output)

        return output

class SimpleNet(nn.Module):
    def __init__(self,num_classes=10):
        super(SimpleNet,self).__init__()
        
        #Create 14 layers of the unit with max pooling in between
        self.unit1 = Unit(in_channels=3,out_channels=32)
        self.unit2 = Unit(in_channels=32, out_channels=32)
        self.unit3 = Unit(in_channels=32, out_channels=32)

        self.pool1 = nn.MaxPool2d(kernel_size=2)

        self.unit4 = Unit(in_channels=32, out_channels=64)
        self.unit5 = Unit(in_channels=64, out_channels=64)
        self.unit6 = Unit(in_channels=64, out_channels=64)
        self.unit7 = Unit(in_channels=64, out_channels=64)

        self.pool2 = nn.MaxPool2d(kernel_size=2)

        self.unit8 = Unit(in_channels=64, out_channels=128)
        self.unit9 = Unit(in_channels=128, out_channels=128)
        self.unit10 = Unit(in_channels=128, out_channels=128)
        self.unit11 = Unit(in_channels=128, out_channels=128)

        self.pool3 = nn.MaxPool2d(kernel_size=2)

        self.unit12 = Unit(in_channels=128, out_channels=128)
        self.unit13 = Unit(in_channels=128, out_channels=128)
        self.unit14 = Unit(in_channels=128, out_channels=128)

        self.avgpool = nn.AvgPool2d(kernel_size=4)
        
        #Add all the units into the Sequential layer in exact order
        self.net = nn.Sequential(self.unit1, self.unit2, self.unit3, self.pool1, self.unit4, self.unit5, self.unit6
                                 ,self.unit7, self.pool2, self.unit8, self.unit9, self.unit10, self.unit11, self.pool3,
                                 self.unit12, self.unit13, self.unit14, self.avgpool)

        self.fc = nn.Linear(in_features=128,out_features=num_classes)

    def forward(self, input):
        output = self.net(input)
        output = output.view(-1,128)
        output = self.fc(output)
        return output

That’s a whole 15 layer network, made up of 14 convolution layers, 14 ReLU layers, 14 batch normalization layers, 4 pooling layers, and 1 Linear layer, totalling 62 layers! This was made possible through the use of sub-modules and the Sequential class.

The above code is made up of a stack of the unit and the pooling layers in between.

Notice how I made the code more compact by putting all layers except the fully connected layer into a sequential class. This further simplifies the code in the forward function.

Also the AvgPooling layer after the last unit computes the average of all activations in each channel. The output of the unit has 128 channels, and after pooling 3 times, our 32 x 32 images have become 4 x 4. We apply the AvgPool2D of kernel size 4, turning our feature map into 1 x 1 x 128.

Consequently, the linear layer would have 1 x 1 x 128 = 128 input features.

We also flatten the output of the network to have 128 features.

Loading and Augmenting data

Data loading is very easy in PyTorch thanks to the torchvision package. To demonstrate this, I’ll be loading the CIFAR10 dataset that we’ll make use of in this tutorial.

First we need three additional import statements

from torchvision.datasets import CIFAR10
from torchvision.transforms import transforms
from torch.utils.data import DataLoader

To load the dataset we do the following:

  • Define transformations to be applied on the image
  • Load the dataset using torchvision
  • Create an instance of the DataLoader to hold the images

We do this for the training set as below:

#Define transformations for the training set, flip the images randomly, crop out and apply mean and std normalization
train_transformations = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32,padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
])

#Load the training set
train_set =CIFAR10(root="./data",train=True,transform=train_transformations,download=True)

#Create a loder for the training set
train_loader = DataLoader(train_set,batch_size=32,shuffle=True,num_workers=4)

First we pass an array of transformations using transform.Compose. RandomHorizontalFlip randomly flips the images horizontally. RandomCrop randomly crops the images. Below is an example of horizontal flipping.

Lastly, the two most important; ToTensor converts the images into a format usable by PyTorch. Normalize with the values given below would make all our pixels range between -1 to +1.

Note that when stating the transformations, ToTensor and Normalize must be last in the exact order as defined above.

The primary reason for this is that the other transformations are applied on the input which is a PIL image, however, this must be converted to a PyTorch tensor before applying normalization.

Data Augmentation helps the model to classify images properly irrespective of the perspective from which it is displayed.

Next, we load the training set using the CIFAR10 class, and finally we create a loader for the training set, specifying a batch size of 32 images.

This is repeated for the test set as below, except that the transformations only include ToTensor and Normalize. We do not apply other types of transformations on the test set.

# Define transformations for the test set
test_transformations = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

])

# Load the test set, note that train is set to False
test_set = CIFAR10(root="./data", train=False, transform=test_transformations, download=True)

# Create a loder for the test set, note that both shuffle is set to false for the test loader
test_loader = DataLoader(test_set, batch_size=32, shuffle=False, num_workers=4)

The first time you run this code, the dataset of about 170 mb would be downloaded to your system.

Training the Model

Training neural networks with PyTorch is a very explicit process that gives you full control over what happens during training. Let’s go over the process step by step.

You should import the Adam optimizer as:

from torch.optim import Adam

Step 1: Instantiate the Model, create the optimizer and Loss function


from torch.optim import Adam


# Check if gpu support is available
cuda_avail = torch.cuda.is_available()

# Create model, optimizer and loss function
model = SimpleNet(num_classes=10)

#if cuda is available, move the model to the GPU
if cuda_avail:
    model.cuda()

#Define the optimizer and loss function
optimizer = Adam(model.parameters(), lr=0.001, weight_decay=0.0001)
loss_fn = nn.CrossEntropyLoss()

Step 2: Write a function to adjust learning rates

# Create a learning rate adjustment function that divides the learning rate by 10 every 30 epochs


# Create a learning rate adjustment function that divides the learning rate by 10 every 30 epochs
def adjust_learning_rate(epoch):
    lr = 0.001

    if epoch > 180:
        lr = lr / 1000000
    elif epoch > 150:
        lr = lr / 100000
    elif epoch > 120:
        lr = lr / 10000
    elif epoch > 90:
        lr = lr / 1000
    elif epoch > 60:
        lr = lr / 100
    elif epoch > 30:
        lr = lr / 10

    for param_group in optimizer.param_groups:
        param_group["lr"] = lr

This function essentially divides the learning rate by a factor of 10 after every 30 epochs.

Step 3: Write functions to save and evaluate the model.


def save_models(epoch):
    torch.save(model.state_dict(), "cifar10model_{}.model".format(epoch))
    print("Chekcpoint saved")

def test():
    model.eval()
    test_acc = 0.0
    for i, (images, labels) in enumerate(test_loader):

        if cuda_avail:
            images = Variable(images.cuda())
            labels = Variable(labels.cuda())

        # Predict classes using images from the test set
        outputs = model(images)
        _, prediction = torch.max(outputs.data, 1)
        
        test_acc += torch.sum(prediction == labels.data)

    # Compute the average acc and loss over all 10000 test images
    test_acc = test_acc / 10000

    return test_acc

To evaluate the accuracy of the model on the test set, we iterate over the test loader. At each step, we move the images and labels to the GPU, if available and wrap them up in a Variable.

The images are passed into the model to obtain predictions. The maximum prediction is picked and then compared to the actual class to obtain the accuracy. Finally we return the average accuracy.

Step 4: Write the training function


def train(num_epochs):
    best_acc = 0.0

    for epoch in range(num_epochs):
        model.train()
        train_acc = 0.0
        train_loss = 0.0
        for i, (images, labels) in enumerate(train_loader):
            # Move images and labels to gpu if available
            if cuda_avail:
                images = Variable(images.cuda())
                labels = Variable(labels.cuda())

            # Clear all accumulated gradients
            optimizer.zero_grad()
            # Predict classes using images from the test set
            outputs = model(images)
            # Compute the loss based on the predictions and actual labels
            loss = loss_fn(outputs, labels)
            # Backpropagate the loss
            loss.backward()

            # Adjust parameters according to the computed gradients
            optimizer.step()

            train_loss += loss.cpu().data[0] * images.size(0)
            _, prediction = torch.max(outputs.data, 1)
            
            train_acc += torch.sum(prediction == labels.data)

        # Call the learning rate adjustment function
        adjust_learning_rate(epoch)

        # Compute the average acc and loss over all 50000 training images
        train_acc = train_acc / 50000
        train_loss = train_loss / 50000

        # Evaluate on the test set
        test_acc = test()

        # Save the model if the test acc is greater than our current best
        if test_acc > best_acc:
            save_models(epoch)
            best_acc = test_acc

        # Print the metrics
        print("Epoch {}, Train Accuracy: {} , TrainLoss: {} , Test Accuracy: {}".format(epoch, train_acc, train_loss,
                        

The training function above is highly annotated; however, you might still be confused by a few things. It’s important to review exactly what happens above in detail.

First we loop over the loader for the training set:

Next, if GPU support is available, we move both the images and labels to the GPU:

The next line is to clear all currently accumulated gradients.

This is important because weights in a neural network are adjusted based on gradients accumulated for each batch, hence for each new batch, gradients must be reset to zero, so images in a previous batch would not propagate gradients to a new batch.

In the next steps, we pass our images into the model. It returns the predictions, and then we pass both the predictions and actual labels into the loss function.

We call loss.backward() to propagate the gradients, and then we call optimizer.step() to modify our model parameters in accordance with the propagated gradients.

These are the main steps in the training.

The rest of the code is to compute the metrics:

Here we retrieve the actual loss and then obtain the maximum predicted class. Finally, we sum up the number of correct predictions in the batch and add it to the total train_acc.

After each epoch, we call the learning rate adjustment function, compute the average of the training loss and training accuracy, find the test accuracy, and log the results.

More importantly, we keep track of the best accuracy, and if the current test accuracy is greater than our current best, we’d call the save models function.

The complete code on GitHub is here:

#Import needed packages
import torch
import torch.nn as nn
from torchvision.datasets import CIFAR10
from torchvision.transforms import transforms
from torch.utils.data import DataLoader
from torch.optim import Adam
from torch.autograd import Variable
import numpy as np


class Unit(nn.Module):
    def __init__(self,in_channels,out_channels):
        super(Unit,self).__init__()
        

        self.conv = nn.Conv2d(in_channels=in_channels,kernel_size=3,out_channels=out_channels,stride=1,padding=1)
        self.bn = nn.BatchNorm2d(num_features=out_channels)
        self.relu = nn.ReLU()

    def forward(self,input):
        output = self.conv(input)
        output = self.bn(output)
        output = self.relu(output)

        return output

class SimpleNet(nn.Module):
    def __init__(self,num_classes=10):
        super(SimpleNet,self).__init__()

        #Create 14 layers of the unit with max pooling in between
        self.unit1 = Unit(in_channels=3,out_channels=32)
        self.unit2 = Unit(in_channels=32, out_channels=32)
        self.unit3 = Unit(in_channels=32, out_channels=32)

        self.pool1 = nn.MaxPool2d(kernel_size=2)

        self.unit4 = Unit(in_channels=32, out_channels=64)
        self.unit5 = Unit(in_channels=64, out_channels=64)
        self.unit6 = Unit(in_channels=64, out_channels=64)
        self.unit7 = Unit(in_channels=64, out_channels=64)

        self.pool2 = nn.MaxPool2d(kernel_size=2)

        self.unit8 = Unit(in_channels=64, out_channels=128)
        self.unit9 = Unit(in_channels=128, out_channels=128)
        self.unit10 = Unit(in_channels=128, out_channels=128)
        self.unit11 = Unit(in_channels=128, out_channels=128)

        self.pool3 = nn.MaxPool2d(kernel_size=2)

        self.unit12 = Unit(in_channels=128, out_channels=128)
        self.unit13 = Unit(in_channels=128, out_channels=128)
        self.unit14 = Unit(in_channels=128, out_channels=128)

        self.avgpool = nn.AvgPool2d(kernel_size=4)
        
        #Add all the units into the Sequential layer in exact order
        self.net = nn.Sequential(self.unit1, self.unit2, self.unit3, self.pool1, self.unit4, self.unit5, self.unit6
                                 ,self.unit7, self.pool2, self.unit8, self.unit9, self.unit10, self.unit11, self.pool3,
                                 self.unit12, self.unit13, self.unit14, self.avgpool)

        self.fc = nn.Linear(in_features=128,out_features=num_classes)

    def forward(self, input):
        output = self.net(input)
        output = output.view(-1,128)
        output = self.fc(output)
        return output

#Define transformations for the training set, flip the images randomly, crop out and apply mean and std normalization
train_transformations = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomCrop(32,padding=4),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

batch_size = 32

#Load the training set
train_set = CIFAR10(root="./data",train=True,transform=train_transformations,download=True)

#Create a loder for the training set
train_loader = DataLoader(train_set,batch_size=batch_size,shuffle=True,num_workers=4)


#Define transformations for the test set
test_transformations = transforms.Compose([
   transforms.ToTensor(),
    transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))

])

#Load the test set, note that train is set to False
test_set = CIFAR10(root="./data",train=False,transform=test_transformations,download=True)

#Create a loder for the test set, note that both shuffle is set to false for the test loader
test_loader = DataLoader(test_set,batch_size=batch_size,shuffle=False,num_workers=4)

#Check if gpu support is available
cuda_avail = torch.cuda.is_available()

#Create model, optimizer and loss function
model = SimpleNet(num_classes=10)

if cuda_avail:
    model.cuda()

optimizer = Adam(model.parameters(), lr=0.001,weight_decay=0.0001)
loss_fn = nn.CrossEntropyLoss()

#Create a learning rate adjustment function that divides the learning rate by 10 every 30 epochs
def adjust_learning_rate(epoch):

    lr = 0.001

    if epoch > 180:
        lr = lr / 1000000
    elif epoch > 150:
        lr = lr / 100000
    elif epoch > 120:
        lr = lr / 10000
    elif epoch > 90:
        lr = lr / 1000
    elif epoch > 60:
        lr = lr / 100
    elif epoch > 30:
        lr = lr / 10

    for param_group in optimizer.param_groups:
        param_group["lr"] = lr




def save_models(epoch):
    torch.save(model.state_dict(), "cifar10model_{}.model".format(epoch))
    print("Checkpoint saved")

def test():
    model.eval()
    test_acc = 0.0
    for i, (images, labels) in enumerate(test_loader):
      
        if cuda_avail:
                images = Variable(images.cuda())
                labels = Variable(labels.cuda())

        #Predict classes using images from the test set
        outputs = model(images)
        _,prediction = torch.max(outputs.data, 1)
        prediction = prediction.cpu().numpy()
        test_acc += torch.sum(prediction == labels.data)
        


    #Compute the average acc and loss over all 10000 test images
    test_acc = test_acc / 10000

    return test_acc

def train(num_epochs):
    best_acc = 0.0

    for epoch in range(num_epochs):
        model.train()
        train_acc = 0.0
        train_loss = 0.0
        for i, (images, labels) in enumerate(train_loader):
            #Move images and labels to gpu if available
            if cuda_avail:
                images = Variable(images.cuda())
                labels = Variable(labels.cuda())

            #Clear all accumulated gradients
            optimizer.zero_grad()
            #Predict classes using images from the test set
            outputs = model(images)
            #Compute the loss based on the predictions and actual labels
            loss = loss_fn(outputs,labels)
            #Backpropagate the loss
            loss.backward()

            #Adjust parameters according to the computed gradients
            optimizer.step()

            train_loss += loss.cpu().data[0] * images.size(0)
            _, prediction = torch.max(outputs.data, 1)
            
            train_acc += torch.sum(prediction == labels.data)

        #Call the learning rate adjustment function
        adjust_learning_rate(epoch)

        #Compute the average acc and loss over all 50000 training images
        train_acc = train_acc / 50000
        train_loss = train_loss / 50000

        #Evaluate on the test set
        test_acc = test()

        # Save the model if the test acc is greater than our current best
        if test_acc > best_acc:
            save_models(epoch)
            best_acc = test_acc


        # Print the metrics
        print("Epoch {}, Train Accuracy: {} , TrainLoss: {} , Test Accuracy: {}".format(epoch, train_acc, train_loss,test_acc))


if __name__ == "__main__":
    train(200)







Run this code — you should have over 90% test accuracy after about 35 epochs.

Inference with Saved Models

After models are trained, they can be used to perform inference on new images.

To perform inference, you need to go through the following steps:

  • Define and instantiate the same model you constructed during training
  • Load the saved checkpoint into the model
  • Pick an image from the file system
  • Run the image through the model and retrieve the highest prediction
  • Convert the predicted class number into a class name.

To illustrate this, we’ll use the SqueezeNet model with pre-trained ImageNet weights. This allow us to take nearly any image and get the prediction for it. Since the ImageNet model has 1000 classes, a lot of different kinds of objects are supported.

Torchvision provides predefined models, covering a wide range of popular architectures.

First, import all needed packages and classes and create an instance of the SqueezeNet model.

# Import needed packages
import torch
import torch.nn as nn
from torchvision.transforms import transforms
from torch.autograd import Variable
from torchvision.models import squeezenet1_1
import requests
import shutil
from io import open
import os
from PIL import Image
import json


model = squeezenet1_1(pretrained=True)
model.eval()

Note that, in the above code, by setting pre trained to be true, the SqueezeNet model would be downloaded the first time you run this function. The size of the model is just 4.7 mb.

Next, create a prediction function as below:


def predict_image(image_path):
    print("Prediction in progress")
    image = Image.open(image_path)

    # Define transformations for the image, should (note that imagenet models are trained with image size 224)
    transformation = transforms.Compose([
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

    ])

    # Preprocess the image
    image_tensor = transformation(image).float()

    # Add an extra batch dimension since pytorch treats all images as batches
    image_tensor = image_tensor.unsqueeze_(0)

    if torch.cuda.is_available():
        image_tensor.cuda()

    # Turn the input into a Variable
    input = Variable(image_tensor)

    # Predict the class of the image
    output = model(input)

    index = output.data.numpy().argmax()

    return index

The code above contains the same components we used during training and evaluation. See the comments in the above code for clarity.

Finally, in the main function that runs our prediction, we should download an image from the web and store it on disk. We should also download the class map that maps all class indexes to actual class names. This is because our model would return the index of the predicted class, depending on how the class names are encoded, the actual names would then be retrieved from the index-class map.

Afterwards, we run the predict function using the saved image, and we use the saved class map to obtain the exact class name.

if __name__ == "__main__":

    imagefile = "image.png"

    imagepath = os.path.join(os.getcwd(), imagefile)
    # Donwload image if it doesn't exist
    if not os.path.exists(imagepath):
        data = requests.get(
            "https://github.com/OlafenwaMoses/ImageAI/raw/master/images/3.jpg", stream=True)

        with open(imagepath, "wb") as file:
            shutil.copyfileobj(data.raw, file)

        del data

    index_file = "class_index_map.json"

    indexpath = os.path.join(os.getcwd(), index_file)
    # Donwload class index if it doesn't exist
    if not os.path.exists(indexpath):
        data = requests.get('https://github.com/OlafenwaMoses/ImageAI/raw/master/imagenet_class_index.json')

        with open(indexpath, "w", encoding="utf-8") as file:
            file.write(data.text)

    class_map = json.load(open(indexpath))

    # run prediction function annd obtain prediccted class index
    index = predict_image(imagepath)

    prediction = class_map[str(index)][1]

    print("Predicted Class ", prediction)

Here is the complete code for inference:

# Import needed packages
import torch
import torch.nn as nn
from torchvision.transforms import transforms
import matplotlib.pyplot as plt
import numpy as np
from torch.autograd import Variable
from torchvision.models import squeezenet1_1
import torch.functional as F
import requests
import shutil
from io import open
import os
from PIL import Image
import json

""" Instantiate model, this downloads tje 4.7 mb  squzzene the first time it is called.
To use with your own model, re-define your trained networks ad load weights as below

checkpoint = torch.load("pathtosavemodel")
model = SimpleNet(num_classes=10)


model.load_state_dict(checkpoint)
model.eval()
"""


model = squeezenet1_1(pretrained=True)
model.eval()


def predict_image(image_path):
    print("Prediction in progress")
    image = Image.open(image_path)

    # Define transformations for the image, should (note that imagenet models are trained with image size 224)
    transformation = transforms.Compose([
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))

    ])

    # Preprocess the image
    image_tensor = transformation(image).float()

    # Add an extra batch dimension since pytorch treats all images as batches
    image_tensor = image_tensor.unsqueeze_(0)

    if torch.cuda.is_available():
        image_tensor.cuda()

    # Turn the input into a Variable
    input = Variable(image_tensor)

    # Predict the class of the image
    output = model(input)

    index = output.data.numpy().argmax()

    return index


if __name__ == "__main__":

    imagefile = "image.png"

    imagepath = os.path.join(os.getcwd(), imagefile)
    # Donwload image if it doesn't exist
    if not os.path.exists(imagepath):
        data = requests.get(
            "https://github.com/OlafenwaMoses/ImageAI/raw/master/images/3.jpg", stream=True)

        with open(imagepath, "wb") as file:
            shutil.copyfileobj(data.raw, file)

        del data

    index_file = "class_index_map.json"

    indexpath = os.path.join(os.getcwd(), index_file)
    # Donwload class index if it doesn't exist
    if not os.path.exists(indexpath):
        data = requests.get('https://github.com/OlafenwaMoses/ImageAI/raw/master/imagenet_class_index.json')

        with open(indexpath, "w", encoding="utf-8") as file:
            file.write(data.text)

    class_map = json.load(open(indexpath))

    # run prediction function annd obtain prediccted class index
    index = predict_image(imagepath)

    prediction = class_map[str(index)][1]

    print("Predicted Class ", prediction)

The image in the example above is the picture of this bird:

The image was taken from the ImageAI repository. If you want to use your own custom network, i.e. the SimpleNet you just created, to perform inference, all you need to do is to replace the model loading section with this.


checkpoint = torch.load("pathtosavemodel")
model = SimpleNet(num_classes=10)


model.load_state_dict(checkpoint)
model.eval()

Note that if your model was trained on ImageNet, then your num_classes must be 1000 instead of 10.

All other aspects of the code remain the same with just one difference — if we’re running prediction with a model trained on cifar10, then in the transforms, change transforms.CenterCrop(224) to transforms.Resize(32)

However, if your model was trained on ImageNet, this change should not be done.

Conclusion

Hope you have had a nice ride with PyTorch! This post is the first in a series I’ll be writing on PyTorch. Stay connected for more and give a clap!

You can always reach to me on twitter: @johnolafenwa

Discuss this post on Hacker News

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square