Is it a human or a dog?

Mahbubul Wasek
12 min readApr 9, 2021

Dog Breed Classification using CNNs

This post is about my Udacity Data Science Nanodegree capstone project. One of the project choices was the Dog Breed Classifier.

Definition

Project Overview

The dog breed classification project using Convolutional Neural Networks (CNNs) is able to process real-world, user-supplied images and distinguish between human face images and dog images.

Moreover, apart from being able to recognize an image of a dog the model is also able to predict the breed of the dog from the image with a high level of accuracy.

This project uses computer vision and machine learning techniques to predict dog breeds from images. We distinguish a dog facial key points from human faces for each image using CNNs.

In this project, CNNs were trained using transfer learning in order to reduce training time without sacrificing accuracy.

Model detecting a dog and then predicting its breed

This dog breed classification project was not only challenging but the solution can be used for other fine-grained classification problems. Any set of classes with relatively small variation within it can be solved as a fine-grained classification problem.

The approach, the methodology and the results are documented here.

Problem Statement

In this project users will be able to supply any image as input. If a dog is detected in the image, it will provide an estimate of the dog’s breed. If a human is detected, it will provide an estimate of the dog breed that is most resembling.

This is a fine-grained classification problem and the solution for this project can be applied to other similar classification problems. Breed prediction may assist veterinarians to identify stray, unidentified dogs that need medical care. Convolutional Neural Network (CNNs) are used here to assist with keypoint detection in dogs, namely identifying eyes, nose and ears.

Metrics

This project uses few Pre-trained Models:

· OpenCV’s implementation of Haar feature-based cascade classifiers is used to detect human faces in images. OpenCV provides many pre-trained face detectors, stored as XML files on github. For the first 100 dog images in the test dog dataset and the first 100 face images of the human face dataset, it predicts 11% and 100% of them to have human faces respectively.

· Using Keras api, ResNet-50 was imported and used to detect dogs for this project. This pre-trained model was able to detect 100% of the first 100 dogs as dogs and 0% of the first 100 human faces as dogs.

· To reduce training time without sacrificing accuracy, CNNs were trained using transfer-learning. The pre-trained models for transfer learning in this project used VGG-16 and InceptionV3 models as fixed feature extractor.

· Validation Performance Accuracy is used to measure the performance of the classifiers, based on the confusion matrix. The score range is [0, 1], with the 0 the worst while 1 the best.

· Training Loss Categorical cross-entropy is used as the loss function for computing gradients for network optimization.

Analysis

Data Exploration

The project uses two sets of data. The first dataset contains data about the different breeds of dogs and the second dataset contains data about human faces.

The dog dataset as already mentioned contains a total of 8351 dog images and 133 dog categories. The training, validation and testing sets contain 6680, 835 and 836 dog images respectively.

The human face dataset on the other hand contains 13233 total human face images. The images in the human face dataset had to converted to grayscale before as the model had grayscale image as a parameter.

Algorithms and Techniques

The algorithms used for this project are few pre-trained model such as OpenCV and ResNet-50 to detect human faces and dogs respectively.

Detect Humans

I used OpenCV’s implementation of Haar feature-based cascade classifiers is used to detect human faces in images. OpenCV provides many pre-trained face detectors. The code below instantiates the Haar Cascade Classifier from OpenCV and is then used in the face detector function to determine if a supplied image contains a face or not.

Before using any of the face detectors, it is standard procedure to convert the images to grayscale. The detectMultiScale function executes the classifier stored in face_cascade and takes the grayscale image as a parameter.

Ideally, we would like 100% of human images with a detected face and 0% of dog images with a detected face. We extracted the file paths for the first 100 images from each of the datasets and store them in the numpy arrays human_files_short and dog_files_short.

However, my algorithm fell short and 100.0% of the first 100 human images detected a human face while 11.0% of the first 100 dog images detected a human face.

Detect Dogs

I used the pre-trained model ResNet50 to detect dog images. The first line of code downloads the ResNet-50 model, along with weights that have been trained on ImageNet, a very large, very popular dataset used for image classification and other vision tasks.

The images had to be processed to the correct tensor size before using them with the model. This is often one of the most challenging parts in the image classification process, the preprocessing of the images.

For keras, the images need to be in four dimensions, formatted like this (number of images, row size of image in pixels, column size of image in pixels, number of color channels).

The path_to_tensor function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224224×224 pixels. Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels.

In the path_to_tensor function I processed a single image, so the output is (1,224,224,3), where 1 image, 224 pixels wide, 224 pixels high, and 3 colours red, green and blue.

Another pre-processing step by a built-in function in keras called preprocess_input, takes the output from image processing steps above and then reverses the colors to blue green red. It is the order keras expects and then normalizes the pixels based on standards for use with pre-trained imagenet models.

You will notice that the categories corresponding to dogs appear in an uninterrupted sequence and correspond to dictionary keys 151–268, inclusive, to include all categories from 'Chihuahua' to 'Mexican hairless'.

Thus, in order to check to see if an image is predicted to contain a dog by the pre-trained ResNet-50 model, I needed to only check if the ResNet50_predict_labels function above returns a value between 151 and 268 (inclusive).

The dog detection function was able to detect

Apart from this the first half of the project uses CNN without transfer learning. When training the model, we had to be careful on the trainable layers, as more parameters meant longer training and usage of more GPUs to accelerate the training process.

Then in the final model algorithm, CNN with transfer learning is used to speed up the training time without sacrificing accuracy. The pre-trained models for transfer learning in this project used VGG-16 and InceptionV3 models as fixed feature extractor.

Methodology

Data Preprocessing

When using TensorFlow as backend, Keras CNNs require a 4D array (which is also referred to as a 4D tensor) as input, with shape

(nb_samples,rows,columns,channels).

The nb_samples correspond to the total number of images (or samples), and rows, columns, and channels correspond to the number of rows, columns, and channels for each image, respectively.

The path_to_tensor function in the datautils.py and the ipython notebook takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image that is 224×224 pixels.

Next, the image is converted to an array, which is then resized to a 4D tensor. In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape (1,224,224,3)

For batch processing, the paths_to_tensor function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape

(nb_samples,224,224,3)

Here, nb_samples are the number of samples, or number of images, in the supplied array of image paths. It is best to think of nb_samples as the number of 3D tensors (where each 3D tensor corresponds to a different image) in the dataset.

We rescaled the images by dividing every pixel in every image by 255.

Implementation

  1. CNN from scratch

The above image provides the summary of the CNN model from scratch. The full code for this part of the project can be found in the notebook. The codes are easy to follow and can be imitated for personal use.

I used 3 Convolutional Layers with 8, 16, 32 filters respectively in order to discover features from the images. Added Max Pooling Layer in between the Convolutional Layers. In order to prevent overfitting, I used Dropout Layers. At the end I used a Dense Layer with a ReLU activation function and for the output layer I used a softmax activation function to get results in percentage.

The CNN layers extract high-level semantic feature by working on lower-level features. Dropout layers randomly drop part of the network, so that each part of the neural networks can have the opportunity of getting tuned alone.

Relu activation function ensures non-linear transformations while preventing gradient saturation. The Dense layer is the most used neural network layer which is connected deeply and each neuron in the dense layer receives input from all neurons of its previous layer.

For the hyper-parameters I used Category Cross Entropy for loss, ‘rmsprop’ for optimizer and the metric was set to accuracy. The number of epochs were 50 and the batch size was 20. The accuracy of the test model was 10.28%.

2. Use a CNN to classify Dog Breeds (using Transfer Learning)

In this part of the project, you will notice in the Jupyter notebook that few pre-trained networks available for use with keras.

Transfer Learning is a faster approach because it has already learned the features of classifying dogs when the model was trained for object classification and hence we don’t have to train the CNN feature extractors[Bottleneck Features] to work better.

The model here uses the pre-trained VGG-16 model as a fixed feature extractor, where the last convolutional output of VGG-16 is fed as input to our model.

We only add a global average pooling layer and a fully connected layer, where the latter contains one node for each dog category and is equipped with a softmax.

The training time was greatly reduced without sacrificing the accuracy when the CNN model was trained using transfer learning. The number of epochs used were 20 and the size of the batch was 20.

The test model accuracy with transfer learning was greatly increased to 43.4%.

Refinement

Refinement was made to the Transfer Learning model. The refinement model used another pre-trained bottleneck feature, InceptionV3 as the feature extractor.

Inception-v3 is a convolutional neural network architecture from the Inception family that makes several improvements including using Label Smoothing, Factorized 7 x 7 convolutions, and the use of an auxiliary classifier to propagate label information lower down the network (along with the use of batch normalization for layers in the sidehead).

This makes InceptionV3 an excellent image classification network and the lower layers do not have to be retrained.

The extraction of bottleneck features for each of the pre-trained networks on the dog images were already provided as can be noticed in the Jupyter notebook.

The model architecture used global average pooling layer similar to the previous transfer learning model. Dropout layer was added to make each part of the network have the chance to be trained separated, so that the network performance can be more generalizable. I also added two Densed Layers with ReLu activation for this model.

The total trainable parameters or weights for this model is 5.7 million. Then compiled the model, i.e select loss optimization function and loss measurement, rmsprop and categorical_crossentropy respectively. The accuracy score was used to indicate the performance metric.

The model was then trained with the fit function, validating it for back propagation updating of parameters. The epochs was set to 20 and the size of the batch was 20 to train the model.

After training the model, it was loaded. Then the model was test for its accuracy. The test model returned an accuracy of almost 80%.

Results

The CNN model from scratch experienced a drastic improvement in test model accuracy when implemented with transfer learning. The test model accuracy of the CNN model from scratch was around 10% which improved to around 43% with the transfer learning model.

The transfer learning model not only improved the accuracy of the test model but also greatly improved the speed with which the model was trained.

Overall, the results achieved by producing this model is considered a success given the dataset. We were able to achieve an accuracy of almost 80% with the test model. The learning from this project can be used for the identification of all other fine-grained species detection.

However, there are some issues which can be improved with future works. The model was correctly able to identify all dog images but in one of the cases the breed of the dog was incorrect. This might be due to the position of the dog in the image. Also, the size of the dataset can be increased to improve the accuracy of the prediction.

The final model was also not able to identify a human face from one of the images as well. This might be because the human face in the image was not properly positioned.

In order to improve the accuracy of predicting such images, model augmentation can be introduced in the future. Adding augmentation will allow us to train the model to identify human faces positioned in different postures. As a result more accurate prediction can be achieved.

Image detection of a dog and its breed
Image detection of a human face and predicting similar dog breed

Conclusion

In this article we created a Dog Breed Classification using CNNs. uses computer vision and machine learning techniques to predict dog breeds from images. The algorithm of this project can be used to solve other fine-grained classification problems.

  1. I used pre-trained models to create human face detection and dog detection functions from user-supplied images. The human face detection function detected 100.0% of the first 100 human images while 11.0% of the first 100 dog images detected a human face. The dog detection function was able to detect 0.0% of the first 100 human images as a dog and 100.0% of the first 100 dog images as a dog.
  2. Then I created a CNN model from scratch with the test model accuracy of around 10.3%. I used 3 Convolutional Layers with 8, 16, 32 filters respectively in order to discover features from the images. Added Max Pooling Layer in between the Convolutional Layers. In order to prevent overfitting, I used Dropout Layers.
  3. Finally, I created a CNN model with transfer learning. Transfer learning increased the speed with which the model was trained without compromising the accuracy. The accuracy of the final test model was almost 80%.

The results of this project can further be improved by using larger datasets and introducing augmentation to detect human faces placed in different positions of the image.

To see more about this analysis, see the link to my Github available here.

--

--