Every machine learning engineer is interested in working with images, as they give a wider perspective of understanding new concepts in machine automation and indeed a very interesting field to explore. Unlike numeric and categorical data, image data is higher dimensional (2d and 3d), so we cannot work with them in the same way as regular data. Today we would understand the basic concepts involved in image classifications, the widely used machine learning models for image classification, and how these models learn from these images. Before we jump directly into the concepts, let us first understand what is an image.
Image
In general, an image is a visual representation of a thing (a person, a fruit, etc). As humans, we can look and an image and understand it. Further, our brain memory can recollect any situation associated with the image or any similar incident, but how does a machine understand an image?
Technically, an image is composed of arrays called pixels, each pixel represents information that looks like a point, which you can see in the below image.
Inside the computer’s memory, each image is stored with these pixel values, so whenever we save an image in a computer, it just arranges the image according to its pixel values which all together form an image. Now, what does each pixel represents?
Each pixel in an image represents a data value. To understand this, first, we need to know the types of images, we have two types of images,
- Grayscale image: This is in black and white.
- RGB image: Where the image uses the colors in the RGB spectrum,
R — Red
G — Green
B — Blue
RGB image
RGB images are typically the colored images that we see regularly. These are represented by an array of 3 numbers which are indeed the three colors (R, G, B), ranging between 0 and 255.
0 — Black
255 — White
These three colors together decide the color of the pixel. For instance, consider the below values,
R, G, B = [0,0,0] ==> this represents that the pixel is black.
R, G, B = [255,255,255] ==> this represents that the pixel is white.
RGB = [51,255,51] ==> this represents that the pixel is light green.
These three colors are referred to as channels.
Gray Scale image
Grayscale images are black-and-white images, unlike colored images, grayscale images have only one number which represents a color ranging between 0 and 255. This means the value only represents the intensity of light, instead of the color.
So, when we say an image of size 10X10, we are saying that the height of the image is 10 and the width of the image is 10. Now, what about the color?
In general, grayscale image dimensions are represented as AxB, where A is the height and B is the width. RGB image dimensions are represented as AxBxC, where A is the height, B is the width and C is the color channels (R, G, B).
We can convert the RGB-colored image into grayscale, as we discussed above each pixel in a grayscale image represents the intensity of the light, so to find the intensity of the color we need to know the minimum and maximum values of the colors.
R,G,B = [100,10,120]==> minimum([100,10,120])==> 10
R,G,B = [100,10,120]==> maximum([100,10,120])==> 120
The above two equations represent only one pixel in an RGB image, so we need to find the min and max of each pixel in the image and take the average of it.
min(R,G,B) + max(R,G,B) / 2.
This is one method of converting an RGB image to a Grayscale, which is called the Lightness method. We have another method called the Average method, which involves taking the average of each pixel value in an RGB image,
R+G+B/3
The most common and widely used method in converting an RGB image to a Grayscale image is called the Luminosity method. Here, we take the weighted average of each color in a pixel, the proposed formula for this,
0.3*R + 0.59*G + 0.11*B
In general, we don’t manually convert the RGB image to grayscale, we make use of libraries such as OpenCV.
Now, we have understood briefly about an image, the color channels, and the difference between an RGB image and a grayscale image. So far good, but how does a machine read an image and understand what it is representing?
Consider the above image, by looking at the image, we can say it is a car because we see the hood, headlights, bumper, wheels, hub caps, and doors. So, we recognize a car through its external representation which is called Features. A machine learning model recognizes a car by identifying some features, but it cannot directly understand through the image that it is a car, we need to first extract the features and provide these features to the model, this process is called Feature Extraction.
Feature extraction is a method of extracting the important aspects in an image and providing them to the models, through which they can learn and understand. But, how do we extract them?
We have a few inbuilt algorithms called feature descriptors, which will read through the image and extract the features from it. Few of the most widely used feature descriptors,
- Histogram of Orientated Gradients (HOG)
- Local Binary Pattern (LBP)
- Scale-Invariant Feature Transform (SIFT)
- Speeded-Up Robust Features (SURF)
We will go through each of the feature descriptors in part 2 of the image classification
Can we use all the machine learning models for image classification?
Well, the answer is ‘yes’ but it depends upon the problem and the data we have. The most popular machine learning models which are used for image classification are,
- Support Vector Machine (SVM)
- Random Forest (RF)
- Decision Trees (DT)
Other models such as KNN, Logistic regression are generally used in the trial and testing phases, because they are not accurately proven models for image classification.
Apart from machine learning models, we have Neural Networks which are widely used for Image classification,
- Convolutional Neural Networks (CNN)
- Pre-trained CNNs, such as AlexNet, ImageNet, GoogleNet
- Artificial Neural Networks (ANN)
CNN’s are exclusive for image classification, which means they are only used for image classification. For the machine learning models and ANN, we need to extract features through feature extractors and provide them as input, but in the case of CNN’s, we don’t need to do that. They can inbuilt extract the features from an image through a process called Convolution.
This article is only to provide an overview of working with images, in the upcoming parts we will discuss each process and model used for image classification in detail.
About me
I’m currently working as a data science intern, to get some practical knowledge of what I have learnt in my Master’s. I prefer to share my knowledge through my blogs, which can be benefited for the new aspiring data science students. If you like my work please feel free to contact me through LinkedIn (Gopi Krishna Duvvada)