Let us step ahead for some real world problem solving. Some History: Few days ago, while working on Gesture Recognition, one of the problems I faced was related to segmentation. I was trying to recognize hand gestures. Started off with classification of simple gestures - 1 finger, 2 finger .. 5 fingers, we created own dataset of images. This went off very smoothly, and accuracy of 85% plus was achieved. Then came a time to implement this in real time and I was stumped by a problem of segmentation. In real life scenario, with varying lighting conditions, segmenting human hand from background is a quite a challenging problem. There are various methods to do this. Our problem statement for this blog post is going to be one of the methods - Segmenting skin color from the given image.
What if we classify image pixel into “skin_color” vs “Non_Skin_color”. This would make a life of an image processing programmer a little easier. Then in a constrained environment, we would be able to segment out the hand (well, rather a skin color, which includes forearms, face, etc. as well).
What is an image, well digitally? Image is made up if rectangles, each cell in a rectangle has a set of 3 values, one for blue, green and red and this rectangle is known as pixel. So in order to represent an image of say 32X32 pixels, there are 32X32X3 integer values used.
There is dataset on UCI repository - Skin segmentation data set.
“Dataset is collected by randomly sampling B,G and R values from face images of various age groups (young, middle, and old), race groups (white, black, and Asian), and genders obtained from FERET database and PAL database. Total learning sample size is 245057; out of which 50859 are the skin samples and 194198 are non-skin samples.”
So for our dataset, number of attributes, aka features, are going to be 3
- blue component value of a pixel
- green component value of a pixel
- red component value of a pixel
Outcome is going to be answer to a question “whether combination of r, g, b values represents a skin color?” Yes or No.
With respect to Machine Learning terminologies, this is Supervised Learning. Given the labeled examples, we design a rule. This rule can be simple or complex depending on the dataset and methodologies. This is similar to solving other classification problems - from “Spam or Not-Spam” email classification problem to Complex Object recognition problem in Image Processing.
So, just like we last time, let us march ahead to do a Data Visualization.
Now, for this problem, even though number of samples is too high, number of features is pretty less. So we can visualize each feature individually.
Now, we have to come up with a model that separates skin colors from non-skin colors. From the graph, one can evidently say that if red component is less than 100, it has to be a non-skin color. Let us first prove a point.
Though this can be termed as one of the possible models, we can agree - from our visualization of our features that coming up with a single threshold is not possible. This is because our threshold value actually forms a line which is parallel to either of the axes and no horizontal or vertical line can separate the two classes.
In the last post we have witnessed the notion of training data and testing data, in that, we want to estimate ability of our model to perform a task T for the new set of instances. Last time, this task T was of a prediction, in this particular problem it is of a classification of skin color from non-skin color.
Training and Testing Split
There are various methods for splitting the data into training and testing. You might have heard about the terminologies such as “training accuracy” and “testing accuracy”. Training accuracy is simply the one measured on the training data and testing accuracy is an accuracy achieved on the testing data. It is not at all unusual to experience the fact that testing score is less than training score. This fact will become more and more important as we proceed to use more complex data and classifiers. In fact, the scenario that 100% accuracy on the training data and random guess kind of accuracy on the testing data is witnessed at least once by every machine learning programmer.
One important note is we should always report the testing accuracy; score on the collection of exemplars that were not used for training.
It is usual practice to randomly shuffle the data and perform split as training and testing as 80%-20% or 2/3rd-1/3rd, etc. One natural glitch is, obviously, what is ideal split? If we use more equal split, we would be missing a lot of data in training phase and if we use more for training, we are simply testing and verifying our model on very less data. Ideally, we would want to use all the data for training as well as testing. Somewhat similar case can be achieved by a technique called Cross Validation.
Let us start from the extreme. Let us imagine a hypothetical case that, we have 100 samples. For the sake of using all the data for training and testing, we can do the following. Train the model using first 99 samples and test it on the last one, again train the model using 1-98 and 100, and test is on 99th sample, and so on. This way, we would be using all the data for training and testing. Obviously enough, the cost is 100 times in this case, in that, I have to train my model n times, n being the size of the data. BTW, this is also known as leave-one-out splitting, and can be useful sometimes.
Most of the benefits of leave-one-out can be pertained at a fraction of a cost using x-fold cross validation; x being a small number, say five. We break the data into five groups, i.e. five folds. Then we learn five models, leaving one fold out each time. It is similar to leave-one-out, but now we will be leaving out 20% of the data. We test our model of the left out data and average out the results.
Of course, there comes an in-built complexity of balance. Each fold has to have (almost) equal share from all the classes. For e.g. in one of the fold, if all the examples belong to the same class, the model will not be a true representative.
Coming back to our problem of classifying skin and non-skin color, having agreed that we will not able to come up with some threshold to separate two available classes, we need something else. Let us get our hands dirty with a very simple classifier - Nearest Neighbor Classifier.
The idea is very simple. We represent each sample as a point in N dimensional space, N being number of features, 3 in our case. We can then choose to compute the distance between the samples.
There are various ways and means to compute the distances. Let us adopt a variant commonly used - Euclidean Distance, to be specific, Squared Euclidean Distance:
Now, at the time of classification, given a new example, we find the closest sample (Nearest Neighbor) to it, and use it’s label.
Well, with this particular algorithm, we are not really doing different training and testing phase. For every testing pattern, we pass through all the training data and simply assign a label.
And hence, it is quite slow as well. The code is an easy flow and I am sure you glanced over the slow, little-faster and faster way of doing the same thing. This, particularly, is an important task, usually goes unstated. By any means, this is not at all the fastest implementation, it is just practical. We have around 0.2 million data points, computing a distance of a 50K points against it just gets unreasonable in terms of time it takes. Hence many versions, hope you get a chance to play with it.
The results are quite satisfactory, 99% plus do not usually come along as easily.
This classifier can be further improved by considering k nearest points, and making them elect the best class. This is known as k-Nearest-Neighbor classifier.
Let us Fold It!
But wait! We discussed a lot about cross validation. Why not to apply it?
Testing on Real Data
Well, we can definitely say now, that out Nearest Neighbor Classifier is quite successful; it gives 99% plus accuracy in all the cases. And hence, now for a new incoming point, we can use ALL of our data.
Well, this is fine; testing something on dataset is a way of creating a model. If we do not test it for real data, it is as good as ineffectual.
Following are the input and output images:
As you can see, as we move towards more complex images, the accuracy seems to drop. In the last image, the pixels of a table, print on the T-shirt, etc. gets misclassified. Hence, this method, though we carried out 99% plus accuracy, in real-life situations, it is as good as around 60% of the times.
In full swing, we have started to dig into the classification problems. And be assured, this is just a start.
As a matter of fact, the problem that we solved was too simple. We had only three features, and in that, all of them had a same scale. Hence, normalization was not needed. With data, we had uneven split between our samples, but the classifier we chose was unaffected by this datum. Yes, not all classifiers are this much considerate.
In the upcoming posts, these concerns will be addressed, of course one at a time, and of course, with a real-life example.