# Machine Learning 3 : Prediction and its Traits

By Sagar Gandhi on

Let us begin. I am pretty excited about this post, here onward we start to learn how to do Machine Learning.
From whatever posts I have followed, somehow, people take an approach of first explaining the types of things one can do using Machine Learning, namely classification, clustering, etc.

Let us take a different path. Let us head straight to solve a problem and get our hands dirty. We will also, of course, come to a point where we will talk about types of problems that Machine Learning solves, but let us first get introduced to the problem solving.

Environment
We will be using Python, examples are tested with Python 3.5.
We need numpy, scipy and matplotlib packages. How to install these is not covered here, as that will drift us. You can install Anaconda, or follow along OS specific post for installation.

Problem Statement Let us look at a simple example. This will help us in getting started. Let us say we wish to predict the usage of a particular word in next 50 to 100 years. What are our options then? We need data of usage of that word till now.

Google’s Ngram viewer comes to our rescue. As we are working on a small branch of Artificial Intelligence tree, let us pay a small tribute to the community, by predicting future of word “artificial”. Below is the snap of the usage of the word, and you can play with it here With the help of a friend, data has been downloaded, from years 1500 to 1999. For further simplifications, as this is our first dive-in, I have termed years from 0 to 499. You can have a look at data in the file: ML_curve_fitting_example.dat. Left column indicates years starting from 1500, right column indicates score of that word being used.

Let us further pin-point our task. Let us say we have to predict when this word will become obsolete, i.e. when the usage will reach the zero, based on the provided data.

Visualizing and cleansing the data
Let us have a look at our data and its characteristics: We begin by reading the file. We display last ten values and min and max of Y, i.e. usage of the word. From these, we are sure that values are too small, hence we multiply each value by a 100000. In this particular case, we have a single dimensional data, i.e. only score, hence scaling a cake-walk. Be warned that this will not be this easy always.
We also process our data so that it should not contain any missing value. In our particular case, we did not have any missing value. But it is always better to check the data beforehand to give ourselves an opportunity to handle any intricacies in data gathering. Then we simply draw a scatter plot.

Modelling
Now, as we have taken a look at our data, question is what will be usage score for upcoming years?
To answer this question, we need to fit a model to our data and then extrapolate that model to perform prediction of the future.

Let us assume that a straight line fits our model. There is Scipy’s function - Polyfit(), which will help in finding equation of a line.

Output:
Res: [ 208.94057049]
Model parameters: [ 0.00623047 -0.11288484]
Error O_1: 208.940570486 Straight line is not all that fancy. If “Curve_Fitting” is so much easier to code, why not try higher order polynomials?

Output:
Error O_1: 208.940570486
Error O_2: 196.453545329
Error O_3: 163.573700658
Error O_10: 158.810240344
Error O_100: 149.630548353 So, which of the models to be used? It might seem that higher the degree of polynomial - better it is. Right?
Absolutely Wrong! The higher degree polynomials not only model data, but they also model noise. Ideally, we would want to eliminate all the noise from our model, but it is not possible 100%. So, how can we confirm this? Well, visually …

If we have another look at the data, it can be seen that last 100 years, usage is going down. So let us split data into two sets, last 100 years separate from rest of the data.

Output:
Error O_1: 208.940570486
Error O_2: 196.453545329
Error O_3: 163.573700658
Error O_10: 158.810240344
Error O_100: 149.630548353
Error using partition: 167.911429 As it can be concluded from the figure that fitting two low dimensional models will reduce the error than any other higher dimensional model that we have tried. Hope this makes it clear that higher degree polynomials does NOT mean better model.

So which model should be used in practice? Well, using multiple models is not a very bright idea. If we do that we would have to use some kind of conditioning that if year is greater than so and so, use this model otherwise the other model. And in totality, we would use the second model(closest to future), which effectively ignores all the data from the past.

Training, Testing and Conclusion
Till now, we have used all the data for our experiments. It would be really great if we could look into future and somehow grasp the actual values. But that is not possible. So what we do is - take a chunk of our data, and test the system on this data. Need no mention that this data should NOT be there while training the system.

In our case, if we take the data from last 100 years, then it is going to be difficult to fit the model. Data clearly has a “shift” in its behavior. So let us consider only last 50 years for the testing. And let us try to fit all the models to this “Test Dataset”. Note that all models need to be retrained. This would give a more realistic picture of how our models perform.

Output:
Training Set: 449
Test Set: 50
Error O_1: 168.926371288
Error O_2: 168.565578906
Error O_3: 160.721239448
Error O_10: 157.935350721
Error O_50: 146.882568126
test: order_1 = 59.359704
test: order_2 = 69.137475
test: order_3 = 13.847517
test: order_10 = 346.005838
test: order_50 = 868412738064986.500000
Word “artificial” will become ancient in: 2146.28104703 So, from Test Dataset, it is clear that polynomial of degree three is best out of what we have tried and as per the data and our prediction model, word artificial would see darkest year since in 2146.

Now, there are other parameters as well. Given the fact that we both are studying Artificial Intelligence, well, a part of it, it seem to have a bright future.

Also, intentionally data from 2000 to 2008 was left out. This is to put some light on the fact that even though Machine Learning is a great tool, it works really well when data is available, trusting it too much is not a good idea after all. This simply happens because we did not have any “features” as such. We simply gathered the past usage, and tried to fit a model to it. External parameters are so very important that they can actually help us “Modelling a Problem” rather than modelling a data.

In the upcoming posts, we dive into other details, we will learn more about features, classification and data quality, etc.

Hope you enjoyed a long post. All of the code is available here.

Note: I am thankful to Vivek for providing valuable feedback and suggestions on this post.