Cross validation implementation in python using scikit-learn models
5 min read

Cross validation implementation in python using scikit-learn models

Say you want to learn about machine learning. As you learn more and more about it, you realize that it is a vast subject and fear you might get lost in all this new information. For instance just to classify some data you learn about classifiers, and you learn that there are tons of classifiers and you panic. This is how I reacted when I read the documentation of scikit-learn classifiers : Ridge classification, SVM classification, Decision trees classification, K nearest neighbors classifiers and many more...

But fear not ! each different model has its own strength and weakness. For the same dataset, some method might work better than another one. Althought this might seem like an abstract concept (especially to me as I am a newbie in the subject) the next logical step would be to try to quantify how good a model would perform compared to another one using the same dataset.

Enters Cross validation (check out this awesome video). Cross validation is a method to compare and quantify the effectiveness of different models on the same dataset.
Here are the steps for cross validation :

  1. Separate the data into different "blocks"
  2. for each method, each block will be used as the testing data, the rest of the block will be used as the training data, for each different block being the testing block, the score of the model will be calculated and the highest score will be the one associated to the model used
  3. finally, when all models are assigned a certain score, the model with the highest score will be the winner

In theory, doesen't sound like such a hard concept to implement, let's check out how I went about doing it ! the dataset used here is the iris dataset in scikit-learn

the first step is to be able to read the data :


import random

class Cross_validation:
    def __init__(self, X, y, methods):
        """
        X : features
        y : target
        methods : list of methods to compare
        methods will be imported from sklearn, fit() and score() will be used
        """
        assert len(X) == len(y), "target and instance mismatch"
        self.nblocks = 4
        self.methods = methods

        self.data = list(zip(X,y))
        random.shuffle(self.data)
    

I chose to create a class for cross validation, this class takes 3 arguments, X and y are the data (features and target respectively) and methods is a list of all the methods to compare. An example will be shown at the end for simplicity's sake.
assert len(X) == len(y) Checks the number of instances matches the number of targets.
self.nblocks = 4 this means the data will be subdivided into 4 different subblocks. 4 is arbitrary here.
finally, we shuffle the data for good measure

The next step is to divide the data into smaller blocks of data :


    def train_test(self):
        """
        splits the data into self.nblocks of blocks 
        """
        assert len(self.data)%self.nblocks==0, "number of instances
        must be a multiple of the number of total blocks"
        
        elements = len(self.data)//self.nblocks
        blocks = []
        for i in range(self.nblocks):
            blocks = blocks + [self.data[i*elements:(i+1)*elements]]
 

This train_test function will return a list made out of nblocks blocks of data. the code is pretty self-explanatory, the only subtlety is that I make sure that the number of instances of data is a multiple of the total number of blocks. This is not very important, I just wanted to have blocks that all had the same number of elements.

Next up, we actually want to be able to measure the performance of a model :


def score_method(self, method):
        """
        trains and tests the model on different blocks of the same dataset and keeps the best score
        """
        blocks = self.train_test()
        best_score = 0
        for i in range(len(blocks)):
            X_train, y_train = [], []
            for j in range(len(blocks)):
                if i != j:
                    X_train = X_train + [blocks[j][x][0] for x in range(len(blocks[j]))]
                    y_train = y_train + [blocks[j][x][1] for x in range(len(blocks[j]))]
            X_test, y_test = [blocks[i][j][0] for j in range(len(blocks[i]))], [blocks[i][j][1] for j in range(len(blocks[i]))]
            
            method.fit(X_train, y_train)
            score = method.score(X_test, y_test)
            if score >= best_score:
                best_score = score
        return best_score

Let's dissect this code.
The idea here is to start off by taking the data given to the class as arguments (X and y, made into self.data) and using the function train_test() written above to create the blocks of data.
Then, each block is used as a test block once, the rest are used as training block.

X_train = X_train + [blocks[j][x][0] for x in range(len(blocks[j]))]
y_train = y_train + [blocks[j][x][1] for x in range(len(blocks[j]))]

Those two lines create the training data from the blocks.


X_test, y_test = [blocks[i][j][0] for j in range(len(blocks[i]))], [blocks[i][j][1] for j in range(len(blocks[i]))]

Same with this line, except it creates the testing data.
We then recognize the method.fit(X_train, y_train) and method.score(X_test, y_test) methods from the scikit-learn library. Finally we calculate the score of the model on the testing data. This is done with all blocks being the testing data and the best score is returned.


we finally have a way of measuring the score of a model on some dataset. The next step is to compare multiple models :


def best_method(self):
        best_method = ''
        best_score = 0
        for method in self.methods:
            score = self.score_method(method)
            if score >= best_score:
                best_method = method
                best_score = score
        return (best_method, best_score)

Note that this function doesen't take anything as argument, and that is because the methods were already passed as argument to the class.


This is how it looks completed :


import random

class Cross_validation:
    def __init__(self, X, y, methods):
        """
        X : features
        y : target
        methods : list of methods to compare
        methods will be imported from sklearn, fit() and score() will be used
        """
        assert len(X) == len(y), "target and instance mismatch"
        self.nblocks = 4
        self.methods = methods

        self.data = list(zip(X,y))
        random.shuffle(self.data)

    def train_test(self):
        """
        splits the data into self.nblocks of blocks 
        """
        assert len(self.data)%self.nblocks==0, "number of instances must be a multiple of the number of total blocks"
        elements = len(self.data)//self.nblocks
        blocks = []
        for i in range(self.nblocks):
            blocks = blocks + [self.data[i*elements:(i+1)*elements]]
        return blocks

    def score_method(self, method):
        """
        trains and tests the model on different blocks of the same dataset and keeps the best score
        """
        blocks = self.train_test()
        best_score = 0
        for i in range(len(blocks)):
            X_train, y_train = [], []
            for j in range(len(blocks)):
                if i != j:
                    X_train = X_train + [blocks[j][x][0] for x in range(len(blocks[j]))]
                    y_train = y_train + [blocks[j][x][1] for x in range(len(blocks[j]))]
            X_test, y_test = [blocks[i][j][0] for j in range(len(blocks[i]))], [blocks[i][j][1] for j in range(len(blocks[i]))]
            
            method.fit(X_train, y_train)
            score = method.score(X_test, y_test)
            if score >= best_score:
                best_score = score
        return best_score

    def best_method(self):
        best_method = ''
        best_score = 0
        for method in self.methods:
            score = self.score_method(method)
            if score >= best_score:
                best_method = method
                best_score = score
        return (best_method, best_score)

Alright ! Finally done. But it might still be confusing so lets check out an example. First of all let's prepare our data and models :


from sklearn.datasets import load_iris
from sklearn.linear_model import RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
m1 = KNeighborsClassifier()
m2 = DecisionTreeClassifier()
m3 = RidgeClassifier()
data = load_iris()
X, y = data.data[:148], data.target[:148]

Note that we only keep the first 148 elements of the dataset because we want the total number of elements to be a multiple of 4 (the initial number of instances is 150 in this dataset).

Now to evaluate the best method for that specific dataset :


score = Cross_validation(X,y ,[m1, m2, m3])
score.best_method()

Output : (DecisionTreeClassifier(), 1.0)

We can also evaluate each method independently :


for method in [m1, m2, m3]:
	print(score.score_method(method))
    
Output : 1.0
         1.0
         0.8918918918918919