Calculating Gini impurity in python
7 min read

Calculating Gini impurity in python

Gini impurity is used for creating decision trees, it is a method among others to calculate impurity. If you want to learn more about it check out This video. This article is about the implementation of calculating Gini impurity in python. I'll probably want to write an article in the future about making decision trees from scratch so I thought I'd start by this article as it is probably going to be already pretty long in itself.

The first issue comes up when we take a look at the data : we'll ask ourselves, what kind of data are we working with ? Let's take an example of a possible dataset (that I just made up) :

Sex Favorite color Age Has his/her own house
F Blue 13 no
F Green 28 yes
M Blue 45 yes
F Red 30 no
M Red 15 no
M Blue 23 no
F Green 60 yes
M Red 39 yes
M Green 3 no
F Red 18 no

You can download this data as a csv file below
In this dataset, we'll try to predict if a person owns a house based on his/her sex, favorite color and age. We'll notice that the data type for each of the different columns is different : the sex can be considered as a boolean (Q: Is a male ? A: True or False), the favorite color is a choice (Q: What is your favorite color ? A: Red, Blue or Green) and finally the age is a numerical value. To calculate the Gini impurity, we'll have to work our way around the last 2 types of data and try to consider them as booleans. But before anything else let's try to calculate Gini impurity for our boolean data, which is the most straightforward out of all 3 data types.

Beautiful decision tree drawn in paint 

If we want to calculate the Gini impurity of this characteristic, this are the calculations we'd go about :
Gini impurity in left leaf = 1 - (2/5)^2 - (3/5)^2 = 0.48
Gini impurity in right leaf = 1 - (2/5)^2 - (3/5)^2 = 0.48
Total Gini impurity = 0.48*(5/10) + 0.48*(5/10) = 0.48

This is the python code that takes data as argument (X and y as lists, for example X represents the sex and y represents the ownership of a home) and returns the calculated Gini impurity :


def gini_impurity_bool(X,y):
    """
    Calculates gini impurity for a column X of data, y is the target value
    this function can only compute gini impurity for boolean data (example : person is a male (yes or no)) 
    """
    leaf_1, leaf_2 = list(), list() 
    compare_value = X[0]
    compare_value_y = y[0]

    for i, val in enumerate(X):
        if val == compare_value:
            leaf_1.append(y[i])
        else:
            leaf_2.append(y[i])

    #This is used to calculate the impurity later on
    leaf_1_0 = len([x for x in leaf_1 if x==compare_value_y])
    leaf_1_1 = len(leaf_1) - leaf_1_0

    #same here
    leaf_2_0 = len([x for x in leaf_2 if x==compare_value_y])
    leaf_2_1 = len (leaf_2) - leaf_2_0


    #now to calculate impurity
    leaf_1_impurity = 1 - (leaf_1_0/len(leaf_1))**2 - (leaf_1_1/len(leaf_1))**2
    leaf_2_impurity = 1 - (leaf_2_0/len(leaf_2))**2 - (leaf_2_1/len(leaf_2))**2

    total_impurity = leaf_1_impurity*(len(leaf_1)/len(y)) + leaf_2_impurity*(len(leaf_2)/len(y))

    return total_impurity

Now if we want to test this function :


import pandas as pd
data = pd.read_csv('test_data.csv')
print(gini_impurity_bool(data.sex, data.house))
Output : 0.48

We can see that the Gini impurity is the same as the one calculated before, so the function seems like it's working !

Alright, now let's move on to the second type of data, the multiple choice (here Red, Green or Blue). My idea to view this data as a boolean is to calculate the Gini impurity for each of the possible choice and keep the lowest impurity. For example we'll start with Red, and calculate the Gini impurity related to the question, is the person's favorite color red ? (we can now answer with True/False) And then do the same for Green and Blue.
Here's what it would look like with red :

another beautifully drawn image on paint

And if we wanted to calculate Gini impurity this are the calculations we would do :
Gini impurity in left leaf = 1 - (1/4)^2 - (3/4)^2 = 0.375
Gini impurity in right leaf = 1 - (3/6)^2 - (3/6)^2 = 0.5
Total Gini impurity = 0.375*(4/10) + 0.5*(6/10) = 0.45

Now without going in the details of the calculation for Blue and Green (you can check yourself !), these are the impurities we find :
Blue Gini impurity = 0.4762
Green Gini impurity = 0.4191
So here we'd keep Green because it has the lowest Gini impurity (i.e. in a decision tree, we'd find the question : "is this person's favorite color green ?").
Now the idea is to make a function which would take arguments (same as before) as lists and return the gini impurity of the lowest choice and to which choice it is linked.
This is the code I came up with :


def gini_impurity_choice(X,y):
    """
    Calculates gini impurity for a column X of data, y is the target value
    this function can only compute gini impurity for multi-choice data (example : favorite color is Green, Blue, Red, Yellow, etc...)
    """
    choices = set(X)
    best_impurity = (0.5, None)

    for choice in choices:
        leaf_1, leaf_2 = list(), list()
        compare_value_y = y[0]

        for i, val in enumerate(X):
    	    if val == choice:
    	        leaf_1.append(y[i])
    	    else:
    	        leaf_2.append(y[i])
    			
    	#This is used to calculate the impurity later on
        leaf_1_0 = len([x for x in leaf_1 if x==compare_value_y])
        leaf_1_1 = len(leaf_1) - leaf_1_0

        #same here
        leaf_2_0 = len([x for x in leaf_2 if x==compare_value_y])
        leaf_2_1 = len (leaf_2) - leaf_2_0
        
        #now to calculate impurity
        leaf_1_impurity = 1 - (leaf_1_0/len(leaf_1))**2 - (leaf_1_1/len(leaf_1))**2
        leaf_2_impurity = 1 - (leaf_2_0/len(leaf_2))**2 - (leaf_2_1/len(leaf_2))**2

        total_impurity = leaf_1_impurity*(len(leaf_1)/len(y)) + leaf_2_impurity*(len(leaf_2)/len(y))

        if total_impurity <= best_impurity[0]:
            best_impurity = (total_impurity, choice)
   		
    return best_impurity

and now to test it:


import pandas as pd
data = pd.read_csv('test_data.csv')
gini_impurity_choice(data.color, data.house)
Output : (0.419047619047619, 'Green')

Which is coherent !
Notice we can also use this function with the boolean data :


gini_impurity_choice(data.sex, data.house)
Output : (0.48, 'M')

Finally, we have to take care of the numerical data. The way to go about doing that is to sort all the numerical values in ascending order, then to calculate the average for each 2 adjacent values, then calculate the gini impurity for each value (for example we'll have at some point in our data, a 28yo and a 30yo female, so we'll calculate the average : 29 years old and we'll use a the question : is the person younger than 29 ? and calculate the impurity accordingly, and then we'll do the same for all the other values).
To jump straight into it here is the code :


def gini_impurity_numerical(X,y):
    """
    Calculates gini impurity for a column X of data, y is the target value
    this function can only compute gini impurity for numerical data (example : age stored as a numerical value)
    """
    sorted_data = sorted(X)
    compare_value_y = y[0]
    best_impurity = (0.5, None)
    
    for i in range(len(X)-1):
        average = (sorted_data[i]+sorted_data[i+1])/2
        leaf_1, leaf_2 = list(), list()

        for j, val in enumerate(X):
            if val <= average:
                leaf_1.append(y[j])
            else:
                leaf_2.append(y[j])
 
        #This is used to calculate the impurity later on
        leaf_1_0 = len([x for x in leaf_1 if x==compare_value_y])
        leaf_1_1 = len(leaf_1) - leaf_1_0

        #same here
        leaf_2_0 = len([x for x in leaf_2 if x==compare_value_y])
        leaf_2_1 = len (leaf_2) - leaf_2_0
        
        #now to calculate impurity
        leaf_1_impurity = 1 - (leaf_1_0/len(leaf_1))**2 - (leaf_1_1/len(leaf_1))**2
        leaf_2_impurity = 1 - (leaf_2_0/len(leaf_2))**2 - (leaf_2_1/len(leaf_2))**2

        total_impurity = leaf_1_impurity*(len(leaf_1)/len(y)) + leaf_2_impurity*(len(leaf_2)/len(y))
        
        if total_impurity <= best_impurity[0]:
            best_impurity = (total_impurity, average)
            
    return best_impurity

And to test this function !


import pandas as pd
data = pd.read_csv('test_data.csv')
print(gini_impurity_numerical(data.age, data.house))
Output : (0.15999999999999992, 25.5)

Now, just to be sure let's calculate the Gini impurity related to the question : is the person younger than 25.5 ? and compare to what the computer gave us

am I good at mspaint or what

And now the math :
Gini impurity in left leaf = 1 - (0/5)^2 - (5/5)^2 = 0
Gini impurity in right leaf = 1 - (4/5)^2 - (1/5)^2 = 0.3199
Total Gini impurity = 0.0*(5/10) + 0.3199*(5/10) = 0.1599

Which is coherent with what was given to us by the computer, so everything seems to work !


The last thing left to do is to create a function which calculates the Gini impurity of a parameter no matter its data type. I quickly went over the fact that the function gini_impurity_choice can be used for any kind of non-numerical data, so the idea would be to create a new function which could calculate separately the gini impurity depending on it being numerical or not using the above functions:


def gini_impurity(X,y):
    if type(X[0])==str:
        return gini_impurity_choice(X,y)
    else:
        return gini_impurity_numerical(X,y)

And now let's test this function :


import pandas as pd
data = pd.read_csv('test_data.csv')

gini_impurity(data.sex, data.house)
Output : (0.48, 'M')

gini_impurity(data.color, data.house)
Output : (0.419047619047619, 'Green')

gini_impurity(data.age, data.house)
Output : (0.15999999999999992, 25.5)

We finally have our function ! We can now go on and start to make decision trees ! (another time)
If anyone reached the end of this article, I'm infinitely gratefull and please subscribe to this page, it's free and gives me more motivation to keep writing !