What is overfitting?
3 min read

What is overfitting?

What is overfitting?

Subscribe to our newsletter to hone your skills and knowledge in programming and machine learning, and stay up to date with weekly articles.

Intro

This is going to be a very short article explaining what overfitting is and why it is important.

When we started learning about machine learning models, we all thought about improving our model's accuracy by just training it more. Well, in reality after having our model go through a certain number of training epochs, its accuracy doesn't just stagnate at its maximum value but it actually goes back down. And the culprit of this phenomenon is overfitting

Basic explanation

Overfitting is when a machine learning model has its parameters too optimized on the training dataset: the model will then be very good at predicting results from the data it already trained on at the expense of being able to predict results from new data. This is not what we want from a machine learning model, as the whole idea is usually to process new data.

Why does overfitting happen and how to avoid it?

As explained previously, we might think that by training our model more (ie going through more epochs), it will become more accurate with its prediction, but after a certain point, the model becomes too accustomed to the training dataset and modifies its parameters specifically to optimize its accuracy with this dataset.

Another reason why overfitting might happen is when we have models that aren't fit for generalization(for example decision trees with too many branches).

This is why we split our dataset into two parts: the training dataset and the testing dataset. We'll train our model using our training dataset and then, using our testing dataset we'll measure the model's accuracy

Example

In the following code, we'll train a neural network multiple times, each time with a different number of training epochs. Each time, we'll measure the model's accuracy using the training and the testing data, and finally, we'll graph those results as a function of the number of training epochs

from sklearn.neural_network import MLPClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=6,
    n_classes=2
)

avg_out = 10
EPOCHS = [i*2 for i in range(1,100)]
TRAINING = list()
TESTING = list()

for epoch in EPOCHS:
    total_TRAIN = 0
    total_TEST = 0
    for _ in range(avg_out):
        model = MLPClassifier(hidden_layer_sizes=(100, 100), tol=0, max_iter=epoch)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
        
        model.fit(X_train, y_train)
        
        total_TRAIN += model.score(X_train, y_train)
        print(model.score(X_train, y_train))
        total_TEST += model.score(X_test, y_test)
        
    TRAINING.append(total_TRAIN/avg_out)
    TESTING.append(total_TEST/avg_out)
    
plt.xlabel('training epochs')
plt.ylabel('model accuracy')
plt.scatter(EPOCHS, TRAINING, marker="+", label="training data")
plt.scatter(EPOCHS, TESTING, marker="o", label="testing data")

plt.legend()
plt.show()

We can see that from 50 training epochs onwards, the model accuracy on the testing data goes down very slowly, while the accuracy on the training data keeps going up (approaching 1).

Here is another way overfitting can happen (inpired from machinelearningmastery.com's post on the same topic, check them out!)

By increasing gradually the depth of a decision tree, the model can be prone to become too optimized for the training data, here is an example:

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

X, y = make_classification(
    n_samples=10000,
    n_features=10,
    n_informative=6,
    n_classes=2
)

avg_out = 10
DEPTH = [i for i in range(1,30)]
TRAINING = list()
TESTING = list()

for deep in DEPTH:
    total_TRAIN = 0
    total_TEST = 0
    for _ in range(avg_out):
        model = DecisionTreeClassifier(max_depth=deep)
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
        
        model.fit(X_train, y_train)
        
        total_TRAIN += model.score(X_train, y_train)
        print(model.score(X_train, y_train))
        total_TEST += model.score(X_test, y_test)
        
    TRAINING.append(total_TRAIN/avg_out)
    TESTING.append(total_TEST/avg_out)
    
plt.xlabel('tree depth')
plt.ylabel('model accuracy')
plt.scatter(DEPTH, TRAINING, marker="+", label="training data")
plt.scatter(DEPTH, TESTING, marker="o", label="testing data")

plt.legend()
plt.show()

Subscribe to our newsletter to stay updated with new topics on machine learning and programming

Thanks to our readers!