Detecting breast cancer using AI
5 min read

Detecting breast cancer using AI

Detecting breast cancer using AI

Subscribe to our newsletter to hone your skills and knowledge in programming and machine learning, it also keeps you updated with each new article.

Intro

If you take a walk today in your city or observe your electronic devices/softwares, you might or might not realize it, but AI has made its way into every aspect of our lives. From intelligent systems integrated into cars to the infamous YouTube algorithm which seems to know exactly what video you want to watch. Another field in which it is used (the one we'll talk about today) is the medical field. For example, the pharmaceutical industry uses supercomputers to estimate the physical properties of drugs (melting point, pKa, solubility, etc..).
The topic which we'll talk about in this article is how we could predict if a screened tumor is benign or malignant.

Disclaimer: This article is not supposed to focus on the medical aspect (since I'm not a doctor!) but instead focuses on choosing the correct ML model, how to split the training and testing data the optimal way, etc...

Context

After feeling a lump in your breast and being concerned, you decide to go seek your doctor's advice to see what he thinks. After some tests, he says that it is indeed a tumor but can't say yet if it is malignant (ie cancerous) or benign (ie non-cancerous). He'll need to analyze it further to give you an answer.
For the analysis, some data will need to be taken: radius of the tumor, texture, area, etc...

Let's now play as the scientists in charge of the analysis of the tumor: Our job is to analyze the data given to us by the doctor and give an answer: is the tumor benign or malignant?

Dataset

Our goal is going to be to make a machine-learning model that is capable of predicting with sufficient accuracy the nature of a tumor (what accuracy that is, I don't know, but let's go as high as possible).

The dataset that we'll use is the Breast cancer Wisconsin (diagnostic) dataset available in scikit-learn.
The dataset has 569 instances, each containing 30 features (10 features separated into 3 categories: mean, standard error, and worst). Each instance is then classified as 'malignant' or 'Benign'.

What accuracy can we reach?

The idea is to find a model and train it in a way to get the highest accuracy possible: we want to be confident enough knowing the nature of the tumor.
The second step will be finding the correct testing/training size to avoid overfitting the model.

To have some variety, these are the model we'll choose:

  • Ridge classification (linear model)
  • Naive Bayes classifier (probabilistic model)
  • Multi-layered perceptron classifier (non-linear model)

Let's go for a simple test run:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import MultinomialNB

data = load_breast_cancer()

X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.25)

model_linear = RidgeClassifier()
model_nonlinear = MLPClassifier()
model_probabilistic = MultinomialNB()

models = [model_linear, model_nonlinear, model_probabilistic]

for model in models:
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    
    print(model, score)

By running this code multiple times, we'll see that we get different results each time:

RidgeClassifier() 0.916083916083916
MLPClassifier() 0.8251748251748252
MultinomialNB() 0.8531468531468531
RidgeClassifier() 0.9790209790209791
MLPClassifier() 0.9440559440559441
MultinomialNB() 0.9300699300699301
RidgeClassifier() 0.965034965034965
MLPClassifier() 0.951048951048951
MultinomialNB() 0.9020979020979021
RidgeClassifier() 0.9440559440559441
MLPClassifier() 0.8951048951048951
MultinomialNB() 0.8601398601398601
RidgeClassifier() 0.958041958041958
MLPClassifier() 0.9300699300699301
MultinomialNB() 0.9090909090909091

This happens because of the train_test_split() function. This function splits the dataset into two categories: training data and testing data. Before splitting the data, however, it shuffles it up, and that is where the randomness in the results comes from.

All things considered, we can still make out the best performing model on this dataset from the worst one: The Ridge classifier model has a consistent accuracy of about 95%, the Multi-layered perceptron classifier has an accuracy of about 90%, and the Multinomial probabilistic model also has an average accuracy around 90%.

To be sure of our choice, we could use cross-validation, read about it here. But for simplicity's sake, we'll just continue with the ridge classifier

The next step is finding the optimal ratio for training/testing size. To avoid random fluctuations in results due to selecting random blocks from our dataset for our training/testing datasets, we'll create multiple randomly selected training/testing datasets and calculate the average on each ration. This might be confusing, to give an example, if we want to calculate the accuracy of a model trained with 75% of the dataset (and therefore, tested with the 25% left), this is how we would do it:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier

n_tests = 10
training_size=0.75

data = load_breast_cancer()

datasets = [train_test_split(data.data, data.target, train_size=training_size) for _ in range(n_tests)]

total = 0
for d in datasets:
    model = RidgeClassifier()
    model.fit(d[0], d[2]) #d[0] = X_train, d[2] = y_train
    score = model.score(d[1], d[3]) #d[1] = X_test, d[3] = y_test
    print(score)
    
    total += score
    
print("avg: ", total/n_tests) #this is the value we'd keep

Output:

0.951048951048951
0.9440559440559441
0.951048951048951
0.958041958041958
0.951048951048951
0.972027972027972
0.965034965034965
0.9440559440559441
0.958041958041958
0.951048951048951
avg:  0.9545454545454545

The last value (average) is the value we'd keep to graph as a function of training size

Now, let's see how that would look like with multiple values:

Notice we increased the number of tests to 1000 to get extremely consistent results

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
import matplotlib.pyplot as plt

n_tests = 1000
training_size=[round(0.05*n, 2) for n in range(1, 20)]

data = load_breast_cancer()

RESULTS = list()

for size in training_size:
    datasets = [train_test_split(data.data, data.target, train_size=size) for _ in range(n_tests)]
    total = 0
    for d in datasets:
        model = RidgeClassifier()
        model.fit(d[0], d[2]) #d[0] = X_train, d[2] = y_train
        score = model.score(d[1], d[3]) #d[1] = X_test, d[3] = y_test
        total += score
    
    RESULTS.append(total/n_tests)
    print(size, "done")
    
plt.scatter(training_size,
        RESULTS,
        marker="+")
plt.xlabel('training size')
plt.ylabel('model accuracy')
plt.show()

Result:

Notice that at the beginning of the graph, the accuracy increases logarithmically up to about 70% training size, where it decreases slightly before reaching its maximum at 80% training size. After that, it decreases again.

And now we know that if we want to give the most accurate answer possible, we could use the Ridge model with 80% of this specific dataset used for training.

Conclusion

Although we were lucky this time to not have to tweak too many things as we already had a very decent accuracy, it's always good practice to try to optimize each parameter.
Next time, we'll try to show how bad choices and parameters can badly influence the accuracy of our model. To stay updated, subscribe to our newsletter.