Gradient descent explained
4 min read

Gradient descent explained

Gradient descent explained

If you're interested in improving your knowledge and skills in machine learning and programming, subscribe to our newsletter to keep up with weekly tips and explanations.

Intro

"Gradient descent" are two words we seem to hear everywhere in the context of machine learning, but what is it actually?

A very big part of ML is finding optimal values: In linear regression, we'll try to find the best values for the slope and the y-intercept, in neural nets, we'll try to optimize the values of each weight and bias between each layer, etc...

Gradient descent is a method that can be used for optimizing such values.

Cost functions

Before explaining exactly what gradient descent is, I'll have to quickly go over cost functions as they are what makes gradient descent possible. If you want to read more about them in detail, click here.

Simply put, cost functions measure the accuracy of a model. Not to be confused with loss functions, which measure the distance between a real value and a predicted value (usually noted ) of a unique data point, cost functions have to deal with the fact that multiple data points usually come into play (you can try to find the line of best fit for a single point but I'm not sure you'll go very far).

A cost function is usually a multivariable function with a single numerical output. The higher the output is, the worst the result is. Therefore, the goal is usually to minimize the cost as much as possible. What makes this function interesting however is the variables on which it depends: these variables are the values we'll want to optimize to make our model learn, and the way to optimize them is to modify them in a way that brings the value of the cost function down to zero. (To be clear, we'll never reach zero but we want to get as low as possible).

Some examples of popular cost functions:

mean squared error
mean absolute error
root mean squared error

Simple example

Using the gradient descent method, we're able to optimize a line to fit a set of points, here is how the value of the cost evolves as the model gets more accurate:

The cost function used here is the Mean Squarred Error function (MSE), see above for the formula.

Theory

Okay, I talked about cost functions so far, but not about gradient descent, so let's actually get into it:
Gradient descent is the method used for minimizing a function by modifying the variables on which it depends to move in the direction of the local minima. The further we are from the minima, the bigger the step we'll take, and the closer we are, the smaller the step.

Approaching the local minima

The first step is to understand how modifying the variables (and remember, the variables are also the values we want to optimize) will impact the cost function. To do that, we'll require some calculus. For each variable we'll want to optimize we'll have to calculate the partial derivative of the cost function with respect to that variable:

For instance, if we choose the MSE cost function to optimize a linear model ŷ=ax+b, to find the optimal values for a and b, we could rewrite the cost function as:

Then, calculate both partial derivative:

Once that's done, we'll choose a learning rate. This is done so we don't take steps that are too big (we usually choose LR=0.1 or LR=0.01, LR=0.001, etc... but the value could be manually tweaked later).

Once this value is chosen, we can finally apply the method of gradient descent, which is to modify the variable over a number of epochs following this formula:

Over the epochs, as the variables gets modified, the value of the cost function will go down to approach zero as much as it can.

To recap the steps of gradient descent:

  1. Find the variables which need to be optimized
  2. Choose a cost function and integrate all the variables that need to be optimized into this cost function
  3. Find the partial derivatives of this cost function with respect to all those variables
  4. Choose a learning rate
  5. Apply the gradient descent formula over a chosen number of epochs
  6. Other uses

    The previous example showed the gradient descent method in a linear regression application (which is probably the most popular example used), but it can be used in pretty much any situation as long as there's a local minimum to find. It doesn't even have to be one-dimensional: it can be used for multiple linear regression to tweak each of the coefficients. It can be used in polynomial regression, or even in other ML models, such as neural nets.

    Final words

    Thank you for reading this until the end, if you want to stay up to date with the weekly articles about ML, subscribe to the newsletter.