A Deep Dive into Linear Regression: Cost Functions, Gradient Descent, and Implementation

20 min read4 days ago

–

1. Linear Regression and the Supervised Learning Process

To understand supervised learning, it helps to start with its simplest and most widely used model:** linear regression.**

Press enter or click to view image in full size

Simple Linear Regression

At a high level, linear regression means learning a straight-line relationship between variables in data. Despite its simplicity, it remains one of the most important and commonly applied learning algorithms in machine learning today.

What makes linear regression especially valuable is that the ideas behind it — learning from labeled data, defining a model, measuring error, and improving performance — reappear in many more advanced algorithms that follow.

Press enter or click to view image in…

20 min read4 days ago

–

1. Linear Regression and the Supervised Learning Process

To understand supervised learning, it helps to start with its simplest and most widely used model:** linear regression.**

Press enter or click to view image in full size

Simple Linear Regression

Press enter or click to view image in full size

Labeled Data

A Practical Example: Predicting House Prices

Consider a classic and intuitive problem: predicting the price of a house based on its size.

Suppose we are given a dataset containing information about houses sold in Ismayıllı, a city in the Azerbaijan. For each house, we know:

its size, measured in square feet
the price it was sold for, measured in thousands of dollars

If we visualize this dataset:

the horizontal axis represents house size
the vertical axis represents house price

Press enter or click to view image in full size

Each point on the graph corresponds to a single house sale. These individual points allow us to observe how prices tend to change as house size increases.

From Data to Decision-Making

Now imagine you are a real estate agent helping a client sell her home. She asks a natural and important question:

How much do you think this house can sell for?

You measure the house and find that it is 1,100 square meters. However, the house has not yet been sold, so there is no known price.

This is precisely where historical data becomes useful. By learning from past house sales, we can estimate a reasonable price for a new house of similar size.

Press enter or click to view image in full size

A linear regression model attempts to summarize the dataset by fitting a straight line through the data points. This line represents the model’s best understanding of how size and price are related.

Press enter or click to view image in full size

Once the line is fitted:

locate the input value (1,100 sq m) on the horizontal axis
move vertically until you reach the fitted line
project horizontally to the vertical axis

The resulting value is the model’s predicted price. For example, the model might estimate a price of around $215,000.

This estimate is not guaranteed to be correct — the house has not been sold yet — but it is an informed prediction grounded in observed data.

Why This Is Called Supervised Learning

This approach is known as supervised learning because the model is trained on examples that include the correct answers.

For every house in the dataset:

the input (house size)* is known*
the output (house price)* is also known*

These outputs act as guidance, allowing the algorithm to learn how inputs and outputs relate. Once training is complete, the model can make predictions for new inputs where the true output is unknown.

Besides visualizing the data as a scatter plot, it is often helpful to think of it as a table.

Each row in the table represents one training example, and each column corresponds to an input feature or an output target.

For example:

one column for house size
one column for house price

Press enter or click to view image in full size

If the table contains 50 rows, then the dataset consists of 50 training examples — each represented as a point on the graph.

Standard Notation in Machine Learning

Now, let’s look at some notations for describing the data.* *These notations will be very useful for you throughout your journey in Machine Learning.

x — This is the standard notation to denote an input in Machine Learning. We call this the input variable. It is also called an input feature. In this example, xth is the size of the house.
y — This is the standard notation to denote the output variable which you’re trying to predict, which is also sometimes called the target variable. Here, y is the price of the house.
m — The dataset has one row for each house and in this training set, there are 50 rows with each row representing a different training example. We’re going to use lowercase m to represent the total number of training examples.
(x,y) — This notation is used to indicate a single training example.
(x^(i),y^(i)) — Now, we have 50 different training examples. To refer to a specific training example, we are going to use this notation. The superscript tells us that this is the i-th training example, where ii refers to a specific row in the table. It is important to note that this superscript in parentheses is not exponentiation. It just refers to the second training example.

Press enter or click to view image in full size

A supervised learning algorithm starts with a training set. This dataset always contains two parts:

Input features — for example, the size of a house
Output targets — for example, the price of the house

The output targets are the correct answers the model learns from. These labeled examples guide the learning process.

When we train a supervised learning model, we pass both inputs and targets to the algorithm. The algorithm’s job is to discover a relationship between them. As a result of training, the algorithm produces a function, usually denoted as f.

Press enter or click to view image in full size

The function f is the core of the learning process — it is the model.

x → input feature
f(x) → model function
ŷ (y-hat) → predicted output

In machine learning, the symbol*** ŷ*** represents the model’s estimate of the true target y*.* The actual value y is known only for training data. For new, unseen inputs, the model can only predict.

For example, if a house has not yet been sold, its true price is unknown. The model estimates the price based on patterns learned from previously sold houses.

Representing the Model: Linear Regression

A key design decision in machine learning is how to represent the function f mathematically.

In linear regression, we assume the relationship between input and output can be approximated by a straight line:

Where:

k is the slope (how strongly x affects*** y***)
b is the intercept (the baseline output when x = 0)

Press enter or click to view image in full size

Changing the values of*** k*** and b changes the position and orientation of the line, and therefore the predictions.

This model is known as linear regression with one variable, also called univariate linear regression.

Our objective is to find values of k and* *b such that the line:

passes through or close to the training data points
captures the underlying trend
generalizes to new data

However, these parameters are initially unknown. So the key question becomes:

How do we measure whether a particular line is good or bad?

This leads us to the Cost Function.

2. The Cost Function: Measuring Model Quality

To implement linear regression effectively, the first essential component is the cost function. The purpose of the cost function is to quantify how well a given model fits the training data. Without such a measure, there would be no systematic way to improve the model’s performance.

We will introduce the cost function step by step, beginning with its mathematical formulation and then discussing its intuition.

Mathematical Formulation

Recall that our training set consists of:

input features x
corresponding output targets y

The hypothesis used to model the relationship between inputs and outputs in linear regression is the linear function:

The quantities k and b are known as the parameters of the model. In machine learning, parameters are variables whose values are adjusted during training in order to reduce model error. They are also commonly referred to as weights or coefficients.

Role of the Parameters k and b

Different choices of k and b lead to different functions* f(x),* and therefore to different lines on a graph.

Consider the following cases:

If k=0 and b=1.5, then the function outputs a constant value for all inputs. The resulting line is horizontal and predicts the same output regardless of x.
If* *k=0.5 and b=0, the function produces values proportional to the input. For example:
f(0)=0
f(1)=0.5
f(2)=1
In this case, the slope of the line is determined entirely by k, demonstrating that k controls the rate at which predictions change with respect to the input.
If*** k=0.5 and b=1,*** the slope remains the same, but the line is shifted upward. The intercept b determines where the line crosses the vertical axis, while k still defines its steepness.

Press enter or click to view image in full size

These examples illustrate that:

k controls the slope
b controls the vertical offset

Our training set can be represented as shown in the image below.

Press enter or click to view image in full size

In linear regression, the objective is to select values of k and b such that the resulting straight line fits the training data as closely as possible. A good fit means that the line passes through — or at least remains close to — the observed training examples, relative to alternative parameter choices that produce poorer approximations.

This leads to a fundamental question:

How can we determine which values of k and b produce predictions that are close to the true targets?

To answer this, we require a quantitative measure of model error.

Defining the Cost Function

The cost function measures the discrepancy between the model’s predictions and the actual target values.

For a given training example i, the prediction error is defined as the difference between:

the predicted value ŷ^(i)
the true target y^(i)

To ensure that positive and negative errors do not cancel each other out, this difference is squared. The squared error is then computed for every training example in the dataset.

To measure overall model performance, the squared errors across all training examples are summed.

Averaging the Error

If the number of training examples is denoted by m, then summing squared errors alone would cause the cost to increase automatically as m increases. To prevent this, the average squared error is used instead of the total error.

By convention, the cost function for linear regression is defined as:

The division by 2m serves two purposes:

it normalizes the cost with respect to the number of training examples
it simplifies derivative expressions in subsequent optimization steps

Importantly, the factor 1/2 does not affect the location of the minimum of J(k,b) but simplifies the mathematics involved in gradient-based optimization.

This expression is the Cost Function.

This is also called the Squared Error Cost Function because we’re taking the square of these error terms. In Machine Learning, different people will use different cost functions for different applications, but the Squared Error Cost Function is by far the most commonly used for linear regression and in fact, for all regression problems where it seems to give good results for many applications.

Building the Intuition Behind the Cost Function

We have seen how the cost function can be defined mathematically. The next step is to understand what the cost function represents intuitively and how it behaves as model parameters change.

In linear regression, the objective is to find values for the parameters k and ***b ***such that the cost function J(k,b) is as small as possible. In other words, training a linear regression model is equivalent to solving an optimization problem:

Simplifying the Model for Visualization

To better visualize how the cost function behaves, it is helpful to temporarily simplify the linear regression model.

Consider the reduced form:

which is equivalent to setting the intercept parameter b=0* *in the original model. With this simplification, the model has only one parameter, namely k.

Under this assumption, the cost function becomes a function of a single variable:

The learning problem now reduces to finding the value of k that minimizes J(k).

Now, using this simplified model, let’s see how the cost function changes as you choose different values for the parameter** k.**

In particular, let’s look at graphs of the model f(x), and the Cost Function J.** We are going to plot these side-by-side so that we can closely observe how the two are related.

Press enter or click to view image in full size

To understand this relationship, it is useful to examine the model function and the cost function side by side.

Suppose our training set consists of three data points:

(1,1),(2,2),(3,3)

On one plot, we visualize the model function f(x) = kx. On a separate plot, we visualize the cost function J(k) as k varies.

Example: k=0.5

Now consider a different choice, k=0.5

Press enter or click to view image in full size

With this value, the model predicts values that are consistently lower than the true targets. The vertical distance between each predicted value and its corresponding true value represents the prediction error.

Computing the squared errors and averaging them across the dataset results in a nonzero cost value. In this case:

J(0.5)≈0.58

This value appears higher on the cost function graph, reflecting the poorer fit of the model.

Example: k=0

If we choose*** k=0,*** the model predicts zero for all inputs. This produces even larger errors relative to the true target values.

Press enter or click to view image in full size

The resulting cost increases further:

Get Cəlal İbrahimli’s stories in your inbox

Join Medium for free to get updates from this writer.

J(0)≈2.3

This reflects the fact that the model’s predictions deviate significantly from the training data.

Since k can take any real value, it is also possible for it to be negative.

For example, if k=−0.5, the model defines a downward-sloping line. In this case, not only are the predictions inaccurate, but they also move in the opposite direction of the data trend.

As expected, the cost function for such a parameter choice is even larger, with values on the order of:

J(−0.5)≈5.25

By computing J(k) for a range of values of k and plotting the results, a clear pattern emerges.

The cost function forms a smooth, convex curve with a single global minimum. In this example, the minimum occurs at:

k = 1, J(k) = 0

This observation is crucial. It tells us that minimizing the cost function corresponds directly to selecting the parameter value that best fits the data.

Press enter or click to view image in full size

Since the objective of linear regression is to locate the minimum of the cost function, we now require a systematic method for finding this minimum efficiently.

This leads naturally to Gradient Descent, an optimization algorithm designed to iteratively move parameters in the direction that reduces the cost function.

In the next section, we will examine how Gradient Descent operates and how it is applied to linear regression.

3. Applying Gradient Descent

To build intuition behind Gradient Descent, imagine standing at the top of a hill on a very foggy day. Your destination is the lowest point of the valley below, but because visibility is limited, you cannot see where that point is located. What you can sense, however, is the direction in which the ground slopes downward beneath your feet.

Naturally, you would move downhill by taking larger steps when the slope is steep and smaller steps as the terrain flattens, gradually approaching the lowest point. This mental model closely mirrors how Gradient Descent operates when optimizing a machine learning model.

Press enter or click to view image in full size

What Gradient Descent Does

Gradient Descent is one of the most widely used optimization techniques in machine learning. Its primary purpose is to minimize the cost function and, in the context of linear regression, to determine optimal values for the parameters k and b.

The process begins with initial guesses for these parameters. In linear regression, the exact starting values are usually not critical, so it is common to initialize both k and b to zero.

Gradient Descent then proceeds iteratively. At each iteration, the parameters are adjusted slightly in a direction that reduces the value of the cost function J(k,b). This continues until the algorithm converges, meaning the cost no longer decreases significantly and the parameters stabilize near a minimum.

It is important to note that not all cost functions necessarily have a simple, bowl-shaped form. In more complex models, the cost surface may contain multiple local minima. However, for standard linear regression, the cost function is convex, which guarantees a single global minimum.

To understand why Gradient Descent works, we examine its mathematical foundation.

At its core, Gradient Descent relies on derivatives. The derivative of a function describes the slope of the tangent line at a specific point. For a cost function J, the gradient indicates both the direction and rate at which the cost changes.

For a single parameter k, the gradient is expressed as:

When working with both parameters ***k ***and b, each parameter is updated independently but simultaneously using the following update rules:

Here, α-alpha is known as the learning rate.

The Role of the Learning Rate α

The learning rate determines how large a step the algorithm takes during each update.

If α is too large, Gradient Descent may overshoot the minimum and even diverge.
If*** α*** is too small, progress toward the minimum becomes slow, potentially requiring many iterations.

In practice, α is chosen as a small positive value — often between 0 and*** 1*** — to balance stability and convergence speed.

Developing the Intuition

To deepen our understanding of Gradient Descent, it is helpful to examine a simplified scenario in which the cost function depends on only one parameter. Let us assume that the cost function J is a function of a single variable k.

Under this assumption, the Gradient Descent update rule reduces to:

In this formulation, the objective is to minimize the cost by iteratively adjusting the parameter k.

Visualizing Gradient Descent on J(k)

Consider a graph where:

the horizontal axis represents the parameter k
the vertical axis represents the cost J(k)

We begin by initializing Gradient Descent with some starting value of*** k.*** At each iteration, the parameter is updated using the rule:

To understand what this update is doing, we must interpret the derivative term

Press enter or click to view image in full size

Meaning of the Derivative

At a particular value of*** k,*** the derivative dJ/dk represents the slope of the tangent line to the cost function at that point. One way to visualize this slope is by drawing a small triangle along the tangent line and computing its rise over run.

When the tangent line slopes upward to the right, the derivative is positive.
When the tangent line slopes downward to the right, the derivative is negative.

Case 1: Positive Slope

Suppose the derivative at the current value of ***k ***is positive. The update rule then becomes:

Press enter or click to view image in full size

Since the learning rate α is always positive, subtracting a positive quantity causes k to decrease. On the graph, this corresponds to moving leftward along the horizontal axis.

This movement is beneficial because shifting left reduces the cost value J(k), guiding the algorithm closer to the minimum.

Case 2: Negative Slope

Now consider a different starting point where the slope of the cost function is negative. In this situation, the derivative ***dJ/dk ***is negative.

The update rule becomes:

Press enter or click to view image in full size

Subtracting a negative value is equivalent to adding a positive one, causing k to increase. Graphically, this moves the parameter to the right, once again leading toward lower cost values.

These two cases illustrate why Gradient Descent consistently guides parameters toward the minimum of the cost function. Regardless of whether the slope is positive or negative, the update rule always adjusts the parameter in a direction that reduces the cost.

The derivative provides local directional information, and the learning rate controls how aggressively the parameter is updated. Over successive iterations, these controlled adjustments bring the parameter closer to the optimal value.

An essential component of Gradient Descent is the learning rate α.

If α is too small, the algorithm converges very slowly.
If α is too large, Gradient Descent may overshoot the minimum or fail to converge altogether.

Press enter or click to view image in full size

Selecting an appropriate learning rate is therefore critical for efficient and stable training.

4. Linear Regression in Python

In this section, we’ll build a simple univariate linear regression model from the ground up using only NumPy and Matplotlib.

Our goal is to learn the parameters k (slope) and b (bias/intercept) of the model

by minimizing the mean squared error (MSE) using gradient descent.

We’ll wrap the whole logic into a small class called SimpleLinearRegressor, which will have two main methods:

.fit(X, y) – learns parameters k and*** b*** using gradient descent
.predict(X) – uses the learned parameters to make predictions for new inputs

We’ll also allow the user to specify:

learning_rate (step size in gradient descent)
n_iters (number of iterations/epochs)

Code

import numpy as npimport matplotlib.pyplot as pltclass SimpleLinearRegressor:    """    Univariate Linear Regression implemented from scratch    using gradient descent.        Model: y_hat = k * x + b    """    def __init__(self, learning_rate=0.01, n_iters=1000):        # Hyperparameters        self.learning_rate = learning_rate        self.n_iters = n_iters                # Parameters (weights) - initialized later in fit()        self.k_ = None   # slope        self.b_ = None   # intercept    def fit(self, X, y):        """        Train the model using gradient descent.                Parameters        ----------        X : array-like, shape (m,)            Input feature values.        y : array-like, shape (m,)            Target values.        """        # Convert inputs to NumPy arrays and flatten to 1D        X = np.array(X).reshape(-1)        y = np.array(y).reshape(-1)                m = len(X)  # number of training examples                # Initialize parameters to zero        self.k_ = 0.0        self.b_ = 0.0        # Gradient descent loop        for _ in range(self.n_iters):            # 1. Compute current predictions: y_hat = kx + b            y_pred = self.k_ * X + self.b_                        # 2. Compute the error term (residuals)            error = y_pred - y  # shape (m,)                        # 3. Compute gradients of the cost function w.r.t k and b            #    Cost J(k, b) = (1/(2m)) * sum((y_hat - y)^2)            #    dJ/dk = (1/m) * sum((y_hat - y) * x)            #    dJ/db = (1/m) * sum(y_hat - y)            grad_k = (1 / m) * np.dot(error, X)            grad_b = (1 / m) * np.sum(error)                        # 4. Update parameters in the opposite direction of the gradient            self.k_ -= self.learning_rate * grad_k            self.b_ -= self.learning_rate * grad_b    def predict(self, X):        """        Predict target values for given inputs.                Parameters        ----------        X : array-like, shape (m,) or (n_samples,)                Returns        -------        y_pred : np.ndarray            Predicted values with shape (m,)        """        X = np.array(X).reshape(-1)        return self.k_ * X + self.b_

Class Definition

class SimpleLinearRegressor:    ...

We define a simple class that encapsulates:

The parameters of the model (k_, b_)
The training procedure (fit)
The prediction procedure (predict)

This makes the API feel similar to scikit-learn (fit, predict style).

`init` – Hyperparameters

def __init__(self, learning_rate=0.01, n_iters=1000):    self.learning_rate = learning_rate    self.n_iters = n_iters    self.k_ = None    self.b_ = None

learning_rate (α): controls how large each gradient descent update is.
n_iters: total number of gradient descent iterations (epochs).
We set k_ and b_ to None initially. They will be learned in .fit().

`.fit(X, y)` – Training with Gradient Descent

X = np.array(X).reshape(-1)y = np.array(y).reshape(-1)m = len(X)self.k_ = 0.0self.b_ = 0.0

Convert X and y into NumPy arrays to support vectorized math.
reshape(-1) ensures they become 1D arrays of length m.
Initialize parameters k and b to 0. For linear regression, this is a standard and safe starting point.

Gradient Descent Loop

for _ in range(self.n_iters):    y_pred = self.k_ * X + self.b_    error = y_pred - y

y_pred is our prediction

error is the vector of residuals: *y^(i) − y(i) *for each training example.

Computing the Gradients

Recall the cost function:

The partial derivatives are:

In code:

grad_k = (1 / m) * np.dot(error, X)grad_b = (1 / m) * np.sum(error)Parameter Update

This is the gradient descent update:

np.dot(error, X) computes

np.sum(error) computes

Parameter Update

self.k_ -= self.learning_rate * grad_kself.b_ -= self.learning_rate * grad_b

This is the gradient descent update:

Each iteration nudges the parameters in the direction that reduces the cost.

`.predict(X)` – Using the Learned Parameters

X = np.array(X).reshape(-1)return self.k_ * X + self.b_

Given new inputs X, we apply the learned line:

The result is a NumPy array of predictions.

Training and Visualizing on Synthetic Data

Now let’s generate a simple synthetic dataset and fit our model.

# 1. Create synthetic datarng = np.random.RandomState(0)   # for reproducibilityX = 10 * rng.rand(50)            # 50 points in [0, 10)Y = 2.0 * X + 10.0 + rng.randn(50)  # true slope=2, intercept=10 + noise# 2. Visualize raw dataplt.figure(figsize=(8, 5))plt.scatter(X, Y, color='green', alpha=0.7)plt.xlabel('x', fontsize=14)plt.ylabel('y', fontsize=14)plt.title('Synthetic Data for Linear Regression', fontsize=16)plt.show()# 3. Create and train the modelregressor = SimpleLinearRegressor(learning_rate=0.01, n_iters=1000)regressor.fit(X, Y)print("Learned parameters:")print(f"k (slope):     {regressor.k_:.4f}")print(f"b (intercept): {regressor.b_:.4f}")# 4. Make predictions on the training dataY_pred = regressor.predict(X)# 5. Plot data and fitted lineplt.figure(figsize=(8, 5))plt.scatter(X, Y, color='green', alpha=0.7, label='Data')plt.plot(X, Y_pred, color='black', linewidth=2.5, label='Fitted line')plt.xlabel('x', fontsize=14)plt.ylabel('y', fontsize=14)plt.title('Linear Regression Fit', fontsize=16)plt.legend()plt.show()

Explanation of the Training Script

Data generation

rng = np.random.RandomState(0)X = 10 * rng.rand(50)Y = 2.0 * X + 10.0 + rng.randn(50)

rng.rand(50) generates 50 random values in* *[0,1).
Multiplying by 10 scales them to*** [0,10).***
Y is generated from a true underlying line ***y=2x+10 ***plus some Gaussian noise (rng.randn(50)), simulating real-world data.

2. Initial scatter plot

We visualize (X, Y) to see roughly linear structure with some noise.
This is what the model will try to approximate with a straight line.

3. Model creation and training

regressor = SimpleLinearRegressor(learning_rate=0.01, n_iters=1000)regressor.fit(X, Y)

We instantiate our model with a learning rate of 0.01 and 1000 iterations.
.fit(X, Y) runs gradient descent and learns k_ and b_.

4. Inspecting learned parameters

print(f"k (slope):     {regressor.k_:.4f}")print(f"b (intercept): {regressor.b_:.4f}")

For this dataset, you should see values close to the true parameters: slope ≈ 2, intercept ≈ 10.

5. Predictions and fitted line

Y_pred = regressor.predict(X)

We compute predictions for each X in the training set.
The second plot overlays the original points and the fitted regression line.
Visually, you should see that the line passes through the “middle” of the data cloud.

Linear Regression is often the first algorithm we encounter in machine learning, but it is far from trivial. Concepts like cost functions and gradient descent appear again and again in more advanced models.

Mastering these basics now will make your future journey into machine learning significantly smoother. If this article helped even a little — then it did its job.

celalibr

1. Linear Regression and the Supervised Learning Process

1. Linear Regression and the Supervised Learning Process

A Practical Example: Predicting House Prices

From Data to Decision-Making

Why This Is Called Supervised Learning

Standard Notation in Machine Learning

Representing the Model: Linear Regression

2. The Cost Function: Measuring Model Quality

Mathematical Formulation

Role of the Parameters k and b

Defining the Cost Function

Averaging the Error

Building the Intuition Behind the Cost Function

Simplifying the Model for Visualization

Example: k=0.5

Example: k=0

Get Cəlal İbrahimli’s stories in your inbox

3. Applying Gradient Descent

What Gradient Descent Does

The Role of the Learning Rate α

Developing the Intuition

Visualizing Gradient Descent on J(k)

Meaning of the Derivative

Case 1: Positive Slope

Case 2: Negative Slope

4. Linear Regression in Python

Code

Class Definition

__init__ – Hyperparameters

.fit(X, y) – Training with Gradient Descent

Computing the Gradients

In code:

Parameter Update

.predict(X) – Using the Learned Parameters

Training and Visualizing on Synthetic Data

Explanation of the Training Script

Similar Posts

`init` – Hyperparameters

`.fit(X, y)` – Training with Gradient Descent

`.predict(X)` – Using the Learned Parameters