Linear Regression from Scratch with Python

Among the variety of models available in Machine Learning, most people will agree that Linear Regression is the most basic and simple one. However, this model incorporates almost all of the basic concepts that are required to understand Machine Learning modelling.
In this example, I will show how it is relatively simple to implement an univariate (one input, one output) linear regression model.
Coming back to the theory, linear regression consists in a statistical hypothesis, stating that the relation between two (or more) variables is linear, i.e they increment by the same quantity, directly or inversely proportional. Finding an accurate linear regression validates such hypothesis applied to a certain dataset.
The basic equation structure is:
Where
To better understand, consider that one wants to predict the housing prices in a certain neighborhood by using as input the house size. This makes sense, since one can logically imagine that bigger houses (higher area) will have higher prices. However, the hypothesis stated by the linear regression is that such relation between variables is linear.
To evaluate that, let’s work with a “manually” created dataset, which will be easier since we will know from the beginning that the hypothesis is valid for it. And just to give a sense of real-world, we will add some noise to the expected output
For visualization, we will use the library matplotlib.pyplot, and we will use the library random() to generate the white noise on the variable
import matplotlib.pyplot as plt
import random
Now let’s create the variable
where
x = list(range(-10,12))
y = [2*xval-1 for xval in x]
The first step of pre-processing the data consists in performing normalization. This will transform the ranges from the original one to values between 0 and 1. The following formula is used for that.
The following code is used to perform normalization and also add the white-noide to the output variable
random.seed(999)
# normalize the values
minx = min(x)
maxx = max(x)
miny = min(y)
maxy = max(y)
x = [(xval - minx)/(maxx-minx) for xval in x]
y = [(yval - miny)/(maxy-miny) + random.random()/5 for yval in y]
print(x)
print(y)
plt.plot(x,y,'o')
Assume an initial guess for the parameters of the linear regression model. From this value, we will iterate until the optimum values are found. Let’s assume that initially
It is possible to adjust the parameters of the linear regression model analytically, i.e with no iteration but using an equation. This analytical method is known as the Least Squares Method or the Normal Equation method. However, since this technique os almost only applicable to linear regression, I chooe to use the iterative approach, because it is more general and will give a better sense of how machine learning models are usually trained.
The algorithm that we will use the Gradient Descent. A good explanation of it can be found in Wikipedia for instance, so I won’t bother myseelf to be writing here the basic concepts. Maybe I will leave that to another post.
Applying the Gradient Descent method means to be updating iteratively the parameters of the linear regression model according the following formula.
At each iteration until convergence:
Where
Where
Where
We will do 100 iterations using a learning rate of 0.05. Also, we will collect the parameter and cost function theta_history
and J
.
epochs = 100 # number of iterations
learning_rate = 0.05
theta_history = [[theta0,theta1]]
J = list()
Then, for each iteration, we calculate the const function
for epoch in range(epochs):
J.append((sum([(ypredval-yval)**2 for ypredval,yval in zip(ypred,y)])))
print('J = ',J[-1])
dJd0 = (sum([ypredval - yval for ypredval,yval in zip(ypred,y)]))
dJd1 = (sum([(ypredval - yval)*xval for ypredval,yval,xval in zip(ypred,y,x)]))
theta0 = theta0 - learning_rate*dJd0
theta1 = theta1 - learning_rate*dJd1
theta_history.append([theta0,theta1])
ypred = [theta0 + theta1*xval for xval in x]
plt.plot(J)
Notice how the cost function
Notice now the model accuracy as plotted below.
plt.plot(x,y,'o')
plt.plot(x,ypred,'+')
With this example, you have seen how it is possible and not so complicate to build a univariate linear regression with Python. Notice that we only used libraries for plotting and to create pseudo random numbers. Not even Numpy or Scipy was used.
📥 Download the Jupyter notebook here 📥 Download the Python code hereThanks for reading this post and see you soon!