Linear Regression in easy and manageable terms

In this article, I would like to propose Linear Regression in a simple and easy way so
that even a newbie can understand the mechanism without remembering it as an
algorithmic concept.

INTRODUCTION

Regression algorithm – we use this when the target variable (or dependent variable: the variable that we predict) is a continuous variable (100, …, 2000). For example, to predict the price of the house given its sq.area. The house price will increase if the sq. area increases or with the no of bed rooms or the house location. We can conclude that the sq.area influences the price of the house. Hence there is a strong relationship between the sq.area of the house and the house price. In mathematical terms this refers to “Regression”. The price is dependent on the area of the house and so it is called “Dependent Variable”; whereas sq.area is called “Independent Variable”. There can be multiple factors influencing the house price such as location, no. of bedrooms, etc. All such factors have a relationship with the price which we can write statistically as an equation.

TABLE OF CONTENTS

Traditional Definition of Linear Regression

Linear Regression is a statistical model used to predict the relationship between independent and dependent variables denoted by x and y respectively.

The linear regression equation is based on the formula for a simple linear equation.

Equation of a line
How to approach Linear regression
Detailed analysis of the equation

For example, the equation can be
y = 0.32 x1 + 0.5 x2 + 0.66 x3
where the 0.32, 0.5, 0.66 are the weights of the equation. These weights are to be learned by studying the relationship between the independent and dependent variables.

Important Assumptions of simple linear regression


Simple linear regression is based on parametric test, meaning that it makes certain assumptions about the data. Consider these assumptions before doing Regression

Analysis:

  1. Linearity : the line of best fit through the data points is a straight line (rather than a curve or some sort of grouping factor). L.R is the simplest non-trivial relationship between Independent and dependent variables.
  2. No Endogeneity of regressors: When there is a correlation between Independent variable (x) and the error term in the model, we refer this problem as OVB (Omitted Variable Bias). This problem occurs when you forget to include a relevant variable.
  3. Homogeneity of variance (homoscedasticity): Error terms to have equal variance.
  4. No Multicollinearity: There are no hidden relationships among Independent variables/observations.
  5. Normality: We assume that the error term is normally distributed or the data follows normal distribution.

Idea behind Linear Regression:

  1. The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the data point and best fit line (regression line).
  2. Choose the values a and b so that they minimize the error.
  3. We can train this algorithm using multiple methods, we can use statistics to calculate a and b , or we can use Ordinary Least Square method to calculate a and b, or we can use gradient descent to calculate a and b.

How to perform a simple linear regression

Simple linear regression formula

The formula for a simple linear regression is:

  • y is the predicted value / dependent variable
  • B0 is the intercept, the predicted value of y when the x is 0.
  • B1 is the regression coefficient – how much we expect y to change as x increases.
  • x is the independent variable ( the variable we expect is influencing y).
  • e is the error of the estimate, or how much variation there is in our estimate of the regression coefficient.

Smaller the value of error, the more accurate the prediction will be, which would make the
model the best fit.

Simple linear regression in Python

from sklearn.datasets import load_boston
import pandas as pd
import matplotlib.pyplot as plt

We will use boston dataset from sklearn library, along with that we’ll also import pandas
library and provide alias as pd; numpy library as np and matplotlib will be used to plot linear
regression graph.

bs = load_boston()

Now, we’ll load the boston data and store in bs variable

print(bs.DESCR)

(Optional) If you want to know about the boston dataset like the variables and their uses and
meaning, you can use .DESCR (Description) method to get more details

boston = pd.DataFrame(bs.data, columns = bs.feature_names)

Convert the boston dataset into pandas DataFrame for further computations. Because if you will
check the type of bs variable it will give you the output as below:

type(bs)
> sklearn.utils.Bunch

Check the boston data now

Now, we’ll split out boston dataset into training and test set using sklearn library as below

from sklearn.model_selection import train_test_split as split
train, test = split(boston, test_size = 0.20, random_state = 12)

The data is split into 20% test set and 80% of training set and provided random_state = 12 (basically you can give any value to random_state, as every time you will execute this line the train and set will always have same sort of records otherwise the data shuffles on every run)

We’ll do Model Development after splitting of data
from sklearn.linear_model import LinearRegression

Import Linear Regression from sklearn.linear_model library

#create a model object
lm = LinearRegression()

lm is the model object for our simple linear regression model

#Fit the model
X = train[['RM']] # always needs to be dataframe
y= train.MEDV # always a Series

RM is the Independent variable (description of RM: average number of rooms per dwelling)
MEDV is the Dependent Variable (that we will predict. Description of MEDV: Median value of owner-occupied homes in $1000’s)
We’ll train our simple linear regression model on the Independent and dependent variable.

Plotting the Linear regression Graph
y_cap = lm.predict(train[['RM']])
 plt.figure(figsize = (10,5))
 plt.scatter(x=train.RM, y=train.MEDV)
 plt.plot(train.RM, y_cap,color = 'red')
 plt.show();

Interpreting the results

from statsmodels.formula.api import ols # ordinary least square
mod = ols(formula='MEDV ~ RM', data = train)
 lm_fit = mod.fit()
 lm_fit.summary()

This function takes the most important parameters from the linear model and puts them into a table, which looks like this:

R-Square: It explains “ percentage variation in Dependent variable is explained by the Independent variable”. In the above table 48.6% in y is explained by x variable.

Adj. R-squared: This is the advanced version of R-squared which is adjusted for the number of variables in the regression. It increases only when an additional variable adds to the explanatory power to the regression.

Prob(F-Statistic): This tells the overall significance of the regression. This is to assess the significance level of all the variables together. Prob(F-statistics) depicts the probability of null hypothesis being true.

The t value column displays the test statistic. Unless you specify otherwise, the test statistic used in linear regression is the t-value from a two-sided t-test. The larger the test statistic, the less likely it is that our results occurred by chance.

The Pr(>| t |) column shows the p-value. This number tells us how likely we are to see the estimated effect of the average number of rooms on Median value if the null hypothesis of no effect were true.

Because the p-value is so low (p < 0.001), we can reject the null hypothesis and conclude that the average number of rooms has a statistically significant effect on Median value. The last three lines of the model summary are statistics about the model as a whole. The most important thing to notice here is the p-value of the model. Here it is significant (p < 0.001), which means that this model is a good fit for the observed data.

Explanation of Linear Regression Graph
Pic Credit: http://www.sthda.com/english/sthda-upload/images/machine-learning-essentials/linear-regres
sion.png
  • The figure illustrates the linear regression model, where:
  • Black points are the dataset points
  • The best – fit regression line (which is in blue)
  • Intercept (b0) and the slope (b1) are shown in green
  • Error terms (e) are represented by vertical red lines

Smaller the value of error, the more accurate the prediction will be, which would make the
model the best fit.

Why to use Linear Regression

Simple Implementation:

  1. When we know the relationship between the Independent and dependent variable have a linear relationship, this algorithm is the best to use because it’s the least complex to compound to other algorithms that also try finding the relationship between independent and dependent variables.
  2. We can use to find the nature of the relationship between the variables.
  3. It is the most basic and widely used technique to predict a value of an attribute.
  4. Easy to use as the model does not require a lot of tuning.
  5. It runs very fast, which makes it more time -efficient

Disadvantages of Linear Regression

  1. Linear Regression assumes that the data is independent
    a. Very often the inputs aren’t independent of each other and hence any multicollinearity must be removed before applying linear regression.
  2. Sensitive to outliers
    a. Outliers of a data set are anomalies or extreme values that deviate from the other data points of the distribution.Data outliers can damage the performance of a model drastically and can often lead to models with low accuracy.
  3. Prone to underfitting
    a. Underfitting : A situation that arises when a model fails to capture the data properly.
    b. Linear models are also often not that good regarding predictive performance, because the relationships that can be learned are so restricted and usually oversimplify how complex reality is.
    c. The interpretation of a weight can be unintuitive because it depends on all other features. A feature with high positive correlation with the outcome y and another feature might get a negative weight in the linear model, because, given the other correlated feature, it is negatively correlated with y in the high-dimensional space.

Conclusion

This article gave you an idea about linear regression. It explains with the help of an example and uses sklearn library . We discuss few advantages as well as disadvantages of linear regression.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.