Linear regression is one of the most common data analysis techniques to help you make sense of Big Data and enable more informed decision-making. This quote from renowned data scientist Tom Redman gives us the best possible explanation of linear regression.
“Suppose you’re a sales manager trying to predict next month’s numbers. You know that dozens, perhaps even hundreds of factors from the weather to a competitor’s promotion to the rumor of a new and improved model can impact the number. Perhaps people in your organization even have a theory about what will have the biggest effect on sales.
“Trust me. The more rain we have, the more we sell.”
“Six weeks after the competitor’s promotion, sales jump.”
Regression analysis is a way of mathematically sorting out which of those variables does indeed have an impact.”
What is a Linear Regression?
Linear regression is one of the simplest and most commonly used data analysis and predictive modelling techniques. The linear regression aims to find an equation for a continuous response variable known as Y which will be a function of one or more variables (X).
Linear regression can, therefore, predict the value of Y when only the X is known. It doesn’t depend on any other factors.
Y is known as the criterion variable while X is known as the predictor variable. The aim of linear regression is to find the best-fitting line, called the regression line, through the points. This is what the mathematical linear regression formula/equation looks like:
In the above equation,
hθ(x) is the criterion variable Y
X is the predictor variable
θ0 is a constant, and
θ1 is the regression coefficient
Example Of A Linear Regression Problem
Let’s say we have sample data that talks about the population over a number of years. The data looks something like this:
No. |
Year |
Population |
1 |
2000 |
1,014,004,000 |
2 |
2001 |
1,029,991,000 |
3 |
2002 |
1,045,845,000 |
4 |
2003 |
1,049,700,000 |
5 |
2004 |
1,065,071,000 |
6 |
2005 |
1,080,264,000 |
7 |
2006 |
1,095,352,000 |
8 |
2007 |
1,129,866,000 |
9 |
2008 |
1,147,996,000 |
10 |
2009 |
1,166,079,000 |
11 |
2010 |
1,173,108,000 |
12 |
2011 |
1,189,173,000 |
13 |
2012 |
1,205,074,000 |
Now say for instance, our goal is to figure out what the population will be in 2014, or what year will it be when the population is 2,205,074,000.
Download Detailed Curriculum and Get Complimentary access to Orientation Session
Time: 10:30 AM - 11:30 AM (IST/GMT +5:30)
To begin with, we use Python to create a graph to represent this data. Here’s what the code and the graph look like.
The Code to Build The Graph
# Required Packages
import plotly.plotly as pyfrom plotly.graph_objs import *
py.sign_in(“username”, “API_authentication_code”)
from datetime import datetime
x = [
datetime(year=2000,month=1,day=1),
datetime(year=2001,month=1,day=1),
datetime(year=2002,month=1,day=1),
datetime(year=2003,month=1,day=1),
datetime(year=2004,month=1,day=1),
datetime(year=2005,month=1,day=1),
datetime(year=2006,month=1,day=1),
datetime(year=2007,month=1,day=1),
datetime(year=2008,month=1,day=1),
datetime(year=2009,month=1,day=1),
datetime(year=2010,month=1,day=1),
datetime(year=2011,month=1,day=1),
datetime(year=2012,month=1,day=1)]
data = Data([
Scatter(
x = x,
y = [1014004000, 1029991000, 1045845000, 1049700000, 1065071000,
1080264000, 1095352000, 1129866000, 1147996000, 1166079000,
1173108000,1189173000,1205074000])
])
plot_url = py.plot(data, filename=’DataAspirant’)
In the next step, we use the equation we discussed above to find the most suitable value for our constant θ0 and our co-efficient θ1. This scenario contains x is the years, and hθ(X) is the value we will predict or the Y. First, we find the θ0 and θ1 for the training data. We then use these values to find the Y for our test data.
Here’s a video that uses examples to explain simple linear regression in detail.
An easy way of performing regression calculations is by using the Linear Regression Calculator. The Linear Regression Calculator is an online tool that has been programmed to be able to fit a linear equation to a data set. Thereby calculating the relationship between two variables. All you have to do is enter the data points into the Linear Regression Calculator and the calculator performs the linear regression calculations.
What is A Simple Linear Regression & Multiple Linear Regression?
In simple linear regression, just one independent variable X is used to predict the value of the criterion variable Y. The multiple linear regression contain more than one independent variable is used to predict Y.
Of course, in both cases, there is just one variable Y. The only difference is in the number of independent variables.
For example, if we predict the rent of an apartment based on just the square footage, it is a simple linear regression. On the other hand, if we predict rent based on a number of factors; square footage, the location of the property, and age of the building, then it becomes an example of multiple linear regression.
How is Linear Regression Different From Logistic Regression?
As you know, the dependent variable Y is always a continuous variable in linear regression. What happens when the Y variable is categorical and not continuous.
Categorical variables have distinct groups or categories. They may or may not have a logical order. Examples include gender, payment method, age bracket and so on.
Continuous variables are numeric values. They have an infinite number of values between any two given values. Examples include the length of a video or the time a payment is received or the population of a city.
If the dependent variable Y is categorical and not continuous, you can’t use linear regression. In cases where Y is a categorical variable and has 2 classes, you can use logistic regression to solve the problem. These kinds of problems are also known as binary classification problems.
It’s also important to remember that standard logistic regression works only when the dependent variable Y has 2 classes. If Y has more than 2 classes then the problem is no longer a binary classification problem but a multi-class classification and standard logistic regression won’t apply.
How Do Companies Use Linear Regression?
As Tom Redman says, “Regression analysis is the go-to method in analytics. Like managers, we want to figure out how we can impact sales or employee retention or recruiting the best people. It helps us figure out what we can do.”
In other words, linear regression is used to make business decisions in all kinds of use cases. Companies use regression analysis in three main ways:
(i) To explain something they are having trouble understanding. For instance, why customer service emails have fallen in the previous quarter.
(ii) To make predictions about important business trends. For instance, what will demand for their product look like over the next year?
(iii) Choose between different alternatives. For instance, should we go for a PPC(Pay-per-click) or a content marketing campaign?
Download Detailed Curriculum and Get Complimentary access to Orientation Session
Time: 10:30 AM - 11:30 AM (IST/GMT +5:30)
What Are The Best Resources To Learn Linear Regression?
There are a number of online courses that teach regression analysis from scratch. You can also go through some seminal textbooks like Regression analysis by example (Chatterjee S., Hadi A.S.) and Regression: Models, Methods and Applications (Fahrmeir, L., Kneib, Th., Lang, S., Marx, B.).
However, if you’re serious about a career in data analytics and data science, then a live instructor-led program might be your best bet. It will combine the advantages of a cutting-edge curriculum with two-way interaction, live sessions, assignments, and placement assistance.
Common Mistakes While Using Linear Regression
Here are some of the most common mistakes that need to be avoided while doing regression analysis.
1. Having A Vague Problem Definition
Don’t have a problem that is defined as “Find out why sales are going down”. If you have a problem statement that’s nothing short of a fishing expedition, then you won’t be able to find the right answers. Try to get business managers to define the independent variables to the maximum possible extent.
2. Analyses Are Sensitive To Poor Data
Of course, data will not always be perfect. But you need to get it to a point where it is as clean and actionable as possible. This is especially important when the decisions made as a result of such analysis will have a significant impact on the bottom line of the business.
Conclusion
Regression analysis is one of the most fundamental techniques used in data science. It’s also the starting point for most data analysis and predictive modelling techniques.
Linear regression will usually be followed by logistic and polynomial regression. In order to have a career in data analytics, it’s best to learn regression analysis as thoroughly as you can so that you are able to grasp the different nuances as well as avoid common mistakes. Knowing all the assumptions of Linear Regression is an added advantage.
Luckily, there are a number of good courses that can help you get there.
Are you also inspired by the opportunities provided by Data Science? Enroll in our Data Science Master Course to elevate your career as a data scientist.