Linear regression is a statistical method used for analyzing the relationship between a dependent variable and one or more independent variables. It is widely used in various fields, such as finance, economics, and engineering, to model the relationship between variables and make predictions. In this article, we will learn how to create a linear regression model using the scikit-learn library in Python.
Scikit-learn (also known as sklearn) is a popular Python library for machine learning that provides simple and efficient tools for data mining and data analysis. It provides a wide range of algorithms and models, including linear regression. In this article, we will use the sklearn library to create a linear regression model to predict the relationship between two variables.
Before we dive into the code, let’s first understand the basic concepts of linear regression.
Understanding Linear Regression
Linear regression is a supervised learning technique that models the relationship between a dependent variable (also known as the response variable or target variable) and one or more independent variables (also known as predictor variables or features). The goal of linear regression is to find the line of best fit that best predicts the dependent variable based on the independent variables.
In a simple linear regression, the relationship between the dependent variable and the independent variable is represented by the equation:
y = b0 + b1x
where y
is the dependent variable, x
is the independent variable, b0
is the intercept, and b1
is the slope.
The intercept b0
is the value of y
when x
is equal to zero, and the slope b1
represents the change in y
for every unit change in x
.
In multiple linear regression, the relationship between the dependent variable and multiple independent variables is represented by the equation:
y = b0 + b1x1 + b2x2 + ... + bnxn
where y
is the dependent variable, x1
, x2
, …, xn
are the independent variables, b0
is the intercept, and b1
, b2
, …, bn
are the slopes.
Creating a Linear Regression Model in Python
Now that we have a basic understanding of linear regression, let’s dive into the code to create a linear regression model using the sklearn library in Python.
The first step is to import the necessary libraries and load the data. We will use the pandas
library to load the data and the scikit-learn
library to create the linear regression model.
Become a Data Analyst with Work Experience
import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression
Next, we will load the data into a pandas DataFrame. In this example, we will use a simple dataset that contains the height and weight of a group of individuals. The data consists of two columns, the height in inches and the weight in pounds. The goal is to fit a linear regression model to this data to find the relationship between the height and weight of individuals. The data can be represented in a 2-dimensional array, where each row represents a sample (an individual), and each column represents a feature (height and weight). The X
data is the height of individuals and the y
data is their corresponding weight.
height (inches) | weight (pounds) |
---|---|
65 | 150 |
70 | 170 |
72 | 175 |
68 | 160 |
71 | 170 |
# Load the data
df = pd.read_excel('data.xlsx')
Next, we will split the data into two arrays: X
and y
. X
contains the independent variable (height) and y
contains the dependent variable (weight).
# Split the data into X (independent variable) and y (dependent variable)
X = df['height'].values.reshape(-1, 1)
y = df['weight'].values
It’s always a good idea to check the shape of the data to ensure that it has been loaded correctly. We can use the shape
attribute to check the shape of the arrays X
and y
.
# Check the shape of the data
print(X.shape)
print(y.shape)
The output should show that X
has n
rows and 1 column and y
has n
rows, where n
is the number of samples in the dataset.
Perform simple cross validation
One common method for performing cross-validation on the data is to split the data into training and testing sets using the train_test_split
function from the model_selection
module of scikit-learn
.
In this example, the data is first split into the X
data, which is the height of individuals, and the y
data, which is their corresponding weight. Then, the train_test_split
function is used to split the data into training and testing sets. The test_size
argument specifies the proportion of the data to use for testing, and the random_state
argument sets the seed for the random number generator used to split the data.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
Train the linear regression model
Now that we have split the data into X
and y
, we can create a linear regression model using the LinearRegression
class from the scikit-learn
library. This same package is used to load and train the logistic regression model for classification. Learn more here.
# Create a linear regression model
reg = LinearRegression()
Next, we will fit the linear regression model to the data using the fit
method.
# Fit the model to the data
reg.fit(X_train, y_train)
After fitting the model, we can access the intercept and coefficients using the intercept_
and coef_
attributes, respectively.
# Print the intercept and coefficients
print(reg.intercept_)
print(reg.coef_)
The intercept and coefficients represent the parameters b0
and b1
in the equation y = b0 + b1x
, respectively.
Finally, we can use the predict
method to make predictions for new data.
# Make predictions for new data new_data = np.array([[65]]) # Height of 65 inches prediction = reg.predict(new_data) print(prediction)
This will output the predicted weight for a person with a height of 65 inches.
HINT: You can also using Seaborn to plot a linear regression line between two variables as shown in the chart below. Learn more about data visualization with Seaborn here.
tips = sns.load_dataset("tips")
g = sns.relplot(data=tips, x="total_bill", y="tip")
g.ax.axline(xy1=(10, 2), slope=.2, color="b", dashes=(5, 2))
Cost functions for linear regression models
There are several cost functions that can be used to evaluate the linear regression model. Here are a few common ones:
- Mean Squared Error (MSE): MSE is the average of the squared differences between the predicted values and the actual values. The lower the MSE, the better the fit of the model. MSE is expressed as:
MSE = 1/n * Σ(y_i - y_i_pred)^2
where n
is the number of samples, y_i
is the actual value, and y_i_pred
is the predicted value.
- Root Mean Squared Error (RMSE): RMSE is the square root of MSE. It is expressed as:
RMSE = √(1/n * Σ(y_i - y_i_pred)^2)
- Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted values and the actual values. The lower the MAE, the better the fit of the model. MAE is expressed as:
MAE = 1/n * Σ|y_i - y_i_pred|
- R-Squared (R^2) a.k.a the coefficient of determination: R^2 is a measure of the goodness of fit of the linear regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable. The R^2 value ranges from 0 to 1, where a value of 1 indicates a perfect fit and a value of 0 indicates a poor fit.
In scikit-learn
, these cost functions can be easily computed using the mean_squared_error
, mean_absolute_error
, and r2_score
functions from the metrics
module. For example:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score y_pred = model.predict(X_test) # Mean Squared Error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse) # Root Mean Squared Error rmse = mean_squared_error(y_test, y_pred, squared = False) print("Root Mean Squared Error:", rmse) # Mean Absolute Error mae = mean_absolute_error(y_test, y_pred) print("Mean Absolute Error:", mae) # R-Squared r2 = r2_score(y_test, y_pred) print("R-Squared:", r2)
These cost functions provide different perspectives on the performance of the linear regression model and can be used to choose the best model for a given problem.
Conclusion
In this article, we learned how to create a linear regression model using the scikit-learn library in Python. We first split the data into X
and y
, created a linear regression model, fit the model to the data, and finally made predictions for new data.
Linear regression is a simple and powerful method for analyzing the relationship between variables. By using the scikit-learn library in Python, we can easily create and fit linear regression models to our data and make predictions.
Frequently Asked Questions about Linear Regression with Sklearn in Python
Which Python library is best for linear regression?
scikit-learn (sklearn) is one of the best Python libraries for statistical analysis and machine learning and it is adapted for training models and making predictions. It offers several options for numerical calculations and statistical modelling. LinearRegression is an important sub-module to perform linear regression modelling.
What is linear regression used for?
Linear regression analysis is used to predict the value of a target variable based on the value of one or more independent variables. The variable you want to predict / explain is called the dependent or target variable. The variable you are using to predict the dependent variable's value is called the independent or feature variable.
What are the 2 most common models of regression analysis?
The most common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more complicated data sets in which the dependent and independent variables show a nonlinear relationship. Regression analysis offers numerous applications in various disciplines.
What are the advantages of linear regression?
The biggest advantage of linear regression models is linearity: It makes the estimation procedure simple and, most importantly, these linear equations have an easy to understand interpretation on a modular level (i.e. the weights).
What is the difference between correlation and linear regression?
Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.
What is LinearRegression in Sklearn?
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
What is the full form of sklearn?
scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language.
What is the syntax for linear regression model in Python?
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X,y)
lr.score()
lr.predict(new_data)