Posted on Leave a comment

sklearn Linear Regression in Python with sci-kit learn and easy examples

linear regression sklearn in python

Linear regression is a statistical method used for analyzing the relationship between a dependent variable and one or more independent variables. It is widely used in various fields, such as finance, economics, and engineering, to model the relationship between variables and make predictions. In this article, we will learn how to create a linear regression model using the scikit-learn library in Python.

Scikit-learn (also known as sklearn) is a popular Python library for machine learning that provides simple and efficient tools for data mining and data analysis. It provides a wide range of algorithms and models, including linear regression. In this article, we will use the sklearn library to create a linear regression model to predict the relationship between two variables.

Before we dive into the code, let’s first understand the basic concepts of linear regression.

Understanding Linear Regression

Linear regression is a supervised learning technique that models the relationship between a dependent variable (also known as the response variable or target variable) and one or more independent variables (also known as predictor variables or features). The goal of linear regression is to find the line of best fit that best predicts the dependent variable based on the independent variables.

In a simple linear regression, the relationship between the dependent variable and the independent variable is represented by the equation:

y = b0 + b1x

where y is the dependent variable, x is the independent variable, b0 is the intercept, and b1 is the slope.

The intercept b0 is the value of y when x is equal to zero, and the slope b1 represents the change in y for every unit change in x.

In multiple linear regression, the relationship between the dependent variable and multiple independent variables is represented by the equation:

y = b0 + b1x1 + b2x2 + ... + bnxn

where y is the dependent variable, x1, x2, …, xn are the independent variables, b0 is the intercept, and b1, b2, …, bn are the slopes.

Creating a Linear Regression Model in Python

Now that we have a basic understanding of linear regression, let’s dive into the code to create a linear regression model using the sklearn library in Python.

The first step is to import the necessary libraries and load the data. We will use the pandas library to load the data and the scikit-learn library to create the linear regression model.

Become a Data Analyst with Work Experience

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

Next, we will load the data into a pandas DataFrame. In this example, we will use a simple dataset that contains the height and weight of a group of individuals. The data consists of two columns, the height in inches and the weight in pounds. The goal is to fit a linear regression model to this data to find the relationship between the height and weight of individuals. The data can be represented in a 2-dimensional array, where each row represents a sample (an individual), and each column represents a feature (height and weight). The X data is the height of individuals and the y data is their corresponding weight.

height (inches)weight (pounds)
65150
70170
72175
68160
71170
Heights and Weights of Individuals for a Linear Regression Model Exercise
# Load the data
df = pd.read_excel('data.xlsx')

Next, we will split the data into two arrays: X and y. X contains the independent variable (height) and y contains the dependent variable (weight).

# Split the data into X (independent variable) and y (dependent variable)
X = df['height'].values.reshape(-1, 1)
y = df['weight'].values

It’s always a good idea to check the shape of the data to ensure that it has been loaded correctly. We can use the shape attribute to check the shape of the arrays X and y.

# Check the shape of the data
print(X.shape)
print(y.shape)

The output should show that X has n rows and 1 column and y has n rows, where n is the number of samples in the dataset.

Perform simple cross validation

One common method for performing cross-validation on the data is to split the data into training and testing sets using the train_test_split function from the model_selection module of scikit-learn.

In this example, the data is first split into the X data, which is the height of individuals, and the y data, which is their corresponding weight. Then, the train_test_split function is used to split the data into training and testing sets. The test_size argument specifies the proportion of the data to use for testing, and the random_state argument sets the seed for the random number generator used to split the data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Train the linear regression model

Now that we have split the data into X and y, we can create a linear regression model using the LinearRegression class from the scikit-learn library. This same package is used to load and train the logistic regression model for classification. Learn more here.

# Create a linear regression model
reg = LinearRegression()

Next, we will fit the linear regression model to the data using the fit method.

# Fit the model to the data
reg.fit(X_train, y_train)

After fitting the model, we can access the intercept and coefficients using the intercept_ and coef_ attributes, respectively.

# Print the intercept and coefficients
print(reg.intercept_)
print(reg.coef_)

The intercept and coefficients represent the parameters b0 and b1 in the equation y = b0 + b1x, respectively.

Finally, we can use the predict method to make predictions for new data.

# Make predictions for new data
new_data = np.array([[65]]) # Height of 65 inches
prediction = reg.predict(new_data)
print(prediction)

This will output the predicted weight for a person with a height of 65 inches.

HINT: You can also using Seaborn to plot a linear regression line between two variables as shown in the chart below. Learn more about data visualization with Seaborn here.

tips = sns.load_dataset("tips")

g = sns.relplot(data=tips, x="total_bill", y="tip")

g.ax.axline(xy1=(10, 2), slope=.2, color="b", dashes=(5, 2))
plot to determine the relation among two variables viz. total bill amount and tips paid.

Cost functions for linear regression models

There are several cost functions that can be used to evaluate the linear regression model. Here are a few common ones:

  1. Mean Squared Error (MSE): MSE is the average of the squared differences between the predicted values and the actual values. The lower the MSE, the better the fit of the model. MSE is expressed as:
MSE = 1/n * Σ(y_i - y_i_pred)^2

where n is the number of samples, y_i is the actual value, and y_i_pred is the predicted value.

  1. Root Mean Squared Error (RMSE): RMSE is the square root of MSE. It is expressed as:
RMSE = √(1/n * Σ(y_i - y_i_pred)^2)
  1. Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted values and the actual values. The lower the MAE, the better the fit of the model. MAE is expressed as:
MAE = 1/n * Σ|y_i - y_i_pred|
  1. R-Squared (R^2) a.k.a the coefficient of determination: R^2 is a measure of the goodness of fit of the linear regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable. The R^2 value ranges from 0 to 1, where a value of 1 indicates a perfect fit and a value of 0 indicates a poor fit.

In scikit-learn, these cost functions can be easily computed using the mean_squared_error, mean_absolute_error, and r2_score functions from the metrics module. For example:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

y_pred = model.predict(X_test)

# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

# Root Mean Squared Error
rmse = mean_squared_error(y_test, y_pred, squared = False)
print("Root Mean Squared Error:", rmse)

# Mean Absolute Error
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error:", mae)

# R-Squared
r2 = r2_score(y_test, y_pred)
print("R-Squared:", r2)

These cost functions provide different perspectives on the performance of the linear regression model and can be used to choose the best model for a given problem.

Conclusion

In this article, we learned how to create a linear regression model using the scikit-learn library in Python. We first split the data into X and y, created a linear regression model, fit the model to the data, and finally made predictions for new data.

Linear regression is a simple and powerful method for analyzing the relationship between variables. By using the scikit-learn library in Python, we can easily create and fit linear regression models to our data and make predictions.

Frequently Asked Questions about Linear Regression with Sklearn in Python

  1. Which Python library is best for linear regression?

    scikit-learn (sklearn) is one of the best Python libraries for statistical analysis and machine learning and it is adapted for training models and making predictions. It offers several options for numerical calculations and statistical modelling. LinearRegression is an important sub-module to perform linear regression modelling.

  2. What is linear regression used for?

    Linear regression analysis is used to predict the value of a target variable based on the value of one or more independent variables. The variable you want to predict / explain is called the dependent or target variable. The variable you are using to predict the dependent variable's value is called the independent or feature variable.

  3. What are the 2 most common models of regression analysis?

    The most common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more complicated data sets in which the dependent and independent variables show a nonlinear relationship. Regression analysis offers numerous applications in various disciplines.

  4. What are the advantages of linear regression?

    The biggest advantage of linear regression models is linearity: It makes the estimation procedure simple and, most importantly, these linear equations have an easy to understand interpretation on a modular level (i.e. the weights).

  5. What is the difference between correlation and linear regression?

    Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.

  6. What is LinearRegression in Sklearn?

    LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

  7. What is the full form of sklearn?

    scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language.

  8. What is the syntax for linear regression model in Python?

    from sklearn.linear_model import LinearRegression
    lr = LinearRegression()
    lr.fit(X,y)
    lr.score()
    lr.predict(new_data)

Leave a Reply

Your email address will not be published. Required fields are marked *