Linear regression is a statistical method used for analyzing the relationship between a dependent variable and one or more independent variables. It is widely used in various fields, such as finance, economics, and engineering, to model the relationship between variables and make predictions. In this article, we will learn how to create a linear regression model using the scikit-learn library in Python.

Scikit-learn (also known as sklearn) is a popular Python library for machine learning that provides simple and efficient tools for data mining and data analysis. It provides a wide range of algorithms and models, including linear regression. In this article, we will use the sklearn library to create a linear regression model to predict the relationship between two variables.

Before we dive into the code, let’s first understand the basic concepts of linear regression.

## Understanding Linear Regression

Linear regression is a supervised learning technique that models the relationship between a dependent variable (also known as the response variable or target variable) and one or more independent variables (also known as predictor variables or features). The goal of linear regression is to find the line of best fit that best predicts the dependent variable based on the independent variables.

In a simple linear regression, the relationship between the dependent variable and the independent variable is represented by the equation:

`y = b0 + b1x`

where `y`

is the dependent variable, `x`

is the independent variable, `b0`

is the intercept, and `b1`

is the slope.

The intercept `b0`

is the value of `y`

when `x`

is equal to zero, and the slope `b1`

represents the change in `y`

for every unit change in `x`

.

In multiple linear regression, the relationship between the dependent variable and multiple independent variables is represented by the equation:

`y = b0 + b1x1 + b2x2 + ... + bnxn`

where `y`

is the dependent variable, `x1`

, `x2`

, …, `xn`

are the independent variables, `b0`

is the intercept, and `b1`

, `b2`

, …, `bn`

are the slopes.

### Creating a Linear Regression Model in Python

Now that we have a basic understanding of linear regression, let’s dive into the code to create a linear regression model using the sklearn library in Python.

The first step is to import the necessary libraries and load the data. We will use the `pandas`

library to load the data and the `scikit-learn`

library to create the linear regression model.

### Become a Data Analyst with Work Experience

import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression

Next, we will load the data into a pandas DataFrame. In this example, we will use a simple dataset that contains the height and weight of a group of individuals. The data consists of two columns, the height in inches and the weight in pounds. The goal is to fit a linear regression model to this data to find the relationship between the height and weight of individuals. The data can be represented in a 2-dimensional array, where each row represents a sample (an individual), and each column represents a feature (height and weight). The `X`

data is the height of individuals and the `y`

data is their corresponding weight.

height (inches) | weight (pounds) |
---|---|

65 | 150 |

70 | 170 |

72 | 175 |

68 | 160 |

71 | 170 |

```
# Load the data
df = pd.read_excel('data.xlsx')
```

Next, we will split the data into two arrays: `X`

and `y`

. `X`

contains the independent variable (height) and `y`

contains the dependent variable (weight).

```
# Split the data into X (independent variable) and y (dependent variable)
X = df['height'].values.reshape(-1, 1)
y = df['weight'].values
```

It’s always a good idea to check the shape of the data to ensure that it has been loaded correctly. We can use the `shape`

attribute to check the shape of the arrays `X`

and `y`

.

```
# Check the shape of the data
print(X.shape)
print(y.shape)
```

The output should show that `X`

has `n`

rows and 1 column and `y`

has `n`

rows, where `n`

is the number of samples in the dataset.

## Perform simple cross validation

One common method for performing cross-validation on the data is to split the data into training and testing sets using the `train_test_split`

function from the `model_selection`

module of `scikit-learn`

.

In this example, the data is first split into the `X`

data, which is the height of individuals, and the `y`

data, which is their corresponding weight. Then, the `train_test_split`

function is used to split the data into training and testing sets. The `test_size`

argument specifies the proportion of the data to use for testing, and the `random_state`

argument sets the seed for the random number generator used to split the data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## Train the linear regression model

Now that we have split the data into `X`

and `y`

, we can create a linear regression model using the `LinearRegression`

class from the `scikit-learn`

library. This same package is used to load and train the logistic regression model for classification. Learn more here.

```
# Create a linear regression model
reg = LinearRegression()
```

Next, we will fit the linear regression model to the data using the `fit`

method.

```
# Fit the model to the data
reg.fit(X_train, y_train)
```

After fitting the model, we can access the intercept and coefficients using the `intercept_`

and `coef_`

attributes, respectively.

```
# Print the intercept and coefficients
print(reg.intercept_)
print(reg.coef_)
```

The intercept and coefficients represent the parameters `b0`

and `b1`

in the equation `y = b0 + b1x`

, respectively.

Finally, we can use the `predict`

method to make predictions for new data.

# Make predictions for new data new_data = np.array([[65]]) # Height of 65 inches prediction = reg.predict(new_data) print(prediction)

This will output the predicted weight for a person with a height of 65 inches.

HINT: You can also using Seaborn to plot a linear regression line between two variables as shown in the chart below. Learn more about data visualization with Seaborn here.

```
tips = sns.load_dataset("tips")
g = sns.relplot(data=tips, x="total_bill", y="tip")
g.ax.axline(xy1=(10, 2), slope=.2, color="b", dashes=(5, 2))
```

## Cost functions for linear regression models

There are several cost functions that can be used to evaluate the linear regression model. Here are a few common ones:

- Mean Squared Error (MSE): MSE is the average of the squared differences between the predicted values and the actual values. The lower the MSE, the better the fit of the model. MSE is expressed as:

MSE = 1/n * Σ(y_i - y_i_pred)^2

where `n`

is the number of samples, `y_i`

is the actual value, and `y_i_pred`

is the predicted value.

- Root Mean Squared Error (RMSE): RMSE is the square root of MSE. It is expressed as:

RMSE = √(1/n * Σ(y_i - y_i_pred)^2)

- Mean Absolute Error (MAE): MAE is the average of the absolute differences between the predicted values and the actual values. The lower the MAE, the better the fit of the model. MAE is expressed as:

MAE = 1/n * Σ|y_i - y_i_pred|

- R-Squared (R^2) a.k.a the coefficient of determination: R^2 is a measure of the goodness of fit of the linear regression model. It is the proportion of the variance in the dependent variable that is predictable from the independent variable. The R^2 value ranges from 0 to 1, where a value of 1 indicates a perfect fit and a value of 0 indicates a poor fit.

In `scikit-learn`

, these cost functions can be easily computed using the `mean_squared_error`

, `mean_absolute_error`

, and `r2_score`

functions from the `metrics`

module. For example:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score y_pred = model.predict(X_test) # Mean Squared Error mse = mean_squared_error(y_test, y_pred) print("Mean Squared Error:", mse) # Root Mean Squared Error rmse = mean_squared_error(y_test, y_pred, squared = False) print("Root Mean Squared Error:", rmse) # Mean Absolute Error mae = mean_absolute_error(y_test, y_pred) print("Mean Absolute Error:", mae) # R-Squared r2 = r2_score(y_test, y_pred) print("R-Squared:", r2)

These cost functions provide different perspectives on the performance of the linear regression model and can be used to choose the best model for a given problem.

### Conclusion

In this article, we learned how to create a linear regression model using the scikit-learn library in Python. We first split the data into `X`

and `y`

, created a linear regression model, fit the model to the data, and finally made predictions for new data.

Linear regression is a simple and powerful method for analyzing the relationship between variables. By using the scikit-learn library in Python, we can easily create and fit linear regression models to our data and make predictions.

## Frequently Asked Questions about Linear Regression with Sklearn in Python

##### Which Python library is best for linear regression?

scikit-learn (sklearn) is one of the best Python libraries for statistical analysis and machine learning and it is adapted for training models and making predictions. It offers several options for numerical calculations and statistical modelling. LinearRegression is an important sub-module to perform linear regression modelling.

##### What is linear regression used for?

Linear regression analysis is used to predict the value of a target variable based on the value of one or more independent variables. The variable you want to predict / explain is called the dependent or target variable. The variable you are using to predict the dependent variable's value is called the independent or feature variable.

##### What are the 2 most common models of regression analysis?

The most common models are simple linear and multiple linear. Nonlinear regression analysis is commonly used for more complicated data sets in which the dependent and independent variables show a nonlinear relationship. Regression analysis offers numerous applications in various disciplines.

##### What are the advantages of linear regression?

The biggest advantage of linear regression models is linearity: It makes the estimation procedure simple and, most importantly, these linear equations have an easy to understand interpretation on a modular level (i.e. the weights).

##### What is the difference between correlation and linear regression?

Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an equation.

##### What is LinearRegression in Sklearn?

LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.

##### What is the full form of sklearn?

scikit-learn (also known as sklearn) is a free software machine learning library for the Python programming language.

##### What is the syntax for linear regression model in Python?

from sklearn.linear_model import LinearRegression

lr = LinearRegression()

lr.fit(X,y)

lr.score()

lr.predict(new_data)