Logistic Regression is a widely used machine learning algorithm for solving binary classification problems like medical diagnosis, churn or fraud detection, intent classification and more. In this article, we’ll be covering how to implement a logistic regression model in Python using the scikit-learn (sklearn) library. In this article you will get started with logistic regression and familiarize yourself with the sklearn library.
Before diving into the implementation, let’s quickly understand what logistic regression is and what it’s used for.
What is Logistic Regression?
Logistic Regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes). It is used to predict a binary outcome (1/0, Yes/No, True/False) given a set of independent variables.
Applications of logistic regression for classification (binary)
Logistic Regression is a widely used machine learning algorithm for binary classification. It is used in many applications where the goal is to predict a binary outcome, such as:
- Medical Diagnosis: Logistic Regression can be used to diagnose a medical condition based on patient symptoms and other relevant factors.
- Customer Churn Prediction: Logistic Regression can be used to predict whether a customer is likely to leave a company based on their past behavior and other factors.
- Fraud Detection: Logistic Regression can be used to detect fraudulent transactions by identifying unusual patterns in transaction data.
- Credit Approval: Logistic Regression can be used to approve or reject loan applications based on a customer’s credit score, income, and other financial information.
- Marketing Campaigns: Logistic Regression can be used to predict the response to a marketing campaign based on customer demographics, past behavior, and other relevant factors.
- Image Classification: Logistic Regression can be used to classify images into different categories, such as animals, people, or objects.
- Natural Language Processing (NLP): Logistic Regression can be used for sentiment analysis in NLP, where the goal is to classify a text as positive, negative, or neutral.
These are some of the common applications of Logistic Regression for binary classification. The algorithm is simple to implement and can provide good results in many cases, making it a popular choice for binary classification problems.
Before getting started, make sure you have the following libraries installed in your environment:
You can install them by running the following command in your terminal/command prompt:
pip install numpy pandas scikit-learn
Importing the Libraries
The first step is to import the necessary libraries that we’ll be using in our implementation.
import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler
Become a Data Analyst with Work Experience
Loading the Dataset
Next, we’ll load the dataset using pandas. We’ll be using the
load_breast_cancer dataset from the
sklearn.datasets library. This dataset contains information about the cancer diagnosis of patients. The dataset includes features such as the mean radius, mean texture, mean perimeter, mean area, mean smoothness, mean compactness, mean concavity, mean concave points, mean symmetry, mean fractal dimension, radius error, texture error, perimeter error, area error, smoothness error, compactness error, concavity error, concave points error, symmetry error, and fractal dimension error. The target variable is a binary variable indicating whether the patient has a malignant tumor (represented by 0) or a benign tumor (represented by 1).
from sklearn.datasets import load_breast_cancer data = load_breast_cancer()
We’ll create a dataframe from the dataset and have a look at the first 5 rows to get a feel for the data.
df = pd.DataFrame(data.data, columns=data.feature_names) df.head()
Preprocessing the Data
Before we start building the model, we need to preprocess the data. We’ll be splitting the data into two parts: training data and testing data. The training data will be used to train the model and the testing data will be used to evaluate the performance of the model. We’ll use the
train_test_split function from the
sklearn.model_selection library to split the data.
X = df y = data.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Next, we’ll normalize the data. Normalization is a crucial step in preprocessing the data as it ensures that all the features have the same scale, which is important for logistic regression. We’ll use the
StandardScaler function from the
sklearn.preprocessing library to normalize the data.
scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
Why do we need to scale data?
Scaling the data is important in many machine learning algorithms, including logistic regression, because the algorithms can be sensitive to the scale of the features. If one feature has a much larger scale than the other features, it can dominate the model and negatively affect its performance.
Scaling the data ensures that all the features are on a similar scale, which can help the model to better capture the relationship between the features and the target variable. By scaling the data, we can avoid issues such as domination of one feature over others, and reduce the computational cost and training time for the model.
In the example, we used the
StandardScaler class from the
sklearn.preprocessing library to scale the data. This class scales the data by subtracting the mean and dividing by the standard deviation, which ensures that the data has a mean of 0 and a standard deviation of 1. This is a commonly used method for scaling data in machine learning.
NOTE: In the interest of preventing information about the distribution of the test set leaking into your model, you should fit the scaler on your training data only, then standardize both training and test sets with that scaler. By fitting the scaler on the full dataset prior to splitting, information about the test set is used to transform the training set, which in turn is passed downstream. As an example, knowing the distribution of the whole dataset might influence how you detect and process outliers, as well as how you parameterize your model. Although the data itself is not exposed, information about the distribution of the data is. As a result, your test set performance is not a true estimate of performance on unseen data.
Building the Logistic Regression Model
Now that the data is preprocessed, we can build the logistic regression model. We’ll use the
LogisticRegression function from the
sklearn.linear_model library to build the model. The same package is also used to import and train the linear regression model. Know more here.
model = LogisticRegression() model.fit(X_train, y_train)
Evaluating the Model
We’ll evaluate the performance of the model by calculating its accuracy. Accuracy is defined as the ratio of correctly predicted observations to the total observations. We’ll use the
score method from the model to calculate the accuracy.
accuracy = model.score(X_test, y_test) print("Accuracy:", accuracy)
Now that the model is trained and evaluated, we can use it to make predictions on data that the model has not been trained on. We’ll use the
predict method from the model to make predictions.
y_pred = model.predict(X_test)
In this article, we covered how to build a logistic regression model using the sklearn library in Python. We preprocessed the data, built the model, evaluated its performance, and made predictions on new data. This should serve as a good starting point for anyone looking to get started with logistic regression and the sklearn library.
Frequently asked questions (FAQ) about logistic regression
What is logistic regression in simple terms?
Logistic regression is a statistical analysis method to predict a binary outcome, such as yes or no, based on prior observations of a data set. A logistic regression model predicts a dependent data variable by analyzing the relationship between one or more existing independent variables.
What is logistic regression vs linear regression?
Linear regression is utilized for regression tasks, while logistic regression helps accomplish classification tasks. Supervised machine learning is a widely used machine learning technique that predicts future outcomes or events. It uses labeled datasets i.e. datasets with a dependent variable, to learn and generate accurate predictions.
Which type of problem does logistic regression solve?
Logistic regression is the most widely used machine learning algorithm for classification problems. In its original form, it is used for binary classification problem which has only two classes to predict.
Why is logistic regression used in machine learning?
Logistic regression is applied to predict binary categorical dependent variable. In other words, it's used when the prediction is categorical, for example, yes or no, true or false, 0 or 1. The predicted probability or output of logistic regression can be either one of them.
How to evaluate the performance of a logistic regression model?
Logistic regression like classification models can be evaluated on several metrics including accuracy score, precision, recall, F1 score, and the ROC AUC.
What kind of model is logistic regression?
Logistic regression, despite its name, is a classification model. Logistic regression is a simple method for binary classification problems.
What type of variables is used in logistic regression?
There must be one or more independent variables, for a logistic regression, and one dependent variable. The independent variables can be continuous or categorical (ordinal/nominal) while the dependent variable must be categorical.