Predicting Credit Risk by using PySpark ML and Docker Part-1 (2024)

Published in

ITNEXT

8 min read

Jan 8, 2020

Analyzing a dataset about Credit risk

Credit risk can be explained as the possibility of a loss because of a borrower’s failure to repay a loan or meet contractual obligations. Basically, it means the risk that a lender may not receive the owed principal and interest. The higher risk implies the higher cost, that makes this topic important for many people. In this article, we will analyze a dataset about the loan status of applicants and make predictions for new applications via different machine learning algorithms.

We will use Spark for analyzing the dataset locally. The reason we use Spark is that it promotes scalability, and it provides easy integration. Rather than processing the data on a single machine, Spark enables data practitioners to deal with their machine learning problems interactively and at a better scale.

Machine Learning is a method to automate analytical model building by analyzing the data. In this article, we will use the Binary Classification algorithm with PySpark to make predictions. First, the algorithm will be trained with data and this training will be a reference for the new predictions.

Apache Spark

Spark is one of the most important Big Data technologies and it provides a distributed data processing engine for a wide range of cases. Spark includes libraries for stream processing, SQL, graph computation and also machine learning. It enables to write applications on Java, Scala, Python and R. Apache Spark gives high performance both in batch and streaming data and it is one of the most important actors to handle Machine Learning with Big Data. In Spark, you can split your data into partitions, and then process those partitions on parallel on all the nodes in the cluster.

Predicting Credit Risk by using PySpark ML and Docker Part-1 (3)

This article will first analyze a dataset locally and then implement binary classification algorithms on it. To implement our algorithms, we will use Jupyter Notebook via Docker.

What is Machine Learning?

Machine Learning is an application of artificial intelligence. It can be basically defined as which makes systems learn automatically and improves from experience. It is the science of teaching machines to learn by themselves. The learning process begins with the training of the data to look deeply and then make better decisions in the future depending on the pattern we provided.

Predicting Credit Risk by using PySpark ML and Docker Part-1 (4)

As we mentioned, Spark is an important actor in the Big Data field and Spark MLlib provides a fast machine learning capability.

In order to access Jupyter Notebook, we will use the followingdocker-compose.yml file:

This .yml file will enable us to access Jupyter Notebook. First, we should open the Terminal and then type in $docker-compose up in the same folder of docker-compose.yml file (you can install docker-compose via https://docs.docker.com/compose/install/)

You can check your running docker container in Terminal via $docker ps . This will show you the list of running containers. In this tutorial, we have the container image .jupyter/all-spark-notebook:latest and with its CONTAINER ID we type in Terminal$docker logs CONTAINER_ID :

Predicting Credit Risk by using PySpark ML and Docker Part-1 (5)

Here you can find the token to access Jupyter Notebook via a browser. From that point, you are able to access Jupyter Notebook and you can create a Python project.

For the binary classification model, we will use a dataset about credit risk. It includes several inputs and the output will make predictions for being eligible for the loan. To import the data, we will use PySpark .

First, we start the SparkSession:

After we create the SparkSession, we are ready for reading the local data:

# reading data via Sparkdf = spark.read.csv(“./DataFolder/*.csv”, inferSchema = True, header = True, sep=”,”)

In this tutorial, we put two separate .csv files into one folder and integrate them (these files must have the same schema and parameter types). Spark automatically detects it and will pretend as one file. If you analyze a single file, giving the path of the file will be enough. To demonstrate the schema via Jupyter Notebook, we type:

df.printSchema()

Here is the schema of our dataset:

root
 |-- Loan_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Married: string (nullable = true)
 |-- Dependents: string (nullable = true)
 |-- Education: string (nullable = true)
 |-- Self_Employed: string (nullable = true)
 |-- ApplicantIncome: integer (nullable = true)
 |-- CoapplicantIncome: double (nullable = true)
 |-- LoanAmount: integer (nullable = true)
 |-- Loan_Amount_Term: integer (nullable = true)
 |-- Credit_History: integer (nullable = true)
 |-- Property_Area: string (nullable = true)
 |-- Loan_Status: string (nullable = true)

At this level, the input variables are Loan_ID, Gender, Married, Dependents, Education, Self_Employed, ApplicantIncome, CoapplicantIncome, LoanAmount, Loan_Amount_Term, Credit_History, Property_Area.

And the output variable is Loan_Status. The output will represent our prediction variable.

In our dataset, there are some variables that have binary or three different values. For example, let’s consider the Property Area . To show the distinct values of Property Area :

df.select(“Property_Area”).distinct().show()

and the output:

+-------------+
|Property_Area|
+-------------+
| Urban|
| Semiurban|
| Rural|
+-------------+

If we prefer to keep these values as string type, it would be costly for our machine learning model. To improve the performance of our prediction, we better modify this column by producing three different integer type properties from Property_Area.

And also other properties that have binary values will be modified later:

df = df.withColumn(‘Married’, when(col(‘Married’)==’No’, 0).otherwise(1))df = df.withColumn(‘Education’, when(col(‘Education’)==’Not Graduate’, 0).otherwise(1))df = df.withColumn(‘Self_Employed’, when(col(‘Self_Employed’)==’No’, 0).otherwise(1))df = df.withColumn('Married', when(col('Married')=='No', 0).otherwise(1))df = df.withColumn('Education', when(col('Education')=='Not Graduate', 0).otherwise(1))df = df.withColumn('Self_Employed', when(col('Self_Employed')=='No', 0).otherwise(1))

After these modifications, we have new DataFrame with a new schema:

root
 |-- Loan_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Married: integer (nullable = false)
 |-- Dependents: string (nullable = true)
 |-- Education: integer (nullable = false)
 |-- Self_Employed: integer (nullable = false)
 |-- ApplicantIncome: integer (nullable = true)
 |-- CoapplicantIncome: double (nullable = true)
 |-- LoanAmount: integer (nullable = true)
 |-- Loan_Amount_Term: integer (nullable = true)
 |-- Credit_History: integer (nullable = true)
 |-- Property_Area: string (nullable = true)
 |-- Loan_Status: string (nullable = true)
 |-- Urban: integer (nullable = false)
 |-- Semiurban: integer (nullable = false)
 |-- Rural: integer (nullable = false)

As we already modified the Property Area by three new properties, it has to be removed. And the Loan_ID is a unique property for all the customers and does not have a role in the prediction. For the performance, it should be also removed:

drop_list = [‘Loan_ID’, ‘Property_Area’]df = df.select([column for column in df.columns if column not in drop_list])

Another modification we better do before we implement the prediction is to make a type casting on Dependents. We first check the distinct values of Dependents by df.select("Dependents").distinct().show():

+----------+
|Dependents|
+----------+
| 1|
| 3|
| 2|
| 0|
+----------+

As you can observe, although it has integer values it’s type is a string. For a better prediction, we will cast it as integer and create a new Dataframe df:

Now, we have proper data to implement machine learning algorithms. To have a statistical overview of the data, we gather the numerical features and then describe them in a statistical way:

import pandas as pd## gather numerical featuresnumerical_features = [t[0] for t in df.dtypes if t[1] == ‘int’ or t[1]==’double’]
df_numeric = df.select(numerical_features).describe().toPandas().transpose()

Predicting Credit Risk by using PySpark ML and Docker Part-1 (6)

Also as a statistical operation, we can produce a correlation matrix and it can help us to understand the relationships among the features:

Predicting Credit Risk by using PySpark ML and Docker Part-1 (7)

Processing the Data

In Spark ML, all the features used in prediction should be in integer type or should be converted into the integer type. Right now, we have both categorical features in string and numerical features in integer in our DataFrame. The categorical features first will be encoded by StringIndexer and then via One-hot encoding, they will be made usable for algorithms that expect continuous features, such as Logistic Regression. Before this transformation, let's define the categorical and numerical features:

# detect categorical columns:categorical_cols = [item[0] for item in df.dtypes if item[1].startswith(‘string’)][:1]# detect numerical columns:numerical_cols = [item[0] for item in df.dtypes if item[1].startswith(‘int’) | item[1].startswith(‘double’)]

We are still not ready to implement our machine learning algorithm yet. Detection of missing values in datasets is another important mission in machine learning projects. Missing values can be also regarded as the messiness of real life. There are several usual suspects for missing values such as human errors during data entry, incorrect sensor readings, software bugs in the data processing pipeline and etc.

This will show us which feature has how many missing values:

[('Gender', 24),
 ('Dependents', 76),
 ('LoanAmount', 27),
 ('Loan_Amount_Term', 20),
 ('Credit_History', 79)]

In this article, we replace the missing values with the mode and mean value of each feature in the dataset.

This script above lists both categorical and numerical columns having missing values:

cateogrical columns_miss: ['Gender']
numerical columns_miss: ['Dependents', 'LoanAmount', 'Loan_Amount_Term', 'Credit_History']

Let’s fill in these missing values with the mode and mean value of each column:

Make the data usable for Machine Learning

At this moment, we have no missing values in our dataset. As it was mentioned before, we will index the categorical columns with StringIndexerand OneHotEncoderEstimator. While StringIndexer is assigning indexes to each categorical feature, StringIndexer is converting categorical columns to one-hot encoded vectors. After that, we put all the feature columns into a single vector via VectorAssemblerwhich transforms features. This process can be defined as building the stages of the machine learning pipelines:

You can access this script also on the Databricks website.

Conclusion

In this first part of the article, we transform the credit-risk dataset usable for machine learning algorithms and categorized the features. These preprocessing procedures are visualized to have a better overview. We are now in the middle of our project. In the next part, we will implement an end-to-end classification model in PySpark.

References

How to use SparkSession in Apache Spark 2.0

Try this notebook in Databricks Generally, a session is an interaction between two or more entities. In computer…

databricks.com

Extracting, transforming and selecting features

This section covers algorithms for working with features, roughly divided into these groups: Extraction: Extracting…

spark.apache.org