Comparison of Random Forest, Logistic Regression, and Multilayer Perceptron Methods on Classification of Bank Customer Account Closure

The Bank is a business entity that is dealing with money, accepting deposits from customers, providing funds for each withdrawal, billing checks on the customer's orders, giving credit and or embedding the excess deposits until required for repayment. The purpose of this research is to determine the influence of age, gender, country, customer credit score, number of bank products used by the customer, and the activation of the bank members in the decision to choose to continue using the bank account that he has retained or closed the bank account. The data in this research used 10,000 respondents originating from France, Spain, and Germany. The method used is data mining with early stage preprocessing to clean data from outlier and missing value and feature selection to select important attributes. Then perform the classification using three methods, which are Random Forest, Logistic Regression, and Multilayer Perceptron. The results of this research showed that the model with Multilayer Perceptron method with 10 folds Cross Validation is the best model with 85.5373% accuracy.


Introduction
The population in all countries evolves and increases, that is no exception to countries in continental Europe such as Germany, Spain, and France. An increase in the population, it leads to the need to be met by everyone. Financial management is an important factor in fulfilling the needs of each person. That condition is a potential target for the rapidly growing financial industry. Banking is one of the industries that is significant advances around the world. The bank is a business entity in about to concerning to money, accepting deposits from customers, providing funds for each withdrawal, billing checks on the customer's orders, giving credit and or embedding the excess deposits until it is made for repayment [1]. As time goes by, inter-banking competition can't be avoided. Each banking strives to offer a variety of products and services to attract people to their customers. Customer loyalty to always use a bank without closing the account at the bank is the goal of all banks. The quality of products or services offered is a big role to increase customer loyalty bank. The often-used bank product is credit. The easy to apply credit conditions affecting the customer to select the bank. Customers who believe in a bank will use a lot of products from the bank and will become active member of the bank. In addition, the cultural factor of a country is consists of a set of decision trees, where the decision tree set is used to classify the data to a class [4]. Random Forest is a bagging method that the method generates a number of trees from the sample data where the creation of one tree during training does not depend on the previous tree then the decision is taken based on the most voting [5]. In the bagging process, bootstrap resampling is used to generate the classification tree, which is a multi-version generation technique that combines them to obtain the final prediction.
Whereas in the Random Forests method, the randomization process is carried out on sample data and on taking independent variables so that the classification tree generated will have different sizes and shapes [6].

Logistic regression.
Logistic regression is a regression model used when the response variable is qualitative [7]. Logistic regression is one of the most frequently used , which means that logistic regression illustrates a probability. By transforming ) (x the equation (1) with the transformation of

2.4.
Multilayer Perceptron. Multilayer perceptron or artificial neural network (ANN) from many layers is a network composed of multiple layers of perceptrons.
Multilayer Perceptron or artificial neural network from many layers is a network composed of multiple layers of perceptrons. The algorithm for the multilayer perceptron is the inputs are pushed forward through the multilayer perceptron by taking the dot product of the input with the weights that exist between the input layer and the hidden layer. Multilayer Perceptron utilizes activation functions at each of their calculated layers.
Once the calculated output at the hidden layer has been pushed through the activation function, push it to the next layer in the MLP by taking the dot product with the corresponding weights. Repeat steps two and three until the output layer is reached. At the output layer, the calculations will either be used for a backpropagation algorithm that corresponds to the activation function that was selected for the Multilayer Perceptron (in the case of training) or a decision will be made based on the output (in the case of testing

Results and Discussion
This section elaborated on the preprocessing of data and comparative classification of random forest, logistic regression, and multilayer perceptron methods.

3.1.
Methods. The dataset used in this research is data related to the direct marketing campaign of Portuguese banking that published Sonu Jha on Kaggle [10]. There are 10 attributes that are used as predictor variables whether customers closed the bank account or retained them. The ten attributes are credit score, country of origin of the customer, gender, age of customer, tenure, balance, number of bank products used, whether customer holding a credit card or not, whether the customer is an active member with bank and customer salary estimate in dollars. To process the data, the first thing to do is the preprocessing data were at this step, the missing values and outliers (multivariate and univariate) will be checked and then do the treatment if missing values and outliers are found. The next step is to select attributes that are important in predicting and modelling to classify response variables using three methods, namely random forest, logistic regression, and multilayer perceptron. The last stage is to compare the results of the three classification methods and get the best model.

Data preprocessing.
In the identification of missing value in the data, no missing value is found so that the analysis can be continued on outlier detection. Outlier detection is performed using standard deviation data following the normal and symmetrical distribution. Found univariate outlier on credit score data, age of the customer, and the number of bank products used, as much as 8, 133, and 60 instances respectively. Before the treatment on the outlier, see a statistical summary of all three data attributes. Looking at the summary of the statistics in Table 1 [11]. It obtained 6 important attributes for classification, namely credit score, country of customer origin, gender, age of the customer, number of bank products used, and whether the customer is an active member of the bank.

3.3.
Comparison of classification methods. Once obtained predictor variables to be used, next created model to classify response variable that is bank's customer account closure. From such data created models with the methods of Random Forest, logistic regression, and Multilayer Perceptron. The three models are run with 10 folds Cross Validation (CV) to make predictions with different folds training. CV could be a resampling procedure used to evaluate each model of machine learning on a restricted information sample. The procedure includes a single parameter referred to as k that refers to the number of groups that a given data sample is to be split into. In this research, k=10 was used for each machine learning model. Previously, data were divided into training in training data and 82.0927% in testing data.

3.4.
Prediction. The best model obtained is multilayer perceptron, then applied to the overall data so that the confusion matrix presented in Table 3. From the confusion matrix in Table 3 customers with the closed account to be predicted to be retained account than customers with retained account predicted to be closed account, then the recall performance is used.
However, if the bank requires the opposite, then the precision performance is used.

Conclusion
There are 10 attributes that are used as predictor variables. However, based on the feature selection, only 6 predictor variables are selected to be used to classify the bank's customer account closure, there are credit score, customer's country of origin, gender, age of costomer, number of bank products used, and whether the customer is an active member with the bank. The best method for classifying the bank's customer account closure is the Multilayer Perceptron method with a data prediction accuracy result of 85.5373%.