Kaggle Loan data binary classification

1. Introduction & Problem setting

혼자서 Kaggle의 다양한 Open data들 중 하나인 Loan를 갖고 Binary classification 해본 과정들을 담았습니다.

링크는 가장 하단 Reference에도 있고, 여기를 클릭하셔도 됩니다.

처음 Raw data를 그대로 사용해서 할 수 있는 분석은 10개의 입력변수(독립변수, X)들을 이용해서 3개의 범주를 가진 1개의 종속변수를 추정(예측, 분류)하는 것입니다.

여기서 종속변수(타겟변수, Y)는 loan_status로 3개 범주는 각각 다음과 같습니다.

PAIDOFF : 기한 내에 대출금 모두 상환

COLLECTION : Data 수집 당시까지 미납(연체)

COLLECTION_PAIDOFF : 기한은 지났지만 대출금 모두 상환

원래는 위 3가지 범주들을 분류하는 Multi classification problem인데, 저는 약간의 변형을 통해서 기한 내에 상환을 성공할 고객인지(Success), 실패할 고객인지(Failure)로 분류하는 이항분류(Binary classification)를 할 것 입니다.

데이터를 구성하는 총 11개 변수들에 대한 설명은 아래 표를 참고하시기 바랍니다.

변수명	해석(의미)
Loan_ID	대출한 고객의 고유한 ID
loan_status	이번 분석의 타겟 변수, 상환 여부를 나타냄
Principal	고객이 대출받은 금액
terms	대출금 지급까지 걸린 기간
effective_date	실제 계약 효과가 발휘하기 시작한 날짜
due_date	대출금 납부 기한 날짜
paid_off_time	고객이 은행에 모두 상환한 날짜, 시간
past_due_days	고객이 은행에 대출금을 모두 상환하는데 걸린 기간
age	고객의 나이
education	고객의 교육 수준
Gender	고객의 성별

2. 준비 작업

dplyr를 비롯한 분석에 사용할 여러 R package들을 장착하고 데이터를 불러오는 과정입니다.

2.1 Packages load, multiplot() function generation

# Data import 
library(readr)         # Data input with readr::read_csv()

# EDA : 탐색적 데이터 분석, 데이터 확인
library(VIM)           # Missing values with VIM::aggr()
library(descr)         # descr::CrossTable() - Factor data의 범주별 빈도수, 비율 확인 
library(DT)            # DT::datatable() - All data assesment with web chart 
library(corrplot)      # Correlation coefficient 확인 

# Visuallization : 시각화 
library(GGally)        # 모든 변수에 대한 다양한 시각화
library(ggplot2)       # Visuallization 
library(RColorBrewer)  # plot의 color 설정 
library(scales)        # plot setting - x, y 축 설정

# Feature engineering, Pre-processing : 데이터 전처리 
# library(tidyverse)   # ggplot2, dplyr, purrr, etc ...
library(dplyr)         # Used for almost all data handling 
library(lubridate)     # Time series data Pre-processing 

# Machine learning modeling : 기계학습 모델 생성
library(e1071)         # Support Vector Machine
library(rpart)         # Decision Tree
library(rpart.plot)    # Decision Tree plotting 
library(randomForest)  # Random Forest
library(glmnet)        # LASSO, Ridge 

# Model validation : 모델 검증, 성능 확인 
library(caret)         # confusionMatrix : 혼동행렬 
library(ROCR)          # ROC curve

multiplot <- function(..., plotlist = NULL, file, cols = 1, layout = NULL) {
  library(grid)
  
  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)
  
  numPlots = length(plots)
  
  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                     ncol = cols, nrow = ceiling(numPlots/cols))
  }
  
  if (numPlots==1) {
    print(plots[[1]])
    
  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
    
    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
      
      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

2.2 Working directory setting, Raw data import

setwd("C:/Users/LG/Documents/Kaggle Competition/Loan")
getwd()

## [1] "C:/Users/LG/Documents/Kaggle Competition/Loan"

loan <- readr::read_csv("./kaggle_loan.csv")

2.3 Raw data’s Variable description table

summary(loan)

##    Loan_ID          loan_status          Principal          terms      
##  Length:500         Length:500         Min.   : 300.0   Min.   : 7.00  
##  Class :character   Class :character   1st Qu.:1000.0   1st Qu.:15.00  
##  Mode  :character   Mode  :character   Median :1000.0   Median :30.00  
##                                        Mean   : 943.2   Mean   :22.82  
##                                        3rd Qu.:1000.0   3rd Qu.:30.00  
##                                        Max.   :1000.0   Max.   :30.00  
##                                                                        
##  effective_date       due_date         paid_off_time      past_due_days  
##  Length:500         Length:500         Length:500         Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 3.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :37.00  
##                                                           Mean   :36.01  
##                                                           3rd Qu.:60.00  
##                                                           Max.   :76.00  
##                                                           NA's   :300    
##       age         education            Gender         
##  Min.   :18.00   Length:500         Length:500        
##  1st Qu.:27.00   Class :character   Class :character  
##  Median :30.00   Mode  :character   Mode  :character  
##  Mean   :31.12                                        
##  3rd Qu.:35.00                                        
##  Max.   :51.00                                        
##

위의 summary() 결과를 보시면 우리가 예상한 것과는 다르게 나옵니다.

이유는 데이터를 불러올 때 readr::read_csv()로 불러올 때 범주형 변수와 문장형 변수 모두들 구별없이 Chracter속성으로 저장하기 때문입니다.

따라서 간단하게 변수들의 속성을 변환한 후 다시 summary()를 해보도록 하겠습니다.

loan <- loan %>% 
  mutate(Loan_ID        = factor(Loan_ID),
         loan_status    = factor(loan_status),
         effective_date = factor(effective_date),
         due_date       = factor(due_date),
         paid_off_time  = factor(paid_off_time),
         education      = factor(education),
         Gender         = factor(Gender))

summary(loan)

##         Loan_ID                loan_status    Principal     
##  xqd12160159:  1   COLLECTION        :100   Min.   : 300.0  
##  xqd20110409:  1   COLLECTION_PAIDOFF:100   1st Qu.:1000.0  
##  xqd20125284:  1   PAIDOFF           :300   Median :1000.0  
##  xqd20151038:  1                            Mean   : 943.2  
##  xqd20160003:  1                            3rd Qu.:1000.0  
##  xqd20160004:  1                            Max.   :1000.0  
##  (Other)    :494                                            
##      terms         effective_date       due_date            paid_off_time
##  Min.   : 7.00   9/10/2016: 46    10/10/2016:123   9/25/2016 9:00  :  9  
##  1st Qu.:15.00   9/11/2016:231    9/25/2016 : 87   9/26/2016 9:00  :  9  
##  Median :30.00   9/12/2016:148    10/11/2016: 69   10/10/2016 9:00 :  8  
##  Mean   :22.82   9/13/2016: 23    9/26/2016 : 60   10/13/2016 9:00 :  6  
##  3rd Qu.:30.00   9/14/2016: 33    10/9/2016 : 28   10/11/2016 16:00:  5  
##  Max.   :30.00   9/8/2016 :  4    10/13/2016: 21   (Other)         :363  
##                  9/9/2016 : 15    (Other)   :112   NA's            :100  
##  past_due_days        age                       education      Gender   
##  Min.   : 1.00   Min.   :18.00   Bechalor            : 67   female: 77  
##  1st Qu.: 3.00   1st Qu.:27.00   college             :220   male  :423  
##  Median :37.00   Median :30.00   High School or Below:209               
##  Mean   :36.01   Mean   :31.12   Master or Above     :  4               
##  3rd Qu.:60.00   3rd Qu.:35.00                                          
##  Max.   :76.00   Max.   :51.00                                          
##  NA's   :300

Time series인 effective_date, due_date 2개를 제외하고는 summary()의 결과가 올바르게 나옴을 알 수 있습니다. :)

위 summary()결과를 바탕으로 기록한 Loan data의 feature들에 대한 간단한 설명과 해설은 다음 표와 같습니다.

변수명	해석 (의미)	Type
Loan_ID	대출한 고객의 고유한 ID `xpdNNNNNNNN` 형태로 되어있으며 의미 해석이 어려워 보인다. 이번 분석에서는 사용하지 않는다.	Factor
loan_status	이번 분석의 타겟 변수이며 각 고객들의 상환 여부를 나타낸다. `PAIDOFF` : 기한 내에 대출금 모두 상환 `COLLECTION` : Data 수집 당시까지 미납(연체) `COLLECTION_PAIDOFF` : 기한은 지났지만 대출금 모두 상환 여기서 `PAIDOFF`를 성공(`Success`)으로 바꾸고 나머지 2개는 실패(`Failure`)로 바꿔준다.	Factor
Principal	고객이 대출받은 금액 300부터 1000까지 있다. 거의 모든 고객이 고액 대출(1000$)을 받았음을 알 수 있다.	Dbl
terms	대출금 지급까지 걸린 기간 7일(일주일)부터 30일(한 달)까지 있다. 히스토그램을 활용한 추가 확인이 필요해 보인다.	Dbl
effective_date	실제 계약 효과가 발휘하기 시작한 날짜 Month/Date/Year 형태 2016년 9월 8일부터 14일까지 7개 범주가 있다.	Date
due_date	대출금 납부 기한 날짜를 의미한다. Month/Date/Year 형태 `Date` 속성으로 변환한 후에 전체 기간을 확인해봐야 한다.	Date
paid_off_time	고객이 은행에 모두 상환한 날짜, 시간 Month/Date/Year Hour:Minute 형태로 되어있다. 이번 분석 주제와는 맞지 않으므로 사용하지 않는다.	X
past_due_days	고객이 은행에 대출금을 모두 상환하는데 걸린 기간 이번 분석 주제와는 맞지 않으므로 사용하지 않는다.	X
age	고객의 나이 18세 부터 51세 까지 있다.	Dbl
education	고객의 최종 학력 `High School or Below` : 고졸 `college` : 대졸 `Bechalor` : 석사 `Master or Above` : 박사 이상	Factor
Gender	고객의 성별 `male` : 남성 `female` : 여성	Factor

2.4 Simple operation

Chapter 2.3 Description of feature의 표 내용을 바탕으로 이번 Binary classification에 사용할 변수들만 선택, 추출한 후에 몇 가지 변수들의 속성을 다시 변환 하겠습니다.

loan <- loan %>% 
  # 이번 classification에 사용할 변수들만 선택, 추출
  select("loan_status", "Principal", "terms", "effective_date",
         "due_date", "age", "education", "Gender") %>% 
  # change features type
  # target feature인 loan_status를 2개 범주로 변환 
  mutate(loan_status = factor(ifelse(loan_status == "PAIDOFF", "Success", "Failure")),
         # Date 속성으로 변환할 변수들 - effective_date, due_date
         effective_date = lubridate::mdy(effective_date),
         due_date       = lubridate::mdy(due_date),
         # 학력과 성별도 범주형 변수로 변환
         education      = factor(education),
         Gender         = factor(Gender))

3. 탐색적 데이터 분석(EDA, Exploratory data analysis)

Feature engineering과 Data pre-processing 이전에 분석할 데이터에 대한 탐색(구조 파악, 결측치 유무 확인 등등)을 하는 과정입니다.

처음에는 간단한 function들을 활용해서 확인 할 것이고 후에는 여러가지 시각화(Visualization)를 활용할 것입니다.

3.1 summary()

summary(loan)

##   loan_status    Principal          terms       effective_date      
##  Failure:200   Min.   : 300.0   Min.   : 7.00   Min.   :2016-09-08  
##  Success:300   1st Qu.:1000.0   1st Qu.:15.00   1st Qu.:2016-09-11  
##                Median :1000.0   Median :30.00   Median :2016-09-11  
##                Mean   : 943.2   Mean   :22.82   Mean   :2016-09-11  
##                3rd Qu.:1000.0   3rd Qu.:30.00   3rd Qu.:2016-09-12  
##                Max.   :1000.0   Max.   :30.00   Max.   :2016-09-14  
##     due_date               age                       education  
##  Min.   :2016-09-15   Min.   :18.00   Bechalor            : 67  
##  1st Qu.:2016-09-25   1st Qu.:27.00   college             :220  
##  Median :2016-10-10   Median :30.00   High School or Below:209  
##  Mean   :2016-10-05   Mean   :31.12   Master or Above     :  4  
##  3rd Qu.:2016-10-11   3rd Qu.:35.00                             
##  Max.   :2016-11-12   Max.   :51.00                             
##     Gender   
##  female: 77  
##  male  :423  
##              
##              
##              
##

3.2 str()

str(loan)

## Classes 'tbl_df', 'tbl' and 'data.frame':    500 obs. of  8 variables:
##  $ loan_status   : Factor w/ 2 levels "Failure","Success": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Principal     : num  1000 1000 1000 1000 1000 300 1000 1000 1000 800 ...
##  $ terms         : num  30 30 30 15 30 7 30 30 30 15 ...
##  $ effective_date: Date, format: "2016-09-08" "2016-09-08" ...
##  $ due_date      : Date, format: "2016-10-07" "2016-10-07" ...
##  $ age           : num  45 50 33 27 28 35 29 36 28 26 ...
##  $ education     : Factor w/ 4 levels "Bechalor","college",..: 3 1 1 2 2 4 2 2 2 2 ...
##  $ Gender        : Factor w/ 2 levels "female","male": 2 1 1 2 1 2 2 2 2 2 ...

3.3 Missing values

Feature engineering 이전에 raw data에 결측치(Missing values, NA)가 존재하는지, 존재한다면 얼마나 존재하는지를 확인하는 과정입니다.

이 때 유용한 패키지가 VIM입니다. 아래와 같이 aggr()함수를 사용하면 Console과 Plots 창을 통해서 데이터의 변수별 결측치 빈도수를 확인할 수 있습니다.

낯설은 패키지와 함수일 것이라 판단해서 Package Name::Function Name() 형식으로 서술했으며 aggr()함수의 각 argument들에 대한 간단한 주석을 달아놨습니다.

이번 Loan 데이터에는 결측치가 없으므로, 결측치가 존재하는 데이터를 넣고 TRUE와 FALSE를 하나씩 바꿔가면서 사용해보시면 더욱 이해가 쉬우실 것이라 생각합니다.

VIM::aggr(loan,
          # 빈도수를 출력할 것인지 비율을 출력할 것인지 선택 / FALSE이므로 빈도수 출력
          prop = FALSE,
          # Missing value 있는 plot과 없는 plot 2개를 하나로 결합할지 안 할지 선택 
          combined = TRUE,
          # Plot에 정확한 Count number 출력 여부
          numbers = TRUE,
          # Colsole window에 Variable 별로 missing value count 출력할지 안 할지 설정
          sortVars = TRUE,
          # Combine(결합)한 plot의 output에서 NA갯수를 우선해서 출력할 것인지 설정
          sortCombs = TRUE)

## 
##  Variables sorted by number of missings: 
##        Variable Count
##     loan_status     0
##       Principal     0
##           terms     0
##  effective_date     0
##        due_date     0
##             age     0
##       education     0
##          Gender     0

3.4 Visualization

각 변수들의 빈도수, 비율, 분포 등을 확인하는 과정입니다.

3.4.1 loan_status

loan %>% 
  ggplot(aes(loan_status)) +
  geom_bar() + 
  labs(title = "Bar plot",
       subtitle = "대출금 상환에 성공한 고객과 실패한 고객은 각각 몇 명인가요?",
       caption = "Source: Kaggle Loan data")

정해진 기한 내에 대출금을 모두 상환하는데 성공한 고객이 300명, 실패한 고객이 200명임을 알 수 있습니다.

3.4.2 Principal

summary()에서 Principal이 300부터 1000까지 있음을 알아냈습니다. 그렇다면 고객들의 대출금이 각각 얼마인지 그 분포를 히스토그램으로 확인하도록 하겠습니다.

loan %>% 
  ggplot(aes(Principal)) + 
  geom_histogram(breaks = seq(from = 300, to = 1000, by = 10), # 간격 설정 
                 col    = "yellow",                 # 막대 경계선 색깔 
                 fill   = "Blue",                   # 막대 내부 색깔 
                 alpha  = .5) +                     # 막대 투명도 = 50% 
  labs(title = "First histogram of principal",
       subtitle = "고객들이 빌린 대출금은 얼마나 되나요?",
       caption = "Source: Kaggle Loan data")

300부터 1000까지 10을 간격으로 히스토그램을 그렸는데 최소 100씩 차이가 나는 것을 알 수 있습니다.

그러므로 dplyr패키지를 활용해서 Principal을 factor로 바꾼 뒤에 대출금별로 그룹화하고 빈도수를 구하는 것과 막대그래프를 그려보도록 하겠습니다.

loan %>% 
  # Double 속성인 Principal을 Factor로 변환 
  mutate(Principal = factor(Principal)) %>% 
  # Principal 별로 그룹화(Grouping, Clustering)
  group_by(Principal) %>% 
  # 빈도수 구한 후 Console에 Count로 출력 
  summarize(Count = n()) %>% 
  # 대출금 빈도수를 기준으로 내림차순 
  arrange(desc(Count))

## # A tibble: 6 x 2
##   Principal Count
##   <fct>     <int>
## 1 1000        377
## 2 800         111
## 3 300           6
## 4 500           3
## 5 900           2
## 6 700           1

loan %>% 
  mutate(Principal = factor(Principal)) %>% 
  group_by(Principal) %>% 
  summarize(Count = n()) %>% 
  ggplot(aes(Principal, Count)) +
  geom_col() +
  geom_text(aes(label = Count),        # Plot의 y에 해당하는 N(빈도수)를 매핑
            size = 5,                  # 글씨 크기 
            hjust = 0.5,               # 가로 (Horizontal, 수평) 위치 설정
            vjust = -0.3) +            # 세로 (Vertical, 수직) 위치 설정 
  labs(x = "Principal", y = "Count",
       title = "Bar plot by Principal",
       subtitle = "고객들이 빌린 대출금은 얼마나 되나요?",
       caption = "Source: Kaggle Loan data")

310, 270처럼 연속적인 분포를 갖고 있지 않으며 800과 1000을 대출받은 고객들이 가장 많음을 알 수 있습니다.

다음은 위에서 생성한 multiplot() function을 활용하여 대출금별로 상환에 성공한 고객들과 실패한 고객들의 빈도수와 비율을 한 화면에 같이 plotting 해보도록 하겠습니다.

principal.p1 <- loan %>% 
  # Dbl 속성인 Principal feature를 Factor 속성으로 변환 
  mutate(Principal = factor(Principal)) %>% 
  # ggplot aesthetic setting - x = Principal, y = loan_status
  ggplot(aes(Principal, fill = loan_status)) +
  # loan_status 별로 따로 빈도수 막대 그래프 그리게 설정 
  geom_bar(position = "dodge") + 
  # Bar 내부의 색상 설정 
  scale_fill_brewer(palette = "Set1") + 
  # Plots x, y axis, main title and sub title setting 
  labs(x = "Principal", y = "Count",
       title = "Frequency bar plot", subtitle = "대출금, 상환 여부 빈도수 막대 그래프")

principal.p2 <- loan %>% 
  mutate(Principal = factor(Principal)) %>% 
  ggplot(aes(Principal, fill = loan_status)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Set1") +
  # Y axis output을 %로 나오게 설정 
  scale_y_continuous(labels = percent) +
  labs(x = "Principal", y = "Rate",
       title = "Ratio bar plot", subtitle = "대출금, 상환 여부 비율 막대 그래프",
       caption = "Source : Kaggle Loan data")

# Multiplots layout setting with matrix(), rep()
multi.lay = matrix(rep(c(1, 2), each = 2, times = 2), 2, 4, byrow = T)

# Plotting 
multiplot(principal.p1, principal.p2, layout = multi.lay)

상대적으로 고액인 800 이상을 대출 받은 고객들이 대출금 상환에 실패할 가능성이 더 높은 것을 알 수 있습니다.

800 이상을 대출받은 고객들을 High, 그 외는 Low로 범주화 시키면 Information loss가 발생할수도 있으므로 여기서는 그대로 사용합니다.

3.4.3 terms

terms는 고객이 은행과 계약한 이후에 대출금을 지급받기까지 걸린 시간을 나타내는 연속형 변수(feature) 입니다.

우선은 위 Principal과 유사하게 간격을 1씩 쪼개서 히스토그램을 그려보도록 하겠습니다.

loan %>% 
  ggplot(aes(terms)) +
  geom_histogram(breaks = seq(from = 7, to = 30, by = 1),
                 col = "yellow", fill = "blue", alpha = .8) + 
  labs(caption = "Source : Kaggle Loan data")

대출금을 지급받기까지 7일, 14일, 30일이 걸린 고객들만 존재함을 알 수 있습니다.

따라서 Principal과 유사하게 terms의 속성을 factor로 변환 해준 뒤 빈도수 막대 그래프와 비율 막대그래프를 그려보도록 하겠습니다.

단, 여기서는 그때 그때 속성을 바꿔서 plotting 하도록 하고 factor로 변환해서 다시 loan에 저장하는 것은 Chapter 4. Data Pre-processing에서 하도록 하겠습니다.

# Console window에 terms 범주별 빈도수 출력 
loan %>% 
  mutate(terms = factor(terms)) %>% 
  group_by(terms) %>% 
  summarise(Count = n()) %>% 
  arrange(desc(Count))

## # A tibble: 3 x 2
##   terms Count
##   <fct> <int>
## 1 30      272
## 2 15      207
## 3 7        21

# terms 1st visualization - 빈도수 막대 그래프 
terms.p1 <- loan %>% 
  mutate(terms = factor(terms)) %>% 
  ggplot(aes(terms, fill = loan_status)) + 
  geom_bar(position = "dodge") + 
  scale_fill_brewer(palette = "Set1") + 
  labs(x = "Terms", y = "Count",
       title = "Frequency bar plot by terms", subtitle = "범주별 성공/실패 빈도수 막대 그래프")

# terms 2nd visualization - 비율 막대 그래프 
terms.p2 <- loan %>% 
  mutate(terms = factor(terms)) %>% 
  ggplot(aes(terms, fill = loan_status)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) +
  labs(x = "Terms", y = "Rate",
       title = "Ratio bar plot by terms", subtitle = "범주별 성공/실패 비율 막대 그래프",
       caption = "Source : Kaggle Loan data")

multiplot(terms.p1, terms.p2, layout = multi.lay)

14일과 30일 걸린 고객들이 가장 많고 대출금 지급이 늦어질수록 기한 내에 상환할 가능성이 낮아짐을 알 수 있습니다.

3.4.4 effective_date

고객이 은행으로부터 대출금을 지급받아서 계약 효력이 발생한 날짜를 의미하며 2016년 9월 8일 목요일부터 2016년 9월 14일 수요일까지 7개 범주를 가진 변수입니다.

(plot이 잘리는 현상 때문에 이번에는 multiplot()을 사용하지 않았습니다.)

# 빈도수 막대 그래프 
loan %>% 
  ggplot(aes(effective_date, fill = loan_status)) + 
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6)) +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Effective date", y = "Count",
       title = "Frequency bar plot by effective date",
       subtitle = "2016-09-08 Thursday ~ 2016-09-14 Wednesday",
       caption = "Source : Kaggle Loan data")

# 비율 막대 그래프 
loan %>% 
  ggplot(aes(effective_date, fill = loan_status)) + 
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6)) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) + 
  labs(x = "Effective date", y = "Rate",
       title = "Ratio bar plot by effective date",
       subtitle = "2016-09-08 Thursday ~ 2016-09-14 Wednesday",
       caption = "Source : Kaggle Loan data")

일요일과 월요일에 대출금을 지급받은 고객이 가장 많고 주말에 가까울수록 (금, 토, 일, 월) 대출금 상환에 실패한 고객들이 많음을 알 수 있습니다.

3.4.5 due_date

대출금을 모두 상환하기로 계약한 날짜를 의미하는 변수입니다.

우선 빈도수 막대그래프로 500명의 고객들이 언제까지 상환하기로 했는지 확인해보겠습니다.

loan %>% 
  mutate(due_date = factor(due_date)) %>% 
  ggplot(aes(due_date, fill = loan_status)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6)) +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Frequency bar plot by due date", 
       subtitle = "2016.09.15.Thursday ~ 2016.11.12.Saturday", 
       caption = "Source: Kaggle Loan data")

별다른 규칙성(주기)는 보이지 않으며 빈도수도 일정하지 않습니다.

그렇다고 due_date를 버리기보다 9월, 10월, 11월 중에서 몇 월까지 갚아야 하는지를 나타내는 due_month와 무슨 요일까지 갚아야 하는지 나타내는 due_day 파생변수 2개를 만들어서 확인해보는 것이 좋아보입니다.

파생변수를 만드는 과정은 모두 Chapter 4. Data Pre-processing에 있으며, 파생변수들과 종속변수의 연관성을 확인하는 것은 Chapter 5. Second EDA에서 할 것 입니다.

3.4.6 age

age.p1 <- loan %>% 
  ggplot(aes(age, fill = loan_status)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Bar plot by Age 18 to 51")

age.p2 <- loan %>% 
  ggplot(aes(age, fill = loan_status)) +
  geom_density(alpha = .7) +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Density plot by customer age",
       caption = "Source: Kaggle Loan data")

# age features multiplot layout setting 
multi.lay2 = matrix(rep(c(1, 2), each = 2), 2, 2, byrow = T)

# multiplot
multiplot(age.p1, age.p2, layout = multi.lay2)

3.4.7 education

4개 범주를 가진 범주형 변수입니다.

하지만 처음에 데이터를 받아서 확인했을 때 4개 범주 중에서 Bechalor의 뜻을 몰랐습니다.

그래서 다음과 같이 age, education 변수를 이용한 density plot을 그려서 그 의미를 파악했습니다.

loan %>% 
  ggplot(aes(age, fill = education)) +
  geom_density(alpha = .5) +
  labs(title = "Density plot of education by age",
       subtitle = "연령별 최종 학력은 어떤가?",
       caption = "Source : Kaggle Loan data")

위 density plot을 통해서 파악한 4개 학력은 각각 다음과 같습니다.

High School or Below : 고졸

college : 대졸

Bechalor : 석사

Master or Above : 박사 이상

하지만 여기서 education feature에서 학력 수준의 levels가 맞지 않습니다.

따라서 학력 순서에 맞게 다시 levels를 설정해준뒤에 학력별 상환율을 확인하겠습니다.

loan <- loan %>% 
  mutate(education = factor(education,
                            levels = c("High School or Below",  # 고졸 또는 그 이하 
                                       "college",               # 학부 졸업 
                                       "Bechalor",              # 석사 졸업 
                                       "Master or Above")))     # 박사 또는 그 이상

# 빈도수 막대그래프 - edu.p1 으로 저장 
edu.p1 <- loan %>% 
  ggplot(aes(education, fill = loan_status)) +
  geom_bar() +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6)) +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Education", y = "Count",
       title = "Frequency bar plot by education")

# 비율 막대그래프 - edu.p2 로 저장 
edu.p2 <- loan %>% 
  ggplot(aes(education, fill = loan_status)) +
  geom_bar(position = "fill") + 
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6)) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) +
  labs(x = "Education", y = "Rate",
       title = "Ratio bar plot by education",
       caption = "Source: Kaggle Loan data")

multiplot(edu.p1, edu.p2, layout = multi.lay)

석, 박사 과정을 이수한 고객들의 빈도수가 현저히 적으며 학력이 높을수록 상환에 성공할 가능성도 높음을 알 수 있습니다.

하지만 범주명이 너무 길고, 학부 졸업과 석사 졸업의 상환율이 비슷함도 알 수 있는데 이를 이용해서 학부와 석사를 하나의 범주로 묶어서 4개 범주를 3개로 줄일 것입니다.

3.4.8 Gender

gender.p1 <- loan %>% 
  ggplot(aes(Gender, fill = loan_status)) +
  geom_bar() +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Gender", y = "Count",
       title = "Frequency bar plot by gender")

gender.p2 <- loan %>% 
  ggplot(aes(Gender, fill = loan_status)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) +
  labs(x = "Gender", y = "Rate",
       title = "Ratio bar plot by gender",
       caption = "Source: Kaggle Loan data")

multiplot(gender.p1, gender.p2, layout = multi.lay)

남성 고객이 여성 고객보다 많지만 상환율은 여성 고객이 더 좋음을 알 수 있습니다.

하지만 남성, 여성 고객들의 무슨 차이점 때문인지 확인이 필요합니다.

만약에 남성, 여성 고객간에 큰 차이가 없다면 성별은 큰 의미가 없습니다.

4. 데이터 전처리 (Data Pre-processing)

4.1 terms -> Factor feature “Term”

terms는 Dbl 속성이지만 모든 고객들이 일주일, 격주, 한 달이라는 값들만 가지고 있습니다.

따라서 이것을 factor로 변환해서 Term이라는 파생변수를 만들 것 입니다.

바로 factor로 바꿔서 다시 terms에 저장하지 않는 이유는 연속형 변수 terms를 이용해서 상관계수를 확인할 것이기 때문입니다.

loan <- loan %>% 
  mutate(Term = factor(terms))

4.2 “YYYY-MM-DD” effective_date -> “MMDD.Day” effective_day

YYYY-MM-DD형태의 effective_date를 보면 연도(2016년)와 월(9월)은 같으면서 요일이 생략되어 있으므로 MMDD.Day 형태의 effective_day 파생변수를 만들어 사용합니다.

loan <- loan %>%
  # `effective_date` feature의 날짜를 월일.요일 형식으로 수정한 `effective_day` 파생변수 생성
  mutate(effective_day = case_when(effective_date == "2016-09-08" ~ "0908.Thu",
                                   effective_date == "2016-09-09" ~ "0909.Fri",
                                   effective_date == "2016-09-10" ~ "0910.Sat",
                                   effective_date == "2016-09-11" ~ "0911.Sun",
                                   effective_date == "2016-09-12" ~ "0912.Mon",
                                   effective_date == "2016-09-13" ~ "0913.Tue",
                                   effective_date == "2016-09-14" ~ "0914.Wed"),
         # 그대로 두면 chr 속성이기 때문에 factor로 변환
         effective_day = factor(effective_day))

4.3 due_date -> due_month & due_day

상환 기한을 나타내는 due_date를 시각화해서 확인했을때는 별다른 주기성(반복, 규칙)을 확인하지 못 했습니다.

그리고 고객들의 기한이 일정 날짜에만 몰려있고 고른 분포를 보이지도 않기 때문에 lubridate패키지를 이용하여 due_date에서 상환을 약속한 월(Month)과 요일(Day)을 추출한 후 각각 due_month와 due_day라는 파생변수를 만들도록 하겠습니다.

loan <- loan %>% 
  # 9월, 10월, 11월 이라는 범주형 변수로 사용할 것이기 때문에 추출한 후에 factor()로 변환
  mutate(due_month = factor(lubridate::month(due_date)),
         # 무슨 요일인지 추출하는 wday function에서 label = T로 주면 알아서 labeling을 해줍니다.
         due_day = lubridate::wday(due_date, label = T))

4.4 due_date - effective_date -> length.due

대출금을 지급받은 날짜(effetive_date)부터 상환하기로 한 날짜(due_date)까지의 기간을 나타내는 파생변수 length.due 를 만드는 과정입니다.

기한 내에 모든 대출금을 상환하기 어려울수록 지급일부터 상환 계약일 까지의 기간이 길지 않을까 하는 생각에 만들게 되었습니다.

due_date에서 effetive_date를 뺀 이유는 상환까지 얼마나 남았는지는 양수(Positive Value)인데 두 날짜 중에서 대출금을 상환하는 것이 더 미래의 일이기 때문입니다.

예를 들어 군대 전역까지 몇 일 남았는지 계산 할 때 전역일에서 지금 날짜를 빼는 것과 같습니다.

loan <- loan %>% 
  # `effective_date` 부터 `due_date`까지 기간을 추출해서 `length.due` 생성
  mutate(length.due = lubridate::days(.$due_date - .$effective_date)@day)

4.5 education -> 3 labels feature education

4개 범주였던 education feature를 3개 범주로 줄이면서 각 범주의 이름도 변환하는 과정입니다.

여기서는 dplyr 패키지의 case_when() 함수를 사용했습니다.

loan <- loan %>%
  mutate(education = dplyr::case_when(education == "High School or Below" ~ "Low",
                                      education %in% c("college", "Bechalor") ~ "Normal",
                                      education == "Master or Above" ~ "High"),
         # Chr 속성인 education을 factor로 변환하면서 순서에 맞게 levels 수정 
         education = factor(education, levels = c("Low", "Normal", "High")))

4.6 Variable description table after data pre-processing

원본 데이터에서 사용한 변수들과 전처리를 통해 만들어진 파생변수들에 대한 설명을 담은 표 입니다.

바로 아래 summary() 결과와 같이 보시면 이해하기시 더 쉬울 것 같습니다.

summary(loan)

##   loan_status    Principal          terms       effective_date      
##  Failure:200   Min.   : 300.0   Min.   : 7.00   Min.   :2016-09-08  
##  Success:300   1st Qu.:1000.0   1st Qu.:15.00   1st Qu.:2016-09-11  
##                Median :1000.0   Median :30.00   Median :2016-09-11  
##                Mean   : 943.2   Mean   :22.82   Mean   :2016-09-11  
##                3rd Qu.:1000.0   3rd Qu.:30.00   3rd Qu.:2016-09-12  
##                Max.   :1000.0   Max.   :30.00   Max.   :2016-09-14  
##                                                                     
##     due_date               age         education      Gender    Term    
##  Min.   :2016-09-15   Min.   :18.00   Low   :209   female: 77   7 : 21  
##  1st Qu.:2016-09-25   1st Qu.:27.00   Normal:287   male  :423   15:207  
##  Median :2016-10-10   Median :30.00   High  :  4                30:272  
##  Mean   :2016-10-05   Mean   :31.12                                     
##  3rd Qu.:2016-10-11   3rd Qu.:35.00                                     
##  Max.   :2016-11-12   Max.   :51.00                                     
##                                                                         
##   effective_day due_month due_day    length.due   
##  0908.Thu:  4   9 :211    일:119   Min.   : 6.00  
##  0909.Fri: 15   10:267    월:187   1st Qu.:14.00  
##  0910.Sat: 46   11: 22    화: 85   Median :29.00  
##  0911.Sun:231             수: 31   Mean   :23.94  
##  0912.Mon:148             목: 33   3rd Qu.:29.00  
##  0913.Tue: 23             금: 12   Max.   :59.00  
##  0914.Wed: 33             토: 33

변수명	Type	설명
loan_status	Factor	이번 분석의 타겟 변수 각 고객들의 상환 여부를 의미함. `Success` : due_date까지 대출금을 모두 상환하는데 성공 `Failure` : due_date까지 대출금을 모두 상환하는데 실패
Principal	Dbl	고객이 대출받은 금액 300부터 1000까지 있으며 대다수의 고객들이 1000을 대출받았다. 또한 대출금은 311, 740 같이 1단위 혹은 10단위가 아니라 100단위로 존재한다.
terms	Dbl	대출금 지급까지 걸린 기간 모든 고객들이 1주(7), 2주(14), 한 달(30)이라는 세 가지 값 중에서 하나만 갖고 있다. 세 개 범주를 갖고 있는 파생 변수 `Terms`를 만드는데 사용함.
effective_date	Date	실제 계약 효과가 발생하기 시작한 날짜 Year-Month-Date 형태이며 2016년 9월 8일부터 14일까지 7개 값이 있다. `lubridate` 패키지를 사용해서 `Date`속성으로 변환했고 월일.요일형태의 범주형 파생 변수 effective_day를 만드는데 사용함.
due_date	Date	대출금 상환 날짜를 의미함. Year-Month-Date 형태이며 2016년 9월 15일부터 11월 12일까지 존재함. `lubridate` 패키지를 사용해서 `Date`속성으로 변환했고 length.due라는 파생변수를 만드는 데 사용함.
age	Dbl	고객의 나이 18세 부터 51세 까지 존재함.
education	Factor	고객의 최종 학력을 나타내며 아래 3개 범주를 가짐. `Low` : 고졸 혹은 그 미만 `Normal` : 학사, 석사 졸업 `High` : 박사 혹은 그 이상
Gender	Factor	고객의 성별 `male` : 남성 `female` : 여성
Term	Factor	대출금 지급까지 걸린 기간을 나타내는 terms를 범주형으로 변환시킨 파생변수이면서 실제로 모델 생성에 사용한 입력 변수
effective_day	Factor	계약 효과가 발생하기 시작한 날짜 effective_date에서 월, 일, 요일만 추출한 뒤 만든 파생변수 7개 범주를 갖고 있다.
due_month	Factor	due_date를 시각화 했을 때 별다른 `insight`를 찾지 못해서 만든 파생변수 고객이 대출금을 갚기로 한 월(Month)이 언제인지를 나타내며 9월, 10월, 11월 세 개 범주를 갖고 있다.
due_day	Factor	due_date와 `lubridate` 패키지를 사용해서 만든 파생변수 고객이 대출금을 갚기로 한 요일(Day)이 언제인지를 나타내며 7개 범주를 갖고 있다.
length.due	Dbl	effective_date부터 due_date까지의 기간(간격)을 나타냄. `lubridate` 패키지를 사용해서 만든 파생변수

5. Second EDA, Feature selection

전처리를 통해서 만든 파생변수와 변환된 변수들이 종속변수와 어떤 연관이 있는지 확인하는 과정입니다.

5.1 due_date, due_month, due_day

loan %>% 
  mutate(due_date = factor(due_date)) %>% 
  ggplot(aes(due_date, fill = loan_status)) +
  geom_bar(position = "dodge") +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.6)) +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Frequency bar plot by due date", 
       subtitle = "2016.09.15.Thursday ~ 2016.11.12.Saturday",
       caption = "Source: Kaggle Loan data")

due.p1 <- loan %>% 
  ggplot(aes(due_month, fill = loan_status)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Frequency bar plot by due month")

due.p2 <- loan %>% 
  ggplot(aes(due_day, fill = loan_status)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") + 
  labs(title = "Frequency bar plot by due day")

due.p3 <- loan %>% 
  ggplot(aes(due_month, fill = loan_status)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") + 
  scale_y_continuous(labels = percent) +
  labs(y = "Rate", title = "Ratio bar plot by due month")

due.p4 <- loan %>% 
  ggplot(aes(due_day, fill = loan_status)) +
  geom_bar(position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) +
  labs(y = "Rate", title = "Ratio bar plot by due day")

# age features multiplot layout setting 
due.lay = matrix(1:4, 2, 2, byrow = T)

# multiplot
multiplot(due.p1, due.p2, due.p3, due.p4, layout = due.lay)

5.2 education

기존에 4개 범주이던걸 3개 범주로 바꾼 뒤에 종속변수와 얼마나 연관이 있는지 확인하는 과정입니다.

edu.p1 <- loan %>% 
  ggplot(aes(education, fill = loan_status)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "Education", y = "Count",
       title = "Frequency bar plot by education")

# 비율 막대그래프 - edu.p2 로 저장 
edu.p2 <- loan %>% 
  ggplot(aes(education, fill = loan_status)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) +
  labs(x = "Education", y = "Rate",
       title = "Ratio bar plot by education",
       caption = "Source: Kaggle Loan data")

multiplot(edu.p1, edu.p2, layout = multi.lay)

5.3 length.due

ldue.p1 <- loan %>% 
  mutate(length.due = factor(length.due)) %>% 
  ggplot(aes(length.due, fill = loan_status)) +
  geom_bar(position = "dodge") +
  scale_fill_brewer(palette = "Set1") +
  labs(x = "effective_date ~ due_date", y = "Count",
       title = "Frequency bar plot by length.due")

# 비율 막대그래프 - edu.p2 로 저장 
ldue.p2 <- loan %>% 
  mutate(length.due = factor(length.due)) %>% 
  ggplot(aes(length.due, fill = loan_status)) +
  geom_bar(position = "fill") + 
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = percent) +
  labs(x = "effective_date ~ due_date", y = "Rate",
       title = "Ratio bar plot by length.due",
       caption = "Source: Kaggle Loan data")

multiplot(ldue.p1, ldue.p2, layout = multi.lay2)

5.4 Feature selection

지금까지의 과정들을 바탕으로 Classification Model을 생성하는데 사용할 변수들을 선택, 추출하도록 하겠습니다.

주의점은 모델 생성 후에 입력변수들을 바꿀 수도 있으니(변수 선택을 다르게 한다는 의미) 다른 객체명으로 저장해야 한다는 것입니다.

real.loan <- loan %>% 
  # 성공/실패 순으로 Positive, Negative 순서가 지정되게 해야 Confusion matrix 보기가 편합니다.
  mutate(loan_status = factor(loan_status, levels = c("Success", "Failure"))) %>% 
  select("loan_status",   # Dependent variable
         # 9th Independent variables
         "Principal", "Term", "effective_day", "due_month", "due_day",
         "age", "education", "Gender", "length.due")

6. Model generation and validation

6.1 Data set split

바로 모델을 생성하지 않고 가장 먼저, 학습 데이터(Training)와 검증 데이터(Validation)를 각각 7:3 비율로 Random sampling 합니다.

Random sampling과 모델을 생성할 때는 재현성(Reproducibility)을 위해서 Seed number를 지정해줘야 합니다.

set.seed(1901)   # YYMM - 2번째로 다시 분석 시작한 년, 월 

# 전체 관측치 갯수 추출 
n <- nrow(real.loan)
ind <- 1:n

# 전체 데이터에서 70%의 index number 먼저 추출 
training.ind <- sample(ind, n*.70)
# Training data index를 제외 - 비복원 추출하기 위해서 
ind <- setdiff(ind, training.ind)
# 남은 index에서 validation data index 추출 
validation.ind <- sample(ind, n*.30)

# 위 2개의 index들을 이용해서 training : validation = 7 : 3 비율로 split 
training <- real.loan[training.ind, ]
validation <- real.loan[validation.ind, ]

6.2 Support Vector Machine

SVM은 e1071패키지의 svm() 함수를 이용해서 만들 수 있으며 kernel이라는 argument에 따라 다시 linear, polynomial, sigmoid로 나뉩니다.

그러므로 3개 SVM 모델을 모두 만든 후에 정확도(Accuracy), 특이도, 민감도를 비교해서 검증하는 과정을 보여드리겠습니다.

6.2.1 Model generation

# Linear SVM
set.seed(1901)

svm.linear <- e1071::svm(loan_status ~., data = training, kernel = "linear")

# Sigmoid SVM
set.seed(1901)

svm.sigmoid <- e1071::svm(loan_status ~., data = training, kernel = "sigmoid")

# Poly SVM 
set.seed(1901)

svm.poly <- e1071::svm(loan_status ~., data = training, kernel = "poly")

6.2.2 Validation

# Linear SVM
pred.svm.linear <- predict(object = svm.linear, newdata = validation, type = 'class')

confusionMatrix(data = pred.svm.linear, reference = validation$loan_status,
                positive = "Success")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      53      21
##    Failure      40      36
##                                           
##                Accuracy : 0.5933          
##                  95% CI : (0.5102, 0.6727)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.77628         
##                                           
##                   Kappa : 0.1893          
##  Mcnemar's Test P-Value : 0.02119         
##                                           
##             Sensitivity : 0.5699          
##             Specificity : 0.6316          
##          Pos Pred Value : 0.7162          
##          Neg Pred Value : 0.4737          
##              Prevalence : 0.6200          
##          Detection Rate : 0.3533          
##    Detection Prevalence : 0.4933          
##       Balanced Accuracy : 0.6007          
##                                           
##        'Positive' Class : Success         
##

# Sigmoid SVM
pred.svm.sigmoid <- predict(object = svm.sigmoid, newdata = validation, type = 'class')

confusionMatrix(data = pred.svm.sigmoid, reference = validation$loan_status,
                positive = "Success")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      91      53
##    Failure       2       4
##                                           
##                Accuracy : 0.6333          
##                  95% CI : (0.5508, 0.7104)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.4029          
##                                           
##                   Kappa : 0.0589          
##  Mcnemar's Test P-Value : 1.562e-11       
##                                           
##             Sensitivity : 0.97849         
##             Specificity : 0.07018         
##          Pos Pred Value : 0.63194         
##          Neg Pred Value : 0.66667         
##              Prevalence : 0.62000         
##          Detection Rate : 0.60667         
##    Detection Prevalence : 0.96000         
##       Balanced Accuracy : 0.52434         
##                                           
##        'Positive' Class : Success         
##

# Poly SVM
pred.svm.poly <- predict(object = svm.poly, newdata = validation, type = 'class')

confusionMatrix(data = pred.svm.poly, reference = validation$loan_status,
                positive = "Success")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      93      57
##    Failure       0       0
##                                           
##                Accuracy : 0.62            
##                  95% CI : (0.5372, 0.6979)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.5362          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 1.195e-13       
##                                           
##             Sensitivity : 1.00            
##             Specificity : 0.00            
##          Pos Pred Value : 0.62            
##          Neg Pred Value :  NaN            
##              Prevalence : 0.62            
##          Detection Rate : 0.62            
##    Detection Prevalence : 1.00            
##       Balanced Accuracy : 0.50            
##                                           
##        'Positive' Class : Success         
##

6.3 Decision Tree

6.3.1 Model generation

set.seed(1901)

loan.tree <- rpart::rpart(formula = loan_status ~ ., data = training,
                          method = 'class', parms = list(split = 'gini'),
                          control = rpart.control(minsplit = 20, cp = 0.01, maxdepth = 10))

6.3.2 Validation

pred.tr <- predict(object = loan.tree,
                   newdata = validation, type = 'class')

# 혼동행렬을 사용하여 모형의 분류 성능을 파악합니다. 
confusionMatrix(data = pred.tr, reference = validation$loan_status, positive = 'Success')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      65      33
##    Failure      28      24
##                                           
##                Accuracy : 0.5933          
##                  95% CI : (0.5102, 0.6727)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.7763          
##                                           
##                   Kappa : 0.122           
##  Mcnemar's Test P-Value : 0.6085          
##                                           
##             Sensitivity : 0.6989          
##             Specificity : 0.4211          
##          Pos Pred Value : 0.6633          
##          Neg Pred Value : 0.4615          
##              Prevalence : 0.6200          
##          Detection Rate : 0.4333          
##    Detection Prevalence : 0.6533          
##       Balanced Accuracy : 0.5600          
##                                           
##        'Positive' Class : Success         
##

6.3.3 Visualization

# Plotting 
print(x = loan.tree)

## n= 350 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 350 143 Success (0.59142857 0.40857143)  
##     2) effective_day=0908.Thu,0913.Tue,0914.Wed 43   0 Success (1.00000000 0.00000000) *
##     3) effective_day=0909.Fri,0910.Sat,0911.Sun,0912.Mon 307 143 Success (0.53420195 0.46579805)  
##       6) effective_day=0912.Mon 95  33 Success (0.65263158 0.34736842)  
##        12) due_month=9,10 87  28 Success (0.67816092 0.32183908) *
##        13) due_month=11 8   3 Failure (0.37500000 0.62500000) *
##       7) effective_day=0909.Fri,0910.Sat,0911.Sun 212 102 Failure (0.48113208 0.51886792)  
##        14) Term=7 11   1 Success (0.90909091 0.09090909) *
##        15) Term=15,30 201  92 Failure (0.45771144 0.54228856)  
##          30) age>=29.5 106  52 Success (0.50943396 0.49056604)  
##            60) Gender=female 14   4 Success (0.71428571 0.28571429) *
##            61) Gender=male 92  44 Failure (0.47826087 0.52173913)  
##             122) age< 31.5 17   6 Success (0.64705882 0.35294118) *
##             123) age>=31.5 75  33 Failure (0.44000000 0.56000000)  
##               246) effective_day=0910.Sat 10   4 Success (0.60000000 0.40000000) *
##               247) effective_day=0909.Fri,0911.Sun 65  27 Failure (0.41538462 0.58461538)  
##                 494) education=Normal 36  18 Success (0.50000000 0.50000000)  
##                   988) age< 37.5 24   9 Success (0.62500000 0.37500000) *
##                   989) age>=37.5 12   3 Failure (0.25000000 0.75000000) *
##                 495) education=Low 29   9 Failure (0.31034483 0.68965517) *
##          31) age< 29.5 95  38 Failure (0.40000000 0.60000000) *

rpart.plot::rpart.plot(x = loan.tree, main = 'Decision Tree before pruning')

6.3.4 Pruning

printcp(x = loan.tree)

## 
## Classification tree:
## rpart::rpart(formula = loan_status ~ ., data = training, method = "class", 
##     parms = list(split = "gini"), control = rpart.control(minsplit = 20, 
##         cp = 0.01, maxdepth = 10))
## 
## Variables actually used in tree construction:
## [1] age           due_month     education     effective_day Gender       
## [6] Term         
## 
## Root node error: 143/350 = 0.40857
## 
## n= 350 
## 
##         CP nsplit rel error xerror     xstd
## 1 0.039627      0   1.00000 1.0000 0.064311
## 2 0.025641      3   0.88112 1.0280 0.064571
## 3 0.018648      6   0.80420 1.0140 0.064445
## 4 0.013986      9   0.74825 1.0210 0.064509
## 5 0.010000     10   0.73427 1.0699 0.064895

plotcp(x = loan.tree)
abline(h = min(loan.tree$cptable[, 'xerror']), col = 'red', lty = 2)

# xerror이 가장 낮을 때의 비용복잡도 파라미터를 구합니다. 
bestCP <- loan.tree$cptable[which.min(loan.tree$cptable[ , 'xerror']), 'CP']

# 본격적인 가지치기(pruning) 진행 
loan.tree.prn <- prune.rpart(tree = loan.tree, cp = bestCP)

6.3.5 Visualization and Validation after pruning

print(x = loan.tree.prn)

## n= 350 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 350 143 Success (0.5914286 0.4085714) *

rpart.plot(x = loan.tree.prn, main = 'Decision Tree after pruning')

# 가지치기 후 Decision Tree validation 
pred.tr.prn <- predict(object = loan.tree.prn,
                       newdata = validation, type = 'class')

# 혼동행렬을 사용하여 모형의 분류 성능을 파악합니다. 
confusionMatrix(data = pred.tr.prn, reference = validation$loan_status,
                positive = 'Success')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      93      57
##    Failure       0       0
##                                           
##                Accuracy : 0.62            
##                  95% CI : (0.5372, 0.6979)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.5362          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : 1.195e-13       
##                                           
##             Sensitivity : 1.00            
##             Specificity : 0.00            
##          Pos Pred Value : 0.62            
##          Neg Pred Value :  NaN            
##              Prevalence : 0.62            
##          Detection Rate : 0.62            
##    Detection Prevalence : 1.00            
##       Balanced Accuracy : 0.50            
##                                           
##        'Positive' Class : Success         
##

과적합(Overfitting)이 일어난 것인지 처음에 들어간 Data에서의 성공/실패 외의 모든 가지가 날아갔습니다. 때문에 Decision Tree model은 사용하지 못 합니다.

6.4 Random Forest

6.4.1 Model generation

set.seed(1901)

loan.rf <- randomForest(loan_status ~ ., data = training,
                        importance = T, ntree = 2000)

pred.rf <- predict(object = loan.rf,
                   newdata = validation, type = 'class')

# 혼동행렬을 사용하여 모형의 분류 성능을 파악합니다.
confusionMatrix(data = pred.rf,
                reference = validation$loan_status, positive = "Success")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      74      28
##    Failure      19      29
##                                           
##                Accuracy : 0.6867          
##                  95% CI : (0.6059, 0.7598)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.05366         
##                                           
##                   Kappa : 0.3141          
##  Mcnemar's Test P-Value : 0.24324         
##                                           
##             Sensitivity : 0.7957          
##             Specificity : 0.5088          
##          Pos Pred Value : 0.7255          
##          Neg Pred Value : 0.6042          
##              Prevalence : 0.6200          
##          Detection Rate : 0.4933          
##    Detection Prevalence : 0.6800          
##       Balanced Accuracy : 0.6522          
##                                           
##        'Positive' Class : Success         
##

6.4.2 Validation

6.4.3 Feature importance

Loan데이터를 이용해서 만든 Random Forest 모델에서 변수의 중요도를 확인 해봅니다.

varImpPlot(loan.rf)

6.4.4 2nd Random Forest model generation

training2 <- training[, -2]   # Principal 제외 
validation2 <- validation[, -2]

set.seed(1901)

loan.rf.2 <- randomForest(loan_status ~ .,
                          data = training2,
                          importance = T, ntree = 2000)

pred.rf.2 <- predict(object = loan.rf.2,
                     newdata = validation2, type = 'class')

# 혼동행렬을 사용하여 모형의 분류 성능을 파악합니다.
confusionMatrix(data = pred.rf.2,
                reference = validation2$loan_status, positive = "Success")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      75      34
##    Failure      18      23
##                                           
##                Accuracy : 0.6533          
##                  95% CI : (0.5714, 0.7291)
##     No Information Rate : 0.62            
##     P-Value [Acc > NIR] : 0.22545         
##                                           
##                   Kappa : 0.222           
##  Mcnemar's Test P-Value : 0.03751         
##                                           
##             Sensitivity : 0.8065          
##             Specificity : 0.4035          
##          Pos Pred Value : 0.6881          
##          Neg Pred Value : 0.5610          
##              Prevalence : 0.6200          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.7267          
##       Balanced Accuracy : 0.6050          
##                                           
##        'Positive' Class : Success         
##

feature를 변환해도 기대했떤 만큼의 예측력(정확도)는 나오지 않았음을 알 수 있습니다.

6.5 LASSO

6.5.1 Model generation

# 모형 행렬 생성 : 절편항은 필요없으므로 모형식에 -1 을 설정
loan.model.matrix <- model.matrix(loan_status ~ .-1 , real.loan)

# 입력변수
x <- loan.model.matrix[training.ind, ]

# 반응변수
y <- ifelse(training$loan_status == "Success", 1, 0)

# 실제값 - Validation 에서 사용 
real <- ifelse(validation$loan_status == "Success", 1, 0)

# Seed number setting 
set.seed(1901)

# Model generation 
loan.glmnet <- glmnet(x, y)

# loan.glmnet model Cross Validation 
loan.cvfit <- cv.glmnet(x, y, family = "binomial")

# 예측력이 가장 좋은 lambda를 찾습니다. - 여기서 기준은 이항편차(Binomial Deviance) 입니다.
plot(loan.cvfit)

6.5.2 Validation

# Prediction with LASSO model - output은 Matrix 
pred.glmnet <- predict(loan.cvfit,        # Cross Validation을 거친 LASSO model 
                       s = "lambda.1se",  # Binomial Deviance가 가장 낮을 때의 lambda 
                       newx = loan.model.matrix[validation.ind, ],
                       type = "response")

pred.glmnet <- pred.glmnet[, 1] # Vector로 변환 

pred.glmnet <- pred.glmnet %>%
  # pred.glmnet라는 Vector를 data frame 으로 변환 
  dplyr::tbl_df() %>% 
  # 변수명 변경 
  rename(pred = 1) %>% 
  # 확률값으로 나왔던 pred 값을 0.6 기준으로 성공/실패로 분류 
  mutate(pred = factor(ifelse(pred > 0.6, "Success", "Failure"),
                       levels = c("Success", "Failure")))

# confusionMatrix()
confusionMatrix(data = pred.glmnet$pred,
                reference = validation$loan_status,
                positive = "Success")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Success Failure
##    Success      36       3
##    Failure      57      54
##                                          
##                Accuracy : 0.6            
##                  95% CI : (0.5169, 0.679)
##     No Information Rate : 0.62           
##     P-Value [Acc > NIR] : 0.7234         
##                                          
##                   Kappa : 0.2826         
##  Mcnemar's Test P-Value : 7.795e-12      
##                                          
##             Sensitivity : 0.3871         
##             Specificity : 0.9474         
##          Pos Pred Value : 0.9231         
##          Neg Pred Value : 0.4865         
##              Prevalence : 0.6200         
##          Detection Rate : 0.2400         
##    Detection Prevalence : 0.2600         
##       Balanced Accuracy : 0.6672         
##                                          
##        'Positive' Class : Success        
##

6.6 Accuracy comparison table

Model Name	Accuracy (정확도)	Sensitivity (민감도)	Specificity (특이도)
Linear Support Vector Machine	0.5933	0.5699	0.6316
Radial(Sigmoid) Support Vector Machine	0.6333	0.97849	0.07018
Polynomial Support Vector Machine	0.62	1.00	0.00
Decision Tree (after pruning)	0.62	1.00	0.00
Random Forest	`0.6867`	0.7957	0.5088
LASSO	0.6	0.3871	0.9474

위 표의 값들을 통해서 이번 Loan data에는 Random Forest 모델이 가장 좋은 예측 정확도를 보였음을 알 수 있습니다.

7. 마치며 (Outro)

2018년 2월에 처음으로 Kaggle의 open data set 중 하나인 Loan data를 분석해봤고 2019년 1월에 다시 2번째로 분석하면서 느낀 점들 입니다.

변수(Variable, feature)가 다양하지 않은 것과 관측치(레코드, record)가 많지 않은게 아쉽다.

상환을 성공하는데는 고객의 학력(education)보다 직업(연봉)도 영향력이 클 것 같은데 이 data에서는 고객들의 학력만 나와있었다.

상관관계와 인과관계를 증명하지 못하는 것처럼 학력이 좋다고 해서 재정적인 능력이 좋고, 대출금을 기한내에 상환할 것이라는 보장은 없다.

또한, 데이터의 출처도 그냥 은행이라고만 되어있고 어느 지역인지도 안 나와 있는 것이 아쉽다.

지역이라도 나와있다면 그 지역에 대한 도메인 지식을 공부하고 접목 해볼텐데 하는 아쉬움이 남는다.

Data Science에서의 사골 데이터인 Iris 처럼 표본이 150개에 변수가 4개 밖에 되지 않아도 정확도가 높게 모델이 만들어지는 경우는 실제로 잘 없다.

그리고 데이터가 수집될 때 편향이 있다면 편향이 있는 데이터로 만든 모델 또한 일반화가 잘 되었다고 할 수 없기 때문이다.

파라미터 튜닝(Parameter tuning), 스태킹(Stacking), 앙상블(Ensemble)에 대해서 공부한 후에 접목시켜서 더 좋은 모델을 만들고 싶은 욕심이 생겼다.

처음 주어진 데이터로 만든 모델을 쭉 변함없이 사용하지않고 지속적인 모니터링과 관리가 필요하다.

올해와 내년은 다른 것 처럼 데이터는 얼마든지 바뀔 수 있고 거기에 맞는 모델과 변수도 얼마든지 달라질 수 있기 때문이다.

8. Reference : 참고문헌

Kaggle dataset - open data “Loan” link
R을 활용한 데이터 과학(인사이트), Garrett Grolemund, Hadley Wickham 지음, 김설기, 최혜민 옮김, ISBN : 978-89-6626-235-9
머신러닝 탐구생활(비제이퍼블릭), 정권우 지음, ISBN : 979-11-86697-69-6
R에서 lubridate package 에 대해
Software Carpentry - 날짜/시간 데이터(lubridate)

'데이터 분석 > Kaggle' 카테고리의 다른 글

캐글 코리아 주관 - "2019 1st ML month with KaKR" kernel (0)	2019.02.06
캐글 버섯 데이터 분류분석(Kaggle - Mushrooms Data Classification) (2)	2018.01.23

캐글 대출고객 분류분석 (Kaggle - Loan Data Classification)