Kaggle_Mushrooms_Classification

1. 서론
2. 패키지 장착
3. 데이터 불러오기
4. 데이터 변수 설명
5 - 1. 데이터 확인
5 - 2. 데이터 전처리 (Pre-processing)
6. 탐색적 데이터 분석(EDA)
- 6 - 1. Bar Plot
- 6 - 2. Mosaicplot
7. 모델 생성
8. Test Set 을 이용한 분류 분석
9. 마치며
11. 참고 문헌 (Reference)

1. 서론

해당 레포트는 기계학습(`Machine Learning`) 중 지도학습(`Supervised Learning`)의 한 분야인 분류분석(`Classification`)을 이용하여 `Kaggle` 에 있는 `Mushrooms` 데이터를 분석하는 레포트 입니다.

분석 목적 : `독버섯(poisonous)`인지 `식용버섯(edible)`인지 분류하는 것

분석 시작일 : 2018년 1월 24일 수요일

분석 종료일 : 2018년 1월 28일 일요일(총 5일 소요)

캐글 링크 : Kaggle - Public Data - Mushrooms Data Acquisition

2. 패키지 장착

library(dplyr)
library(descr)
library(DT)
library(ggplot2)
library(ISLR)
library(MASS)
library(glmnet)
library(randomForest)
library(rpart)
library(ROCR)

3. 데이터 불러오기

rm(list=ls())

setwd("C:/Users/LG/Documents/Study-R/Data")
getwd()

## [1] "C:/Users/LG/Documents/Study-R/Data"

mushrooms <- read.csv("mushrooms.csv",
                      header = T)

4. 데이터 변수 설명

총 23개 변수가 사용됨.
여기서 종속변수(반응변수)는 class 이고 나머지 22개는 모두 입력변수(설명변수, 예측변수, 독립변수) 입니다.

변수명	변수 설명
class	edible = e, poisonous = p
cap-shape	bell = b, conical = c, convex = x, flat = f, knobbed = k, sunken = s
cap-surface	fibrous = f, grooves = g, scaly = y, smooth = s
cap-color	brown = n, buff = b, cinnamon = c, gray = g, green = r, pink = p, purple = u, red = e, white = w, yellow = y
bruises	bruises = t, no = f
odor	almond = a, anise = l, creosote = c, fishy = y, foul = f, musty = m, none = n, pungent = p, spicy = s
gill-attachment	attached = a, descending = d, free = f, notched = n
gill-spacing	close = c, crowded = w, distant = d
gill-size	broad = b, narrow = n
gill-color	black = k, brown = n, buff = b, chocolate = h, gray = g, green = r, orange = o, pink = p, purple = u, red = e, white = w, yellow = y
stalk-shape	enlarging = e, tapering = t
stalk-root	bulbous = b, club = c, cup = u, equal = e, rhizomorphs = z, rooted = r, missing = ?
stalk-surface-above-ring	fibrous = f, scaly = y, silky = k, smooth = s
stalk-surface-below-ring	fibrous = f, scaly = y, silky = k, smooth = s
stalk-color-above-ring	brown = n, buff = b, cinnamon = c, gray = g, orange = o, pink = p, red = e, white = w, yellow = y
stalk-color-below-ring	brown = n, buff = b, cinnamon = c, gray = g, orange = o,pink = p, red = e, white = w, yellow = y
veil-type	partial = p, universal = u
veil-color	brown = n, orange = o, white = w, yellow = y
ring-number	none = n, one = o, two = t
ring-type	cobwebby = c, evanescent = e, flaring = f, large = l, none = n, pendant = p, sheathing = s, zone = z
spore-print-color	black = k, brown = n, buff = b, chocolate = h, green = r, orange =o, purple = u, white = w, yellow = y
population	abundant = a, clustered = c, numerous = n, scattered = s, several = v, solitary = y
habitat	grasses = g, leaves = l, meadows = m, paths = p, urban = u, waste = w, woods = d

5 - 1. 데이터 확인

head(mushrooms)

##   class cap.shape cap.surface cap.color bruises odor gill.attachment
## 1     p         x           s         n       t    p               f
## 2     e         x           s         y       t    a               f
## 3     e         b           s         w       t    l               f
## 4     p         x           y         w       t    p               f
## 5     e         x           s         g       f    n               f
## 6     e         x           y         y       t    a               f
##   gill.spacing gill.size gill.color stalk.shape stalk.root
## 1            c         n          k           e          e
## 2            c         b          k           e          c
## 3            c         b          n           e          c
## 4            c         n          n           e          e
## 5            w         b          k           t          e
## 6            c         b          n           e          c
##   stalk.surface.above.ring stalk.surface.below.ring stalk.color.above.ring
## 1                        s                        s                      w
## 2                        s                        s                      w
## 3                        s                        s                      w
## 4                        s                        s                      w
## 5                        s                        s                      w
## 6                        s                        s                      w
##   stalk.color.below.ring veil.type veil.color ring.number ring.type
## 1                      w         p          w           o         p
## 2                      w         p          w           o         p
## 3                      w         p          w           o         p
## 4                      w         p          w           o         p
## 5                      w         p          w           o         e
## 6                      w         p          w           o         p
##   spore.print.color population habitat
## 1                 k          s       u
## 2                 n          n       g
## 3                 n          n       m
## 4                 k          s       u
## 5                 n          a       g
## 6                 k          n       g

DT::datatable(mushrooms)

str(mushrooms)

## 'data.frame':    8124 obs. of  23 variables:
##  $ class                   : Factor w/ 2 levels "e","p": 2 1 1 2 1 1 1 1 2 1 ...
##  $ cap.shape               : Factor w/ 6 levels "b","c","f","k",..: 6 6 1 6 6 6 1 1 6 1 ...
##  $ cap.surface             : Factor w/ 4 levels "f","g","s","y": 3 3 3 4 3 4 3 4 4 3 ...
##  $ cap.color               : Factor w/ 10 levels "b","c","e","g",..: 5 10 9 9 4 10 9 9 9 10 ...
##  $ bruises                 : Factor w/ 2 levels "f","t": 2 2 2 2 1 2 2 2 2 2 ...
##  $ odor                    : Factor w/ 9 levels "a","c","f","l",..: 7 1 4 7 6 1 1 4 7 1 ...
##  $ gill.attachment         : Factor w/ 2 levels "a","f": 2 2 2 2 2 2 2 2 2 2 ...
##  $ gill.spacing            : Factor w/ 2 levels "c","w": 1 1 1 1 2 1 1 1 1 1 ...
##  $ gill.size               : Factor w/ 2 levels "b","n": 2 1 1 2 1 1 1 1 2 1 ...
##  $ gill.color              : Factor w/ 12 levels "b","e","g","h",..: 5 5 6 6 5 6 3 6 8 3 ...
##  $ stalk.shape             : Factor w/ 2 levels "e","t": 1 1 1 1 2 1 1 1 1 1 ...
##  $ stalk.root              : Factor w/ 5 levels "?","b","c","e",..: 4 3 3 4 4 3 3 3 4 3 ...
##  $ stalk.surface.above.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.surface.below.ring: Factor w/ 4 levels "f","k","s","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ stalk.color.above.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ stalk.color.below.ring  : Factor w/ 9 levels "b","c","e","g",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ veil.type               : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1 ...
##  $ veil.color              : Factor w/ 4 levels "n","o","w","y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ ring.number             : Factor w/ 3 levels "n","o","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ring.type               : Factor w/ 5 levels "e","f","l","n",..: 5 5 5 5 1 5 5 5 5 5 ...
##  $ spore.print.color       : Factor w/ 9 levels "b","h","k","n",..: 3 4 4 3 4 3 3 4 3 3 ...
##  $ population              : Factor w/ 6 levels "a","c","n","s",..: 4 3 3 4 1 3 3 4 5 4 ...
##  $ habitat                 : Factor w/ 7 levels "d","g","l","m",..: 6 2 4 6 2 2 4 4 2 4 ...

데이터 속성 data frame / 관측치 8124 / 변수 23개 임을 확인

변수 타입이 모두 factor 에 levels 적용되어 있음을 확인

반응변수 확인 : 관측치의 수가 비슷한지 등등

# class 변수의 빈도수와 비율, levels 확인 
CrossTable(mushrooms$class)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |-------------------------|
## 
## |        e |        p |
## |----------|----------|
## |     4208 |     3916 |
## |    0.518 |    0.482 |
## |----------|----------|

mushrooms %>%
  ggplot(aes(class)) +
  geom_bar()

## Warning in plyr::split_indices(scale_id, n): '.Random.seed'는 정수형 벡터가
## 아니라 타입 'NULL'이므로 무시되어집니다

levels(mushrooms$class)

## [1] "e" "p"

# 유일하게 누락값이 존재하는 stalk-root 변수만 빈도수와 비율을 해본다.
descr::CrossTable(mushrooms$stalk.root)

##    Cell Contents 
## |-------------------------|
## |                       N | 
## |           N / Row Total | 
## |-------------------------|
## 
## |        ? |        b |        c |        e |        r |
## |----------|----------|----------|----------|----------|
## |     2480 |     3776 |      556 |     1120 |      192 |
## |    0.305 |    0.465 |    0.068 |    0.138 |    0.024 |
## |----------|----------|----------|----------|----------|

5 - 2. 데이터 전처리 (Pre-processing)

`veil.type` 변수는 모두 1 level 이라 무의미한 변수이므로 제거 함.

로지스틱 회귀분석(`glm`)에서 반응변수(`class`)가 이진형(`binary`)인 경우 첫번째 레벨에 해당하는 범주가 `Failure`, 이외의 모든 레벨이 `Success`로 간주된다.

따라서 지금 이대로는 `class`에서 `e`(식용)가 실패 `p`(독성)가 성공으로 간주된다. 이는 우리가 원하는 결과가 아니므로 `class` 변수의 `levels` 를 재설정 해준다.

mushrooms <- mushrooms[, -17]

mushrooms$class <- factor(mushrooms$class,
                          levels = c("p", "e"))

전처리 후 분석에 사용할 최종 데이터 확인

summary(mushrooms)

##  class    cap.shape cap.surface   cap.color    bruises       odor     
##  p:3916   b: 452    f:2320      n      :2284   f:4748   n      :3528  
##  e:4208   c:   4    g:   4      g      :1840   t:3376   f      :2160  
##           f:3152    s:2556      e      :1500            s      : 576  
##           k: 828    y:3244      y      :1072            y      : 576  
##           s:  32                w      :1040            a      : 400  
##           x:3656                b      : 168            l      : 400  
##                                 (Other): 220            (Other): 484  
##  gill.attachment gill.spacing gill.size   gill.color   stalk.shape
##  a: 210          c:6812       b:5612    b      :1728   e:3516     
##  f:7914          w:1312       n:2512    p      :1492   t:4608     
##                                         w      :1202              
##                                         n      :1048              
##                                         g      : 752              
##                                         h      : 732              
##                                         (Other):1170              
##  stalk.root stalk.surface.above.ring stalk.surface.below.ring
##  ?:2480     f: 552                   f: 600                  
##  b:3776     k:2372                   k:2304                  
##  c: 556     s:5176                   s:4936                  
##  e:1120     y:  24                   y: 284                  
##  r: 192                                                      
##                                                              
##                                                              
##  stalk.color.above.ring stalk.color.below.ring veil.color ring.number
##  w      :4464           w      :4384           n:  96     n:  36     
##  p      :1872           p      :1872           o:  96     o:7488     
##  g      : 576           g      : 576           w:7924     t: 600     
##  n      : 448           n      : 512           y:   8                
##  b      : 432           b      : 432                                 
##  o      : 192           o      : 192                                 
##  (Other): 140           (Other): 156                                 
##  ring.type spore.print.color population habitat 
##  e:2776    w      :2388      a: 384     d:3148  
##  f:  48    n      :1968      c: 340     g:2148  
##  l:1296    k      :1872      n: 400     l: 832  
##  n:  36    h      :1632      s:1248     m: 292  
##  p:3968    r      :  72      v:4040     p:1144  
##            b      :  48      y:1712     u: 368  
##            (Other): 144                 w: 192

`class`변수의 레벨과 `veil.type` 변수 제거 모두 확인

고민해볼 포인트 : `stalk.root`(줄기 뿌리) 변수에서 `?`(missing, 누락) 값을 결측치로 제거 할 것인지 말 것인지

일단은 `missing`도 포함시킨 상태로 분석 시작.

문제의 복잡도 구하기

# mushrooms 데이터의 n, p 값을 구해서 문제의 복잡도를 확인 해보는 과정이다.
A <- model.matrix( ~ . -class, mushrooms)

dim(A)

## [1] 8124   96

n = 8124, p = 96 임을 확인할 수 있다.

6. 탐색적 데이터 분석(EDA)

데이터가 모두 질적 자료(factor data)이다.

따라서 막대 그래프(bar plot)와 모자이크 플롯을 이용하여 시각화를 한다.

6 - 1. Bar Plot

# 따라하며 배우는 데이터 과학 - 데이터 종류에 따른 분석 기법 파트 참조 

# 일단 제일 levels 가 다양한 gill.color / cap.color 변수만 골라서 막대그래프로 나타내보자
# cap.color
mushrooms %>%
  group_by(class) %>%
  ggplot(aes(cap.color, fill = class)) +
  geom_bar(position = "dodge")

# gill.color
mushrooms %>%
  group_by(class) %>%
  ggplot(aes(gill.color, fill = class)) +
  geom_bar(position = "dodge")

# odor
mushrooms %>%
  group_by(class) %>%
  ggplot(aes(odor, fill = class)) +
  geom_bar(position = "dodge")

# spore.print.color
mushrooms %>%
  group_by(class) %>%
  ggplot(aes(spore.print.color, fill = class)) +
  geom_bar(position = "dodge")

6 - 2. Mosaicplot

# cap.color
mosaicplot( ~ cap.color + class,
            data = mushrooms,
            color=T,
            cex=1.2)

# gill.color
mosaicplot( ~ gill.color + class,
            data = mushrooms,
            color=T,
            cex=1.2)

# odor
mosaicplot( ~ odor + class,
            data = mushrooms,
            color=T,
            cex=1.2)

# spore.print.color
mosaicplot( ~ spore.print.color + class,
            data = mushrooms,
            color=T,
            cex=1.2)

7. 모델 생성

Data Set Split (데이터 나누기)

Training : Validation : Test = 60 : 20 : 20 비율로 나눔

재현 가능한(`Reproducible`) 연구를 위해서 각 모델 생성 전에 `seed` 설정

분석에 사용한 모델은 다음과 같다.

(로지스틱 회귀의 경우 `Error`가 많고 부스팅의 경우 시간이 너무 오래 걸려서 생략했습니다.)

랜덤 포레스트 (Random Forest)
라쏘 (Least Absolute Shrinkage And Selection Operator, LASSO)

set.seed(0124)                
# 재현가능한 연구를 위한 seed 설정 - 연구 시작일인 1월 24일

n <- nrow(mushrooms)
idx <- 1:n                    # 총 관측치 개수 인덱싱

training.idx <- sample(idx, n * .60)   # Random 하게 전체 데이터에서 60% 샘플링
idx <- setdiff(idx, training.idx)    
# 전체 idx에서 training_idx 제외한 나머지 idx를 다시 idx 변수에 저장

validation.idx <- sample(idx, n * .20)
test.idx <- setdiff(idx, validation.idx)

# 샘플링 된 데이터 갯수들 확인 
length(training.idx)

## [1] 4874

length(validation.idx)

## [1] 1624

length(test.idx)

## [1] 1626

# 순서대로 훈련, 검증, 테스트 데이터 
training <- mushrooms[training.idx, ]
validation <- mushrooms[validation.idx, ]
test <- mushrooms[test.idx, ]

7 - 1. RandomForest

대표적인 앙상블 방법이 적용된 알고리즘

# seed setting
set.seed(0124)

# modeling : 모델 적합(생성)
mushrooms_rf <- randomForest(class ~ . , training)
mushrooms_rf

## 
## Call:
##  randomForest(formula = class ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0%
## Confusion matrix:
##      p    e class.error
## p 2326    0           0
## e    0 2548           0

7 - 1 - 1. 설명변수들 중에서 설명력이 높은 변수들 알아보기

변수 중요도(Variable Importance)를 계산한다.

랜덤포레스트를 분류분석에 이용하는 경우 평균지니지수감소량이 쓰인다.

importance(mushrooms_rf)

##                          MeanDecreaseGini
## cap.shape                       7.0379634
## cap.surface                    15.6482154
## cap.color                      32.5913368
## bruises                        57.1305346
## odor                          801.9086798
## gill.attachment                 0.8401865
## gill.spacing                   38.2667401
## gill.size                     149.5306826
## gill.color                    178.6411650
## stalk.shape                    28.7215300
## stalk.root                     78.1767869
## stalk.surface.above.ring      139.4804517
## stalk.surface.below.ring      117.7573378
## stalk.color.above.ring         31.6408053
## stalk.color.below.ring         30.7346657
## veil.color                      1.6569072
## ring.number                    31.1118309
## ring.type                     112.4231000
## spore.print.color             388.1765763
## population                    111.7841933
## habitat                        73.4109361

varImpPlot(mushrooms_rf)

#### odor 과 spore.print.color 변수가 특히 유의함을 알 수 있다.

7 - 1 - 2. `validation set` 을 이용한 예측

valiadtion set 중에서 10개만 예측한 결과를 출력한다.

predict(mushrooms_rf,
        newdata = validation[1:10, ])

## 8090 1203 1542  182 4199 7227 3432 3375 1779 7810 
##    p    e    e    e    p    e    e    e    p    e 
## Levels: p e

# 확률값으로 보고 싶은 경우 
predict(mushrooms_rf,
        newdata = validation[1:10, ], type = "prob")

##          p     e
## 8090 1.000 0.000
## 1203 0.000 1.000
## 1542 0.000 1.000
## 182  0.006 0.994
## 4199 1.000 0.000
## 7227 0.002 0.998
## 3432 0.000 1.000
## 3375 0.000 1.000
## 1779 1.000 0.000
## 7810 0.002 0.998
## attr(,"class")
## [1] "matrix" "votes"

7 - 1 - 3. `Random Forest` 모형 평가

모형을 평가하는 데에 쓰이는 것은 이항편차, ROC Curve 그리고 AUC Value 가 있다.

네가지 경우 모두 `validation set`를 이용하며 반응변수와 예측변수 추출이 우선이다.

y_obs <- ifelse(validation$class == "e", 1, 0)
yhat_rf <- predict(mushrooms_rf,
                   newdata = validation,
                   type = "prob")[, 'e']

이항편차 구하기

binomial_deviance(y_obs, yhat_rf)

ROC Curve

pred_rf <- prediction(yhat_rf, y_obs)

perf_rf <- performance(pred_rf, measure = "tpr",
                       x.measure = "fpr")

plot(perf_rf,
     col = "red",
     main = "ROC Curve")

abline(0,1)

AUC Value

performance(pred_rf, "auc")@y.values[[1]]

## [1] 1

정말 당황스럽게도 AUC Value == 1이 나왔다.

즉, 이 모델로 모든 버섯을 분류할 수 있다는 뜻이다.

7 - 2. LASSO

라쏘는 로지스틱 회귀 보다는 현대적인 방법이다.

하지만 로짓과는 다르게 입력변수에 범주형 변수가 있을 경우 모형 행렬을 직접 만들어 줘야한다.

패키지는 glmnet 을 사용한다.

glmnet 패키지로 적합할 수 있는 모형은 라쏘와 능형회귀(`ridge regression`), 엘라스틱넷(`ElasticNet`) 3개가 있다.

LASSO의 경우 랜덤 포레스트와는 다르게 설정값이 alpha와 lambda로 2개이다.

7 - 2 - 1. 모형 행렬 생성 : 절편항은 필요없으므로 모형식에 -1 을 설정

xx <- model.matrix(class ~ .-1 , mushrooms)

# 입력 변수
x <- xx[training.idx,]

# 반응 변수
y <- ifelse(training$class == "e", 1, 0)

dim(x)

## [1] 4874   96

가변수(더미변수) 포함한 훈련 데이터의 갯수 / 관측치 4874개, 변수 96개

7 - 2 - 2. 모형 적합(생성)

alpha = 1이 디폴트(`default`)로 설정되어 있어서 `LASSO` 모형으로 적합된다.

alpha = 0인 경우 능형회귀(`Ridge Regression`) 모형이 되며

alpha 를 0과 1사이에서 지정해줄 경우 해당 알파값을 갖는 엘라스틱넷(`ElasticNet`) 모형이 된다.

위에서 생성한 입력변수와 반응변수를 이용해 LASSO 모형 적합을 실행한다.

mushrooms_glmnet_fit <- glmnet(x,y)

다음은 coefficient profile plot 혹은 모수 패스를 나타내는 그래프이다.

plot(mushrooms_glmnet_fit)

LASSO 의 경우 모형의 복잡도로 L1-norm 을 사용한다.

하단 숫자 = lambda가 변함에 따라서 전체 모수 벡터의 L1norm값

상단 숫자 = 주어진 L1-norm에 해당하는 0이 아닌 모수의 갯수 즉 모형의 자유도를 뜻함.

관측치 4874개, 변수 96개 중에서 선택된 변수들이 얼마나 되는지 시각적으로 확인 할 수 있다.

mushrooms_glmnet_fit

## 
## Call:  glmnet(x = x, y = y) 
## 
##       Df   %Dev    Lambda
##  [1,]  0 0.0000 0.3907000
##  [2,]  1 0.1039 0.3560000
##  [3,]  1 0.1901 0.3243000
##  [4,]  1 0.2617 0.2955000
##  [5,]  1 0.3211 0.2693000
##  [6,]  1 0.3705 0.2454000
##  [7,]  2 0.4126 0.2236000
##  [8,]  3 0.4575 0.2037000
##  [9,]  4 0.5011 0.1856000
## [10,]  5 0.5526 0.1691000
## [11,]  5 0.5965 0.1541000
## [12,]  5 0.6330 0.1404000
## [13,]  5 0.6633 0.1279000
## [14,]  7 0.6960 0.1166000
## [15,]  8 0.7251 0.1062000
## [16,]  9 0.7495 0.0967700
## [17,]  9 0.7698 0.0881800
## [18,] 10 0.7873 0.0803400
## [19,] 12 0.8080 0.0732100
## [20,] 12 0.8284 0.0667000
## [21,] 12 0.8455 0.0607800
## [22,] 12 0.8596 0.0553800
## [23,] 13 0.8722 0.0504600
## [24,] 14 0.8844 0.0459800
## [25,] 16 0.8960 0.0418900
## [26,] 16 0.9067 0.0381700
## [27,] 20 0.9173 0.0347800
## [28,] 20 0.9262 0.0316900
## [29,] 20 0.9337 0.0288700
## [30,] 21 0.9400 0.0263100
## [31,] 21 0.9452 0.0239700
## [32,] 27 0.9503 0.0218400
## [33,] 27 0.9552 0.0199000
## [34,] 27 0.9594 0.0181300
## [35,] 27 0.9628 0.0165200
## [36,] 27 0.9657 0.0150500
## [37,] 27 0.9682 0.0137200
## [38,] 31 0.9707 0.0125000
## [39,] 30 0.9743 0.0113900
## [40,] 30 0.9774 0.0103800
## [41,] 30 0.9800 0.0094550
## [42,] 28 0.9821 0.0086150
## [43,] 26 0.9839 0.0078500
## [44,] 27 0.9853 0.0071520
## [45,] 29 0.9867 0.0065170
## [46,] 29 0.9883 0.0059380
## [47,] 29 0.9895 0.0054100
## [48,] 30 0.9906 0.0049300
## [49,] 31 0.9917 0.0044920
## [50,] 31 0.9929 0.0040930
## [51,] 31 0.9939 0.0037290
## [52,] 32 0.9947 0.0033980
## [53,] 32 0.9954 0.0030960
## [54,] 32 0.9959 0.0028210
## [55,] 33 0.9964 0.0025700
## [56,] 33 0.9968 0.0023420
## [57,] 33 0.9971 0.0021340
## [58,] 33 0.9974 0.0019440
## [59,] 33 0.9976 0.0017720
## [60,] 33 0.9978 0.0016140
## [61,] 33 0.9980 0.0014710
## [62,] 33 0.9981 0.0013400
## [63,] 33 0.9982 0.0012210
## [64,] 34 0.9983 0.0011130
## [65,] 33 0.9984 0.0010140
## [66,] 34 0.9984 0.0009237
## [67,] 35 0.9985 0.0008417
## [68,] 39 0.9986 0.0007669
## [69,] 37 0.9986 0.0006988
## [70,] 37 0.9986 0.0006367
## [71,] 37 0.9987 0.0005801
## [72,] 38 0.9987 0.0005286
## [73,] 38 0.9987 0.0004816
## [74,] 38 0.9987 0.0004389
## [75,] 40 0.9988 0.0003999
## [76,] 39 0.9988 0.0003643
## [77,] 40 0.9988 0.0003320
## [78,] 42 0.9988 0.0003025
## [79,] 44 0.9988 0.0002756

변해가는 `lambda`(복잡도 벌점)의 변화에 따른 `DF`와 `%Dev` 값이 나온다.

`DF` : Degree of Freedom 자유도

`% Dev` : 현재 모형으로 설명되는 변이의 부분이 어느 정도인가

연구자가 원하는 lambda 값에 해당하는(혹은 자유도) 모수 추정값들을 보고 싶은 경우

만약 원하는 lambda = 0.2236 인 경우 (다른 말로 자유도가 2인 경우)

coef(mushrooms_glmnet_fit, s = .2236)

## 97 x 1 sparse Matrix of class "dgCMatrix"
##                                      1
## (Intercept)                0.377051938
## cap.shapeb                 .          
## cap.shapec                 .          
## cap.shapef                 .          
## cap.shapek                 .          
## cap.shapes                 .          
## cap.shapex                 .          
## cap.surfaceg               .          
## cap.surfaces               .          
## cap.surfacey               .          
## cap.colorc                 .          
## cap.colore                 .          
## cap.colorg                 .          
## cap.colorn                 .          
## cap.colorp                 .          
## cap.colorr                 .          
## cap.coloru                 .          
## cap.colorw                 .          
## cap.colory                 .          
## bruisest                   .          
## odorc                      .          
## odorf                     -0.003160735
## odorl                      .          
## odorm                      .          
## odorn                      0.335355511
## odorp                      .          
## odors                      .          
## odory                      .          
## gill.attachmentf           .          
## gill.spacingw              .          
## gill.sizen                 .          
## gill.colore                .          
## gill.colorg                .          
## gill.colorh                .          
## gill.colork                .          
## gill.colorn                .          
## gill.coloro                .          
## gill.colorp                .          
## gill.colorr                .          
## gill.coloru                .          
## gill.colorw                .          
## gill.colory                .          
## stalk.shapet               .          
## stalk.rootb                .          
## stalk.rootc                .          
## stalk.roote                .          
## stalk.rootr                .          
## stalk.surface.above.ringk  .          
## stalk.surface.above.rings  .          
## stalk.surface.above.ringy  .          
## stalk.surface.below.ringk  .          
## stalk.surface.below.rings  .          
## stalk.surface.below.ringy  .          
## stalk.color.above.ringc    .          
## stalk.color.above.ringe    .          
## stalk.color.above.ringg    .          
## stalk.color.above.ringn    .          
## stalk.color.above.ringo    .          
## stalk.color.above.ringp    .          
## stalk.color.above.ringw    .          
## stalk.color.above.ringy    .          
## stalk.color.below.ringc    .          
## stalk.color.below.ringe    .          
## stalk.color.below.ringg    .          
## stalk.color.below.ringn    .          
## stalk.color.below.ringo    .          
## stalk.color.below.ringp    .          
## stalk.color.below.ringw    .          
## stalk.color.below.ringy    .          
## veil.coloro                .          
## veil.colorw                .          
## veil.colory                .          
## ring.numbero               .          
## ring.numbert               .          
## ring.typef                 .          
## ring.typel                 .          
## ring.typen                 .          
## ring.typep                 .          
## spore.print.colorh         .          
## spore.print.colork         .          
## spore.print.colorn         .          
## spore.print.coloro         .          
## spore.print.colorr         .          
## spore.print.coloru         .          
## spore.print.colorw         .          
## spore.print.colory         .          
## populationc                .          
## populationn                .          
## populations                .          
## populationv                .          
## populationy                .          
## habitatg                   .          
## habitatl                   .          
## habitatm                   .          
## habitatp                   .          
## habitatu                   .          
## habitatw                   .

모형식은 다음과 같다.

eta = (Intercept) + coef1 * Var1 + coef2 * Var2 + ….

eta = 0.377051938 - 0.003160735 * odorf + 0.335355511 * odorn

`mushrooms` 데이터에 로지스틱 `glmnet` 모형을 적합하고 거기에 교차검증(`Cross Validation`)을 시행한 결과

mushrooms_cvfit <- cv.glmnet(x, y,
                             family = "binomial")
plot(mushrooms_cvfit)

x 축 : lambda의 로그값 => 좌측일수록 복잡한 모형(자유도 높음), 우측일수록 단순한 모형

y 축 : 주어진 lambda 에서 k-fold 교차검증 오차 범위

빨간 점 : 주어진 lambda 에서의 k개의 교차검증의 평균값.

# 최적의 예측력을 갖는 모형 : 예측력이 가장 좋을 때는 lamda가 가장 작은 경우 
log(mushrooms_cvfit$lambda.min)

## [1] -8.568645

# 해석력이 좋은 모형에서의 lambda의 로그값
log(mushrooms_cvfit$lambda.1se)

## [1] -7.359206

각 변수들의 모수 출력

# lambda.1se 일 때 각 변수들의 계수들(모수) 출력 
coef(mushrooms_cvfit,
     s = mushrooms_cvfit$lambda.1se)

## 97 x 1 sparse Matrix of class "dgCMatrix"
##                                     1
## (Intercept)                 2.7630487
## cap.shapeb                  .        
## cap.shapec                 -3.8250637
## cap.shapef                  .        
## cap.shapek                  .        
## cap.shapes                  .        
## cap.shapex                  .        
## cap.surfaceg               -4.5463987
## cap.surfaces                .        
## cap.surfacey                .        
## cap.colorc                  0.4302336
## cap.colore                  .        
## cap.colorg                  .        
## cap.colorn                  .        
## cap.colorp                  .        
## cap.colorr                  .        
## cap.coloru                  .        
## cap.colorw                  .        
## cap.colory                  .        
## bruisest                    .        
## odorc                      -7.6593575
## odorf                      -8.0577656
## odorl                       1.0983781
## odorm                      -1.1978868
## odorn                       4.7537149
## odorp                      -5.6390608
## odors                      -0.8527686
## odory                      -0.8084669
## gill.attachmentf            .        
## gill.spacingw               2.8417298
## gill.sizen                 -3.0053350
## gill.colore                 .        
## gill.colorg                 .        
## gill.colorh                 .        
## gill.colork                 .        
## gill.colorn                 .        
## gill.coloro                 .        
## gill.colorp                 .        
## gill.colorr                 .        
## gill.coloru                 .        
## gill.colorw                 .        
## gill.colory                 .        
## stalk.shapet                .        
## stalk.rootb                 .        
## stalk.rootc                 2.3223490
## stalk.roote                 .        
## stalk.rootr                 2.7178828
## stalk.surface.above.ringk  -2.0805594
## stalk.surface.above.rings   .        
## stalk.surface.above.ringy   .        
## stalk.surface.below.ringk   .        
## stalk.surface.below.rings   .        
## stalk.surface.below.ringy  -0.5559326
## stalk.color.above.ringc     .        
## stalk.color.above.ringe     .        
## stalk.color.above.ringg     .        
## stalk.color.above.ringn     .        
## stalk.color.above.ringo     .        
## stalk.color.above.ringp     .        
## stalk.color.above.ringw     .        
## stalk.color.above.ringy    -3.9589789
## stalk.color.below.ringc     .        
## stalk.color.below.ringe     .        
## stalk.color.below.ringg     .        
## stalk.color.below.ringn     .        
## stalk.color.below.ringo     .        
## stalk.color.below.ringp     .        
## stalk.color.below.ringw     .        
## stalk.color.below.ringy    -2.3629862
## veil.coloro                 .        
## veil.colorw                 .        
## veil.colory                 .        
## ring.numbero                .        
## ring.numbert                1.6934564
## ring.typef                  0.3781774
## ring.typel                  .        
## ring.typen                  .        
## ring.typep                  .        
## spore.print.colorh          .        
## spore.print.colork          .        
## spore.print.colorn          0.3857454
## spore.print.coloro          .        
## spore.print.colorr        -14.1632331
## spore.print.coloru          1.7687872
## spore.print.colorw         -4.3885794
## spore.print.colory          .        
## populationc                -1.6490988
## populationn                 0.0394289
## populations                 .        
## populationv                 .        
## populationy                 .        
## habitatg                    .        
## habitatl                    .        
## habitatm                    .        
## habitatp                    .        
## habitatu                    .        
## habitatw                    2.3300583

# lambda.min 일 때 
# coef(mushrooms_cvfit, s = mushrooms_cvfit$lambda.min)

데이터의 모든 변수들 중에 영향력이 있다고 판별된 변수들의 갯수 출력

length(which(coef(mushrooms_cvfit,
                  s="lambda.1se")!=0))

## [1] 29

# lambda.min 일 때
# length(which(coef(mushrooms_cvfit, s="lambda.min")!=0))

7 - 2 - 3. `LASSO` 모형을 이용한 예측

labda = `lambda.1se` 일 때 관측치 1 - 5에 대한 확률 예측값을 보자.

predict(mushrooms_cvfit,
        s = "lambda.1se",
        newx = x[1:5, ],
        type = "response")

##                 1
## 675  0.9983562374
## 3321 0.9996303222
## 4186 0.0006261612
## 3224 0.9994564057
## 1809 0.9994564057

7 - 2 - 4. `LASSO` 모형 평가

검증세트를 이용해서 이항편차, ROC Curve, AUC Value 를 구해보자.

y_obs <- ifelse(validation$class == "e", 1, 0)
yhat_glmnet <- predict(mushrooms_cvfit, s = "lambda.1se",
                       newx = xx[validation.idx, ],
                       type = "response")

yhat_glmnet <- yhat_glmnet[, 1]

이항편차

# binomial_deviance 이 없다고 출력되서 실행은 안 했습니다.
binomial_deviance(y_obs, yhat_glmnet)

ROC Curve

pred_glmnet <- prediction(yhat_glmnet, y_obs)
perf_glmnet <- performance(pred_glmnet,
                           measure = "tpr",
                           x.mesure = "fpr")

plot(perf_rf,
     col = "red",
     main = "ROC Curve")

plot(perf_glmnet,
     add = T,
     col = "blue")

abline(0, 1)

legend("bottomright",
       inset = .1,
       legend = c("Random Forest", "LASSO"),
       col = c("red", "blue"),
       lty = 1, lwd = 2)

AUC Value

performance(pred_rf, "auc")@y.values[[1]]

## [1] 1

8. `Test Set` 을 이용한 분류 분석

# RF
pre1 <- predict(mushrooms_rf,
        newdata = test,
        type = "prob")[, 'e']

# LASSO
pre2 <- predict(mushrooms_cvfit,
        s = "lambda.1se",
        newx = xx[test.idx, ],
        type = "response")

head(pre1, 10)

##     5    22    30    32    49    56    69    70    81    84 
## 1.000 0.000 0.994 0.000 1.000 1.000 1.000 0.980 1.000 1.000

head(pre2, 10)

##              1
## 5  0.999978431
## 22 0.004088005
## 30 0.951900503
## 32 0.004088005
## 49 0.998356237
## 56 0.993851610
## 69 0.994087912
## 70 0.992587695
## 81 0.999978431
## 84 0.999978431

9. 마치며

위에 생성한 모델들을 앙상블 해보고 싶었는데 하지 못한 점이 아쉽다.

기존에는 책에 있는 데이터와 방법을 그대로 썼는데 이번에 처음으로 데이터도 직접 수집하고 연구해본 것이라 뿌듯하다.

11. 참고 문헌 (Reference)

따라하며 배우는 데이터 과학 (권재명 지음, 제이펍 출판)
초보자를 위한 RStudio 마스터 (줄리안 힐레브란트, 막시밀리안 니어호프 저 / 고석범 역, 에이콘출판사 출판)
딥러닝 첫걸음 (김성필 지음, 한빛미디어 출판)
Glmnet Vignette

'데이터 분석 > Kaggle' 카테고리의 다른 글

캐글 코리아 주관 - "2019 1st ML month with KaKR" kernel (0)	2019.02.06
캐글 대출고객 분류분석 (Kaggle - Loan Data Classification) (0)	2018.02.06

캐글 버섯 데이터 분류분석(Kaggle - Mushrooms Data Classification)

Kaggle_Mushrooms_Classification

MinSoon Lim

1. 서론

해당 레포트는 기계학습(Machine Learning) 중 지도학습(Supervised Learning)의 한 분야인 분류분석(Classification)을 이용하여 Kaggle 에 있는 Mushrooms 데이터를 분석하는 레포트 입니다.

분석 목적 : 독버섯(poisonous)인지 식용버섯(edible)인지 분류하는 것

분석 시작일 : 2018년 1월 24일 수요일

분석 종료일 : 2018년 1월 28일 일요일(총 5일 소요)

캐글 링크 : Kaggle - Public Data - Mushrooms Data Acquisition

2. 패키지 장착

3. 데이터 불러오기

4. 데이터 변수 설명

5 - 1. 데이터 확인

데이터 속성 data frame / 관측치 8124 / 변수 23개 임을 확인

변수 타입이 모두 factor 에 levels 적용되어 있음을 확인

반응변수 확인 : 관측치의 수가 비슷한지 등등

5 - 2. 데이터 전처리 (Pre-processing)

veil.type 변수는 모두 1 level 이라 무의미한 변수이므로 제거 함.

로지스틱 회귀분석(glm)에서 반응변수(class)가 이진형(binary)인 경우 첫번째 레벨에 해당하는 범주가 Failure, 이외의 모든 레벨이 Success로 간주된다.

따라서 지금 이대로는 class에서 e(식용)가 실패 p(독성)가 성공으로 간주된다. 이는 우리가 원하는 결과가 아니므로 class 변수의 levels 를 재설정 해준다.

전처리 후 분석에 사용할 최종 데이터 확인

class변수의 레벨과 veil.type 변수 제거 모두 확인

고민해볼 포인트 : stalk.root(줄기 뿌리) 변수에서 ?(missing, 누락) 값을 결측치로 제거 할 것인지 말 것인지

일단은 missing도 포함시킨 상태로 분석 시작.

문제의 복잡도 구하기

n = 8124, p = 96 임을 확인할 수 있다.

6. 탐색적 데이터 분석(EDA)

데이터가 모두 질적 자료(factor data)이다.

따라서 막대 그래프(bar plot)와 모자이크 플롯을 이용하여 시각화를 한다.

6 - 1. Bar Plot

6 - 2. Mosaicplot

7. 모델 생성

Data Set Split (데이터 나누기)

Training : Validation : Test = 60 : 20 : 20 비율로 나눔

재현 가능한(Reproducible) 연구를 위해서 각 모델 생성 전에 seed 설정

분석에 사용한 모델은 다음과 같다.

(로지스틱 회귀의 경우 Error가 많고 부스팅의 경우 시간이 너무 오래 걸려서 생략했습니다.)

7 - 1. RandomForest

대표적인 앙상블 방법이 적용된 알고리즘

7 - 1 - 1. 설명변수들 중에서 설명력이 높은 변수들 알아보기

변수 중요도(Variable Importance)를 계산한다.

랜덤포레스트를 분류분석에 이용하는 경우 평균지니지수감소량이 쓰인다.

7 - 1 - 2. validation set 을 이용한 예측

valiadtion set 중에서 10개만 예측한 결과를 출력한다.

7 - 1 - 3. Random Forest 모형 평가

모형을 평가하는 데에 쓰이는 것은 이항편차, ROC Curve 그리고 AUC Value 가 있다.

네가지 경우 모두 validation set를 이용하며 반응변수와 예측변수 추출이 우선이다.

이항편차 구하기

ROC Curve

AUC Value

정말 당황스럽게도 AUC Value == 1이 나왔다.

즉, 이 모델로 모든 버섯을 분류할 수 있다는 뜻이다.

7 - 2. LASSO

라쏘는 로지스틱 회귀 보다는 현대적인 방법이다.

하지만 로짓과는 다르게 입력변수에 범주형 변수가 있을 경우 모형 행렬을 직접 만들어 줘야한다.

패키지는 glmnet 을 사용한다.

glmnet 패키지로 적합할 수 있는 모형은 라쏘와 능형회귀(ridge regression), 엘라스틱넷(ElasticNet) 3개가 있다.

LASSO의 경우 랜덤 포레스트와는 다르게 설정값이 alpha와 lambda로 2개이다.

7 - 2 - 1. 모형 행렬 생성 : 절편항은 필요없으므로 모형식에 -1 을 설정

가변수(더미변수) 포함한 훈련 데이터의 갯수 / 관측치 4874개, 변수 96개

7 - 2 - 2. 모형 적합(생성)

alpha = 1이 디폴트(default)로 설정되어 있어서 LASSO 모형으로 적합된다.

alpha = 0인 경우 능형회귀(Ridge Regression) 모형이 되며

alpha 를 0과 1사이에서 지정해줄 경우 해당 알파값을 갖는 엘라스틱넷(ElasticNet) 모형이 된다.

위에서 생성한 입력변수와 반응변수를 이용해 LASSO 모형 적합을 실행한다.

다음은 coefficient profile plot 혹은 모수 패스를 나타내는 그래프이다.

LASSO 의 경우 모형의 복잡도로 L1-norm 을 사용한다.

하단 숫자 = lambda가 변함에 따라서 전체 모수 벡터의 L1norm값

상단 숫자 = 주어진 L1-norm에 해당하는 0이 아닌 모수의 갯수 즉 모형의 자유도를 뜻함.

관측치 4874개, 변수 96개 중에서 선택된 변수들이 얼마나 되는지 시각적으로 확인 할 수 있다.

변해가는 lambda(복잡도 벌점)의 변화에 따른 DF와 %Dev 값이 나온다.

DF : Degree of Freedom 자유도

% Dev : 현재 모형으로 설명되는 변이의 부분이 어느 정도인가

연구자가 원하는 lambda 값에 해당하는(혹은 자유도) 모수 추정값들을 보고 싶은 경우

만약 원하는 lambda = 0.2236 인 경우 (다른 말로 자유도가 2인 경우)

모형식은 다음과 같다.

eta = (Intercept) + coef1 * Var1 + coef2 * Var2 + ….

eta = 0.377051938 - 0.003160735 * odorf + 0.335355511 * odorn

mushrooms 데이터에 로지스틱 glmnet 모형을 적합하고 거기에 교차검증(Cross Validation)을 시행한 결과

x 축 : lambda의 로그값 => 좌측일수록 복잡한 모형(자유도 높음), 우측일수록 단순한 모형

해당 레포트는 기계학습(`Machine Learning`) 중 지도학습(`Supervised Learning`)의 한 분야인 분류분석(`Classification`)을 이용하여 `Kaggle` 에 있는 `Mushrooms` 데이터를 분석하는 레포트 입니다.

분석 목적 : `독버섯(poisonous)`인지 `식용버섯(edible)`인지 분류하는 것

`veil.type` 변수는 모두 1 level 이라 무의미한 변수이므로 제거 함.

로지스틱 회귀분석(`glm`)에서 반응변수(`class`)가 이진형(`binary`)인 경우 첫번째 레벨에 해당하는 범주가 `Failure`, 이외의 모든 레벨이 `Success`로 간주된다.

따라서 지금 이대로는 `class`에서 `e`(식용)가 실패 `p`(독성)가 성공으로 간주된다. 이는 우리가 원하는 결과가 아니므로 `class` 변수의 `levels` 를 재설정 해준다.

`class`변수의 레벨과 `veil.type` 변수 제거 모두 확인

고민해볼 포인트 : `stalk.root`(줄기 뿌리) 변수에서 `?`(missing, 누락) 값을 결측치로 제거 할 것인지 말 것인지

일단은 `missing`도 포함시킨 상태로 분석 시작.

재현 가능한(`Reproducible`) 연구를 위해서 각 모델 생성 전에 `seed` 설정

(로지스틱 회귀의 경우 `Error`가 많고 부스팅의 경우 시간이 너무 오래 걸려서 생략했습니다.)

7 - 1 - 2. `validation set` 을 이용한 예측

7 - 1 - 3. `Random Forest` 모형 평가

네가지 경우 모두 `validation set`를 이용하며 반응변수와 예측변수 추출이 우선이다.

glmnet 패키지로 적합할 수 있는 모형은 라쏘와 능형회귀(`ridge regression`), 엘라스틱넷(`ElasticNet`) 3개가 있다.

alpha = 1이 디폴트(`default`)로 설정되어 있어서 `LASSO` 모형으로 적합된다.

alpha = 0인 경우 능형회귀(`Ridge Regression`) 모형이 되며

alpha 를 0과 1사이에서 지정해줄 경우 해당 알파값을 갖는 엘라스틱넷(`ElasticNet`) 모형이 된다.

변해가는 `lambda`(복잡도 벌점)의 변화에 따른 `DF`와 `%Dev` 값이 나온다.

`DF` : Degree of Freedom 자유도

`% Dev` : 현재 모형으로 설명되는 변이의 부분이 어느 정도인가

`mushrooms` 데이터에 로지스틱 `glmnet` 모형을 적합하고 거기에 교차검증(`Cross Validation`)을 시행한 결과

7 - 2 - 3. `LASSO` 모형을 이용한 예측

labda = `lambda.1se` 일 때 관측치 1 - 5에 대한 확률 예측값을 보자.

7 - 2 - 4. `LASSO` 모형 평가

8. `Test Set` 을 이용한 분류 분석