Logistic Regression

Logistic Regression

  • 로지스틱 회귀분석은 반응변수가 범주형변수일 때 사용이 가능합니다.(범주 2개 이상 가능)
  • 설명변수는 범주형과 연속형 둘다 사용이 가능합니다.
  • $E(y) = {exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)}$
  • $E(y) = p 확률을 의미합니다.$ 성공하는 경우에 y =1 실패하는 경우에 y=0이라고 두면 p는 성공할 확률을 나타냅니다. p는 0~1사이의 값으로 분석가가 임의로 cutoff value를 설정하여서 cutoff value 밑은 y= 0 위는 y=1을 줍니다.

  • $p’ = {ln \left (p\over1-p \right ) } = {ln \left( E(y)\over 1-E(y) \right )} = ln \left( {exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)} \over {1 \over 1+exp(\beta_0+\beta_1 x)} \right ) = \beta_0+\beta_1 x $

  • 이러한 변환을 로지스틱변환이라고 부릅니다.

계수 추정법

  • 2번 째 수식 $1- W_{\beta}(x)입니다$
  • 로지스틱1
  • 로지스틱2

Odds(오즈)

  • 사건이 일어날 확률 p = ${exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)} 일때 odds = {p \over 1-p}$
  • 사건이 일어날 확률 /사건이 일어나지 않을 확률
  • odds가 클 수록 사건이 일어날 확률도 커진다.
  • 설명변수 $x_1의 단위가 하나 증가할 때의 사건이 일어날 오즈가 e^{\beta_1} *100 \%가 되는것이다$

titanic 실습

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import warnings

warnings.filterwarnings('ignore')
train= pd.read_csv("C:/Users/landg/Downloads/titanic/train.csv")
test= pd.read_csv("C:/Users/landg/Downloads/titanic/test.csv")

train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
def reduce_mem_usage(df):
    """ 
    iterate through all the columns of a dataframe and 
    modify the data type to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory usage of dataframe is {start_mem:.2f}MB')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max <\
                  np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max <\
                   np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max <\
                   np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max <\
                   np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float16).min and c_max <\
                   np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max <\
                   np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
            else:
                pass
        else:
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory usage after optimization is: {end_mem:.2f}MB')
    print(f'Decreased by {100*((start_mem - end_mem)/start_mem):.1f}%')
    
    return df
train = reduce_mem_usage(train)
Memory usage of dataframe is 0.08MB
Memory usage after optimization is: 0.09MB
Decreased by -12.9%
train['Sex_clean'] = train['Sex'].astype('category').cat.codes
test['Sex_clean'] = test['Sex'].astype('category').cat.codes
train['Sex_clean']
0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Sex_clean, Length: 891, dtype: int8
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int16   
 1   Survived     891 non-null    int8    
 2   Pclass       891 non-null    int8    
 3   Name         891 non-null    category
 4   Sex          891 non-null    category
 5   Age          714 non-null    float16 
 6   SibSp        891 non-null    int8    
 7   Parch        891 non-null    int8    
 8   Ticket       891 non-null    category
 9   Fare         891 non-null    float16 
 10  Cabin        204 non-null    category
 11  Embarked     889 non-null    category
 12  Sex_clean    891 non-null    int8    
dtypes: category(5), float16(2), int16(1), int8(5)
memory usage: 95.3 KB
train['Embarked'].isnull().sum()
2
test['Embarked'].isnull().sum()
0
train['Embarked'].value_counts()
S    644
C    168
Q     77
Name: Embarked, dtype: int64
train['Embarked'].fillna('S', inplace = True)
train['Embarked_clean'] = train['Embarked'].astype('category').cat.codes
test['Embarked_clean'] = test['Embarked'].astype('category').cat.codes
train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Sex_clean Embarked_clean
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.250000 NaN S 1 2
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.312500 C85 C 0 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925781 NaN S 0 2
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.093750 C123 S 0 2
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.046875 NaN S 1 2
train['Family'] = 1 + train['SibSp'] + train['Parch']
test['Family'] = 1 + test['SibSp'] + test['Parch']

train['Solo'] = (train['Family']==1)
test['Solo'] = (test['Family']==1)
train['FareBin_4'] = pd.qcut(train['Fare'],4)
test['FareBin_4'] = pd.qcut(test['Fare'],4)
train['FareBin_4'].value_counts()
(7.91, 14.453]    224
(-0.001, 7.91]    223
(31.0, 512.5]     222
(14.453, 31.0]    222
Name: FareBin_4, dtype: int64
train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.',  expand= False)
test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.',  expand= False)
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capy','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Other')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')
test['Title'] = test['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Other')
test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')
train['Title_clean'] = train['Title'].astype('category').cat.codes
test['Title_clean'] = test['Title'].astype('category').cat.codes
train.groupby("Title")["Age"].transform("median")
0      30.0
1      35.0
2      21.0
3      35.0
4      30.0
       ... 
886    48.0
887    21.0
888    21.0
889    30.0
890    30.0
Name: Age, Length: 891, dtype: float16
train['Age'].plot.hist(bins=range(10,101,10),figsize=[15,8])
<matplotlib.axes._subplots.AxesSubplot at 0x1d5a9776e48>

png

train['Age'].fillna(train.groupby("Title")["Age"].transform("median"),inplace = True)
test['Age'].fillna(test.groupby("Title")["Age"].transform("median"),inplace = True)
train.loc[train['Age'] <= 16, 'Age_clean'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <=26), 'Age_clean'] = 1
train.loc[(train['Age'] > 26) & (train['Age'] <=36), 'Age_clean'] = 2
train.loc[(train['Age'] > 36) & (train['Age'] <=62), 'Age_clean'] = 3
train.loc[(train['Age'] > 62), 'Age_clean'] = 4
test.loc[test['Age'] <= 16, 'Age_clean'] = 0
test.loc[(test['Age'] > 16) & (test['Age'] <=26), 'Age_clean'] = 1
test.loc[(test['Age'] > 26) & (test['Age'] <=36), 'Age_clean'] = 2
test.loc[(test['Age'] > 36) & (test['Age'] <=62), 'Age_clean'] = 3
test.loc[(test['Age'] > 62), 'Age_clean'] = 4
train['Fare'].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace = True)
test['Fare'].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace = True)

pd.qcut(train['Fare'], 4)
0      (-0.001, 7.91]
1       (31.0, 512.5]
2      (7.91, 14.453]
3       (31.0, 512.5]
4      (7.91, 14.453]
            ...      
886    (7.91, 14.453]
887    (14.453, 31.0]
888    (14.453, 31.0]
889    (14.453, 31.0]
890    (-0.001, 7.91]
Name: Fare, Length: 891, dtype: category
Categories (4, interval[float64]): [(-0.001, 7.91] < (7.91, 14.453] < (14.453, 31.0] < (31.0, 512.5]]
train.loc[train['Fare'] <= 17, 'Fare_clean'] = 0
train.loc[(train['Fare'] > 17) & (train['Fare'] <=30), 'Fare_clean'] = 1
train.loc[(train['Fare'] > 30) & (train['Fare'] <=100), 'Fare_clean'] = 2
train.loc[(train['Fare'] > 100), 'Fare_clean'] = 3

train['Fare_clean'] = train['Fare_clean'].astype(int)
test.loc[test['Fare'] <= 17, 'Fare_clean'] = 0
test.loc[(test['Fare'] > 17) & (test['Fare'] <=30), 'Fare_clean'] = 1
test.loc[(test['Fare'] > 30) & (test['Fare'] <=100), 'Fare_clean'] = 2
test.loc[(test['Fare'] > 100), 'Fare_clean'] = 3
test['Fare_clean'] = test['Fare_clean'].astype(int)
train['Cabin'].str[:1].value_counts()
C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin, dtype: int64
mapping = {
    'A' :0,
    'B' :0.4,
    'C' : 0.8,
    'D' : 1.2,
    'E' : 1.6,
    'F' : 2.0,
    'G' : 2.4,
    'T' : 2.8
    
}
train['Cabin_clean'] = train['Cabin'].str[:1]
train['Cabin_clean'] = train['Cabin_clean'].map(mapping)
train[['Pclass','Cabin_clean']].head(10)
Pclass Cabin_clean
0 3 NaN
1 1 0.8
2 3 NaN
3 1 0.8
4 3 NaN
5 3 NaN
6 1 1.6
7 3 NaN
8 3 NaN
9 2 NaN
train.groupby('Pclass')['Cabin_clean'].median()
Pclass
1    0.8
2    1.8
3    2.0
Name: Cabin_clean, dtype: float64
train['Cabin_clean'].fillna(train.groupby('Pclass')['Cabin_clean'].transform('median'),inplace = True)
train['Cabin_clean']
0      2.0
1      0.8
2      2.0
3      0.8
4      2.0
      ... 
886    1.8
887    0.4
888    2.0
889    0.8
890    2.0
Name: Cabin_clean, Length: 891, dtype: float64
test['Cabin_clean'] = test['Cabin'].str[:1]
test['Cabin_clean'] = test['Cabin_clean'].map(mapping)
test['Cabin_clean'].fillna(test.groupby('Pclass')['Cabin_clean'].transform('median'), inplace=True)
feature = [
    'Pclass',
    'SibSp',
    'Parch',
    'Sex_clean',
    'Embarked_clean',
    'Family',
    'Solo',
    'Title_clean',
    'Age_clean',
    'Fare_clean',
    'Cabin_clean'
]
label = [
    'Survived'
]
data = train[feature]
target = train[label]
k_fold = KFold(n_splits=10, shuffle = True, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(data, target, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape 
((668, 11), (223, 11), (668, 1), (223, 1))

Logistic Regression 해석

import statsmodels.api as sm
model = sm.Logit(y_train, x_train.astype(float))
results = model.fit(method = "newton") 

Optimization terminated successfully.
         Current function value: 0.439386
         Iterations 7
x_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 105 to 684
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pclass          668 non-null    int8   
 1   SibSp           668 non-null    int8   
 2   Parch           668 non-null    int8   
 3   Sex_clean       668 non-null    int8   
 4   Embarked_clean  668 non-null    int8   
 5   Family          668 non-null    int8   
 6   Solo            668 non-null    bool   
 7   Title_clean     668 non-null    int8   
 8   Age_clean       668 non-null    float64
 9   Fare_clean      668 non-null    int32  
 10  Cabin_clean     668 non-null    float64
dtypes: bool(1), float64(2), int32(1), int8(7)
memory usage: 23.5 KB
x_train.head()
Pclass SibSp Parch Sex_clean Embarked_clean Family Solo Title_clean Age_clean Fare_clean Cabin_clean
105 3 0 0 1 2 1 True 3 2.0 0 2.0
68 3 4 2 0 2 7 False 2 1.0 0 2.0
253 3 1 0 1 2 2 False 3 2.0 0 2.0
320 3 0 0 1 2 1 True 3 1.0 0 2.0
706 2 0 0 0 2 1 True 4 3.0 0 1.8
results.summary()
Logit Regression Results
Dep. Variable: Survived No. Observations: 668
Model: Logit Df Residuals: 657
Method: MLE Df Model: 10
Date: Sun, 28 Mar 2021 Pseudo R-squ.: 0.3413
Time: 18:52:55 Log-Likelihood: -293.51
converged: True LL-Null: -445.58
Covariance Type: nonrobust LLR p-value: 2.079e-59
coef std err z P>|z| [0.025 0.975]
Pclass -1.1096 0.263 -4.214 0.000 -1.626 -0.593
SibSp -6.3773 0.909 -7.012 0.000 -8.160 -4.595
Parch -5.9752 0.888 -6.728 0.000 -7.716 -4.235
Sex_clean -2.6116 0.229 -11.428 0.000 -3.059 -2.164
Embarked_clean -0.1870 0.140 -1.338 0.181 -0.461 0.087
Family 5.7611 0.876 6.573 0.000 4.043 7.479
Solo -0.8793 0.328 -2.679 0.007 -1.523 -0.236
Title_clean -0.2110 0.152 -1.392 0.164 -0.508 0.086
Age_clean -0.4346 0.138 -3.143 0.002 -0.706 -0.164
Fare_clean 0.0126 0.198 0.063 0.949 -0.376 0.401
Cabin_clean 0.2644 0.350 0.756 0.450 -0.421 0.950
#계수들
results.params
Pclass           -1.109590
SibSp            -6.377258
Parch            -5.975151
Sex_clean        -2.611591
Embarked_clean   -0.187041
Family            5.761088
Solo             -0.879342
Title_clean      -0.210982
Age_clean        -0.434641
Fare_clean        0.012587
Cabin_clean       0.264359
dtype: float64
##계수들의 오즈비 
np.exp(results.params)
Pclass              0.329694
SibSp               0.001700
Parch               0.002541
Sex_clean           0.073418
Embarked_clean      0.829410
Family            317.693867
Solo                0.415056
Title_clean         0.809789
Age_clean           0.647497
Fare_clean          1.012667
Cabin_clean         1.302596
dtype: float64

오즈비 해석

  • Pclass 단위가 1 증가할 때마다 생존할 오즈가 0.32배라는 의미입니다.
  • solo인 사람이 solo가 아닌 사람보다 생존할 오즈가 0.41배 입니다.

모형의 적합도 평가

  • 모든 변수를 포함한 모형을 포화모형이라고 합니다. 이때의 가능도 값을 L1이라고 두겠습니다.
  • 의미없는 설명변수를 제거하고 적합한 모형의 가능도 값을 L2라고 하겠습니다.
  • 이때 Likelihood ratio statistics, LRS는 -2log(L1/L2)입니다.
  • LRS이 0에 가까울수록 적합이 잘됐다고 판단할 수 있습니다.
  • LRS = Deviance(이탈도) = -2loglikelihood
  • 이탈도는 카이제곱 모형을 따르게 되며 자유도는 각 모형의 자유도를 뺀 값이다.
  • Cox & Snell, Nagelkerke, pseudo-Rsquare는 낮을수록 좋습니다.
  • Hosmer-Lemeshow’s goodness-of-fit test: 모형이 적합한지를 테스트를 합니다. 표본수가 커야하고 귀무가설이 모형이 적합하다입니다.

회귀계수 검정

  • Wald통계량 이용
  • $W_j = \left( {\hat{\beta_j} \over \sqrt{\hat{Var({\hat\beta_j})}}} \right )^2$
  • 자유도가 1인 카이제곱 통계량을 따른다.

References


Comments