Logistic Regression

로지스틱 회귀분석은 반응변수가 범주형변수일 때 사용이 가능합니다.(범주 2개 이상 가능)
설명변수는 범주형과 연속형 둘다 사용이 가능합니다.
$E(y) = {exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)}$
$E(y) = p 확률을 의미합니다.$ 성공하는 경우에 y =1 실패하는 경우에 y=0이라고 두면 p는 성공할 확률을 나타냅니다. p는 0~1사이의 값으로 분석가가 임의로 cutoff value를 설정하여서 cutoff value 밑은 y= 0 위는 y=1을 줍니다.

$p’ = {ln \left (p\over1-p \right ) } = {ln \left( E(y)\over 1-E(y) \right )} = ln \left( {exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)} \over {1 \over 1+exp(\beta_0+\beta_1 x)} \right ) = \beta_0+\beta_1 x $
이러한 변환을 로지스틱변환이라고 부릅니다.

계수 추정법

2번 째 수식 $1- W_{\beta}(x)입니다$

Odds(오즈)

사건이 일어날 확률 p = ${exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)} 일때 odds = {p \over 1-p}$
사건이 일어날 확률 /사건이 일어나지 않을 확률
odds가 클 수록 사건이 일어날 확률도 커진다.
설명변수 $x_1의 단위가 하나 증가할 때의 사건이 일어날 오즈가 e^{\beta_1} *100 \%가 되는것이다$

titanic 실습

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

import warnings

warnings.filterwarnings('ignore')

train= pd.read_csv("C:/Users/landg/Downloads/titanic/train.csv")
test= pd.read_csv("C:/Users/landg/Downloads/titanic/test.csv")

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB

def reduce_mem_usage(df):
    """ 
    iterate through all the columns of a dataframe and 
    modify the data type to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory usage of dataframe is {start_mem:.2f}MB')
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max <\
                  np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max <\
                   np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max <\
                   np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max <\
                   np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            elif str(col_type)[:5] == 'float':
                if c_min > np.finfo(np.float16).min and c_max <\
                   np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max <\
                   np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
            else:
                pass
        else:
            df[col] = df[col].astype('category')
    end_mem = df.memory_usage().sum() / 1024**2
    print(f'Memory usage after optimization is: {end_mem:.2f}MB')
    print(f'Decreased by {100*((start_mem - end_mem)/start_mem):.1f}%')
    
    return df

train = reduce_mem_usage(train)

Memory usage of dataframe is 0.08MB
Memory usage after optimization is: 0.09MB
Decreased by -12.9%

train['Sex_clean'] = train['Sex'].astype('category').cat.codes
test['Sex_clean'] = test['Sex'].astype('category').cat.codes

train['Sex_clean']

    1
    0
    0
    0
    1
      ..
  1
  0
  0
  1
  1
Name: Sex_clean, Length: 891, dtype: int8

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  891 non-null    int16   
 1   Survived     891 non-null    int8    
 2   Pclass       891 non-null    int8    
 3   Name         891 non-null    category
 4   Sex          891 non-null    category
 5   Age          714 non-null    float16 
 6   SibSp        891 non-null    int8    
 7   Parch        891 non-null    int8    
 8   Ticket       891 non-null    category
 9   Fare         891 non-null    float16 
 10  Cabin        204 non-null    category
 11  Embarked     889 non-null    category
 12  Sex_clean    891 non-null    int8    
dtypes: category(5), float16(2), int16(1), int8(5)
memory usage: 95.3 KB

train['Embarked'].isnull().sum()

test['Embarked'].isnull().sum()

train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

train['Embarked'].fillna('S', inplace = True)

train['Embarked_clean'] = train['Embarked'].astype('category').cat.codes
test['Embarked_clean'] = test['Embarked'].astype('category').cat.codes

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Sex_clean	Embarked_clean
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.250000	NaN	S	1	2
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.312500	C85	C	0	0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.925781	NaN	S	0	2
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.093750	C123	S	0	2
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.046875	NaN	S	1	2

train['Family'] = 1 + train['SibSp'] + train['Parch']
test['Family'] = 1 + test['SibSp'] + test['Parch']

train['Solo'] = (train['Family']==1)

test['Solo'] = (test['Family']==1)

train['FareBin_4'] = pd.qcut(train['Fare'],4)
test['FareBin_4'] = pd.qcut(test['Fare'],4)

train['FareBin_4'].value_counts()

(7.91, 14.453]    224
(-0.001, 7.91]    223
(31.0, 512.5]     222
(14.453, 31.0]    222
Name: FareBin_4, dtype: int64

train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.',  expand= False)

test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.',  expand= False)

train['Title'] = train['Title'].replace(['Lady', 'Countess','Capy','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Other')

train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')

test['Title'] = test['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Other')

test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')

train['Title_clean'] = train['Title'].astype('category').cat.codes
test['Title_clean'] = test['Title'].astype('category').cat.codes

train.groupby("Title")["Age"].transform("median")

    30.0
    35.0
    21.0
    35.0
    30.0
       ... 
  48.0
  21.0
  21.0
  30.0
  30.0
Name: Age, Length: 891, dtype: float16

train['Age'].plot.hist(bins=range(10,101,10),figsize=[15,8])

<matplotlib.axes._subplots.AxesSubplot at 0x1d5a9776e48>

png

train['Age'].fillna(train.groupby("Title")["Age"].transform("median"),inplace = True)
test['Age'].fillna(test.groupby("Title")["Age"].transform("median"),inplace = True)

train.loc[train['Age'] <= 16, 'Age_clean'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <=26), 'Age_clean'] = 1
train.loc[(train['Age'] > 26) & (train['Age'] <=36), 'Age_clean'] = 2
train.loc[(train['Age'] > 36) & (train['Age'] <=62), 'Age_clean'] = 3
train.loc[(train['Age'] > 62), 'Age_clean'] = 4

test.loc[test['Age'] <= 16, 'Age_clean'] = 0
test.loc[(test['Age'] > 16) & (test['Age'] <=26), 'Age_clean'] = 1
test.loc[(test['Age'] > 26) & (test['Age'] <=36), 'Age_clean'] = 2
test.loc[(test['Age'] > 36) & (test['Age'] <=62), 'Age_clean'] = 3
test.loc[(test['Age'] > 62), 'Age_clean'] = 4

train['Fare'].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace = True)
test['Fare'].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace = True)

pd.qcut(train['Fare'], 4)

    (-0.001, 7.91]
     (31.0, 512.5]
    (7.91, 14.453]
     (31.0, 512.5]
    (7.91, 14.453]
            ...      
  (7.91, 14.453]
  (14.453, 31.0]
  (14.453, 31.0]
  (14.453, 31.0]
  (-0.001, 7.91]
Name: Fare, Length: 891, dtype: category
Categories (4, interval[float64]): [(-0.001, 7.91] < (7.91, 14.453] < (14.453, 31.0] < (31.0, 512.5]]

train.loc[train['Fare'] <= 17, 'Fare_clean'] = 0
train.loc[(train['Fare'] > 17) & (train['Fare'] <=30), 'Fare_clean'] = 1
train.loc[(train['Fare'] > 30) & (train['Fare'] <=100), 'Fare_clean'] = 2
train.loc[(train['Fare'] > 100), 'Fare_clean'] = 3

train['Fare_clean'] = train['Fare_clean'].astype(int)

test.loc[test['Fare'] <= 17, 'Fare_clean'] = 0
test.loc[(test['Fare'] > 17) & (test['Fare'] <=30), 'Fare_clean'] = 1
test.loc[(test['Fare'] > 30) & (test['Fare'] <=100), 'Fare_clean'] = 2
test.loc[(test['Fare'] > 100), 'Fare_clean'] = 3
test['Fare_clean'] = test['Fare_clean'].astype(int)

train['Cabin'].str[:1].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Cabin, dtype: int64

mapping = {
    'A' :0,
    'B' :0.4,
    'C' : 0.8,
    'D' : 1.2,
    'E' : 1.6,
    'F' : 2.0,
    'G' : 2.4,
    'T' : 2.8
    
}

train['Cabin_clean'] = train['Cabin'].str[:1]

train['Cabin_clean'] = train['Cabin_clean'].map(mapping)

train[['Pclass','Cabin_clean']].head(10)

	Pclass	Cabin_clean
0	3	NaN
1	1	0.8
2	3	NaN
3	1	0.8
4	3	NaN
5	3	NaN
6	1	1.6
7	3	NaN
8	3	NaN
9	2	NaN

train.groupby('Pclass')['Cabin_clean'].median()

Pclass
1    0.8
2    1.8
3    2.0
Name: Cabin_clean, dtype: float64

train['Cabin_clean'].fillna(train.groupby('Pclass')['Cabin_clean'].transform('median'),inplace = True)

train['Cabin_clean']

    2.0
    0.8
    2.0
    0.8
    2.0
      ... 
  1.8
  0.4
  2.0
  0.8
  2.0
Name: Cabin_clean, Length: 891, dtype: float64

test['Cabin_clean'] = test['Cabin'].str[:1]
test['Cabin_clean'] = test['Cabin_clean'].map(mapping)
test['Cabin_clean'].fillna(test.groupby('Pclass')['Cabin_clean'].transform('median'), inplace=True)

feature = [
    'Pclass',
    'SibSp',
    'Parch',
    'Sex_clean',
    'Embarked_clean',
    'Family',
    'Solo',
    'Title_clean',
    'Age_clean',
    'Fare_clean',
    'Cabin_clean'
]

label = [
    'Survived'
]

data = train[feature]
target = train[label]

k_fold = KFold(n_splits=10, shuffle = True, random_state=0)

x_train, x_test, y_train, y_test = train_test_split(data, target, random_state=0)

x_train.shape, x_test.shape, y_train.shape, y_test.shape 

((668, 11), (223, 11), (668, 1), (223, 1))

Logistic Regression 해석

import statsmodels.api as sm

model = sm.Logit(y_train, x_train.astype(float))
results = model.fit(method = "newton") 

Optimization terminated successfully.
         Current function value: 0.439386
         Iterations 7

x_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 105 to 684
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Pclass          668 non-null    int8   
 1   SibSp           668 non-null    int8   
 2   Parch           668 non-null    int8   
 3   Sex_clean       668 non-null    int8   
 4   Embarked_clean  668 non-null    int8   
 5   Family          668 non-null    int8   
 6   Solo            668 non-null    bool   
 7   Title_clean     668 non-null    int8   
 8   Age_clean       668 non-null    float64
 9   Fare_clean      668 non-null    int32  
 10  Cabin_clean     668 non-null    float64
dtypes: bool(1), float64(2), int32(1), int8(7)
memory usage: 23.5 KB

x_train.head()

	Pclass	SibSp	Parch	Sex_clean	Embarked_clean	Family	Solo	Title_clean	Age_clean	Cabin_clean
105	3	0	0	1	2	1	True	3	2.0	2.0
68	3	4	2	0	2	7	False	2	1.0	2.0
253	3	1	0	1	2	2	False	3	2.0	2.0
320	3	0	0	1	2	1	True	3	1.0	2.0
706	2	0	0	0	2	1	True	4	3.0	1.8

results.summary()

Logit Regression Results
Dep. Variable:	Survived	No. Observations:	668
Model:	Logit	Df Residuals:	657
Method:	MLE	Df Model:	10
Date:	Sun, 28 Mar 2021	Pseudo R-squ.:	0.3413
Time:	18:52:55	Log-Likelihood:	-293.51
converged:	True	LL-Null:	-445.58
Covariance Type:	nonrobust	LLR p-value:	2.079e-59

	coef	std err	z	P>\|z\|	[0.025	0.975]
Pclass	-1.1096	0.263	-4.214	0.000	-1.626	-0.593
SibSp	-6.3773	0.909	-7.012	0.000	-8.160	-4.595
Parch	-5.9752	0.888	-6.728	0.000	-7.716	-4.235
Sex_clean	-2.6116	0.229	-11.428	0.000	-3.059	-2.164
Embarked_clean	-0.1870	0.140	-1.338	0.181	-0.461	0.087
Family	5.7611	0.876	6.573	0.000	4.043	7.479
Solo	-0.8793	0.328	-2.679	0.007	-1.523	-0.236
Title_clean	-0.2110	0.152	-1.392	0.164	-0.508	0.086
Age_clean	-0.4346	0.138	-3.143	0.002	-0.706	-0.164
Fare_clean	0.0126	0.198	0.063	0.949	-0.376	0.401
Cabin_clean	0.2644	0.350	0.756	0.450	-0.421	0.950

#계수들
results.params

Pclass           -1.109590
SibSp            -6.377258
Parch            -5.975151
Sex_clean        -2.611591
Embarked_clean   -0.187041
Family            5.761088
Solo             -0.879342
Title_clean      -0.210982
Age_clean        -0.434641
Fare_clean        0.012587
Cabin_clean       0.264359
dtype: float64

##계수들의 오즈비 
np.exp(results.params)

Pclass              0.329694
SibSp               0.001700
Parch               0.002541
Sex_clean           0.073418
Embarked_clean      0.829410
Family            317.693867
Solo                0.415056
Title_clean         0.809789
Age_clean           0.647497
Fare_clean          1.012667
Cabin_clean         1.302596
dtype: float64

오즈비 해석

Pclass 단위가 1 증가할 때마다 생존할 오즈가 0.32배라는 의미입니다.
solo인 사람이 solo가 아닌 사람보다 생존할 오즈가 0.41배 입니다.

모형의 적합도 평가

모든 변수를 포함한 모형을 포화모형이라고 합니다. 이때의 가능도 값을 L1이라고 두겠습니다.
의미없는 설명변수를 제거하고 적합한 모형의 가능도 값을 L2라고 하겠습니다.
이때 Likelihood ratio statistics, LRS는 -2log(L1/L2)입니다.
LRS이 0에 가까울수록 적합이 잘됐다고 판단할 수 있습니다.
LRS = Deviance(이탈도) = -2loglikelihood
이탈도는 카이제곱 모형을 따르게 되며 자유도는 각 모형의 자유도를 뺀 값이다.
Cox & Snell, Nagelkerke, pseudo-Rsquare는 낮을수록 좋습니다.
Hosmer-Lemeshow’s goodness-of-fit test: 모형이 적합한지를 테스트를 합니다. 표본수가 커야하고 귀무가설이 모형이 적합하다입니다.

회귀계수 검정

Wald통계량 이용
$W_j = \left( {\hat{\beta_j} \over \sqrt{\hat{Var({\hat\beta_j})}}} \right )^2$
자유도가 1인 카이제곱 통계량을 따른다.

References

https://www.slideshare.net/JeonghunYoon/04-logistic-regression
https://m.blog.naver.com/libido1014/120122772781
https://todayisbetterthanyesterday.tistory.com/11
패스트캠퍼스
회귀분석 -박성현

Twitter Facebook LinkedIn

Gang Min

Logistic Regression

Logistic Regression

계수 추정법

Odds(오즈)

titanic 실습

Logistic Regression 해석

오즈비 해석

모형의 적합도 평가

회귀계수 검정

References

Comments

You May Also Enjoy

추천시스템 함수

TBS크롤링

MBC크롤링

2021 국토교통 빅데이터 온라인 해커톤 경진대회