Logistic Regression
Logistic Regression
- 로지스틱 회귀분석은 반응변수가 범주형변수일 때 사용이 가능합니다.(범주 2개 이상 가능)
- 설명변수는 범주형과 연속형 둘다 사용이 가능합니다.
- $E(y) = {exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)}$
- $E(y) = p 확률을 의미합니다.$ 성공하는 경우에 y =1 실패하는 경우에 y=0이라고 두면 p는 성공할 확률을 나타냅니다. p는 0~1사이의 값으로 분석가가 임의로 cutoff value를 설정하여서 cutoff value 밑은 y= 0 위는 y=1을 줍니다.
-
$p’ = {ln \left (p\over1-p \right ) } = {ln \left( E(y)\over 1-E(y) \right )} = ln \left( {exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)} \over {1 \over 1+exp(\beta_0+\beta_1 x)} \right ) = \beta_0+\beta_1 x $
-
이러한 변환을 로지스틱변환이라고 부릅니다.
계수 추정법
- 2번 째 수식 $1- W_{\beta}(x)입니다$
Odds(오즈)
- 사건이 일어날 확률 p = ${exp (\beta_0 + \beta_1) \over 1+exp(\beta_0+\beta_1 x)} 일때 odds = {p \over 1-p}$
- 사건이 일어날 확률 /사건이 일어나지 않을 확률
- odds가 클 수록 사건이 일어날 확률도 커진다.
- 설명변수 $x_1의 단위가 하나 증가할 때의 사건이 일어날 오즈가 e^{\beta_1} *100 \%가 되는것이다$
titanic 실습
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings('ignore')
train= pd.read_csv("C:/Users/landg/Downloads/titanic/train.csv")
test= pd.read_csv("C:/Users/landg/Downloads/titanic/test.csv")
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 418 non-null int64
1 Pclass 418 non-null int64
2 Name 418 non-null object
3 Sex 418 non-null object
4 Age 332 non-null float64
5 SibSp 418 non-null int64
6 Parch 418 non-null int64
7 Ticket 418 non-null object
8 Fare 417 non-null float64
9 Cabin 91 non-null object
10 Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
def reduce_mem_usage(df):
"""
iterate through all the columns of a dataframe and
modify the data type to reduce memory usage.
"""
start_mem = df.memory_usage().sum() / 1024**2
print(f'Memory usage of dataframe is {start_mem:.2f}MB')
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max <\
np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max <\
np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max <\
np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max <\
np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
elif str(col_type)[:5] == 'float':
if c_min > np.finfo(np.float16).min and c_max <\
np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max <\
np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
pass
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print(f'Memory usage after optimization is: {end_mem:.2f}MB')
print(f'Decreased by {100*((start_mem - end_mem)/start_mem):.1f}%')
return df
train = reduce_mem_usage(train)
Memory usage of dataframe is 0.08MB
Memory usage after optimization is: 0.09MB
Decreased by -12.9%
train['Sex_clean'] = train['Sex'].astype('category').cat.codes
test['Sex_clean'] = test['Sex'].astype('category').cat.codes
train['Sex_clean']
0 1
1 0
2 0
3 0
4 1
..
886 1
887 0
888 0
889 1
890 1
Name: Sex_clean, Length: 891, dtype: int8
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int16
1 Survived 891 non-null int8
2 Pclass 891 non-null int8
3 Name 891 non-null category
4 Sex 891 non-null category
5 Age 714 non-null float16
6 SibSp 891 non-null int8
7 Parch 891 non-null int8
8 Ticket 891 non-null category
9 Fare 891 non-null float16
10 Cabin 204 non-null category
11 Embarked 889 non-null category
12 Sex_clean 891 non-null int8
dtypes: category(5), float16(2), int16(1), int8(5)
memory usage: 95.3 KB
train['Embarked'].isnull().sum()
2
test['Embarked'].isnull().sum()
0
train['Embarked'].value_counts()
S 644
C 168
Q 77
Name: Embarked, dtype: int64
train['Embarked'].fillna('S', inplace = True)
train['Embarked_clean'] = train['Embarked'].astype('category').cat.codes
test['Embarked_clean'] = test['Embarked'].astype('category').cat.codes
train.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Sex_clean | Embarked_clean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.250000 | NaN | S | 1 | 2 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.312500 | C85 | C | 0 | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.925781 | NaN | S | 0 | 2 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.093750 | C123 | S | 0 | 2 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.046875 | NaN | S | 1 | 2 |
train['Family'] = 1 + train['SibSp'] + train['Parch']
test['Family'] = 1 + test['SibSp'] + test['Parch']
train['Solo'] = (train['Family']==1)
test['Solo'] = (test['Family']==1)
train['FareBin_4'] = pd.qcut(train['Fare'],4)
test['FareBin_4'] = pd.qcut(test['Fare'],4)
train['FareBin_4'].value_counts()
(7.91, 14.453] 224
(-0.001, 7.91] 223
(31.0, 512.5] 222
(14.453, 31.0] 222
Name: FareBin_4, dtype: int64
train['Title'] = train['Name'].str.extract(' ([A-Za-z]+)\.', expand= False)
test['Title'] = test['Name'].str.extract(' ([A-Za-z]+)\.', expand= False)
train['Title'] = train['Title'].replace(['Lady', 'Countess','Capy','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Other')
train['Title'] = train['Title'].replace('Mlle', 'Miss')
train['Title'] = train['Title'].replace('Ms', 'Miss')
train['Title'] = train['Title'].replace('Mme', 'Mrs')
test['Title'] = test['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Other')
test['Title'] = test['Title'].replace('Mlle', 'Miss')
test['Title'] = test['Title'].replace('Ms', 'Miss')
test['Title'] = test['Title'].replace('Mme', 'Mrs')
train['Title_clean'] = train['Title'].astype('category').cat.codes
test['Title_clean'] = test['Title'].astype('category').cat.codes
train.groupby("Title")["Age"].transform("median")
0 30.0
1 35.0
2 21.0
3 35.0
4 30.0
...
886 48.0
887 21.0
888 21.0
889 30.0
890 30.0
Name: Age, Length: 891, dtype: float16
train['Age'].plot.hist(bins=range(10,101,10),figsize=[15,8])
<matplotlib.axes._subplots.AxesSubplot at 0x1d5a9776e48>
train['Age'].fillna(train.groupby("Title")["Age"].transform("median"),inplace = True)
test['Age'].fillna(test.groupby("Title")["Age"].transform("median"),inplace = True)
train.loc[train['Age'] <= 16, 'Age_clean'] = 0
train.loc[(train['Age'] > 16) & (train['Age'] <=26), 'Age_clean'] = 1
train.loc[(train['Age'] > 26) & (train['Age'] <=36), 'Age_clean'] = 2
train.loc[(train['Age'] > 36) & (train['Age'] <=62), 'Age_clean'] = 3
train.loc[(train['Age'] > 62), 'Age_clean'] = 4
test.loc[test['Age'] <= 16, 'Age_clean'] = 0
test.loc[(test['Age'] > 16) & (test['Age'] <=26), 'Age_clean'] = 1
test.loc[(test['Age'] > 26) & (test['Age'] <=36), 'Age_clean'] = 2
test.loc[(test['Age'] > 36) & (test['Age'] <=62), 'Age_clean'] = 3
test.loc[(test['Age'] > 62), 'Age_clean'] = 4
train['Fare'].fillna(train.groupby("Pclass")["Fare"].transform("median"), inplace = True)
test['Fare'].fillna(test.groupby("Pclass")["Fare"].transform("median"), inplace = True)
pd.qcut(train['Fare'], 4)
0 (-0.001, 7.91]
1 (31.0, 512.5]
2 (7.91, 14.453]
3 (31.0, 512.5]
4 (7.91, 14.453]
...
886 (7.91, 14.453]
887 (14.453, 31.0]
888 (14.453, 31.0]
889 (14.453, 31.0]
890 (-0.001, 7.91]
Name: Fare, Length: 891, dtype: category
Categories (4, interval[float64]): [(-0.001, 7.91] < (7.91, 14.453] < (14.453, 31.0] < (31.0, 512.5]]
train.loc[train['Fare'] <= 17, 'Fare_clean'] = 0
train.loc[(train['Fare'] > 17) & (train['Fare'] <=30), 'Fare_clean'] = 1
train.loc[(train['Fare'] > 30) & (train['Fare'] <=100), 'Fare_clean'] = 2
train.loc[(train['Fare'] > 100), 'Fare_clean'] = 3
train['Fare_clean'] = train['Fare_clean'].astype(int)
test.loc[test['Fare'] <= 17, 'Fare_clean'] = 0
test.loc[(test['Fare'] > 17) & (test['Fare'] <=30), 'Fare_clean'] = 1
test.loc[(test['Fare'] > 30) & (test['Fare'] <=100), 'Fare_clean'] = 2
test.loc[(test['Fare'] > 100), 'Fare_clean'] = 3
test['Fare_clean'] = test['Fare_clean'].astype(int)
train['Cabin'].str[:1].value_counts()
C 59
B 47
D 33
E 32
A 15
F 13
G 4
T 1
Name: Cabin, dtype: int64
mapping = {
'A' :0,
'B' :0.4,
'C' : 0.8,
'D' : 1.2,
'E' : 1.6,
'F' : 2.0,
'G' : 2.4,
'T' : 2.8
}
train['Cabin_clean'] = train['Cabin'].str[:1]
train['Cabin_clean'] = train['Cabin_clean'].map(mapping)
train[['Pclass','Cabin_clean']].head(10)
Pclass | Cabin_clean | |
---|---|---|
0 | 3 | NaN |
1 | 1 | 0.8 |
2 | 3 | NaN |
3 | 1 | 0.8 |
4 | 3 | NaN |
5 | 3 | NaN |
6 | 1 | 1.6 |
7 | 3 | NaN |
8 | 3 | NaN |
9 | 2 | NaN |
train.groupby('Pclass')['Cabin_clean'].median()
Pclass
1 0.8
2 1.8
3 2.0
Name: Cabin_clean, dtype: float64
train['Cabin_clean'].fillna(train.groupby('Pclass')['Cabin_clean'].transform('median'),inplace = True)
train['Cabin_clean']
0 2.0
1 0.8
2 2.0
3 0.8
4 2.0
...
886 1.8
887 0.4
888 2.0
889 0.8
890 2.0
Name: Cabin_clean, Length: 891, dtype: float64
test['Cabin_clean'] = test['Cabin'].str[:1]
test['Cabin_clean'] = test['Cabin_clean'].map(mapping)
test['Cabin_clean'].fillna(test.groupby('Pclass')['Cabin_clean'].transform('median'), inplace=True)
feature = [
'Pclass',
'SibSp',
'Parch',
'Sex_clean',
'Embarked_clean',
'Family',
'Solo',
'Title_clean',
'Age_clean',
'Fare_clean',
'Cabin_clean'
]
label = [
'Survived'
]
data = train[feature]
target = train[label]
k_fold = KFold(n_splits=10, shuffle = True, random_state=0)
x_train, x_test, y_train, y_test = train_test_split(data, target, random_state=0)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
((668, 11), (223, 11), (668, 1), (223, 1))
Logistic Regression 해석
import statsmodels.api as sm
model = sm.Logit(y_train, x_train.astype(float))
results = model.fit(method = "newton")
Optimization terminated successfully.
Current function value: 0.439386
Iterations 7
x_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 105 to 684
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 668 non-null int8
1 SibSp 668 non-null int8
2 Parch 668 non-null int8
3 Sex_clean 668 non-null int8
4 Embarked_clean 668 non-null int8
5 Family 668 non-null int8
6 Solo 668 non-null bool
7 Title_clean 668 non-null int8
8 Age_clean 668 non-null float64
9 Fare_clean 668 non-null int32
10 Cabin_clean 668 non-null float64
dtypes: bool(1), float64(2), int32(1), int8(7)
memory usage: 23.5 KB
x_train.head()
Pclass | SibSp | Parch | Sex_clean | Embarked_clean | Family | Solo | Title_clean | Age_clean | Fare_clean | Cabin_clean | |
---|---|---|---|---|---|---|---|---|---|---|---|
105 | 3 | 0 | 0 | 1 | 2 | 1 | True | 3 | 2.0 | 0 | 2.0 |
68 | 3 | 4 | 2 | 0 | 2 | 7 | False | 2 | 1.0 | 0 | 2.0 |
253 | 3 | 1 | 0 | 1 | 2 | 2 | False | 3 | 2.0 | 0 | 2.0 |
320 | 3 | 0 | 0 | 1 | 2 | 1 | True | 3 | 1.0 | 0 | 2.0 |
706 | 2 | 0 | 0 | 0 | 2 | 1 | True | 4 | 3.0 | 0 | 1.8 |
results.summary()
Dep. Variable: | Survived | No. Observations: | 668 |
---|---|---|---|
Model: | Logit | Df Residuals: | 657 |
Method: | MLE | Df Model: | 10 |
Date: | Sun, 28 Mar 2021 | Pseudo R-squ.: | 0.3413 |
Time: | 18:52:55 | Log-Likelihood: | -293.51 |
converged: | True | LL-Null: | -445.58 |
Covariance Type: | nonrobust | LLR p-value: | 2.079e-59 |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Pclass | -1.1096 | 0.263 | -4.214 | 0.000 | -1.626 | -0.593 |
SibSp | -6.3773 | 0.909 | -7.012 | 0.000 | -8.160 | -4.595 |
Parch | -5.9752 | 0.888 | -6.728 | 0.000 | -7.716 | -4.235 |
Sex_clean | -2.6116 | 0.229 | -11.428 | 0.000 | -3.059 | -2.164 |
Embarked_clean | -0.1870 | 0.140 | -1.338 | 0.181 | -0.461 | 0.087 |
Family | 5.7611 | 0.876 | 6.573 | 0.000 | 4.043 | 7.479 |
Solo | -0.8793 | 0.328 | -2.679 | 0.007 | -1.523 | -0.236 |
Title_clean | -0.2110 | 0.152 | -1.392 | 0.164 | -0.508 | 0.086 |
Age_clean | -0.4346 | 0.138 | -3.143 | 0.002 | -0.706 | -0.164 |
Fare_clean | 0.0126 | 0.198 | 0.063 | 0.949 | -0.376 | 0.401 |
Cabin_clean | 0.2644 | 0.350 | 0.756 | 0.450 | -0.421 | 0.950 |
#계수들
results.params
Pclass -1.109590
SibSp -6.377258
Parch -5.975151
Sex_clean -2.611591
Embarked_clean -0.187041
Family 5.761088
Solo -0.879342
Title_clean -0.210982
Age_clean -0.434641
Fare_clean 0.012587
Cabin_clean 0.264359
dtype: float64
##계수들의 오즈비
np.exp(results.params)
Pclass 0.329694
SibSp 0.001700
Parch 0.002541
Sex_clean 0.073418
Embarked_clean 0.829410
Family 317.693867
Solo 0.415056
Title_clean 0.809789
Age_clean 0.647497
Fare_clean 1.012667
Cabin_clean 1.302596
dtype: float64
오즈비 해석
- Pclass 단위가 1 증가할 때마다 생존할 오즈가 0.32배라는 의미입니다.
- solo인 사람이 solo가 아닌 사람보다 생존할 오즈가 0.41배 입니다.
모형의 적합도 평가
- 모든 변수를 포함한 모형을 포화모형이라고 합니다. 이때의 가능도 값을 L1이라고 두겠습니다.
- 의미없는 설명변수를 제거하고 적합한 모형의 가능도 값을 L2라고 하겠습니다.
- 이때 Likelihood ratio statistics, LRS는 -2log(L1/L2)입니다.
- LRS이 0에 가까울수록 적합이 잘됐다고 판단할 수 있습니다.
- LRS = Deviance(이탈도) = -2loglikelihood
- 이탈도는 카이제곱 모형을 따르게 되며 자유도는 각 모형의 자유도를 뺀 값이다.
- Cox & Snell, Nagelkerke, pseudo-Rsquare는 낮을수록 좋습니다.
- Hosmer-Lemeshow’s goodness-of-fit test: 모형이 적합한지를 테스트를 합니다. 표본수가 커야하고 귀무가설이 모형이 적합하다입니다.
회귀계수 검정
- Wald통계량 이용
- $W_j = \left( {\hat{\beta_j} \over \sqrt{\hat{Var({\hat\beta_j})}}} \right )^2$
- 자유도가 1인 카이제곱 통계량을 따른다.
References
- https://www.slideshare.net/JeonghunYoon/04-logistic-regression
- https://m.blog.naver.com/libido1014/120122772781
- https://todayisbetterthanyesterday.tistory.com/11
- 패스트캠퍼스
- 회귀분석 -박성현
Comments