범주형 변수 처리하기.

lable encoding

  • 범주형 자료를 연속형 숫자로 변경해준다
  • a,b,c,d를 label encoding하면 1,2,3,4로 변경된다.
  • 의미없는 a,b,c,d를 1,2,3,4로 변경하게 되면 d가 a보다 큰 값을 가진다고 해석할 수 있기 때문에 주의해야한다.
  • 순서가 의미있는 자료일 때 사용하면 좋다.

  • 방법 1

def label_encode(train_data, test_data, columns):
    'Returns a DataFrame with encoded columns'
    encoded_cols = []
    for col in columns:
        factorised = pd.factorize(train_data[col])[1]
        labels = pd.Series(range(len(factorised)), index=factorised)
        encoded_col_train = train_data[col].map(labels) 
        encoded_col_test = test_data[col].map(labels)
        encoded_col = pd.concat([encoded_col_train, encoded_col_test], axis=0)
        encoded_col[encoded_col.isnull()] = -1
        encoded_cols.append(pd.DataFrame({'label_'+col:encoded_col}))
    all_encoded = pd.concat(encoded_cols, axis=1)
    return (all_encoded.loc[train_data.index,:], 
            all_encoded.loc[test_data.index,:])

  • 방법 2
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])
le.classes_

예시

df = pd.read_csv("C:/Users/landg/Downloads/titanic/train.csv")
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Sex']=le.fit_transform(df['Sex'])
le.classes_
array(['female', 'male'], dtype=object)
df['Sex']
0      1
1      0
2      0
3      0
4      1
      ..
886    1
887    0
888    0
889    1
890    1
Name: Sex, Length: 891, dtype: int32

One-Hot-Encoding

  • 범주형 변수를 이진값으로 표현한다.
  • 리신, 이즈리얼, 다리우스라는 범주형 변수를 리신 = (1,0,0) 이즈리얼은 (0,1,0), 다리우스는 (0,0,1)로 표현된다.
  • 범주형변수를 처리하는데 있어서 가장 기본적인 방법
  • 회귀분석이나 로지스틱 분석을 할 때는 다중공선성 문제가 발생하므로 하나를 삭제 해줘야한다.
  • 다리우스를 삭제하면 리신은 (1,0) 이즈리얼은 (0,1) 다리우스는 (0,0)이 된다.
  • 순서가 의미없을 때 사용하면 좋다.
  • feature에 노이즈가 많이 꼈을 때

  • 방법1
import pandas as pd
df_dummy = pd.get_dummies(df['Sex']) 
df_dummy.head()
female male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
df = pd.concat([df,df_dummy],axis=1)
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked female male
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0 1
df = df.drop('Sex',axis=1)
df.head()
PassengerId Survived Pclass Name Age SibSp Parch Ticket Fare Cabin Embarked female male
0 1 0 3 Braund, Mr. Owen Harris 22.0 1 0 A/5 21171 7.2500 NaN S 0 1
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 38.0 1 0 PC 17599 71.2833 C85 C 1 0
2 3 1 3 Heikkinen, Miss. Laina 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 35.0 1 0 113803 53.1000 C123 S 1 0
4 5 0 3 Allen, Mr. William Henry 35.0 0 0 373450 8.0500 NaN S 0 1
  • 방법2 : sklearn 사용한 onehotencoder
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'롤 챔프' : ['리신','이즈리얼','다리우스']})
ohc = OneHotEncoder()
ohc_encoded = ohc.fit_transform(df)
print(ohc_encoded)
  (0, 1)	1.0
  (1, 2)	1.0
  (2, 0)	1.0
  • 방법 3
def one_hot_encode(train_data, test_data, columns):
    conc = pd.concat([train_data, test_data], axis=0)
    encoded = pd.get_dummies(conc.loc[:, columns], drop_first=True,
                             sparse=True) 
    return (encoded.iloc[:train_data.shape[0],:], 
            encoded.iloc[train_data.shape[0]:,:])

Mean_Encoding, Target_Encoding

A : 14. B : 10, C : 10 , A : 10, A : 12 , B : 20 , C : 5

** Mean_Encoding 적용 **

A : 12, B : 15, C : 7.5

  • Mean_Encoding의 장점
    • 만들어지는 feature수가 적어서 학습이 빠르다.
    • bias가 낮다
  • Mean_Encoding의 단점
    • Overfitting
      • train data에 test data값이 들어가게 되면서 Overfitting 발생
  • 해결방법
    • Smoothing
      • 적절한 $ \alpha $ 값을 설정해준다.
      • $Lable_c = { {\left ( p_c * n_c + p_{global}*\alpha \right )} \over {n_c + \alpha}} $
    • CV loop
      • train set에서 CV를 통해서 여러 mean_encoding값을 출력해서 선택
      • 10fold면 label당 mean_encoding값 10개 출력

def add_noise(series, noise_level):
    return series * (1 + noise_level * np.random.randn(len(series)))

def target_encode(trn_series=None, 
                  tst_series=None, 
                  target=None, 
                  min_samples_leaf=1, 
                  smoothing=1,
                  noise_level=0):
    """
    Smoothing is computed like in the following paper by Daniele Micci-Barreca
    https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
    trn_series : training categorical feature as a pd.Series
    tst_series : test categorical feature as a pd.Series
    target : target data as a pd.Series
    min_samples_leaf (int) : minimum samples to take category average into account
    smoothing (int) : smoothing effect to balance categorical average vs prior  
    """ 
    assert len(trn_series) == len(target) #값이 같으면 True 
    assert trn_series.name == tst_series.name
    temp = pd.concat([trn_series, target], axis=1) #옆으로 붙여주고,
    # Compute target mean 
    averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"]) #group by
    # Compute smoothing
    smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
    # Apply average function to all target data
    prior = target.mean()
    # The bigger the count the less full_avg is taken into account
    averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
    averages.drop(["mean", "count"], axis=1, inplace=True)
    # Apply averages to trn and tst series
    ft_trn_series = pd.merge(
        trn_series.to_frame(trn_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),#rename
        on=trn_series.name,                                                                 # merge 
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior) # how = left를 넣어야 NULL값이 안생김
    
    # pd.merge does not keep the index so restore it
    ft_trn_series.index = trn_series.index `
    ft_tst_series = pd.merge(
        tst_series.to_frame(tst_series.name),
        averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
        on=tst_series.name,
        how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
    # pd.merge does not keep the index so restore it
    ft_tst_series.index = tst_series.index
    return add_noise(ft_trn_series, noise_level), add_noise(ft_tst_series, noise_level)
train_encoded, test_encoded = target_encode(X_train["blueFirstBlood"], 
                             X_test["blueFirstBlood"], 
                             target=y_train, 
                             min_samples_leaf=100,
                             smoothing=10,
                             noise_level=0.01)
    
X_train['blueFirstBlood_te'] = train_encoded
X_train.drop('blueFirstBlood', axis=1, inplace=True)

X_test['blueFirstBlood_te'] = test_encoded
X_test.drop('blueFirstBlood', axis=1, inplace=True)
5775    0.507663
8906    0.499963
559     0.503175
2909    0.486721
2781    0.493782
          ...   
2895    0.499784
7813    0.497651
905     0.503718
5192    0.490238
235     0.495995
Name: blueFirstBlood_mean, Length: 6915, dtype: float64

가벼운 mean encoding

# sex는 0,1로 label encoding이 수행된 상태입니다.
sex_mean = df.groupby('Sex')['Survived'].mean()
df['Sex_me'] = df['Sex'].map(sex_mean)
df['Sex_me']
0      0.188908
1      0.742038
2      0.742038
3      0.742038
4      0.188908
         ...   
886    0.188908
887    0.742038
888    0.742038
889    0.188908
890    0.188908
Name: Sex_me, Length: 891, dtype: float64

Frequency_Encoding

A, B, C, A, A, B, C

  • requency_Encoding

A : 3, B : 2, C : 2

빈도수를 이용해서 변환해준다.

def freq_encode(train_data, test_data, columns):
    '''Returns a DataFrame with encoded columns'''
    encoded_cols = []
    nsamples = train_data.shape[0]
    for col in columns:    
        freqs_cat = train_data.groupby(col)[col].count()/nsamples
        encoded_col_train = train_data[col].map(freqs_cat)
        encoded_col_test = test_data[col].map(freqs_cat)
        encoded_col = pd.concat([encoded_col_train, encoded_col_test], axis=0)
        encoded_col[encoded_col.isnull()] = 0
        encoded_cols.append(pd.DataFrame({'freq_'+col:encoded_col}))
    all_encoded = pd.concat(encoded_cols, axis=1)
    return (all_encoded.loc[train_data.index,:], 
            all_encoded.loc[test_data.index,:])

상황에 맞는 범주형 변수 처리 방법을 사용합시다.



Comments