주성분 분석[adp,python]

주성분 분석

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

df = pd.read_csv("C:/adp/ISLR-python/Notebooks/Data/USArrests.csv",index_col='Unnamed: 0' )
display(df.describe())
display(df.info())
display(df.head())

Murder Assault UrbanPop Rape
count 50.00000 50.000000 50.000000 50.000000
mean 7.78800 170.760000 65.540000 21.232000
std 4.35551 83.337661 14.474763 9.366385
min 0.80000 45.000000 32.000000 7.300000
25% 4.07500 109.000000 54.500000 15.075000
50% 7.25000 159.000000 66.000000 20.100000
75% 11.25000 249.000000 77.750000 26.175000
max 17.40000 337.000000 91.000000 46.000000
<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Murder    50 non-null     float64
 1   Assault   50 non-null     int64  
 2   UrbanPop  50 non-null     int64  
 3   Rape      50 non-null     float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB



None
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
df.columns
Index(['Murder', 'Assault', 'UrbanPop', 'Rape'], dtype='object')
  • 데이터 표준화
X = df
scaler = StandardScaler()
X_scaled = StandardScaler().fit_transform(X)
X = pd.DataFrame(data=X_scaled, index=X.index, columns=X.columns)

X.head()
Murder Assault UrbanPop Rape
Alabama 1.255179 0.790787 -0.526195 -0.003451
Alaska 0.513019 1.118060 -1.224067 2.509424
Arizona 0.072361 1.493817 1.009122 1.053466
Arkansas 0.234708 0.233212 -1.084492 -0.186794
California 0.281093 1.275635 1.776781 2.088814
  • 주성분 분석
pca = PCA(n_components=4) # 주성분의 갯수
pca.fit(X)
pc_score = pca.transform(X)
pc_score = pd.DataFrame(data=pc_score, columns = ['PC1', 'PC2','PC3','PC4'])# 주성분으로 이루어진 데이터 프레임 구성
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2','PC3','PC4'], index=df.columns) # 주성분 계수 


pc_score.head() #주성분 score
PC1 PC2 PC3 PC4
0 0.985566 1.133392 -0.444269 0.156267
1 1.950138 1.073213 2.040003 -0.438583
2 1.763164 -0.745957 0.054781 -0.834653
3 -0.141420 1.119797 0.114574 -0.182811
4 2.523980 -1.542934 0.598557 -0.341996
  • 주성분 계수
loadings 
PC1 PC2 PC3 PC4
Murder 0.535899 0.418181 -0.341233 0.649228
Assault 0.583184 0.187986 -0.268148 -0.743407
UrbanPop 0.278191 -0.872806 -0.378016 0.133878
Rape 0.543432 -0.167319 0.817778 0.089024
  • PC1 = 0.53Murder + 0.583Assault + 0.278UrbanPop + 0.543432Rape

pca.explained_variance_ratio_  
array([0.62006039, 0.24744129, 0.0891408 , 0.04335752])
  • 주성분 2개만 사용해도 85%의 설명력을 가진다.
pc_score.corr()
PC1 PC2 PC3 PC4
PC1 1.000000e+00 2.444648e-17 7.732791e-17 1.071531e-15
PC2 2.444648e-17 1.000000e+00 -2.878039e-16 2.180585e-16
PC3 7.732791e-17 -2.878039e-16 1.000000e+00 3.147523e-16
PC4 1.071531e-15 2.180585e-16 3.147523e-16 1.000000e+00

자료출처

Comments