주성분 분석[adp,python]
주성분 분석
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
df = pd.read_csv("C:/adp/ISLR-python/Notebooks/Data/USArrests.csv",index_col='Unnamed: 0' )
display(df.describe())
display(df.info())
display(df.head())
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
count | 50.00000 | 50.000000 | 50.000000 | 50.000000 |
mean | 7.78800 | 170.760000 | 65.540000 | 21.232000 |
std | 4.35551 | 83.337661 | 14.474763 | 9.366385 |
min | 0.80000 | 45.000000 | 32.000000 | 7.300000 |
25% | 4.07500 | 109.000000 | 54.500000 | 15.075000 |
50% | 7.25000 | 159.000000 | 66.000000 | 20.100000 |
75% | 11.25000 | 249.000000 | 77.750000 | 26.175000 |
max | 17.40000 | 337.000000 | 91.000000 | 46.000000 |
<class 'pandas.core.frame.DataFrame'>
Index: 50 entries, Alabama to Wyoming
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Murder 50 non-null float64
1 Assault 50 non-null int64
2 UrbanPop 50 non-null int64
3 Rape 50 non-null float64
dtypes: float64(2), int64(2)
memory usage: 2.0+ KB
None
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
Alabama | 13.2 | 236 | 58 | 21.2 |
Alaska | 10.0 | 263 | 48 | 44.5 |
Arizona | 8.1 | 294 | 80 | 31.0 |
Arkansas | 8.8 | 190 | 50 | 19.5 |
California | 9.0 | 276 | 91 | 40.6 |
df.columns
Index(['Murder', 'Assault', 'UrbanPop', 'Rape'], dtype='object')
- 데이터 표준화
X = df
scaler = StandardScaler()
X_scaled = StandardScaler().fit_transform(X)
X = pd.DataFrame(data=X_scaled, index=X.index, columns=X.columns)
X.head()
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
Alabama | 1.255179 | 0.790787 | -0.526195 | -0.003451 |
Alaska | 0.513019 | 1.118060 | -1.224067 | 2.509424 |
Arizona | 0.072361 | 1.493817 | 1.009122 | 1.053466 |
Arkansas | 0.234708 | 0.233212 | -1.084492 | -0.186794 |
California | 0.281093 | 1.275635 | 1.776781 | 2.088814 |
- 주성분 분석
pca = PCA(n_components=4) # 주성분의 갯수
pca.fit(X)
pc_score = pca.transform(X)
pc_score = pd.DataFrame(data=pc_score, columns = ['PC1', 'PC2','PC3','PC4'])# 주성분으로 이루어진 데이터 프레임 구성
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2','PC3','PC4'], index=df.columns) # 주성분 계수
pc_score.head() #주성분 score
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
0 | 0.985566 | 1.133392 | -0.444269 | 0.156267 |
1 | 1.950138 | 1.073213 | 2.040003 | -0.438583 |
2 | 1.763164 | -0.745957 | 0.054781 | -0.834653 |
3 | -0.141420 | 1.119797 | 0.114574 | -0.182811 |
4 | 2.523980 | -1.542934 | 0.598557 | -0.341996 |
- 주성분 계수
loadings
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
Murder | 0.535899 | 0.418181 | -0.341233 | 0.649228 |
Assault | 0.583184 | 0.187986 | -0.268148 | -0.743407 |
UrbanPop | 0.278191 | -0.872806 | -0.378016 | 0.133878 |
Rape | 0.543432 | -0.167319 | 0.817778 | 0.089024 |
- PC1 = 0.53Murder + 0.583Assault + 0.278UrbanPop + 0.543432Rape
pca.explained_variance_ratio_
array([0.62006039, 0.24744129, 0.0891408 , 0.04335752])
- 주성분 2개만 사용해도 85%의 설명력을 가진다.
pc_score.corr()
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
PC1 | 1.000000e+00 | 2.444648e-17 | 7.732791e-17 | 1.071531e-15 |
PC2 | 2.444648e-17 | 1.000000e+00 | -2.878039e-16 | 2.180585e-16 |
PC3 | 7.732791e-17 | -2.878039e-16 | 1.000000e+00 | 3.147523e-16 |
PC4 | 1.071531e-15 | 2.180585e-16 | 3.147523e-16 | 1.000000e+00 |
Comments