sklearn-09 classification 4 ramdom forest

머신러닝/sklearn 2023. 5. 10. 18:05

1. random forest

랜덤 포레스트는 의사결정 트리가 과적합이 잘 되는 단점을 극복하기 위한 알고리즘으로 같은 데이터에 대해 의사결정 데이터를 여러 개 만들어 종합하는 방식으로 이 기법을 앙상블이라고 부른다.

랜덤 포레스트는 분류와 회귀 둘 다 사용가능하며 대용량 데이터에 효과적이다. 데이터 크기에 비례하여 트리를 생성하기 때문에 프로세스 시간이 오래 걸리며 모든 트리 모델을 확인하기 어렵기 때문에 해석 가능성이 떨어진다는 단점이 있다.

2. python

from sklearn.ensemble import RandomForestClassifier

randomforestclassifier 모듈을 불러온다.

import os
from os.path import join

abalone_path = join('.','abalone.txt')
column_path = join('.','abalone_attributes.txt')

abalone_columns = list()
for l in open(column_path):
  abalone_columns.append(l.strip())

abbalone 데이터를 가져온다.

abalone_columns
['Sex',
 'Length',
 'Diameter',
 'Height',
 'Whole weight',
 'Shucked weight',
 'Viscera weight',
 'Shell weight',
 'Rings']

가져온 데이터를 확인한다.

data = pd.read_csv(abalone_path, header =None,names = abalone_columns)


data = data[data['Sex'] != 'I'] 
label = data['Sex'].map(lambda x:0 if x=='M' else 1)
del data['Sex']
data.head()


Length	Diameter	Height	Whole weight	Shucked weight	Viscera weight	Shell weight	Rings
0	0.455	0.365	0.095	0.5140	0.2245	0.1010	0.150	15
1	0.350	0.265	0.090	0.2255	0.0995	0.0485	0.070	7
2	0.530	0.420	0.135	0.6770	0.2565	0.1415	0.210	9
3	0.440	0.365	0.125	0.5160	0.2155	0.1140	0.155	10
6	0.530	0.415	0.150	0.7775	0.2370	0.1415	0.330	20

성별 데이터 중 I데이터를 삭제하고 M일경우 1으로 F일 경우 0으로 설정하여 원핫엔코딩을 진행한다.

성별 데이터를 삭제한다.

X_train, X_test,y_train,y_test =train_test_split(
    data, label, test_size=0.2,random_state=2023)

데이터를 쪼갠다.

rf = RandomForestClassifier(max_depth=5)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

데이터를 학습하고 예측한다.

from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

데이터 지표를 보기위한 모듈을 불러온다.

print('Accurancy:{:}'.format(accuracy_score(y_test,y_pred)))
print('precision:{:}'.format(precision_score(y_test,y_pred)))
print('recall:{:}'.format(recall_score(y_test,y_pred)))
print('auc:{:}'.format(roc_auc_score(y_test,y_pred)))

Accurancy:0.5220458553791887
precision:0.5121951219512195
recall:0.22992700729927007
auc:0.5125744251513416

accurancy, precision, recall, auc 값을 확인한다.

best_model_depth = 0
best_model_accuracy = 0

for i in [2,3,4,5,6,7,8,9,10]:
  rf = RandomForestClassifier(max_depth=i)
  rf.fit(X_train, y_train)
  y_pred = rf.predict(X_test)

  acc = accuracy_score(y_test, y_pred)

  print('Accurancy:i={} {:}'.format(i,acc*100))

  if best_model_accuracy < acc:
    best_model_depth = i
    best_model_accuracy = acc

print('-----------')
print('best_model_depth = {0}, best_mode_accuracy={1}'.format(
    best_model_depth, best_model_accuracy
))

Accurancy:i=2 52.733686067019406
Accurancy:i=3 51.85185185185185
Accurancy:i=4 53.43915343915344
Accurancy:i=5 53.086419753086425
Accurancy:i=6 53.96825396825397
Accurancy:i=7 55.026455026455025
Accurancy:i=8 53.96825396825397
Accurancy:i=9 52.02821869488537
Accurancy:i=10 54.32098765432099
-----------
best_model_depth = 7, best_mode_accuracy=0.5502645502645502

랜덤 포레스트의 깊이를 2부터 10까지 바꿔가며 정확도를 출력한다.

조건문을 넣어 가장 정확도가 높은 깊이를 찾는다.

'머신러닝 > sklearn' 카테고리의 다른 글

sklearn-11 regression 2 decison tree regressor (0)	2023.05.10
sklearn-10 regression 1 linear regression (0)	2023.05.10
sklearn-08 classification 3 decision tree (0)	2023.05.10
sklearn-07 classification 2 svm (0)	2023.05.10

ABOUT ME

My storage My storage

1. random forest

2. python

'머신러닝 > sklearn' 카테고리의 다른 글

티스토리툴바

ABOUT ME

1. random forest

2. python

'머신러닝 > sklearn' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바