지도학습 : 분류모델

⊢MachineLearning

지도학습 : 분류모델 - KNN

최 수빈 2025. 3. 14. 02:16

KNN(K-최근접 이웃, K-Nearest Neighbors)

새로운 데이터 포인트를 기존 데이터 포인트 중 가장 가까운 K개의 이웃과 비교하여 분류

데이터 포인트 간 거리를 계산하여 가장 가까운 이웃을 찾고, 다수결 투표 방식으로 분류를 결정

비모수적(non-parametric)방법으로, 분류(Classification) 및 회귀(Regression)에 사용

거리 측정

KNN 알고리즘에서 가장 중요한 요소

일반적으로 유클리드 거리(Euclidean Distance)가 사용됨

K값 설정

K값이 작을수록 모델이 데이터의 노이즈에 민감해짐
K값이 클수록 결정 경계가 부드러워지지만, 과적합(overfitting)을 방지할 수 있음

다수결 투표

K개의 가장 가까운 이웃 중 가장 빈번하게 나타나는 클래스를 예측값으로 설정

KNN의 목적

학습 데이터를 기반으로 새로운 데이터 포인트의 클래스를 예측

→ 분류 문제에서 주로 사용되며, 의료 진단, 이미지 인식, 추천 시스템 등 다양한 분야에 활용 가능

KNN 실습

유방암 데이터셋

데이터 로드 및 전처리

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 데이터 로드
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# 데이터 분할 (훈련 80%, 테스트 20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 스케일링
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

모델 학습 및 평가

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# KNN 모델 생성 및 학습
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# 예측 수행
y_pred = knn.predict(X_test)

# 모델 평가
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

"""
Accuracy: 0.9473684210526315
Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114

Confusion Matrix:
[[40  3]
 [ 3 68]]
 """

타이타닉 데이터셋

데이터 로드 및 전처리

import seaborn as sns

titanic = sns.load_dataset('titanic')

# 필요한 열 선택 및 결측값 제거
titanic = titanic[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']].dropna()

# 범주형 데이터 인코딩
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})
titanic['embarked'] = titanic['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# 특성과 타겟 분리
X = titanic.drop('survived', axis=1)
y = titanic['survived']

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 스케일링
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

모델 학습 및 평가

# KNN 모델 생성 및 학습
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# 예측 수행
y_pred = knn.predict(X_test)

# 모델 평가
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

"""
Accuracy: 0.7832167832167832
Classification Report:
              precision    recall  f1-score   support

           0       0.76      0.89      0.82        80
           1       0.82      0.65      0.73        63

    accuracy                           0.78       143
   macro avg       0.79      0.77      0.77       143
weighted avg       0.79      0.78      0.78       143

Confusion Matrix:
[[71  9]
 [22 41]]
"""

KNN 모델은 단순하지만 강력한 분류 알고리즘이며, 다양한 데이터셋에서 실용적으로 적용 가능

저작자표시 비영리 변경금지

'⊢MachineLearning' 카테고리의 다른 글

지도학습 : 분류모델 - 의사결정나무 (0)	2025.03.16
지도학습 : 분류모델 - 나이브베이즈 (0)	2025.03.14
지도학습 : 분류모델 - SVM (0)	2025.03.13
지도학습 : 분류모델 - 로지스틱 회귀 (2)	2025.03.10
지도학습 : 회귀모델 (0)	2025.03.06

현재글지도학습 : 분류모델 - KNN

if(life){code();}

life: Compiling… Please Wait

250x250

if(life){code();}