지도학습 : 분류모델 - 나이브베이즈

⊢MachineLearning

지도학습 : 분류모델 - 나이브베이즈

최 수빈 2025. 3. 14. 08:01

나이브 베이즈(Naive Bayes)

베이즈 정리(Bayes’ Theorem)를 기반으로 하는 확률 기반 머신러닝 분류 기법 (베이즈 정리 기반 확률적 분류 모델)

"나이브(naive)"라는 이름이 붙은 이유

→ 각 특징(feature)이 서로 독립적이라고 가정(Assumption)하기 때문

텍스트 분류(text classification), 스팸 필터링과 같은 분야에서 강력한 성능을 보임

베이즈 정리(Bayes'Theorem)

베이지안 통계

베이즈 정리와 사전/사후 확률 베이즈 정리는 기존의 사전 확률을 새로운 증거를 바탕으로 갱신하여 사후 확률을 계산하는 방법을 제공→ 통계적 추론, 머신러닝, 의학적 진단 등 다양한 분야

s2bibiprincess.tistory.com

나이브베이즈의 목적

주어진 데이터가 특정 클래스에 속할 확률을 계산하여 분류하는 것

계산이 빠르고 단순한 구조로 이루어져 있어 대용량 데이터에도 효율적
텍스트 분류(스팸 필터링, 감성 분석)와 같은 NLP 분야에서 좋은 성능을 보임

데이터의 특성에 따른 나이브베이즈 분류기 변형

가우시안 나이브베이즈 (Gaussian Naive Bayes)	연속형 데이터를 다룰 때 사용하며, 정규 분포(가우시안 분포)를 가정
베르누이 나이브베이즈 (Bernoulli Naive Bayes)	이진형(0 또는 1) 데이터를 다룰 때 사용
멀티노미얼 나이브베이즈 (Multinomial Naive Bayes)	텍스트 데이터에서 단어 출현 빈도(BoW)를 기반으로 사용 - 특징들이 다항 분포를 따르는 경우 사용

나이브베이즈 실습

유방암 데이터셋

데이터 로드 및 전처리

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 데이터 로드
data = load_breast_cancer()
X = data.data
y = data.target

# 데이터 분할 (훈련 세트 80%, 테스트 세트 20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 스케일링 (표준화)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

모델 학습 및 평가

가우시안 나이브베이즈(GaussianNB) 모델을 학습하고 평가

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# 모델 생성 및 학습
model = GaussianNB()
model.fit(X_train, y_train)

# 예측
y_pred = model.predict(X_test)

# 평가
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

"""
Accuracy: 0.9649122807017544
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

Confusion Matrix:
[[40  3]
 [ 1 70]]
 """

타이타닉 데이터셋

데이터 로드 및 전처리

import seaborn as sns

# 데이터 로드
titanic = sns.load_dataset('titanic')

# 필요한 열 선택 및 결측값 제거
titanic = titanic[['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']].dropna()

# 성별과 탑승한 곳을 숫자로 변환 (인코딩)
titanic['sex'] = titanic['sex'].map({'male': 0, 'female': 1})
titanic['embarked'] = titanic['embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# 특성과 타겟 분리
X = titanic.drop('survived', axis=1)
y = titanic['survived']

# 데이터 분할
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 데이터 스케일링
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

모델 학습 및 평가

가우시안 나이브베이즈 모델 적용, 학습, 평가

# 모델 생성 및 학습
model = GaussianNB()
model.fit(X_train, y_train)

# 예측
y_pred = model.predict(X_test)

# 평가
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(f"Classification Report:\n{classification_report(y_test, y_pred)}")
print(f"Confusion Matrix:\n{confusion_matrix(y_test, y_pred)}")

"""
Accuracy: 0.9649122807017544
Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

Confusion Matrix:
[[40  3]
 [ 1 70]]
 """

저작자표시 비영리 변경금지

'⊢MachineLearning' 카테고리의 다른 글

비지도학습 : 군집화 모델 - k-means Clustering (2)	2025.03.16
지도학습 : 분류모델 - 의사결정나무 (0)	2025.03.16
지도학습 : 분류모델 - KNN (0)	2025.03.14
지도학습 : 분류모델 - SVM (0)	2025.03.13
지도학습 : 분류모델 - 로지스틱 회귀 (2)	2025.03.10

현재글지도학습 : 분류모델 - 나이브베이즈

if(life){code();}

life: Compiling… Please Wait

250x250

if(life){code();}