Hugging Face와 Transformers 라이브러리로 살펴본 NLP 모델들

⊢ AI 모델 활용

Hugging Face와 Transformers 라이브러리로 살펴본 NLP 모델들

최 수빈 2025. 3. 24. 08:51

Hugging Face의 Transformers 라이브러리를 사용해 NLP 모델을 활용실습

감정 분석
텍스트 생성
번역
문장 임베딩

*Transformers 라이브러리

Hugging Face에서 개발한 오픈소스 라이브러리

다양한 사전학습(pretrained)된 NLP 모델 제공

텍스트 생성, 번역, 감정 분석, 문장 분류 등 다양한 작업 지원

pipeline()을 통해 복잡한 과정을 한 줄로 해결 가능

텍스트 생성 : GPT-2

OpenAI에서 개발
다음 문장을 예측하고 생성하는 데 특화된 모델

from transformers import pipeline

# GPT-2 기반 텍스트 생성 파이프라인 로드
generator = pipeline("text-generation", model="gpt2")

# 텍스트 생성
result = generator("I have a", max_length=50, num_return_sequences=1, pad_token_id=50256, truncation=False,)

print(result)

"""
[{'generated_text': "I have a problem with that, and I can go into this in so much detail if you're interested in it.\n\n\nBut the other problem...\n\nI'm starting to wonder what's happening to the other three people who have never worked"}]
"""

감정 분석 : 기본 모델 & RoBERTa

기본 모델 사용

from transformers import pipeline

# 감정 분석 파이프라인 로드
sentiment_analysis = pipeline("sentiment-analysis")
print(sentiment_analysis("해줘"))

"""
[{'label': 'POSITIVE', 'score': 0.5771076679229736}]
"""

RoBERTa 모델 사용

BERT를 기반으로 개선된 모델

분류 작업에서 뛰어난 성능을 보임

from transformers import pipeline

# RoBERTa 기반 감정 분석 파이프라인 로드
classifier = pipeline("sentiment-analysis", model="roberta-base")

# 감정 분석 실행
print(classifier("해줘"))

"""
[{'label': 'LABEL_0', 'score': 0.5240933895111084}]
"""

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

→ BERT는 사전학습(pretraining)만 되어 있고, 분류 태스크에 맞는 파인튜닝 필요

pipeline("sentiment-analysis")는 RoBERTa나 fine-tuned BERT 기반을 사용

직접 BERT를 쓰고 싶다면 감정 분류용으로 파인튜닝하거나, Hugging Face Hub에서 적절한 모델을 선택해야 함

문장 임베딩 : BERT

BERT는 문장의 문맥을 양방향으로 파악하며, 다양한 다운스트림 작업에 활용됨

# BERT 기반의 임베딩 모델 가져오기
from transformers import BertTokenizer, BertModel
import torch
from scipy.spatial.distance import cosine
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

# 모델 네임 지정하기
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# 문장 가져오기
sentences = [
    "나는 오늘 낮잠 잤음",
    "아직 저녁을 못먹음",
    "갑자기 배고파",
    "오빠야가 뭐 먹고 싶냐고 물어본다.",
]

# 문장 단위로 임베딩 비교하기
input1 = tokenizer(sentences[0], return_tensors="pt")
input2 = tokenizer(sentences[1], return_tensors="pt")
input3 = tokenizer(sentences[2], return_tensors="pt")
input4 = tokenizer(sentences[3], return_tensors="pt")

# 모델을 사용해서 문장 임베딩 생성하기
with torch.no_grad():
    output1 = model(**input1)
    output2 = model(**input2)
    output3 = model(**input3)
    output4 = model(**input4)

# 문장 임베딩 벡터 만들기
embedding1 = output1.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
embedding2 = output2.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
embedding3 = output3.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()
embedding4 = output4.last_hidden_state.mean(dim=1).squeeze().cpu().numpy()

# 코사인 유사도 계산
similarity = 1 - cosine(embedding1, embedding2)
print(f"1&2: {similarity:.4f}")
similarity = 1 - cosine(embedding1, embedding3)
print(f"1&3: {similarity:.4f}")
similarity = 1 - cosine(embedding1, embedding4)
print(f"1&4: {similarity:.4f}")
similarity = 1 - cosine(embedding2, embedding3)
print(f"2&3: {similarity:.4f}")
similarity = 1 - cosine(embedding2, embedding4)
print(f"2&4: {similarity:.4f}")
similarity = 1 - cosine(embedding3, embedding4)
print(f"3&4: {similarity:.4f}")

"""
1&2: 0.9041
1&3: 0.8227
1&4: 0.8687
2&3: 0.8845
2&4: 0.9091
3&4: 0.8758
"""

번역 모델 : M2M100 & NLLB-200

두 모델 모두 다국어 번역을 지원

(1) M2M100

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

# M2M100 모델과 토크나이저 로드
model_name = "facebook/m2m100_418M"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# 번역할 문장
sentence = "나는 오늘 낮잠 잤음"

# 입력 문장을 토큰화
encoded_sentence = tokenizer(sentence, return_tensors="pt")

# 번역 대상 언어를 지정 (M2M100은 직접 언어 코드를 설정해야 함)
tokenizer.scr_lang = "ko"
model.config.forced_bos_token_id = tokenizer.get_lang_id("en")

# 번역 수행 (한국어 -> 영어)
generated_tokens = model.generate(**encoded_sentence)

# 번역 결과를 디코딩
translated_text = tokenizer.decode(generated_tokens[0], skip_special_tokes=True)

print(f"번역된 문장: {translated_text}")

"""
번역된 문장: </s> __en__ I am sleeping today</s>
"""

(2) NLLB-200

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

# NLLB-200 모델과 토크나이저 로드
model_name = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# 번역할 문장
sentence = "나는 오늘 낮잠 잤음"

# NLLB-200에서 한국어(Hangul) -> 영어(Latin) 코드 설정 / 한글 : kor_Hang
inputs = tokenizer(sentence, return_tensors="pt")

# 입력 문장에 대한 번역 수행 (한국어 -> 영어)
generated_tokens = model.generate(
    inputs.input_ids, forced_bos_token_id=tokenizer.lang_code_to_id["eng_Latn"]
)

# 번역 결과 디코딩
translated_text = tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

print(f"번역된 문장: {translated_text}")

"""
번역된 문장: I slept this afternoon.
"""

Word2Vec vs BERT 임베딩 비교

Word2Vec

Word2Vec은 단어 기반, BERT는 문장 전체 문맥 기반의 임베딩을 제공

from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from scipy.spatial.distance import cosine

sentences = [
    "나는 오늘 낮잠 잤음",
    "아직 저녁을 못먹음",
    "갑자기 배고파",
    "오빠야가 뭐 먹고 싶냐고 물어본다.",
]

# 문장을 단어 단위로 바꾸기
processed = [simple_preprocess(sentence) for sentence in sentences]
print(processed)

# 모델 만들기
model = Word2Vec(sentences=processed, vector_size=5, window=5, min_count=1, sg=0)

# 각각의 임베딩 확인하기
낮잠 = model.wv["낮잠"]
저녁을 = model.wv["저녁을"]

# 두 벡터간의 유사도 확인하기
sim = 1 - cosine(낮잠, 저녁을)
print(sim)

"""
[['나는', '오늘', '낮잠', '잤음'], ['아직', '저녁을', '못먹음'], ['갑자기', '배고파'], ['오빠야가', '먹고', '싶냐고', '물어본다']]
0.2424482216355497
"""

저작자표시 비영리 변경금지 (새창열림)

'⊢ AI 모델 활용' 카테고리의 다른 글

생성형 AI(Generative AI) (0)	2025.03.25
사전 학습과 파인 튜닝 (2)	2025.03.25
API(Application Programming Interface)로 인공지능 활용하기 (0)	2025.03.23
GitHub의 Open Project로 AI 활용하기 (0)	2025.03.23
허깅페이스(Hugging Face) (0)	2025.03.23

현재글Hugging Face와 Transformers 라이브러리로 살펴본 NLP 모델들

if(life){code();}

life: Compiling… Please Wait

250x250

if(life){code();}