어텐션(Attention) 메커니즘

⊢ DeepLearning

어텐션(Attention) 메커니즘

최 수빈 2025. 3. 20. 18:56

Attention 메커니즘

시퀀스 데이터에서 중요한 부분에 더 많은 가중치를 할당하여 정보를 효율적으로 처리하는 기법

주로 자연어 처리(NLP) 및 시계열 데이터에서 사용되며, 기계 번역, 요약, 질의응답 시스템 등에 적용됨

동작 방식

기본 개념

입력 시퀀스의 각 요소에 대해 중요도를 계산하여 가중치를 부여
불필요한 정보를 무시하고 중요한 정보 강조

주요 구성 요소 : Query(Q), Key(K), Value(V)

1. Attention 스코어 계산

Query와 Key 간의 유사도를 측정하여 중요도를 계산

일반적으로 내적(dot product) 연산을 사용하여 유사도를 계산함

𝓢(Q, K) = Q · K^T

2. Softmax를 통한 가중치 계산

Attention 스코어를 Softmax 함수로 정규화하여 가중치로 변환

α_i = ∕frac{𝒳(𝓢(Q, K_i))}{∑_j 𝒳(𝓢(Q, K_j))}

→ 가중치의 합이 1이 되도록 함

3. 가중치 적용 및 최종 출력

Softmax를 통해 얻은 가중치를 Value에 곱하여 최종 Attention 출력 계산

𝓢(Q, K, V) = ∑_i α_i V_i

Self-Attention과 Multi-Head Attention

Self-Attention

입력 시퀀스 내에서 각 요소가 서로를 참조하여 관계를 학습 (입력 시퀀스의 모든 요소가 Query, Key, Value로 사용 됨)
같은 문장 내 단어들이 서로의 의미를 이해하는 데 도움을 줌
번역, 요약 등에서 단어 간의 의존성을 효과적으로 모델링 가능

Multi-Head Attention

여러 개의 Self-Attention을 병렬로 수행하는 방식
각 Head가 서로 다른 패턴을 학습하여 더욱 풍부한 표현력을 가짐
다양한 관점에서 데이터를 처리할 수 있어 성능 향상

Attention 메커니즘 구현

Scaled Dot-Product Attention 구현

import torch
import torch.nn.functional as F # PyTorch의 활성화 함수 라이브러리 (Softmax 사용)

def scaled_dot_product_attention(Q, K, V):
    """
    Scaled Dot-Product Attention 메커니즘을 구현하는 함수

    Args:
        Q: Query 벡터 (shape: [batch_size, num_heads, seq_len, d_k])
        K: Key 벡터 (shqpe: [batch_size, num_heads, seq_len, d_k])
        V: Value 벡터 (shape: [batch_size, num_heads, seq_len, d_k])

    Returns:
        ouput: 어텐션이 적용된 결과 벡터
        attn_weights: Softmax를 적용한 어텐션 가중치
    """
    d_k = Q.size(-1) # Key의 차원 수(임베딩 벡터의 크기, d_k)
    # scores 계산 : Query와 Key의 내적(dot product)을 계산하고 sqrt(d_k)로 나누어 스케일링
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32)) # 유사도 계산 및 스케일링
    attn_weights = F.softmax(scores, dim=-1) # Softmax를 통한 가중치 계산
    output = torch.matmul(attn_weights, V) # 가중합을 통한 최종 출력 계산
    return output, attn_weights

Multi-Head Attention 구현

import torch.nn as nn

class MultiHeadAttention(nn.Module):
    def __init__(self, embed_size, heads):
        """
        Multi-Head Attention 레이어

        Args:
            embed_size (int): 입력 임베딩 차원 크기
            heads (int): 몇 개의 언텐션 헤드를 사용할지 결정
        """
        super(MultiHeadAttention, self).__init__()
        self.embed_size = embed_size # 전체 임베딩 크기
        self.heads = heads # 어텐션 헤드 개수
        self.head_dim = embed_size // heads # 각 헤드당 차원 크기

        # 어텐션 헤드 개수로 나누어 떨어져야 함
        assert (
            self.head_dim * heads == embed_size
        ), "Embedding size needs to be divisible by heads"

        # Query, Key, Value를 생성하는 Linear 레이어 정의
        self.values = nn.Linear(embed_size, embed_size, bias=False)
        self.keys = nn.Linear(embed_size, embed_size, bias=False)
        self.queries = nn.Linear(embed_size, embed_size, bias=False)

        # 최종 출력 FC 레이어 (어텐션 결과를 다시 원래 차원으로 변환)
        self.fc_out = nn.Linear(heads * self.head_dim, embed_size)

    def forward(self, values, keys, query, mask=None):
        """
        Multi-Head Attention Forward Propagation

        Args:
            values (tensor): Value 벡터 (shape: [barch_size, seq_len, embed_size])
            keys (tensor): Key 벡터 (shape: [barch_size, seq_len, embed_size])
            query (tensor) : Query 벡터 (shape: [barch_size, seq_len, embed_size])
            mask (tensor, optional) : 패딩 마스킹 (기본값: None)

        Returns:
            tensor: Attention 적용된 출력 값
        """
        N = query.shape[0] # 배치 크기
        value_len, key_len, query_len = values.shape[1], keys.shape[1], query.shape[1]

        # [batch_size(입력 데이터의 배치 크기(N), heads(Multi-Head Attention의 헤드 개수), seq_len(문장의 시퀀스 길이 :토큰 개수), head_dim(각 어텐션 헤드가 사용하는 차원 크기: embed_size // heads)
        values = self.values(values).view(N, self.heads, value_len, self.head_dim)
        keys = self.keys(keys).view(N, self.heads, key_len, self.head_dim)
        queries = self.queries(query).view(N, self.heads, query_len, self.head_dim)


        # Scaled dot-product attention
        out, attn_weights = scaled_dot_product_attention(queries, keys, values)

        # Multi-Head Attention 결과를 다시 원래 크기로 변환
        out = out.view(N, query_len, self.heads * self.head_dim)

        # 최종 FC 레이어를 통과하여 최종 출력 생성
        out = self.fc_out(out)
        
        return out, attn_weights

Multi-Head Attention 실행, Attention Weights 시각화

import matplotlib.pyplot as plt
import seaborn as sns

# 입력 문장 준비
sentence = ["I", "love", "deep", "learning"]
vocab = {word: idx for idx, word in enumerate(sentence)}

# 토큰화
tokens = [vocab[word] for word in sentence]
tokens = torch.tensor(tokens).unsqueeze(0)  # (1, 4)

# 임베딩
embed_size = 128
embedding_layer = nn.Embedding(len(vocab), embed_size)
embedded_tokens = embedding_layer(tokens)  # (1, 4, 128)

# Multi-Head Attention 실행
heads = 8
multi_head_attn = MultiHeadAttention(embed_size, heads)
output, attn_weights = multi_head_attn(embedded_tokens, embedded_tokens, embedded_tokens)

# Attention Weights 크기 확인
print("Attention Weights 크기:", attn_weights.shape)

# Attention Weights 시각화
def plot_attention_weights(attention, sentence):
    attention = attention.squeeze(0).mean(dim=0).detach().numpy()  
    plt.figure(figsize=(6, 6))
    sns.heatmap(attention, annot=True, cmap="Blues", xticklabels=sentence, yticklabels=sentence)
    plt.xlabel("Key Tokens")
    plt.ylabel("Query Tokens")
    plt.title("Attention Weights (Avg of 8 Heads)")
    plt.savefig(f'{sentence}')
    plt.show()

plot_attention_weights(attn_weights, sentence)

Attention Weights 크기: torch.Size([1, 8, 4, 4])

더 깊이 파고들려면 Transformer 모델과 Self-Attention을 활용한 BERT, GPT 구조까지 확장해보는 것도 좋음

저작자표시 비영리 변경금지 (새창열림)

'⊢ DeepLearning' 카테고리의 다른 글

ResNet(Residual Network) (2)	2025.03.21
자연어 처리(Natural Language Processing, NLP) 모델 (0)	2025.03.20
순환 신경망(Recrurrent Neural Network, RNN) (0)	2025.03.20
합성곱 신경망(Convolutional Neural Network, CNN) (5)	2025.03.20
인공 신경망(Artificial Neural Network, ANN) (0)	2025.03.19

현재글어텐션(Attention) 메커니즘

if(life){code();}

life: Compiling… Please Wait

250x250

일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

if(life){code();}