Attention 코드정리

Attention is all you need: Transformer 정리

Attention is just calculated by dot product!
- hidden state 간의 weighted sum! = attention 개념
- L4 loss를 가장 줄일 수 있는 h1, h2, h3 형태는 무엇일까? S4 랑 h2랑 만났을 때
- 어떤 단어를 가깝게 놓도록 선형 변환 해야 좋을까?
  - 하지만 기존 RNN+Attention은 encoder 쪽( 갈 수록 흐려지는 정보에) attention 하고 decoder 파트는 여전히 멀수록 잊혀진다. h1,2,3에 차곡 차곡 정보를 쌓지만 x1와 같은 앞단의 정보는 갈수록 잊혀질 것!
  - —> self-attention으로 해결!
  거리에 대한 문제점이 해결됨! attention을 self로도 사용하자
  
  하지만 decoder에서 미래 단어와의 attention은 컨닝이므로 제외!
query 랑 key랑 내적
- weighted sum을 통해 가중치가 큰 key의 정보를 갖음
- 양수로 만들기 위해 exp()
- 각 전체 weight합으로 normalize 해서 전체 합이 1이 되게 만듬
- =softmax(QK^T)
위 가중치로 value랑 weighted sum = attention 결과는 key 의 가중치 합 이면서 간접적으로나마 query에 정보도 반영 = 왜냐면 가중치가 query와 만들어진 것이기 때문
- 왜냐면 value 값은 key에 v를 통과시킨 값이기 때문
- 같은 문장 내에서 (key)에서 문맥(순서)을 이해하기 위해 positional encoding (단어 순서마다 포스트잇 벡터를 붙여 주는 것!) —> 다음으로 h_new 계산방식과 같은 self attention 활용
Transformer의 동작/학습 + Masking
- 첫번째 단어부터 차근차근 예측 (4번째 단어는 뭘까? 라는 식으로 예측 불가) = n 번째 단어를 예측 할때는 n-1단어까지 decoder에 input으로 넣음
- 그래서 첫번째 단어로는 약속된 단어 start of sequence = sos 를 넣고 완성 후에 빼버림, end of sequence도 마찬가지 = 번역을 종료하도록
- ~n-1 단어를 넣어서 n 단어를 예측 하는 방식으로 학습 시킴
- 단어 길이 만큼 넣고 한칸씩 미는 방식으로 단어예측을 한번에 병렬 학습 가능
- 학습 할 때 미래단어를 안 쓰기 위해서 뒤 단어 쓰지 않음 = masking

Self-Attention

Untitled

import torch
import torch.nn.functional as F

# Sample text
sentence = "The cat sat on the mat"

# Tokenize the sentence (naive tokenization for simplicity)
tokens = sentence.lower().split()

# Simulate an embedding layer: map each token to a random embedding
# In practice, use pre-trained embeddings or an embedding layer of a model
token_to_embedding = {token: torch.randn(1, 4) for token in set(tokens)}  # 4-dimensional embeddings

# Convert tokens to embeddings
embeddings = torch.cat([token_to_embedding[token] for token in tokens], dim=0)

# Learned weight matrices (randomly initialized for this example)
W_q = torch.randn(4, 4)  # (embedding_size, output_size)
W_k = torch.randn(4, 4)  # (embedding_size, output_size)
W_v = torch.randn(4, 4)  # (embedding_size, output_size)

# Compute query (Q), key (K), and value (V) matrices
Q = torch.matmul(embeddings, W_q)
K = torch.matmul(embeddings, W_k)
V = torch.matmul(embeddings, W_v)

# Compute scaled dot-product attention
d_k = Q.size(-1)  # dimension of the keys
**# 아래 식이 바로 QK^T/root(d_k)
# Q = [batch size, length of query, dimension for each query vector]
# Q = [N, L_q, D]
# K = [N, L_k, D] --> 이걸 [N,D, L_k]로 만들어주기 위해 transpose
# QK^T = [N, L_q, L_k] shape : represents similarity between each query and each key**
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
attention_weights = F.softmax(scores, dim=-1)
**# Attention 은 softmax 값과 V를 곱한 것**
self_attention_output = torch.matmul(attention_weights, V)

print("Self-Attention Output:\\n", self_attention_output)

Cross - attention

import torch
import torch.nn.functional as F

# Sample sentences
sentence_a = "The cat sat on the mat"  # Source
sentence_b = "Le chat était sur le tapis"  # Target

# Tokenize sentences (naive tokenization for simplicity)
tokens_a = sentence_a.lower().split()
tokens_b = sentence_b.lower().split()

# Simulate an embedding layer: map each token to a random embedding
# Mapping for sentence A (source)
token_to_embedding_a = {token: torch.randn(1, 4) for token in set(tokens_a)}
embeddings_a = torch.cat([token_to_embedding_a[token] for token in tokens_a], dim=0)

# Mapping for sentence B (target)
token_to_embedding_b = {token: torch.randn(1, 4) for token in set(tokens_b)}
embeddings_b = torch.cat([token_to_embedding_b[token] for token in tokens_b], dim=0)

# Learned weight matrices (randomly initialized for this example)
**# 단지 self-attention과 차이점은 그냥 query는 다른 값을 사용하고 key=value 값은 같다**
# For sentence A (source)
W_k_a = torch.randn(4, 4)
W_v_a = torch.randn(4, 4)

# For sentence B (target)
W_q_b = torch.randn(4, 4)

# Compute query (Q) from B, key (K) and value (V) from A
Q_b = torch.matmul(embeddings_b, W_q_b)
K_a = torch.matmul(embeddings_a, W_k_a)
V_a = torch.matmul(embeddings_a, W_v_a)

# Compute scaled dot-product cross-attention
d_k = Q_b.size(-1)  # dimension of the keys
scores = torch.matmul(Q_b, K_a.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
attention_weights = F.softmax(scores, dim=-1)
cross_attention_output = torch.matmul(attention_weights, V_a)

print("Cross-Attention Output:\\n", cross_attention_output)