18장 시퀀스 배열로 다루는 순환 신경망(RNN)¶

1. LSTM을 이용한 로이터 뉴스 카테고리 분류하기¶

In [ ]:

Copied!





from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import reuters       # 로이터 뉴스 데이터셋 불러오기
from tensorflow.keras.callbacks import EarlyStopping

import numpy as np
import matplotlib.pyplot as plt

# 데이터를 불러와 학습셋, 테스트셋으로 나눕니다.
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=1000, test_split=0.2)

# 데이터를 확인해 보겠습니다.
category = np.max(y_train) + 1
print(category, '카테고리')
print(len(X_train), '학습용 뉴스 기사')
print(len(X_test), '테스트용 뉴스 기사')
print(X_train[0])
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import reuters       # 로이터 뉴스 데이터셋 불러오기
from tensorflow.keras.callbacks import EarlyStopping

import numpy as np
import matplotlib.pyplot as plt

# 데이터를 불러와 학습셋, 테스트셋으로 나눕니다.
(X_train, y_train), (X_test, y_test) = reuters.load_data(num_words=1000, test_split=0.2)

# 데이터를 확인해 보겠습니다.
category = np.max(y_train) + 1
print(category, '카테고리')
print(len(X_train), '학습용 뉴스 기사')
print(len(X_test), '테스트용 뉴스 기사')
print(X_train[0])

46 카테고리
8982 학습용 뉴스 기사
2246 테스트용 뉴스 기사
[1, 2, 2, 8, 43, 10, 447, 5, 25, 207, 270, 5, 2, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 2, 2, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 2, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]

In [ ]:

Copied!





# 단어의 수를 맞추어 줍니다. 
X_train = sequence.pad_sequences(X_train, maxlen=100)
X_test = sequence.pad_sequences(X_test, maxlen=100)

# 원-핫 인코딩 처리를 합니다.
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# 모델의 구조를 설정합니다.
model = Sequential()
model.add(Embedding(1000, 100))
model.add(LSTM(100, activation='tanh'))
model.add(Dense(46, activation='softmax'))

# 모델의 실행 옵션을 정합니다.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 학습의 조기 중단을 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=5)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, batch_size=20, epochs=200, validation_data=(X_test, y_test), callbacks=[early_stopping_callback])

# 테스트 정확도를 출력합니다.
print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))
# 단어의 수를 맞추어 줍니다. 
X_train = sequence.pad_sequences(X_train, maxlen=100)
X_test = sequence.pad_sequences(X_test, maxlen=100)

# 원-핫 인코딩 처리를 합니다.
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

# 모델의 구조를 설정합니다.
model = Sequential()
model.add(Embedding(1000, 100))
model.add(LSTM(100, activation='tanh'))
model.add(Dense(46, activation='softmax'))

# 모델의 실행 옵션을 정합니다.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# 학습의 조기 중단을 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=5)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, batch_size=20, epochs=200, validation_data=(X_test, y_test), callbacks=[early_stopping_callback])

# 테스트 정확도를 출력합니다.
print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))

Epoch 1/200
450/450 [==============================] - 8s 11ms/step - loss: 2.2100 - accuracy: 0.4390 - val_loss: 1.9456 - val_accuracy: 0.5116
Epoch 2/200
450/450 [==============================] - 5s 11ms/step - loss: 1.8228 - accuracy: 0.5322 - val_loss: 1.7361 - val_accuracy: 0.5606
Epoch 3/200
450/450 [==============================] - 5s 11ms/step - loss: 1.6624 - accuracy: 0.5748 - val_loss: 1.6674 - val_accuracy: 0.5868
Epoch 4/200
450/450 [==============================] - 5s 11ms/step - loss: 1.5424 - accuracy: 0.6049 - val_loss: 1.4972 - val_accuracy: 0.6287
Epoch 5/200
450/450 [==============================] - 5s 11ms/step - loss: 1.3349 - accuracy: 0.6620 - val_loss: 1.3598 - val_accuracy: 0.6647
Epoch 6/200
450/450 [==============================] - 5s 11ms/step - loss: 1.2020 - accuracy: 0.6929 - val_loss: 1.2815 - val_accuracy: 0.6652
Epoch 7/200
450/450 [==============================] - 5s 11ms/step - loss: 1.0926 - accuracy: 0.7233 - val_loss: 1.2093 - val_accuracy: 0.6963
Epoch 8/200
450/450 [==============================] - 5s 11ms/step - loss: 0.9945 - accuracy: 0.7492 - val_loss: 1.1618 - val_accuracy: 0.7084
Epoch 9/200
450/450 [==============================] - 5s 11ms/step - loss: 0.9207 - accuracy: 0.7659 - val_loss: 1.1567 - val_accuracy: 0.7053
Epoch 10/200
450/450 [==============================] - 5s 11ms/step - loss: 0.8467 - accuracy: 0.7866 - val_loss: 1.1250 - val_accuracy: 0.7204
Epoch 11/200
450/450 [==============================] - 5s 11ms/step - loss: 0.7866 - accuracy: 0.8025 - val_loss: 1.1005 - val_accuracy: 0.7248
Epoch 12/200
450/450 [==============================] - 5s 11ms/step - loss: 0.7157 - accuracy: 0.8206 - val_loss: 1.1453 - val_accuracy: 0.7262
Epoch 13/200
450/450 [==============================] - 5s 11ms/step - loss: 0.6583 - accuracy: 0.8367 - val_loss: 1.1542 - val_accuracy: 0.7337
Epoch 14/200
450/450 [==============================] - 5s 11ms/step - loss: 0.6136 - accuracy: 0.8475 - val_loss: 1.1600 - val_accuracy: 0.7302
Epoch 15/200
450/450 [==============================] - 5s 11ms/step - loss: 0.5546 - accuracy: 0.8618 - val_loss: 1.1965 - val_accuracy: 0.7235
Epoch 16/200
450/450 [==============================] - 5s 11ms/step - loss: 0.5121 - accuracy: 0.8691 - val_loss: 1.2316 - val_accuracy: 0.7177
71/71 [==============================] - 0s 4ms/step - loss: 1.2316 - accuracy: 0.7177

 Test Accuracy: 0.7177

In [ ]:

Copied!





# 학습셋과 테스트셋의 오차를 저장합니다. 
y_vloss = history.history['val_loss']
y_loss = history.history['loss']

# 그래프로 표현해 보겠습니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

# 그래프에 그리드를 주고 레이블을 표시하겠습니다. 
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
# 학습셋과 테스트셋의 오차를 저장합니다. 
y_vloss = history.history['val_loss']
y_loss = history.history['loss']

# 그래프로 표현해 보겠습니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

# 그래프에 그리드를 주고 레이블을 표시하겠습니다. 
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

No description has been provided for this image

2. LSTM과 CNN의 조합을 이용한 영화 리뷰 분류하기¶

In [ ]:

Copied!





from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding, LSTM, Conv1D, MaxPooling1D
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import EarlyStopping

import numpy as np
import matplotlib.pyplot as plt

# 데이터를 불러와 학습셋, 테스트셋으로 나눕니다.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

# 단어의 수를 맞추어 줍니다. 
X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

# 모델의 구조를 설정합니다.
model = Sequential()
model.add(Embedding(5000, 100))
model.add(Dropout(0.5))
model.add(Conv1D(64, 5, padding='valid', activation='relu',strides=1))
model.add(MaxPooling1D(pool_size=4))
model.add(LSTM(55))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.summary()
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding, LSTM, Conv1D, MaxPooling1D
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import EarlyStopping

import numpy as np
import matplotlib.pyplot as plt

# 데이터를 불러와 학습셋, 테스트셋으로 나눕니다.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

# 단어의 수를 맞추어 줍니다. 
X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

# 모델의 구조를 설정합니다.
model = Sequential()
model.add(Embedding(5000, 100))
model.add(Dropout(0.5))
model.add(Conv1D(64, 5, padding='valid', activation='relu',strides=1))
model.add(MaxPooling1D(pool_size=4))
model.add(LSTM(55))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, None, 100)         500000    
_________________________________________________________________
dropout (Dropout)            (None, None, 100)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 64)          32064     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, None, 64)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 55)                26400     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 56        
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
=================================================================
Total params: 558,520
Trainable params: 558,520
Non-trainable params: 0
_________________________________________________________________

In [ ]:

Copied!





# 모델의 실행 옵션을 정합니다.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# 학습의 조기 중단을 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=3)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, batch_size=40, epochs=100, validation_split=0.25, callbacks=[early_stopping_callback])

# 테스트 정확도를 출력합니다.
print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))
# 모델의 실행 옵션을 정합니다.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# 학습의 조기 중단을 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=3)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, batch_size=40, epochs=100, validation_split=0.25, callbacks=[early_stopping_callback])

# 테스트 정확도를 출력합니다.
print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))

Epoch 1/100
469/469 [==============================] - 18s 17ms/step - loss: 0.4083 - accuracy: 0.7973 - val_loss: 0.2848 - val_accuracy: 0.8818
Epoch 2/100
469/469 [==============================] - 7s 16ms/step - loss: 0.2360 - accuracy: 0.9113 - val_loss: 0.2785 - val_accuracy: 0.8829
Epoch 3/100
469/469 [==============================] - 7s 16ms/step - loss: 0.1920 - accuracy: 0.9279 - val_loss: 0.3171 - val_accuracy: 0.8624
Epoch 4/100
469/469 [==============================] - 7s 16ms/step - loss: 0.1509 - accuracy: 0.9442 - val_loss: 0.2977 - val_accuracy: 0.8813
Epoch 5/100
469/469 [==============================] - 7s 16ms/step - loss: 0.1197 - accuracy: 0.9581 - val_loss: 0.3106 - val_accuracy: 0.8896
782/782 [==============================] - 4s 6ms/step - loss: 0.3367 - accuracy: 0.8796

 Test Accuracy: 0.8796

In [ ]:

Copied!





# 학습셋과 테스트셋의 오차를 저장합니다. 
y_vloss = history.history['val_loss']
y_loss = history.history['loss']

# 그래프로 표현해 보겠습니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

# 그래프에 그리드를 주고 레이블을 표시하겠습니다. 
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
# 학습셋과 테스트셋의 오차를 저장합니다. 
y_vloss = history.history['val_loss']
y_loss = history.history['loss']

# 그래프로 표현해 보겠습니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

# 그래프에 그리드를 주고 레이블을 표시하겠습니다. 
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

3. 어텐션을 사용한 신경망¶

In [1]:

Copied!

!pip install keras-self-attention
!pip install keras-self-attention

In [ ]:

Copied!





from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding, LSTM, Conv1D, MaxPooling1D
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model
from keras_self_attention import SeqSelfAttention

import numpy as np
import matplotlib.pyplot as plt

# 데이터를 불러와 학습셋, 테스트셋으로 나눕니다.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

# 단어의 수를 맞추어 줍니다. 
X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

# 모델의 구조를 설정합니다.
model = Sequential()
model.add(Embedding(5000, 500))
model.add(Dropout(0.5))
model.add(LSTM(64, return_sequences=True))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))

# 모델의 실행 옵션을 정합니다.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# 학습의 조기 중단을 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=3)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, batch_size=40, epochs=100,  validation_data=(X_test, y_test), callbacks=[early_stopping_callback])

# 테스트 정확도를 출력합니다.
print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding, LSTM, Conv1D, MaxPooling1D
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.utils import plot_model
from keras_self_attention import SeqSelfAttention

import numpy as np
import matplotlib.pyplot as plt

# 데이터를 불러와 학습셋, 테스트셋으로 나눕니다.
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=5000)

# 단어의 수를 맞추어 줍니다. 
X_train = sequence.pad_sequences(X_train, maxlen=500)
X_test = sequence.pad_sequences(X_test, maxlen=500)

# 모델의 구조를 설정합니다.
model = Sequential()
model.add(Embedding(5000, 500))
model.add(Dropout(0.5))
model.add(LSTM(64, return_sequences=True))
model.add(SeqSelfAttention(attention_activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(1))
model.add(Activation('sigmoid'))

# 모델의 실행 옵션을 정합니다.
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# 학습의 조기 중단을 설정합니다.
early_stopping_callback = EarlyStopping(monitor='val_loss', patience=3)

# 모델을 실행합니다.
history = model.fit(X_train, y_train, batch_size=40, epochs=100,  validation_data=(X_test, y_test), callbacks=[early_stopping_callback])

# 테스트 정확도를 출력합니다.
print("\n Test Accuracy: %.4f" % (model.evaluate(X_test, y_test)[1]))

Epoch 1/100
625/625 [==============================] - 32s 50ms/step - loss: 0.3872 - accuracy: 0.8211 - val_loss: 0.2915 - val_accuracy: 0.8784
Epoch 2/100
625/625 [==============================] - 31s 49ms/step - loss: 0.2312 - accuracy: 0.9070 - val_loss: 0.2688 - val_accuracy: 0.8873
Epoch 3/100
625/625 [==============================] - 30s 48ms/step - loss: 0.1700 - accuracy: 0.9363 - val_loss: 0.3014 - val_accuracy: 0.8866
Epoch 4/100
625/625 [==============================] - 30s 48ms/step - loss: 0.1210 - accuracy: 0.9534 - val_loss: 0.3148 - val_accuracy: 0.8840
Epoch 5/100
625/625 [==============================] - 31s 49ms/step - loss: 0.0872 - accuracy: 0.9676 - val_loss: 0.3980 - val_accuracy: 0.8808
782/782 [==============================] - 12s 14ms/step - loss: 0.3980 - accuracy: 0.8808

 Test Accuracy: 0.8808

In [ ]:

Copied!





# 학습셋과 테스트셋의 오차를 저장합니다. 
y_vloss = history.history['val_loss']
y_loss = history.history['loss']

# 그래프로 표현해 보겠습니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

# 그래프에 그리드를 주고 레이블을 표시하겠습니다. 
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()
# 학습셋과 테스트셋의 오차를 저장합니다. 
y_vloss = history.history['val_loss']
y_loss = history.history['loss']

# 그래프로 표현해 보겠습니다.
x_len = np.arange(len(y_loss))
plt.plot(x_len, y_vloss, marker='.', c="red", label='Testset_loss')
plt.plot(x_len, y_loss, marker='.', c="blue", label='Trainset_loss')

# 그래프에 그리드를 주고 레이블을 표시하겠습니다. 
plt.legend(loc='upper right')
plt.grid()
plt.xlabel('epoch')
plt.ylabel('loss')
plt.show()

어텐션 학습¶

🎯 한 문장으로 요약

Attention은 “입력의 모든 정보 중, 지금 필요한 부분에 집중하도록 해주는 메커니즘” 입니다.

🧠 1️⃣ 어텐션이 필요한 이유

먼저 RNN(순환 신경망)을 떠올려볼게요. RNN은 문장을 왼쪽부터 한 단어씩 읽어가며 문맥을 요약합니다.

예를 들어: 입력: 나는 오늘 학교에 갔다 출력: I went to school today

RNN은 문장 전체를 다 읽은 후 마지막 hidden state를 사용해 번역을 해야 해요. 그런데 문제가 생깁니다.

👉 문장이 길어지면, 초반 단어의 정보(예: “나는”)가 뒤로 갈수록 희미해집니다. 이를 장기 의존성 문제(long-term dependency) 라고 해요.

그래서 등장한 게 바로 Attention이에요.

“어텐션의 가중치(α)가 어떻게 학습되는가?” 즉, 가중치 조절을 위한 오차가 어디서 오느냐가 핵심이죠.

🧩 핵심 요약

어텐션의 가중치는 따로 ‘정답’이 없어요. 대신 전체 모델의 출력 오차(예: 번역 문장과 정답의 차이) 로부터 역전파를 통해 간접적으로 학습됩니다.

🔬 어텐션 오차의 직관적 해석 • 어떤 단어에 너무 집중을 덜 했다면, → 그 단어의 정보가 부족해서 번역 품질이 떨어짐 → 오차 ↑ → 역전파 시 α가 증가하도록 학습됨. • 어떤 단어에 너무 집중을 많이 했다면, → 불필요한 정보가 섞여 오차 ↑ → 역전파 시 α가 감소하도록 학습됨.

결국 오차가 “주의 분포(attention distribution)”를 점점 더 정확하게 만들어주는 거예요.

“Attention을 학습할 때 어떤 데이터가 쓰이는가?” 🧠 1️⃣ 먼저: 어떤 종류의 Attention 모델인가?

어텐션은 여러 모델에서 쓰이지만, 대표적인 학습 데이터 유형은 다음 세 가지예요:

모델 종류	사용되는 데이터	예시
🔤 Seq2Seq + Attention (번역기)	문장 쌍 (입력 문장 → 번역 문장)	(“나는 학교에 간다”, “I go to school”)
📜 Text Summarization (요약)	긴 문장 → 요약된 문장	(“오늘 날씨가 흐리고 비가 …”, “비오는 날씨”)
💬 Q&A / ChatGPT류 (Self-Attention)	단일 문장 or 문맥 + 정답	(“Who wrote Hamlet?”, “Shakespeare”)

🎯 4️⃣ 학습에 쓰이는 실제 데이터 포맷 예시 (Tensor 형태)

# 입력 (한국어 문장)
X = [
  [1, 2, 3, 0],     # "나는 학교에 간다"
  [8, 9, 10, 0],    # "그는 밥을 먹는다" (예시)
]

# 출력 (영어 문장)
Y = [
  [4, 5, 6, 7, 0],  # "I go to school"
  [11, 12, 13, 0, 0]  # "He eats rice"
]

모델은 Y를 정답으로 맞히도록 훈련됩니다. 즉, decoder의 출력이 “I go to school”과 최대한 가까워지도록 손실이 계산됩니다.

🔁 5️⃣ 손실 계산의 예시

예를 들어, decoder가 다음과 같이 예측했다면:

Step	예측	정답	오차
1	“I”	“I”	✅
2	“eat”	“go”	❌
3	“to”	“to”	✅
4	“class"	“school”	❌

손실은 각 시점별 Cross-Entropy로 계산되고, 그 오차가 Attention 가중치까지 역전파되어 “go”를 예측할 땐 “간다”에 더 집중하도록 수정됩니다.

🧩 7️⃣ Self-Attention (GPT, BERT류)는 어떻게 다를까?

Self-Attention에서는 “문장 쌍”이 아니라 하나의 문장을 입력으로 사용합니다.

입력: "The cat sat on the mat"

모델은 문장 내 단어들이 서로 어떻게 연관되는지를 학습합니다.

예를 들어, • “cat” → “sat”와 높은 연관 • “mat” → “on”과 높은 연관

이런 관계를 학습하면서, 문장 구조(문법적 의존성)를 스스로 익히는 거예요.

✅ 정리표

모델	입력 데이터	출력 데이터	어텐션 역할
Seq2Seq 번역기	한 언어 문장	다른 언어 문장	단어 간 의미 대응 학습
Summarization	긴 문장	짧은 요약	중요한 문장 부분에 집중
Self-Attention	한 문장	(자기 자신)	문맥 관계 학습
ChatGPT류	긴 대화	다음 문장	이전 문맥과 연관된 집중 학습