๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

NLP

[Python, KoBERT] ๋‹ค์ค‘ ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๊ตฌํ˜„ํ•˜๊ธฐ (huggingface๋กœ ์ด์ „ ๋ฐฉ๋ฒ• O)

 

1. BERT, KoBERT๋ž€?

 ๊ตฌ๊ธ€์—์„œ 2018๋…„์— ๊ณต๊ฐœํ•œ BERT๋Š” ๋“ฑ์žฅ๊ณผ ๋™์‹œ์— ์ˆ˜๋งŽ์€ NLP ํƒœ์Šคํฌ์—์„œ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๋ฉด์„œ NLP์˜ ํ•œ ํš์„ ๊ทธ์€ ๋ชจ๋ธ๋กœ ํ‰๊ฐ€๋ฐ›๊ณ  ์žˆ๋‹ค. ์–‘๋ฐฉํ–ฅ์„ฑ์„ ์ง€ํ–ฅํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.(B: bidirection) BERT ๋ชจ๋ธ์€ ๋ฌธ๋งฅ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜๊ณ  ์žˆ๊ณ , ๋Œ€์šฉ๋Ÿ‰ ๋ง๋ญ‰์น˜๋กœ ์‚ฌ์ „ ํ•™์Šต์ด ์ด๋ฏธ ์ง„ํ–‰๋˜์–ด ์–ธ์–ด์— ๋Œ€ํ•œ ์ดํ•ด๋„๋„ ๋†’๋‹ค.  ํ•˜์ง€๋งŒ BERT๋Š” ํ•œ๊ตญ์–ด์— ๋Œ€ํ•ด์„œ ์˜์–ด๋ณด๋‹ค ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค๊ณ  ํ•œ๋‹ค. 

 

 ์˜ค๋Š˜ ๊ธฐ์ˆ ํ•ด๋ณผ KoBERT ๋ชจ๋ธ์€ SKTBrain์—์„œ ๊ณต๊ฐœํ–ˆ๋Š”๋ฐ, ํ•œ๊ตญ์–ด ์œ„ํ‚ค 5๋ฐฑ๋งŒ ๋ฌธ์žฅ๊ณผ ํ•œ๊ตญ์–ด ๋‰ด์Šค 2์ฒœ๋งŒ ๋ฌธ์žฅ์„ ํ•™์Šตํ•œ ๋ชจ๋ธ์ด๋‹ค. ์ž์‹ ์˜ ์‚ฌ์šฉ ๋ชฉ์ ์— ๋”ฐ๋ผ ํŒŒ์ธํŠœ๋‹์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— output layer๋งŒ์„ ์ถ”๊ฐ€๋กœ ๋‹ฌ์•„์ฃผ๋ฉด ์›ํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ๋งŽ์€ BERT ๋ชจ๋ธ ์ค‘์—์„œ๋„ KoBERT๋ฅผ ์‚ฌ์šฉํ•œ ์ด์œ ๋Š” "ํ•œ๊ตญ์–ด"์— ๋Œ€ํ•ด ๋งŽ์€ ์‚ฌ์ „ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ ธ ์žˆ๊ณ , ๊ฐ์ •์„ ๋ถ„์„ํ•  ๋•Œ, ๊ธ์ •๊ณผ ๋ถ€์ •๋งŒ์œผ๋กœ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ ๋‹ค์ค‘ ๋ถ„๋ฅ˜๊ฐ€ ๊ฐ€๋Šฅํ•œ ๊ฒƒ์ด ๊ฐ•์ ์ด๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

 ์ด ํฌ์ŠคํŒ…์—์„œ ํ•˜๋Š” ์ž‘์—…์€ ๋ฐ”๋กœ  ํŒŒ์ธ ํŠœ๋‹(Fine-tuning)์— ํ•ด๋‹นํ•œ๋‹ค. ๋‹ค๋ฅธ ์ž‘์—…์— ๋Œ€ํ•ด์„œ ํŒŒ๋ผ๋ฏธํ„ฐ ์žฌ์กฐ์ •์„ ์œ„ํ•œ ์ถ”๊ฐ€ ํ›ˆ๋ จ ๊ณผ์ •์„ ๊ฑฐ์น˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด๋ฉด, ์šฐ๋ฆฌ๊ฐ€ ํ•˜๊ณ  ์‹ถ์€ ์ž‘์—…์ด ์šฐ์šธ์ฆ ๊ฒฝํ–ฅ ๋ฌธํ—Œ ๋ถ„๋ฅ˜๋ผ๊ณ  ํ•˜์˜€์„ ๋•Œ, ์ด๋ฏธ ์œ„ํ‚คํ”ผ๋””์•„ ๋“ฑ์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ BERT ์œ„์— ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ์‹ ๊ฒฝ๋ง์„ ํ•œ ์ธต ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฏธ BERT๊ฐ€ ์–ธ์–ด ๋ชจ๋ธ ์‚ฌ์ „ ํ•™์Šต ๊ณผ์ •์—์„œ ์–ป์€ ์ง€์‹์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์šฐ์šธ์ฆ ๊ฒฝํ–ฅ ๋ฌธํ—Œ ๋ถ„๋ฅ˜์—์„œ ๋ณด๋‹ค ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.

2. ์–ด๋ ค์› ๋˜ ์ 

KoBERT ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๋˜ ๊ธฐ์กด ์„œ๋ฒ„๊ฐ€ ์™„์ „ํžˆ ๋‹ซํžˆ๊ณ  hugingface hub๋กœ ์ด์ „๋˜์—ˆ๋‹ค.
ํ•˜์ง€๋งŒ ์—ฌ์ „ํžˆ github์—๋Š” ๋‘๊ฐ€์ง€์˜ method๋ฅผ ์ œ๊ณตํ•˜๊ณ  ์žˆ๊ณ 
๋ฐ”๋€ ์„œ๋ฒ„๋กœ์˜ ์˜ˆ์ œ ์ฝ”๋“œ๋ฅผ ์ œ๊ณตํ•ด์ฃผ๊ณ  ์žˆ์ง€ ์•Š์•„
์ฝ”๋“œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ž‘์„ฑํ•˜๋Š”๋ฐ ์˜ค๋žœ ์‹œ๊ฐ„์ด ๊ฑธ๋ ธ๋‹ค.

๊ธฐ์กด์— ์ œ๊ณตํ•˜๋˜ ์„œ๋ฒ„, KoBERT ๊ณต์‹ ์ฝ”๋“œ ๋ฐœ์ทŒ
๊ฐ€ ๋‹ซํ˜€์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌํ•ด ๋ฐœํ–‰๋œ issue์— ๋ฌด๋ ค 36๋ถ„ ์ „์— ๊ฐœ๋ฐœ์ž๋ถ„๊ป˜์„œ comment๋ฅผ ๋‚จ๊ฒจ์ฃผ์…จ์—ˆ๋‹ค. ๋‚ด๋‚ด it ๊ธฐ์ˆ ์„ ๋‹ค๋ฃจ์–ด๋„ ๊ฝค๋‚˜ ์˜ค๋žซ๋™์•ˆ ๊ฒ€์ฆ๋ฐ›๊ณ  ์„œ๋น„์Šค๊ฐ€ ๋งŽ์ด ๋˜๊ณ  ์žˆ๋Š” ์•ˆ์ •์ ์ธ ์–ธ์–ด, ํ”„๋ ˆ์ž„์›Œํฌ, ๋ชจ๋ธ๋“ค๋งŒ ์‚ฌ์šฉํ•˜๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ ์‹ ๊ธฐ์ˆ ์˜ ์ •์ ์— ์„œ์žˆ๋Š” ๊ฒƒ ๊ฐ™์•„ ์กฐ๊ธˆ ์‹ ๊ธฐํ–ˆ๋‹ค.

 

3. ๋ฐ์ดํ„ฐ์…‹ ์„ค๋ช…, ์ฝ”๋”ฉ ํ™˜๊ฒฝ

๋„ค์ด๋ฒ„ ๋ฆฌ๋ทฐ 2์ค‘ ๋ถ„๋ฅ˜ ์˜ˆ์ œ ์ฝ”๋“œ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘์„ฑํ•˜์˜€๊ณ  ๋ฐ์ดํ„ฐ๋Š” AiHub์˜ ๊ฐ์ • ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋Œ€ํ™” ์Œ์„ฑ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ–ˆ๋‹ค. ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ๊ฐ์ •์€ 7๊ฐ€์ง€ ๊ฐ์ •(happiness-ํ–‰๋ณต, angry-๋ถ„๋…ธ, disgust-ํ˜์˜ค, fear-๊ณตํฌ, neutral-์ค‘๋ฆฝ, sadness-์Šฌํ””, surprise-๋†€๋žŒ) 7๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•ด์ฃผ๊ณ  ์žˆ๋‹ค. colab ํ™˜๊ฒฝ์—์„œ ์ฝ”๋”ฉํ•˜๊ณ  ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜๋‹ค.

4. ์‹ค์Šต ์„ค๋ช…, ์ฝ”๋“œ

1. Colab ํ™˜๊ฒฝ์„ ์„ค์ •ํ•˜๊ธฐ

!pip install gluonnlp pandas tqdm   
!pip install mxnet
!pip install sentencepiece==0.1.91
!pip install transformers==4.8.2
!pip install torch

ํ˜„์žฌ ์š”๊ตฌํ•˜๋Š” ์‚ฌ์–‘์€ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.

KoBERT ๊ฐ€ ์š”๊ตฌํ•˜๋Š” ์ตœ์‹  ๋ฒ„์ „ ์ •๋ณด๋Š” ์ด ๊ณต์‹ ์˜คํ”ˆ์†Œ์Šค ๋งํฌ#์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

2. github์—์„œ KoBERT ํŒŒ์ผ์„ ๋กœ๋“œ ๋ฐ KoBERT๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

!pip install 'git+https://github.com/SKTBrain/KoBERT.git#egg=kobert_tokenizer&subdirectory=kobert_hf'

https://github.com/SKTBrain/KoBERT/tree/master/kobert_hf ์˜ kobert_tokenizer ํด๋”๋ฅผ ๋‹ค์šด๋ฐ›๋Š” ์ฝ”๋“œ์ด๋‹ค.

!pip install git+https://git@github.com/SKTBrain/KoBERT.git@master

https://github.com/SKTBrain/KoBERT ์˜ ํŒŒ์ผ๋“ค์„ ๋‹ค์šด๋ฐ›๋Š” ์ฝ”๋“œ์ด๋‹ค.

from kobert.pytorch_kobert import get_kobert_model
from kobert_tokenizer import KoBERTTokenizer
tokenizer = KoBERTTokenizer.from_pretrained('skt/kobert-base-v1')
bertmodel, vocab = get_kobert_model('skt/kobert-base-v1',tokenizer.vocab_file)

BERT๋Š” ์ด๋ฏธ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ํ•™์Šตํ•ด๋‘” ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค(pre-trained model)๋Š” ๊ฒƒ์„ ๋œปํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์‚ฌ์šฉํ•˜๋Š” model๊ณผ tokenizer๋Š” ํ•ญ์ƒ mapping ๊ด€๊ณ„์—ฌ์•ผ ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ U ํŒ€์ด ๊ฐœ๋ฐœํ•œ BERT๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, VํŒ€์ด ๊ฐœ๋ฐœํ•œ BERT์˜ tokenizer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด model์€ ํ…์ŠคํŠธ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์—†๋‹ค. UํŒ€์˜ BERT์˜ ํ† ํฌ๋‚˜์ด์ €๋Š” '์šฐ๋ฆฌ'๋ผ๋Š” ๋‹จ์–ด๋ฅผ 23๋ฒˆ์œผ๋กœ int encodingํ•˜๋Š” ๋ฐ˜๋ฉด์—, V๋ผ๋Š” BERT์˜ tokenizer๋Š” '์šฐ๋ฆฌ'๋ผ๋Š” ๋‹จ์–ด๋ฅผ 103๋ฒˆ์œผ๋กœ int encodingํ•ด ๋‹จ์–ด์™€ mapping ๋˜๋Š” ์ •๋ณด ์ž์ฒด๊ฐ€ ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด ๋ถ€๋ถ„์€ ๋’ค์—์„œ ๊ฐ„๋‹จํžˆ ์ง„ํ–‰ํ•ด๋ณธ ์‹ค์Šต์—์„œ ๋” ์ž์„ธํžˆ ๋‹ค๋ค„๋ณผ ๊ฒƒ์ด๋‹ค.

 

 

ํ•œ ์ค„ ํ•œ ์ค„ ์ž์„ธํžˆ ์„ค๋ช…ํ•ด๋ณด๋ฉด,

1: https://github.com/SKTBrain/KoBERT ์˜ kobert ํด๋”์˜ pytorch_kobert.py ํŒŒ์ผ์—์„œ get_kobert_model ๋ฉ”์„œ๋“œ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ์ฝ”๋“œ์ด๋‹ค.

get_kobert_model method

2: https://github.com/SKTBrain/KoBERT/tree/master/kobert_hf ์˜ kobert_tokenizer ํด๋”์˜ kobert_tokenizer.py ํŒŒ์ผ์—์„œ KoBERTTokenizer ํด๋ž˜์Šค๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ์ฝ”๋“œ์ด๋‹ค.

KoBERTTokenizer class

3, 4: ๋ถˆ๋Ÿฌ์˜จ ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•˜๊ณ  ํด๋ž˜์Šค๋ฅผ ๊ฐ€์ ธ์™€ ๊ฐ๊ฐ tokenizer์™€, model, vocabulary๋ฅผ ๋ถˆ๋Ÿฌ์™”๋‹ค.

์ด๋ ‡๊ฒŒ ํ•ด๋‹น ์‹ค์Šต์— ํ•„์š”ํ•œ KoBERT๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ์ฝ”๋“œ ์„ค๋ช…์€ ๋๋‚ฌ๋‹ค.

 


 

์•„๋ž˜์˜ ๋ฐฉ๋ฒ•์€ ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ KoBERT ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ์ฝ”๋“œ์ด๋‹ค.

from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("skt/kobert-base-v1")
model = AutoModel.from_pretrained("skt/kobert-base-v1")

github์„ ํ†ตํ•œ ๋‹ค์šด๋กœ๋“œ ์—†์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•ด๋‹น ์ฝ”๋“œ๋กœ๋Š” vocab์— ์ ‘๊ทผํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๊ฐ์ž์˜ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์“ฐ๋ฉด ๋  ๊ฒƒ ๊ฐ™๋‹ค.

huggingface ์‚ฌ์ดํŠธ์— skt/kobert-base-v1์œผ๋กœ ์˜ฌ๋ผ์™€์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

3. ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import gluonnlp as nlp
import numpy as np
from tqdm import tqdm, tqdm_notebook
import pandas as pd

#transformers
from transformers import AdamW
from transformers.optimization import get_cosine_schedule_with_warmup
from transformers import BertModel

#GPU ์‚ฌ์šฉ ์‹œ
device = torch.device("cuda:0")

์‚ฌ์ „ ํ•™์Šต๋œ BERT๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ๋Š” transformers๋ผ๋Š” ํŒจํ‚ค์ง€๋ฅผ ์ž์ฃผ ์‚ฌ์šฉํ•œ๋‹ค.

๋˜ํ•œ, ํ•™์Šต์‹œ๊ฐ„์„ ์ค„์ด๊ธฐ ์œ„ํ•ด, GPU๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

4. ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

์™ผ์ชฝ ์ƒ๋‹จ์— ํด๋” ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๊ณ  ์—…๋กœ๋“œ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ์„œ ๋กœ์ปฌ์— ์žˆ๋Š” ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ค๋Š˜ ์‹ค์Šต์— ํ•„์š”ํ•œ '๊ฐ์ •๋ถ„๋ฅ˜๋ฐ์ดํ„ฐ์…‹.csv' ํŒŒ์ผ์„ ์—…๋กœ๋“œํ•˜๋ฉด ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ํŒŒ์ผ์ด ์ƒ๊ธด๋‹ค.

data = pd.read_csv('๊ฐ์ •๋ถ„๋ฅ˜๋ฐ์ดํ„ฐ์…‹.csv', encoding='cp949')

ํŒŒ์ผ ์ด๋ฆ„์„ ์ง€์ •ํ•ด pandas library๋กœ ์—ด์–ด์ฃผ์—ˆ๋‹ค.

ํ•ด๋‹น ๋ฐ์ดํ„ฐ์…‹์€ ์œ„์™€ ๊ฐ™์ด column ๋ช…์ด ์ง€์ •๋˜์–ด ์žˆ๋‹ค.

data.loc[(data['์ƒํ™ฉ'] == "fear"), '์ƒํ™ฉ'] = 0  #๊ณตํฌ => 0
data.loc[(data['์ƒํ™ฉ'] == "surprise"), '์ƒํ™ฉ'] = 1  #๋†€๋žŒ => 1
data.loc[(data['์ƒํ™ฉ'] == "angry"), '์ƒํ™ฉ'] = 2  #๋ถ„๋…ธ => 2
data.loc[(data['์ƒํ™ฉ'] == "sadness"), '์ƒํ™ฉ'] = 3  #์Šฌํ”” => 3
data.loc[(data['์ƒํ™ฉ'] == "neutral"), '์ƒํ™ฉ'] = 4  #์ค‘๋ฆฝ => 4
data.loc[(data['์ƒํ™ฉ'] == "happiness"), '์ƒํ™ฉ'] = 5  #ํ–‰๋ณต => 5
data.loc[(data['์ƒํ™ฉ'] == "disgust"), '์ƒํ™ฉ'] = 6  #ํ˜์˜ค => 6


data_list = []
for ques, label in zip(data['๋ฐœํ™”๋ฌธ'], data['์ƒํ™ฉ'])  :
    data = []   
    data.append(ques)
    data.append(str(label))

    data_list.append(data)

7๊ฐœ์˜ ๊ฐ์ • class๋ฅผ 0~6 ์ˆซ์ž์— ๋Œ€์‘์‹œ์ผœ data_list์— ๋‹ด์•„์ค€๋‹ค.

5. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์…‹์„ ํ† ํฐํ™”ํ•˜๊ธฐ

class BERTDataset(Dataset):
    def __init__(self, dataset, sent_idx, label_idx, bert_tokenizer,vocab, max_len,
                 pad, pair):
   
        transform = nlp.data.BERTSentenceTransform(
            bert_tokenizer, max_seq_length=max_len,vocab=vocab, pad=pad, pair=pair)
        
        self.sentences = [transform([i[sent_idx]]) for i in dataset]
        self.labels = [np.int32(i[label_idx]) for i in dataset]

    def __getitem__(self, i):
        return (self.sentences[i] + (self.labels[i], ))
         

    def __len__(self):
        return (len(self.labels))

๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ BERT ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๋„๋ก tokenization, int encoding, padding ๋“ฑ์„ ํ•ด์ฃผ๋Š” ์ฝ”๋“œ์ด๋‹ค.

 


<tokenization ๋ถ€๊ฐ€ ์„ค๋ช…>

"tokenization" ์ด ๋‚ด์šฉ์— ๋Œ€ํ•ด์„œ ์กฐ๊ธˆ ์ € ์ž์„ธํžˆ ์ ์–ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

์ฐธ๊ณ ๋กœ ์ด ๋ถ€๋ถ„์€ ๋‹ค์ค‘ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ์ฝ”๋“œ์™€ ์—ฐ๊ฒฐ๋˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ์ € ์ดํ•ด๋ฅผ ๋•๊ธฐ ์œ„ํ•ด ๋ถ€๊ฐ€ ์„ค๋ช…๋œ ๋ถ€๋ถ„์ด๋‹ค !!

์ฐธ๊ณ ๋กœ, ํ˜„์žฌ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์€ KoBERT ์ด์ง€๋งŒ, KcBERT์—์„œ vocab file์„ ๋ณด๊ธฐ ์‰ฝ๊ฒŒ text ํŒŒ์ผ๋กœ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์–ด, ๋จผ์ € KcBERT๋กœ ํ† ํฐํ™”, ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ, padding์„ ๋ณด๊ณ  KoBERT์—์„œ๋„ ๋‹ค์‹œ ํ•œ๋ฒˆ ์‹คํ—˜ํ•ด๋ณด๊ณ ์ž ํ•œ๋‹ค.

KcBERT ๋ชจ๋ธ์—์„œ์˜ ํ† ํฐํ™”์™€ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ

kcbert์˜ vocabulary

KcBERT ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•œ ๊ธฐ์กด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋งŒ๋“ค์–ด์ง„ vocabulary๋Š” ์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ๋ฅผ ๋ˆ๋‹ค.

์‚ฌ์ง„์„ ๋ˆ„๋ฅด๋ฉด KcBERT ๊ณต์‹ repo๋กœ ์—ฐ๊ฒฐ๋œ๋‹ค.

์ด vocab๊ฐ€ ์–ด๋–ป๊ฒŒ ๋Œ€์‘๋˜๋Š”์ง€ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด ์•„์ฃผ ๊ฐ„๋‹จํ•œ ์ฝ”๋“œ๋ฅผ ์ง์ ‘ ์ž‘์„ฑํ•ด๋ณด์•˜๋‹ค.

!pip install --no-cache-dir transformers sentencepiece 

from transformers import AutoTokenizer, AutoModelForMaskedLM

# kcbert์˜ tokenizer์™€ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜ด.
kcbert_tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")
kcbert = AutoModelForMaskedLM.from_pretrained("beomi/kcbert-base")

result = kcbert_tokenizer.tokenize("๋„ˆ๋Š” ๋‚ด๋…„ ๋Œ€์„  ๋•Œ ํˆฌํ‘œํ•  ์ˆ˜ ์žˆ์–ด?")
print(result)
print(kcbert_tokenizer.vocab['๋Œ€์„ '])
print([kcbert_tokenizer.encode(token) for token in result])

kcbert output

์ƒˆ๋กœ์šด colab ํŒŒ์ผ์„ ์—ด์–ด ๋‹จ 4์ค„์˜ ์ฝ”๋“œ๋กœ model๊ณผ tokenizer๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค.

์ƒˆ๋กœ์šด ๋ฌธ์žฅ์„ ๋งŒ๋“ค์–ด ๋จผ์ € tokenize ํ•ด๋ณด์•˜๋”๋‹ˆ, ๋Œ€์„ ์ด๋ผ๋Š” ๋‹จ์–ด๋Š” ๋ชจ๋ธ์—์„œ '๋Œ€์„ '์œผ๋กœ ๋Œ€์‘๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋ชจ๋ธ์—์„œ '๋Œ€์„ '์œผ๋กœ ์ •์˜๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•œ ํ›„ ํ•ด๋‹น ๋‹จ์–ด๋ฅผ vocabulary ํŒŒ์ผ์—์„œ  9311๋กœ ๋Œ€์‘๋œ๋‹ค.

vocab์˜ indexing์€ 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š”๋ฐ github ํ™˜๊ฒฝ์—์„œ line of code๋Š” 1๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ ๋•Œ๋ฌธ์— 9312์—์„œ 1์ด ๋บ€ ๊ฐ’์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์ด๋‹ค.

์ฆ‰, ์•ž์—์„œ vocab ํŒŒ์ผ๊ณผ ์ •ํ™•ํžˆ ๋Œ€์‘๋˜๋Š” ๊ฒƒ์„ ์ง์ ‘ ๋ˆˆ์œผ๋กœ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

๋” ๋‚˜์•„๊ฐ€ ํ† ํฐํ™”๋œ ๋ฌธ์žฅ์„ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉํ•ด ํ™•์ธํ•ด๋ณด๋‹ˆ ์•ž๊ณผ ๋’ค์˜ 2, 3์€ padding์ด๋‚˜ margin์œผ๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๊ณ , ์—ญ์‹œ๋‚˜ 9311๋กœ ์ธ์ฝ”๋”ฉ๋œ ๊ฐ’์ด ์ž˜ ์“ฐ์ด๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

 

KoBERT ๋ชจ๋ธ์—์„œ์˜ ํ† ํฐํ™”์™€ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ

๊ณผ์—ฐ ๊ทธ๋ ‡๋‹ค๋ฉด ์˜ค๋Š˜ ๋‹ค๋ฃจ๊ณ  ์žˆ๋Š” KoBERT ๋ชจ๋ธ์—์„œ๋Š” ์–ด๋–ป๊ฒŒ ํ† ํฐ์œผ๋กœ ๋‚˜๋ˆ„์–ด์ค„๊นŒ?

!pip install --no-cache-dir transformers sentencepiece

from transformers import AutoModel, AutoTokenizer

kobert_tokenizer = AutoTokenizer.from_pretrained("skt/kobert-base-v1", use_fast=False)
kobert = AutoModel.from_pretrained("skt/kobert-base-v1")

result = kobert_tokenizer.tokenize("๋„ˆ๋Š” ๋‚ด๋…„ ๋Œ€์„  ๋•Œ ํˆฌํ‘œํ•  ์ˆ˜ ์žˆ์–ด?")
print(result)
kobert_vocab = kobert_tokenizer.get_vocab()
print(kobert_vocab.get('โ–๋Œ€์„ '))
print([kobert_tokenizer.encode(token) for token in result])

kobert output

kobert ์—์„œ๋Š” ์˜ˆ์ƒ๋Œ€๋กœ ๊ฐ™์€ ๋ฌธ์žฅ์ด ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ํ† ํฐํ™”๋˜์—ˆ๋‹ค.

๋Œ€์„ ์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ 'โ–๋Œ€์„ ' ์ด๋ž€ ํ† ํฐ์œผ๋กœ ๋Œ€์‘๋˜์—ˆ๊ณ  1654 ์ •์ˆ˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค.

๋ฌธ์žฅ์„ ๋ชจ๋‘ ์ธ์ฝ”๋”ฉํ–ˆ์„ ๋•Œ์—๋„ ๊ฐ™์€ ๊ฐ’์œผ๋กœ ์ž˜ ์ธ์ฝ”๋”ฉ๋˜๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•ด๋ณด์•˜๋‹ค.

 

์ด ์‹คํ—˜์œผ๋กœ bert ๋ชจ๋ธ์ด ๋ฌธ์žฅ์„ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„์–ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

๋˜ํ•œ, ์•ž์„œ์„œ ์„ค๋ช…ํ–ˆ๋˜ ๋ถ€๋ถ„ ์ค‘ ์‚ฌ์šฉํ•˜๋Š” model๊ณผ tokenizer๋Š” ํ•ญ์ƒ mapping ๊ด€๊ณ„์—ฌ์•ผ ํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์ด ์ด์ œ๋Š” ๋„ˆ๋ฌด๋‚˜ ๋‹น์—ฐํ•˜๊ฒŒ ๋Š๊ปด์ง„๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ KoBERT๊ฐ€ ์•„๋‹Œ ๋˜ ๋‹ค๋ฅธ ๋ชจ๋ธ์ธ KcBERT๋ฅผ ๋ถˆ๋Ÿฌ์™€ ์ง์ ‘ ์ฝ”๋”ฉํ•ด๋ณด๋Š” ๊ณผ์ •์—์„œ ๋‹ค๋ฅธ ๋ชจ๋ธ์„ ์ ์šฉํ•ด๋ณด๋Š” ์‹œํ–‰์ฐฉ์˜ค๋„ ๋ฏธ๋ฆฌ ๊ฒฝํ—˜ํ•ด๋ณด์•˜๋‹ค.

 

์ž! ๋‹ค์‹œ ๋‹ค์ค‘ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•œ ์ฝ”๋“œ๋กœ ๋Œ์•„๊ฐ€๋ณด์ž!

 


 

# Setting parameters
max_len = 64
batch_size = 64
warmup_ratio = 0.1
num_epochs = 5  
max_grad_norm = 1
log_interval = 200
learning_rate =  5e-5

parameter์˜ ๊ฒฝ์šฐ, ์˜ˆ์‹œ ์ฝ”๋“œ์— ์žˆ๋Š” ๊ฐ’๋“ค์„ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•ด์ฃผ์—ˆ๋‹ค.

#train & test ๋ฐ์ดํ„ฐ๋กœ ๋‚˜๋ˆ„๊ธฐ
from sklearn.model_selection import train_test_split

dataset_train, dataset_test = train_test_split(data_list, test_size=0.2, shuffle=True, random_state=34)

์‚ฌ์ดํ‚ท๋Ÿฐ์—์„œ ์ œ๊ณตํ•ด์ฃผ๋Š” train_test_split ๋ฉ”์„œ๋“œ๋ฅผ ํ™œ์šฉํ•ด ๊ธฐ์กด data_list๋ฅผ train ๋ฐ์ดํ„ฐ์…‹๊ณผ test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค. 5:1 ๋น„์œจ๋กœ ๋‚˜๋ˆ„์—ˆ๋‹ค.

tok=tokenizer.tokenize
data_train = BERTDataset(dataset_train, 0, 1, tok, vocab, max_len, True, False)
data_test = BERTDataset(dataset_test,0, 1, tok, vocab,  max_len, True, False)

์œ„์—์„œ ๊ตฌํ˜„ํ•œ BERTDataset ํด๋ž˜์Šค๋ฅผ ํ™œ์šฉํ•ด tokenization, int encoding, padding ์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

train_dataloader = torch.utils.data.DataLoader(data_train, batch_size=batch_size, num_workers=5)
test_dataloader = torch.utils.data.DataLoader(data_test, batch_size=batch_size, num_workers=5)

torch ํ˜•์‹์˜ dataset์„ ๋งŒ๋“ค์–ด์ฃผ๋ฉด์„œ, ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์…‹์˜ ์ฒ˜๋ฆฌ๊ฐ€ ๋ชจ๋‘ ๋๋‚ฌ๋‹ค.

6. KoBERT ๋ชจ๋ธ ๊ตฌํ˜„ํ•˜๊ธฐ

class BERTClassifier(nn.Module):
    def __init__(self,
                 bert,
                 hidden_size = 768,
                 num_classes=7,   ##ํด๋ž˜์Šค ์ˆ˜ ์กฐ์ •##
                 dr_rate=None,
                 params=None):
        super(BERTClassifier, self).__init__()
        self.bert = bert
        self.dr_rate = dr_rate
                 
        self.classifier = nn.Linear(hidden_size , num_classes)
        if dr_rate:
            self.dropout = nn.Dropout(p=dr_rate)
    
    def gen_attention_mask(self, token_ids, valid_length):
        attention_mask = torch.zeros_like(token_ids)
        for i, v in enumerate(valid_length):
            attention_mask[i][:v] = 1
        return attention_mask.float()

    def forward(self, token_ids, valid_length, segment_ids):
        attention_mask = self.gen_attention_mask(token_ids, valid_length)
        
        _, pooler = self.bert(input_ids = token_ids, token_type_ids = segment_ids.long(), attention_mask = attention_mask.float().to(token_ids.device),return_dict=False)
        if self.dr_rate:
            out = self.dropout(pooler)
        return self.classifier(out)
#BERT ๋ชจ๋ธ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = BERTClassifier(bertmodel,  dr_rate=0.5).to(device)
 
#optimizer์™€ schedule ์„ค์ •
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]

optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate)
loss_fn = nn.CrossEntropyLoss() # ๋‹ค์ค‘๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋Œ€ํ‘œ์ ์ธ loss func

t_total = len(train_dataloader) * num_epochs
warmup_step = int(t_total * warmup_ratio)

scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=warmup_step, num_training_steps=t_total)

#์ •ํ™•๋„ ์ธก์ •์„ ์œ„ํ•œ ํ•จ์ˆ˜ ์ •์˜
def calc_accuracy(X,Y):
    max_vals, max_indices = torch.max(X, 1)
    train_acc = (max_indices == Y).sum().data.cpu().numpy()/max_indices.size()[0]
    return train_acc
    
train_dataloader

์˜ˆ์ œ ์ฝ”๋“œ์™€ ๋™์ผํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

7. train

train_history=[]
test_history=[]
loss_history=[]
for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
         
        #print(label.shape,out.shape)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
            train_history.append(train_acc / (batch_id+1))
            loss_history.append(loss.data.cpu().numpy())
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    #train_history.append(train_acc / (batch_id+1))
    
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))
    test_history.append(test_acc / (batch_id+1))

KoBERT ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ์ฝ”๋“œ์ด๋‹ค. epoch์˜ ์ˆ˜์— ๋”ฐ๋ผ ํ•™์Šต์‹œ๊ฐ„์ด ๋งŽ์ด ๋‹ฌ๋ผ์ง€๋Š”๋ฐ epoch๋ฅผ 5๋กœ ์ง€์ •ํ•˜๋ฉด 30๋ถ„ ์ •๋„ ๊ฑธ๋ฆฐ๋‹ค. ์•ž์„œ ์ด ๋ชจ๋ธ์—์„œ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์žˆ๋„๋ก ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์…‹์„ ์ฒ˜๋ฆฌํ•˜๊ณ , ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ชจ๋‘ ์ง€์ •ํ•˜์˜€์œผ๋ฏ€๋กœ ์˜ˆ์‹œ ์ฝ”๋“œ์™€ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

train dataset์— ๋Œ€ํ•ด์„œ๋Š” 0.979, test dataset์— ๋Œ€ํ•ด์„œ๋Š” 0.918์˜ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๊ธฐ๋กํ•˜๋Š” ์ด์œ ๋Š” ๋ฐ”๋กœ ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋‹ค.

์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋Š” ํ”„๋กœ์ ํŠธ์—์„œ ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์˜ ์„ฑ๊ฒฉ์— ๋งž๊ฒŒ 3๋งŒ 7์ฒœ ์—ฌ๊ฐœ ์ •๋„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ง์ ‘ ์žฌ๋ถ„๋ฅ˜ํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์‚ฌ๋žŒ์ด ์ƒ๊ฐํ•˜๊ธฐ์—๋Š” ๋ถ„๋ช…ํžˆ ๊ฐ™์€ ๋‚ด์šฉ์œผ๋กœ ์ธ์ง€๋˜๋Š” ๋ฌธ์žฅ์ด์ง€๋งŒ ์‚ฌ์†Œํ•œ ๋‹จ์–ด๋ฅผ ํ•œ ๋‘๊ฐœ ์—†์• ์„œ, ์–ด๋–จ ๋•Œ๋Š” ์ค‘์š”ํ•œ ๋‹จ์–ด๋ฅผ ์ œ๊ฑฐํ•ด์„œ, ์งง๊ฒŒ ์ž˜๋ผ์„œ, ๊ธธ๊ฒŒ ๋Š˜์—ฌ๋œจ๋ ค์„œ, ๋ฌธ์žฅ ๋ถ€ํ˜ธ๋ฅผ ๋‹ค๋ฅด๊ฒŒ, ๊ฐํƒ„์‚ฌ๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ ๋“ฑ๋“ฑ ์ค‘๋ณต์ธ ๋“ฏ ์ค‘๋ณต ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์‹œ์ผฐ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ ‡๊ฒŒ ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์ด๋ผ ํŒ๋‹จ๋œ๋‹ค.

8. ์ง์ ‘ ๋งŒ๋“  ์ƒˆ๋กœ์šด ๋ฌธ์žฅ์œผ๋กœ ํ…Œ์ŠคํŠธ

์ด์ œ ์ง์ ‘ ๋ฌธ์žฅ์„ ๋งŒ๋“ค์–ด ํ•™์Šต๋œ ๋ชจ๋ธ์ด ๋‹ค์ค‘ ๋ถ„๋ฅ˜๋ฅผ ์ž˜ ํ•ด๋‚ด๋Š”์ง€ ์•Œ์•„๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

def predict(predict_sentence):

    data = [predict_sentence, '0']
    dataset_another = [data]

    another_test = BERTDataset(dataset_another, 0, 1, tok, vocab, max_len, True, False)
    test_dataloader = torch.utils.data.DataLoader(another_test, batch_size=batch_size, num_workers=5)
    
    model.eval()

    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(test_dataloader):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)

        valid_length= valid_length
        label = label.long().to(device)

        out = model(token_ids, valid_length, segment_ids)


        test_eval=[]
        for i in out:
            logits=i
            logits = logits.detach().cpu().numpy()

            if np.argmax(logits) == 0:
                test_eval.append("๊ณตํฌ๊ฐ€")
            elif np.argmax(logits) == 1:
                test_eval.append("๋†€๋žŒ์ด")
            elif np.argmax(logits) == 2:
                test_eval.append("๋ถ„๋…ธ๊ฐ€")
            elif np.argmax(logits) == 3:
                test_eval.append("์Šฌํ””์ด")
            elif np.argmax(logits) == 4:
                test_eval.append("์ค‘๋ฆฝ์ด")
            elif np.argmax(logits) == 5:
                test_eval.append("ํ–‰๋ณต์ด")
            elif np.argmax(logits) == 6:
                test_eval.append("ํ˜์˜ค๊ฐ€")

        print(">> ์ž…๋ ฅํ•˜์‹  ๋‚ด์šฉ์—์„œ " + test_eval[0] + " ๋Š๊ปด์ง‘๋‹ˆ๋‹ค.")

ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ๋ถ„๋ฅ˜๋œ ํด๋ž˜์Šค๋ฅผ ์ถœ๋ ฅํ•ด์ฃผ๋Š” predict ํ•จ์ˆ˜๋ฅผ ๊ตฌํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

#์งˆ๋ฌธ ๋ฌดํ•œ๋ฐ˜๋ณตํ•˜๊ธฐ! 0 ์ž…๋ ฅ์‹œ ์ข…๋ฃŒ
end = 1
while end == 1 :
    sentence = input("ํ•˜๊ณ ์‹ถ์€ ๋ง์„ ์ž…๋ ฅํ•ด์ฃผ์„ธ์š” : ")
    if sentence == "0" :
        break
    predict(sentence)
    print("\n")

5. ์ด ๋ชจ๋ธ์ด ์–ด๋–ป๊ฒŒ ํ™œ์šฉ๋ ๊นŒ?

์ด ๊ธฐ์ˆ ์€ ํ˜„์žฌ ๊ธ€์“ด์ด๊ฐ€ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๋Š” ์บก์Šคํ†ค ํ”„๋กœ์ ํŠธ์—์„œ ๊ฐ€์žฅ ์ฃผ์ถ•์ด ๋˜๋Š” ๊ธฐ์ˆ ์ด๋‹ค.

์กฐ๊ธˆ ๋” ์ž์„ธํžˆ ๋งํ•˜๋ฉด, ์‚ฌ์šฉ์ž๊ฐ€ ์ž‘์„ฑํ•œ ์ผ๊ธฐ๋ฅผ ๋ถ„์„ํ•˜์—ฌ 

โ‘  ์–ด๋– ํ•œ ๊ฐ์ •์„ ๋งŽ์ด ํ‘œ์ถœํ–ˆ๋Š”์ง€ ์•Œ๋ ค์ฃผ๊ณ , 

โ‘ก ๋ˆ„์ ๋œ ์ผ๊ธฐ๋“ค๋กœ ํ–‰๋ณตํ•œ ์ •๋„์™€ ์šฐ์šธํ•œ ์ •๋„๋ฅผ ๋‹จ๊ณ„๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ์ •๋Ÿ‰์ ์œผ๋กœ ๋ณด์—ฌ์ฃผ๋ ค๊ณ  ํ•œ๋‹ค.

โ‘  ์˜ค๋Š˜์˜ ๊ฐ์ • top 3

ํ•˜๋ฃจ๋™์•ˆ ์ž‘์„ฑ๋œ ์ผ๊ธฐ์—์„œ ์–ด๋– ํ•œ ๊ฐ์ •์„ ๋งŽ์ด ํ‘œ์ถœํ–ˆ๋Š”์ง€๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ๊ณ„์‚ฐํ•  ๊ฒƒ์ด๋‹ค.

์ด ๋ชจ๋ธ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๋ฌธ์žฅ, ์ฆ‰ ๋‹จ๋ฐœ์ ์œผ๋กœ ๋ณด์ด๋Š” ๊ฐ์ •๋“ค์„ ์ถ”์ถœํ•  ๊ฒƒ์ด๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ค‘๋ฆฝ์„ ์ œ์™ธํ•œ ๊ฐ์ •์ด ๋ณด์ด๋Š” ๋ฌธ์žฅ๋“ค์˜ ํผ์„ผํŠธ๋ฅผ ๋‚˜๋ˆ„์–ด top3์˜ ๊ฐ์ •์„ ๋‚˜ํƒ€๋‚ผ ๊ฒƒ์ด๋‹ค.

 

์‰ฝ๊ฒŒ ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด์•˜๋‹ค.

ํ•œ ์‚ฌ์šฉ์ž๋Š” 20๋ฌธ์žฅ์„ ์ ์—ˆ๋Š”๋ฐ, 10 ๋ฌธ์žฅ์ด ์ค‘๋ฆฝ์œผ๋กœ ํŒ๋‹จ๋˜๊ณ  ๋‚˜๋จธ์ง€ 10๋ฌธ์žฅ์ด ์ค‘๋ฆฝ ์ด์™ธ์˜ ๊ฐ์ •์„ ๋ณด์˜€๋‹ค. 10 ๋ฌธ์žฅ ์ค‘ ํ–‰๋ณต์œผ๋กœ ํŒ๋‹จ๋œ ๋ฌธ์žฅ์€ 5๊ฐœ, ๋ถ„๋…ธ๋กœ ํŒ๋‹จ๋œ ๋ฌธ์žฅ์€ 4๋ฌธ์žฅ, ์Šฌํ””์œผ๋กœ ํŒ๋‹จ๋œ ๋ฌธ์žฅ์„ 1๋ฌธ์žฅ์ด์—ˆ๋‹ค.

→ ์˜ค๋Š˜์˜ ๋Œ€ํ‘œ ๊ฐ์ •์œผ๋กœ ํ–‰๋ณต์„ ์ถ”์ฒœํ•ด์ฃผ๊ณ , ๊ฐ์ •์€ ํ–‰๋ณต 50%, ๋ถ„๋…ธ 40%, ์Šฌํ”” 10%์œผ๋กœ ๊ธฐ๋ก๋  ๊ฒƒ์ด๋‹ค.

โ‘ก ์ข…ํ•ฉ ํ–‰๋ณต ์ง€์ˆ˜, ์ข…ํ•ฉ ์šฐ์šธ ์ง€์ˆ˜

๋ˆ„์ ๋œ ์ผ๊ธฐ๋“ค๋กœ๋ถ€ํ„ฐ ํ–‰๋ณต ์ง€์ˆ˜์™€ ์šฐ์šธ ์ง€์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

ํ–‰๋ณต์˜ ๊ฒฝ์šฐ, ๊ธ์ •์ ์œผ๋กœ ํŒ๋‹จ๋˜๋Š” ๊ฐ์ • ๋ถ„๋ฅ˜ ํด๋ž˜์Šค๊ฐ€ ํ•˜๋‚˜ ๋ฐ–์— ์—†์œผ๋ฏ€๋กœ, ์ง€๋‚œ 30์ผ๋™์•ˆ ์“ด ์ผ๊ธฐ์˜ ๊ฐ ๋‚ ์งœ ํ–‰๋ณต ํผ์„ผํŠธ๋ฅผ ์‚ฐ์ˆ  ํ‰๊ท  ๋‚ธ ๊ฐ’์„ ํ™œ์šฉํ•  ๊ฒƒ์ด๋‹ค.

์šฐ์šธ๋„์˜ ๊ฒฝ์šฐ, ๋จผ์ € ๋ถ€์ •์ ์œผ๋กœ ํŒ๋‹จ๋œ ๊ฐ์ • ํผ์„ผํŠธ๋ฅผ ํ•ฉ์น˜๊ณ  ์ถ”๊ฐ€์ ์œผ๋กœ ์šฐ์šธํ•จ์ด ๋ฌป์–ด๋‚˜๋Š” ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋”ํ•ด ์‚ฐ์ˆ  ํ‰๊ท ์„ ๋‚ธ ๊ฐ’์„ ํ™œ์šฉํ•  ๊ฒƒ์ด๋‹ค.

 

์ด๋„ ์‰ฝ๊ฒŒ ์˜ˆ์‹œ๋ฅผ ๋“ค์–ด๋ณด์•˜๋‹ค.

์ง€๋‚œ 30์ผ ๋™์•ˆ 5๋ฒˆ์˜ ์ผ๊ธฐ๋ฅผ ์“ด ์‚ฌ์šฉ์ž๊ฐ€ ์žˆ๋‹ค.

ํ–‰๋ณต์ด ๊ฐ๊ฐ์˜ ์ผ๊ธฐ์—์„œ 90%, 50%, 10%, 60%, 20% ์ˆ˜์น˜๋กœ ๋‚˜ํƒ€๋‚ฌ์„ ๋•Œ, 

(0.9+0.5+0.1+0.6+0.2)/5 = 0.46๋กœ ๊ณ„์‚ฐ์ด ๋œ๋‹ค. 

์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐ๋œ ์ˆ˜์น˜๋ฅผ ๊ธ‰๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ํ–‰๋ณต๋„๋กœ ๋ณด์—ฌ์ค„ ๊ฒƒ์ด๋‹ค.

๋ถ€์ •์  ๊ฐ์ •์€ ๊ฐ๊ฐ์˜ ์ผ๊ธฐ์—์„œ ๋ฐ˜๋Œ€๋กœ 10%, 50%, 90%, 40%, 80% ์ˆ˜์น˜๋กœ ๋‚˜ํƒ€๋‚ฌ๊ณ , ์ถ”๊ฐ€์ ์œผ๋กœ 2๋ฒˆ์งธ์˜ ์ผ๊ธฐ์—์„œ "๋„ˆ๋ฌด ์šฐ์šธํ•˜๋‹ค"์™€ ๊ฐ™์ด ์ง๊ด€์ ์œผ๋กœ "์šฐ์šธ"์ด๋ผ ํŒ๋‹จ๋œ ๋ฌธ์žฅ์ด ์žˆ์—ˆ๋‹ค๋ฉด ๊ฐ ๋ฌธ์žฅ ๋‹น ํ•ด๋‹น ๋‚ ์˜ ๋ถ€์ •์  ํผ์„ผํŠธ์—์„œ 5%~10%์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋”ํ•ด์ค„ ๊ฒƒ์ด๋‹ค. ์šฐ์šธ๋กœ ํŒ๋‹จ๋œ ๋ฌธ์žฅ์ด ๋งŽ์•„๋„ ๊ฐ ๋‚ ์งœ์˜ ๋ถ€์ •์  ์ˆ˜์น˜๋Š” ์ตœ๋Œ€ 100%๋กœ ์ œํ•œ์„ ๋‘”๋‹ค.

๊ณ„์‚ฐ ํ•ด๋ณด๋ฉด, 2๋ฒˆ์งธ ์ผ๊ธฐ์—์„œ ์šฐ์šธ๊ณผ ์ง์ ‘์ ์œผ๋กœ ์—ฐ๊ด€๋œ ๋ฌธ์žฅ์ด 3๋ฒˆ ๋‚˜ํƒ€๋‚ฌ์„ ๋•Œ

(0.1+(0.5+0.1*3)+0.9+0.4+0.8)/5 = 0.6์œผ๋กœ ๊ณ„์‚ฐ์ด ๋œ๋‹ค. 

์ด๋ ‡๊ฒŒ ๊ณ„์‚ฐ๋œ ์ˆ˜์น˜๋ฅผ ๊ธ‰๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„์–ด ์šฐ์šธ๋„๋กœ ๋ณด์—ฌ์ค„ ๊ฒƒ์ด๋‹ค.

6. ์•ž์œผ๋กœ์˜ ๊ธฐ๋Œ€

์šฐ๋ฆฌ ํŒ€์ด ๊ตฌ์ƒํ•˜๊ณ  ์žˆ๋Š” ์„œ๋น„์Šค๋Š” ์ผ๊ธฐ ์„œ๋น„์Šค์ด๋ฏ€๋กœ ์ผ๊ธฐ์— ์•Œ๋งž๋Š” ๊ฐ์ •์„ ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•œ๋‹ค.

"๊ธฐ์จ/ํ‰์˜จ/์Šฌํ””/๋ถ„๋…ธ/๋ถˆ์•ˆ/ํ”ผ๊ณค" 6๊ฐ€์ง€๋กœ ๊ฐ์ •์„ ์žฌ๋ถ„๋ฅ˜ํ•˜์—ฌ ํ˜„์žฌ ์ด ํฌ์ŠคํŒ…์—์„œ ๋Œ๋ ค๋ณธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๊ฐ์ •์„ ์ง์ ‘ ์žฌ๋ถ„๋ฅ˜ํ•˜์—ฌ fine tuning ํ•˜๊ณ  ์žˆ๋‹ค.

๊ฒฐ๊ตญ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์ด ๋ชจ๋ธ์˜ ์ •ํ™•๋„๋ฅผ ํŒ๊ฐ€๋ฆ„ํ•œ๋‹ค. ์ง์ ‘ ๋ฐ์ดํ„ฐ์…‹์„ ์žฌ๋ถ„๋ฅ˜ํ•˜๊ณ  ์žˆ์œผ๋ฏ€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์กฐ๊ธˆ์”ฉ ์ •์ œํ•˜์—ฌ ์ •ํ™•๋„๊ฐ€ ๋†’์•„์ง„ ๋ชจ์Šต์„ ๋ณด์—ฌ๋“œ๋ฆฌ๊ณ ์ž ํ•œ๋‹ค.

๋˜ํ•œ, KcBERT ๋“ฑ ํŒŒ์ƒ๋œ ๋ชจ๋ธ๋“ค์ด ๊ณ„์† ์ƒ๊ฒจ๋‚˜๊ณ  ์žˆ๋Š”๋ฐ, ๋” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ์ •ํ™•๋„ ๊ฒ€์ฆ์„ ํ•ด๋ณผ ์˜ˆ์ •์ด๋‹ค.