[비트교육센터][AI] 8일차 파이토치로 이미지 분석, 자연어 처리,EDA, 토크나이징,한국어 모델링

비트교육센터/AI

[비트교육센터][AI] 8일차 파이토치로 이미지 분석, 자연어 처리,EDA, 토크나이징,한국어 모델링

달의요정루나 2023. 8. 11. 09:03

1. 파이토치로 이미지 분석하기

1. 모듈 임포트하기

import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torchvision.transforms import transforms, ToTensor
from torchvision.datasets import ImageFolder
from tqdm.notebook import tqdm

2. CPU인지 GPU인지 선언하기

device = ('cuba' if torch.cuda.is_available() else 'cpu')

3. 이미지 변환, 훈련셋과 테스트셋 만들기

batch_size=20

train_transforms = transforms.Compose(
    [
        transforms.CenterCrop(224),
        transforms.ToTensor(), # 이미지를 텐서로
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ]
)#이미지가 들어오면 어떻게 변환할지 결정

test_transforms = transforms.Compose(
    [
        transforms.Resize(255), # 이미지 리사이즈
        transforms.CenterCrop(224), # 중앙 224 x 244 크롭
        transforms.ToTensor(), 
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ]
)

# training, validation, testset 로드
trainset = ImageFolder(root='./data/dogs_vs_cats/train', transform=train_transforms)
trainloader = DataLoader(trainset, batch_size=batch_size, shuffle=True, 
                                          drop_last=True)
                                                              
# 테스트 셋은 셔플 안함
testset = ImageFolder(root='./data/dogs_vs_cats/test', transform=test_transforms)
testloader = DataLoader(testset, batch_size=batch_size, drop_last=True)

print(len(trainset))
print(len(testset))

--> 결과

20000
5000

4. 테스트셋 출력

print(trainset.classes)
print(trainset.class_to_idx)
print(testset.classes)
print(testset.class_to_idx)

--> 결과

['cats', 'dogs']
{'cats': 0, 'dogs': 1}
['cats', 'dogs']
{'cats': 0, 'dogs': 1}

5. 딥러닝 모델 설계하기

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        # input image = 224 x 244 x 3

        # 224 x 224 x 3 --> 112 x 112 x 32 maxpool
        self.conv1 = nn.Conv2d(3, 32, 3, padding=1)#입력,출력,커널,패딩 // 패딩을 1로 지정해 stride가 되더라도 원본사이즈와 출력사이즈를 똑같이한다.
        # 112 x 112x 32 --> 56 x 56 x 64 maxpool
        self.conv2 = nn.Conv2d(32, 64, 3, padding=1) 
        # 56 x 56 x 64 --> 28 x 28 x 128 maxpool
        self.conv3 = nn.Conv2d(64, 128, 3, padding=1)     
        # 28 x 28 x 128 --> 14 x 14 x 128 maxpool
        self.conv4 = nn.Conv2d(128, 128, 3, padding=1)    

        # maxpool 2 x 2
        self.pool = nn.MaxPool2d(2, 2)

        # 28 x 28 x 128 vector flat 512개
        self.fc1 = nn.Linear(128 * 14 * 14, 512)
        # 카테고리 2개 클래스
        self.fc2 = nn.Linear(512, 2) 

        # dropout 적용
        self.dropout = nn.Dropout(0.5) # 0.25 해보고 0.5로 해보기. 값 저장하고나서

    def forward(self, x):
        # conv1 레이어에 relu 후 maxpool. 112 x 112 x 32
        x = self.pool(torch.relu(self.conv1(x)))
        # conv2 레이어에 relu 후 maxpool. 56 x 56 x 64
        x = self.pool(torch.relu(self.conv2(x)))
        # conv3 레이어에 relu 후 maxpool. 28 x 28 x 128
        x = self.pool(torch.relu(self.conv3(x)))
        # conv4 레이어에 relu 후 maxpool. 14 x 14 x 128
        x = self.pool(torch.relu(self.conv4(x)))

        # 이미지 펴기
        x = x.view(-1, 128 * 14 * 14) 
        # dropout 적용
        x = self.dropout(x)
        # fc 레이어에 삽입 후 relu
        x = torch.relu(self.fc1(x))
        # dropout 적용
        x = self.dropout(x)

        x = self.fc2(x)
        return x

6. 파라미터 개수 출력하기

model = Net().to(device)

def count_parameters(model):
    total_param = 0
    for name, param in model.named_parameters():
        if param.requires_grad:
            num_param = np.prod(param.size())
            if param.dim() > 1:
                print(name, ':', ' x '.join(str(x) for x in list(param.size())[::-1]), '=', num_param)
            else:
                print(name, ':', num_param)
                print('-' * 40)
            total_param += num_param
    print('total:', total_param)

count_parameters(model)

--> 결과

conv1.weight : 3 x 3 x 3 x 32 = 864
conv1.bias : 32
----------------------------------------
conv2.weight : 3 x 3 x 32 x 64 = 18432
conv2.bias : 64
----------------------------------------
conv3.weight : 3 x 3 x 64 x 128 = 73728
conv3.bias : 128
----------------------------------------
conv4.weight : 3 x 3 x 128 x 128 = 147456
conv4.bias : 128
----------------------------------------
fc1.weight : 25088 x 512 = 12845056
fc1.bias : 512
----------------------------------------
fc2.weight : 512 x 2 = 1024
fc2.bias : 2
----------------------------------------
total: 13087426

7. 모델 실행하기

optimizer = optim.SGD(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss().to(device)

n_epochs = 20
loss = 0.0
total_batch = len(trainloader)

model.train()
for epoch in tqdm(range(n_epochs)):
    train_loss = 0.0
    for data, label in trainloader:
        optimizer.zero_grad()
        out = model(data.to(device))
        loss = loss_fn(out, label.to(device))
        loss.backward()
        optimizer.step()

        train_loss += loss / total_batch
        
    train_loss = train_loss/len(trainloader.sampler)
    # training set, validation set 로스율 출력
    print('Epoch: {} \tTraining Loss: {:.6f}'.format(epoch+1, float(train_loss)))

--> 결과

100%

20/20 [5:37:57<00:00, 977.72s/it]

Epoch: 1 	Training Loss: 0.000034
Epoch: 2 	Training Loss: 0.000032
Epoch: 3 	Training Loss: 0.000030
Epoch: 4 	Training Loss: 0.000029
Epoch: 5 	Training Loss: 0.000027
Epoch: 6 	Training Loss: 0.000026
Epoch: 7 	Training Loss: 0.000024
Epoch: 8 	Training Loss: 0.000023
Epoch: 9 	Training Loss: 0.000022
Epoch: 10 	Training Loss: 0.000021
Epoch: 11 	Training Loss: 0.000020
Epoch: 12 	Training Loss: 0.000019
Epoch: 13 	Training Loss: 0.000018
Epoch: 14 	Training Loss: 0.000016
Epoch: 15 	Training Loss: 0.000015
Epoch: 16 	Training Loss: 0.000014
Epoch: 17 	Training Loss: 0.000013
Epoch: 18 	Training Loss: 0.000012
Epoch: 19 	Training Loss: 0.000011
Epoch: 20 	Training Loss: 0.000010

7. 테스트 데이터에 대한 정확도를 평가하고 출력

[1] 모델 평가

classes= ['cat', 'dog']
test_loss = 0.0
correct = 0.0
class_correct = [0.0, 0.0]
class_total = [0.0, 0.0]

model.eval()#학습할때 dropout을 꺼준다.
with torch.no_grad():
    for data, label in tqdm(testloader):
        output = model(data.to(device))
        _, pred_index = torch.max(output, 1)#하나만 받고 싶을때 _를 선언한다.
        #pred와 데이터를 비교한다
        correct_tensor = pred_index.eq(label.to(device).data.view_as(pred_index))
        # correct_tensor를 numpy로 바꾼 뒤 gpu 계산 또는 cpu 계산
        correct = np.squeeze(correct_tensor.numpy()) \
                        if not torch.cuda.is_available() else np.squeeze(correct_tensor.cpu().numpy())
    # 몇 개 맞췄나 계산
        for i in range(batch_size): # 배치 사이즈로
            target = label.data[i] # tensor(0) or tensor(1)
            class_correct[target.item()] += correct[i]
            class_total[target.item()] += 1


for i in range(2):
    # 각 클래스 별 확률 출력
    if class_total[i] > 0:
        print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
            classes[i], 100 * class_correct[i] / class_total[i],
            np.sum(class_correct[i]), np.sum(class_total[i])))
    else:
        print('Test Accuracy of %5s: N/A (no training examples)' % (classes[i]))

# 최종 확률 출력
print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
    100. * np.sum(class_correct) / np.sum(class_total),
    np.sum(class_correct), np.sum(class_total)))

--> 결과

100%

250/250 [02:27<00:00, 1.74it/s]

Test Accuracy of   cat: 72% (1802/2500)
Test Accuracy of   dog: 93% (2336/2500)

Test Accuracy (Overall): 82% (4138/5000)

[2] 데이터 분석

classes= ['cat', 'dog']
test_loss = 0.0
correct = 0.0
class_correct = [0.0, 0.0]
class_total = [0.0, 0.0]

model.eval()#학습할때 dropout을 꺼준다.
with torch.no_grad():#평가할때는 기울기 계산이 필요없기 때문에(역전파를 하지 않음) no_grad선언
    for data, label in tqdm(testloader):
        #tqdm: 진행상황을 표시하는 모듈이다.
        #testloader에서 data와 label을 가져온다.
        output = model(data.to(device))
        #모델에 데이터를 입력해 cpu(혹은 gpu)로 보내고 예측결과를 얻는다.
        
        _, pred_index = torch.max(output, 1)#axis=1
        #하나만 받고 싶을때 _를 선언한다.
        #모델 출력에서 가장 큰 값을 갖는 인덱스를 추출한다.
        
        correct_tensor = pred_index.eq(label.to(device).view_as(pred_index))
        #pred와 데이터를 비교한다(pred_index와 label을 비교)
        
        correct = np.squeeze(correct_tensor.numpy()) \
                        if not torch.cuda.is_available() else np.squeeze(correct_tensor.cpu().numpy())
        # correct_tensor를 numpy로 바꾼 뒤 gpu 계산 또는 cpu 계산
        # 혹시 correct_tensor가 2차원일수도 있어서 squeeze로 1차원으로 줄여준다.
        
        for i in range(batch_size): # 배치 사이즈로 예측결과 처리
            target = label.data[i] # tensor(0):고양이 or tensor(1):개
            class_correct[target.item()] += correct[i]
            class_total[target.item()] += 1
            #각 클래스별로 몇 개 맞췄나 계산한다.

        if True:
            pred_va, pred_index = torch.max(output, 1)
            print("class_correct",class_correct)#고양이 데이터 개수
            print("class_total",class_total)#총 분석한 데이터 개수(=batchsize)
            print("pred_index",pred_index)#0은 고양이, 1은 개
            print("pred_va",pred_va)#고양이의 확률
            print("output,",output)#열 기준으로 고양이 확률 최대값 출력
            print("correct",correct)#예측 결과
            print(label.to(device).data.view_as(pred_index))#타겟값, 정답
            print(pred_index.eq(label.to(device).data.view_as(pred_index)))
            print(correct_tensor)#타겟값, 정답
            bo = False
            print()
for i in range(2):#클래스 개수(cat,dog 2개)
    # 각 클래스 별 확률 출력한다.
    if class_total[i] > 0:
        #해당 클래스의 데이터 개수가 0보다 크면 정확도를 계산하고 출력한다.
        print('Test Accuracy of %5s: %2d%% (%2d/%2d)' % (
            classes[i], #클래스 이름 cat또는 dog를 가져온다.
            100 * class_correct[i] / class_total[i], #해당 클래스의 정확도를 계산한다.
            np.sum(class_correct[i]), np.sum(class_total[i]))) 
            #해당 클래스이 맞춘 개수와 전체 데이터 개수를 출력한다.
    else:
        print('Test Accuracy of %5s: N/A (no training examples)' % (classes[i]))
        #데이터가 없을때 출력

print('\nTest Accuracy (Overall): %2d%% (%2d/%2d)' % (
    100. * np.sum(class_correct) / np.sum(class_total),
    np.sum(class_correct), np.sum(class_total)))
# 최종 확률 출력

--> 결과

class_correct [15.0, 0.0] #고양이 데이터 개수
class_total [20.0, 0.0] #총 분석한 데이터 개수(=batchsize)
pred_index tensor([0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1])
#0은 고양이, 1은 개
pred_va tensor([0.6614, 1.3291, 1.7660, 1.3794, 2.4336, 0.5707, 4.5740, 1.8294, 0.6008,
        3.5921, 0.4357, 1.4350, 4.6368, 2.7956, 0.7386, 1.7921, 3.0626, 1.8448,
        0.9866, 0.1330])
#고양이의 확률
output, tensor([[ 0.6614, -0.5788],
        [ 1.3291, -1.2220],
        [ 1.7660, -1.6747],
        [-1.1525,  1.3794],
        [ 2.4336, -2.2640],
        [ 0.5707, -0.4844],
        [ 4.5740, -4.5051],
        [ 1.8294, -1.7253],
        [ 0.6008, -0.5912],
        [ 3.5921, -3.4313],
        [-0.3558,  0.4357],
        [-1.4362,  1.4350],
        [ 4.6368, -4.4851],
        [-2.7062,  2.7956],
        [ 0.7386, -0.6513],
        [ 1.7921, -1.7660],
        [ 3.0626, -2.9450],
        [ 1.8448, -1.8645],
        [ 0.9866, -1.0891],
        [-0.1589,  0.1330]])
#열 기준(axis=1)으로 고양이 확률 최대값 출력
correct [ True  True  True False  True  True  True  True  True  True False False
  True False  True  True  True  True  True False]
#예측 결과
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
tensor([ True,  True,  True, False,  True,  True,  True,  True,  True,  True,
        False, False,  True, False,  True,  True,  True,  True,  True, False])
tensor([ True,  True,  True, False,  True,  True,  True,  True,  True,  True,
        False, False,  True, False,  True,  True,  True,  True,  True, False])
#타겟값, 정답

2. 자연어 처리

1. 모듈 임포트하기

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras import layers, models
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import text_to_word_sequence
from numpy import array

2. 토큰화하기

text = '해보지 않으면 해낼 수 없다'

result = text_to_word_sequence(text)
print("\n원문:\n", text)
print("\n토큰화:\n", result)

--> 결과

원문:
 해보지 않으면 해낼 수 없다

토큰화:
 ['해보지', '않으면', '해낼', '수', '없다']

3. 카운트하기

[1] 단어 개수 카운트

docs = ['먼저 텍스트의 각 단어를 나누어 토큰화 합니다.',
       '텍스트의 단어로 토큰화해야 딥러닝에서 인식됩니다.',
       '토큰화한 결과는 딥러닝에서 사용할 수 있습니다.',
       ]

token = Tokenizer()
token.fit_on_texts(docs)

print("\n단어 카운트:\n", token.word_counts) #각 단어가 몇개씩 나오는지 출력

--> 결과

단어 카운트:
 OrderedDict([('먼저', 1), ('텍스트의', 2), ('각', 1), ('단어를', 1), ('나누어', 1), ('토큰화', 1), ('합니다', 1), ('단어로', 1), ('토큰화해야', 1), ('딥러닝에서', 2), ('인식됩니다', 1), ('토큰화한', 1), ('결과는', 1), ('사용할', 1), ('수', 1), ('있습니다', 1)])

[2] 문장 개수 카운트

print("\n문장 카운트: ", token.document_count)

--> 결과

문장 카운트:  3

[3] 문장에 포함된 단어 개수 카운트

print("\n각 단어가 몇 개의 문장에 포함되어 있는가:\n", token.word_docs)

--> 결과

각 단어가 몇 개의 문장에 포함되어 있는가:
 defaultdict(<class 'int'>, {'먼저': 1, '합니다': 1, '토큰화': 1, '텍스트의': 2, '나누어': 1, '각': 1, '단어를': 1, '딥러닝에서': 2, '인식됩니다': 1, '토큰화해야': 1, '단어로': 1, '사용할': 1, '있습니다': 1, '토큰화한': 1, '결과는': 1, '수': 1})

[4] 인덱스 값 출력

print("\n각 단어에 매겨진 인덱스 값:\n",  token.word_index)
#1부터 나오는 이유는 0은 특수한 목적으로 사용
#0은 존재하지 않는 단어가 나왔을때 쓰인다.

--> 결과

각 단어에 매겨진 인덱스 값:
 {'텍스트의': 1, '딥러닝에서': 2, '먼저': 3, '각': 4, '단어를': 5, '나누어': 6, '토큰화': 7, '합니다': 8, '단어로': 9, '토큰화해야': 10, '인식됩니다': 11, '토큰화한': 12, '결과는': 13, '사용할': 14, '수': 15, '있습니다': 16}

3. 단어의 원-핫 인코딩

[1] 각 단어 인덱스 값 출력

text="오랫동안 꿈꾸는 이는 그 꿈을 닮아간다"
token = Tokenizer()
token.fit_on_texts([text])
print(token.word_index)
#토큰화 함수를 불러와 단어 단위로 토큰화하고 각 단어의 인덱스 값을 출력한다.

--> 결과

{'오랫동안': 1, '꿈꾸는': 2, '이는': 3, '그': 4, '꿈을': 5, '닮아간다': 6}

[2] 토큰의 인덱스로만 채워진 새로운 배열

x=token.texts_to_sequences([text])
print(x)

--> 결과

[[1, 2, 3, 4, 5, 6]]

[3] 원-핫 인코딩 배열 만들기

word_size = len(token.word_index) + 1 #값과 인덱스를 동일시하기 위해 +1을 해주었다.
x = to_categorical(x, num_classes=word_size)
print(x)

--> 결과

[[[0. 1. 0. 0. 0. 0. 0.]
  [0. 0. 1. 0. 0. 0. 0.]
  [0. 0. 0. 1. 0. 0. 0.]
  [0. 0. 0. 0. 1. 0. 0.]
  [0. 0. 0. 0. 0. 1. 0.]
  [0. 0. 0. 0. 0. 0. 1.]]]

4. 텍스트 읽고 긍정, 부정 예측하기

[1] 텍스트 지정

# 텍스트 리뷰 자료를 지정합니다.
docs = ["너무 재밌네요","최고예요","참 잘 만든 영화예요","추천하고 싶은 영화입니다","한번 더 보고싶네요","글쎄요","별로예요","생각보다 지루하네요","연기가 어색해요","재미없어요"]

# 긍정 리뷰는 1, 부정 리뷰는 0으로 클래스를 지정합니다.
classes = array([1,1,1,1,1,0,0,0,0,0])

# 토큰화 
token = Tokenizer()
token.fit_on_texts(docs)
print(token.word_index)

--> 결과

{'너무': 1, '재밌네요': 2, '최고예요': 3, '참': 4, '잘': 5, '만든': 6, '영화예요': 7, '추천하고': 8, '싶은': 9, '영화입니다': 10, '한번': 11, '더': 12, '보고싶네요': 13, '글쎄요': 14, '별로예요': 15, '생각보다': 16, '지루하네요': 17, '연기가': 18, '어색해요': 19, '재미없어요': 20}

[2] 토큰화 결과

x = token.texts_to_sequences(docs)
print("\n리뷰 텍스트, 토큰화 결과:\n",  x)

--> 결과

리뷰 텍스트, 토큰화 결과:
 [[1, 2], [3], [4, 5, 6, 7], [8, 9, 10], [11, 12, 13], [14], [15], [16, 17], [18, 19], [20]]

[3] 패딩

padded_x = pad_sequences(x, 4)  
print("\n패딩 결과:\n", padded_x)
#패딩: 길이를 똑같이 맞추어 주는 작업

--> 결과

패딩 결과:
 [[ 0  0  1  2]
 [ 0  0  0  3]
 [ 4  5  6  7]
 [ 0  8  9 10]
 [ 0 11 12 13]
 [ 0  0  0 14]
 [ 0  0  0 15]
 [ 0  0 16 17]
 [ 0  0 18 19]
 [ 0  0  0 20]]

[4] 딥러닝 모델 만들기

# 임베딩에 입력될 단어의 수를 지정합니다.
word_size = len(token.word_index) +1
print(len(token.word_index))
# 단어 임베딩을 포함하여 딥러닝 모델을 만들고 결과를 출력합니다.
#word2vec알고리즘
model = models.Sequential()
model.add(layers.Embedding(word_size, 8, input_length=4))
model.add(layers.Flatten())
model.add(layers.Dense(1, activation='sigmoid'))
model.summary()
#20차원 벡터를 8차원 벡터로 만듦
#word_index*8+8

--> 결과

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 4, 8)              168       
                                                                 
 flatten (Flatten)           (None, 32)                0         
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 201
Trainable params: 201
Non-trainable params: 0
_________________________________________________________________

[5] 모델 컴파일

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(padded_x, classes, epochs=20)
print("\n Accuracy: %.4f" % (model.evaluate(padded_x, classes)[1]))

--> 결과

Epoch 1/20
1/1 [==============================] - 1s 678ms/step - loss: 0.6870 - accuracy: 0.6000
Epoch 2/20
1/1 [==============================] - 0s 8ms/step - loss: 0.6851 - accuracy: 0.6000
Epoch 3/20
1/1 [==============================] - 0s 5ms/step - loss: 0.6832 - accuracy: 0.6000
Epoch 4/20
1/1 [==============================] - 0s 6ms/step - loss: 0.6814 - accuracy: 0.6000
Epoch 5/20
1/1 [==============================] - 0s 5ms/step - loss: 0.6795 - accuracy: 0.6000
Epoch 6/20
1/1 [==============================] - 0s 6ms/step - loss: 0.6776 - accuracy: 0.6000
Epoch 7/20
1/1 [==============================] - 0s 5ms/step - loss: 0.6758 - accuracy: 0.6000
Epoch 8/20
1/1 [==============================] - 0s 5ms/step - loss: 0.6739 - accuracy: 0.6000
Epoch 9/20
1/1 [==============================] - 0s 6ms/step - loss: 0.6720 - accuracy: 0.7000
Epoch 10/20
1/1 [==============================] - 0s 4ms/step - loss: 0.6701 - accuracy: 0.8000
Epoch 11/20
1/1 [==============================] - 0s 6ms/step - loss: 0.6683 - accuracy: 0.8000
Epoch 12/20
1/1 [==============================] - 0s 4ms/step - loss: 0.6664 - accuracy: 0.8000
Epoch 13/20
1/1 [==============================] - 0s 4ms/step - loss: 0.6645 - accuracy: 0.8000
Epoch 14/20
1/1 [==============================] - 0s 3ms/step - loss: 0.6626 - accuracy: 0.9000
Epoch 15/20
1/1 [==============================] - 0s 6ms/step - loss: 0.6607 - accuracy: 1.0000
Epoch 16/20
1/1 [==============================] - 0s 3ms/step - loss: 0.6588 - accuracy: 1.0000
Epoch 17/20
1/1 [==============================] - 0s 4ms/step - loss: 0.6569 - accuracy: 1.0000
Epoch 18/20
1/1 [==============================] - 0s 5ms/step - loss: 0.6549 - accuracy: 1.0000
Epoch 19/20
1/1 [==============================] - 0s 5ms/step - loss: 0.6530 - accuracy: 1.0000
Epoch 20/20
1/1 [==============================] - 0s 4ms/step - loss: 0.6510 - accuracy: 1.0000
1/1 [==============================] - 0s 187ms/step - loss: 0.6491 - accuracy: 1.0000

 Accuracy: 1.0000

3. 토크나이징

- 처리하고자 하는 텍스트(corpus)에 대한 정보를 특정 단위별로 나누는 작업  
- 단어, 형태소, 문장 토크나이징

## 영어 토크나이징
- NLTK(Natural Language Toolkit) 과 Spacy 가 대표적 임  
## 1. NLTK  
- 파이썬에서 영어 텍스트 전처리 작업을 하는 데 많이 쓰이는 라이브러리  
- 50여개가 넘는 말뭉치(corpus) 리소스를 활용해 영어 텍스트를 분석할 수 있게 해줌  
- 직관적인 함수 사용법으로 빠르게 텍스트 전처리를 할 수 있음

1. nltk 라이브러리 설치

import nltk

nltk.download('all-corpora')
nltk.download('punkt')

--> 결과

[nltk_data] Downloading collection 'all-corpora'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\BIT\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\BIT\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     C:\Users\BIT\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to

---

[nltk_data]    |     C:\Users\BIT\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\words.zip.
[nltk_data]    | Downloading package ycoe to
[nltk_data]    |     C:\Users\BIT\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\ycoe.zip.
[nltk_data]    | 
[nltk_data]  Done downloading collection all-corpora
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\BIT\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.

True

2. 모듈 임포트, 문장 토큰화

[1] 단어 토큰화

from nltk.tokenize import word_tokenize

sentence = "Natural language processing (NLP) is a subfield of computer science, \
information engineering, and artificial intelligence concerned with the interactions \
between computers and human (natural) languages, in particular how to program computers \
to process and analyze large amounts of natural language data."

print(word_tokenize(sentence))

--> 결과

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']

[2] 문장 토큰화

from nltk.tokenize import sent_tokenize

paragraph = "Natural language processing (NLP) is a subfield of computer science, \
information engineering, and artificial intelligence concerned with the interactions \
between computers and human (natural) languages, in particular how to program computers \
to process and analyze large amounts of natural language data. Challenges in natural \
language processing frequently involve speech recognition, natural language \
understanding, and natural language generation."

print(sent_tokenize(paragraph))

--> 결과

['Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.', 'Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.']

3. 불용어

[1] stopwords

from nltk.corpus import stopwords
#불용어: 언어의 습관이나 문법상 필요하지만 뜻을 해석하는데는 필요없음
#ex) 정관사 the
stopwords.words('english')[:10]

--> 결과

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

[2] 불용어 개수

print(len(stopwords.words('english')))

--> 결과

[3] 불용어 제거

from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(sentence)

result = []
for w in word_tokens: 
    if w not in stop_words: 
        result.append(w) 

print(word_tokens) 
print(result)#원문장에서 불용어 제거

--> 결과

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']
['Natural', 'language', 'processing', '(', 'NLP', ')', 'subfield', 'computer', 'science', ',', 'information', 'engineering', ',', 'artificial', 'intelligence', 'concerned', 'interactions', 'computers', 'human', '(', 'natural', ')', 'languages', ',', 'particular', 'program', 'computers', 'process', 'analyze', 'large', 'amounts', 'natural', 'language', 'data', '.']

4. Spacy

상업용 목적으로 만들어진 오픈소스 라이브러리
영어를 포함한 8개 국어에 대한 자연어 전처리 모듈을 제공
쉬운 설치 및 빠른 전처리

만약 en_core_web_sm에서 오류가 생길경우 해당 코드를 입력하길 바란다.

[1] 단어 토큰화

import spacy

nlp = spacy.load('en_core_web_sm')

sentence = "Natural language processing (NLP) is a subfield of computer science, \
information engineering, and artificial intelligence concerned with the interactions \
between computers and human (natural) languages, in particular how to program computers \
to process and analyze large amounts of natural language data."

doc = nlp(sentence)

word_tokenized_sentence = [token.text for token in doc]
print(word_tokenized_sentence)

--> 결과

['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data', '.']

[2] 문장 토큰화

sentence_tokenized_list = [sent.text for sent in doc.sents]
print(sentence_tokenized_list)

--> 결과

['Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.']

5. KoLNPy

한글 자연어 처리를 위해 만들어진 오픈소스 라이브러리
국내에 이미 만들어져 사용되고 있는 여러 형태소 분석기를 사용할 수 있음
자바로 작성된 형태소 분석기를 사용하기 때문에 윈도우에서 KoNLPy를 설치하기 위해서는 Java(1.7 이상)가 필요

환경변수에서 시스템 변수에서 Amazon Corretto가 설치되어져 있는지 확인한다.

[1] 한글 토크나이징

한글은 언어의 특성상 NLTK나 Spacy는 사용하기에 적합하지 않음
영어에 존재하지 않는 형태소 분석이나 음소 분리와 같은 내용은 다루기 어려움
한글 자연어 처리에 많이 사용되는 KoNLPy 에 대해 알아봄

[2] 라이브러리 설치

import konlpy

[3] 형태소 단위 토크나이징

KoNLPy에서는 여러 형태소 분석기를 제공
각 형태소 분석기는 클래스 형태로 되어 있고, 이를 객체로 생성한 후 메서드를 호출해서 토크나이징 함

[4] 형태소 분석 및 품사 태깅

형태소란 의미를 가지는 가장 작은 단위
KoNLPy에 객체 형태로 포함되어 있는 형태소 분석기 목록
a. Hannanum
b. Kkma
c. Komoran
d. Mecab
e. Okt(Twitter)
위 객체들은 모두 동일하게 형태소 분석 기능을 제공
각기 성능이 조금씩 다름
Mecab은 윈도우에서 실행 불가능

[5] 어간 추출

from konlpy.tag import Hannanum
from konlpy.tag import Kkma
from konlpy.tag import Komoran
from konlpy.tag import Okt

okt = Okt()
#Okt: normalize()함수로 정규화기능을 제공합니다. 오타가 섞인 문장을 정규화한다.
text = "한글 자연어 처리는 재밌다 이제부터 열심히 해야지ㅎㅎㅎ"
print(okt.morphs(text))
print(okt.morphs(text, stem=True)) # 형태소 단위로 나눈 후 어간을 추출

--> 결과

['한글', '자연어', '처리', '는', '재밌다', '이제', '부터', '열심히', '해야지', 'ㅎㅎㅎ']
['한글', '자연어', '처리', '는', '재밌다', '이제', '부터', '열심히', '하다', 'ㅎㅎㅎ']

[6] 명사 추출

print(okt.nouns(text)) # 명사만 추출
print(okt.phrases(text)) # 어절 단위로 나누어서 추출

--> 결과

['한글', '자연어', '처리', '이제']
['한글', '한글 자연어', '한글 자연어 처리', '이제', '자연어', '처리']

[7] 리스트화

print(okt.pos(text))
print(okt.pos(text, join=True)) # 형태소와 품사를 붙여서 리스트화

--> 결과

[('한글', 'Noun'), ('자연어', 'Noun'), ('처리', 'Noun'), ('는', 'Josa'), ('재밌다', 'Adjective'), ('이제', 'Noun'), ('부터', 'Josa'), ('열심히', 'Adverb'), ('해야지', 'Verb'), ('ㅎㅎㅎ', 'KoreanParticle')]
['한글/Noun', '자연어/Noun', '처리/Noun', '는/Josa', '재밌다/Adjective', '이제/Noun', '부터/Josa', '열심히/Adverb', '해야지/Verb', 'ㅎㅎㅎ/KoreanParticle']

[8] kkma

kkma = Kkma()
#인자로 입력한 여러 문장을 분리해주는 역할을 한다.
print(kkma.morphs(text))
print(kkma.nouns(text))
print(kkma.pos(text))

--> 결과

['한글', '자연어', '처리', '는', '재밌', '다', '이제', '부터', '열심히', '하', '어야지', 'ㅎㅎㅎ']
['한글', '자연어', '처리', '이제']
[('한글', 'NNG'), ('자연어', 'NNG'), ('처리', 'NNG'), ('는', 'JX'), ('재밌', 'VA'), ('다', 'ECS'), ('이제', 'NNG'), ('부터', 'JX'), ('열심히', 'MAG'), ('하', 'VV'), ('어야지', 'EFN'), ('ㅎㅎㅎ', 'EMO')]

[9] Komoran

komoran = Komoran()
#Java로 개발된 한국어 형태소 분석기이다.
print(komoran.morphs(text))
print(komoran.nouns(text))
print(komoran.pos(text))

--> 결과

['한글', '자연어', '처리', '는', '재밌', '다', '이제', '부터', '열심히', '해야지ㅎㅎㅎ']
['한글', '자연어', '처리', '이제']
[('한글', 'NNP'), ('자연어', 'NNP'), ('처리', 'NNG'), ('는', 'JX'), ('재밌', 'VA'), ('다', 'EC'), ('이제', 'NNG'), ('부터', 'JX'), ('열심히', 'MAG'), ('해야지ㅎㅎㅎ', 'NA')]

[10] hannanum

hannanum = Hannanum()
print(hannanum.morphs(text))
print(hannanum.nouns(text))
print(hannanum.pos(text))

--> 결과

['한글', '자연어', '처리', '는', '재밌다', '이제', '부터', '열심히', '해야짛ㅎㅎ']
['한글', '자연어', '처리', '재밌다', '해야짛ㅎㅎ']
[('한글', 'N'), ('자연어', 'N'), ('처리', 'N'), ('는', 'J'), ('재밌다', 'N'), ('이제', 'M'), ('부터', 'J'), ('열심히', 'M'), ('해야짛ㅎㅎ', 'N')]

4. EDA

- 탐색적 분석단계(Exploratory Data Analysis, EDA): 데이터를 다양한 각도에서 관찰하고 이해하는 모든 과정

- 네이버의 영화평점을 분석할 것이다.

1. 라이브러리 설치

https://anaconda.org/conda-forge/wordcloud

Wordcloud :: Anaconda.org

anaconda.org

2. 모듈 임포트

import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

3. 훈련셋, 테스트 셋 다운로드

https://github.com/e9t/nsmc

GitHub - e9t/nsmc: Naver sentiment movie corpus

Naver sentiment movie corpus. Contribute to e9t/nsmc development by creating an account on GitHub.

github.com

[3] 다운받은 파일을 작업환경 경로에 있는 data-in폴더로 옮긴다.

4. txt파일 로드하기

DATA_IN_PATH = './data-in/' #네이버에 있는 영화리뷰 가져옴

train_data = pd.read_csv(DATA_IN_PATH + 'ratings_train.txt', delimiter = '\t', quoting = 3)
# quoting=3 데이터에 quotes가 없다는 의미
train_data.head()

--> 결과

5. 학습데이터 개수 출력

print('전체 학습데이터의 개수: {}'.format(len(train_data)))

--> 결과

전체 학습데이터의 개수: 150000

6. 각 훈련데이터들 길이 출력하기

train_length = train_data['document'].astype(str).apply(len)
train_length.head()

--> 결과

0    19
1    33
2    17
3    29
4    61
Name: document, dtype: int64

7. 히스토그램 출력

plt.figure(figsize=(12, 5))
# 히스토그램 선언
# bins: 히스토그램 값들에 대한 버켓 범위
# range: x축 값의 범위
# alpha: 그래프 색상 투명도
# color: 그래프 색상
# label: 그래프에 대한 라벨
plt.hist(train_length, bins=200, alpha=0.5, color= 'r')
plt.yscale('log', nonpositive='clip')#y스케일은 너무커서 log스케일로 표시
#non-positive values in y can be clipped to a very small positive number
# 그래프 제목
plt.title('Log-Histogram of length of review')
# 그래프 x 축 라벨
plt.xlabel('Length of review')
# 그래프 y 축 라벨
plt.ylabel('Number of review')
plt.show()

--> 결과

짧은 길이부터 140자까지 고르게 분포되어 있음
20자 이하에 많이 분포되어 있다가 길이가 길어질 수록 점점 적어지다가 140자 부근에서 갑자기 많아짐
140자 제한(한글 기준)이 있는 데이터이기 때문

8. 길이 출력

print('리뷰 길이 최대 값: {}'.format(np.max(train_length)))
print('리뷰 길이 최소 값: {}'.format(np.min(train_length)))
print('리뷰 길이 평균 값: {:.2f}'.format(np.mean(train_length)))
print('리뷰 길이 표준편차: {:.2f}'.format(np.std(train_length)))
print('리뷰 길이 중간 값: {}'.format(np.median(train_length)))
# 사분위의 대한 경우는 0~100 스케일로 되어있음
print('리뷰 길이 제 1 사분위: {}'.format(np.percentile(train_length, 25)))
print('리뷰 길이 제 3 사분위: {}'.format(np.percentile(train_length, 75)))

--> 결과

리뷰 길이 최대 값: 158
리뷰 길이 최소 값: 1
리뷰 길이 평균 값: 35.24
리뷰 길이 표준편차: 29.58
리뷰 길이 중간 값: 27.0
리뷰 길이 제 1 사분위: 16.0
리뷰 길이 제 3 사분위: 42.0

9. 박스플롯 생성

plt.figure(figsize=(12, 5))
# 박스플롯 생성
# 첫번째 파라메터: 여러 분포에 대한 데이터 리스트를 입력
# labels: 입력한 데이터에 대한 라벨
# showmeans: 평균값을 마크함

plt.boxplot(train_length, labels=['counts'], showmeans=True)
plt.show()

--> 결과

10. 워드클라우드 출력

[1] 리뷰 대이터 배출

train_review = [review for review in train_data['document'] if type(review) is str]
#str타입인 리뷰만 배출하기

길이가 긴 데이터가 꽤 존재함
중간값과 평균값은 아래쪽에 위치
워드클라우드를 이용해 자주 사용된 어휘 알아보기
사전 작업으로, 데이터 안에 들어 있는 문자열이 아닌 데이터는 모두 제거함

[2] 폰트 다운로드

워드 클라우드는 기본적으로 영어 텍스트를 지원함
한글을 사용하기 위해 한글 폰트를 설정해야 함

https://corp.gmarket.com/fonts/

G마켓 - 쇼핑을 바꾸는 쇼핑

Gmarket Sans

corp.gmarket.com

1) G마켓 사이트에 들어가 otf파일을 다운로드한다.

[3] 워드클라우드 출력

wordclud = WordCloud(font_path = DATA_IN_PATH + 'GmarketSansMedium.otf').generate(' '.join(train_review))
plt.figure(figsize=(15, 10))
plt.imshow(wordclud, interpolation='bilinear')
plt.axis('off')
plt.show()

--> 결과

11. 그래프 출력

fig, axe = plt.subplots(ncols=1)
fig.set_size_inches(6, 3)
sns.countplot(x=train_data['label'])
plt.show()

--> 결과

12. 리뷰 개수 출력

print("긍정 리뷰 개수: {}".format(train_data['label'].value_counts()[1]))
print("부정 리뷰 개수: {}".format(train_data['label'].value_counts()[0]))

--> 결과

긍정 리뷰 개수: 74827
부정 리뷰 개수: 75173

13. 글자 개수 그래프 만들기

[1] 글자수 세기

train_word_counts = train_data['document'].astype(str).apply(lambda x:len(x.split(' ')))
#문장이 들어오면 띄어쓰기 별로 나누어 글자수를 센다

각 리뷰의 단어 수 확인
각 데이터를 띄어쓰기 기준으로 나눠서 그 개수를 하나의 변수로 할당, 히스토그램 표시

[2] 상위 5개 데이터 출력

train_word_counts.head()

--> 결과

0     5
1     4
2     1
3     6
4    11
Name: document, dtype: int64

[3] 타입 출력

print(type(train_word_counts))

--> 결과

<class 'pandas.core.series.Series'>

[4] 그래프 출력

plt.figure(figsize=(15, 10))
plt.hist(train_word_counts, bins=50, facecolor='r',label='train')
plt.title('Log-Histogram of word count in review', fontsize=15)
plt.yscale('log', nonpositive='clip')
plt.legend()
plt.xlabel('Number of words', fontsize=15)
plt.ylabel('Number of reviews', fontsize=15)
plt.show()

--> 결과

길이의 경우 대부분 5개 정도에 분포되어 있음
30개 이상의 데이터부터는 수가 급격히 줄어듬

[5] 단어 개수 출력

print('리뷰 단어 개수 최대 값: {}'.format(np.max(train_word_counts)))
print('리뷰 단어 개수 최소 값: {}'.format(np.min(train_word_counts)))
print('리뷰 단어 개수 평균 값: {:.2f}'.format(np.mean(train_word_counts)))
print('리뷰 단어 개수 표준편차: {:.2f}'.format(np.std(train_word_counts)))
print('리뷰 단어 개수 중간 값: {}'.format(np.median(train_word_counts)))
# 사분위의 대한 경우는 0~100 스케일로 되어있음
print('리뷰 단어 개수 제 1 사분위: {}'.format(np.percentile(train_word_counts, 25)))
print('리뷰 단어 개수 제 3 사분위: {}'.format(np.percentile(train_word_counts, 75)))

--> 결과

리뷰 단어 개수 최대 값: 41
리뷰 단어 개수 최소 값: 1
리뷰 단어 개수 평균 값: 7.58
리뷰 단어 개수 표준편차: 6.51
리뷰 단어 개수 중간 값: 6.0
리뷰 단어 개수 제 1 사분위: 3.0
리뷰 단어 개수 제 3 사분위: 9.0

평균 7 ~ 8 개 정도의 단어 수를 가지고 있고, 중간값의 경우 6개 정도의 단어를 가지고 있음
글자 수 제한때문에 영어 데이터에 비해 길이가 짧음
이 경우 모델에 적용할 최대 단어수를 6개가 아닌 7개로 설정해도 큰 무리가 없음 (패딩작업시 6~7로 작업)

[6] 물음표와 마침표가 있는 질문의 비율

qmarks = np.mean(train_data['document'].astype(str).apply(lambda x: '?' in x)) # 물음표가 구두점으로 쓰임
fullstop = np.mean(train_data['document'].astype(str).apply(lambda x: '.' in x)) # 마침표
#True는 1, False는 0

print('물음표가있는 질문: {:.2f}%'.format(qmarks * 100))
print('마침표가 있는 질문: {:.2f}%'.format(fullstop * 100))

--> 결과

물음표가있는 질문: 8.25%
마침표가 있는 질문: 51.76%

14. 모듈 임포트

import re
import json
from konlpy.tag import Okt
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

train_data.head()

--> 결과

15. 문장 가공하기

[1] 문장 다듬기

review_text = re.sub("[^가-힣ㄱ-ㅎㅏ-ㅣ\\s]", "", train_data['document'][0])
# \\s는 스페이스의미
# 정규표현식
# []안에 있는 문자 이외가 있으면 ""를 통해 삭제, 있으면 그냥 통과
# 모든 한글, 자음, 모음, 공백을 제외한 모든 것을 삭제
print(review_text)

--> 결과

아 더빙 진짜 짜증나네요 목소리

[2] 문장 나누기

okt=Okt()
review_text = okt.morphs(review_text, stem=True)
print(review_text)

이렇게 처리된 문장에서 불용어를 제거하기 위해 문장을 단어로 나눠야 함
KoNLPy 라이브러리의 okt 객체를 사용
형태소 분석기를 사용할 때 어간 추출을 사용해 어간이 추출된 단어로 나눔

--> 결과

['아', '더빙', '진짜', '짜증나다', '목소리']

16. 불용어 사전

[1] 불용어 사전 만들어 문장 나누기

stop_words = set(['은', '는', '이', '가', '하', '아', '것', '들','의', '있', '되', '수', '보', '주', '등', '한'])
clean_review = [token for token in review_text if not token in stop_words]
print(clean_review)

불용어 제거를 위해 불용어 사전을 만들어 적용

--> 결과

['더빙', '진짜', '짜증나다', '목소리']

[2] 불용어 제거

def preprocessing(review, okt, remove_stopwords=False, stop_words=[]):
    # 함수의 인자는 다음과 같다.
    # review : 전처리할 텍스트
    # okt : okt 객체를 반복적으로 생성하지 않고 미리 생성후 인자로 받는다.
    # remove_stopword : 불용어를 제거할지 선택 기본값은 False
    # stop_word : 불용어 사전은 사용자가 직접 입력해야함 기본값은 비어있는 리스트
    
    # 1. 한글 및 공백을 제외한 문자 모두 제거.
    review_text = re.sub("[^가-힣ㄱ-ㅎㅏ-ㅣ\\s]", "", review)
    
    # 2. okt 객체를 활용해서 형태소 단위로 나눈다.
    word_review = okt.morphs(review_text, stem=True)
    
    if remove_stopwords:
        
        # 불용어 제거(선택적)
        word_review = [token for token in word_review if not token in stop_words]
        
   
    return word_review

clean_train_review = []

for review in train_data['document']:
    # 비어있는 데이터에서 멈추지 않도록 string인 경우만 진행
    if type(review) == str:
        clean_train_review.append(preprocessing(review, okt, remove_stopwords=True, stop_words=stop_words))
    else:
        clean_train_review.append([])  #string이 아니면 비어있는 값 추가
        
print(clean_train_review[:4])

--> 결과

[['더빙', '진짜', '짜증나다', '목소리'], ['흠', '포스터', '보고', '초딩', '영화', '줄', '오버', '연기', '조차', '가볍다', '않다'], ['너', '무재', '밓었', '다그', '래서', '보다', '추천', '다'], ['교도소', '이야기', '구먼', '솔직하다', '재미', '없다', '평점', '조정']]

[3] txt파일 불러와 불용어 제거하기

test_data = pd.read_csv(DATA_IN_PATH + 'ratings_test.txt', delimiter='\t', quoting=3 )

clean_test_review = []

for review in test_data['document']:
    # 비어있는 데이터에서 멈추지 않도록 string인 경우만 진행
    if type(review) == str:
        clean_test_review.append(preprocessing(review, okt, remove_stopwords=True, stop_words=stop_words))
    else:
        clean_test_review.append([])  #string이 아니면 비어있는 값 추가
        
print(clean_test_review[:4])

--> 결과

[['굳다', 'ㅋ'], [], ['뭐', '야', '평점', '나쁘다', '않다', '점', '짜다', '리', '더', '더욱', '아니다'], ['지루하다', '않다', '완전', '막장', '임', '돈', '주다', '보기', '에는']]

17. 토크나이저 활용

[1] 단어 수치

tokenizer = Tokenizer()
tokenizer.fit_on_texts(clean_train_review)
train_sequences = tokenizer.texts_to_sequences(clean_train_review)
test_sequences = tokenizer.texts_to_sequences(clean_test_review)

word_vocab = tokenizer.word_index

train_sequences[:3]#위에서 나온 값들을 수치로 나오게 한것

--> 결과

[[463, 20, 265, 664],
 [923, 465, 46, 604, 1, 219, 1459, 30, 969, 680, 24],
 [393, 2456, 25028, 2323, 5680, 2, 226, 13]]

[2] '더빙' 단어 수치 출력

print(word_vocab['더빙'])

--> 결과

18. npy파일 생성

[1] 데이터 저장

MAX_SEQUENCE_LENGTH = 8 # 문장 최대 길이, 단어의 평균 개수가 8개 정도였기 때문

train_inputs = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post') # 학습 데이터를 벡터화
train_labels = np.array(train_data['label']) # 학습 데이터의 라벨

test_inputs = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding='post') # 테스트 데이터를 벡터화
test_labels = np.array(test_data['label']) # 테스트 데이터의 라벨
#padding=post하면 0이 뒤로 붙는다.

DATA_IN_PATH = './data-in/'
TRAIN_INPUT_DATA = 'nsmc_train_input.npy'
TRAIN_LABEL_DATA = 'nsmc_train_label.npy'
TEST_INPUT_DATA = 'nsmc_test_input.npy'
TEST_LABEL_DATA = 'nsmc_test_label.npy'
DATA_CONFIGS = 'nsmc_data_configs.json'

data_configs = {}

data_configs['vocab'] = word_vocab
data_configs['vocab_size'] = len(word_vocab) # vocab size 추가

import os
# 저장하는 디렉토리가 존재하지 않으면 생성
if not os.path.exists(DATA_IN_PATH):
    os.makedirs(DATA_IN_PATH)

# 전처리 된 학습 데이터를 넘파이 형태로 저장
np.save(open(DATA_IN_PATH + TRAIN_INPUT_DATA, 'wb'), train_inputs)
np.save(open(DATA_IN_PATH + TRAIN_LABEL_DATA, 'wb'), train_labels)
# 전처리 된 테스트 데이터를 넘파이 형태로 저장
np.save(open(DATA_IN_PATH + TEST_INPUT_DATA, 'wb'), test_inputs)
np.save(open(DATA_IN_PATH + TEST_LABEL_DATA, 'wb'), test_labels)

# 데이터 사전을 json 형태로 저장
json.dump(data_configs, open(DATA_IN_PATH + DATA_CONFIGS, 'w'), ensure_ascii=False)

#신경망에 집어넣기 위해 전처리한 데이터를 저장한 것

--> 결과

5. 한국어 모델링(Korean modeling)

- 탐색적 분석단계(Exploratory Data Analysis, EDA): 데이터를 다양한 각도에서 관찰하고 이해하는 모든 과정

1. 모델 구조

- 배치에 대한 개념을 제거하고 본다.

- I like this moive very much! 문장 들어감.

- d=5로 5차원 벡터를 만듦. 커널사이즈를 (2,3,4) 각 커널마다 2개의 필터를 씀.

- 커널사이즈로 5차원 벡터를 스트라이드 하고 맥스풀링하고 붙인다.

↓요약

2. 모듈 임포트

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras import layers

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import os
import json
os.environ['KMP_DUPLICATE_LIB_OK']="TRUE"

3. 그래프 출력 함수

def plot_graphs(history, string):
    plt.plot(history.history[string])
    plt.plot(history.history['val_'+string], '')
    plt.xlabel("Epochs")
    plt.ylabel(string)
    plt.legend([string, 'val_'+string])
    plt.show()

4. 파일 입출력 경로

DATA_IN_PATH = './data-in/'
DATA_OUT_PATH = './data-out/'
INPUT_TRAIN_DATA = 'nsmc_train_input.npy'
LABEL_TRAIN_DATA = 'nsmc_train_label.npy'
DATA_CONFIGS = 'nsmc_data_configs.json'

5. 난수 생성

SEED_NUM = 1234
tf.random.set_seed(SEED_NUM)#숫자 1234를 입력해 난수 생성방식을 고정시킨다.

6. 파일 로드하기

train_input = np.load(open(DATA_IN_PATH + INPUT_TRAIN_DATA, 'rb'))
train_label = np.load(open(DATA_IN_PATH + LABEL_TRAIN_DATA, 'rb'))
prepro_configs = json.load(open(DATA_IN_PATH + DATA_CONFIGS, 'r'))

7. 딥러닝 모델 만들기

model_name = 'cnn_classifier_kr'
BATCH_SIZE = 512
NUM_EPOCHS = 10
VALID_SPLIT = 0.1
MAX_LEN = train_input.shape[1]

kargs = {'model_name': model_name,
        'vocab_size': prepro_configs['vocab_size'],
        'embedding_size': 128,
        'num_filters': 100,
        'dropout_rate': 0.5,
        'hidden_dimension': 250,
        'output_dimension':1}

8. CNN

https://arxiv.org/abs/1408.5882

Convolutional Neural Networks for Sentence Classification

We report on a series of experiments with convolutional neural networks (CNN) trained on top of pre-trained word vectors for sentence-level classification tasks. We show that a simple CNN with little hyperparameter tuning and static vectors achieves excell

arxiv.org

class CNNClassifier(tf.keras.Model):
    #CNN: 이미지 패턴 분석,문장분류에서도 가능(논문에서 나옴)
    def __init__(self, **kargs):
        #모델 클래스의 생성자, 모델의 구조와 파라미터를 설정한다.
        #**kargs는 딕셔너리 형태로 모델 생성시 필요한 인자들을 전달한다.
        
        super(CNNClassifier, self).__init__(name=kargs['model_name'])
        #부모생성자 호출시 모델이름을 지정한다.
        
        self.embedding = layers.Embedding(input_dim=kargs['vocab_size']+1,
                                     output_dim=kargs['embedding_size'])
        #임베딩: 고차원의 희소벡터를 저차원을 밀집벡터로 바꾼다.
        #입력데이터를 저차원의 밀집벡터로 변환한다.
        #글자크기를 집어넣는다.
        #128차원의 밀집벡터 만들기
        
        self.conv_list = [layers.Conv1D(filters=kargs['num_filters'],
                                   kernel_size=kernel_size,
                                   activation='relu',
                                   kernel_constraint=tf.keras.constraints.MaxNorm(max_value=3.))
                                      #MaxNorm(max_value=3.) 커널의 가중치값이 최대 값 3을 넘지 않게 설정
                         for kernel_size in [3,4,5]]
        #커널사이즈 3,4,5를 각각 만든다.
        #filter는 출력의 개수
        #활성화함수는 relu
        
        self.pooling = layers.GlobalMaxPooling1D()
        #맥스풀링을 한다.
        #맥스풀링: 커널과 겹치는 영역 안에서 최대값을 추출
        
        self.dropout = layers.Dropout(kargs['dropout_rate'])
        self.fc1 = layers.Dense(units=kargs['hidden_dimension'],
                           activation='relu',
                           kernel_constraint=tf.keras.constraints.MaxNorm(max_value=3.))
        self.fc2 = layers.Dense(units=kargs['output_dimension'],
                           activation='sigmoid',
                           kernel_constraint=tf.keras.constraints.MaxNorm(max_value=3.))
    
    def call(self, x):
        x = self.embedding(x)
        x = self.dropout(x)
        x = tf.concat([self.pooling(conv(x)) for conv in self.conv_list], axis=-1)
        #맥스풀링한 값을 붙이고 fc1과 fc2를 통과시킨다.
        x = self.fc1(x)
        x = self.fc2(x)
        
        return x

9. 모델 컴파일

model = CNNClassifier(**kargs)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

10. 파일 출력하기

# overfitting을 막기 위한 ealrystop 추가
earlystop_callback = EarlyStopping(monitor='val_accuracy', min_delta=0.0001,patience=4)
# min_delta: the threshold that triggers the termination (acc should at least improve 0.0001)
# patience: no improvment epochs (patience = 1, 1번 이상 상승이 없으면 종료)\

checkpoint_path = DATA_OUT_PATH + model_name + '/weights.h5'
checkpoint_dir = os.path.dirname(checkpoint_path)

# Create path if exists
if os.path.exists(checkpoint_dir):
    print("{} -- Folder already exists \n".format(checkpoint_dir))
else:
    os.makedirs(checkpoint_dir, exist_ok=True)
    print("{} -- Folder create complete \n".format(checkpoint_dir))
    

cp_callback = ModelCheckpoint(
    checkpoint_path, monitor='val_accuracy', verbose=1, save_best_only=True, save_weights_only=True)
#save_weight: 가중치만 저장

--> 결과

./data-out/cnn_classifier_kr -- Folder already exists

'비트교육센터 > AI' 카테고리의 다른 글

[비트교육센터][AI] 10일차 오토인코더, 전이학습, Imagenet (0)	2023.08.11
[비트교육센터][AI] 9일차 한국어 모델링, RNN, GAN (0)	2023.08.11
[비트교육센터][AI] 7일차 이미지분석 (0)	2023.08.10
[비트교육센터][AI] 6일차 데이터 예측하기 (0)	2023.08.09
[비트교육센터][AI] AI 3일차 회귀,배치, tensorflow, pytorch 설치 (0)	2023.08.06

현재글[비트교육센터][AI] 8일차 파이토치로 이미지 분석, 자연어 처리,EDA, 토크나이징,한국어 모델링

Today :
Yesterday :

코딩 초보의 블로그