keras로 word2vec을 전이 학습하고 분류 모델 구축

12093 단어 Keras 파이썬 word2vec 자연 언어 처리 기계 학습

목표

교사 없는 데이터를 사용하여 word2vec을 학습하고 이를 Embedding 계층으로 텍스트 분류 문제에 응용한다.

지식

word2vec

단어를 벡터화합니다. 이번에는 파이썬 라이브러리 인 gensim을 사용하여 구현합니다. 이 라이브러리는 이하의 논문을 참고로 하고 있다.

· Efficient Estimation of Word Representations in Vector Space
· Distributed Representations of Words and Phrases and their Compositionality

전이 학습

전이 학습이란 학습된 모델을 이용하여 새로운 모델을 구축하는 방법이다.
이번 경우라고 하면 다음과 같다.

학습된 모델 → gensim의 word2vec 모델
새로운 모델 → 문서 분류 모델

전이 학습을 이용한 문서 분류 논문

Convolutional Neural Networks for Sentence Classification

전이 학습에 대한 기사

htps : // 코 m / 이코 x 후 g417 / ms / 48cbf087d22f1f8c6f4
h tp // w w. 야스히사 y. 인후 / 엔트리 / 2016/12/05/000000

또, 전이 학습과 닮은 것으로 사전 학습이 있다. 사전 학습에는 AutoEncoder와 제약이 있는 볼츠만 머신이 있다. 전이 학습과 사전 학습의 상세한 차이를 모른다.

구현

데이터 세트 작성

이번에는 노래의 가사를 가수별로 분류하는 것을 목적으로 한다.

docs = [text2doc(song['lyric']) for song in songs]
artists = [song['artist_name'] for song in songs]

・text2doc 함수에서는 텍스트를 나누어 써, 자립어만을 꺼내, 리스트로 해 돌려주고 있다.

example

docs = [['わたし', '犬', 'なる'], ...]
artists = ['星野源', '福山雅治', ...]

학습된 word2vec 모델 로드

from gensim.models import KeyedVectors
model_path = 'path/to/entity_vector.model.bin'
wv_model = KeyedVectors.load_word2vec_format(model_path, binary=True)
index2word = wv_model.wv.index2word
word2index = {word : i for i, word in enumerate(index2word)}
vocab = word2index.keys()

· KeyedVectors에서 word2vec 모델을로드합니다.
・위의 데이터를 id화할 때에, index2word나 word2index가 필요하므로, 여기서 읽어 둔다.

학습 데이터 작성

위의 데이터는 아직 텍스트가 포함되어 있으므로, keras에서 학습할 수 있는 데이터형으로 변경한다. 텍스트에서 숫자로 변경할 때 위에서 만든 사전에 의해 id화한다.

x 만들기

x_bag_of_words = np.zeros((len(docs), len(vocab)))
for i, doc in enumerate(docs):
    for word in doc:
        if word in vocab:
            x_bag_of_words[i][word2index[word]] += 1

y 만들기

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
label_ids = le.fit_transform(artists)
y = np.zeros((len(docs), len(le.classes_)))
for i, label_id in enumerate(label_ids):
    y[i][label_id] = 1

・x도 y도 bag_of_words 하고 있을 뿐입니다.
・LabelEncoder 편리

from sklearn.model_selection import train_test_split
x_train , x_test, y_train, y_test = train_test_split(x_bag_of_words, y, test_size=0.2)

마지막으로 train 데이터와 test 데이터로 나누어 완성입니다.

모델링

keras의 model function API를 이용하고 있습니다. Sequence 모델을 사용하지 않았던 것은 model function APi 쪽이 자유자재에 거는다고 써 있었기 때문입니다.

htps : // m / M 7777 / ms / 1339d01bc6 ef028 e7b44

from keras.layers import Dense, Activation, Input, Flatten
from keras.layers.convolutional import Conv1D
from keras.layers.recurrent import LSTM
from keras.models import Model

inputs = Input(shape=((x_train.shape[1],)))
embed = wv_model.get_keras_embedding(train_embeddings=True)(inputs)
embed = LSTM(100)(embed)
hiddened = Dense(50)(embed)
hiddened = Activation('relu')(hiddened)
pred = Dense(len(le.classes_))(hiddened)
pred = Activation('sigmoid')(pred)
model = Model(inputs=inputs, outputs=pred)

・특필해야 할 곳은 wv_model.get_keras_embedding() 정도입니다. 이것은 gensim의 word2vec에 구현 된 함수이며, 이렇게 keras의 Embedding 계층을 내보냅니다.
・다음은 Embedding층으로, 입력이 2차원이었던 것이, 출력이 3차원이 되어 있습니다. 여기는 LSTM에서 잘 흡수됩니다. Conv층에서 흡수할 수도 있을 것 같습니다만, 하지 않습니다. 여기는 여러 논문을 보고 바꿀 가치가 있을 것 같습니다.

학습

model.compile(optimizer='sgd', loss='categorical_crossentropy')
model.fit(x_train, y_train, epochs=1, batch_size=32, verbose=1)

평가

학습 중

결과

학습 중

의문·과제

· 전이 학습과 사전 학습의 차이
· Convolutional Neural Networks for Sentence Classification 읽기

Reference

이 문제에 관하여(keras로 word2vec을 전이 학습하고 분류 모델 구축), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/hiroto0227/items/680b772079cf2eba59d9

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

gensim의 word2vec/doc2vec에서 학습 손실을 출력하면서 학습률 alpha의 값을 바꾸어 학습한다

【기계 학습 오차 역전파법】word2vec 메모 (1)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다