Pytorch Exercise: Augmenting the LSTM part-of-speech tagger with character-level features
홈페이지에 올라온 질문은:
In the example above, each word had an embedding, which served as the inputs to our sequence model. Let’s augment the word embeddings with a representation derived from the characters of the word. We expect that this should help significantly, since character-level information like affixes have a large bearing on part-of-speech. For example, words with the affix -ly are almost always tagged as adverbs in English.
To do this, let cwcw be the character-level representation of word ww. Let xwxw be the word embedding as before. Then the input to our sequence model is the concatenation of xwxw and cwcw. So if xwxw has dimension 5, and cwcw dimension 3, then our LSTM should accept an input of dimension 8.
To get the character level representation, do an LSTM over the characters of a word, and let cwcw be the final hidden state of this LSTM. Hints:
There are going to be two LSTM’s in your new model. The original one that outputs POS tag scores, and the new one that outputs a character-level representation of each word.
To do a sequence model over characters, you will have to embed characters. The character embeddings will be the input to the character LSTM.
인터넷에서 답을 검색해도 소용이 없자 스스로 프로그램을 썼다.다음 프로그램에서 character embedding 방법을 사용하여 단어의 최대 길이를 MAX 로 설정합니다WORD_LEN, LSTM을 사용하여 각 단어 문자의 최종 출력을 해당 단어의 charemb로 만듭니다.
# -*- coding:utf8 -*-
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)
def prepare_sequence(seq, to_ix):
idxs = [to_ix[w] for w in seq]
tensor = torch.LongTensor(idxs)
return Variable(tensor)
training_data = [
("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, tags in training_data:
for word in sent:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)
char_to_ix = {}
char_to_ix[' '] = len(char_to_ix)
for sent, _ in training_data:
for word in sent:
for char in word:
if char not in char_to_ix:
char_to_ix[char] = len(char_to_ix)
# print(char_to_ix)
# print('len(char_to_ix):',len(char_to_ix))
# print(word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}
class LSTMTagger(nn.Module):
def __init__(self, word_emb_dim, char_emb_dim, hidden_dim, vocab_size, tagset_size, char_size):
super(LSTMTagger,self).__init__()
self.hidden_dim = hidden_dim
self.char_emb_dim = char_emb_dim
self.word_embedding = nn.Embedding(vocab_size, word_emb_dim)
self.char_embedding = nn.Embedding(char_size, char_emb_dim)
self.char_lstm = nn.LSTM(char_emb_dim, char_emb_dim)
self.lstm = nn.LSTM(word_emb_dim + char_emb_dim, hidden_dim)
self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
def forward(self, sentence_word, sentence_char, MAX_WORD_LEN):
# char emb
sentence_size = sentence_word.size()[0]
char_emb = self.char_embedding(sentence_char) # [sentence_size * MAX_WORD_LEN, char_emb_dim]
try :
char_emb = char_emb.view(len(sentence_word), MAX_WORD_LEN, -1).permute(1,0,2) # [MAX_WORD_LEN, sentence_size, char_emb_dim]
except :
print("char_emb.size():",char_emb.size())
self.hidden_char = self.initHidden_char(sentence_size)
char_lstm_out, self.hidden = self.char_lstm(char_emb, self.hidden_char)
char_embeded = char_lstm_out[-1,:,:].view(sentence_size,-1)
# word emb
word_embeded = self.word_embedding(sentence_word)
embeded = torch.cat((word_embeded, char_embeded),dim=1)
# print('embeded size:
', embeded.size())
self.hidden = self.initHidden()
lstm_out, self.hidden = self.lstm(embeded.view(sentence_size,1,-1), self.hidden)
tag_space = self.hidden2tag(lstm_out.view(sentence_size,-1))
tag_scores = F.log_softmax(tag_space)
return tag_scores
def initHidden(self):
result = (Variable(torch.zeros(1,1,self.hidden_dim)),
Variable(torch.zeros(1, 1, self.hidden_dim)))
return result
def initHidden_char(self, sentence_size):
result = (Variable(torch.zeros(1, sentence_size, self.char_emb_dim)),
Variable(torch.zeros(1, sentence_size, self.char_emb_dim)))
return result
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
WORD_EMB_DIM = 6
CHAR_EMB_DIM = 3
HIDDEN_DIM = 6
MAX_WORD_LEN = 8
model = LSTMTagger(WORD_EMB_DIM, CHAR_EMB_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix), len(char_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
# before training
print('before training')
sentence_word = prepare_sequence(training_data[0][0], word_to_ix)
sent_chars = []
for w in training_data[0][0]:
sps = ' ' * (MAX_WORD_LEN - len(w))
sent_chars.extend(list(sps + w) if len(w) < MAX_WORD_LEN else list(w[:MAX_WORD_LEN]))
sentence_char = prepare_sequence(sent_chars, char_to_ix)
tag_scores = model(sentence_word, sentence_char, MAX_WORD_LEN)
targets = prepare_sequence(training_data[0][1], tag_to_ix)
print(tag_scores)
print('targets:
',targets)
for epoch in range(300):
for sentence, tags in training_data:
model.zero_grad()
model.hidden = model.initHidden()
sentence_word = prepare_sequence(sentence, word_to_ix)
sent_chars = []
for w in sentence:
sps = ' ' * (MAX_WORD_LEN - len(w))
sent_chars.extend(list(sps + w) if len(w)
이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
다양한 언어의 JSONJSON은 Javascript 표기법을 사용하여 데이터 구조를 레이아웃하는 데이터 형식입니다. 그러나 Javascript가 코드에서 이러한 구조를 나타낼 수 있는 유일한 언어는 아닙니다. 저는 일반적으로 '객체'{}...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.