pytorch 의 bert 모델 을 사용 하여 문장 벡터 를 가 져 와 후속 NLP 작업 을 준비 합 니 다.

1.pytorch-preetrained-BERT 설치
pip install pytorch-pretrained-bert

제 python 버 전 은 3.6 입 니 다.
2.모델 과 사전 다운로드:
모델 과 사전 위치:https://s3.amazonaws.com/models.huggingface.co
예 를 들 어 bert-base-cased.tar.gz 다운로드
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz
https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt
폴 더 에 넣 기 bert-base-casedfile 에서 폴 더 이름 을 모델 이름 과 같이 기억 하지 말고 bert-base-cased-vocb.txt 를 vocb.txt 로 바 꾸 고 bert-base-cased.tar.gz 를 압축 해제 합 니 다.
3.은 층 벡터 와 마지막 으로 다음 단원 으로 출력 되 는 벡터 획득
import torch from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows import logging logging.basicConfig(level=logging.INFO)
# Load pre-trained model tokenizer (vocabulary)
 
BERT Word Embeddings Tutorial  https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased_file')
# Tokenized input text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]" tokenized_text = tokenizer.tokenize(text)
# Mask a token that we will try to predict back with `BertForMaskedLM` masked_index = 8 tokenized_text[masked_index] = '[MASK]' assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
# Convert token to vocabulary indices indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text) # Define sentence A and B indices associated to 1st and 2nd sentences (see paper) segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors tokens_tensor = torch.tensor([indexed_tokens]) segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights) model = BertModel.from_pretrained('bert-base-uncased_file') model.eval()
# If you have a GPU, put everything on cuda tokens_tensor = tokens_tensor.to('cuda') segments_tensors = segments_tensors.to('cuda') model.to('cuda')
# Predict hidden states features for each layer with torch.no_grad():     encoded_layers, pooled_output = model(tokens_tensor, segments_tensors)     print(pooled_output) # We have a hidden states for each of the 12 layers in model bert-base-uncased assert len(encoded_layers) == 12  
 
Q: Where do you get the fixed representation? Did you do pooling or something?
A: I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.
Q: Why not use the hidden state of the first token, i.e. the  [CLS] ?
A: Because a pre-trained model is not fine-tuned on any downstream tasks yet. In this case, the hidden state of  [CLS]  is not a good sentence representation. If later you fine-tune the model, you may use  get_pooled_output()  to get the fixed length representation as well.
Q: Why not the last hidden layer? Why second-to-last?
A: The last layer is too closed to the target functions (i.e. masked language model and next sentence prediction) during pre-training, therefore may be biased to those targets.
참고:https://pythonawesome.com/mapping-a-variable-length-sentence-to-a-fixed-length-vector-using-pretrained-bert-model/

좋은 웹페이지 즐겨찾기