Count-based Representation

NLP

머신러닝을 이용한 텍스트의 의미와 구조 파악

자연어(Natural Language)
사람들이 일상적으로 쓰는 언어를 인공적으로 만들어진 언어인 인공어와 구분하여 부르는 개념
자연어 이해(NLU, Natural Language Understanding)
- 분류(Classification)
  - Sentiment Analysis
- 자연어 추론(NLI, Natural Langauge Inference)
- 기계 독해(MRC, Machine Reading Comprehension), 질의 응답(QA, Question&Answering)
- 품사 태깅(POS tagging), 개체명 인식(Named Entity Recognition)
자연어 생성(NLG, Natural Language Generation)
- 텍스트 생성 (특정 도메인의 텍스트 생성)
NLU & NLG
- 기계 번역(Machine Translation)
- 요약(Summerization)
- 챗봇(Chatbot)
기타
- TTS(Text to Speech)
- STT(Speech to Text)
- Image Captioning
Vectorization
텍스트를 컴퓨터가 계산할 수 있도록 수치정보로 변환하는 과정
Count-based Representation
단어가 특정 문서(혹은 문장)에 들어있는 횟수를 바탕으로 해당 문서를 벡터화
- Bag of Words
  단어의 빈도수에 따라 문장을 숫자로 나타낸 것
```
from sklearn.feature_extraction.text import CounterVectorizer
```
  - Limitation
    - Sparsity
    - Frequent words has more power
    - Ignoring word orders
    - Out of vocabulary
- TF-IDF(Term Frequency-Inverse Document Frequency)
  각 단어에 대하여 문서와의 연관성을 고려하여 수치적으로 변환한 것
  특정 문서에만 등정하는 단어에 가중치를 두는 방법
```
from sklearn.feature_extraction.text import TfidfVectorizer
```

[Reference]

Distributed Representation

Encoding
Convert text to vector
- One Hot Encoding
  - Limitation
    It doesn't have similarity
    >> Every distance is same to each other
Embedding
Dense vector with similarity
Word2Vec
Word embedding
Similarity comes from neighbor words
Word2Vec gives similarity in vector representation
```
!pip install gensim --upgrade
 import gensim
```
- Skipgram
- Word2Vec Training
  Word2Vec is hidden layer after train
- Limitation
  - OOV(Out of Vocabulary)
    Skipgram model ignores the morphology(internal structure) of words
FestText
- Subword Training
- OOV에 대한 대응
- Rare Word에 대한 대응

[Reference]

Language Modeling with RNN

RNN

연속형 데이터를 처리하기 위한 신경망 모델

연속형 데이터 (Sequential Data)
어떤 순서로 오느냐에 따라서 단위의 의미가 달라지는 데이터
Limitation
역전파 과정에서 RNN의 활성화 함수인 tanh의 미분값을 전달하게 되는데(Back Propagation Through Time, BPTT), 시퀀스의 길이가 길어지면 역전파 정보가 거의 전달되지 않거나 과하게 전달된다.
- Gradient Vanishing
- Gradient Exploding

LSTM

RNN에 기울기 정보 크기를 조절하기 위한 Gate를 추가한 모델 (Solution for long sequence issue)

forget gate
과거 정보를 얼마나 유지할 것인가?
input gate
새로 입력된 정보는 얼마만큼 활용할 것인가?
output gate
두 정보를 계산하여 나온 출력 정보를 얼마만큼 넘겨줄 것인가?
cell-state
역전파 과정에서 활성화 함수를 거치지 않아 정보 손실이 없다.

[Reference]

Author And Source

이 문제에 관하여(NLP(Natural Language Processing)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@shenanigans/NLPNatural-Language-Processing

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

NLP(Natural Language Processing)

Count-based Representation

NLP

Distributed Representation

Language Modeling with RNN

RNN

LSTM

Author And Source

좋은 웹페이지 즐겨찾기