Introduction to NLP (Wk.2)
Ch. 2 Text Preprocessing
2-5) Regular Expression
Introduction to RegEx
Python officially supports a module 're'
It helps refining text data with a certain pattern
Grammar of RegEx
Symbol | Explannation |
---|---|
. | a random character except \n |
? | character may exist or not {0,1} |
* | character may exist or not {0,} |
+ | character exists {1,} |
^ | string starts with character behind |
$ | string ends with character ahead |
{nvm} | repeat nvm times |
{nvm1, nvm2} | repeat more than nvm1, less than nvm2 |
{nvm,} | repeat more than nvm times |
[characters] | match with one of character in [] |
[range] | match with one of character in the range |
[^character] | match character except character in [] |
a|b | match a or b |
\ | backslash itself |
\d | every digits [0-9] |
\D | everything except digits [^0-9] |
\s | every spaces [\t\n\r\f\v] |
\S | everything except spaces [^\t\n\r\f\v] |
\w | every characters and numbers [a-zA-Z0-9] |
\W | everything except characters nor number [^a-zA-Z0-9] |
RegEx Module Definition
Module Def | Explannation |
---|---|
re.compile() | compile RegEx |
re.search() | search string if it matches with RegEx. If there exists, return Match Object, else return none |
re.match() | search beginning of string if it matches with RegEx |
re.split() | split string with RegEx, and return list |
re.findall() | search every case that matches with RegEx from string, and return list. If there is none, return empty list |
re.finditer() | search every case that matches with RegEx from string, and return iterate object |
re.sub() | replace strings that match with RegEx to different string |
Example of RegEx in Python
# code A and code B is the same
code A
r = re.compile('ab+c')
r.search('abc')
code B
re.search('ab+c', 'abc')
re.match('what_to_find.', 'from_where')
re.split('\s', from_where)
re.findall('what_to_find', 'from_where')
re.finditer('what_to_find', 'from_where')
re.sub('from_sth', 'to_sth', 'from_where')
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w]+")
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("\s+", gaps=True)
print(tokenizer.tokenize("Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop"))
2-6) Integer Encoding
Introduction to Integer Encoding
Computer processes int better than str
Sometimes, we map words to certain integers(or index), this is called 'mapping'
Usually, we assign index after sorting numbers by frequency
Integer Encoding Using Python Dictionary
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
raw_text = "A barber is a person. a barber is good person. a barber is huge person. he Knew A Secret! The Secret He Kept is huge secret. Huge secret. His barber kept his word. a barber kept his word. His barber kept his secret. But keeping and keeping such a huge secret to himself was driving the barber crazy. the barber went up a huge mountain."
# sentence tokenization
sentences = sent_tokenize(raw_text)
vocab = {}
preprocessed_sentences = []
stop_words = set(stopwords.words('english'))
for sentence in sentences:
# word tokenization
tokenized_sentence = word_tokenize(sentence)
result = []
for word in tokenized_sentence:
word = word.lower() # lower words to reduce words
if word not in stop_words: # remove stop words
if len(word) > 2: # remove words with length lower than 2
result.append(word)
if word not in vocab:
vocab[word] = 0
vocab[word] += 1
preprocessed_sentences.append(result)
# sort by frequency
vocab_sorted = sorted(vocab.items(), key = lambda x:x[1], reverse = True)
word_to_index = {}
i = 0
for (word, frequency) in vocab_sorted :
if frequency > 1 : # remove words with small frequency
i = i + 1
word_to_index[word] = i
vocab_size = 5
words_frequency = [word for word, index in word_to_index.items() if index >= vocab_size + 1] # remove words whose index is more than 5
for w in words_frequency:
del word_to_index[w] # remove index information
word_to_index['OOV'] = len(word_to_index) + 1
encoded_sentences = []
for sentence in preprocessed_sentences:
encoded_sentence = []
for word in sentence:
try:
encoded_sentence.append(word_to_index[word])
except KeyError:
encoded_sentence.append(word_to_index['OOV'])
encoded_sentences.append(encoded_sentence)
Changing text to number signifies 'processing' starts.
Therefore, we have to finish all the preprocessing that is only possible in text form.
Lower index means higher frequency.
The reason why we remove words with lower frequency is they are often meaningless in NLP.
Because of this, there exists words not in word_to_index dictionary; we call them OOV (Out-Of-Vocabulary).
We add OOV as the last of the index.
Then, we encode every word in sentences with the mapped integers.
Often, we use Counter, FreqDist, enumerate, or Keras Tokenizer than using dictionary in Python.
In the code above,
vocab = (dictionary) {unique word: its frequency}
vocab_sort = (list) [(unique word, its frequency)] /descending sorted by frequency
word_to_index = (dictionary) {unique word: its index} /ascending sorted by index
Integer Encoding Using Counter
from collections import Counter
all_words_list = sum(preprocessed_sentences, [])
# or you can use 'words = np.hstack(preprocessed_sentences)' instead
# count word frequency using 'Counter' module in Python
vocab = Counter(all_words_list)
vocab_size = 5
vocab = vocab.most_common(vocab_size) # leave only top 5 words with higher frequency
word_to_index = {}
i = 0
for (word, frequency) in vocab :
i = i + 1
word_to_index[word] = i
In the code above, 'sentences' is already tokenized by words.
Counter() : remove duplicated words and get their frequency
most_common(nvm) : return top nvm words with high frequency
Integer Encoding Using NLTK's FreqDist
from nltk import FreqDist
import numpy as np
# remove punctuation using np.hstack
vocab = FreqDist(np.hstack(preprocessed_sentences))
vocab_size = 5
vocab = vocab.most_common(vocab_size) # store only top 5 words with high frequency
word_to_index = {word[0] : index + 1 for index, word in enumerate(vocab)}
enumerate() is useful when assigning index
Integer Encoding Using Keras
from tensorflow.keras.preprocessing.text import Tokenizer
preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]
tokenizer = Tokenizer()
# fit_on_texts()안에 코퍼스를 입력으로 하면 빈도수를 기준으로 단어 집합을 생성.
tokenizer.fit_on_texts(preprocessed_sentences)
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # 상위 5개 단어만 사용
tokenizer.fit_on_texts(preprocessed_sentences)
# show how indexes assigned to words
tokenizer.word_index
# show unique words and their frequencies
tokenizer.word_counts
# change words in corpus to given index
tokenizer.texts_to_sequences(preprocessed_sentences)
### If we want to only use top 5 frequency words for texts_to_sequences ###
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # see description below for the reason
tokenizer.fit_on_texts(preprocessed_sentences)
### If we want to only use top 5 frequency words for word_index & word_counts as well ###
tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
vocab_size = 5
words_frequency = [word for word, index in tokenizer.word_index.items() if index >= vocab_size + 1] # delete words whose index exceed 5
for word in words_frequency:
del tokenizer.word_index[word] # delete index information
del tokenizer.word_counts[word] # delete count information
### If we want to save words not in word list as OOV ###
vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 2, oov_token = 'OOV')
tokenizer.fit_on_texts(preprocessed_sentences)
The reason why we add +1 to num_words value:
The 'num_words' count number from 0, thus if we put 5, it will save words in range 0th~4th, which means only words with index 1 to 4 will remain.
Therefore, if we want to save words with index 1 to 5, we need to put 5+1 rather than just 5.
The reason why Keras Tokenizer inculdes 0 when it does not actually exits is the process named 'padding'.
This will be explained in the next chapter.
2-7) Padding
Introduction to Padding
When processing natural language, each sentence(or document) may have different length.
Computer can consider documents with the same length as a matrix, and process altoghter.
Therefore, sometimes we need to adjust documents' length same.
Padding with NumPy
# What we did in the last chapter - encoding words to assigned integer
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
preprocessed_sentences = [['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
Then, we do a padding with this encoded data
max_len = max(len(item) for item in encoded) # the longest sentence's length
# We then adjust other sentences' length as same as the longest one
# We suppose there's a imaginary word 'PAD' and its index as 0
for sentence in encoded:
while len(sentence) < max_len:
sentence.append(0)
padded_np = np.array(encoded)
For sentences whose length is shorter than 7, number 0 has been added behind to make their length as 7.
Now computer can consider them as a matrix, and conduct parallel processing.
0th word is meaningless, so we will ignore it when processing natural language.
Adjusting data's shape(or size) by filling in certain value is called 'Padding'.
If we are using number 0 as above, it is called 'Zero Padding'.
Padding with Keras Preprocessint Tools
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(encoded)
# If we want to fill 0 behind as NumPy, rather than ahead
padded = pad_sequences(encoded, padding = 'post')
# It is not necessary to adjust to the longest sentence. We can assign the max length of sentences
padded = pad_sequences(encoded, padding = 'post', maxlen = 5) # in this case, if length of the sentence exceeds 5, then data is loss
# Usually we pad with number 0, but we can use other numbers as well. The code below is padding with the number that is +1 bigger than the size of the word set
last_value = len(tokenizer.word_index) + 1
2-8) One-Hot Encoding
Introduction to One-Hot Encoding
Computer handles number better than text.
Therefore, in NLP, there are many techniques we use to change text to number.
'One-Hot Encoding' is the most basic technique to express 'words'.
Before moving on to one-hot encoding, we first make *vocabulary.
Then, we do integer encoding.
If there are 5,000 different words in the text, the size of vocabulary is 5,000.
And, there are 5,000 indexes that are assigned to each word.
- vocabulary is a set of different words.
We also consider 'book' and 'books' as different as well.
We set the dimension of vector as the size of vocabulary.
Then, we put 1 for words that we want to express, and put 0 for others.
This vector is called One-Hot vector.
One-Hot Encoding Function in Python
def one_hot_encoding(word, word2index):
one_hot_vector = [0]*(len(word2index))
index = word2index[word]
one_hot_vector[index] = 1
return one_hot_vector
2-9) Splitting Data
Introduction to Supervised Learning
Data for supervised learning consists of 'question' data, and 'answer' data (also known as label)
We split data as below:
-
train data
X_train : question for train
y_train : answer for train -
test data
X_test : question for test
y_test : answer for test
Computer train with train data, and guess with x_test data.
Then, we compare its prediction with y_test data, and return its Accuracy.
2-10) Text Preprocessing Tools for Korean Text
PyKoSpacing
It convert sentences without spacing to sentences with proper spacing.
Py-Hanspell
It is based on Naver Hangul Spell Checker.
It also checks spacing as well.
SOYNLP
It is a word tokenizer that supports pos-tagging and word tokenization.
It is based on unsupervised learning, and analyze frequent words in the data.
It operates as word score table in inner side.
This score uses 'cohesion probability' and 'branching entrophy'.
It can solve new words such as new debuted idol group's name.
Ch. 3 Language Model
3-1) Introduction to Language Model
Introduction to Language Model
It means model that assign possibility to word sequence(sentence).
Nowadays, we usually use neural based model rather than statistic based model.
Uprising technology GPT or BERT is also based on neural-network language model.
The most common case is predicting the next word with given previous words. This is called 'Language Modeling'.
Or, it may also predict word between given words.
Assigning Possibility of Word Sequence
This can be applied in fields like
- Machine Learning
- Spell Correction
- Speech Recognition
3-2) Statistical Language Model, SLM
Possibility of Sentence
It is composed of multiplication of possibility of words with given previous words.
For example, the possibility of 'An adorable little boy is spreading smiles' is:
3-3) N-gram Language Model
Introduction to N-gram Language Model
It uses 'Count Based Statistic Approach'.
Therefore, it is a form of SLM as well.
However, instead of considering every word, it only consider a part of it.
We decide how many words we consider, and the number of it is 'n' in 'n-gram'.
Decrease in Cases of Not Being Counted in Corpus
There can be a high possibility of not being able to count in corpus if the sentence we want to calculate possibility gets longer.
Since it'd not exist in the corpus.
This is the limitation of SLM; the goal sentence may not be in train corpus.
By reducing words that we refer can increase the possiblity to be counted.
3-4) Language Model for Korean Sentences
Korean sentences are more difficult to language model because of following reasons:
- order of words is not important
- it is an agglutinative language
- often its spacing is not correct
3-5) Perplexity
Extrinsic Evaluation
When comparing performance of different models, we can apply it to spell check, machine translation, or speech recognition.
And see which model was better.
However, it takes too much time when comparing models more than two.
Therefore, we do 'Intrinsic Evaluation', which can be less accurate compared to extrinsic evaluation, but faster.
It digitize its performance and return the result inside the model.
Perplexity, PPL
Perplexity is an inner evaluation matrix for language model.
It is often shorten as PPL.
Lower the PPL is, better the language model's performance is.
3-6) Conditional Probability
Not a special thing to be noted
Author And Source
이 문제에 관하여(Introduction to NLP (Wk.2)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@jongbeen_song/Introduction-to-NLP-Wk.2저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)