nlp 기초
3908 단어 자연 언어 처리python 데이터 분석
from sklearn.feature_extraction.text import CountVectorizer
In [2]:
vect = CountVectorizer()
vect
Out[2]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:
In [3]:
corpus = ['Hi my name is kanav.','I love reading.','Kanav loves reading scripts.']
X= vect.fit_transform(corpus)
X # note the dimensions of X(3X9) means 3 rows and 9 columns.
Out[3]:
<3x9 sparse matrix of type ''
with 11 stored elements in Compressed Sparse Row format>
Note the dimensions of X (3X9) means 3 rows and 9 columns.
as there are three documents and 9 unique words
See
In [4]:
vect.get_feature_names()
Out[4]:
['hi', 'is', 'kanav', 'love', 'loves', 'my', 'name', 'reading', 'scripts']
See this is the frequency matrix in the given documents
Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:
In [5]:
X.toarray()
Out[5]:
array([[1, 1, 1, 0, 0, 1, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 1, 0],
[0, 0, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)
Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:
In [6]:
vect.transform(['hi,whats your name?.']).toarray()
Out[6]:
array([[1, 0, 0, 0, 0, 0, 1, 0, 0]])
Normalization and stemming
Since the words like love and loves has same meaning so, why not we treat them same?
In [7]:
import nltk
porter = nltk.PorterStemmer()
[porter.stem(t) for t in vect.get_feature_names()]
Out[7]:
['hi', 'is', 'kanav', 'love', 'love', 'my', 'name', 'read', 'script']
See the loves has now become love.
Now we have total 8 unique features
In [8]:
list(set([porter.stem(t) for t in vect.get_feature_names()]))
Out[8]:
['kanav', 'hi', 'name', 'is', 'love', 'my', 'script', 'read']
In [9]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in list(set([porter.stem(t) for t in vect.get_feature_names()]))]
Out[9]:
['kanav', 'hi', 'name', 'is', 'love', 'my', 'script', 'read']
Lemmatization
A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.
So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.
Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.
In [10]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
cat
cactus
goose
rock
python
good
best
run
run
이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
NLP4J - Java로 형태소 해석 (Yahoo! 개발자 네트워크 일본어 형태소 해석을 이용)Yahoo! Japan이 제공하고 있는 일본어 형태소 해석 API입니다. 텍스트 분석 : 일본어 형태소 분석 - Yahoo! 개발자 네트워크 품목 설명 제공자 야후 주식회사 Yahoo Japan Corporation...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.