nlp 기초

from sklearn.feature_extraction.text import CountVectorizer
In [2]:
vect = CountVectorizer()
vect
Out[2]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

In [3]:
corpus = ['Hi my name is kanav.','I love reading.','Kanav loves reading scripts.']
X= vect.fit_transform(corpus)
X # note the dimensions of X(3X9) means 3 rows and 9 columns. 
Out[3]:
<3x9 sparse matrix of type ''
	with 11 stored elements in Compressed Sparse Row format>

Note the dimensions of X (3X9) means 3 rows and 9 columns. 
as there are three documents and 9 unique words
See

In [4]:
vect.get_feature_names()
Out[4]:
['hi', 'is', 'kanav', 'love', 'loves', 'my', 'name', 'reading', 'scripts']
See this is the frequency matrix in the given documents

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

In [5]:
X.toarray()
Out[5]:
array([[1, 1, 1, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)
Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [6]:
vect.transform(['hi,whats your name?.']).toarray()
Out[6]:
array([[1, 0, 0, 0, 0, 0, 1, 0, 0]])
Normalization and stemming

Since the words like love and loves has same meaning so, why not we treat them same?

In [7]:
import nltk
porter = nltk.PorterStemmer()
[porter.stem(t) for t in vect.get_feature_names()]
Out[7]:
['hi', 'is', 'kanav', 'love', 'love', 'my', 'name', 'read', 'script']
See the loves has now become love.

Now we have total 8 unique features

In [8]:
list(set([porter.stem(t) for t in vect.get_feature_names()]))
Out[8]:
['kanav', 'hi', 'name', 'is', 'love', 'my', 'script', 'read']
In [9]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in list(set([porter.stem(t) for t in vect.get_feature_names()]))]
Out[9]:
['kanav', 'hi', 'name', 'is', 'love', 'my', 'script', 'read']
Lemmatization
A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.

Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.

In [10]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
cat
cactus
goose
rock
python
good
best
run
run

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

NLP4J - Java로 형태소 해석 (Yahoo! 개발자 네트워크 일본어 형태소 해석을 이용)

Yahoo! Japan이 제공하고 있는 일본어 형태소 해석 API입니다. 텍스트 분석 : 일본어 형태소 분석 - Yahoo! 개발자 네트워크 품목 설명 제공자 야후 주식회사 Yahoo Japan Corporation...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다