nlp 기초

from sklearn.feature_extraction.text import CountVectorizer
In [2]:
vect = CountVectorizer()
vect
Out[2]:
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)


Let’s use it to tokenize and count the word occurrences of a minimalistic corpus of text documents:

In [3]:
corpus = ['Hi my name is kanav.','I love reading.','Kanav loves reading scripts.']
X= vect.fit_transform(corpus)
X # note the dimensions of X(3X9) means 3 rows and 9 columns. 
Out[3]:
<3x9 sparse matrix of type ''
	with 11 stored elements in Compressed Sparse Row format>

Note the dimensions of X (3X9) means 3 rows and 9 columns. 
as there are three documents and 9 unique words
See

In [4]:
vect.get_feature_names()
Out[4]:
['hi', 'is', 'kanav', 'love', 'loves', 'my', 'name', 'reading', 'scripts']
See this is the frequency matrix in the given documents

Each term found by the analyzer during the fit is assigned a unique integer index corresponding to a column in the resulting matrix. This interpretation of the columns can be retrieved as follows:

In [5]:
X.toarray()
Out[5]:
array([[1, 1, 1, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 0, 1, 0, 1, 0, 0, 1, 1]], dtype=int64)
Hence words that were not seen in the training corpus will be completely ignored in future calls to the transform method:

In [6]:
vect.transform(['hi,whats your name?.']).toarray()
Out[6]:
array([[1, 0, 0, 0, 0, 0, 1, 0, 0]])
Normalization and stemming

Since the words like love and loves has same meaning so, why not we treat them same?

In [7]:
import nltk
porter = nltk.PorterStemmer()
[porter.stem(t) for t in vect.get_feature_names()]
Out[7]:
['hi', 'is', 'kanav', 'love', 'love', 'my', 'name', 'read', 'script']
See the loves has now become love.

Now we have total 8 unique features

In [8]:
list(set([porter.stem(t) for t in vect.get_feature_names()]))
Out[8]:
['kanav', 'hi', 'name', 'is', 'love', 'my', 'script', 'read']
In [9]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in list(set([porter.stem(t) for t in vect.get_feature_names()]))]
Out[9]:
['kanav', 'hi', 'name', 'is', 'love', 'my', 'script', 'read']
Lemmatization
A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.

So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.

Some times you will wind up with a very similar word, but sometimes, you will wind up with a completely different word. Let's see some examples.

In [10]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))
cat
cactus
goose
rock
python
good
best
run
run

좋은 웹페이지 즐겨찾기