【기계 학습】scikit-learn을 이용한 LDA 주제 분류

LDA 주제 분류 정보

LDA = latent dirichelet allocation (잠재적 디렉토리 할당 방법)

LDA에서는 문장중의 각 단어는 숨겨진 토픽(화제, 카테고리)에 속하고 있고, 그 토픽으로부터 어떠한 확률 분포에 따라 문장이 생성되어 있다고 가정해, 그 소속되어 있는 토픽을 추측한다.

논문 h tp // w w. jmlr. rg/파페rs/ゔぅ메3/b〇03아/bㅇ03아. pdf

alpha; : 주제를 얻기위한 매개 변수

beta; : 주제에서 단어를 얻기위한 매개 변수

theta; : 다항 분포 매개 변수

w :word(단어)

z :topic(토픽)

이번에는 이 LDA를 사용하여 문장이 주제별로 분류될 수 있는지 확인한다.

데이터 세트

20 Newsgroups를 사용하여 검증

약 20000 문서, 20 카테고리의 데이터 세트

카테고리는 이하 20 종류

comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
talk.politics.misc
talk.politics.guns
talk.politics.mideast
talk.religion.misc
alt.atheism
misc.forsale
soc.religion.christian

이번에는 이하의 4 종류를 사용

'rec.sport.baseball': 야구

'rec.sport.hockey': 하키

'comp.sys.mac.hardware': mac 컴퓨터

'comp.windows.x': windows 컴퓨터

학습

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import mglearn
import numpy as np

#data 
categories = ['rec.sport.baseball', 'rec.sport.hockey', \
                'comp.sys.mac.hardware', 'comp.windows.x']
twenty_train = fetch_20newsgroups(subset='train',categories=categories, \
                                            shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test',categories=categories, \
                                            shuffle=True, random_state=42)
tfidf_vec = TfidfVectorizer(lowercase=True, stop_words='english', \
                            max_df = 0.1, min_df = 5).fit(twenty_train.data)
X_train = tfidf_vec.transform(twenty_train.data)
X_test = tfidf_vec.transform(twenty_test.data)

feature_names = tfidf_vec.get_feature_names()
#print(feature_names[1000:1050])
#print()

# train
topic_num=4
lda =LatentDirichletAllocation(n_components=topic_num,  max_iter=50, \
                        learning_method='batch', random_state=0, n_jobs=-1)
lda.fit(X_train)

확인 상황을 아래에서 확인

sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
mglearn.tools.print_topics(topics=range(topic_num),
                           feature_names=np.array(feature_names),
                           topics_per_chunk=topic_num,
                           sorting=sorting,n_words=10)

topic 0       topic 1       topic 2       topic 3       
--------      --------      --------      --------      
nhl           window        mac           wpi           
toronto       mit           apple         nada          
teams         motif         drive         kth           
league        uk            monitor       hcf           
player        server        quadra        jhunix        
roger         windows       se            jhu           
pittsburgh    program       scsi          unm           
cmu           widget        card          admiral       
runs          ac            simms         liu           
fan           file          centris       carina

topic1 :windows 컴퓨터

topic2 :mac 컴퓨터

topic0 : 야구 or 하키, 예상대로 분류 할 수 없습니다

topic3: 컴퓨터 관련? 예상대로 분류 할 수 없습니다

topic1, topic2는 학습 단계로 깨끗하게 분류할 수 있었다고 생각된다.

추론

추론을 위한 자료는 wikipedia 애플의 영어 기사를 빌렸다. wikipedia 기사의 일부를 text11, text12로 설정.

text11="an American multinational technology company headquartered in Cupertino, "+ \
        "California, that designs, develops, and sells consumer electronics,"+ \
        "computer software, and online services."
text12="The company's hardware products include the iPhone smartphone,"+ \
        "the iPad tablet computer, the Mac personal computer,"+ \
        "the iPod portable media player, the Apple Watch smartwatch,"+ \
        "the Apple TV digital media player, and the HomePod smart speaker."

아래에서 추론 수행

# predict
test1=[text11,text12]
X_test1 = tfidf_vec.transform(test1)
lda_test1 = lda.transform(X_test1)
for i,lda in enumerate(lda_test1):
    print("### ",i)
    topicid=[i for i, x in enumerate(lda) if x == max(lda)]
    print(text11)
    print(lda," >>> topic",topicid)
    print("")

결과

###  0
an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics,computer software, and online services.
[0.06391161 0.06149079 0.81545564 0.05914196]  >>> topic [2]

###  1
an American multinational technology company headquartered in Cupertino, California, that designs, develops, and sells consumer electronics,computer software, and online services.
[0.34345051 0.05899806 0.54454404 0.05300738]  >>> topic [2]

MAC(apple)에 관한 어느 문장도 topic2(mac 컴퓨터)에 속할 가능성이 높다고 추론되어 올바르게 분류할 수 있었다고 할 수 있다.

Reference

이 문제에 관하여(【기계 학습】scikit-learn을 이용한 LDA 주제 분류), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/asakbiz/items/3dfcece09592585581fd

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다