sklearn 을 사용 하여 birch 집합 분석 을 실현 합 니 다.

일반적으로 BIRCH 알고리즘 은 샘플 양 이 많은 경우 에 적용 되 는데, 이 점 은 Mini Batch K - Means 와 유사 하지만 BIRCH 는 카 테 고리 수가 많은 경우 에 적용 되 며, Mini Batch K - Means 는 카 테 고리 수가 적당 하거나 적은 경우 에 사용 된다.BIRCH 는 집합 을 제외 하고 이상 점 검 측 과 데 이 터 를 추가 로 분류 규정 에 따라 예비 처 리 를 할 수 있다.그러나 만약 에 데이터 특징의 차원 이 매우 크다 면 예 를 들 어 20 보다 크 면 BIRCH 는 적합 하지 않 고 이때 Mini Batch K - Means 의 표현 이 비교적 좋다.
데이터 형식: test. dat

프로그램 코드:

# coding=utf-8
import sys
import jieba
import numpy as np
from sklearn import feature_extraction    
from sklearn.feature_extraction.text import TfidfTransformer    
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cluster import Birch
import importlib
importlib.reload(sys)


class Cluster():
    def init_data(self):
        corpus = []
        self.title_dict = {
     }
        with open('test.dat', 'r' ,encoding="utf-8") as f:
            index = 0
            for line in f:
                title = line.strip()
                self.title_dict[index] = title
                seglist = jieba.cut(title,cut_all=False)  #      
                output = ' '.join(['%s'%x for x in list(seglist)]).encode('utf-8')       #    
                # print index,output
                index +=1
                corpus.append(output.strip())

        #                   a[i][j]   j  i         
        vectorizer = CountVectorizer()  
        #          tf-idf    
        transformer = TfidfTransformer()  
        #   fit_transform   tf-idf    fit_transform            
        tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))  
        #                
        word = vectorizer.get_feature_names()
        # tf-idf      ，  w[i][j]  j  i     tf-idf    
        self.weight = tfidf.toarray()
        # print self.weight

    def birch_cluster(self):
        print ('start cluster Birch -------------------' )
        self.cluster = Birch(threshold=0.8,n_clusters=5)
        self.cluster.fit_predict(self.weight)

        
    def get_title(self):
        # self.cluster.labels_     corpus   index       {index:   }    int          
        cluster_dict = {
     }
        # cluster_dict key Birch       ，value  title   index
        for index,value in enumerate(self.cluster.labels_):
            if value not in cluster_dict:
                cluster_dict[value] = [index]
            else:
                cluster_dict[value].append(index)
        print(cluster_dict)

        print ("-----before cluster Birch count title:",len(self.title_dict))
        # result_dict key Birch           title，value sum_similar  
        
        result_dict = {
     }
        for indexs in cluster_dict.values():
            latest_index = indexs[0]
            similar_num = len(indexs)
            if len(indexs)>=2:
                min_s = np.sqrt(np.sum(np.square(self.weight[indexs[0]]-self.cluster.subcluster_centers_[self.cluster.labels_[indexs[0]]])))
                for index in indexs:
                    s = np.sqrt(np.sum(np.square(self.weight[index]-self.cluster.subcluster_centers_[self.cluster.labels_[index]])))
                    if s<min_s:
                        min_s = s
                        latest_index = index

            title = self.title_dict[latest_index]

            result_dict[title] = similar_num
        print ("-----after cluster Birch count title:",len(result_dict))
        for title in result_dict:
            print(title,result_dict[title])
        return result_dict
    
    def run(self):
        self.init_data()
        self.birch_cluster()
        self.get_title()

if __name__=='__main__':
    cluster = Cluster()
    cluster.run()

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

쓰쿠바 대학의 기계 학습 강좌 : 과제의 파이썬 스크립트 부분을 만들면서 sklearn 공부 (10)

지난번 이번에는 이상치가 있는 경우 Youtube의 해설은 제6회(1)55분 30초당 프로그램은 2행 더 강좌에서는 최초의 값을 바꾸고 있는 것 같지만, 하나 추가해도 거기까지 변화는 없을 것 같다. 이것은 실행 결...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다