N - Gram 모델 로 데이터 요약 (Python 설명)

19139 단어 자연 언어 처리 python n-gram 파 이 썬 기반 Machine Learning

Python 2.7 IDE PyCharm 5.0.3

        ，              。。

처음부터 말 하 다

                 ，     ，  ，   。        。

N - Gram 모델 이 뭐 예요?
자연 언어 에는 n - gram 이라는 모델 이 있 는데 문자 나 언어 중의 n 개의 연속 적 인 단어 구성 서열 을 나타 낸다.자연 언어 분석 을 할 때 n - gram 을 사용 하거나 상용 어 구 를 찾 으 면 한 마디 를 여러 개의 문자 부분 으로 쉽게 분해 할 수 있다.파 이 썬 네트워크 데이터 수집 에서 따 온 [Ryan Mitchell 저]
쉽게 말 하면 핵심 주제 어 를 찾 는 것 이다. 그렇다면 핵심 주제 어 는 무엇 일 까? 일반적으로 중복 율 은 언급 횟수 가 가장 많은 것, 즉 가장 표현 해 야 할 것 이 핵심 단어 다.아래 의 예 는 바로 이것 부터 펼 쳐 집 니 다.
임시 보충
밤 에 나타 나 면 여기 서 꺼 내 서 단독으로 먼저 효 과 를 시험 해 보 세 요.
1. string. punctuation 모든 문장 부 호 를 가 져 와 strip 와 조합 하여 사용 합 니 다.

import string
list = ['a,','b!','cj!/n']
item=[]
for i in list:
    i =i.strip(string.punctuation)
    item.append(i)
print item

['a', 'b', 'cj!/n']

2. operator. itemgetter () operator 모듈 에서 제공 하 는 itemgetter 함 수 는 대상 의 어떤 차원 의 데 이 터 를 가 져 오 는 데 사 용 됩 니까? 매개 변 수 는 일부 번호 (즉, 가 져 올 데이터 가 대상 에 있 는 번호) 입 니 다.
밤.

import operator
dict_={'name1':'2',
      'name2':'1'}

print sorted(dict_.items(),key=operator.itemgetter(0),reverse=True)
#dict_.items()，

[('name2', '1'), ('name1', '2')]

물론 이 걸 직접 사용 하 셔 도 됩 니 다.

dict_={'name1':'2',
      'name2':'1'}
print sorted(dict_.iteritems(),key=lambda x:x[1],reverse=True)

2-gram
두 가지 키워드 로 말씀 드 리 겠 습 니 다. 지난 밤 에 설명 을 드 리 겠 습 니 다.

import urllib2
import re
import string
import operator

def cleanText(input):
    input = re.sub('
+', " ", input).lower() #     ,        
    input = re.sub('\[[0-9]*\]', "", input) #     [1]       
    input = re.sub(' +', " ", input) #                
    input = bytes(input)#.encode('utf-8') #       utf-8         
    #input = input.decode("ascii", "ignore")
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ') #       ，    


    for item in input:
        item = item.strip(string.punctuation) # string.punctuation        

        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #    ，  i,a     
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)

    output = {} #     
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')
        if ngramTemp not in output: #    
            output[ngramTemp] = 0 #       
        output[ngramTemp] += 1
    return output

#   ：         
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#   ：        ，     ，      
#content = open("1.txt").read()
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) #=True     
print(sortedNGrams)

[('of the', 213), ('in the', 65), ('to the', 61), ('by the', 41), ('the constitution', 34),,,

위 와 같은 밤 작용 은 2 접속사 의 주파수 크기 를 잡 아 정렬 하 는 것 이다. 그러나 이것 은 우리 가 원 하 는 것 이 아니다. 200 여 번 의 of the 가 나타 나 면 고양이 가 사용 하 는 것 이 라 고 말 했다. 그래서 우 리 는 이 접속사 와 전치사 에 대한 제거 작업 을 해 야 한다.
Deeper

# -*- coding: utf-8 -*-
import urllib2

import re
import string
import operator

#       
def isCommon(ngram):
    commonWords = ["the", "be", "and", "of", "a", "in", "to", "have",
                   "it", "i", "that", "for", "you", "he", "with", "on", "do", "say",
                   "this", "they", "is", "an", "at", "but","we", "his", "from", "that",
                   "not", "by", "she", "or", "as", "what", "go", "their","can", "who",
                   "get", "if", "would", "her", "all", "my", "make", "about", "know",
                   "will","as", "up", "one", "time", "has", "been", "there", "year", "so",
                   "think", "when", "which", "them", "some", "me", "people", "take", "out",
                   "into", "just", "see", "him", "your", "come", "could", "now", "than",
                   "like", "other", "how", "then", "its", "our", "two", "more", "these",
                   "want", "way", "look", "first", "also", "new", "because", "day", "more",
                   "use", "no", "man", "find", "here", "thing", "give", "many", "well"]

    if ngram in commonWords:
        return True
    else:
        return False

def cleanText(input):
    input = re.sub('
+', " ", input).lower() #             
    input = re.sub('\[[0-9]*\]', "", input) #     [1]       
    input = re.sub(' +', " ", input) #                
    input = bytes(input)#.encode('utf-8') #       utf-8         
    #input = input.decode("ascii", "ignore")
    return input

def cleanInput(input):
    input = cleanText(input)
    cleanInput = []
    input = input.split(' ') #       ，    


    for item in input:
        item = item.strip(string.punctuation) # string.punctuation        

        if len(item) > 1 or (item.lower() == 'a' or item.lower() == 'i'): #    ，  i,a     
            cleanInput.append(item)
    return cleanInput

def getNgrams(input, n):
    input = cleanInput(input)

    output = {} #     
    for i in range(len(input)-n+1):
        ngramTemp = " ".join(input[i:i+n])#.encode('utf-8')

        if isCommon(ngramTemp.split()[0]) or isCommon(ngramTemp.split()[1]):
            pass
        else:
            if ngramTemp not in output: #    
                output[ngramTemp] = 0 #       
            output[ngramTemp] += 1
    return output

#         
def getFirstSentenceContaining(ngram, content):
    #print(ngram)
    sentences = content.split(".")
    for sentence in sentences:
        if ngram in sentence:
            return sentence
    return ""

#   ：         
content = urllib2.urlopen(urllib2.Request("http://pythonscraping.com/files/inaugurationSpeech.txt")).read()
#        ，     ，      
#content = open("1.txt").read()
ngrams = getNgrams(content, 2)
sortedNGrams = sorted(ngrams.items(), key = operator.itemgetter(1), reverse=True) # reverse=True     
print(sortedNGrams)
for top3 in range(3):
    print "###"+getFirstSentenceContaining(sortedNGrams[top3][0],content.lower())+"###"

[('united states', 10), ('general government', 4), ('executive department', 4), ('legisltive bojefferson', 3), ('same causes', 3), ('called upon', 3), ('chief magistrate', 3), ('whole country', 3), ('government should', 3),,,,      

### the constitution of the united states is the instrument containing this grant of power to the several departments composing the government###
### the general government has seized upon none of the reserved rights of the states###
### such a one was afforded by the executive department constituted by the constitution###

상술 한 밤 을 통 해 알 수 있 듯 이 우 리 는 유용 한 단 어 를 삭제 하고 연결 어 를 제거 한 다음 에 핵심 어 를 포함 하 는 문장 을 잡 아 냈 다. 여기 서 나 는 앞의 세 마디 만 잡 았 을 뿐 이 고 200 개의 문장 이 있 는 문장 에 대해 서 는 서 너 마디 로 개괄 하 는 것 이 신기 하 다 고 생각한다.
BUT
상술 한 방법 은 취지 가 매우 명확 한 회의 등에 국한 된다. 그렇지 않 으 면 소설 에 대해 정말 끔찍 하 다. 나 는 여러 개의 영어 소설 을 시험 해 보 았 는데 정말, 총 결 된 것 은 무엇 입 니까?
마지막.
자 료 는 Python 네트워크 데이터 수집 8 장 에서 나 왔 지만 코드 는 python 3. x 이 고 일부 코드 사례 에서 빠 져 나 오지 못 하기 때문에 정리 하고 코드 세 션 을 수정 해서 책의 효 과 를 냈 습 니 다.
사 의 를 표 하 다
Python 네트워크 데이터 수집 [Ryan Mitchell 저] [인민 우편 출판사] python strip () 함수 소개 Python 의 sorted 함수 및 operator. itemgeter 함수

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

NLP4J - Java로 형태소 해석 (Yahoo! 개발자 네트워크 일본어 형태소 해석을 이용)

Yahoo! Japan이 제공하고 있는 일본어 형태소 해석 API입니다. 텍스트 분석 : 일본어 형태소 분석 - Yahoo! 개발자 네트워크 품목 설명 제공자 야후 주식회사 Yahoo Japan Corporation...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

ZBrush 스칼라 없음 모드

100일 코드 중 18일: BeautifulSoup을 사용하여 HTML의 링크 따라가기

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다