《집단 지혜 프로그래밍》 제6장

1. P126 코드는 임계값을 정의하기 위해 초기화 방법을 수정하고classifier에 새로운 실례 변수를 추가하십시오.

def __init__(self, getfeatures):
    classifier.__init__(self, getfeatures)
    self.thresholds = {}

이 코드는 수정할 때 클래스classifier에서 직접 정의해야 합니다init _ _()에 마지막 코드를 넣으면 앞의 코드는 필요 없습니다.수정된init _ _():

class classifier:
        def __init__(self, getfeatures, filename = None):
                #count the number of feature or classify group
                self.fc = {}
                #count the number of doc in each classification
                self.cc = {}
                self.getfeatures = getfeatures
                #classifier.__init__(self, getfeatures)
                self.thresholds = {}

2.P131 코드 유효성 검사 입력 시

>>> reload(docclass)
'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')

오류가 나옵니다.

>>> reload(docclass)
'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
Traceback (most recent call last):
  File "", line 1, in 
  File "docclass.py", line 94, in classify
    probs[cat] = self.prob(item, cat)
AttributeError: fisherclassifier instance has no attribute 'prob'

올바른 방법은 파일을 다시 불러온 후에 c1을 다시 정의해야 오류가 발생하지 않습니다.아래와 같다

>>> reload(docclass)
'docclass' from 'docclass.py'>
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
Traceback (most recent call last):
  File "", line 1, in 
  File "docclass.py", line 94, in classify
    probs[cat] = self.prob(item, cat)
AttributeError: fisherclassifier instance has no attribute 'prob'
>>> c1 = docclass.fisherclassifier(docclass.getwords)
>>> docclass.sampletrain(c1)
>>> c1.classify('quick rabbit')
'good'
>>> c1.classify('quick money')
'bad'
>>> c1.setminimum('bad', 0.8)
>>> c1.classify('quick money')
'good'
>>> c1.setminimum('good', 0.4)
>>> c1.classify('quick money')
'good'
>>>

3. P128 페이지에서 귀일화 계산을 할 때 문장의 공식은 다음과 같다. cprob = clf/(clf+nclf)이지만 프로그램에서는

p = clf / (freqsum)

나는 nclf를 계산할 때 이미 clf를 포함했기 때문에 다시 한 번 추가할 필요가 없다. 귀일화를 실현할 수 있기 때문에 문장의 공식을 다음과 같이 바꾸어야 한다. cprob=clf/nclf. 물론 clf를 넣든 안 넣든 최종 결과에 영향을 주지 않고 확률의 수치만 영향을 주며 줄행에 영향을 주지 않는다.4.P129 문장의 "'casino'라는 단어가 포함된 문서는 스팸메일일 확률이 0.9이다"는 문장이 틀렸다. 계산을 통해'casino'라는 단어가 포함된 문서는 스팸메일일 확률이 1.05.P137 설명서의 코드:

def entryfeatures(entry):
    splitter = re.compile('\\W*')
    f = {}

    #get words in title and sign it
    titlewords = [s.lower() for s in splitter.split(entry['title']) if len(s) > 2 and len(s) < 20]
    for w in titlewords: f['Title: ' + w] = 1

    #get words in absrtact
    summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

    #count capitalize words
    uc = 0
    for i in range(len(summarywords)):
        w = summarywords[i]
        f[w] = 1
        if w.isupper(): uc += 1

        #words from absrtact as features
        if i < len(summarywords) - 1:
            twowords = ' '.join(summarywords[i : i + 1])
            f[twowords] = 1

    #keep names compile of artile's creater and publicor
    f['Publisher: ' + entry['publisher']] = 1

    #UPPERCASE is a virtual word, and it is used to aim at too many capitalize words exist
    if float(uc) / len(summarywords) > 0.3: f['UPPERCASE'] = 1

    return f

대문자 단어의 수량을 통계할 때 앞에서 추출한summarywords 변수를 사용했지만,summarywords 변수를 추출할 때

summarywords = [s.lower() for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

lower () 함수가summarywords 변수의 단어를 모두 소문자로 만들었음을 알 수 있습니다.그래서 통계 뒤에 있는 대문자 단어도 의미가 없다.그래서 바꿔야 한다고 생각해요.

summarywords = [s for s in splitter.split(entry['summary']) if len(s) > 2 and len(s) < 20]

나의 엉터리 영어를 소홀히 하세요.

《집단 지혜 프로그래밍》 제6장

좋은 웹페이지 즐겨찾기