보리그물 파충류 총결산

json 라이브러리
처음 json 라이브러리를 사용했을 때 입력한 것은 사전이나 목록이었다. 그러나 목록 출력의 결과는 json이 아닌'[',']'가 많았다. 그러나 사전을 json으로 돌리면 순서가 혼란스러워졌다. 사전 내부는 원래 순서가 없었다. 그 전에 너무 어리석어서 사전 안의 순서를 바꿀 수 있는지 여러 가지 방법을 생각해 봤는데 나중에 생각해 보니 불가능했다.메모리 안의 순서가 어떻게 바뀔 수 있습니까?사전의 순서를 바꾸는 모든 방법은 출력할 때 출력의 순서를 바꾸는 것일 뿐이지만, 사전 내부는 여전히 같다.
이 실험에서 내가 출력을 json으로 조정한 방법은 매 쌍의 그림자를 하나의 단독 사전으로 하고 많은 그림자가 하나의 목록을 구성한 다음에 이 사전을 json()에 던지고 변환된 문자열을 일련의 부분적으로 바꾸는 것이다.
잘못을 내던지다
매번 단독 캡처는try에 넣어야 합니다. 이것은 이전에 생각하지 못했던 것입니다. 이렇게 하면 문제가 발생할 때 프로그램이 잘못 종료되는 것을 피할 수 있습니다.
TIPS:

함수의 부분적으로list를 신청하면 조작이 끝난 후에 그 자리에서 풀어야 한다. 필요하지 않다고 생각하지만 이것은 프로그램을 쓸 때 길러야 할 습관이다:del time[:]

브라우저를 모의하는 방법은 실패했지만 예전에는 몰랐던 조작도 많이 배웠어요. 예를 들어findelements 같은 거

구글의 어떤 물건이 제대로 불러오지 않기 때문에selenium는 종종 엉뚱한 문제를 일으킨다. 예를 들어 이전의 클릭 조작이 틀리면 내부의 메커니즘이dodge로 넘어가기 어렵다.

첨부 프로그램:

#encoding:utf-8
#-----------------------------------------
#-----------------------------------------
import sys
import urllib2
import urllib
import cookielib
from bs4 import BeautifulSoup
import StringIO
from PIL import Image
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver import ActionChains
from bidict import bidict
import json
import time

category={}
mcid={'   ':1,'   ':2,'    ':3,'    ':4,'    ':5,'    ':6,'    ':7}

ccid={'  ':9,'  ':10,'  ':11,'   ':12,'     ':13,
      '   ':14, '  ':15,'      ':16, '     ':17, '     ':18,
      '   ':19,'   ':20,'    ':21,'    ':22,'    ':23,
      '   ':24,'   ':25,'   ':26,
      '   ':27,'   ':28,'   ':29,'   ':30,'   ':31,'       ':32,
      '    ':33,'    ':34,'    ':35,
      '    ':36, '   ':37, '  ':38, '    ':39, '  ':40, '  ':41, '    ':42, '   ':43, '   ':44, '    ':45
      }
mcidDict=~bidict(mcid)
ccidDict=~bidict(ccid)

cj = cookielib.CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
headers = { 'User-Agent': ' Chrome/35.0.1916.114 Safari/537.36' }

def get_source(url):
    global cj,opener,headers
    keepRequest=1
    tryTimes=0
    page=''

    while keepRequest==1 :
        tryTimes+=1
        if tryTimes>5:
            break
        try:
            req=urllib2.Request(url,headers=headers)
            page=urllib2.urlopen(req,timeout=2).read()
            # page=urllib2.urlopen(url,timeout=1).read()
        except:
            print 'request again'
        else:
            keepRequest=0
    return page

def Spider():
    global mcidDict,ccidDict

    global cj,opener,headers
    outcomePathHead='C:/Users/Gentlyguitar/Desktop/MyWork/damai/outcome/'
    # get_source('http://www.damai.cn/projectlist.do?mcid=1&ccid=9')
    # print get_source('http://item.damai.cn/66780.html')
    ccidThresh={1:13,2:18,3:23,4:26,5:32,6:35,7:45}
    startMcid=3
    startCcid=21
    mcid=startMcid
    ccid=startCcid

    while mcid<=7:
        while ccid<=ccidThresh[mcid]: #
            print '        ：'
            path=outcomePathHead+mcidDict[mcid]+'/'+ccidDict[ccid]+'.txt'
            # path=outcomePathHead+'test.txt'
            print path
            uipath = unicode(path , "utf8")
            fileOut=open(uipath,'w')
            pageIndex=1
            while 1: # index of page keep changing until there is no perform list in the page
                try:
                    performListPage='http://www.damai.cn/projectlist.do?mcid='+str(mcid)+'&ccid='+str(ccid)+'&pageIndex='+str(pageIndex)
                    print '      ：',performListPage
                    listPage=get_source(performListPage)
                    soup=BeautifulSoup(listPage)
                    performList=soup.find(attrs={'id':'performList'})
                    titleList=performList.find_all('h2')
                    linkList=[]
                    for each in titleList:
                        a=each.find('a')
                        linkList.append(a['href'])

                    if len(titleList)== 0: # indicate the index of page has come to an end, ccid therefore needs to change
                        print 'this is an empty page'
                        break

                    for eachshow in linkList:
                        time=[]
                        price=[]
                        print eachshow
                        showpage=get_source(eachshow)
                        # showpage=get_source('http://item.damai.cn/70686.html')
                        soup=BeautifulSoup(showpage,"html.parser")

                        try:
                            title=soup.find(attrs={'class':'title'}).get_text().strip() # get the title
                        except:
                            title='  '
                        try:
                            location=soup.find(attrs={'itemprop':'location'}).get_text().strip() # get the location
                        except:
                            location='  '

    # try:
                        try:
                            timeList=soup.find(attrs={'id':'perform'}).find_all('a') # get the time, which is a list
                            for index,eachtime in enumerate(timeList):
                                time.append(eachtime.get_text().encode('utf-8'))
                            pidList=[]
                            for index,eachtime in enumerate(timeList): # get the price for each time
                                pid=eachtime['pid']
                                # print eachtime['class'],type(eachtime['class'])
                                if eachtime['class']==[u'grey']:
                                    price.append('  ')
                                    continue
                                if index>0:
                                    data={'type':'33',
                                          'performID':pid,
                                          'business':'1',
                                          'IsBuyFlow':'False',
                                          'sitestaus':'3'}
                                    post_data=urllib.urlencode(data)
                                    url='http://item.damai.cn/ajax.aspx'
                                    keepRequest=1
                                    tryTimes=0
                                    while keepRequest==1: # a time limit is needed
                                        tryTimes+=1
                                        if tryTimes>5:
                                            break
                                        try:
                                            req=urllib2.Request(url,post_data,headers)
                                            newpage=urllib2.urlopen(req).read()
                                        except:
                                            print 'click problem'
                                        else:
                                            keepRequest=0
                                    soup=BeautifulSoup(newpage,"html.parser")
                                    priceLinkList=soup.find_all('a',attrs={'class':True,'price':True})

                                else:
                                    priceLinkList=soup.find(attrs={'id':'price'}).find_all('a')
                                priceList=[]
                                for eachlink in priceLinkList:
                                    norlizedPrice=eachlink.get_text()
                                    norlizedPrice=norlizedPrice.replace(u'    ，      ~',u' (    )').replace(u'        ',u' (     )')
                                    priceList.append(norlizedPrice.encode('utf-8'))
                                price.append(priceList)

                        except:
                            time.append('  ')
                            price.append('  ')

                        mcidName=mcidDict[mcid]
                        ccidName=ccidDict[ccid]
                        titleName=title.encode('utf-8')
                        placeName=location.encode('utf-8')
                        data=[{"mcid": mcidName},
                              {"ccid": ccidName},
                              {"title": titleName},
                              {"place": placeName},
                              {"time": time},
                              {"price": price}]

                        normalizedData= json.dumps(data,ensure_ascii=False,sort_keys=True,indent=1)
                        normalizedData=normalizedData.replace('[
 {
','{
').replace('
 }
]','
}').replace('
 }, 
 {
',' ,
')
                        #print normalizedData
                        fileOut.write(normalizedData+'


')
                        fileOut.flush()

                        del time[:]
                        del price[:]
                except:
                    print 'something wrong'
                pageIndex+=1
            ccid+=1
        mcid+=1






if __name__ == "__main__":
    Spider()

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

다양한 언어의 JSON

JSON은 Javascript 표기법을 사용하여 데이터 구조를 레이아웃하는 데이터 형식입니다. 그러나 Javascript가 코드에서 이러한 구조를 나타낼 수 있는 유일한 언어는 아닙니다. 저는 일반적으로 '객체'{}...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

보리그물 파충류 총결산

좋은 웹페이지 즐겨찾기