역 내 수색 에 관 한 일 들

23294 단어 수색 하 다.전체 텍스트 검색 whoosh Python 역 내 검색

서문

모듈 화

로그 인 모듈

블 로그 스 캔 모듈

블 로그 상세 모듈

검색 모듈

데모

사례 1

사례 2

총화

머리말
이전에 전문 검색 과 관련 된 기술 을 조금 배 웠 는데 그 당시 에는 자바 언어, Lucene 과 compass 프레임 워 크 를 사용 했다.관심 있 으 시 면 아래 칼럼 링크 를 참고 하 세 요.http://blog.csdn.net/column/details/lucene-compass.html
그리고 지금 은 Python 을 사용 하고 있 기 때문에 교체 가 필요 합 니 다.인터넷 에서 검색 해 보 니 관련 된 것 도 정말 많 고 pylucene 도 있 지만 비교 해 보면 whoosh 가 더 뛰 어 납 니 다.그럼 오늘 은 이 걸 로 하 자.
설치 도 간단 합 니 다.

pip install whoosh

이렇게 하면 돼.
목표: 자신의 블 로 그 를 '역 내 검색' 하여 CSDN 역 내 검색 의 단점 을 조금 개선 합 니 다.
모듈 화
최근 에는 임 무 를 모듈 화 하 는 것 을 좋아 하 게 되 었 는데, 이러한 단일 기능 도 관리 하기 쉽 고, 통합 할 때 통합 테스트 에 도 비교적 편리 하 다.새로운 기능 을 추가 하거나 재 구성 하 는 것 도 편리 하 다.
위의 수요 에 맞추어 나 는 여기에 몇 개의 작은 모듈 을 설계 하여 잠시 후에 하나씩 설명 할 것 이다.
로그 인 모듈
로그 인 모듈 은 필요 합 니 다. 블 로그 의 상세 한 내용 을 가 져 올 때 로그 인 한 session 세 션 이 있어 야 하기 때 문 입 니 다. 그렇지 않 으 면 데 이 터 를 가 져 올 수 없습니다.
이전에 도 CSDN 모 의 로그 인 에 관 한 예 를 썼 는데 그 당시 에 완 성 된 기능 은

시 뮬 레이 션 로그 인

위, 문장 밟 기

댓 글

블 로 거 상세 정보 획득

다른 사람 이 코드 를 가지 고 나 쁜 짓 을 하지 않도록 코드 를 붙 이지 않 겠 습 니 다.기술적 으로 사적인 편 지 를 환영 하거나 글 아래 에 댓 글 을 달 아 주세요.
다음은 아 날로 그 로그 인 코드 를 보충 합 니 다.

class Login(object):
    """ Get the same session for blog's backing up. Need the special username and password of your account. """

    def __init__(self):
        # the common headers for this login operation.
        self.headers = {
            'Host': 'passport.csdn.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }

    def login(self, username, password):
        if username and password:
            self.username = username
            self.password = password
        else:
            raise Exception('Need Your username and password!')

        loginurl = 'https://passport.csdn.net/account/login'
        # get the 'token' for webflow
        self.session = requests.Session()
        response = self.session.get(url=loginurl, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Assemble the data for posting operation used in logining.
        self.token = soup.find('input', {'name': 'lt'})['value']

        payload = {
            'username': self.username,
            'password': self.password,
            'lt': self.token,
            'execution': soup.find('input', {'name': 'execution'})['value'],
            '_eventId': 'submit'
        }
        response = self.session.post(url=loginurl, data=payload, headers=self.headers)

        # get the session
        return self.session if response.status_code == 200 else None

블 로그 검색 모듈
블 로그 스 캔 이 모듈 은 로그 인 상태의 지원 이 필요 하지 않 습 니 다. 완 성 된 기능 은 블 로 거들 의 글 총수 와 글 마다 해당 하 는 URL 링크 를 스 캔 하 는 것 입 니 다.다음 에는 그것 으로 문장의 상세 한 정 보 를 얻 을 수 있 기 때문이다.

class BlogScanner(object):
    """ Scan for all blogs """

    def __init__(self, domain):
        self.username = domain
        self.rooturl = 'http://blog.csdn.net'
        self.bloglinks = []
        self.headers = {
            'Host': 'blog.csdn.net',
            'Upgrade - Insecure - Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }

    def scan(self):
        # get the page count
        response = requests.get(url=self.rooturl + "/" + self.username, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        pagecontainer = soup.find('div', {'class': 'pagelist'})
        pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]

        # construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
        for index in range(1, int(pages) + 1):
            # get the blog link of each list page
            listurl = 'http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
            response = requests.get(url=listurl, headers=self.headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            try:
                alinks = soup.find_all('span', {'class': 'link_title'})
                # print(alinks)
                for alink in alinks:
                    link = alink.find('a').attrs['href']
                    link = self.rooturl + link
                    self.bloglinks.append(link)
            except Exception as e:
                print('      ！
' + e)
                continue

        return self.bloglinks

블 로그 상세 모듈
블 로그 에 대한 자세 한 내용 은 CSDN 이 정말 잘 했다 고 생각 합 니 다.그것 도 제 이 슨 형식 이 야.긴 말 없 이 로그 인 상태 에서 얻 을 수 있 는 블 로그 의 자세 한 내용 을 살 펴 보 자.
제목, URL, 태그, 요약 설명, 글 의 본문 내용 을 가 져 오 는 것 이 분명 합 니 다.코드 는 다음 과 같 습 니 다:

class BlogDetails(object):
    """ Get the special url for getting markdown file. 'url':  URL 'title':      'tags':        'description':          'content':   Markdown   """

    def __init__(self, session, blogurl):
        self.headers = {
            'Referer': 'http://write.blog.csdn.net/mdeditor',
            'Host': 'passport.csdn.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }
        # constructor the url: get article id and the username
        # http://blog.csdn.net/marksinoberg/article/details/70432419
        username, id = blogurl.split('/')[3], blogurl.split('/')[-1]
        self.blogurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        self.session = session

    def getSource(self):
        # get title and content for the assigned url.
        try:
            tempheaders = self.headers
            tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
            tempheaders['Host'] = 'write.blog.csdn.net'
            tempheaders['X-Requested-With'] = 'XMLHttpRequest'
            response = self.session.get(url=self.blogurl, headers=tempheaders)
            soup = json.loads(response.text)
            return {
                'url': soup['data']['url'],
                'title': soup['data']['title'],
                'tags': soup['data']['tags'],
                'description': soup['data']['description'],
                'content': soup['data']['markdowncontent'],
            }
        except Exception as e:
            print("      !      ：{}".format(e))

검색 모듈
검색 모듈 은 오늘 의 핵심 입 니 다. 사용 한 라 이브 러 리 는 바로 whoosh 입 니 다. 정말 마음 에 드 는 라 이브 러 리 입 니 다. 그리고 문서 가 상세 하고 간단 하 며 알 기 쉽 습 니 다.나의 이 어 설 픈 영어 수준 은 모두 괜 찮 으 니 너 도 반드시 할 수 있 을 것 이다.
지난 반찬:http://whoosh.readthedocs.io/en/latest/
기본 텍스트 분석 기 는 영어 이기 때문에 중국어 와 관련 된 것 을 잘 고려 하기 위해 중국어 단 어 를 처리 해 야 하기 때문에 인터넷 에서 하 나 를 베 꼈 지만 효과 가 별로 없 었 다.

class ChineseTokenizer(Tokenizer):
    def __call__(self, value, positions=False, chars=False, keeporiginal=False, removestops=True, start_pos=0, start_char=0, mode='', **kwargs):
        assert isinstance(value, text_type), "%r is not unicode"%value
        t = Token(positions=positions, chars=chars, removestops=removestops, mode=mode, **kwargs)
        #   jieba  ，    
        seglist = jieba.cut(value, cut_all=False)
        for w in seglist:
            t.original = t.text = w
            t.boost = 1.0
            if positions:
                t.pos = start_pos + value.find(w)
            if chars:
                t.startchar = start_char + value.find(w)
                t.endchar = start_pos + value.find(w) + len(w)
            yield t

def ChineseAnalyzer():
    return ChineseTokenizer()


class Searcher(object):
    """ Firstly： define a schema suitable for this system. It may should be hard-coded. 'url':  URL 'title':      'tags':        'description':          'content':   Markdown   Secondly: add documents(blogs) Thridly: search user's query string and return suitable high score blog's paths. """
    def __init__(self):
        # define a suitable schema
        self.schema = Schema(url=ID(stored=True),
                             title=TEXT(stored=True),
                             tags=KEYWORD(commas=True),
                             description=TEXT(stored=True),
                             content=TEXT(analyzer=ChineseAnalyzer()))
        # initial a directory to storage indexes info
        if not os.path.exists("indexdir"):
            os.mkdir("indexdir")
        self.indexdir = "indexdir"
        self.indexer = create_in(self.indexdir, schema=self.schema)


    def addblog(self, blog):
        writer = self.indexer.writer()
        # write the blog details into indexes
        writer.add_document(url=blog['url'],
                            title=blog['title'],
                            tags=blog['tags'],
                            description=blog['description'],
                            content=blog['content'])
        writer.commit()

    def search(self, querystring):
        # make sure the query string is unicode string.
        # querystring = u'{}'.format(querystring)
        with self.indexer.searcher() as seracher:
            query = QueryParser('content', self.schema).parse(querystring)
            results = seracher.search(query)
            # for item in results:
            # print(item)
        return results

시범 을 보이다
됐어, 하마터면 이 럴 뻔 했 어.다음은 운행 효 과 를 살 펴 보 겠 습 니 다.
사례 1
먼저 DBHelper 라 는 키워드 에 대한 검색 을 살 펴 보 자. 글 이 너무 많 으 면 계산 도 느 리 기 때문에 앞의 글 을 올 라 가면 된다.

# coding: utf8

# @Author:     # @File: TestAll.py # @Time: 2017/5/12 # @Contact: 1064319632@qq.com # @blog: http://blog.csdn.net/marksinoberg # @Description: from whooshlearn.csdn import Login, BlogScanner, BlogDetails, Searcher login = Login() session = login.login(username="Username", password="password") print(session) scanner = BlogScanner(domain="Marksinoberg") blogs = scanner.scan() print(blogs[0:3]) blogdetails = BlogDetails(session=session, blogurl=blogs[0]) blog = blogdetails.getSource() print(blog['url']) print(blog['description']) print(blog['tags']) # test whoosh for searcher searcher = Searcher() counter=1 for item in blogs[0:7]: print("     {}   ".format(counter)) counter+=1 details = BlogDetails(session=session, blogurl=item).getSource() searcher.addblog(details) # searcher.addblog(blog) searcher.search('DbHelper') # searcher.search('Python')

코드 실행 결 과 는 다음 과 같 습 니 다.
본인 블 로 그 는 앞의 두 편 만 DBHelper 에 관 한 글 이어서 이 두 개의 document 을 명중 시 켰 다 는 것 을 어렵 지 않 게 알 수 있다.괜찮아 보 여요.
사례 2
다음은 다른 키 워드 를 시도 해 보 겠 습 니 다.파 이 썬 같은 거.

# coding: utf8

# @Author:     # @File: TestAll.py # @Time: 2017/5/12 # @Contact: 1064319632@qq.com # @blog: http://blog.csdn.net/marksinoberg # @Description: from whooshlearn.csdn import Login, BlogScanner, BlogDetails, Searcher login = Login() session = login.login(username="username", password="password") print(session) scanner = BlogScanner(domain="Marksinoberg") blogs = scanner.scan() print(blogs[0:3]) blogdetails = BlogDetails(session=session, blogurl=blogs[0]) blog = blogdetails.getSource() print(blog['url']) print(blog['description']) print(blog['tags']) # test whoosh for searcher searcher = Searcher() counter=1 for item in blogs[0:10]: print("     {}   ".format(counter)) counter+=1 details = BlogDetails(session=session, blogurl=item).getSource() searcher.addblog(details) # searcher.addblog(blog) # searcher.search('DbHelper') searcher.search('Python')

그리고 운행 효 과 를 살 펴 보 자.
4 개의 기록 을 명중 시 켰 으 니 적중률 도 그런대로 괜 찮 은 편 이다.
총결산
마지막 으로 정리 해 보 겠 습 니 다.whoosh 역 내 검색 에 관 한 문 제 는 텍스트 결과 에 더욱 높 은 정밀도 로 일치 해 야 합 니 다. 사실은 많은 부분 이 최적화 되 어야 합 니 다.Query Parser 는 아직 발굴 해 야 할 것 이 많 습 니 다.
또한 검색 결 과 를 강조 하 는 것 도 편리 하 다.공식 문서 에 상세 한 소개 가 있다.
마지막 단 계 는 중국어 문제 입 니 다. 현재 저 는 단어 와 명중률 을 높이 는 좋 은 방법 이 없습니다.

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

제1 6 장 파일 에서 텍스트 검색 도구: grep 명령 과 egrep 명령

제1 6 장 파일 에서 텍스트 검색 도구: grep 명령 과 egrep 명령 옵션 grep 명령 파일 에서 단 어 를 검색 하면 명령 은 "match pattern"을 포함 하 는 텍스트 줄 을 되 돌려 줍 니 다....

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

장애 조치 에이전트

Nginx 에 리 버스 프 록 시 설정

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다