nutch 검색엔진 가설 캡 처 모듈

11086 단어 자바 eclipse thread 검색 엔진 character generator

준비 작업
jdk：http://www.oracle.com/technetwork/java/javase/downloads/index.html
tomcat：http://tomcat.apache.org/
eclipse：http://www.eclipse.org/downloads/
Cygwin：http://www.cygwin.com/ 안에 있 는 setup.exe 다운로드 （Liux 환경 을 모 의 할 때 nutch 는 Liux 플랫폼 에 적응 합 니 다.)
nutch0.9：http://www.kuaipan.cn/file/id_164107264406329.htm(자바 기반 검색엔진)
결 성 jar 다운로드:rtf-parser.jar:http://www.kuaipan.cn/file/id_16410726440632361.htm
jid3lib-0.5.1.jar：http://www.kuaipan.cn/file/id_16410726440632360.htm
설치
아래 에 다운로드 한(결 성 jar 제외)을 설치 합 니 다.
jdk 설치:다운로드 한 설치 패 키 지 를 클릭 하여 설치 합 니 다.환경 변 수 를 설정 하 는 것 을 잊 지 마 세 요.여기 서 자세히 설명 하지 않 습 니 다.
tomcat 설치:설치 후 startup.bat 시작 브 라 우 저 에서 테스트 http://localhost:8080/ 환영 인터페이스 가 나타 나 면 설치 에 성공 합 니 다.설치 에 성공 하지 못 하면 cmd 에서 startup.bat 를 실행 하여 오류 정 보 를 봅 니 다.깜빡 지나 가면 startup.bat 를 수 동 으로 수정 할 수 있 습 니 다. 안에 있 는 이 말 을 call "%EXECUTABLE%" start %CMD_LINE_ARGS%
으로 변경 call "%EXECUTABLE%" run %CMD_LINE_ARGS%
그리고 cmd 에서 startup 을 실행 합 니 다.제 공 된 구체 적 인 오류 정보 에 따라 처리 합 니 다.
eclipse:이 건 할 말 이 없어 요.
cygwin:힌트 에 따라 한 걸음 한 걸음 설치 하면 됩 니 다.
프로젝트 루트 디 렉 터 리 에 압축 풀기
결 성 jar:rtf-parser.jar 를 nutch 디 렉 터 리 에 넣 습 니 다. nutch-0.9\\src\plugin\parse-rtf\\lib 안 장차 jid3lib-0.5.1.jar 넣 기 nutch-0.9\\src\plugin\parse-mp3\\lib 안
3.nutch 설정 및 로드
먼저 Cygwin 을 열 면 cmd 와 유사 한 명령 줄 의 창 이 나타 납 니 다.명령 형식 은 dos 와 거의 다 르 지 않 습 니 다.help 를 입력 하여 자세 한 명령 설명 을 볼 수 있 는 지 모 르 겠 습 니 다.
그리고 당신 의 nutch-0.9 디 렉 터 리 에 들 어 갑 니 다.
그리고 nutch 명령 을 실행 합 니 다:
bin/nutch
그림 에서 보 듯 이 nutch 명령 에 대한 설명 은 nutch 운영 환경 구축 에 성공 했다 는 것 을 나타 낸다.
이어서 nutch-0.9 가 져 오기 및 설정 을 시작 합 니 다.
가 져 오기:
우선 nutch-0.9 를 프로젝트 로 eclipse 에 부 습 니 다.
eclipse 를 열 고 자바 프로젝트 를 새로 만 들 고 nutch-0.9 디 렉 터 리 를 프로젝트 소스 로 추가 합 니 다.
그리고 급 하 게 완성 하지 마 세 요.
Source 옵션 에 기본 출력 파일 경 로 를 설정 합 니 다 mybuild
Libraries 옵션 에 nutch 디 렉 터 리 의 conf 파일 을 추가 합 니 다.
Order and Export 옵션 에서 conf 파일 을 위 에 놓 습 니 다.

프로젝트 프로젝트 새로 고침,기본적으로 nutch 를 eclipse 에 완전 하 게 가 져 옵 니 다.
설정:
1.nutch-default.xml 파일 수정
eclipse 에서 방금 부 은 항목 을 엽 니 다->Referenced Libraries->conf 에서 nutch-default.xml 파일 을 찾 습 니 다.

plugin.folders이 속성 을 찾 았 습 니 다. 아래 줄 의 value 값 을로 변경 합 니 다./src/plugin
2.nutch 캡 처 정책 설정
nutch-0.9 프로젝트 루트 디 렉 터 리 에 weburl.txt 를 만들어 검색엔진 이 캡 처 할 내용 의 주 소 를 저장 하고 캡 처 할 주 소 를 입력 합 니 다(예::http://www.buct.edu.cn/）
그리고 conf 파일 에서 crawl 을 찾 습 니 다.urlfilter.txt
찾아내다 ：
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
다음으로 변경:
# accept hosts in MY.DOMAIN.NAME
# +^http://([a-z0-9]*\.)*MY.DOMAIN.NAME/
+^http://www.buct.edu.cn/
이 사이트 의 내용 만 캡 처 하도록 설정 합 니 다.
마지막 으로 conf 의 nutch-site.xml 파일 설정
다음 단락 의 내용 을 nutch-site.xml 파일 의 내용 을 완전히 덮어 씁 니 다.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->




<configuration>
<property>
<name>http.agent.name</name>
<value>ahathinkingSpider</value>
<description>My Search Engine</description>
</property>

<property>
<name>http.agent.description</name>
<value>My web</value>
<description>Further description of our bot- this text is used in
User-Agent header. It appears in parenthesis after the agent name.</description>
</property>

<property>
<name>http.agent.url</name>
<value>ahathinking.com</value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior. of this
</description>
</property>

<property>
<name>http.agent.email</name>
<value>[email protected]</value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>

</configuration>

3.실행 매개 변수 설정:
nutch 항목 을 선택 하고 실행 표시 줄 에서 Run Configurations 를 선택 하 십시오.
다음 과 같이 수정:
수정 후 적용 을 눌 러 주시 기 바 랍 니 다.

수정 후 적용 을 눌 러 주시 기 바 랍 니 다.
마지막 클릭 Run 실행

crawl started in: Local
rootUrlDir = weburl.txt
threads = 5
depth = 5
topN = 100
Injector: starting
Injector: crawlDb: Local/crawldb
Injector: urlDir: weburl.txt
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: Local/segments/20121022110809
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: Local/segments/20121022110809
Fetcher: threads: 5
fetching http://www.buct.edu.cn/
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: Local/crawldb
CrawlDb update: segments: [Local/segments/20121022110809]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: Local/segments/20121022110815
Generator: filtering: false
Generator: topN: 100
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls by host, for politeness.
Generator: done.
Fetcher: starting
Fetcher: segment: Local/segments/20121022110815
Fetcher: threads: 5
fetching http://www.buct.edu.cn/kxyj/index.htm
fetching http://www.buct.edu.cn/zjbh/ldjt/index.htm
fetching http://www.buct.edu.cn/bhxy/index.htm
fetching http://www.buct.edu.cn/zjbh/xycf/index.htm
fetching http://www.buct.edu.cn/zjbh/lrld/index.htm
fetching http://www.buct.edu.cn/xywh/index.htm
fetching http://www.buct.edu.cn/tjtp/25314.htm
fetching http://www.buct.edu.cn/xysz/index.htm
fetching http://www.buct.edu.cn/zsjy/index.htm
fetching http://www.buct.edu.cn/kslj/23583.htm
fetching http://www.buct.edu.cn/js/jquery.flexslider-min.js
fetching http://www.buct.edu.cn/rcpy/index.htm
............

상기 정보 가 나타 나 면 nutch 의 캡 처 모듈 이 설치 되 었 음 을 나타 낸다.
비고:
nutch 가설 에서 흔히 볼 수 있 는 오류:
1.1.1   Crawl 캡 처 hadop 오류 알림
nutch 가 cygwin 에서 nutch 를 실행 하 는 crawl 명령 을 설정 할 때:
[Fatal Error] hadoop-site.xml:15:7: The content of elements must consist of well
-formed character data or markup.
Exception in thread "main" java.lang.RuntimeException: org.xml.sax.SAXParseExcep
tion: The content of elements must consist of well-formed character data or mark
up.
문제 해결:
hadop-site.xml,hadop-site.xml:그 중 하나의 태그앞 에 뾰족 한 괄호 가 하나 더 생 겼 습 니 다.
1.1.2   crawl 오류 보고 실행 Job failed
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
문제 해결:
이것 은 대부분 crawl-urfilter.txt:MY.DOMAIN.NAME 의 수정 이 정확 하지 않 습 니 다.
1.1.3   또 하나의 Job failed
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java
:439)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
문제 해결:
crawl-url filter.txt 의 MY.DOMAIN.NAME 수정 이 잘못 되 었 습 니 다.
1.1.4   이 클립 스에 서 nutch:Job failed 실행
Exception in thread "main" java.io.IOException: Job failed!
       at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
       at org.apache.nutch.crawl.Injector.inject(Injector.java:162)
       at org.apache.nutch.crawl.Crawl.main(Crawl.java:115)
문제 해결:
이 문 제 는 eclipse 의 자바 버 전 설정 문제 입 니 다.해결 방법:
자바 1.4 를 사용 하려 면 1.6 으로 바 꿔 야 합 니 다.
project-》properties-》java compiler
오른쪽 jdk 규정 준수
compiler compliance level:6.0 으로 변경
1.1.5 설정 파일 을 수정 할 때 색인 라 이브 러 리 를 지정 합 니 다.(nutch-site.xml)

         http.agent.name          HD nutch agent                  http.agent.version        1.0          urlfilter.order org.apache.nutch.urlfilter.regex.RegexURLFilter The order by which url filters are applied. If empty, all available url filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter then RegexURLFilter is applied first, and PrefixURLFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters.
위 설정 파일 을 복사 할 때 다음 오류 가 발생 하면 파일 을 복사 할 때 빈 칸 이나 인 코딩 형식 이 있 기 때문에 다시 한 번 두 드 리 면 됩 니 다.java.io.UTFDataFormat 예외:Invalid byte 1 of 1-byte UTF-8 sequence
전반적 으로 흔히 볼 수 있 는 오 류 는 프로필 설정 이 부적 절 한 것 보다 많 습 니 다.
참고 글:
http://www.ahathinking.com/archives/140.html
http://hi.baidu.com/orminknckdbkntq/item/020a8bc6e4d275000bd93a2d

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

Is Eclipse IDE dying?

In 2014 the Eclipse IDE is the leading development environment for Java with a market share of approximately 65%. but ac...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

iframe src 로 딩 실패

Python 학습 노트(wx:인터페이스 미화)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다