Nutch 소스 코드 연구 웹 캡 처 데이터 구조

오늘 우 리 는 Nutch 웹 페이지 캡 처 에 사용 되 는 몇 가지 데이터 구 조 를 살 펴 보 겠 습 니 다.
주로 이런 몇 가지 유형 과 관련된다. FetchListEntry, Page,
우선 FetchListEntry 클래스 를 살 펴 보 겠 습 니 다.
public final class FetchListEntry implements Writable, Cloneable
Writable, Cloneable 인터페이스, Nutch 여러 가지 유형 이 Writable, Cloneable 을 실현 했다.
자기가 자신의 읽 기와 쓰기 조작 을 책임 지 는 것 은 사실 매우 합 리 적 인 설계 방법 인 데, 분리 해 내 면 오히려 매우 자질구레 하 다
라 는 느낌 이 들 었 다.
안에 있 는 멤버 변 수 를 보 세 요:

public static final String DIR_NAME = "fetchlist";//        
private final static byte CUR_VERSION = 2;//      
private boolean fetch;//          
private Page page;//       
private String[] anchors;//

우 리 는 각 필드 를 어떻게 읽 는 지, 즉 함수 이다.
public final void readFields(DataInput in) throws IOException
version 필드 를 읽 고 버 전 번호 가 현재 버 전 번호 정도 인지 판단 하면 버 전이 일치 하지 않 는 이상 을 던 집 니 다.
그리고 fetch 와 page 필드 를 읽 습 니 다.
버 전 번호 가 1 이상 이면 anchors 가 저장 되 었 음 을 판단 하고 anchors 를 읽 습 니 다. 그렇지 않 으 면 빈 문자열 을 직접 할당 합 니 다.
코드 는 다음 과 같 습 니 다:

    byte version = in.readByte();                 // read version
    if (version > CUR_VERSION)                    // check version
      throw new VersionMismatchException(CUR_VERSION, version);

    fetch = in.readByte() != 0;                   // read fetch flag

    page = Page.read(in);                         // read page

    if (version > 1) {                            // anchors added in version 2
      anchors = new String[in.readInt()];         // read anchors
      for (int i = 0; i < anchors.length; i++) {
        anchors[i] = UTF8.readString(in);
      }
    } else {
      anchors = new String[0];
    }

각 필드 를 정적 으로 읽 는 함수 도 제공 하고 FetchListEntry 대상 을 되 돌려 줍 니 다.

public static FetchListEntry read(DataInput in) throws IOException {
    FetchListEntry result = new FetchListEntry();
    result.readFields(in);
    return result;
}

코드 를 쓰 면 비교적 보기 쉬 우 며, 각각 필드 를 쓴다.

public final void write(DataOutput out) throws IOException {
    out.writeByte(CUR_VERSION);                   // store current version
    out.writeByte((byte)(fetch ? 1 : 0));         // write fetch flag
    page.write(out);                              // write page
    out.writeInt(anchors.length);                 // write anchors
    for (int i = 0; i < anchors.length; i++) {
      UTF8.writeString(out, anchors[i]);
    }
  }

다른 clone 과 equals 함수 도 쉽게 이 루어 집 니 다.
다음은 Page 류 의 코드 를 살 펴 보 겠 습 니 다.
public class Page implements WritableComparable, Cloneable
FetchList Entry 와 마찬가지 로 Writable, Cloneable 인 터 페 이 스 를 실 현 했 습 니 다. Nutch 의 설명 을 보면 각 필드 의 의 미 를 쉽게 알 수 있 습 니 다.

/*********************************************
 * A row in the Page Database.
 * <pre>
 *   type   name    description
 * ---------------------------------------------------------------
 *   byte   VERSION  - A byte indicating the version of this entry.
 *   String URL      - The url of a page.  This is the primary key.
 *   128bit ID       - The MD5 hash of the contents of the page.
 *   64bit  DATE     - The date this page should be refetched.
 *   byte   RETRIES  - The number of times we've failed to fetch this page.
 *   byte   INTERVAL - Frequency, in days, this page should be refreshed.
 *   float  SCORE   - Multiplied into the score for hits on this page.
 *   float  NEXTSCORE   - Multiplied into the score for hits on this page.
 * </pre>
 *
 * @author Mike Cafarella
 * @author Doug Cutting
 *********************************************/

각 필드:

private final static byte CUR_VERSION = 4;
  private static final byte DEFAULT_INTERVAL =
    (byte)NutchConf.get().getInt("db.default.fetch.interval", 30);

  private UTF8 url;
  private MD5Hash md5;
  private long nextFetch = System.currentTimeMillis();
  private byte retries;
  private byte fetchInterval = DEFAULT_INTERVAL;
  private int numOutlinks;
  private float score = 1.0f;
  private float nextScore = 1.0f;

마찬가지 로 그 가 자신의 각 필드 를 어떻게 읽 는 지 보 자. 사실은 코드 에 원래 제공 한 주석 을 더 해서 쉽게 이해 할 수 있 고 더 이상 상세 하 게 설명 하지 않 는 다.

ublic void readFields(DataInput in) throws IOException {
    byte version = in.readByte();                 // read version
    if (version > CUR_VERSION)                    // check version
      throw new VersionMismatchException(CUR_VERSION, version);

    url.readFields(in);
    md5.readFields(in);
    nextFetch = in.readLong();
    retries = in.readByte();
    fetchInterval = in.readByte();
    numOutlinks = (version > 2) ? in.readInt() : 0; // added in Version 3
    score = (version>1) ? in.readFloat() : 1.0f;  // score added in version 2
    nextScore = (version>3) ? in.readFloat() : 1.0f;  // 2nd score added in V4
  }

각 필드 를 쓰 는 것 도 매우 직접적 이다.

public void write(DataOutput out) throws IOException {
    out.writeByte(CUR_VERSION);                   // store current version
    url.write(out);
    md5.write(out);
    out.writeLong(nextFetch);
    out.write(retries);
    out.write(fetchInterval);
    out.writeInt(numOutlinks);
    out.writeFloat(score);
    out.writeFloat(nextScore);
  }

우 리 는 참고 로 Fetch 가 도착 한 내용 을 읽 고 쓰기 편 하 게 제공 하 는 클래스 Fetcher Output: 이 클래스 는 앞에서 소개 한 두 가지 종류의 읽 기와 쓰 기 를 의뢰 하여 Fetche 가 도착 한 것 을 제공 합 니 다.
각종 입도 구조의 읽 기와 쓰기 기능 은 코드 가 비교적 직접적 이어서 더 이상 상술 하지 않 는 다.
콘 텐 츠 클래스 추가: public final class Content extends VersionedWritable 우 리 는 Versioned Writable 류 를 계승 하 는 것 을 보 았 다.VersionedWritable 클래스 는 버 전 필드 의 읽 기와 쓰기 기능 을 실현 합 니 다. 멤버 변 수 를 먼저 살 펴 보 겠 습 니 다.

  public static final String DIR_NAME = "content";
  private final static byte VERSION = 1;
  private String url;
  private String base;
  private byte[] content;
  private String contentType;
  private Properties metadata;

DIR_NAME 가 Content 에 저장 한 디 렉 터 리,
버 전 상수
url 은 이 Content 소속 페이지 의 url 입 니 다.
base 는 이 콘 텐 츠 소속 페이지 의 base url 입 니 다.
contentType 은 이 Content 가 속 한 페이지 의 contentType 입 니 다.
metadata 는 이 Content 소속 페이지 의 meta 정보 입 니 다.
다음은 콘 텐 츠 가 자신의 필드 를 어떻게 읽 고 쓰 는 지 살 펴 보 겠 습 니 다.
public final void readFields(DataInput in) throws IOException
이 방법 은 자신의 필드 를 읽 기 위 한 기능 입 니 다.

super.readFields(in);                         // check version

    url = UTF8.readString(in);                    // read url
    base = UTF8.readString(in);                   // read base

    content = WritableUtils.readCompressedByteArray(in);

    contentType = UTF8.readString(in);            // read contentType

    int propertyCount = in.readInt();             // read metadata
    metadata = new Properties();
    for (int i = 0; i < propertyCount; i++) {
      metadata.put(UTF8.readString(in), UTF8.readString(in));
    }

코드 에 주석 을 달 면 기본적으로 비교적 뚜렷 해진 다.
super.readFields(in);
이 문장 은 부모 클래스 VersionedWritable 을 호출 하여 버 전 번 호 를 읽 고 검증 합 니 다.
쓴 코드 도 비교적 간단 하 다.

public final void write(DataOutput out) throws IOException {
    super.write(out);                             // write version

    UTF8.writeString(out, url);                   // write url
    UTF8.writeString(out, base);                  // write base

    WritableUtils.writeCompressedByteArray(out, content); // write content

    UTF8.writeString(out, contentType);           // write contentType
    
    out.writeInt(metadata.size());                // write metadata
    Iterator i = metadata.entrySet().iterator();
    while (i.hasNext()) {
      Map.Entry e = (Map.Entry)i.next();
      UTF8.writeString(out, (String)e.getKey());
      UTF8.writeString(out, (String)e.getValue());
    }
  }

사실 이런 종 류 는 주로 그의 필드 이 고 각 도 메 인 모델 을 어떻게 구분 하 는 지
다음 에 parse - html 플러그 인 을 보고 Nutch 가 html 페이지 를 어떻게 추출 하 는 지 봅 시다.

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

다양한 언어의 JSON

JSON은 Javascript 표기법을 사용하여 데이터 구조를 레이아웃하는 데이터 형식입니다. 그러나 Javascript가 코드에서 이러한 구조를 나타낼 수 있는 유일한 언어는 아닙니다. 저는 일반적으로 '객체'{}...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

Nutch 소스 코드 연구 웹 캡 처 데이터 구조

좋은 웹페이지 즐겨찾기