Heritrix 3.1.0 소스 해석(13)
13566 단어 Heritrix
BdbFrontier 클래스의 부모 클래스인 AbstractFrontier 안에서
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.AbstractFrontier
/**
* Note that the previously emitted CrawlURI has completed
* its processing (for now).
*
* The CrawlURI may be scheduled to retry, if appropriate,
* and other related URIs may become eligible for release
* via the next next() call, as a result of finished().
*
* (non-Javadoc)
* @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
*/
public void finished(CrawlURI curi) {
try {
KeyedProperties.loadOverridesFrom(curi);
processFinish(curi);
} finally {
KeyedProperties.clearOverridesFrom(curi);
}
}
BdbFrontier 클래스의 void processFinish(Crawl URI curi) 방법을 계속 호출합니다. BdbFrontier 클래스의 부모 클래스인 WorkQueue Frontier에서
org.archive.crawler.frontier.BdbFrontier
org.archive.crawler.frontier.WorkQueueFrontier
/**
* Note that the previously emitted CrawlURI has completed
* its processing (for now).
*
* The CrawlURI may be scheduled to retry, if appropriate,
* and other related URIs may become eligible for release
* via the next next() call, as a result of finished().
*
* TODO: make as many decisions about what happens to the CrawlURI
* (success, failure, retry) and queue (retire, snooze, ready) as
* possible elsewhere, such as in DispositionProcessor. Then, break
* this into simple branches or focused methods for each case.
*
* @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)
*/
protected void processFinish(CrawlURI curi) {
// assert Thread.currentThread() == managerThread;
long now = System.currentTimeMillis();
curi.incrementFetchAttempts();
logNonfatalErrors(curi);
WorkQueue wq = (WorkQueue) curi.getHolder();
// always refresh budgeting values from current curi
// (whose overlay settings should be active here)
wq.setSessionBudget(getBalanceReplenishAmount());
wq.setTotalBudget(getQueueTotalBudget());
assert (wq.peek(this) == curi) : "unexpected peek " + wq;
int holderCost = curi.getHolderCost();
if (needsReenqueuing(curi)) {
// codes/errors which don't consume the URI, leaving it atop queue
if(curi.getFetchStatus()!=S_DEFERRED) {
wq.expend(holderCost); // all retries but DEFERRED cost
}
long delay_ms = retryDelayFor(curi) * 1000;
curi.processingCleanup(); // lose state that shouldn't burden retry
wq.unpeek(curi);
wq.update(this, curi); // rewrite any changes
handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));
doJournalReenqueued(curi);
wq.makeDirty();
return; // no further dequeueing, logging, rescheduling to occur
}
// Curi will definitely be disposed of without retry, so remove from queue
wq.dequeue(this,curi);
decrementQueuedCount(1);
largestQueues.update(wq.getClassKey(), wq.getCount());
log(curi);
if (curi.isSuccess()) {
// codes deemed 'success'
incrementSucceededFetchCount();
totalProcessedBytes.addAndGet(curi.getRecordedSize());
appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));
doJournalFinishedSuccess(curi);
} else if (isDisregarded(curi)) {
// codes meaning 'undo' (even though URI was enqueued,
// we now want to disregard it from normal success/failure tallies)
// (eg robots-excluded, operator-changed-scope, etc)
incrementDisregardedUriCount();
appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));
holderCost = 0; // no charge for disregarded URIs
// TODO: consider reinstating forget-URI capability, so URI could be
// re-enqueued if discovered again
doJournalDisregarded(curi);
} else {
// codes meaning 'failure'
incrementFailedFetchCount();
appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));
// if exception, also send to crawlErrors
if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {
Object[] array = { curi };
loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()
.toString(), array);
}
// charge queue any extra error penalty
wq.noteError(getErrorPenaltyAmount());
doJournalFinishedFailure(curi);
}
wq.expend(holderCost); // successes & failures charge cost to queue
long delay_ms = curi.getPolitenessDelay();
handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);
wq.makeDirty();
if(curi.getRescheduleTime()>0) {
// marked up for forced-revisit at a set time
curi.processingCleanup();
curi.resetForRescheduling();
futureUris.put(curi.getRescheduleTime(),curi);
futureUriCount.incrementAndGet();
} else {
curi.stripToMinimal();
curi.processingCleanup();
}
}
상기 방면에서 먼저 CrawlURI curi의holder 속성을 얻는다(이 CrawlURI curi 대상은classkey가 BdbWorkQueue 대상에 대응하고 여기는Heritrix3.1.0 작업 대기열의 스케줄링과 관련된다. 나중에 다시 분석한다).
그런 다음 BdbWorkQueue 객체의 synchronized void dequeue(final WorkQueue Frontier frontier, CrawlURI expected) 방법을 호출합니다.
org.archive.crawler.frontier.BdbWorkQueue
org.archive.crawler.frontier.WorkQueue
/**
* Remove the peekItem from the queue and adjusts the count.
*
* @param frontier Work queues manager.
*/
protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {
try {
deleteItem(frontier, peekItem);
} catch (IOException e) {
//FIXME better exception handling
e.printStackTrace();
throw new RuntimeException(e);
}
unpeek(expected);
count--;
lastDequeueTime = System.currentTimeMillis();
}
org.archive.crawler.frontier.BdbWorkQueue
protected void deleteItem(final WorkQueueFrontier frontier,
final CrawlURI peekItem) throws IOException {
try {
final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)
.getWorkQueues();
queues.delete(peekItem);
} catch (DatabaseException e) {
throw new IOException(e);
}
}
마지막으로 BdbMultipleWorkQueues 대상의void delete(Crawl URI item) 방법을 호출합니다. 앞의 글은 이미 언급되었지만, 이 방법을 다시 반복하지 않습니다.
---------------------------------------------------------------------------
본 시리즈의 Heritrix 3.1.0 원본 해석은 본인이 창작한 것입니다.
전재 는 출처 가 블로그 정원 고슴도치 의 온순함 을 밝혀 주십시오
본문 링크http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html
이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
Heritrix 3.1.0 소스 해석(16)다음은 BdbFrontier 객체 CrawlURI next() 방법과 관련된 방법을 분석합니다. 이 방법은 좀 길어요. 먼저void wakeQueues() 방법을 볼게요. snoozedClassQueues.poll ...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.