Heritrix 3.1.0 소스 해석(13)

13566 단어 Heritrix
이어서 BdbFrontier 클래스의 void finished(Crawl URI curi) 방법을 분석하여 Crawl URI 대상의 마무리 작업을 완성한다.
BdbFrontier 클래스의 부모 클래스인 AbstractFrontier 안에서
org.archive.crawler.frontier.BdbFrontier
      org.archive.crawler.frontier.AbstractFrontier
/**

     * Note that the previously emitted CrawlURI has completed

     * its processing (for now).

     *

     * The CrawlURI may be scheduled to retry, if appropriate,

     * and other related URIs may become eligible for release

     * via the next next() call, as a result of finished().

     *

     *  (non-Javadoc)

     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)

     */

    public void finished(CrawlURI curi) {

        try {

            KeyedProperties.loadOverridesFrom(curi);

            processFinish(curi);

        } finally {

            KeyedProperties.clearOverridesFrom(curi); 

        }

    }

BdbFrontier 클래스의 void processFinish(Crawl URI curi) 방법을 계속 호출합니다. BdbFrontier 클래스의 부모 클래스인 WorkQueue Frontier에서
org.archive.crawler.frontier.BdbFrontier
                org.archive.crawler.frontier.WorkQueueFrontier
/**

     * Note that the previously emitted CrawlURI has completed

     * its processing (for now).

     *

     * The CrawlURI may be scheduled to retry, if appropriate,

     * and other related URIs may become eligible for release

     * via the next next() call, as a result of finished().

     *

     * TODO: make as many decisions about what happens to the CrawlURI

     * (success, failure, retry) and queue (retire, snooze, ready) as 

     * possible elsewhere, such as in DispositionProcessor. Then, break

     * this into simple branches or focused methods for each case. 

     *  

     * @see org.archive.crawler.framework.Frontier#finished(org.archive.modules.CrawlURI)

     */

    protected void processFinish(CrawlURI curi) {

//        assert Thread.currentThread() == managerThread;

        

        long now = System.currentTimeMillis();



        curi.incrementFetchAttempts();

        logNonfatalErrors(curi);

        

        WorkQueue wq = (WorkQueue) curi.getHolder();

        // always refresh budgeting values from current curi

        // (whose overlay settings should be active here)

        wq.setSessionBudget(getBalanceReplenishAmount());

        wq.setTotalBudget(getQueueTotalBudget());

        

        assert (wq.peek(this) == curi) : "unexpected peek " + wq;



        int holderCost = curi.getHolderCost();



        if (needsReenqueuing(curi)) {

            // codes/errors which don't consume the URI, leaving it atop queue

            if(curi.getFetchStatus()!=S_DEFERRED) {

                wq.expend(holderCost); // all retries but DEFERRED cost

            }

            long delay_ms = retryDelayFor(curi) * 1000;

            curi.processingCleanup(); // lose state that shouldn't burden retry

            wq.unpeek(curi);

            wq.update(this, curi); // rewrite any changes

            handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DEFERRED_FOR_RETRY));

            doJournalReenqueued(curi);

            wq.makeDirty();

            return; // no further dequeueing, logging, rescheduling to occur

        }



        // Curi will definitely be disposed of without retry, so remove from queue

        wq.dequeue(this,curi);

        decrementQueuedCount(1);

        largestQueues.update(wq.getClassKey(), wq.getCount());

        log(curi);



        

        if (curi.isSuccess()) {

            // codes deemed 'success' 

            incrementSucceededFetchCount();

            totalProcessedBytes.addAndGet(curi.getRecordedSize());

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,SUCCEEDED));

            doJournalFinishedSuccess(curi);

           

        } else if (isDisregarded(curi)) {

            // codes meaning 'undo' (even though URI was enqueued, 

            // we now want to disregard it from normal success/failure tallies)

            // (eg robots-excluded, operator-changed-scope, etc)

            incrementDisregardedUriCount();

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,DISREGARDED));

            holderCost = 0; // no charge for disregarded URIs

            // TODO: consider reinstating forget-URI capability, so URI could be

            // re-enqueued if discovered again

            doJournalDisregarded(curi);

            

        } else {

            // codes meaning 'failure'

            incrementFailedFetchCount();

            appCtx.publishEvent(new CrawlURIDispositionEvent(this,curi,FAILED));

            // if exception, also send to crawlErrors

            if (curi.getFetchStatus() == S_RUNTIME_EXCEPTION) {

                Object[] array = { curi };

                loggerModule.getRuntimeErrors().log(Level.WARNING, curi.getUURI()

                        .toString(), array);

            }        

            // charge queue any extra error penalty

            wq.noteError(getErrorPenaltyAmount());

            doJournalFinishedFailure(curi);

            

        }



        wq.expend(holderCost); // successes & failures charge cost to queue

        

        long delay_ms = curi.getPolitenessDelay();

        handleQueue(wq,curi.includesRetireDirective(),now,delay_ms);

        wq.makeDirty();

        

        if(curi.getRescheduleTime()>0) {

            // marked up for forced-revisit at a set time

            curi.processingCleanup();

            curi.resetForRescheduling(); 

            futureUris.put(curi.getRescheduleTime(),curi);

            futureUriCount.incrementAndGet(); 

        } else {

            curi.stripToMinimal();

            curi.processingCleanup();

        }

    }

상기 방면에서 먼저 CrawlURI curi의holder 속성을 얻는다(이 CrawlURI curi 대상은classkey가 BdbWorkQueue 대상에 대응하고 여기는Heritrix3.1.0 작업 대기열의 스케줄링과 관련된다. 나중에 다시 분석한다).
그런 다음 BdbWorkQueue 객체의 synchronized void dequeue(final WorkQueue Frontier frontier, CrawlURI expected) 방법을 호출합니다.
org.archive.crawler.frontier.BdbWorkQueue
      org.archive.crawler.frontier.WorkQueue
/**

     * Remove the peekItem from the queue and adjusts the count.

     * 

     * @param frontier  Work queues manager.

     */

    protected synchronized void dequeue(final WorkQueueFrontier frontier, CrawlURI expected) {

        try {

            deleteItem(frontier, peekItem);

        } catch (IOException e) {

            //FIXME better exception handling

            e.printStackTrace();

            throw new RuntimeException(e);

        }

        unpeek(expected);

        count--;

        lastDequeueTime = System.currentTimeMillis();

    }

org.archive.crawler.frontier.BdbWorkQueue
protected void deleteItem(final WorkQueueFrontier frontier,

            final CrawlURI peekItem) throws IOException {

        try {

            final BdbMultipleWorkQueues queues = ((BdbFrontier) frontier)

                .getWorkQueues();

             queues.delete(peekItem);

        } catch (DatabaseException e) {

            throw new IOException(e);

        }

    }

마지막으로 BdbMultipleWorkQueues 대상의void delete(Crawl URI item) 방법을 호출합니다. 앞의 글은 이미 언급되었지만, 이 방법을 다시 반복하지 않습니다.
---------------------------------------------------------------------------
본 시리즈의 Heritrix 3.1.0 원본 해석은 본인이 창작한 것입니다.
전재 는 출처 가 블로그 정원 고슴도치 의 온순함 을 밝혀 주십시오
본문 링크http://www.cnblogs.com/chenying99/archive/2013/04/17/3025419.html

좋은 웹페이지 즐겨찾기