Heritrix 3.1.0 소스 해석(11)

11541 단어 Heritrix
위에서 Heritrix3을 분석하였다.1.0 시스템은 어떻게 CrawlUricuri 대상을 추가합니까? 그러면 시스템을 초기화할 때 CrawlUricuri 피드를 어떻게 불러옵니까?
우리는 채집 작업의 launch 명령을 수행할 때 Crawl Controller 대상의void request Crawl Start () 방법을 실제적으로 호출합니다.
/** 

     * Operator requested crawl begin

     */

    public void requestCrawlStart() {

        hasStarted = true; 

        sendCrawlStateChangeEvent(State.PREPARING, CrawlStatus.PREPARING);

        

        if(recoveryCheckpoint==null) {

            // only announce (trigger scheduling of) seeds

            // when doing a cold (non-recovery) start

            getSeeds().announceSeeds();

        }

        

        setupToePool();



        // A proper exit will change this value.

        this.sExit = CrawlStatus.FINISHED_ABNORMAL;

        

        if (getPauseAtStart()) {

            // frontier is already paused unless started, so just 

            // 'complete'/ack pause

            completePause();

        } else {

            getFrontier().run();

        }

    }

getSeeds()를 계속 호출합니다.announceSeeds () 방법, 여기 getSeeds () 의 실제 대상은 TextSeedModule (spring 자동 주입) 이고,void announceSeeds () 방법을 호출합니다.
/**

     * Announce all seeds from configured source to SeedListeners 

     * (including nonseed lines mixed in). 

     * @see org.archive.modules.seeds.SeedModule#announceSeeds()

     */

    public void announceSeeds() {

        if(getBlockAwaitingSeedLines()>-1) {

            final CountDownLatch latch = new CountDownLatch(getBlockAwaitingSeedLines());

            new Thread(){

                @Override

                public void run() {

                    announceSeeds(latch); 

                    while(latch.getCount()>0) {

                        latch.countDown();

                    }

                }

            }.start();

            try {

                latch.await();

            } catch (InterruptedException e) {

                // do nothing

            } 

        } else {

            announceSeeds(null); 

        }

    }

위 방법에서if 뒤에 있는 Count Down Latch latch는 스레드 계수이고else 뒤에 null이며void announce Seeds(CountDown Latch latch Ornull) 방법을 계속 호출합니다
protected void announceSeeds(CountDownLatch latchOrNull) {

        BufferedReader reader = new BufferedReader(textSource.obtainReader());       

        try {

            announceSeedsFromReader(reader,latchOrNull);    

        } finally {

            IOUtils.closeQuietly(reader);

        }

    }

먼저 ReadSource textSource(org.archive.spring.ConfigString)의 Reader(StringReader)를 가져오고void announceSeedsFromReader(BufferedReader reader, CountDownLatch latch Ornull) 방법을 호출합니다.
/**

     * Announce all seeds (and nonseed possible-directive lines) from

     * the given Reader

     * @param reader source of seed/directive lines

     * @param latchOrNull if non-null, sent countDown after each line, allowing 

     * another thread to proceed after a configurable number of lines processed

     */

    protected void announceSeedsFromReader(BufferedReader reader, CountDownLatch latchOrNull) {

        String s;

        Iterator<String> iter = 

            new RegexLineIterator(

                    new LineReadingIterator(reader),

                    RegexLineIterator.COMMENT_LINE,

                    RegexLineIterator.NONWHITESPACE_ENTRY_TRAILING_COMMENT,

                    RegexLineIterator.ENTRY);



        int count = 0; 

        while (iter.hasNext()) {

            s = (String) iter.next();

            if(Character.isLetterOrDigit(s.charAt(0))) {

                // consider a likely URI

                seedLine(s);

                count++;

                if(count%20000==0) {

                    System.runFinalization();

                }

            } else {

                // report just in case it's a useful directive

                nonseedLine(s);

            }

            if(latchOrNull!=null) {

                latchOrNull.countDown(); 

            }

        }

        publishConcludedSeedBatch(); 

    }

URL 문자열을 반복하고 void seedLine 방법을 호출합니다.
/**

     * Handle a read line that is probably a seed.

     * 

     * @param uri String seed-containing line

     */

    protected void seedLine(String uri) {

        if (!uri.matches("[a-zA-Z][\\w+\\-]+:.*")) { // Rfc2396 s3.1 scheme,

                                                     // minus '.'

            // Does not begin with scheme, so try http://

            uri = "http://" + uri;

        }

        try {

            UURI uuri = UURIFactory.getInstance(uri);

            CrawlURI curi = new CrawlURI(uuri);

            curi.setSeed(true);

            curi.setSchedulingDirective(SchedulingConstants.MEDIUM);

            if (getSourceTagSeeds()) {

                curi.setSourceTag(curi.toString());

            }

            publishAddedSeed(curi);

        } catch (URIException e) {

            // try as nonseed line as fallback

            nonseedLine(uri);

        }

    }

마지막으로 부모 클래스 Seed Module의void publish Added Seed (Crawl URI curi) 방법 (observer 모드) 을 호출합니다.
protected void publishAddedSeed(CrawlURI curi) {

        for (SeedListener l: seedListeners) {

            l.addedSeed(curi);

        }

    }

BdbFrontier 클래스는 간접적으로 SeedListener 인터페이스(AbstractFrontier 추상 클래스void addedSeed(CrawlURI puri) 방법을 구현)
/**

     * When notified of a seed via the SeedListener interface, 

     * schedule it.

     * 

     * @see org.archive.modules.seeds.SeedListener#addedSeed(org.archive.modules.CrawlURI)

     */

    public void addedSeed(CrawlURI puri) {

        schedule(puri);

    }

---------------------------------------------------------------------------
본 시리즈의 Heritrix 3.1.0 원본 해석은 본인이 창작한 것입니다.
전재 는 출처 가 블로그 정원 고슴도치 의 온순함 을 밝혀 주십시오
본문 링크http://www.cnblogs.com/chenying99/archive/2013/04/20/3031924.html

좋은 웹페이지 즐겨찾기