Elasticsearch 2.2.0 분사편: 중국어 분사

11437 단어 elasticsearch 로그 분석 세크람드 secilog

Elasticsearch에는 많은 분사기(analyzers)가 내장되어 있지만, 기본 분사기는 중국어에 대한 지원이 그다지 좋지 않습니다.그래서 단독 플러그인을 설치하여 지원해야 합니다. 비교적 자주 사용하는 것은 중과원 ICTCLAS의 smartcn과 IKAnanlyzer입니다. 그러나 현재 IKAnanlyzer는 최신 Elasticsearch 2.2.0 버전을 지원하지 않지만 smartcn 중국어 분사기는 기본적으로 공식 지원을 제공합니다. 중국어나 중국어 영문 텍스트를 혼합한 분석기를 제공합니다.최신 버전 2.2.0을 지원합니다.그러나 smartcn은 사용자 정의 라이브러리를 지원하지 않습니다. 테스트로 먼저 사용할 수 있습니다.뒷부분에서 최신 버전을 지원하는 방법을 소개합니다.

smartcn

설치 단어:plugin install analysis-smartcn
마운트 해제:plugin remove analysis-smartcn
테스트:
요청: POSThttp://127.0.0.1:9200/_analyze/

{
  "analyzer": "smartcn",
  "text": " "
}

반환 결과:

{
    "tokens": [
        {
            "token": " ", 
            "start_offset": 0, 
            "end_offset": 2, 
            "type": "word", 
            "position": 0
        }, 
        {
            "token": " ", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "word", 
            "position": 1
        }, 
        {
            "token": " ", 
            "start_offset": 3, 
            "end_offset": 5, 
            "type": "word", 
            "position": 2
        }, 
        {
            "token": " ", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "word", 
            "position": 3
        }, 
        {
            "token": " ", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "word", 
            "position": 4
        }, 
        {
            "token": " ", 
            "start_offset": 7, 
            "end_offset": 8, 
            "type": "word", 
            "position": 5
        }, 
        {
            "token": " ", 
            "start_offset": 8, 
            "end_offset": 11, 
            "type": "word", 
            "position": 6
        }, 
        {
            "token": " ", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "word", 
            "position": 7
        }
    ]
}

비교해 보면, 우리는 표준적인 분사의 결과를 보고, 요청에서 바스마트cn을 표준으로 바꾸어
다음 결과를 보려면 다음과 같이 하십시오.

{
    "tokens": [
        {
            "token": " ", 
            "start_offset": 0, 
            "end_offset": 1, 
            "type": "<IDEOGRAPHIC>", 
            "position": 0
        }, 
        {
            "token": " ", 
            "start_offset": 1, 
            "end_offset": 2, 
            "type": "<IDEOGRAPHIC>", 
            "position": 1
        }, 
        {
            "token": " ", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "<IDEOGRAPHIC>", 
            "position": 2
        }, 
        {
            "token": " ", 
            "start_offset": 3, 
            "end_offset": 4, 
            "type": "<IDEOGRAPHIC>", 
            "position": 3
        }, 
        {
            "token": " ", 
            "start_offset": 4, 
            "end_offset": 5, 
            "type": "<IDEOGRAPHIC>", 
            "position": 4
        }, 
        {
            "token": " ", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "<IDEOGRAPHIC>", 
            "position": 5
        }, 
        {
            "token": " ", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "<IDEOGRAPHIC>", 
            "position": 6
        }, 
        {
            "token": " ", 
            "start_offset": 7, 
            "end_offset": 8, 
            "type": "<IDEOGRAPHIC>", 
            "position": 7
        }, 
        {
            "token": " ", 
            "start_offset": 8, 
            "end_offset": 9, 
            "type": "<IDEOGRAPHIC>", 
            "position": 8
        }, 
        {
            "token": " ", 
            "start_offset": 9, 
            "end_offset": 10, 
            "type": "<IDEOGRAPHIC>", 
            "position": 9
        }, 
        {
            "token": " ", 
            "start_offset": 10, 
            "end_offset": 11, 
            "type": "<IDEOGRAPHIC>", 
            "position": 10
        }, 
        {
            "token": " ", 
            "start_offset": 11, 
            "end_offset": 12, 
            "type": "<IDEOGRAPHIC>", 
            "position": 11
        }, 
        {
            "token": " ", 
            "start_offset": 12, 
            "end_offset": 13, 
            "type": "<IDEOGRAPHIC>", 
            "position": 12
        }
    ]
}

이를 통해 알 수 있듯이 기본적으로 사용할 수 없는 것은 한자가 하나의 단어가 되었다는 것이다.
본고는 세크람드(secisland)가 창작한 것으로 작가와 출처를 명시해 주십시오.

IKAnanlyzer 지원 2.2.0 버전

현재 github의 최신 버전은 Elasticsearch 2.1.1만 지원되며 경로는https://github.com/medcl/elasticsearch-analysis-ik.하지만 현재 최신 Elasticsearch는 2.2.0이 되었기 때문에 처리를 거쳐야 지원할 수 있습니다.
1. 원본 코드를 다운로드하고 다운로드한 후 임의의 디렉터리로 압축을 풀고elasticsearch-analysis-ik-master 디렉터리의pom를 수정합니다.xml 파일. 줄, 그리고 뒤에 있는 버전 번호를 2.2.0으로 수정합니다.
2. 코드 mvn 패키지를 컴파일합니다.
3. 번역이 완료되면 target\releases에서 elasticsearch-analysis-ik-1.7.0을 생성합니다.zip 파일.
4. 파일을 Elasticsearch/plugins 디렉토리로 압축 해제합니다.
5. 프로필 수정 한 줄 추가: index.analysis.analyzer.ik.type : "ik"
6. Elasticsearch를 다시 시작합니다.
테스트: 위의 요청과 마찬가지로 단어를 ik로 바꿉니다.
반환된 결과:

{
    "tokens": [
        {
            "token": " ", 
            "start_offset": 0, 
            "end_offset": 2, 
            "type": "CN_WORD", 
            "position": 0
        }, 
        {
            "token": " ", 
            "start_offset": 3, 
            "end_offset": 5, 
            "type": "CN_WORD", 
            "position": 1
        }, 
        {
            "token": " ", 
            "start_offset": 5, 
            "end_offset": 7, 
            "type": "CN_WORD", 
            "position": 2
        }, 
        {
            "token": " ", 
            "start_offset": 8, 
            "end_offset": 11, 
            "type": "CN_WORD", 
            "position": 3
        }, 
        {
            "token": " ", 
            "start_offset": 8, 
            "end_offset": 10, 
            "type": "CN_WORD", 
            "position": 4
        }, 
        {
            "token": " ", 
            "start_offset": 8, 
            "end_offset": 9, 
            "type": "CN_WORD", 
            "position": 5
        }, 
        {
            "token": " ", 
            "start_offset": 9, 
            "end_offset": 10, 
            "type": "CN_CHAR", 
            "position": 6
        }, 
        {
            "token": " ", 
            "start_offset": 10, 
            "end_offset": 12, 
            "type": "CN_WORD", 
            "position": 7
        }, 
        {
            "token": " ", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "CN_WORD", 
            "position": 8
        }
    ]
}

그 중에서 알 수 있듯이 두 분사기의 분사 결과는 여전히 차이가 있다.
확장 라이브러리, config\ik\custom에서 mydict.dic에 필요한 어구를 추가한 다음 Elasticsearch를 다시 시작합니다. 주의해야 할 것은 파일 인코딩은 UTF-8 BOM 형식이 없습니다.
예를 들어 세크란드 단어가 추가되었다.그런 다음 다시 질의합니다.
요청: POSThttp://127.0.0.1:9200/_analyze/
매개변수:

{
  "analyzer": "ik",
  "text": " "
}

반환 결과:

{
    "tokens": [
        {
            "token": " ", 
            "start_offset": 0, 
            "end_offset": 4, 
            "type": "CN_WORD", 
            "position": 0
        }, 
        {
            "token": " ", 
            "start_offset": 1, 
            "end_offset": 2, 
            "type": "CN_WORD", 
            "position": 1
        }, 
        {
            "token": " ", 
            "start_offset": 2, 
            "end_offset": 3, 
            "type": "CN_WORD", 
            "position": 2
        }, 
        {
            "token": " ", 
            "start_offset": 3, 
            "end_offset": 4, 
            "type": "CN_CHAR", 
            "position": 3
        }, 
        {
            "token": " ", 
            "start_offset": 5, 
            "end_offset": 7, 
            "type": "CN_WORD", 
            "position": 4
        }, 
        {
            "token": " ", 
            "start_offset": 5, 
            "end_offset": 6, 
            "type": "TYPE_CNUM", 
            "position": 5
        }, 
        {
            "token": " ", 
            "start_offset": 6, 
            "end_offset": 7, 
            "type": "COUNT", 
            "position": 6
        }, 
        {
            "token": " ", 
            "start_offset": 7, 
            "end_offset": 9, 
            "type": "CN_WORD", 
            "position": 7
        }, 
        {
            "token": " ", 
            "start_offset": 9, 
            "end_offset": 11, 
            "type": "CN_WORD", 
            "position": 8
        }, 
        {
            "token": " ", 
            "start_offset": 11, 
            "end_offset": 13, 
            "type": "CN_WORD", 
            "position": 9
        }
    ]
}

위의 결과에서 볼 수 있듯이 이미 세크란드 단어를 지지했다.
세크랜드(secisland)는 Elasticsearch의 최신 버전의 각종 기능을 점차적으로 분석할 것이니 기대해 주십시오.secisland 공중호에 가입하여 관심을 가지는 것도 환영합니다.

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

kafka connect e elasticsearch를 관찰할 수 있습니다.

No menu lateral do dashboard tem a opção de connectors onde ele mostra todos os clusters do kafka connect conectados atu...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

elasticsearch 학습 노트 (26) - Elasticsearch query DSL 검색 실전

납치하여 고객을 돕다

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다