elasticsearch 관련 단어 조회 및 Shingles

Shingle Token Filter

A token filter of type shingle that constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token. For example, the sentence "please divide this sentence into shingles"might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".
This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0.
The following are settings that can be set for a shingle token filter type:
Setting
Description max_shingle_size
The maximum shingle size. Defaults to 2 . min_shingle_size
The minimum shingle size. Defaults to 2 . output_unigrams
If true the output will contain the input tokens (unigrams) as well as the shingles. Defaults to true . output_unigrams_if_no_shingles
If output_unigrams is false the output will contain the input tokens (unigrams) if no shingles are available. Note if output_unigrams is set to true this setting has no effect. Defaults to false . token_separator
The string to use when joining adjacent tokens to form a shingle. Defaults to " " . filler_token
The string to use as a replacement for each position at which there is no actual token in the stream. For instance this string is used if the position increment is greater than one when a stop filter is used together with the shingle filter. Defaults to "_"
The index level setting index.max_shingle_diff controls the maximum allowed difference between max_shingle_size and min_shingle_size .
참조:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html

연관된 단어 찾기(Finding Associated Words)

비록 짧은 말과 근접도 조회가 매우 유용하지만, 그것들은 여전히 단점이 하나 있다.그것들은 너무 엄격하다. 모든 단어 검색에 있는 단어는 반드시 문서에 나타나야 한다. 설령 slop을 사용한다 하더라도.
slop을 통해 얻은 단어 순서를 조정할 수 있는 유연성도 대가가 있다. 단어 간의 연관성을 잃었기 때문이다.문서에 있는 sue,alligator와ate를 식별할 수 있지만, Sue ate인지 alligator ate인지 판단할 수 없습니다.
단어를 결합하여 사용할 때, 그것들은 단독으로 사용할 때보다 의미를 풍부하게 표현한다.'I'm not happy I'm working'과'I'm happy I'm not working'은 같은 단어를 포함하고 비슷한 인접도를 가지고 있지만 이들의 뜻은 크게 다르다.
만약 우리가 색인 단어가 맞고, 색인이 독립된 단어가 아니라면, 우리는 단어에 대한 상하문 정보를 더 많이 보존할 수 있다.
"Sue ate the alligator"라는 문장에 대해 우리는 각 단어(또는 Unigram)를 하나의 단어로 색인할 뿐만 아니라
["sue", "ate", "the", "alligator"]
우리는 모든 단어를 인접한 단어와 함께 하나의 단어로 색인할 것이다.
["sue ate", "ate the", "the alligator"]
이 단어들이 맞다는 것은 이른바 Shingle이다.
TIP
Shingle은 단어만 옳은 것이 아닙니다.너도 세 단어 (Word Triplet, Trigram) 를 하나의 단어로 색인할 수 있다.
["sue ate the", "ate the alligator"]
Trigram은 당신에게 더 높은 정밀도를 줄 수 있지만, 색인의 다른 단어의 수를 크게 증가시켰다.대부분의 경우 Bigram이면 충분합니다.
물론 사용자가 입력한 조회의 순서와 원본 문서의 순서가 일치해야만 Shingle이 작용할 수 있다.sue alligator에 대한 검색은 단독 단어와 일치하지만 Shingle은 일치하지 않습니다.
다행히도 사용자는 검색하고 있는 데이터와 비슷한 구조를 사용하여 검색을 표현하는 경향이 있다.그러나 이것은 매우 중요한 점이다. Bigram만으로는 부족하다.우리는 여전히 Unigram이 필요하다. 우리는 Bigram과 일치하는 신호(시그널)로 관련도 점수를 늘릴 수 있다.

Shingle 생성

Shingle은 색인 중에 분석 프로세스의 일부로 생성되어야 합니다.우리는 Unigram과 Bigram을 한 필드에 인덱스할 수 있지만, 그것들을 서로 다른 필드에 두면 더욱 명확하고 독립적으로 조회할 수 있다.Unigram 필드는 우리가 검색하는 기초 부분을 형성하고 Bigram 필드는 관련도를 높이는 데 사용된다.
우선, 우리는 shingle 단어 필터를 사용하여 해상도를 만들어야 한다.

DELETE /my_index

PUT /my_index
{
    "settings": {
        "number_of_shards": 1,  
        "analysis": {
            "filter": {
                "my_shingle_filter": {
                    "type":             "shingle",
                    "min_shingle_size": 2, 
                    "max_shingle_size": 2, 
                    "output_unigrams":  false   
                }
            },
            "analyzer": {
                "my_shingle_analyzer": {
                    "type":             "custom",
                    "tokenizer":        "standard",
                    "filter": [
                        "lowercase",
                        "my_shingle_filter" 
                    ]
                }
            }
        }
    }
}

기본 Shingle의 min/max 값은 2이므로 명시적으로 지정하지 않아도 됩니다.output_Unigrams는false로 설정되어 Unigram과 Bigram 인덱스를 같은 필드에 사용하지 않도록 합니다.
Analyze API를 사용하여 해석기를 테스트합니다.

GET /my_index/_analyze?analyzer=my_shingle_analyzer
Sue ate the alligator

예상대로 우리는 세 개의 단어를 얻었다.

sue ate

ate the

the alligator

이제 우리는 새로운 해상도를 사용하는 필드를 만들 수 있다.

다중 필드(Multifields)

Unigram과 Bigram을 분리하면 인덱스가 더욱 명확해지므로 title 필드를 다중 필드(Multifield)로 만듭니다(String Sorting and Multifields 참조).

PUT /my_index/_mapping/my_type
{
    "my_type": {
        "properties": {
            "title": {
                "type": "string",
                "fields": {
                    "shingles": {
                        "type":     "string",
                        "analyzer": "my_shingle_analyzer"
                    }
                }
            }
        }
    }
}

상기 맵이 있으면 JSON 문서의 제목 필드는 유니그램(title 필드)과 빅람(title.shingles 필드) 방식으로 인덱스되어 이 두 필드를 독립적으로 조회할 수 있습니다.
마지막으로 예제 문서를 인덱스할 수 있습니다.

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "Sue ate the alligator" }
{ "index": { "_id": 2 }}
{ "title": "The alligator ate Sue" }
{ "index": { "_id": 3 }}
{ "title": "Sue never goes anywhere without her alligator skin purse" }

Shingles 검색

추가된 shingles 필드의 장점을 이해하기 위해 먼저 "The hungry alligator ate Sue"에 대한 간단한 match 조회의 결과를 보겠습니다.

GET /my_index/my_type/_search
{
   "query": {
        "match": {
           "title": "the hungry alligator ate sue"
        }
   }
}

이 조회는 모든 3개의 문서를 되돌려주지만, 문서 1과 문서 2는 같은 단어를 포함하기 때문에 같은 관련도 값을 가지고 있음을 주의하십시오.

{
  "hits": [
     {
        "_id": "1",
        "_score": 0.44273707, 
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "2",
        "_score": 0.44273707, 
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "3", 
        "_score": 0.046571054,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

이제 shingles 필드도 검색에 추가합시다.shingle 필드를 신호로 삼아 관련도 값을 증가시킵니다. - 주요 title 필드를 질의에 포함해야 합니다.

GET /my_index/my_type/_search
{
   "query": {
      "bool": {
         "must": {
            "match": {
               "title": "the hungry alligator ate sue"
            }
         },
         "should": {
            "match": {
               "title.shingles": "the hungry alligator ate sue"
            }
         }
      }
   }
}

우리는 여전히 3점짜리 문서와 일치하지만, 문서 2는 현재 1위를 차지하고 있다. 왜냐하면 이것은 Shingle 단어인'ate sue'와 일치하기 때문이다.

{
  "hits": [
     {
        "_id": "2",
        "_score": 0.4883322,
        "_source": {
           "title": "The alligator ate Sue"
        }
     },
     {
        "_id": "1",
        "_score": 0.13422975,
        "_source": {
           "title": "Sue ate the alligator"
        }
     },
     {
        "_id": "3",
        "_score": 0.014119488,
        "_source": {
           "title": "Sue never goes anywhere without her alligator skin purse"
        }
     }
  ]
}

검색에 문서에 나타나지 않은 단어hungry가 포함되어 있어도 우리는 단어의 인접도를 사용하여 가장 관련된 문서를 얻을 수 있습니다.

성능

Shingle은 구문 조회보다 유연할 뿐만 아니라 성능도 좋습니다.매번 검색에 단어 검색을 위해 지불해야 하는 대가에 비해Shingle에 대한 검색은 간단한 match 검색과 같은 효율을 가진다.색인 기간에 약간의 대가를 치르게 될 뿐입니다. 더 많은 단어가 색인되어야 하기 때문에 Shingle 필드를 사용하면 더 많은 디스크 공간을 차지할 수 있습니다.그러나 대부분의 응용 프로그램은 한 번에 여러 번 읽기 때문에 색인 기간에 약간의 대가를 써서 조회를 더욱 신속하게 하는 것이 의미가 있다.
이것은 당신이 ES에서 자주 만날 수 있는 주제입니다. 검색 기간에 많은 일을 할 수 있도록 하고 사전 설정이 필요 없습니다.일단 당신의 요구를 더 잘 이해하면 색인 기간에 정확한 모델링을 통해 더 좋은 결과와 성능을 얻을 수 있습니다.

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

다양한 언어의 JSON

JSON은 Javascript 표기법을 사용하여 데이터 구조를 레이아웃하는 데이터 형식입니다. 그러나 Javascript가 코드에서 이러한 구조를 나타낼 수 있는 유일한 언어는 아닙니다. 저는 일반적으로 '객체'{}...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.