Mapping WordPress Posts to Elasticsearch

11712 단어 wordpress elasticsearch

I thought I’d share the Elasticsearch type mapping I am using for WordPress posts. We’ve refined it over a number of iterations and it combines dynamic templates and multi_field mappings along with a number of more standard mappings. So this is probably a good general example of how to index real data from a traditional SQL database into Elasticsearch.
If you aren’t familiar with the WordPress database scheme it looks like this:
These Elasticsearch mappings focus on the wp_posts, wp_term_relationships, wp_term_taxonomy, and wp_terms tables.
To simplify things I’ll just index using an English analyzer and leave discussing multi-lingual analyzers to a different post.

"analysis": {
    "filter": {
        "stop_filter": {
            "type": "stop",
            "stopwords": ["_english_"]
        },
        "stemmer_filter": {
            "type": "stemmer",
            "name": "minimal_english"
        }
    },
    "analyzer": {
        "wp_analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": ["lowercase", "stop_filter", "stemmer_filter"],
            "char_filter": ["html_strip"]
        },
        "wp_raw_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }
    }
}

A few notes on the analyzers:

The minimal_english stemmer only removes plurals rather than potentially butchering the difference between words like “computer”, “computes”, and “computing”.

Lowercase keyword analyzer makes doing an exact search without case possible.

Let’s take a look at the post mapping:

"post": {
    "dynamic_templates": [
        {
            "tax_template_name": {
                "path_match": "taxonomy.*.name",
                "mapping": {
                    "type": "multi_field",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer"
                        }
                    }
                }
            }
        }, {
            "tax_template_slug": {
                "path_match": "taxonomy.*.slug",
                "mapping": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }, {
            "tax_template_term_id": {
                "path_match": "taxonomy.*.term_id",
                "mapping": {
                    "type": "long"
                }
            }
        }
    ],
    "_all": {
        "enabled": false
    },
    "properties": {
        "post_id": {
            "type": "long"
        },
        "blog_id": {
            "type": "long"
        },
        "site_id": {
            "type": "long"
        },
        "post_type": {
            "type": "string",
            "index": "not_analyzed"
        },
        "lang": {
            "type": "string",
            "index": "not_analyzed"
        },
        "url": {
            "type": "string",
            "index": "not_analyzed"
        },
        "location": {
            "type": "geo_point",
            "lat_lon": true
        },
        "date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "date_gmt": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "author": {
            "type": "multi_field",
            "fields": {
                "author": {
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "wp_analyzer"
                },
                "raw": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        },
        "author_login": {
            "type": "string",
            "index": "not_analyzed"
        },
        "title": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "content": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "tag": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "tag"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "tag.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "tag.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
        "category": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "category"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "category.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "category.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
    }
}

Most of the fields are pretty self explanatory, so I’ll just outline to more complex ones:

date  and  date_gmt : We define the allowed formats because we are taking the dates out of MySQL. We also do some checking of the dates since MySQL will allow some things in a DATETIME field that ES will balk at and cause the indexing operation to fail. For instance MySQL accepts leap dates in non-leap years.

    content : Content gets stripped of HTML and shortcodes, then converted to UTF-8 in cases where it isn’t already. 
    author  and  author.raw : The author field corresponds to the user’s display_name. Clearly we need to analyze the field so “Greg Ichneumon Brown” can be matched on a search for “Greg”, but what about when we facet on the field. If we use the analyzed field then the results would have the terms “greg”, “ichneumon”, and “brown”. Instead, by using ES’s multi_field mapping feature to auto generate  author.raw  the faceted results on that field will give us “Greg Ichneumon Brown”. 
    tag  and  category : Tags and Categories similarly need raw versions for faceting so we preserve the original tag. Additionally there are a number of ways users can filter the content. WordPress builds slugs from each category/tag to uniquely identify them in a human readable way and there is a unique integer (term_id) associated with each term. The  tag.raw_lc  is used for exact matching a term without worrying about the case. This may seem like a lot of duplication, but the overriding goal here is to avoid using MySQL for search so we index everything. Extracting data into multiple fields ensures that we will have flexibility when filtering the data in the future. 
    taxonomy.* : WordPress allows custom taxonomies (of which categories and tags are two built-in taxonomies) so we need a way to create a custom path in each document that allows access to each taxonomy. This is where Elasticsearch’s dynamic templates shine. For a custom taxonomy such as “company” the paths will become  taxonomy.company.name ,  taxonomy.company.name ,  taxonomy.company.name.raw , taxonomy.company.slug , and  taxonomy.company.term_id . 
  
The ES documentation is very complete, but it’s not always easy to see how to build complex mappings that fit the individual pieces together. I hope this helps in your own ES development efforts.

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

3가지 방법으로 WordPress에서 AJAX 사용

여기서 우리는 AJAX를 사용하여 가장 많이 사용되는 3가지 도구를 사용하여 데이터를 가져오는 것을 볼 것입니다. 활성 테마 폴더의 루트에 있는 functions.php 파일에 함수와 두 개의 후크를 생성하여 시작하...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

Firebase 인증 + 반응

elasticsearch 클라이언트 탐지기

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다