Django에서 Postgres 전체 텍스트 검색 최적화

Postgres는 기본적으로 뛰어난 검색 기능을 제공합니다. 대부분의 Django 앱의 경우 ElasticSearch가 제공하는 고급 기능이 필요하지 않는 한 ElasticSearch 클러스터를 실행하고 유지 관리할 필요가 없습니다. Django는 내장Postgres module을 통해 Postgres 검색과 잘 통합됩니다.

작은 데이터 세트의 경우 기본 구성이 잘 작동하지만 데이터 크기가 커지면 기본 검색 구성이 매우 느려지므로 쿼리를 빠르게 유지하려면 특정 최적화를 활성화해야 합니다.

이 페이지에서는 Django 및 Postgres 설정, 샘플 데이터 인덱싱, 전체 텍스트 검색 수행 및 최적화를 안내합니다.

예제는 Django + Postgres 설정을 거치지만 일반적으로 조언은 Postgres를 사용하는 한 모든 프로그래밍 언어 또는 프레임워크에 적용할 수 있습니다.

이미 Django 베테랑이라면 첫 번째 단계를 건너뛰고 최적화로 바로 이동할 수 있습니다.

목차


  • Project setup
  • Creating models and indexing sample data

  • Optimizing search
  • Specialized search column and gin indexes


  • Measuring the performance improvement
  • Drawbacks
  • Conclusion

  • 프로젝트 설정

    Create the directories and setup the Django project.

    mkdir django_postgres
    cd django_postgres
    python -m venv venv
    source venv/bin/activate
    pip install django
    django-admin startproject full_text_search
    cd full_text_search
    ./manage.py startapp web
    

    Now we'll need to install 3 dependencies:

    • psycopg2: the Postgres client library for Python
    • wikipedia: a client library to retrieve Wikipedia articles
    • django-extensions: to simplify debugging of SQL queries
    pip install psycopg2 wikipedia django-extensions
    

    We also need to run Postgres locally. I'll use a dockerized version of Postgres here since it's easier to set up, but feel free to install a Postgres binary if you'd like.

    Open full_text_search/docker-compose.yml

    ---
    version: '2.4'
    services:
      postgres:
        image: postgres:11-alpine
        ports:
          - '5432:5432'
        environment:
          # Set the Postgres environment variables for bootstrapping the default
          # database and user.
          POSTGRES_DB: "my_db"
          POSTGRES_USER: "me"
          POSTGRES_PASSWORD: "password"
    

    The project structure should now look like the output below. We'll ignore the venv directory as that is packed with files and irrelevant for now.

    $ tree -I venv
    .
    └── full_text_search
        ├── docker-compose.yml
        ├── full_text_search
        │   ├── __init__.py
        │   ├── __pycache__
        │   │   ├── __init__.cpython-37.pyc
        │   │   └── settings.cpython-37.pyc
        │   ├── settings.py
        │   ├── urls.py
        │   └── wsgi.py
        ├── manage.py
        └── web
            ├── admin.py
            ├── apps.py
            ├── __init__.py
            ├── migrations
            │   └── __init__.py
            ├── models.py
            ├── tests.py
            └── views.py
    
    5 directories, 15 files
    

    We will modify the default database settings to use Postgres instead of SQLite. In settings.py change the DATABASES attribute:

    DATABASES = {
        "default": {
            "ENGINE": "django.db.backends.postgresql",
            "NAME": "my_db",
            "USER": "me",
            "PASSWORD": "password",
            "HOST": "localhost",
            "PORT": "5432",
            "OPTIONS": {"connect_timeout": 2},
        }
    }
    

    We will also modify our INSTALLED_APPS to include a few apps:

    • django.contrib.postgres the Postgres module for Django which is required for full text search
    • django_extensions to print SQL logs when executing queries in Python
    • our web app

    Open full_text_search/settings.py and modify:

    INSTALLED_APPS = [
        'django.contrib.admin',
        'django.contrib.auth',
        'django.contrib.contenttypes',
        'django.contrib.sessions',
        'django.contrib.messages',
        'django.contrib.staticfiles',
        # Added apps below
        'django.contrib.postgres',
        'django_extensions',
        'web',
    ]
    

    Start Postgres and Django.

    cd full_text_search
    docker-compose up -d
    ./manage.py runserver
    
    If we open our browser and enter http://localhost:8000 성공적인 설치가 표시되어야 합니다.



    모델 생성 및 샘플 데이터 인덱싱

    Suppose we have a model which represents a Wikipedia page. For simplicity we'll use two fields: title and content.

    Open full_text_search/web/models.py

    from django.db import models
    
    class Page(models.Model):
        title = models.CharField(max_length=100, unique=True)
        content = models.TextField()
    

    Now run the migrations to create the model.

    cd full_text_search
    ./manage.py makemigrations && ./manage.py migrate
    No changes detected
    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, sessions
    Running migrations:
      Applying contenttypes.0001_initial... OK
      # truncated the other output for brievity
    

    We'll use a script to index random wikipedia articles and save the contents to Postgres.

    Edit web/index_wikipedia.py

    import logging
    import wikipedia
    from .models import Page
    
    logger = logging.getLogger("django")
    
    def index_wikipedia(num_pages):
        for _ in range(0, num_pages):
            p = wikipedia.random()
            try:
                wiki_page = wikipedia.page(p)
                Page.objects.update_or_create(title=wiki_page.title, defaults={
                    "content": wiki_page.content
                })
            except Exception:
                logger.exception("Failed to index %s", p)
    

    Now let's run our script to index Wikipedia. There will be errors when running the script, but don't worry about those as long as we manage to store a few hundred articles. The script will take a while to run so grab a cup of coffee and return in a few minutes.

    ./manage.py shell_plus
    
    >>> from web.index_wikipedia import index_wikipedia
    >>> index_wikipedia(200)
    
    #
    ## A bunch of errors will be show here, ignore them.
    #
    
    >>> Page.objects.count()
    183
    

    검색 최적화

    Now suppose we want to allow users to perform a full text search on the content. We'll interactively query our dataset to test the full text search. Open a Django shell session:

    $ ./manage.py shell_plus --print-sql
    
    >>> Page.objects.filter(content__search='football')
    SELECT t.oid,
           typarray
      FROM pg_type t
      JOIN pg_namespace ns
        ON typnamespace = ns.oid
     WHERE typname = 'hstore'
    
    Execution time: 0.001440s [Database: default]
    
    SELECT typarray
      FROM pg_type
     WHERE typname = 'citext'
    
    Execution time: 0.000260s [Database: default]
    
    SELECT "web_page"."id",
           "web_page"."title",
           "web_page"."content"
      FROM "web_page"
     WHERE to_tsvector(COALESCE("web_page"."content", '')) @@ (plainto_tsquery('football')) = true
     LIMIT 21
    
    Execution time: 0.222619s [Database: default]
    
    <QuerySet [<Page: Page object (2)>, <Page: Page object (7)>...]>
    

    Django performs two preparatory queries and finally executes our search query. Looking at the last query we can at first glance see that the execution was ~315ms for the query execution and serialization alone. That's it far too slow when we want to keep our page load speeds in the double digits in milliseconds.

    Let's take a closer look at why this query is performing so slowly. Open a second terminal where we'll use the excellent Postgres query analyzer . 위에서 쿼리를 복사하고 실행합니다EXPLAIN ANALYZE.

    $ ./manage.py dbshell
    psql (10.8 (Ubuntu 10.8-0ubuntu0.18.10.1), server 11.2)
    
    Type "help" for help.
    
    my_db=# explain analyze SELECT "web_page"."id",
    my_db-#        "web_page"."title",
    my_db-#        "web_page"."content"
    my_db-#   FROM "web_page"
    my_db-#  WHERE to_tsvector(COALESCE("web_page"."content", '')) @@ (plainto_tsquery('football')) = true
    my_db-#  LIMIT 21
    my_db-# ;
                                                      QUERY PLAN
    ---------------------------------------------------------------------------------------------------------------
     Limit  (cost=0.00..106.71 rows=1 width=643) (actual time=5.001..220.212 rows=18 loops=1)
       ->  Seq Scan on web_page  (cost=0.00..106.71 rows=1 width=643) (actual time=4.999..220.206 rows=18 loops=1)
             Filter: (to_tsvector(COALESCE(content, ''::text)) @@ plainto_tsquery('football'::text))
             Rows Removed by Filter: 165
     Planning Time: 3.336 ms
     Execution Time: 220.292 ms
    (6 rows)
    

    계획 시간은 상당히 빠르지만(~3ms) 실행 시간은 ~220ms로 매우 느립니다.

    ->  Seq Scan on web_page  (cost=0.00..106.71 rows=1 width=643) (actual time=4.999..220.206 rows=18 loops=1)
    

    일치하는 레코드를 찾기 위해 쿼리가 전체 테이블에 대해 순차적 스캔을 수행한다는 것을 알 수 있습니다. 인덱스를 사용하여 쿼리를 최적화할 수 있습니다.

    Filter: (to_tsvector(COALESCE(content, ''::text)) @@ plainto_tsquery('football'::text))
    

    또한 쿼리는 전체 텍스트 검색을 수행하기 위해 content를 사용하여 텍스트에서 tsvector로 to_tsvector 열을 정규화합니다.

    WHERE to_tsvector(COALESCE("web_page"."content", '')) @@ (plainto_tsquery('football')) = true
    
    tsvector 유형은 검색 열을 정규화하는 텍스트의 토큰화된 버전입니다(토큰화here에 대해 자세히 알아보기). Postgres는 모든 행에 대해 이 정규화를 수행해야 하며 각 행에는 전체 Wikipedia 페이지가 포함됩니다. 이것은 CPU 집약적이고 느린 작업입니다.

    특수 검색 열 및 진 인덱스

    In order avoid on-the-fly casting of text to tsvectors we'll create a specialized column which is used only for search. The column should be populated on inserts or updates. When querying is performed we'll avoid the performance penalty of casting types.

    Since we can now have a tsvector type we are also able to add a gin index to speed up the query. The gin index ensures that the search will be performed with a indexed scan instead of a sequential scan over all records.

    Open our web/models.py file and make modifications to the Page model.

    from django.db import models
    from django.contrib.postgres.search import SearchVectorField
    from django.contrib.postgres.indexes import GinIndex
    
    class Page(models.Model):
        title = models.CharField(max_length=100, unique=True)
        content = models.TextField()
    
        # New modifications. A field and an index
        content_search = SearchVectorField(null=True)
    
        class Meta:
            indexes = [GinIndex(fields=["content_search"])]
    

    Run the migrations

    Migrations for 'web':
      web/migrations/0002_auto_20190525_0647.py
        - Add field content_search to page
        - Create index web_page_content_505071_gin on field(s) content_search of model page
    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, sessions, web
    Running migrations:
      Applying web.0002_auto_20190525_0647... OK
    

    Postgres 트리거

    Theoretically our problem is now solved. We have a Gin indexed column which should perform well when we search on it, but by doing so we have introduced another problem: the optimized content_search column needs to be kept manually in sync and updated whenever the content column updates.

    Luckily for us Postgres provides an additional feature which solves this problem, namely triggers . 트리거는 특정 작업이 행에서 수행될 때 실행되는 Postgres 함수입니다. content_search 행이 생성되거나 업데이트될 때마다 content를 채우는 트리거를 생성합니다. 그런 식으로 Postgres는 Python 코드를 작성하지 않고도 두 열을 동기화 상태로 유지합니다.

    트리거를 추가하려면 수동 Django 마이그레이션을 만들어야 합니다. 이렇게 하면 트리거 기능이 추가되고 모든 페이지 행이 업데이트되어 트리거가 실행되고 마이그레이션 시 기존 레코드에 대해 채워집니다content_search. 매우 큰 데이터 세트가 있는 경우 프로덕션에서 이 작업을 수행하지 않을 수 있습니다.
    web/migrations/0003_create_text_search_trigger.py에 새 마이그레이션을 추가합니다. 이전에 자동 생성된 마이그레이션이 다를 수 있으므로 dependencies에서 이전 마이그레이션을 수정해야 합니다.

    from django.db import migrations
    
    class Migration(migrations.Migration):
    
        dependencies = [
            # NOTE: The previous migration probably looks different for you, so
            # modify this.
            ('web', '0002_auto_20190524_0957'),
        ]
    
        migration = '''
            CREATE TRIGGER content_search_update BEFORE INSERT OR UPDATE
            ON web_page FOR EACH ROW EXECUTE FUNCTION
            tsvector_update_trigger(content_search, 'pg_catalog.english', content);
    
            -- Force triggers to run and populate the text_search column.
            UPDATE web_page set ID = ID;
        '''
    
        reverse_migration = '''
            DROP TRIGGER content_search ON web_page;
        '''
    
        operations = [
            migrations.RunSQL(migration, reverse_migration)
        ]
    

    마이그레이션 실행

    $ ./manage.py makemigrations && ./manage.py migrate
    No changes detected
    Operations to perform:
      Apply all migrations: admin, auth, contenttypes, sessions, web
    Running migrations:
      Applying web.0003_create_text_search_trigger... OK
    

    성능 향상 측정



    마지막으로 재미있는 부분으로 돌아가서 쿼리가 이전보다 더 빠르게 수행되는지 확인하겠습니다. Django 셸을 다시 엽니다. 그러나 행을 필터링할 때 일반content_search 열이 아닌 인덱싱된content 열을 사용합니다.

    ./manage.py shell_plus --print-sql
    
    >>> from django.contrib.postgres.search import SearchQuery
    >>> Page.objects.filter(content_search=SearchQuery('football', config='english'))
    SELECT t.oid,
           typarray
      FROM pg_type t
      JOIN pg_namespace ns
        ON typnamespace = ns.oid
     WHERE typname = 'hstore'
    
    Execution time: 0.000829s [Database: default]
    
    SELECT typarray
      FROM pg_type
     WHERE typname = 'citext'
    
    Execution time: 0.000310s [Database: default]
    
    SELECT "web_page"."id",
           "web_page"."title",
           "web_page"."content",
           "web_page"."content_search"
      FROM "web_page"
     WHERE "web_page"."content_search" @@ (plainto_tsquery('english'::regconfig, 'football')) = true
     LIMIT 21
    
    Execution time: 0.001359s [Database: default]
    
    <QuerySet []>
    



    쿼리 실행 시간이 0.220초에서 0.001초로 단축되었습니다!

    쿼리를 다시 분석하여 Postgres가 쿼리를 실행하는 방법을 살펴보겠습니다. 위에서 쿼리를 복사하고 EXPLAIN ANALYZE를 통해 실행합니다.

    ./manage.py dbshell
    
    my_db=# explain analyze SELECT "web_page"."id",
    my_db-#        "web_page"."title",
    my_db-#        "web_page"."content",
    my_db-#        "web_page"."content_search"
    my_db-#   FROM "web_page"
    my_db-#  WHERE "web_page"."content_search" @@ (plainto_tsquery('english'::regconfig, 'football')) = true
    my_db-#  LIMIT 21
    my_db-# ;
                                                                    QUERY PLAN
    ------------------------------------------------------------------------------------------------------------------------------------------
     Limit  (cost=8.01..12.02 rows=1 width=675) (actual time=0.022..0.022 rows=0 loops=1)
       ->  Bitmap Heap Scan on web_page  (cost=8.01..12.02 rows=1 width=675) (actual time=0.020..0.020 rows=0 loops=1)
             Recheck Cond: (content_search @@ '''football'''::tsquery)
             ->  Bitmap Index Scan on web_page_content_505071_gin  (cost=0.00..8.01 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1)
                   Index Cond: (content_search @@ '''football'''::tsquery)
     Planning Time: 3.061 ms
     Execution Time: 0.165 ms
    (7 rows)
    

    흥미로운 부분은 다음과 같습니다.

    ->  Bitmap Index Scan on web_page_content_505071_gin  (cost=0.00..8.01 rows=1 width=0) (actual time=0.017..0.017 rows=0 loops=1)
    

    순차적 스캔 대신 Postgres는 content_search 열의 인덱스를 사용합니다.

    Index Cond: (content_search @@ '''football'''::tsquery)
    

    또한 더 이상 각 행에 대해 비용이 많이 드는 to_tsquery 작업을 수행하지 않고 대신 content_search 열을 있는 그대로 사용합니다.

    단점

    Unfortunately there are tradeoffs when using this optimization technique.

    • Because we're maintaining another column of our text for the sole purpose of search speed our table size takes up significantly more space. Additionally the gin index on the content_search column takes up space on it's own.

    • Since the search column is updated on every UPDATE or INSERT it also slows down writes to the database.

    If you're constrained by memory and disk or need quick writes this technique may not be suitable for your use case. However I suspect that the majority of CRUD apps out there are OK with sacrificing disk and write speed for lightning fast search.

    결론

    Postgres offers excellent full text search capability, but it's a little slow out of the box. In order to speed up text searches we add a secondary column of type tsvector which is a search-optimized version of our text.

    We add a Gin index on the search column to ensure Postgres performs an index scan rather than a sequential scan. This reduces the query execution time by an order of magnitude.

    In order to keep the text column and the search column in sync we use a Postgres trigger which populates the search column on any modifications to our text column.

    The full code example can be found at Github .

    좋은 웹페이지 즐겨찾기