결론

copy가 압도적으로 빠르다. CSV 파일 출력보다 빠릅니다.

방법
시간

pandas.to_sql
22분 2초

embulk(insert_direct)
6분 3초

copy(postgres)
0분 12초

참고) CSV 파일 출력
0분 50초

환경

cpu:ryzen 7 1700
메모리:32G
postgres:Ver10
os:win10
저장 장치: HDD

고찰

postgres 설정 기본값으로 남아 있으므로 최적화하면 약간의 결과가 변경 될 수 있습니다.

pg_bulkload를 사용하면 embulk에서도 고속화할 수 있을 것 같지만 win10이라고 힘들 것.

전송 데이터 생성

파이썬으로 더미 데이터를 만듭니다. 1000만 라인

import pandas as pd
import numpy as np
from sqlalchemy import create_engine

df = pd.DataFrame({ 'A' : np.random.randint(0,100,10000000),
                        'B' : np.random.randint(0,100,10000000),
                        'C' : np.random.randint(0,100,10000000),
                        'D' : np.random.randint(0,100,10000000)
                        })
df.to_csv('embulk_test.csv',index=False)

pandas.to_sql로 전송

engine = create_engine('postgresql://postgres:password@localhost:5432/test')
df.to_sql('pandas_to_sql', engine, if_exists='replace')

embulk로 전송

in:
  type: file
  path_prefix: embulk_test.csv
  parser:
    charset: UTF-8
    newline: CRLF
    type: csv
    delimiter: ','
    quote: '"'
    escape: '"'
    trim_if_not_quoted: false
    skip_hander_lines: 1
    allow_extra_columns: false
    allow_optional_columns: false
    columns:
    - {name: 'A', type: long}
    - {name: 'B', type: long}
    - {name: 'C', type: long}
    - {name: 'D', type: long}

out:
  type: postgresql
  mode: insert_direct
  default_timezone: "Asia/Tokyo"
  host: localhost
  port: 5432
  user: postgres
  password: "password"
  database: test
  table: embulk_table

Reference

이 문제에 관하여(CSV 파일의 postgres 전송 시간 비교. pandas.to_sql,embulk,copy), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/manmatti/items/c06f8b47b56ab4ab18ae

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다