1000배 빠른 코드 차이 - Python Pandas와 Juulia Dataframs의 예

파이썬 팬더스를 사용하는 사람에게는 이미 알고 있는 방법이라고 생각했는데 우연히 효율적인 중복 처리 방법을 소개하는 방법을 발견했기 때문에 같은 방법을 시도해 봤고 파이썬 팬더스와 가장 빠른 방법을 비교했다.
원래 보도는 여기에 있다.파이썬 판다즈의 결과만 있습니다.

여기에 소개된 것은 두 열(src bytes, dst bytes)을 더하는 속도는 방법에 따라 큰 변화가 있을 수 있다.

Iterrows

For loop with .loc or .iloc

Apply

Itertuples

List comprehensions

Pandas vectorization

NumPy vectorization

이미 알고 있는 사람도 많을 거라고 생각합니다. 제 환경에서(Macmini M18GB, Python 3.9.7 Pandas 1.4.1)는 문장과 같이 7가지 방법을 시도해 봤습니다.해설은 원문이기 때문에 자세한 내용은 생략하고 코드와 결과만 표시합니다.

Python Pandas

이번에 사용한 데이터

import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/mlabonne/how-to-data-science/main/data/nslkdd_test.txt')

Iterrows

%%timeit -n 10
total = []
for index, row in df.iterrows():
    total.append(row['src_bytes'] + row['dst_bytes'])

For loop with .loc

%%timeit -n 10
total = []
for index in range(len(df)):
    total.append(df['src_bytes'].loc[index] + df['dst_bytes'].loc[index])

Apply

%%timeit -n 10
df.apply(lambda row: row['src_bytes'] + row['dst_bytes'], axis=1).to_list()

Itertuples

%%timeit -n 10
total = []
for row in df.itertuples():
    total.append(row[5] + row[6])

List comprehension

%%timeit -n 100
[src + dst for src, dst in zip(df['src_bytes'], df['dst_bytes'])]

Pandas vectorization

%%timeit -n 1000
(df['src_bytes'] + df['dst_bytes']).to_list()

NumPy vectorization

%%timeit -n 1000
(df['src_bytes'].to_numpy() + df['dst_bytes'].to_numpy()).tolist()

원문도 썼고 제 환경도 442ms가 433에 달했습니다.μs와 1000배 단축.결론적으로 List Comporehension을 사용하고 가능하면 Vecotrization(벡터화)을 사용한다.

Julia DataFrames

갑자기 줄리아에서 같은 일을 하면 어떨까 싶어 해봤어요.동일한 환경(Macmini M18GB, Juria 1.6.5, DataFrames 1.3.2)에서 동일한 데이터를 비교합니다.

using DataFrames
using CSV
using HTTP
using BenchmarkTools

url = "https://raw.githubusercontent.com/mlabonne/how-to-data-science/main/data/nslkdd_test.txt"
df = CSV.File(HTTP.get(url).body) |> DataFrame

Eachrow (= iterrows)

total=[]
@benchmark for row in eachrow(df)
    push!(total, row["src_bytes"].+row["dst_bytes"])
end

중간값을 잡으면 22ms입니다.

List comprehension

@benchmark [(src+dst) for (src, dst) in zip(df[!,"src_bytes"],df[!,"dst_bytes"])]

Julia way

@benchmark [df[!,"src_bytes"]+df[!,"dst_bytes"]]

최종 22ms ~ 8.9μS까지 1000배 이상 단축되었습니다.
Juria DataFrames든 Python Pandas든 방법에 따라 1000배의 차이가 있을 수 있으니 미리 알지 않으면 너무 아쉽다

총결산

언어
메서드
타임
Python
Iterrows
442 ms
Python
For loop with .loc or .iloc
217 ms
Python
Apply
163 ms
Python
Itertuples
64.4 ms
Python
List comprehension
3.72 ms
Python
Pandas vectorization
466 μs
Python
NumPy vectorization
433 μs
Julia
Eachrow
22.6 ms
Julia
List comprehension
13.1 μs
Julia
Julia Dataframe
8.9 μs

Reference

이 문제에 관하여(1000배 빠른 코드 차이 - Python Pandas와 Juulia Dataframs의 예), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://zenn.dev/otwn/articles/970d03c3ab5228

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다