Pandas 상세 노트

17232 단어 pandas numpy 인공지능

Pandas 개요
Pandas는python 제3자 라이브러리로 고성능 데이터 유형과 분석 도구를 제공합니다
Pandas의 기본 작업
Pandas 도입

import pandas as pd

cvs 파일 데이터 읽기 및 관련 작업
파일 읽기

values = pd.read_csv("file/test.csv")

몇 줄의 데이터를 가져오는지 기본값은 5줄입니다. 정수 파라미터를 입력할 수 있습니다.

values.head()

데이터의 기본 정보, 데이터 유형, 유효한 데이터 수량 등 포함

values.info()
    ---------    ---------
    
    RangeIndex: 891 entries, 0 to 890
    Data columns (total 12 columns):
    PassengerId    891 non-null int64
    Age            714 non-null float64
    Cabin          204 non-null object
    dtypes: float64(1), int64(1), object(1)
    memory usage: 83.6+ KB

모든 열 이름 가져오기

values.keys()
    ---------    ---------
    Index(['PassengerId', 'Age', 'Cabin'], dtype='object')

모든 열의 종류와 전체 데이터 형식을 가져옵니다

values.dtype()
    ---------    ---------
    PassengerId      int64
    Age            float64
    Cabin           object
    dtype: object

모든 데이터를 가져옵니다. 열 이름이 포함되지 않습니다.

values.values

모든 데이터의 인덱스 값 가져오기

values.index
    ---------    ---------
    RangeIndex(start=0, stop=891, step=1)

pandas에서 DataFrame 데이터 형식을 만드는 데이터 및 기본 작업
데이터 생성

data = {
    "country": [
        "aaa", "bbb", "ccc"
    ],
    "population": [
        10, 20, 30
    ]
}
pd.DataFrame(data)

이미 있는 열 데이터로 색인 설정하기

data.set_index("country")

일렬 데이터 가져오기

data["country"]

데이터 슬라이드와list의 슬라이드는 기본적으로 일치합니다

data["country"][:2]

수치 단순 계산

data["population"]["aaa"] + data["population"]["bbb"]  #          
data["population"].mean()  #       
data["population"].mean(axis=1)  #       
data["population"].max()        
data["population"].max(axis=1)  #       
data["population"].min()  #       
data["population"].mean(axis=1)  #

데이터의 기본적인 통계적 특성을 얻을 수 있다

values.describe()

Pandas 색인 작업
여러 열 데이터 가져오기

df = pd.read_csv("./file/titanic.csv")
df[["Age", "Fare"]].head()

데이터 가져오기 두 가지 방법loc와iloc
4

loc용 lable 포지셔닝 데이터

iloc용position 포지셔닝 데이터

loc 데이터 그룹 가져오기

df.loc["Bob"]

iloc 데이터 세트 가져오기

#   1~5  1~25   
df.iloc[0: 5, 1: 25]

데이터 판단

#        40      5 
df[df["Age"] > 40].head()

값 존재 여부를 판단하다

s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype="int64")
s.isin([1, 2, 3])
    ---------    ---------
    3    1
    2    2
    1    3
    dtype: int64

이중 색인

s2 = pd.Series(np.arange(6), index=pd.MultiIndex.from_product([[0, 1], ["a", "b", "c"]]))
    ---------    ---------
    0  a    0
       b    1
       c    2
    1  a    3
       b    4
       c    5
    dtype: int64

다중 인덱스를 통해 대응하는 값을 얻다

s2.iloc[s2.index.isin([(1, "a"), (0, "b")])]
    ---------    ---------
    1  a    3
    dtype: int64

데이터 필터링

df.where(df > 0, -df) 
#     ，       0     ，      0      NaN，-df

열 사이의 데이터 판단

df.query("(a < b) & (b < c)")

groupby
연습 데이터 만들기

import pandas as pd
df = pd.DataFrame({
    "key": [
        "A", "B", "C", "A", "B", "C", "A", "B", "C"
    ],
    "data": [
        1, 2, 3, 4, 5, 6, 7, 8, 9
    ]
})

키라는 열에서 데이터의 총계를 계산하다

df.groupby("key").sum()

타이타닉 호 데이터 중의 남녀 연령의 총계를 계산하다

df.groupby("Sex").sum()["Age"]
    ---------    ---------
    Sex
    female     7286.00
    male      13919.17
    Name: Age, dtype: float64

이원 통계
데이터와 데이터 사이의 협방차

df.cov()

데이터 간의 상관계수는 비교적 상용된다

df.corr()

각각 다른 속성이 각각 몇 개인지 통계하고 기본값은 내림차순으로 하며 ascending=True는 오름차순을 나타낸다

df["Sex"].value_counts(ascending=True)
    ---------    ---------
    female    314
    male      577
    Name: Sex, dtype: int64

bins=5는 데이터를 그룹으로 나누는데, 이것은 5그룹으로 나누는 것을 나타낸다

df["Age"].value_counts(ascending=True, bins=5)
    ---------    ---------
    (64.084, 80.0]       11
    (48.168, 64.084]     69
    (0.339, 16.336]     100
    (32.252, 48.168]    188
    (16.336, 32.252]    346
    Name: Age, dtype: int64

유효한 데이터의 합계 가져오기

df["Age"].count()

Pandas 데이터 객체의 작업
Pandas의 주요 데이터 유형은 시리즈 및 DataFrame 두 가지입니다.
시리즈 데이터 유형 작업
연습 데이터 만들기

import pandas as pd
data = [10, 20, 30]
index = ["a", "b", "c"]
s = pd.Series(data=data, index=index)
    ---------    ---------
    a    10
    b    20
    c    30
    dtype: int64

데이터 가져오기

s[0]  #             

mask = [True, False, True]
s[mask]  #   bool    

s.loc["b"]  #   label    

s.iloc[1]   #

기존 데이터를 기준으로 동일한 데이터 복제

s1 = s.copy()

데이터 값 바꾸기

s1.replace(to_replace=100, value=10, inplace=True)
# to_replace      ， value       ， inplace

색인 이름 수정, 원본 값 수정

s1.index = ["a", "b", "d"]

단일 색인 이름 수정

s1.rename(index={"a": "A"}, inplace=True)

두 데이터 병합

s1.append(s, ignore_index=False)  #    s      s1 ，         
# ignore_index        ，True      ，False

지정한 키의 값을 삭제합니다

del s["a"]

여러 요소 데이터 삭제

s1.drop(["b", "d"], inplace=True)

DataFrame 데이터 유형 작업
연습 데이터 만들기

data = [[1, 2, 3], [4, 5, 6]]
index = ["a", "b"]
columns = ["A", "B", "C"]
df = pd.DataFrame(data=data, index=index, columns=columns)

지정된 칸 내의 값 수정하기

df.loc["a"]["A"] = 100

색인 이름 수정

df.index = ["f", "g"]

데이터 행 추가

df.loc["c"] = [1, 2, 3]

두 DataFrame 유형의 데이터 병합

df3 = pd.concat([df, df2], axis=0)   #              
df3 = pd.concat([df, df2], axis=1)   #

데이터 열 추가

df2["Lan"] = [10, 11]

데이터 삭제

df2.drop(["j"], axis=0, inplace=True)  #         
df2.drop(["E"], axis=1, inplace=True)  #         
df2.drop(["j", "k"], axis=0, inplace=True)  #         
df2.drop(["E"，"F"], axis=1, inplace=True)  #

merge
연습 데이터 만들기

import pandas as pd
left = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "A": ["A0", "A1", "A2", "A3"],
    "B": ["B0", "B1", "B2", "B3"] 
})

right = pd.DataFrame({
    "key": ["K0", "K1", "K2", "K3"],
    "C": ["C0", "C1", "C2", "C3"],
    "D": ["D0", "D1", "D2", "D3"]
})

merge 데이터

pd.merge(left, right, on="key", how="outer", indicator=True)
#      DataFrame， on="key"     key       , how="outer"     , indicator=True         
# how: left, right, outer...

Pandas 데이터 출력에 대한 디스플레이 설정

import pandas as pd

출력의 최대 줄 수 가져오기

pd.get_option("display.max_rows")  #    60

출력의 최대 행 수를 설정합니다.

pd.set_option("display.max_rows", 6)
#         6 ，

출력 최대 열 가져오기

pd.get_option("display.max_columns") #    20

출력 최대 열 수 설정

pd.set_option("display.max_columns", 10)
#         10 ，

격자 내 값의 최대 길이 가져오기

pd.get_option("display.max_colwidth")   #    50

메쉬 내부 값의 최대 길이를 설정합니다.

pd.set_option("display.max_colwidth", 10)
#         10   ，

격자 안의 값 정밀도 가져오기

pd.get_option("display.precision")   #     6

메쉬 내부 값의 정밀도를 설정합니다.

pd.set_option("display.precision", 2)
#             2

pivot 데이터 투시표
연습 데이터 만들기

import pandas as pd
example = pd.DataFrame({
    "Month": [
        "January", "January", "January", "January",
        "February", "February", "February", "February",
        "March", "March", "March", "March"
    ],
    "Caregory": [
        "Transportation", "Grocery", "Household", "Entertainment",
        "Transportation", "Grocery", "Household", "Entertainment",
        "Transportation", "Grocery", "Household", "Entertainment"
    ],
    "Amount": [
        74., 235., 175., 100., 115., 240., 225., 125., 90., 260., 200., 120.
    ]
})

DataFrame 데이터를 시각적 표 프리젠테이션으로 변환

example_pivot = example.pivot(index="Caregory", columns="Month", values="Amount")

합계를 계산하다

#      
example_pivot.sum(axis=0)
    ---------    ---------
    Month
    February    705.0
    January     584.0
    March       670.0
    dtype: float64

#      
example_pivot.sum(axis=1)
    ---------    ---------
    Caregory
    Entertainment     345.0
    Grocery           735.0
    Household         600.0
    Transportation    279.0
    dtype: float64

타이타닉호 통계에서 남녀가 각각 1, 2, 3석에 있는 평균 표 값

#          
df.pivot_table(index="Sex", columns="Pclass", values="Fare")

타이타닉호 통계에서 남녀가 각각 1, 2, 3석에 있는 인원을 집계하다

df.pivot_table(index="Sex", columns="Pclass", values="Fare", aggfunc="count")  # aggfunc="mean"        
#     
pd.crosstab(index=df["Sex"], columns=df["Pclass"])   #   aggfunc="count"

시간 조작
시간 데이터 만들기

import pandas as pd
ts = pd.Timestamp("2018-12-13")
    ---------    ---------
    Timestamp('2018-12-13 00:00:00')

pd.to_datetime("2018-12-13")
    ---------    ---------
    Timestamp('2018-12-13 00:00:00')

시간 조작

#  5 
ts + pd.Timedelta("5 days")
    ---------    ---------
    Timestamp('2018-12-18 00:00:00')

#  1 
ts - pd.Timedelta("1 days")
    ---------    ---------
    Timestamp('2018-12-12 00:00:00')

시리즈의 시간 데이터

sd = pd.Series(["2017-12-13 00:00:00", "2017-12-14 00:00:00", "2017-12-15 00:00:00"])
    ---------    ---------
    0    2017-12-13 00:00:00
    1    2017-12-14 00:00:00
    2    2017-12-15 00:00:00
    dtype: object

#    datetime      
ts = pd.to_datetime(s)
    ---------    ---------
    0   2017-12-13
    1   2017-12-14
    2   2017-12-15
    dtype: datetime64[ns]

획득 시간

ts.dt.hour

수령년

ts.dt.year

여러 시간 시퀀스 데이터 생성

data = pd.Series(pd.date_range("2018-12-13", periods=3, freq="12H"))
#  2018-12-13    3     ，   12  
    ---------    ---------
    0   2018-12-13 00:00:00
    1   2018-12-13 12:00:00
    2   2018-12-14 00:00:00
    dtype: datetime64[ns]

슬라이드 형식으로 데이터 그룹 가져오기

data[pd.Timestamp("2018-12-13 00:00:00"): pd.Timestamp("2018-12-14 00:00:00")]

Pandas 일반 작업
연습 데이터 만들기

import pandas as pd
data = pd.DataFrame({
    "group": [
        "A", "B", "C", "A", "B", "C", "A", "B", "C"
    ],
    "data": [
        1, 2, 3, 4, 5, 6, 7, 8, 9
    ]
})

정렬

data.sort_values(by=["group", "data"], ascending=[False, True], inplace=True)
# by=["group", "data"]          
# ascending=[False, True]     ，True   ，False   
# inplace=True

중복 제거

#        ，          
data.drop_duplicates()

#      ，        ，           
data.drop_duplicates(subset="k1")

두 조의 데이터를 계산하다

df = pd.DataFrame({"data1": np.random.randn(5),
            "data2": np.random.randn(5)
              })
df2 = df.assign(ration=df["data1"]/df["data2"])   #

데이터 열 삭제

df2.drop("ration", axis="columns", inplace=True)

데이터 분류

ages = [14, 15, 14, 79, 24, 57, 24, 100]
bins = [10, 40, 80]
bins_res = pd.cut(ages, bins)   #   bins    
    ---------    ---------
    #                      
    [(10, 40], (10, 40], (10, 40], (40, 80], (10, 40], (40, 80], (10, 40], NaN]
    Categories (2, interval[int64]): [(10, 40] < (40, 80]]

#         
pd.value_counts(bins_res)    #     
    ---------    ---------
    (10, 40]    5
    (40, 80]    2
    dtype: int64

#        
group_names = ["Yanth", "Mille", "old"]
pd.value_counts(pd.cut(ages, [10, 20, 50, 80], labels=group_names))
    ---------    ---------
    Yanth    3
    old      2
    Mille    2
    dtype: int64

DataFrame에 누락된 값이 있는지 판단합니다. True는 잘못된 값이고 False는 유효한 값입니다.

df = pd.DataFrame([range(3), [0, np.nan, 0], [0, 0, np.nan], range(3)])
df.isnull()

#             
df.isnull().any()
    ---------    ---------  
    0    False
    1     True
    2     True
    dtype: bool

#             
df.isnull().any(axis=1)
    ---------    ---------  
    0    False
    1     True
    2     True
    3    False
    dtype: bool

부족한 값 채우기

df.fillna(5)

문자열 작업
연습 데이터 만들기

import pandas as pd
import numpy as np
s = pd.Series(["A", "B", "b", "gaer", "AGER", np.nan])

시리즈 데이터의 문자열을 소문자로 변환

s.str.lower()

시리즈 데이터의 문자열을 대문자로 변환

s.str.upper()

시리즈의 각 멤버에 대한 문자열 길이 계산

s.str.len()

구성원 문자열의 앞뒤 공백 제거

index = pd.Index(["   l  an", "   yu", "   lei"])
index.str.strip()

필드 이름의 데이터 바꾸기

df = pd.DataFrame(np.random.randn(3, 2), columns=["A a", "B b"], index=range(3))
df.columns = df.columns.str.replace(" ", "_")

슬라이드 데이터

s = pd.Series(["a_b_C", "c_d_e", "f_g_h"])
s.str.split("_")    #      
    ---------    ---------  
    0    [a, b, C]
    1    [c, d, e]
    2    [f, g, h]
    dtype: object

문자열을 구분하고 표 n=6을 생성하면 몇 번 구분을 나타냅니다

s.str.split("_", expand=True, n=6)

"A"가 s에 포함되어 있는지 여부를 판단합니다. 포함되면 True, 포함되지 않으면 False입니다.

s = pd.Series(["Axzfc", "Aefa", "Ahstr", "Aga", "Aaf"])
s.str.contains("A")
    ---------    ---------  
    0    True
    1    True
    2    True
    3    True
    4    True
    dtype: bool

Pandas 드로잉
가장 기본적인 그래프 그리기

%matplotlib inline
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(10), index=np.arange(0, 100, 10))
s.plot()

약간 복잡한 곡선도

df = pd.DataFrame(np.random.randn(10, 4).cumsum(0), index=np.arange(0, 100, 10), columns=list("ABCD"))
df.plot()

주상도

data = pd.Series(np.random.rand(16), index=list("abcdefghigklmnop"))
#    
from matplotlib import pyplot as plt
fig, axes = plt.subplots(2, 1)
data.plot(ax=axes[0], kind="bar")   #     
data.plot(ax=axes[1], kind="barh")   #

다중 데이터의 기둥 모양 그림

df = pd.DataFrame(np.random.rand(6, 4), index=["one", "two", "three", "four", "five", "six"], 
              columns=pd.Index(["A", "B", "C", "D"], name="Genus"))
df.plot(kind="bar")

히스토그램

df.A.plot(kind="hist", bins=50)

산점도

df.plot.scatter("A", "B")

다중 데이터의 산점도

pd.scatter_matrix(df, color="k", alpha=0.3)

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

【Pandas】DatetimeIndex란?　no.29

안녕하세요, 마유미입니다. Pandas에 대한 기사를 시리즈로 작성하고 있습니다. 이번은 제29회의 기사가 됩니다. 에서 Pandas의 시간에 대한 모듈에 대해 씁니다. 이번 기사에서는, 「DatetimeIndex」...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

hdu 1757 A Simple Math Problem 매트릭스 쾌속 멱

안 드 로 이 드 QR 코드 명함 만 들 기

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다