엘리스-데이터분석(4)

Pandas


모듈 설치

----console-----
pip install pandas

import numpy as np

Dataframe과 Series

일반적으로 Dataframe은 전체적인 데이터,Series는 columns가 없고 하나의 열인 형태이다.

a=pd.DataFrame({'Bob': ['I liked it.', 'It was awful.'], 
              'Sue': ['Pretty good.', 'Bland.']},
             index=['Product A', 'Product B'])
b=pd.Series([1, 2, 3, 4, 5])
print(a)
print(b)

                     Bob           Sue
Product A    I liked it.  Pretty good.
Product B  It was awful.        Bland.
0    1
1    2
2    3
3    4
4    5
dtype: int64

파일 읽기

df = pd.read_csv("test.csv")
print(df)

데이터프레임정보

df.head(n)을 통해 상위 n개의 데이터만 출력가능하다.

df = pd.read_csv("test.csv")
print(df.head())
print(df.shape)
print(df.info())

    country                                        description  ...         variety               winery
0     Italy  Aromas include tropical fruit, broom, brimston...  ...     White Blend              Nicosia
1  Portugal  This is ripe and fruity, a wine that is smooth...  ...  Portuguese Red  Quinta dos Avidagos
2        US  Tart and snappy, the flavors of lime flesh and...  ...      Pinot Gris            Rainstorm
3        US  Pineapple rind, lemon pith and orange blossom ...  ...        Riesling           St. Julian
4        US  Much like the regular bottling from 2012, this...  ...      Pinot Noir         Sweet Cheeks

[5 rows x 13 columns]
(10000, 13)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype
---  ------                 --------------  -----
 0   country                9995 non-null   object
 1   description            10000 non-null  object
 2   designation            7092 non-null   object
 3   points                 10000 non-null  int64
 4   price                  9315 non-null   float64
 5   province               9995 non-null   object
 6   region_1               8307 non-null   object
 7   region_2               3899 non-null   object
 8   taster_name            8018 non-null   object
 9   taster_twitter_handle  7663 non-null   object
 10  title                  10000 non-null  object
 11  variety                10000 non-null  object
 12  winery                 10000 non-null  object
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB
None

인덱싱

df.iloc은 숫자를 통해,df.loc은 숫자와 column명을 통해,일반적 인덱싱은 df[ ]은 column명을 통해 접근한다.(인덱싱을 통해 특정열만 뽑으면 Series형태로,여러열을 뽑으면 Dataframe형태로 반환한다.)

df = pd.read_csv("test.csv")
print(df['country'][:4])   
print(df.iloc[:4,1:4])    
print(df.loc[:4,'country']) 

0       Italy
1    Portugal
2          US
3          US
Name: country, dtype: object
                                         description           designation  points
0  Aromas include tropical fruit, broom, brimston...          Vulkà Bianco      87
1  This is ripe and fruity, a wine that is smooth...              Avidagos      87
2  Tart and snappy, the flavors of lime flesh and...                   NaN      87
3  Pineapple rind, lemon pith and orange blossom ...  Reserve Late Harvest      87
0       Italy
1    Portugal
2          US
3          US
4          US
Name: country, dtype: object

Dataframe형태로 인덱싱

loc의 columns인자에 []를 한번더 씌워주면 DataFrame형태로 반환화며 columns를 여러개 얻고 싶을 땐 [columns1,columns2]형태로 써야하기에 DataFrame형태로 얻어지게 된다.

df = pd.read_csv("test.csv")
print(df.loc[:4,['country']])

0     Italy
1  Portugal
2        US
3        US
4        US

columns바꾸기

data.columns를 통해 데이터를 삽입하면 inplace가 일어난다.

data=pd.read_csv('testfile.csv')
print(data.head())
print(data.columns)
data.columns=[1,2,3,4,5]
print(data.head())
print(data.columns)

  name  class  math  english  korean
0    A      1    96       90      95
1    B      1    95       66      71
2    C      2    91       89      92
3    D      1    92       83      87
4    E      2    93       84      95
Index(['name', 'class', 'math', 'english', 'korean'], dtype='object')
   1  2   3   4   5
0  A  1  96  90  95
1  B  1  95  66  71
2  C  2  91  89  92
3  D  1  92  83  87
4  E  2  93  84  95
Int64Index([1, 2, 3, 4, 5], dtype='int64')

rename을 통한 columns바꾸기

rename을 쓰면 inplace옵션을 통해 새로운 배열을 반환할지,inplace를 할지를 선택하여 columns를 바꿀 수 있다.

test_data = pd.read_csv('testfile.csv')
print(test_data.head())
print(test_data.columns)
test_data.rename(columns={'name':'이름','class':'학급명','math':'수학','english':'국어'},inplace=True)
print(test_data.head())

  name  class  math  english  korean
0    A      1    96       90      95
1    B      1    95       66      71
2    C      2    91       89      92
3    D      1    92       83      87
4    E      2    93       84      95
Index(['name', 'class', 'math', 'english', 'korean'], dtype='object')
  이름  학급명  수학  국어  korean
0  A    1  96  90      95
1  B    1  95  66      71
2  C    2  91  89      92
3  D    1  92  83      87
4  E    2  93  84      95

조건식

isin메서드와 조건식을 통해 마스킹을 할 수 있고,
(df.country=='Italy') & (df.points>=90)처럼 여러개의 조건식을 논리연산자를 통해 구현 할 수 있다.

df = pd.read_csv("test.csv")
print(df.country=='Italy')
print(df['points'].isin([89,91]))
print(df.loc[df.country=='Italy',['country']].head(5))
print(df[df['points'].isin([89,91])].head(5))

0        True
1       False
2       False
3       False
4       False
        ...
9995    False
9996    False
9997    False
9998    False
9999    False
Name: country, Length: 10000, dtype: bool
0       False
1       False
2       False
3       False
4       False
        ...
9995     True
9996     True
9997     True
9998     True
9999     True
Name: points, Length: 10000, dtype: bool
   country
0    Italy
6    Italy
13   Italy
22   Italy
24   Italy
          country                                        description  ...                   variety        winery
125  South Africa  Etienne Le Riche is a total Cabernet specialis...  ...        Cabernet Sauvignon      Le Riche   
126        France  Mid-gold color. Pronounced and enticing aromas...  ...            Gewürztraminer  Pierre Sparr   
127        France  Attractive mid-gold color with intense aromas ...  ...               White Blend  Pierre Sparr   
128        France  Compelling minerality on the nose, Refined and...  ...               Pinot Blanc    Kuentz-Bas   
129  South Africa  A big, black bruiser of a wine that has black ...  ...  Bordeaux-style Red Blend     Camberley   

[5 rows x 13 columns]

df.value_count()

각 값들이 몇번 나왔는지 갯수를 Series형태로 반환한다.

df = pd.read_csv("test.csv")
print(df['country'].value_counts())
print(type(df['country'].value_counts()))

US                4181
France            1588
Italy             1546
Spain              493
Portugal           468
Chile              399
Argentina          316
Austria            243
Australia          192
Germany            157
South Africa       132
New Zealand        111
Greece              32
Israel              29
Canada              20
Romania             17
Hungary             12
Bulgaria             8
Turkey               7
Uruguay              5
Mexico               5
Czech Republic       4
Croatia              4
Lebanon              4
Slovenia             4
Moldova              4
England              3
Georgia              2
Brazil               2
India                1
Cyprus               1
Serbia               1
Peru                 1
Morocco              1
Luxembourg           1
Armenia              1
Name: country, dtype: int64
<class 'pandas.core.series.Series'>

df[column].unique()

한번씩 나온 값들을 배열의 형태로 반환한다.

df = pd.read_csv("test.csv")
print(df['country'].unique())

['Italy' 'Portugal' 'US' 'Spain' 'France' 'Germany' 'Argentina' 'Chile'
 'Australia' 'Austria' 'South Africa' 'New Zealand' 'Israel' 'Hungary'
 'Greece' 'Romania' 'Mexico' 'Canada' nan 'Turkey' 'Czech Republic'
 'Slovenia' 'Luxembourg' 'Croatia' 'Georgia' 'Uruguay' 'England' 'Lebanon'
 'Serbia' 'Brazil' 'Moldova' 'Morocco' 'Peru' 'India' 'Bulgaria' 'Cyprus'
 'Armenia']

df[column].mean()

특정column의 평균값을 반환한다.

df = pd.read_csv("test.csv")
print(df['points'].mean())

Series의 연산

일반적인 +연산을 하면 인덱스가 어긋날 경우 NaN이 들어가게되는데,
add메서드를 사용하면 fill_value를 통해 간단히 원하는 데이터 삽입이 가능하다.

df = pd.read_csv("test.csv")
A=pd.Series([2,4,6],index=[0,1,2])
B=pd.Series([1,3,5],index=[1,2,3])
print(A+B)
print(A.add(B,fill_value=0))

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64
0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

DataFrame의 연산

add메서드를 통해 NaN일 때의 데이터 처리를 간단히 할 수 있다.

df = pd.read_csv("test.csv")
A=pd.DataFrame(np.random.randint(0,10,(2,2)),columns=list('AB'))
B=pd.DataFrame(np.random.randint(0,10,(3,3)),columns=list('BAC'))
print(A+B)
print(A.add(B,fill_value=0))

     A    B   C
0  2.0  1.0 NaN
1  9.0  6.0 NaN
2  NaN  NaN NaN
     A    B    C
0  2.0  1.0  3.0
1  9.0  6.0  2.0
2  4.0  8.0  4.0

집계함수

data={
    'A':[i+5 for i in range(3)],
    'B':[i**2 for i in range(3)]
    }
df=pd.DataFrame(data)
print(df['A'].sum())
print(df.sum())
print(df.mean())

18
A    18
B     5
dtype: int64
A    6.000000
B    1.666667
dtype: float64

df.dropna와 df.fillna

dropna는 NaN인 데이터의 행을 삭제하고(inplce진행),fillna는 NaN을 원하는 데이터로 채워넣는다.(inplace X)

df.dropna()
df['전화번호']=df['전화번호'].fillna('전화번호없음')

df.map과 df.apply

df = pd.read_csv("test.csv")
print(df['points'].map(lambda x:x-df['points'].mean()))
print(df.apply(lambda x:x['points']-df['points'].mean(),axis=1))

0      -1.3957
1      -1.3957
2      -1.3957
3      -1.3957
4      -1.3957
         ...
9995    0.6043
9996    0.6043
9997    0.6043
9998    2.6043
9999    2.6043
Name: points, Length: 10000, dtype: float64
0      -1.3957
1      -1.3957
2      -1.3957
3      -1.3957
4      -1.3957
         ...
9995    0.6043
9996    0.6043
9997    0.6043
9998    2.6043
9999    2.6043
Length: 10000, dtype: float64

groupby

특정 column을 기준으로 각각 묶어준다.

df = pd.read_csv("test.csv")
print(df.groupby('country').points.count())
print(df.groupby('country').points.value_counts())

country
Argentina          316
Armenia              1
Australia          192
Austria            243
Brazil               2
Bulgaria             8
Canada              20
Chile              399
Croatia              4
Cyprus               1
Czech Republic       4
England              3
France            1588
Georgia              2
Germany            157
Greece              32
Hungary             12
India                1
Israel              29
Italy             1546
Lebanon              4
Luxembourg           1
Mexico               5
Moldova              4
Morocco              1
New Zealand        111
Peru                 1
Portugal           468
Romania             17
Serbia               1
Slovenia             4
South Africa       132
Spain              493
Turkey               7
US                4181
Uruguay              5
Name: points, dtype: int64
country    points
Argentina  87        46
           88        42
           85        37
           83        27
           84        27
                     ..
US         98         2
           99         2
Uruguay    86         2
           90         2
           88         1

정렬

ascending이 True이면 오름차순,False이면 내림차순이며 배열을 통해 여러개의 기준을 줄 수도 있다.

df = pd.read_csv("test.csv")
print(df.sort_values(by='country', ascending = True).head())
print(df.sort_values(by=['country','points'], ascending = False).loc[:,['country','points']].head())

        country                                        description        designation  points  ...  taster_twitter_handle                                              title             variety          winery
5786  Argentina  Leafy, spicy, dry berry aromas lead to a jacke...            Reserva      83  ...            @wineschach           Fat Gaucho 2013 Reserva Malbec (Mendoza)              Malbec      Fat Gaucho
1482  Argentina  Rather sweet and medicinal; the wine comes int...            Altosur      85  ...            @wineschach  Finca Sophenia 2006 Altosur Cabernet Sauvignon...  Cabernet Sauvignon  Finca Sophenia
1991  Argentina  Initial plum and berry aromas fall off with ai...                NaN      84  ...            @wineschach       Ricominciare 2010 Malbec-Tannat (Uco Valley)       Malbec-Tannat    Ricominciare
4367  Argentina  A smooth operator with sweet aromas of cotton ...  Cincuenta y Cinco      91  ...            @wineschach  Bodega Chacra 2009 Cincuenta y Cinco Pinot Noi...          Pinot Noir   Bodega Chacra
4369  Argentina  Clean, cedary and dynamic, with fine black-fru...            Reserva      91  ...            @wineschach            Ca' de Calle 2008 Reserva Red (Mendoza)           Red Blend    Ca' de Calle

[5 rows x 13 columns]
      country  points
6005  Uruguay      90
9133  Uruguay      90
4051  Uruguay      88
4104  Uruguay      86
6969  Uruguay      86

replace

특정 데이터를 원하는 데이터로 치환이 가능하며 inplace옵션을 줄 수도 있다.

df = pd.read_csv("test.csv")
print(df.replace('US','USA',inplace=False))

0          Italy
1       Portugal
2            USA
3            USA
4            USA
          ...
9995         USA
9996         USA
9997      France
9998         USA
9999      France
Name: country, Length: 10000, dtype: object

좋은 웹페이지 즐겨찾기