pandas의 get_dummies()는 사용할 수 없다. 그런 식으로 생각했던 시기가 나에게 m

13561 단어 pandas 파이썬 기계 학습

pandas의 get_dummies()는 사용할 수 없다.

특징량 중의 카테고리 변수를 더미 변수화할 때, pandas 의 get_dummies() 는 매우 쉽고 편리하지만,

import pandas as pd
df1 = pd.DataFrame([[1,'yukino',3,'f',4],[2,'teio',4,'m',3],[3,'nature',4]], columns=['id','name','age','sex','d'])
df1

id
이름
age
sex
d

0
1
유키노
3
f
4.0

1
2
teio
4
m
3.0

2
3
자연
4
None
NaN

pd.get_dummies(df1) # すべてのobject型をdummy化

id
age
d
name_nature
name_teio
name_yukino
sex_f
sex_m

0
1
3
4.0
0
0
1
1
0

1
2
4
3.0
0
1
0
0
1

2
3
4
NaN
1
0
0
0
0

pd.get_dummies(df1, columns=['sex']) # dummy化するカラムを指定

id
이름
age
d
sex_f
sex_m

0
1
유키노
3
4.0
1
0

1
2
teio
4
3.0
0
1

2
3
자연
4
NaN
0
0

pd.get_dummies(df1, columns=['sex'], drop_first=True) # firstが固定じゃないのでデータによってdropされるカテゴリ変数が変動する

id
이름
age
d
sex_m

0
1
유키노
3
4.0
0

1
2
teio
4
3.0
1

2
3
자연
4
NaN
0

pd.get_dummies(df1, columns=['sex'], dummy_na=True) # NaNを1つのカテゴリ変数とみなしてdummy化

id
이름
age
d
sex_f
sex_m
sex_nan

0
1
유키노
3
4.0
1
0
0

1
2
teio
4
3.0
0
1
0

2
3
자연
4
NaN
0
0
1

기계 학습에서는 데이터가 학습, 테스트, CV, 홀드 아웃 세트, 평가 등의 목적을 위해 분할하거나 되어 있는 것이 일반적이다. 이 때문에 분할과 dummy 변수화의 타이밍에 따라서는 다음과 같은 문제가 발생한다.

등장하지 않은 카테고리 변수가 dummy화되지 않음

drop_first=True에 의해 드롭되는 카테고리 변수가 등장 순서에 따라 변동한다

2점째는 drop_first 하지 않는 것으로 아직 어떻게든 방지할 수 있어도(선형 의존성을 배제하기 위해 나중에 수동으로 drop 한다고 해), 문제는 1점째.
예를 들어 df1과는 다른 데이터 세트, df2가 있었을 때,

df2 = pd.DataFrame([[3,'palmer',5,'m',1],[4,'legacy',4,'s',3]], columns=df1.columns)
df2

id
이름
age
sex
d

0
3
팔머
5
m
1

1
4
legacy
4
s
3

pd.get_dummies(df2, columns=['sex'])

id
이름
age
d
sex_m
sex_s

0
3
팔머
5
1
1
0

1
4
legacy
4
3
0
1

df1과 df2로 가지고 있는 특징량이 바뀌어 버리기 때문에 여러가지 곤란한 것이 된다. 라고 할까 치명적.

사용할 수 없다.

그런 식으로 생각했던시기가 아래

그렇게 말해도 어떻게든 사용할 수 없는 것인가. 다른 좋은 방법도 찾을 수 없습니다.
거기서

참고로,

possible_categories = ['m','f','s']
df1['sex'] = df1['sex'].astype('category', categories=possible_categories)
pd.get_dummies(df1, columns=['sex'])

이것으로 좋네요.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-159-a2d4da627942> in <module>
      1 possible_categories = ['m','f','s']
----> 2 df1['sex'] = df1['sex'].astype('category', categories=possible_categories)
      3 pd.get_dummies(df1, columns=['sex'])

~\Anaconda3\lib\site-packages\pandas\core\generic.py in astype(self, dtype, copy, errors, **kwargs)
   5880             # else, only a single dtype is given
   5881             new_data = self._data.astype(
-> 5882                 dtype=dtype, copy=copy, errors=errors, **kwargs
   5883             )
   5884             return self._constructor(new_data).__finalize__(self)

~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in astype(self, dtype, **kwargs)
    579 
    580     def astype(self, dtype, **kwargs):
--> 581         return self.apply("astype", dtype=dtype, **kwargs)
    582 
    583     def convert(self, **kwargs):

~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
    436                     kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
    437 
--> 438             applied = getattr(b, f)(**kwargs)
    439             result_blocks = _extend_blocks(applied, result_blocks)
    440 

~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in astype(self, dtype, copy, errors, values, **kwargs)
    557 
    558     def astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):
--> 559         return self._astype(dtype, copy=copy, errors=errors, values=values, **kwargs)
    560 
    561     def _astype(self, dtype, copy=False, errors="raise", values=None, **kwargs):

~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in _astype(self, dtype, copy, errors, values, **kwargs)
    598                 if deprecated_arg in kwargs:
    599                     raise ValueError(
--> 600                         "Got an unexpected argument: {}".format(deprecated_arg)
    601                     )
    602 

ValueError: Got an unexpected argument: categories

! ?
뭔가 오류. astype() , categories 없어요.

※ stackoverflow의 기사를 잘 읽으면 최신 pandas에 대응한 수정이 되네요. 처음 보았던 복사 사이트의 기사가 오래된 것처럼 보입니다.

그러므로

참고.

from pandas.api.types import CategoricalDtype
cat_type = CategoricalDtype(categories=list('mfs'))
df1['sex'] = df1['sex'].astype(cat_type)
df1

id
이름
age
sex
d

0
1
유키노
3
f
4.0

1
2
teio
4
m
3.0

2
3
자연
4
NaN
NaN

pd.get_dummies(df1, columns=['sex'])

id
이름
age
d
sex_m
sex_f
sex_s

0
1
유키노
3
4.0
0
1
0

1
2
teio
4
3.0
1
0
0

2
3
자연
4
NaN
0
0
0

파치파치

. (정의하지 않은 카테고리 변수가 나오면 어떻게 되는지는 불명 <- astype() 를 적용한 단계에서 NaN

Reference

이 문제에 관하여(pandas의 get_dummies()는 사용할 수 없다. 그런 식으로 생각했던 시기가 나에게 m), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://qiita.com/floatnflow/items/17d00a1f7800bcb5f99e

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

판다스 plot을 이용한 범주형 및 연속값 분포의 시각화

Pandas로 만든 Series 데이터에 메소드를 사용해 보자.

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다