파이톤을 이용한 데이터 분석 Chapter 6

이 장은 주로pandas가 데이터에 대한 입력과 출력을 소개한다.일반적으로 다음과 같은 몇 가지 유형이 있다. 텍스트 파일과 하드디스크의 다른 효율적인 형식 파일을 읽고 데이터베이스에서 데이터를 불러오며 네트워크 자원과 상호작용을 한다. (예를 들어 웹 API)
1. 텍스트 형식 데이터의 읽기와 쓰기
가장 많은 함수 사용:readcsv: 파일, URL 또는 파일형 대상에서 구분된 데이터를 읽습니다. 기본 구분자는 쉼표readtable: 파일, URL 또는 파일형 대상에서 구분된 데이터를 읽습니다. 기본 구분자는 탭('\t')readexcel: Excel의 XLS 또는 XLSX 파일에서 테이블read 읽기html: HTML 파일에서 모든 테이블 데이터 읽기 readjson: JSON 문자열에서 데이터 읽기 readql: SQL 조회 결과(SQLAlchemy)를pandas의 DtatFrame로 읽기

! type pydata-book-2nd-edition\examples\ex1.csv #  type 
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

df = pd.read_csv('pydata-book-2nd-edition\examples\ex1.csv')
df
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
pd.read_table("pydata-book-2nd-edition\examples\ex1.csv", sep=',') # 
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

1.1 읽기 옵션
a. 일부 파일은 헤더 줄을 포함하지 않아서pandas의 기본 열 이름을 지정할 수도 있고 열 이름을 지정할 수도 있습니다

! type pydata-book-2nd-edition\examples\ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
pd.read_csv('pydata-book-2nd-edition\examples\ex2.csv') #  
	1	2	3	4	hello
0	5	6	7	8	world
1	9	10	11	12	foo
pd.read_csv("pydata-book-2nd-edition\examples\ex2.csv", header=None) #  pandas 
	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
pd.read_csv('pydata-book-2nd-edition\examples\ex2.csv', names=['a', 'b', 'c', 'd', 'message'])
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
pd.read_csv('pydata-book-2nd-edition\examples\ex2.csv', names=['a', 'b', 'c', 'd', 'message'], 		
			index_col='message')
			
		a	b	c	d
message				
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

b. 여러 열에서 계층형 색인을 만들 수 있습니다.

! type pydata-book-2nd-edition\examples\csv_mindex.csv
key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

pd.read_csv('pydata-book-2nd-edition\examples\csv_mindex.csv', index_col=['key1', 'key2'])
				value1	value2
key1	key2		
one		a		1		2
		b		3		4
		c		5		6
		d		7		8
two		a		9		10
		b		11		12
		c		13		14
		d		15		16

c. 필드를 여러 개의 공백으로 나누면 구분자 매개 변수에 정규 표현식을 전달할 수 있습니다.

list(open('pydata-book-2nd-edition\examples\ex3.txt'))
['            A         B         C
',
 'aaa -0.264438 -1.026059 -0.619500
',
 'bbb  0.927272  0.302904 -0.032399
',
 'ccc -0.264273 -0.386314 -0.217601
',
 'ddd -0.871858 -0.348382  1.100491
']
pd.read_table('pydata-book-2nd-edition\examples\ex3.txt', sep='\s+') 
		A			B			C    # ，read_table 
aaa	-0.264438	-1.026059	-0.619500
bbb	 0.927272	 0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	 1.100491

d. 매개 변수를 이용하여 이상이 발생한 파일 형식을 처리하는 데 도움을 줍니다. 보통: path/sep/헤더/indexcol/names/skiprows/na_values

! type pydata-book-2nd-edition\examples\ex4.csv
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

pd.read_csv('pydata-book-2nd-edition\examples\ex4.csv', skiprows=[0, 2, 3]) # skiprow 
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

na_values 옵션은 목록이나 문자열을 가져와서 부족한 값을 처리할 수 있습니다.NA로 대체할 값 서열을 지정합니다

pd.read_csv('pydata-book-2nd-edition\examples\ex5.csv')
	something	a	b	c		d	message
0	one			1	2	3.0		4	NaN
1	two			5	6	NaN		8	world
2	three		9	10	11.0	12	foo
sentinels = {
     'message': ['foo', 'NA'], 'something': ['two']}
pd.read_csv('pydata-book-2nd-edition\examples\ex5.csv', na_values=sentinels)
	something	a	b	c		d	message
0	one			1	2	3.0		4	NaN
1	NaN			5	6	NaN		8	world
2	three		9	10	11.0	12	NaN

1.2 블록 파일 읽기
대형 파일을 처리하거나 정확한 매개 변수 집합을 찾아서 큰 파일을 정확하게 처리할 때 파일의 작은 부분을 읽거나 작은 블록으로 파일을 옮겨다닐 수 있습니다.a. nrows는 줄의 일부만 읽습니다(전체 파일은 읽지 않음).

pd.options.display.max_rows = 10 # pandas 
pd.read_csv('pydata-book-2nd-edition\examples\ex6.csv')
		one			two			three		four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
...	...	...	...	...	...
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315	0
10000 rows × 5 columns
pd.read_csv('pydata-book-2nd-edition\examples\ex6.csv', nrows=5) # 5 
		one			two			three		four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

b. chunksize 블록은 파일을 읽고 반복합니다.

chunker = pd.read_csv('pydata-book-2nd-edition\examples\ex6.csv', chunksize=1000)
type(chunker)
pandas.io.parsers.TextFileReader

tot = pd.Series([ ])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)
tot = tot.sort_values(ascending=False)
tot[:10]
E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParser에는 get 도 있습니다.chunk 방법으로 데이터 블록을 임의의 크기로 읽을 수 있습니다.
1.3 텍스트 형식에 데이터 쓰기

data.to_csv('out.csv') #  
,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

다른 구분자도 괜찮아요.

import sys
data.to_csv(sys.stdout, sep='|')
|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

누락된 값은 내보낼 때 빈 문자열로 나타나거나 다른 식별 값을 설정할 수 있습니다.

data.to_csv(sys.stdout, na_rep='NULL')
,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo
data.to_csv(sys.stdout, index=False, header=False) #  ， 
one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo
data.to_csv(sys.stdout, index=False, columns=['a', 'b', 'c']) #  
a,b,c
1,2,3.0
5,6,
9,10,11.0

1.4 구분 형식 사용
단일 문자 구분자가 있는 파일은 열려 있는 파일이나 파일형 대상을 csv에 전송합니다.reader, Reader를 옮겨다니면 모듈이 생성되며, 필요에 따라 특정 형식의 데이터 파일을 생성할 수 있습니다.

import csv
with open('pydata-book-2nd-edition/examples/ex7.csv') as f:
    lines = list(csv.reader(f))
header, values = lines[0], lines[1:]
data_dict = {
     h: v for h, v in zip(header, zip(*values))}
data_dict
{
     'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

1.5 JSON 데이터
JSON은 웹 브라우저와 다른 응용 프로그램 간에 HTTP를 통해 데이터를 보내는 표준 형식이 되었고 CSV 등 표 텍스트 형식보다 자유로운 데이터 형식이 되었다.JSON 객체의 모든 키는 문자열이어야 합니다.

obj = """
{"name": "Wes",
 "places_lived": ["United States", "Spain", "Germany"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]},
              {"name": "Katie", "age": 38,
               "pets": ["Sixes", "Stache", "Cisco"]}]
}
"""
result = json.loads(obj) # JSON Python 
result
{
     'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{
     'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {
     'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}
siblings = pd.DataFrame(result["siblings"], columns=['name', 'age', 'pets'])
siblings
	name	age	pets
0	Scott	30	[Zeus, Zuko]
1	Katie	38	[Sixes, Stache, Cisco]
asjson = json.dumps(result) #  Python JSON

read_json은 자동으로 JSON 객체를 시리즈 또는 DataFrame으로 순서대로 변환할 수 있습니다.

! type example.json
[{
     "a": 1, "b": 2, "c": 3},
 {
     "a": 4, "b": 5, "c": 6},
 {
     "a": 7, "b": 8, "c": 9}]
data = pd.read_json('example.json') #  JSON 
	a	b	c
0	1	2	3
1	4	5	6
2	7	8	9

to_json 방법은pandas 데이터를 JSON으로 내보낼 수 있습니다

print(data.to_json())
{
     "a":{
     "0":1,"1":4,"2":7},"b":{
     "0":2,"1":5,"2":8},"c":{
     "0":3,"1":6,"2":9}}
print(data.to_json(orient='records')) #  ， 
[{
     "a":1,"b":2,"c":3},{
     "a":4,"b":5,"c":6},{
     "a":7,"b":8,"c":9}]

1.6 XML 및 HTML:웹 캡처
파이썬에는 HTML과 XML 형식을 읽고 쓸 수 있는 라이브러리가 많은데, 예를 들어 lxml, Beautiful Soup, html5lib lxml은 상대적으로 빠르지만, 다른 라이브러리는 이상한 HTML이나 XML 파일을 더 잘 처리할 수 있다.a. read_html 함수는 HTML의 테이블을 DataFrame 객체 b.lxml로 자동으로 해석합니다.Objectify 해석 XML
2. 바이너리 형식
pickle 서열화 모듈은 2진 형식 조작을 할 수 있으며 데이터 저장(서열화라고 부른다)을 가장 효율적이고 편리하게 하는 방식 중 하나이다.

frame = pd.read_csv('pydata-book-2nd-edition/examples/ex1.csv')
frame
	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo
frame.to_pickle('pydata-book-2nd-edition/examples/frame_pickle')

pandas 내장은 HDF5와 MessagePack 두 개의 이진 형식을 지원합니다.
a. HDF5 형식 HDF5는 대량의 과학적 배열 데이터를 저장하는 데 사용되며, HDF는 계층형 데이터 형식을 나타내며, HDF5 파일마다 여러 개의 데이터 세트를 저장하고 메타데이터를 지원합니다.메모리에 저장되지 않는 초대형 데이터를 처리하는 데 적합하며, 대형 그룹의 작은 부분을 효율적으로 읽을 수 있다.HDF5는 데이터베이스가 아니라 한 번에 여러 번 쓰기에 적합한 데이터 세트입니다.
b. Excel 파일 읽기 pandas는 ExcelFile 클래스 또는 read 를 통해 지원excel 함수는 excel 파일의 표 데이터를 읽습니다. xlrd와openyxl 도구를 설치해야 합니다.

xlsx = pd.ExcelFile('pydata-book-2nd-edition/examples/ex1.xlsx') #  ExcelFile 
pd.read_excel(xlsx, "Sheet1")
   Unnamed: 0	a	b	c	d	message
0			0	1	2	3	4	hello
1			1	5	6	7	8	world
2			2	9	10	11	12	foo
frame = pd.read_excel('pydata-book-2nd-edition/examples/ex1.xlsx', "Sheet1")
frame
   Unnamed: 0	a	b	c	d	message
0			0	1	2	3	4	hello
1			1	5	6	7	8	world
2			2	9	10	11	12	foo

pandas 데이터를 excel 파일에 쓰려면 ExcelWriter를 만들고 to 를 사용해야 합니다excel 메서드:

writer = pd.ExcelWriter('pydata-book-2nd-edition/examples/ex2.xlsx')
frame.to_excel(writer, "Sheet1")
writer.save()

to 에 파일 경로를 전송할 수도 있습니다.excel, ExcelWriter를 직접 호출하지 마십시오.

frame.to_excel('pydata-book-2nd-edition/examples/ex2.xlsx')

3. 웹 API와 상호 작용
많은 사이트에서 API를 공개하여 JSON이나 다른 형식을 통해 데이터 서비스를 제공하고 리퀘스트 패키지로 HTTP GET 요청을 보내면 로컬 Python 대상으로 해석된 JSON 사전을 되돌려줍니다.

import requests
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url) #  get 
data = resp.json() # response json 
pd.DataFrame(data, columns=['number', 'title', 'labels', 'state'])

4. 데이터베이스와 상호작용
데이터베이스 테이블에서 데이터를 선택할 때 대부분의 Python의 SQL 드라이브는 원조의 목록을 되돌려줍니다.pandas의readql 함수는 일반적인 SQLAlchemy 연결에서 데이터를 쉽게 읽을 수 있도록 합니다.

import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()

data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

파이톤을 이용한 데이터 분석 Chapter 6

좋은 웹페이지 즐겨찾기