Hive 프로그래밍(11)[기타 파일 형식 및 압축 방법]
# hive -e "set io.compression.codecs"
io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec
11.2 압축 코덱 선택
11.3 중간 압축 열기
중간 압축을 켜려면
hive.exec.compress.intermediate
값을 true
로 설정해야 합니다.기본값은 false
입니다.Hadoop에서 중간 데이터 압축 속성을 제어하는 것은
mapred.compress.map.output
이다.
hive.exec.compress.intermediate
true
This controls whether intermediate files produced by Hive between
multiple map-reduce jobs are compressed. The compression codec and other options
are determined from hadoop config variables mapred.output.compress*
Hadoop 압축 기본 인코더는
DefaultCodec
이고 수정할 수 있습니다 mapred.map.output.compression.codec
.$HADOOP_HOME/conf/mapred-site.xml
또는 $HADOOP_HOME/conf/hive-site.xml
파일에서 구성할 수 있습니다.
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.SnappyCodec
This controls whether intermediate files produced by Hive
between multiple map-reduce jobs are compressed. The compression codec
and other options are determined from hadoop config variables
mapred.output.compress*
11.4 최종 출력 결과 압축
속성
hive.exec.compress.output
은 최종 출력 결과의 압축을 제어한다.기본값은 false
입니다.
hive.exec.compress.output
false
This controls whether the final outputs of a query
(to a local/hdfs file or a Hive table) is compressed. The compression
codec and other options are determined from hadoop config variables
mapred.output.compress*
Hadoop에서는 속성
mapred.output.compress
을 사용하여 제어합니다.hive.exec.compress.output
값이 true
일 때 디코더를 지정해야 합니다.
mapred.output.compression.codec
org.apache.hadoop.io.compress.GzipCodec
If the job outputs are compressed, how should they be compressed?
11.5 Sequence 형식
Hive에서 사용
sequence
하려면 CREATE TABLE
문에서 사용STORED AS SEQUENCEFILE
해야 합니다.CREATE TABLE a_sequence_file_table STORED AS SEQUENCEFILE;
SEQUENCEFILE
는 세 가지 압축 방식을 제공했다.NONE
, RECORD
와 BLOCK
.기본값RECORD
, 일반적BLOCK
레벨의 압축 성능이 가장 좋습니다.Hadoop의 mapred-site.xml
파일이나 Hive의 hive-site.xml
파일에서 지정할 수 있습니다.
mapred.output.compression.type
BLOCK
If the job outputs are to compressed as SequenceFiles,
how should they be compressed? Should be one of NONE, RECORD or BLOCK.
11.6 압축 실천 사용
원본 데이터:
hive> SELECT * FROM a;
4 5
3 2
hive> DESCRIBE a;
a int
b int
중간 데이터 압축 설정
hive> set hive.exec.compress.intermediate=true;
hive> CREATE TABLE intermediate_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on
Table default.intermediate_comp_on stats: [num_partitions: 0, num_files: 1,
num_rows: 2, total_size: 8, raw_data_size: 6]
예상한 바와 같이 중간 데이터 압축은 최종 출력에 영향을 주지 않는다.최종 결과의 출력은 여전히 비압축적이다.
hive> dfs -ls /user/hive/warehouse/intermediate_comp_on;
Found 1 items
/user/hive/warehouse/intermediate_comp_on/000000_0hive> dfs -cat /user/hive/warehouse/intermediate_comp_on/000000_0;
4 5
3 2
为中间数据压缩配置其编/解码器,而不使用默认的。选择GZip
压缩。
hive> set mapred.map.output.compression.codec
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;hive> CREATE TABLE intermediate_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on_gz
Table default.intermediate_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 8, raw_data_size: 6]hive> dfs -cat /user/hive/warehouse/intermediate_comp_on_gz/000000_0;
4 5
3 2
开启输出结果压缩。
hive> set hive.exec.compress.output=true;
hive> CREATE TABLE final_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/tmp/hive-edward/hive_2012-01-15_11-11-01_884_.../-ext-10001
Moving data to: file:/user/hive/warehouse/final_comp_on
Table default.final_comp_on stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 16, raw_data_size: 6]hive> dfs -ls /user/hive/warehouse/final_comp_on;
Found 1 items
/user/hive/warehouse/final_comp_on/000000_0.deflate
可以看到输出的文件后缀名是.deflate
。
hive> dfs -cat /user/hive/warehouse/final_comp_on/000000_0.deflate;
... UGLYBINARYHERE ...hive> SELECT * FROM final_comp_on;
4 5
3 2
改变输出结果压缩使用的编解码器
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec
=org.apache.hadoop.io.compress.GzipCodec;
hive> CREATE TABLE final_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz
Table default.final_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 28, raw_data_size: 6]hive> dfs -ls /user/hive/warehouse/final_comp_on_gz;
Found 1 items
/user/hive/warehouse/final_comp_on_gz/000000_0.gz
最终输出了.gz
文件,使用zcat
命令查看:
hive> ! /bin/zcat /user/hive/warehouse/final_comp_on_gz/000000_0.gz;
4 5
3 2hive> SELECT * FROM final_comp_on_gz;
OK
4 5
3 2
Time taken: 0.159 seconds
使用sequence
文件格式
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;hive> CREATE TABLE final_comp_on_gz_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz_seq
Table default.final_comp_on_gz_seq stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 199, raw_data_size: 6]
hive> dfs -ls /user/hive/warehouse/final_comp_on_gz_seq;
Found 1 items
/user/hive/warehouse/final_comp_on_gz_seq/000000_0
sequence
文件是二进制格式的。查询文件头:
hive> dfs -cat /user/hive/warehouse/final_comp_on_gz_seq/000000_0;
SEQ[]org.apache.hadoop.io.BytesWritable[]org.apache.hadoop.io.BytesWritable[]
org.apache.hadoop.io.compress.GzipCodec[]
hadoop
dfs -text
명령을 사용하여 sequence
파일 헤더와 압축을 제거합니다.hive> dfs -text /user/hive/warehouse/final_comp_on_gz_seq/000000_0;
4 5
3 2hive> select * from final_comp_on_gz_seq;
OK
4 5
3 2
直接使用数据压缩和最终输出压缩数据
hive> set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive> set hive.exec.compress.intermediate=true;
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;hive> CREATE TABLE final_comp_on_gz_int_compress_snappy_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE AS SELECT * FROM a;
11.7 存档分区
Hadoop中有一种存储格式为HAR
,即Hadoop Archive(Hadoop )
。一个HAR
文件类似与HDFS文件系统中的一个TAR
文件,HAR
查询效率不高,并非是压缩的。不会节约存储空间,仅仅是减轻NameNode
压力。
hive> CREATE TABLE hive_text (line STRING) PARTITIONED BY (folder STRING);
hive> ! ls $HIVE_HOME; LICENSE README.txt RELEASE_NOTES.txt
hive> ALTER TABLE hive_text ADD PARTITION (folder='docs');
hive> LOAD DATA INPATH '${env:HIVE_HOME}/README.txt' > INTO TABLE hive_text PARTITION (folder='docs'); Loading data to table default.hive_text partition (folder=docs)
hive> LOAD DATA INPATH '${env:HIVE_HOME}/RELEASE_NOTES.txt' > INTO TABLE hive_text PARTITION (folder='docs'); Loading data to table default.hive_text partition (folder=docs)
hive> SELECT * FROM hive_text WHERE line LIKE '%hive%' LIMIT 2; http://hive.apache.org/docs - Hive 0.8.0 ignores the hive-default.xml file, though we continue docs
ALTER TABLE ... ARCHIVE PARTITION
문장은 표를 압축 파일로 바꾼다.예:hive> SET hive.archive.enabled=true;
hive> ALTER TABLE hive_text ARCHIVE PARTITION (folder='docs');
intermediate.archived is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
intermediate.original is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Creating data.har for file:/user/hive/warehouse/hive_text/folder=docs
in file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
Please wait... (this may take a while)
Moving file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
Moving file:/user/hive/warehouse/hive_text/folder=docs
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Moving file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
to file:/user/hive/warehouse/hive_text/folder=docs
hive> dfs -ls /user/hive/warehouse/hive_text/folder=docs;
Found 1 items
/user/hive/warehouse/hive_text/folder=docs/data.har
ALTER TABLE ... UNARCHIVE PARTITION
HDFS로 파일 압축 해제HAR
ALTER TABLE hive_text UNARCHIVE PARTITION (folder='docs');
11.8 압축