Hive 프로그래밍(11)[기타 파일 형식 및 압축 방법]

11.1 디코더 설치 확인
# hive -e "set io.compression.codecs" 
io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec

11.2 압축 코덱 선택
11.3 중간 압축 열기
중간 압축을 켜려면 hive.exec.compress.intermediate 값을 true로 설정해야 합니다.기본값은 false입니다.
Hadoop에서 중간 데이터 압축 속성을 제어하는 것은mapred.compress.map.output이다.
 
hive.exec.compress.intermediate
true
This controls whether intermediate files produced by Hive between
multiple map-reduce jobs are compressed. The compression codec and other options
are determined from hadoop config variables mapred.output.compress*



Hadoop 압축 기본 인코더는 DefaultCodec 이고 수정할 수 있습니다 mapred.map.output.compression.codec.$HADOOP_HOME/conf/mapred-site.xml 또는 $HADOOP_HOME/conf/hive-site.xml 파일에서 구성할 수 있습니다.
 
mapred.map.output.compression.codec
org.apache.hadoop.io.compress.SnappyCodec
This controls whether intermediate files produced by Hive
between multiple map-reduce jobs are compressed. The compression codec
and other options are determined from hadoop config variables
mapred.output.compress*



11.4 최종 출력 결과 압축
속성hive.exec.compress.output은 최종 출력 결과의 압축을 제어한다.기본값은 false입니다.
 
hive.exec.compress.output
false
This controls whether the final outputs of a query
(to a local/hdfs file or a Hive table) is compressed. The compression
codec and other options are determined from hadoop config variables
mapred.output.compress*



Hadoop에서는 속성mapred.output.compress을 사용하여 제어합니다.hive.exec.compress.output값이 true일 때 디코더를 지정해야 합니다.
 
mapred.output.compression.codec
org.apache.hadoop.io.compress.GzipCodec
If the job outputs are compressed, how should they be compressed?



11.5 Sequence 형식
Hive에서 사용sequence하려면 CREATE TABLE문에서 사용STORED AS SEQUENCEFILE해야 합니다.
CREATE TABLE a_sequence_file_table STORED AS SEQUENCEFILE; 
SEQUENCEFILE는 세 가지 압축 방식을 제공했다.NONE, RECORDBLOCK.기본값RECORD, 일반적BLOCK레벨의 압축 성능이 가장 좋습니다.Hadoop의 mapred-site.xml 파일이나 Hive의 hive-site.xml 파일에서 지정할 수 있습니다.
 
mapred.output.compression.type
BLOCK
If the job outputs are to compressed as SequenceFiles,
how should they be compressed? Should be one of NONE, RECORD or BLOCK.



11.6 압축 실천 사용
원본 데이터:
hive> SELECT * FROM a; 
4 5
3 2
hive> DESCRIBE a; 
a int
b int

중간 데이터 압축 설정
hive> set hive.exec.compress.intermediate=true; 
hive> CREATE TABLE intermediate_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on
Table default.intermediate_comp_on stats: [num_partitions: 0, num_files: 1,
num_rows: 2, total_size: 8, raw_data_size: 6]

예상한 바와 같이 중간 데이터 압축은 최종 출력에 영향을 주지 않는다.최종 결과의 출력은 여전히 비압축적이다.
hive> dfs -ls /user/hive/warehouse/intermediate_comp_on; 
Found 1 items
/user/hive/warehouse/intermediate_comp_on/000000_0

hive> dfs -cat /user/hive/warehouse/intermediate_comp_on/000000_0;
4 5
3 2

为中间数据压缩配置其编/解码器,而不使用默认的。选择GZip压缩。


hive> set mapred.map.output.compression.codec 
=org.apache.hadoop.io.compress.GZipCodec;
hive> set hive.exec.compress.intermediate=true;

hive> CREATE TABLE intermediate_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/intermediate_comp_on_gz
Table default.intermediate_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 8, raw_data_size: 6]

hive> dfs -cat /user/hive/warehouse/intermediate_comp_on_gz/000000_0;
4 5
3 2

开启输出结果压缩。


hive> set hive.exec.compress.output=true;

hive> CREATE TABLE final_comp_on
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/tmp/hive-edward/hive_2012-01-15_11-11-01_884_.../-ext-10001
Moving data to: file:/user/hive/warehouse/final_comp_on
Table default.final_comp_on stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 16, raw_data_size: 6]

hive> dfs -ls /user/hive/warehouse/final_comp_on;
Found 1 items
/user/hive/warehouse/final_comp_on/000000_0.deflate

可以看到输出的文件后缀名是.deflate


hive> dfs -cat /user/hive/warehouse/final_comp_on/000000_0.deflate; 
... UGLYBINARYHERE ...

hive> SELECT * FROM final_comp_on;
4 5
3 2

改变输出结果压缩使用的编解码器


hive> set hive.exec.compress.output=true;

hive> set mapred.output.compression.codec
=org.apache.hadoop.io.compress.GzipCodec;
hive> CREATE TABLE final_comp_on_gz
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz
Table default.final_comp_on_gz stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 28, raw_data_size: 6]

hive> dfs -ls /user/hive/warehouse/final_comp_on_gz;
Found 1 items
/user/hive/warehouse/final_comp_on_gz/000000_0.gz

最终输出了.gz文件,使用zcat命令查看:


hive> ! /bin/zcat /user/hive/warehouse/final_comp_on_gz/000000_0.gz; 
4 5
3 2

hive> SELECT * FROM final_comp_on_gz;
OK
4 5
3 2
Time taken: 0.159 seconds

使用sequence文件格式


hive> set mapred.output.compression.type=BLOCK; 
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

hive> CREATE TABLE final_comp_on_gz_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE
> AS SELECT * FROM a;
Moving data to: file:/user/hive/warehouse/final_comp_on_gz_seq
Table default.final_comp_on_gz_seq stats:
[num_partitions: 0, num_files: 1, num_rows: 2, total_size: 199, raw_data_size: 6]
hive> dfs -ls /user/hive/warehouse/final_comp_on_gz_seq;
Found 1 items
/user/hive/warehouse/final_comp_on_gz_seq/000000_0

sequence 文件是二进制格式的。查询文件头:


hive> dfs -cat /user/hive/warehouse/final_comp_on_gz_seq/000000_0; 
SEQ[]org.apache.hadoop.io.BytesWritable[]org.apache.hadoop.io.BytesWritable[]
org.apache.hadoop.io.compress.GzipCodec[]

hadoopdfs -text 명령을 사용하여 sequence 파일 헤더와 압축을 제거합니다.
hive> dfs -text /user/hive/warehouse/final_comp_on_gz_seq/000000_0; 
4 5
3 2

hive> select * from final_comp_on_gz_seq;
OK
4 5
3 2

直接使用数据压缩和最终输出压缩数据


hive> set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec; 
hive> set hive.exec.compress.intermediate=true;
hive> set mapred.output.compression.type=BLOCK;
hive> set hive.exec.compress.output=true;
hive> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

hive> CREATE TABLE final_comp_on_gz_int_compress_snappy_seq
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS SEQUENCEFILE AS SELECT * FROM a;

11.7 存档分区

Hadoop中有一种存储格式为HAR,即Hadoop Archive(Hadoop )。一个HAR文件类似与HDFS文件系统中的一个TAR文件,HAR查询效率不高,并非是压缩的。不会节约存储空间,仅仅是减轻NameNode压力。

hive> CREATE TABLE hive_text (line STRING) PARTITIONED BY (folder STRING);

hive> ! ls $HIVE_HOME; LICENSE README.txt RELEASE_NOTES.txt
hive> ALTER TABLE hive_text ADD PARTITION (folder='docs');
hive> LOAD DATA INPATH '${env:HIVE_HOME}/README.txt' > INTO TABLE hive_text PARTITION (folder='docs'); Loading data to table default.hive_text partition (folder=docs)
hive> LOAD DATA INPATH '${env:HIVE_HOME}/RELEASE_NOTES.txt' > INTO TABLE hive_text PARTITION (folder='docs'); Loading data to table default.hive_text partition (folder=docs)
hive> SELECT * FROM hive_text WHERE line LIKE '%hive%' LIMIT 2; http://hive.apache.org/docs - Hive 0.8.0 ignores the hive-default.xml file, though we continue docs ALTER TABLE ... ARCHIVE PARTITION 문장은 표를 압축 파일로 바꾼다.예:
hive> SET hive.archive.enabled=true; 
hive> ALTER TABLE hive_text ARCHIVE PARTITION (folder='docs');
intermediate.archived is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
intermediate.original is
file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Creating data.har for file:/user/hive/warehouse/hive_text/folder=docs
in file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
Please wait... (this may take a while)
Moving file:/tmp/hive-edward/hive_..._3862901820512961909/-ext-10000/partlevel
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
Moving file:/user/hive/warehouse/hive_text/folder=docs
to file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ORIGINAL
Moving file:/user/hive/warehouse/hive_text/folder=docs_INTERMEDIATE_ARCHIVED
to file:/user/hive/warehouse/hive_text/folder=docs
hive> dfs -ls /user/hive/warehouse/hive_text/folder=docs; 
Found 1 items
/user/hive/warehouse/hive_text/folder=docs/data.har
ALTER TABLE ... UNARCHIVE PARTITION HDFS로 파일 압축 해제HAR
ALTER TABLE hive_text UNARCHIVE PARTITION (folder='docs'); 

11.8 압축

좋은 웹페이지 즐겨찾기