Hadoop I/O

Data Integrity

Hdfs: % hadoop fs -cat hdfs://namenode/data/a.txt
LocalFS: % hadoop fs -cat file:///tmp/a.txt
generate crc check sum file
%hadoop fs -copyToLocal -crc/data/a.txt file:///data/a.txt
check sum file: .a.txt.crc is a hidden file.
Ref: CRC-32, 순환 불필요한 검사 알고리즘, error-detecting.
io.bytes.per.checksum is deprecated, it's dfs.bytes-per-checksum, default is 512, Must not be larger than dfs.stream-buffer-size,which is the size of buffer to stream files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.

Data Compression

상용 알고리즘
책을 읽을 때hadoop은 네 가지 압축 알고리즘을 지원하는데 공간과 효율을 조정하면 -1~-9로 가장 좋은 속도에서 가장 좋은 공간으로 가는 것을 의미한다.압축 알고리즘은 org에서 지원됩니다.apache.hadoop.io.compress.*.

deflate(.deflate), 바로 자주 사용하는 gzip,package...DefaultCodec

Gzip(.gz), deflate 형식에 파일 헤더와 꼬리를 추가합니다.압축속도(적정), 압축해제속도(적정), 압축효율(적정), 패키지...GzipCodec, both of java and native

bzip2(.bz2), 압축속도(최악), <압축해제속도(최악), 압축효율(최고), 절분가능(splitable)을 지원하여 맵-red에 매우 우호적입니다.package ..BZip2Codec,java only

LZO(.lzo), 압축속도(최고), 압축해제속도(최고), 압축효율(최악), 패키지com.hadoop.compressiojn.lzo.lzopCodec, native only

원본 라이브러리를 사용하지 않으려면hadoop을 사용하십시오.native.lib.
기본 라이브러리를 사용하는 경우 객체 작성 비용이 많이 들 수 있으므로 CodecPool을 사용하여 객체를 재사용할 수 있습니다.
매우 큰 데이터 파일의 경우 다음과 같은 시나리오를 저장합니다.

절분 지원 bzip2

사용

수동으로 절분하여 압축된 파트를 Block size에 가깝게 합니다.

Sequence File을 사용하여 압축 및 분할 지원

Avro 데이터 파일을 사용하여 압축과 구분을 지원하고 많은 프로그래밍 언어의 읽기와 쓰기를 증가시켰다.

Map-Red의 output이 자동으로 압축되는 경우

conf.setBoolean ("mared.output.compress",true);
conf.setClass("mapred.output.compression.codec",GzipCodec.class,CompressionCodec.class);

Map-Red의 중간 결과가 자동으로 압축되는 경우

//or conf.setCompressMapOutput(true);
conf.setBoolean ("mared.compress.map.output",true);

//or conf.setMapOutputComressorClass(GzipCodec.class)
conf.setClass("mapred.map.output.compression.codec",GzipCodec.class,CompressionCodec.class);

시리얼화(Serialization/Deserialization)

Writable and WritableComparable

// core class for hadoop
public interface Writable{
       void write(DataOutput out) throw IOException;
       void readFields(DataInput in) throw IOException;
}

public interface Comparable<T>{
       int compareTo(T o);
}

//core class for map-reduce shuffle
public interface WritableComparable<T> extends Writable, Comparable<T> {
}

// Sample
public class MyWritableComparable implements WritableComparable {
       // Some data
       private int counter;
       private long timestamp;
       
       public void write(DataOutput out) throws IOException {
         out.writeInt(counter);
         out.writeLong(timestamp);
       }
       
       public void readFields(DataInput in) throws IOException {
         counter = in.readInt();
         timestamp = in.readLong();
       }
       
       public int compareTo(MyWritableComparable o) {
         int thisValue = this.value;
         int thatValue = o.value;
         return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
       }

       public int hashCode() {
         final int prime = 31;
         int result = 1;
         result = prime * result + counter;
         result = prime * result + (int) (timestamp ^ (timestamp >>> 32));
         return result
       }
}

//optimize for stream comparasion
public interface RawComparator<T> extends Comparator<T>{
      // s1 start position, l1, length of bytes
      public int compare(byte[] b1, int s1,int l1,byte[] b2,int s2,int l2);
}

public class WritableComparator implements RawComparator{
}

Comparator RawComparator WritableComparator

Writable Comparator는 원시 compator의compare 반서열화 대상을 제공하여 성능이 비교적 떨어진다.하지만 RawComparator의 인스턴스로 사용되는 플랜트:
RawComparator comparator = WritableComparator.get(IntWritable.class);
//최적화된 비교 계산을 등록한다.Register an optimized comparator for a WritableComparable implementation.
static void define(Class c, WritableComparator comparator);

//Writable Comparable의 비교 계산을 얻습니다.Get a comparator for a WritableComparable implementation.
static WritableComparator get(Class c);

public MyWritableComparator extends WritableComparator{
    static{
        define(MyWritableComparable.class, new MyWritableComparator());
    }
    public MyWritableComparator {
        super(MyWritableComparable.class);
    }

    @Override
    public int compare(byte[] b1, int s1,int l1,byte[] b2,int s2,int l2){
    }
}

주:static initializer를 호출하려면 이 종류의 실례가 만들어지거나 정적 방법이나 구성원이 접근하지 않으면 안 됩니다.또는 다음과 같은 코드를 직접 적용합니다.
Class.forName("package.yourclass"); 정적 initializer를 강제로 초기화합니다.

Java Primitive Data Type wrapped by Writable

Extends from WritableComparable

BooleanWritable, 1

ByteWritable, 1,

BytesWritable,

IntWritable,4

VIntWritable,1~5

FloatWritable,4,

LongWritable,8,

VLongWritable,1~9

DoubleWritable,8

NullWritable,Immutable singletone.

Text,4~

MD5Hash,

ObjectWritable,

GenericWritable

Extends from Writable only

ArrayWritable

TwoDArrayWritable

AbstractMapWritable

MapWritable

SortedMapWritable

[Text]

특히Text의 서열화 방식은 Zero-compressed encoding이다. 이것은 일부 자료를 보았는데 사실은 일종의 인코딩 방식이다. 의도는 높은 0이 차지하는 공간을 생략하는 것이다. 소수에 대해서는 공간을 절약할 수 있고 큰 숫자에 대해서는 공간을 추가로 차지하게 된다.압축에 비해 그것은 비교적 빠를 수 있다.사실은 VIntWritable, VlongWritable의 인코딩 방식과 유사합니다.
- 어떻게 길어지고 정장 수치를 선택합니까?
1. 길이는 분포가 매우 균일한 수치(예를 들어hash)에 적합하고, 길이는 분포가 매우 고르지 않은 수치에 적합하다.
2. 길어지면 공간을 절약할 수 있고 VIntWritable와 VlongWritable 사이를 전환할 수 있다.
- Text와 String의 차이점
1.String은 char 시퀀스이고 Text는 UTF-8의 byte 시퀀스입니다.
UTF-8 클래스에서는 문자열이 32767 이상인 경우 utf-8 인코딩을 수행할 수 없습니다.
(Indexing) 인덱스: ASCII에서 Text와 String은 같고 유니코드와는 다르다.String 클래스의 길이는char 인코딩 단원의 길이이지만 Text는 UTF-8의 바이트 길이입니다.CodePointAt는 2char, 4bytes의 유니코드가 될 수 있는 진정한 유니코드 문자를 나타냅니다.
Iteration (교체):Text를 ByteBuffer로 변환한 다음bytesToCodePoint () 정적 방법을 반복해서 호출하면 전체적인 Unicode를 찾을 수 있습니다.
Mutable (변성): writable과 StringBuffer와 같은 set, getLength () 는 유효한 문자열 길이를 되돌려줍니다. getbytes ().length, 공간 크기를 되돌려줍니다.

[BytesWritable]

이것은 2진 그룹의 봉인입니다. 윈도우즈 아래의 BSTR와 유사합니다. 모두 앞의 정형은 바이트의 길이를 표시하고, 뒤는 바이트의 2진 흐름입니다.
그것도 mutable, getLength()!=getBytes().length

[NullWritable]

NullWritable은 Writable의 특수한 유형입니다.그것의 서열화 길이는 0인데, 사실은 단지 하나의 자리 차지 문자일 뿐, 읽지도 쓰지도 않는다.프로그램체에만 존재합니다.
Immutable, singleton입니다.

[ObjectWritable]

ObjectWritable은 Java의 Array, String 및 Primitive 유형의 일반 패키지입니다(참고: Integer는 포함되지 않음).그것의 서열화는 자바의 형식 서열화, 형식 정보 쓰기 등을 사용하여 공간을 차지한다.
두 가지 특수한 구조를 통해
public ObjectWritable(Object instance);
public ObjectWritable(Class declaredClass,Object instance);
예를 들면 다음과 같습니다.
ObjectWritable objectw = new ObjectWritable(int.class,5);

[GenericWritable]

우선 이것은 추상류로 구상화되어야만 사용할 수 있다.
다음 실열을 살펴보면 Union 방식으로 표시된 프록시 Writable 실례가 Reduce 함수의 매개 변수 성명 문제를 해결합니다.

public class MyGenericWritable extends GenericWritable {

    private static Class<? extends Writable>[] CLASSES = null;

    static {
        CLASSES = (Class<? extends Writable>[]) new Class[] {
            IntWritable.class,
            Text.class
             //add as many different Writable class as you want
        };
    }


    @Override
    protected Class<? extends Writable>[] getTypes() {
        return CLASSES;
    }

    @Override
    public String toString() {
        return "MyGenericWritable [getTypes()=" + Arrays.toString(getTypes()) + "]";
    }

    // override hashcode();
}

public class Reduce extends Reducer<Text, MyGenericWritable, Text, Text> {
    public void reduce(Text key, Iterable<MyGenericWritable> values, Context context) throws IOException, InterruptedException {
}

[ArrayWritable /TwoDArrayWritable]

ArrayWritable aw = new ArrayWriable(Text.class);

[MapWritable / SortedMapWritable]

자바를 실현했습니다.util.Map 및 SortedMap...
맵 를 먼저 쓰고, 그 다음에 모든 종류의 유형을 id로 대체하여 공간을 절약합니다.이러한 작업은 상위 클래스 AbstractMapWritable에서 수행됩니다.
컬렉션 요약:
1. 단일 유형의 목록이라면 Array Writable을 사용하면 충분합니다.
2.다른 유형의 Writable을 목록에 저장하는 경우:
--GenerickWritable을 사용하여 요소를 같은 유형으로만 봉인할 수 있습니다.

    public class MyGenericWritable extends GenericWritable {

    private static Class<? extends Writable>[] CLASSES = null;

    static {
        CLASSES = (Class<? extends Writable>[]) new Class[] {
            ArrayWritable.class,
             //add as many different Writable class as you want
        };
    }


    @Override
    protected Class<? extends Writable>[] getTypes() {
        return CLASSES;
    }

-- MapWritable을 본떠서 ListWritable을 작성할 수 있습니다.
//hashcode, equals, toString, comparTo(if possible) 구현 주의
//hashcode는 특히 중요합니다.HashPartitioner는 보통 hashcode로 Reduce 구역을 선택하기 때문에 클래스에 좋은 hashcode를 쓰는 것이 필요합니다.
public class ListWritable extends ArrayList implements Writable {
}

/**
 * @author cloudera
 *
 */
public class ListWritable extends ArrayList<Writable> implements Writable {
	private List<Writable> list = new ArrayList<Writable>();
	
	public void set(Writable writable){
		list.add(writable);
	}
	
	@Override
	public void readFields(DataInput in) throws IOException {
		int nsize = in.readInt();
		Configuration conf = new Configuration();
		Text className = new Text();
		while(nsize-->0){
	
			Class theClass = null;
			try {
				className.readFields(in);
				theClass = Class.forName(className.toString());
			} catch (ClassNotFoundException e) {
				// TODO Auto-generated catch block
				e.printStackTrace();
			}

			Writable w = (Writable)ReflectionUtils.newInstance(theClass,conf);
			w.readFields(in);
			
			add(w);
			
		}
	}

	@Override
	public void write(DataOutput out) throws IOException {
		Writable w = null;
		out.writeInt(size());
		for(int i = 0;i<size();i++){
			w = get(i);
			new Text(w.getClass().getName()).write(out);
			w.write(out);
		}
	}

}

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

Azure HDInsight + Microsoft R Server에서 연산 처리 분산

Microsoft Azure HDInsight는 Microsoft가 제공하는 Hadoop의 PaaS 서비스로 인프라 주변의 구축 노하우를 몰라도 훌륭한 Hadoop 클러스터를 구축할 수 있는 훌륭한 서비스입니다. 이...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

Hadoop Outline - Part 2 (I/O - Writable)