deep learning 13. transformer 코드 상세 해석 decoder

40055 단어 tensorflow Models

시작하면: 기초부터 시작해서 끊임없이 공부하고 꾸준하고 힘내세요.화성에서 온 사랑 생활 사랑 기술 왕
b e r t bert 시리즈:

b e r t bert bert 어료 생성

b e r t bert los 해석loss해석loss해석

b e r t bert t r a n s f o r m e r transformer transformer 상세 해석 e n c o d e r encoder encoder

b e r t bert t r a n s f o r m e r transformer transformer 상세 해석 d e c o d e r decoder decoder

말을 많이 하지 않고 바로 오늘의 주요 내용을 시작하다.

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

매개 변수를 입력한 s h a p e shape shape에 대한 자세한 설명이 코드에 나와 있습니다.ok ok ok 우리 한 걸음 한 걸음 코드를 보자.
1、 e m b e d d i n g embedding embedding_ l a y e r layer layer
en c o d e r encoder encoder와 동일합니다. 여기서 설명하지 않겠습니다. 궁금한 점이 있으면 다음 섹션을 참조하십시오.마지막으로 sh a p e shape shape는 [b a t c h batch batch s i ze size size, s e q u e n c e sequence sequence l e n g t h length length, h i d e n hidden hidden s i ze size]입니다.
2、 p a d pad pad

decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]

이 방법은 그래도 비교적 이해하기 쉽지.s h a p e shape shape 는 [b a t c h batch batch s i ze size size, s e q u e n c e sequence sequence l e n g t h length length, h i d e n hidden hidden s i ze size size], r a n k rank rank 는 3, 1차원은 p a d pad pad, 마지막 1차원도 pad, 중간 이 차원은 p a d pad pad, 앞 쪽은 pad a pad, 뒤 쪽은 pad 가 아니다.그래서 p a d pad pad 이후의 차원은 [b a t c h batch batch s i z e size size, s e q u e n c e sequence sequence l e n g t h length length+1, h i d e n hidden hidden s i z e size size]로 하고 두 번째 차원에서 마지막 값(대표: [E O S EOS EOS]이라는 표지 위치)을 없앴다.이렇게 sh a p e shape shape는 [b a t c h batch batch s i z e size size, s e q u e n c sequence sequence l e n g t h length length, h i d e n hidden hidden s i ze size]입니다.
3、 g e t get get_ p o s i t i o n position position _ e n c o d i n g encoding encoding
이 절차는 이전 절의 조작과 마찬가지로 더 이상 상세하게 말하지 않겠다.반환된 s h a p e shape shape는 [s e q u e n c e sequence sequence l e n g t h length length, h i d d e n hidden hidden s i z e size size]이고 em b e d d d i n g embedding embedding 출력과 ad d add add add로 간단하게 덧붙인다.마지막으로 반환된 s h a p e shape shape은 [b a t c h batch batch s i ze size size, s e q u e n c e sequence sequence l e n g t h length length, h i d e n hidden hidden s i ze size]입니다.그리고 dr o p o u t dropout dropout 층을 추가했습니다.
4、 g e t get get_ d e c o d e r decoder decoder _ b i a s bias bias

def get_decoder_self_attention_bias(length):
    with tf.name_scope("decoder_self_attention_bias"):
        valid_locs = tf.matrix_band_part(tf.ones([length, length]), -1, 0)
        valid_locs = tf.reshape(valid_locs, [1, 1, length, length])
        decoder_bias = _NEG_INF * (1.0 - valid_locs)
    return decoder_bias

L o w e r Lower t r i a n g u l a r triangular triangular p a r t part part, 아래와 같습니다.

				[[1. 0. 0. 0. 0.]
                 [1. 1. 0. 0. 0.]
                 [1. 1. 1. 0. 0.]
                 [1. 1. 1. 1. 0.]
                 [1. 1. 1. 1. 1.]]

마지막 출력은 다음과 같이 U p p e r Upper t r i a n g u l a r triangular triangular p a r t part part가 됩니다.

tf.Tensor(
[[[[-0.e+00 -1.e+09 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -1.e+09 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -1.e+09 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -1.e+09]
   [-0.e+00 -0.e+00 -0.e+00 -0.e+00 -0.e+00]]]], shape=(1, 1, 5, 5), dtype=float32)

5、 d e c o d e r decoder decoder_ s t a c k stack stack
o k ok ok, 다음 d e o c d e r deocder deocders t a c k stack stack

    def decode(self, targets, encoder_outputs, attention_bias):
        """
        :param targets:  [batch_size, target_length]
        :param encoder_outputs: [batch_size, input_length, hidden_size]
        :param attention_bias:  [batch_size, 1, 1, input_length]
        :return: [batch_size, target_length, vocab_size]
        """
        with tf.name_scope('decode'):
            #   [batch_size, target_length, hidden_size]
            decoder_inputs = self.embedding_layer(targets)
            with tf.name_scope('shift_targets'):
                #   pad embedding value 0 at the head of sequence and remove eos_id
                decoder_inputs = tf.pad(decoder_inputs, [[0, 0], [1, 0], [0, 0]])[:, :-1, :]
            with tf.name_scope('add_pos_embedding'):
                length = tf.shape(decoder_inputs)[1]
                position_decode = model_utils.get_position_encoding(length, self.params.get('hidden_size'))
                decoder_inputs = tf.add(decoder_inputs, position_decode)

            if self.train:
                decoder_inputs = tf.nn.dropout(decoder_inputs, 1. - self.params.get('encoder_decoder_dropout'))

            decoder_self_attention_bias = model_utils.get_decoder_self_attention_bias(length)

            outputs = self.decoder_stack(
                decoder_inputs,
                encoder_outputs,
                decoder_self_attention_bias,
                attention_bias
            )

            #   [batch_size, target_length, vocab_size]
            logits = self.embedding_layer.linear(outputs)

            return logits

class DecoderStack(tf.layers.Layer):
    def __init__(self, params, train):
        super(DecoderStack, self).__init__()
        self.params = params
        self.train = train
        self.layers = list()
        for _ in range(self.params.get('num_blocks')):
            self_attention_layer = SelfAttention(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            vanilla_attention_layer = AttentionLayer(
                hidden_size=self.params.get('hidden_size'),
                num_heads=self.params.get('num_heads'),
                attention_dropout=self.params.get('attention_dropout'),
                train=self.train
            )

            ffn_layer = FFNLayer(
                hidden_size=self.params.get('hidden_size'),
                filter_size=self.params.get('filter_size'),
                relu_dropout=self.params.get('relu_dropout'),
                train=self.train,
                allow_pad=self.params.get('allow_ffn_pad')
            )

            self.layers.append(
                [
                    PrePostProcessingWrapper(self_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(vanilla_attention_layer, self.params, self.train),
                    PrePostProcessingWrapper(ffn_layer, self.params, self.train)
                ]
            )

        self.output_norm = LayerNormalization(self.params.get('hidden_size'))

5.1 s e l f self self_ a t t e n t i o n attention attention
이 부분과 e n c o d e r encoder encoder의 s e l f self selfa t t e n t i o n attention attention은 똑같다. Q, K, V = d e c o d e r Q, K, V=decoder Q, K, V=decoder i n p u t s inputs inputs inputs, 구체적인 계산 과정도 똑같다.유일하게 다른 건 b i a s bias bias는 g e t get get d e c o d e r decoder decoderb i a s bias bias에서 생성된
5.2 v a n i l l a vanilla vanilla_ a t t e n t i o n attention attention
이 a t t e n t i o n attention attention의 다른 점은 d e c o d e r decoder decoder가 e n c o d e r encoder encoder에 대한 a t t e n t i o n attention attention이다.이것은 매우 중요한 att e n t i o n attention attention인 것 같습니다. en c o d e r encoder encoder와 de c o d e r decoder decoder를 정렬했습니다.
Q = d e c o d e r Q=decoder Q=decoder _ i n p u t s inputs inputs K 、 V = e n c o d e r K、V=encoder K、V=encoder _ i n p u t s inputs inputs

QQ의 sh a p e shape shape은 [B B, T d T d Td, D D D D]이고 K, V K, V K, V K, V 의 s h a p e shape shape는 [B B, T e T e Te, D D D]

이다.

Q, K, V Q, K, V Q, K, V 각각 s p l i t split splithe a d head head 작업.Q s h a p e Q shape Qshape는 [B B B, H H, T d T d Td, D//H D//H D//H], K, V s h a p e K, V shape K, Vshape는 [B B, H H, T e T e Te, D////H D//H D//H]이고 H H는 n u m num numh e a d s heads heads,

Q = s c a l e ( Q ) Q = scale(Q) Q=scale(Q)

l o g i t s = t f . m a t m u l ( Q , K , t r a n s p o s e b = T r u e ) logits = tf.matmul(Q, K, transpose_b=True) logits=tf.matmul(Q, K,transposeb=True), sh a p e shape shape 반환 [B B, H H, T d T d Td, T e T e Te]

l o g i t s = t f . a d d ( l o g i t s , b i a s ) logits = tf.add(logits, bias) logits=tf.add (logits,bias), 이 b i a s bias bias는 첫 번째 절의 첫 번째 단계에서 구한 a t e n t i o n attention attentionb i a s bias bias. s h a p e shape shape은 [B B B, 1 1 1, 1 1, T e T e Te]입니다.s h a p e shape shape 를 [B B B, H H, T d T d Td, T e T e Te]

로 반환

w e i g h t s = t f . n n . s o f t m a x ( l o g i t s ) weights = tf.nn.softmax(logits) weights=tf.nn.softmax(logits)

d r o p o u t ( w e i g h t s ) dropout(weights) dropout(weights)

a t t e n t i o n attention attention_ o u t p u t = t f . m a t m u l ( w e i g h t s , V ) output = tf.matmul(weights, V) output=tf.matmul(weights, V), weights h a p e shape shape 는 [B B, H H, T d T d Td, T e T e Te], V=[B B, H H, T e T e Te, D//H D//H D//H D/H], 최종 [B B B, H H, T d T d Td, D/H D///H D//H],

o u t = c o m b i n e (h e a d s) out=combine(heads) out=combine(heads) 반환s h a p e shape shape [B B, T d T d Td, D D D]

d e n s e (o u t, D) dense(out, D) dense(out, D) 반환 s h a p e shape shape [B B, T d T d Td, D D D]

5.3 f e e d feed feed _ f o r w a r d forward forward
en c o d e r encoder encoder와 같습니다.
5.4 n o r m norm norm
en c o d e r encoder encoder와 같습니다.
5.5 l i n e a r linear linear

    def linear(self, inputs):
        """
        :param inputs:  a tensor with shape [batch_size, length, hidden_size]
        :return: float32 tensor with shape [batch_size, length, vocab_size]
        """

        with tf.name_scope('pre_softmax_linear'):
            batch_size = tf.shape(inputs)[0]
            length = tf.shape(inputs)[1]

            inputs = tf.reshape(inputs, [-1, self.hidden_size])
            """
                inputs              [batch_size, length, hidden_size]
                shared_weights      [vocab_size, hidden_size]
                transpose           [hidden_size, vocab_size]
                logits              [batch_size, length, vocab_size]
            """
            logits = tf.matmul(inputs, self.shared_weights, transpose_b=True)

            return tf.reshape(logits, [batch_size, length, self.vocab_size])

그건 자세히 말하지 않겠습니다.주의사항: s h a r e d shared sharedw e i g h t s weights weights, e m b e d i d n g embedidng embedidng 때 초기화된 벡터입니다.마지막으로 출력된 것은 모든 위치가 vo c a b vocab vocab에 있는 확률 분포이다.
마지막: 줄곧 언급하지 않았던

class PrePostProcessingWrapper(object):
    """Wrapper class that applies layer pre-processing and post-processing."""

    def __init__(self, layer, params, train):
        self.layer = layer
        self.postprocess_dropout = params["layer_postprocess_dropout"]
        self.train = train

        # Create normalization layer
        self.layer_norm = LayerNormalization(params["hidden_size"])

    def __call__(self, x, *args, **kwargs):
        # Preprocessing: apply layer normalization
        y = self.layer_norm(x)

        # Get layer output
        y = self.layer(y, *args, **kwargs)

        # Postprocessing: apply dropout and residual connection
        if self.train:
            y = tf.nn.dropout(y, 1 - self.postprocess_dropout)
        return x + y

이거 w r a p e r wrapper wrapper

먼저 입력에 no rm norm norm을 만들었는데 앞에서 언급한 no rm norm norm과 같다.

그리고 l a y e r layer layer의 출력

을 받습니다.

결과에 dr o p o u t dropout dropout

추가

마지막으로 입력과 함께 r e s i d u a l residual residual residual 을 만들었습니다.

이 조작은 모든 l a y e r layer layer의 입력과 출력을 조작한 것이다.
고마워요 고마워요
더 많은 코드는 나의 개인 gi t h u b github github로 이동하십시오. 부정기적으로 업데이트됩니다.환영하다

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

Mediapipe를 사용한 맞춤형 인간 포즈 분류

OpenCV의 도움으로 Mediapipe를 사용하여 사용자 지정 포즈 분류 만들기 Yoga Pose Dataset을 사용하여 사용자 정의 인간 포즈 분류를 생성하겠습니다. 1. 리포지토리 복제: 데이터세트 다운로드:...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

[BOJ] 11726 2×n 타일링 C++

고급 퀘 스 트 6 - 전단 디자인 모드

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다