SVE2 자동 벡터화를 통한 프로그램 최적화

오늘은 SVE2 최적화를 추가하고 런타임을 더욱 개선한다는 목표로 벤치마킹한 볼륨 스케일링 알고리즘을 다시 살펴보겠습니다. 우리는 SVE2를 사용하고 있기 때문에 vol4.c 또는 vol5.c에서 이러한 변경을 수행해야 합니다. 이들은 각각 인라인 어셈블리 및 내장 함수를 활용하는 AArch64 관련 알고리즘입니다.

간단하게 하기 위해 인라인 어셈블리를 사용하는 첫 번째 후보인 vol4.c를 사용하겠습니다. 전체 코드는 다음과 같습니다.

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

시작하려면 포함을 추가하여 관련 라이브러리를 포함해야 합니다.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <time.h>
#include <arm_sve.h>

#ifndef __aarch64__
        printf("Wrong architecture- written for aarch64 only.\n");

다음으로 SVE2 standard에 따라 복제 명령의 대상을 z 레지스터로 변경했습니다.

__asm__ ("dup z1.h,%w0"::"r"(vol_int)); //duplicate vol_int into z1.h
...
"sqrdmulh z0.h, z0.h, z1.h      \n\t"

다음으로 프로그램을 빌드하는 데 사용하는 makefile을 컴파일러에서 SVE2 사용을 트리거하도록 변경해야 합니다.

vol4:    vol4.c vol_createsample.o vol.h
         gcc ${CCOPTS} vol4.c -march=armv8-a+sve2 vol_createsample.o -o vol4

마지막으로 실행 시 qemu-aarch64 인수를 추가하여 SVE2를 실행하기 위해 적절한 하드웨어를 에뮬레이션하도록 지정해야 합니다. 실제는 아직 사용할 수 없기 때문입니다. 다음 명령으로 실행하여 의도한 대로 작동하는지 확인했습니다.

qemu-aarch64 ./vol4

이것은 프로그램에서 SVE2를 구현하기 위해 자동 벡터화를 사용하는 빠른 탐색이었습니다. 즐기다!

Reference

이 문제에 관하여(SVE2 자동 벡터화를 통한 프로그램 최적화), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/gusmccallum/optimizing-a-program-through-sve2-auto-vectorization-49o9

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다