인라인 어셈블리를 사용한 알고리즘 선택(part2)

이것은 인라인 어셈블리를 사용한 알고리즘 선택의 두 번째 부분입니다. SVE2 명령어를 사용하도록 코드를 변경할 것입니다.

빠른 참고: The Armv9 Scalable Vector Extensions version 2 (SVE2) provide a variable-width SIMD capability for AArch64 systems.
기억이 나지 않는 분들을 위해 AArch64(이스라엘 서버) 머신으로 작업하겠습니다.

이 실습의 목적은 다음과 같습니다.

SVE2 지침을 사용하는 Algorithm Selection Lab에서 새 버전의 볼륨 스케일링 코드를 생성합니다.

다음으로 바이너리 관련 부분의 디스어셈블리를 분석하여 코드가 SVE2 명령어를 사용하고 있음을 증명합니다.

하자 start :

사용할 다른 명령으로 파일을 컴파일하도록 Makefile을 변경하는 것으로 시작했습니다.

gcc -march=armv8-a+sve2 ...

GCC 버전 11에서 autovectorizer를 호출하려면 -O3를 사용해야 합니다.

gcc -O3 -march=armv8-a+sve2 ...

다음 단계는 SVE 명령을 C 코드에 추가하고 새로운 기능을 실행하는 데 걸리는 시간을 확인하는 것입니다.

다음은 Vol4에 대한 현재 코드입니다.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <arm_sve.h>

int main() {

#ifndef __aarch64__
    printf("Wrong architecture - written for aarch64 only.\n");
#else


    // these variables will also be accessed by our assembler code
    int16_t*    in_cursor;       // input cursor
    int16_t*    out_cursor;       // output cursor
    int16_t     vol_int;        // volume as int16_t

    int16_t*    limit;         // end of input array

    int       x;           // array interator
    int       ttl=0 ;         // array total

// ---- Create in[] and out[] arrays
    int16_t*    in;
    int16_t*    out;
    in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
    out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
    vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


    // set vol_int to fixed-point representation of the volume factor
    // Q: should we use 32767 or 32768 in next line? why?
    vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

    // Q: what is the purpose of these next two lines?
    in_cursor = in;
    out_cursor = out;
    limit = in + SAMPLES;

    // Q: what does it mean to "duplicate" values in the next line?
    __asm__ ("dup w1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

    while ( in_cursor < limit ) {
        __asm__ (
            "ldr q0, [%[in_cursor]], #16  \n\t"
            // load eight samples into q0 (same as v0.8h)
            // from [in_cursor]
            // post-increment in_cursor by 16 bytes
            // ans store back into the pointer register

            "sqrdmulh v0.8h, v0.8h, v1.8h  \n\t"
            // with 32 signed integer output,
            // multiply each lane in v0 * v1 * 2
            // saturate results
            // store upper 16 bits of results into
            // the corresponding lane in v0

            "str q0, [%[out_cursor]],#16      \n\t"
            // store eight samples to [out_cursor]
            // post-increment out_cursor by 16 bytes
            // and store back into the pointer register

            // Q: What do these next three lines do?
            : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
            : "r"(in_cursor),"r"(out_cursor)
            : "memory"
            );
    }

// --------------------------------------------------------------------

    for (x = 0; x < SAMPLES; x++) {
        ttl=(ttl+out[x])%1000;
    }

    // Q: are the results usable? are they correct?
    printf("Result: %d\n", ttl);

    return 0;

#endif
}

참고: 5M 샘플에서 코드를 확인하고 있습니다.

SVE2 명령어를 추가하는 방법은 무엇입니까?

우리는 SVE2에 대해 더 많이 알아야 합니다. 랩 지침에서 ARM 개발자 문서에 대한 다음 링크가 있습니다.
Arm Armv9-A A64 명령어 세트 아키텍처 - https://developer.arm.com/documentation/ddi0602/2021-12/
SVE2 소개 - https://developer.arm.com/documentation/102340/0001/?lang=en
내장 - SVE용 Arm C 언어 확장(ACLE) - https://developer.arm.com/documentation/100987/latest
Arm 컴파일러를 사용한 SVE 코딩 고려 사항 - 이 문서는 Arm의 컴파일러에만 해당되지만 대부분은 gcc를 포함한 다른 컴파일러에 적용됩니다. - https://developer.arm.com/documentation/100748/0616/SVE-Coding-Considerations-with-Arm-Compiler

읽은 후 SVE2에 대한 지침을 구현하기 시작했습니다.
vol4로 작업하기로 했습니다.

헤더를 포함하자#include <arm_sve.h>우리Makefile는 이미 vol4에 대한 새로운 지침을 가지고 있습니다.
나는 가지고있다

 gcc -O3 -march=armv8-a+sve2 ${CCOPTS} vol4.c vol_createsample.o -o vol4
And for all other I decided to experiment:
gcc ${CCOPTS} vol1/2/3/5.c -march=armv8-a+sve2 vol_createsample.o -Ofast -o vol1/2/3/5

또한 샘플 수를 변경하기로 했습니다.

SVE2 명령으로 모든 볼륨을 실행하려면 이 명령qemu-aarch64으로 실행해야 합니다.

또한 SVE 지침에서 말한 대로 레지스터를 업데이트했습니다.

이 연구실에서 작업하면서 한 가지 오류가 있었는데 알고 보니 샘플 개수 때문이었습니다.

결론

⚠️ 컴퓨터 아키텍처 블로그 게시물: 링크

연결

🖇 팔로우GitHub

🖇 팔로우

_p.s 이 게시물은 내 소프트웨어 이식성 및 최적화 수업을 위해 작성되었습니다. 실습 6.

Reference

이 문제에 관하여(인라인 어셈블리를 사용한 알고리즘 선택(part2)), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/serputov/algorithm-selection-with-inline-assemblypart2-4ncc

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

인라인 어셈블리를 사용한 알고리즘 선택(part2)

결론

연결

Reference

좋은 웹페이지 즐겨찾기