ARM 명령 최적화 - NEON을 위한 코딩 - Part 3: Matrix Multiplication
Matrices
In this post, we will look at how to efficiently multiply four-by-four matrices together, an operation frequently used in the world of 3D graphics. We will assume that the matrices are stored in memory in column-major order – this is the format used by OpenGL-ES.
Algorithm
We start by examining the matrix mutiply operation in detail, by expanding the calculation, and identifying sub-operations that can be implemented using NEON instructions.
Notice that in the diagram, we multiply each column of the first matrix (in red) by a corresponding single value in the second matrix (blue) then add together the results for each element to give a column of results. This operation is repeated for each of the four columns in the result matrix.
If each column is a vector in a NEON register, we can use the vector-by-scalar multiplication instruction to calculate efficiently each result column. The sub-operation highlighted in the diagram can be implemented using this instruction. We must also add the results together for each element of the column, which we do using the accumulating version of the same instruction.
As we are operating on the columns of the first matrix, and producing a column of results, reading and writing elements to and from memory is a linear operation, and requires no interleaving load or store instructions.
Code
Floating Point
First, we will look at an implementation that multiplies single precision floating point matrices.
Begin by loading the matrices from memory into NEON registers. The matrices we are multiplying use column-major order, so columns of the matrix are stored linearly in memory. A column can be loaded into NEON registers using
VLD1
, and written back to memory using VST1
.
vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0
vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0
vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1
vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1
As NEON has 32 64-bit registers, we can load all of the elements from both input matrices into registers, and still have registers left over for use as accumulators. Here, d16 to d23 hold 16 elements from the first matrix, and d0 to d7 hold 16 elements from the second.
An Aside: D and Q registers
Most NEON instructions can use the register bank in two ways:
As 32 Double-word registers, 64-bits in size, named d0 to d31.
As 16 Quad-word registers, 128-bits in size, named q0 to q15.
These registers are aliased so that the data in a Q register is the same as that in its two corresponding D registers. For example, q0 is aliased to d0 and d1, and the same data is accessible through either register type. In C terms, this is very similar to a union.
For the floating point matrix multiplication example, we will use Q registers frequently, as we are handling columns of four 32-bit floating point numbers, which fit into a single 128-bit Q register.
Back to the Code
We can calculate a column of results using just four NEON multiply instructions:
vmul.f32 q12, q8, d0[0] @ multiply col element 0 by matrix col 0
vmla.f32 q12, q9, d0[1] @ multiply-acc col element 1 by matrix col 1
vmla.f32 q12, q10, d1[0] @ multiply-acc col element 2 by matrix col 2
vmla.f32 q12, q11, d1[1] @ multiply-acc col element 3 by matrix col 3
Here, the first instruction implements the operation highlighted in the diagram – x0, x1, x2 and x3 (in register q8) are each multiplied by y0 (element 0 in d0), and stored in q12.
Subsequent instructions operate on the other columns of the first matrix, multiplying by corresponding elements of the first column of the second matrix. Results are accumulated into q12 to give the first column of values for the result matrix.
Notice that the scalar used in the multiply instructions refers to D registers; although q0[3] should be the same value as d1[1], and using it would perhaps make more sense here, the GNU assembler I'm using does not accept that format. I have to specify a scalar from a D register. Your assembler may be better.
If we only needed to calculate a matrix-by-vector multiplication (another common operation in 3D graphics,) the operation would now be complete, and the result vector can be stored to memory. However, to complete the matrix-by-matrix multiplication, we must execute three more iterations, using values y4 to yF in registers q1 to q3.
If we create a macro for the instructions above, we can simplify our code significantly.
.macro mul_col_f32 res_q, col0_d, col1_d
vmul.f32 \res_q, q8, \col0_d[0] @ multiply col element 0 by matrix col 0
vmla.f32 \res_q, q9, \col0_d[1] @ multiply-acc col element 1 by matrix col 1
vmla.f32 \res_q, q10, \col1_d[0] @ multiply-acc col element 2 by matrix col 2
vmla.f32 \res_q, q11, \col1_d[1] @ multiply-acc col element 3 by matrix col 3
.endm
The implementation of a four-by-four floating point matrix multiply now looks like this.
vld1.32 {d16-d19}, [r1]! @ load first eight elements of matrix 0
vld1.32 {d20-d23}, [r1]! @ load second eight elements of matrix 0
vld1.32 {d0-d3}, [r2]! @ load first eight elements of matrix 1
vld1.32 {d4-d7}, [r2]! @ load second eight elements of matrix 1
mul_col_f32 q12, d0, d1 @ matrix 0 * matrix 1 col 0
mul_col_f32 q13, d2, d3 @ matrix 0 * matrix 1 col 1
mul_col_f32 q14, d4, d5 @ matrix 0 * matrix 1 col 2
mul_col_f32 q15, d6, d7 @ matrix 0 * matrix 1 col 3
vst1.32 {d24-d27}, [r0]! @ store first eight elements of result
vst1.32 {d28-d31}, [r0]! @ store second eight elements of result
Fixed Point
Using fixed point arithmetic for calculations is often faster than floating point – it requires less memory bandwidth to read and write values that use fewer bits, and multiplication of integer values is generally quicker than the same operations applied to floating point numbers.
However, when using fixed point arithmetic, you must choose the representation carefully to avoid overflow or saturation, whilst preserving the degree of precision in the results that your application requires.
Implementing a matrix multiply using fixed point values is very similar to floating point. In this example, we will useQ1.14 fixed-point format, but the operations required are similar for other formats, and may only require a change to the final shift applied to the accumulator. Here is the macro:
.macro mul_col_s16 res_d, col_d
vmull.s16 q12, d16, \col_d[0] @ multiply col element 0 by matrix col 0
vmlal.s16 q12, d17, \col_d[1] @ multiply-acc col element 1 by matrix col 1
vmlal.s16 q12, d18, \col_d[2] @ multiply-acc col element 2 by matrix col 2
vmlal.s16 q12, d19, \col_d[3] @ multiply-acc col element 3 by matrix col 3
vqrshrn.s32 \res_d, q12, #14 @ shift right and narrow accumulator into
@ Q1.14 fixed point format, with saturation
.endm
Comparing it to the macro used in the floating point version, you will see that the major differences are:
Values are now 16-bit rather than 32-bit, so we can use D registers to hold four inputs.
The result of multiplying two 16-bit numbers is a 32-bit number. We use
VMULL
and VMLAL
, because they store their results in Q registers, preserving all of the bits of the result using double-size elements. The final result must be 16-bits, but the accumulators are 32-bit. We obtain a 16-bit result using
VQRSHRN
, a vector, saturating, rounding, narrowing shift right. This adds the correct rounding value to each element, shifts it right, and saturates the result to the new, narrower element size. The reduction from 32-bits to 16-bits per element also has an effect on the memory accesses; the data can be loaded and stored using fewer instructions. The code for a fixed point matrix multiply looks like this:
vld1.16 {d16-d19}, [r1] @ load sixteen elements of matrix 0
vld1.16 {d0-d3}, [r2] @ load sixteen elements of matrix 1
mul_col_s16 d4, d0 @ matrix 0 * matrix 1 col 0
mul_col_s16 d5, d1 @ matrix 0 * matrix 1 col 1
mul_col_s16 d6, d2 @ matrix 0 * matrix 1 col 2
mul_col_s16 d7, d3 @ matrix 0 * matrix 1 col 3
vst1.16 {d4-d7}, [r0] @ store sixteen elements of result
Scheduling
We will deal with the details of scheduling in a future post, but for now, it is worth seeing the effect of improved instruction scheduling on this code.
In the macro, adjacent multiply instructions write to the same register, so the NEON pipeline must wait for each multiply to complete before it can start the next instruction.
If we take the instructions out of the macro and rearrange them, we can separate those that are dependent using other instructions that are not dependent. These instructions can be issued whilst the others complete in the background. In this case, we rearrange the code to space out accesses to the accumulator registers.
vmul.f32 q12, q8, d0[0] @ rslt col0 = (mat0 col0) * (mat1 col0 elt0)
vmul.f32 q13, q8, d2[0] @ rslt col1 = (mat0 col0) * (mat1 col1 elt0)
vmul.f32 q14, q8, d4[0] @ rslt col2 = (mat0 col0) * (mat1 col2 elt0)
vmul.f32 q15, q8, d6[0] @ rslt col3 = (mat0 col0) * (mat1 col3 elt0)
vmla.f32 q12, q9, d0[1] @ rslt col0 += (mat0 col1) * (mat1 col0 elt1)
vmla.f32 q13, q9, d2[1] @ rslt col1 += (mat0 col1) * (mat1 col1 elt1)
...
...
Using this version, matrix multiply performance almost doubles on a Cortex-A8 based system.
You can find details for instruction timings and latencies from the Technical Reference Manual for your Cortex core. With potential performance improvements like those described above, it is well worth spending some time familiarizing yourself with it.
Source
The code for the two functions described above can be found here: matrix_asm_sched.s (4.54K) Number of downloads: 1442
From: http://blogs.arm.com/software-enablement/241-coding-for-neon-part-3-matrix-multiplication/
이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
깨끗한 것을 보고 싶기 때문에 최적화 함수의 벤치마크에 이용되는 함수의 가시화를 해 보았다결정되지 않음 (자기 만족) 「헤이 이런 거 있어」라고 생각하는 사람 최적화 함수란? 거친 이미지로 1) x + 10 = 25 2) x + 60 = 15 3) x + 45 = 60 의 x를 기계에 구할 때 정확하게 ...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.