SpacemiT Integrated Matrix Extension

At the VideoLAN Dev Days 2024, I gave a presentation on Hijacking AI CPU extension for multimedia. The illustrative example was SpacemiT's Integrated Matrix Extension, a proprietary extension for the RISC-V ISA implemented in some of the SpacemiT X60 processor cores. In keeping with RISC-V conventions, I refer to that extension at XSTIME where X is the prefix for vendor-specific extensions.

It was to spin it nicely negative result publication. But though XSTIME did not help me accelerate the inverse Direct Cosine Transform from the Advanced Video Coding compression algorithm (better known as H.264), it might still be interesting for other people in other use cases.

SpacemiT did publish a PDF detailling how the extension is supposed to work, though it is a bit confusing and misses a few key details. I still recommend that you check it out (if it is still available), as this article only covers the simpler half of the instruction set extension.

Not really but sort of advertising

Open-source multimedia RISC-V development, code review and maintenance in the FFmpeg multimedia framework and the VLC media player are currently almost at a standstill as funding dried up. I cannot realistically keep up entirely in my free time.

To put it bluntly, I need an RISC-V Vector expert to help with FFmpeg code reviews and/or a new sponsor to resume FFmpeg RISC-V work.

Integrated Matrix Extensions

The RISC-V ecosystem follows a terminology for so-called "AI" extensions. In reality, they are extensions for finite linear algebra scalar product or matrix multiplication, that just so happens to be intended primarily for Deep Learning workloads.

If you are curious, SiFive posted a YouTube video summarising the terminology. But in short an Integrated Matrix Extension (IME) multiplies two matrices stored in vector registers and returns the result in a third vector register.

Advisory note

Note that the rest of this article assumes that the reader is familiar with the RISC-V Vector (RVV) extension, which XSTIME is based on.

XSTIME

XSTIME defines 4 sets of instructions:

vmadot: integer matrix multiplication
vfmadot: floating-point matrix multiplication
vmadotn: integer sliding-window matrix multiplication
vfmadotn: floating-point sliding-window matrix multiplication

~~As it is the only set that I experimented with~~ For the sake of brevity and simplicity, let's see how the first set works.

vmadot

The vmadot family of instruction is an widening integer matrix multiply and accumulate instruction:

It widens elements of two source matrices,
then performs the matrix product of the 1^st widened matrix with the transposition of the 2^nd widened matrix,
and finally destructively accumulates the result into the destination matrix.

Or in short:

M_VD += wide(M_VS1) * ^Twide(M_VS2)

Scalar types

The destination/accumulator matrix is always made of 32-bit integers, while the source matrices can theoretically use 4-bit, 8-bit or 16-bit integers as scalar elements, as specified by the RVV selected element bit-width (SEW) from the current vector type system register, vtype.

In practice, the mean to select 4-bit elements was not specified by RVV at the time that the X60 came out. It is presumably vtype.vsew = 0b111 but I did not test it.

16-bit elements yield systematic reproducible arithmetic errors that render them essentially unusable and useless. I can only guess that they were never actually tested by the vendor. (That is why my experiments with H.264 failed miserably.)

Selecting any other element width in vtype is undefined, so for the rest of the article we assume that 8-bit elements are used.

Matrix dimensions

The dimensions of the matrices is determined by the current RISC-V vector length vl, which can be set with the usual RVV vsetvli or vsetivli instructions. The RVV logical group multiplier (LMUL) must equal 1, and vl must be a power of 2 no smaller than 16 (or actually 128 divided by SEW).

Since the X60 processor has 256-bit vectors, we have 2 possible vector lengths: either 16 or 32 8-bit elements. Those vector lengths correspond to source matrix dimensions of 4 rows and respectively either 4 or 8 columns. Accordingly the destination/accumulator matrix is 4x4 square matrix, and since that adds up to 4x4x32=512 bits, the destination operand has an effective group multiplier (EMUL) of 2.

Matrix layout

All matrices are notionally stored in row-major order in their respective vector operands. (Though given that matrix transposition is anti-commutative, we can also interpret the operands as column-major matrices and define vmadot as tranposing the first rather than the second source operand.) Concretely, the matrix are laid out in vectors as follows (subscripts represent byte offsets):

Source 4x4 matrix layout (8-bit elements)
vs₀	vs₁	vs₂	vs₃
vs₄	vs₅	vs₆	vs₇
vs₈	vs₉	vs₁₀	vs₁₁
vs₁₂	vs₁₃	vs₁₄	vs₁₅

Source 4x8 matrix layout (8-bit elements)
vs₀	vs₁	vs₂	vs₃	vs₄	vs₅	vs₆	vs₇
vs₈	vs₉	vs₁₀	vs₁₁	vs₁₂	vs₁₃	vs₁₄	vs₁₅
vs₁₆	vs₁₇	vs₁₈	vs₁₉	vs₂₀	vs₂₁	vs₂₂	vs₂₃
vs₂₄	vs₂₅	vs₂₆	vs₂₇	vs₂₈	vs₂₉	vs₃₀	vs₃₁

Destination 4x4 matrix layout (32-bit elements)
vs_0-3	vs_4-7	vs_8-11	vs_12-15
vs_16-19	vs_20-23	vs_24-27	vs_28-31
v(s+1)_0-3	v(s+1)_4-7	v(s+1)_8-11	v(s+1)_12-15
v(s+1)_16-19	v(s+1)_20-23	v(s+1)_24-27	v(s+1)_28-31

Assembly

All XSTIME instructions are allocated under the RISC-V CUSTOM_1 opcode. For convenience, this assembler file defines 4 macros to easily assemble the vmadot instruction or its 3 variants:

vmadot vd, vs1, vs2: Sign-extends vs1 and vs2.
vmadotu vd, vs1, vs2: Zero-extends vs1 and vs2.
vmadotsu vd, vs1, vs2: Sign-extends vs1 and zero-extends vs2.
vmadotus vd, vs1, vs2: Zero-extends vs1 and sign-extends vs2.

Before using any of the instruction, configure the vector unit, e.g.:

vsetivli zero, 16, e8, m1, ta, ma
# ...
vmadot v2, v7, v9

This will sign-extend all 8-bit elements in v7 and v9 to 32-bit, then transpose v9, then calculate the matrix product of v7 by (the transposition of) v9, and destructively add the result to v2. Mind that the destination vector number must be even (due to EMUL=2).

Remlab

Projects