SpacemiT Integrated Matrix Extension

At the VideoLAN Dev Days 2024, I gave a presentation on Hijacking AI CPU extension for multimedia. The illustrative example was SpacemiT's Integrated Matrix Extension, a proprietary extension for the RISC-V ISA implemented in some of the SpacemiT X60 processor cores. In keeping with RISC-V conventions, I refer to that extension at XSTIME where X is the prefix for vendor-specific extensions.

It was to spin it nicely negative result publication. But though XSTIME did not help me accelerate the inverse Direct Cosine Transform from the Advanced Video Coding compression algorithm (better known as H.264), it might still be interesting for other people in other use cases.

SpacemiT did publish a PDF detailling how the extension is supposed to work, though it is a bit confusing and misses a few key details. I still recommend that you check it out (if it is still available), as this article only covers the simpler half of the instruction set extension.


Not really but sort of advertising

Open-source multimedia RISC-V development, code review and maintenance in the FFmpeg multimedia framework and the VLC media player are currently almost at a standstill as funding dried up. I cannot realistically keep up entirely in my free time.

To put it bluntly, I need an RISC-V Vector expert to help with FFmpeg code reviews and/or a new sponsor to resume FFmpeg RISC-V work.


Integrated Matrix Extensions

The RISC-V ecosystem follows a terminology for so-called "AI" extensions. In reality, they are extensions for finite linear algebra scalar product or matrix multiplication, that just so happens to be intended primarily for Deep Learning workloads.

If you are curious, SiFive posted a YouTube video summarising the terminology. But in short an Integrated Matrix Extension (IME) multiplies two matrices stored in vector registers and returns the result in a third vector register.

Advisory note

Note that the rest of this article assumes that the reader is familiar with the RISC-V Vector (RVV) extension, which XSTIME is based on.

XSTIME

XSTIME defines 4 sets of instructions:

vmadot
integer matrix multiplication
vfmadot
floating-point matrix multiplication
vmadotn
integer sliding-window matrix multiplication
vfmadotn
floating-point sliding-window matrix multiplication

As it is the only set that I experimented with For the sake of brevity and simplicity, let's see how the first set works.

vmadot

The vmadot family of instruction is an widening integer matrix multiply and accumulate instruction:

Or in short:

MVD += wide(MVS1) * Twide(MVS2)

Scalar types

The destination/accumulator matrix is always made of 32-bit integers, while the source matrices can theoretically use 4-bit, 8-bit or 16-bit integers as scalar elements, as specified by the RVV selected element bit-width (SEW) from the current vector type system register, vtype.

In practice, the mean to select 4-bit elements was not specified by RVV at the time that the X60 came out. It is presumably vtype.vsew = 0b111 but I did not test it.

16-bit elements yield systematic reproducible arithmetic errors that render them essentially unusable and useless. I can only guess that they were never actually tested by the vendor. (That is why my experiments with H.264 failed miserably.)

Selecting any other element width in vtype is undefined, so for the rest of the article we assume that 8-bit elements are used.

Matrix dimensions

The dimensions of the matrices is determined by the current RISC-V vector length vl, which can be set with the usual RVV vsetvli or vsetivli instructions. The RVV logical group multiplier (LMUL) must equal 1, and vl must be a power of 2 no smaller than 16 (or actually 128 divided by SEW).

Since the X60 processor has 256-bit vectors, we have 2 possible vector lengths: either 16 or 32 8-bit elements. Those vector lengths correspond to source matrix dimensions of 4 rows and respectively either 4 or 8 columns. Accordingly the destination/accumulator matrix is 4x4 square matrix, and since that adds up to 4x4x32=512 bits, the destination operand has an effective group multiplier (EMUL) of 2.

Matrix layout

All matrices are notionally stored in row-major order in their respective vector operands. (Though given that matrix transposition is anti-commutative, we can also interpret the operands as column-major matrices and define vmadot as tranposing the first rather than the second source operand.) Concretely, the matrix are laid out in vectors as follows (subscripts represent byte offsets):

Source 4x4 matrix layout
(8-bit elements)
vs0 vs1 vs2 vs3
vs4 vs5 vs6 vs7
vs8 vs9 vs10 vs11
vs12 vs13 vs14 vs15
Source 4x8 matrix layout
(8-bit elements)
vs0 vs1 vs2 vs3 vs4 vs5 vs6 vs7
vs8 vs9 vs10 vs11 vs12 vs13 vs14 vs15
vs16 vs17 vs18 vs19 vs20 vs21 vs22 vs23
vs24 vs25 vs26 vs27 vs28 vs29 vs30 vs31
Destination 4x4 matrix layout
(32-bit elements)
vs0-3 vs4-7 vs8-11 vs12-15
vs16-19 vs20-23 vs24-27 vs28-31
v(s+1)0-3 v(s+1)4-7 v(s+1)8-11 v(s+1)12-15
v(s+1)16-19 v(s+1)20-23 v(s+1)24-27 v(s+1)28-31

Assembly

All XSTIME instructions are allocated under the RISC-V CUSTOM_1 opcode. For convenience, this assembler file defines 4 macros to easily assemble the vmadot instruction or its 3 variants:

vmadot vd, vs1, vs2
Sign-extends vs1 and vs2.
vmadotu vd, vs1, vs2
Zero-extends vs1 and vs2.
vmadotsu vd, vs1, vs2
Sign-extends vs1 and zero-extends vs2.
vmadotus vd, vs1, vs2
Zero-extends vs1 and sign-extends vs2.

Before using any of the instruction, configure the vector unit, e.g.:

vsetivli zero, 16, e8, m1, ta, ma
# ...
vmadot v2, v7, v9

This will sign-extend all 8-bit elements in v7 and v9 to 32-bit, then transpose v9, then calculate the matrix product of v7 by (the transposition of) v9, and destructively add the result to v2. Mind that the destination vector number must be even (due to EMUL=2).