At the VideoLAN Dev Days 2024, I gave a presentation on Hijacking AI CPU extension for multimedia. The illustrative example was SpacemiT's Integrated Matrix Extension, a proprietary extension for the RISC-V ISA implemented in some of the SpacemiT X60 processor cores. In keeping with RISC-V conventions, I refer to that extension at XSTIME where X is the prefix for vendor-specific extensions.
It was to spin it nicely negative result publication. But though XSTIME did not help me accelerate the inverse Direct Cosine Transform from the Advanced Video Coding compression algorithm (better known as H.264), it might still be interesting for other people in other use cases.
SpacemiT did publish a PDF detailling how the extension is supposed to work, though it is a bit confusing and misses a few key details. I still recommend that you check it out (if it is still available), as this article only covers the simpler half of the instruction set extension.
Open-source multimedia RISC-V development, code review and maintenance in the FFmpeg multimedia framework and the VLC media player are currently almost at a standstill as funding dried up. I cannot realistically keep up entirely in my free time.
To put it bluntly, I need an RISC-V Vector expert to help with FFmpeg code reviews and/or a new sponsor to resume FFmpeg RISC-V work.
The RISC-V ecosystem follows a terminology for so-called "AI" extensions. In reality, they are extensions for finite linear algebra scalar product or matrix multiplication, that just so happens to be intended primarily for Deep Learning workloads.
If you are curious, SiFive posted a YouTube video summarising the terminology. But in short an Integrated Matrix Extension (IME) multiplies two matrices stored in vector registers and returns the result in a third vector register.
Note that the rest of this article assumes that the reader is familiar with the RISC-V Vector (RVV) extension, which XSTIME is based on.
XSTIME defines 4 sets of instructions:
As it is the only set that I experimented with
For the sake of brevity and simplicity,
let's see how the first set works.
The vmadot family of instruction
is an widening integer matrix multiply and accumulate instruction:
Or in short:
MVD += wide(MVS1) * Twide(MVS2)
The destination/accumulator matrix is always made of
32-bit integers,
while the source matrices can theoretically use
4-bit, 8-bit or 16-bit integers as scalar elements,
as specified by the RVV selected element bit-width (SEW)
from the current vector type system register, vtype.
In practice, the mean to select 4-bit elements was not specified
by RVV at the time that the X60 came out.
It is presumably vtype.vsew = 0b111
but I did not test it.
16-bit elements yield systematic reproducible arithmetic errors that render them essentially unusable and useless. I can only guess that they were never actually tested by the vendor. (That is why my experiments with H.264 failed miserably.)
Selecting any other element width in vtype is undefined,
so for the rest of the article we assume that 8-bit elements are used.
The dimensions of the matrices is determined by the current
RISC-V vector length vl,
which can be set with the usual RVV vsetvli or
vsetivli instructions.
The RVV logical group multiplier (LMUL) must equal 1,
and vl must be a power of 2 no smaller than 16
(or actually 128 divided by SEW).
Since the X60 processor has 256-bit vectors, we have 2 possible vector lengths: either 16 or 32 8-bit elements. Those vector lengths correspond to source matrix dimensions of 4 rows and respectively either 4 or 8 columns. Accordingly the destination/accumulator matrix is 4x4 square matrix, and since that adds up to 4x4x32=512 bits, the destination operand has an effective group multiplier (EMUL) of 2.
All matrices are notionally stored in row-major order
in their respective vector operands.
(Though given that matrix transposition is
anti-commutative,
we can also interpret the operands as column-major matrices and define
vmadot as tranposing the first rather
than the second source operand.)
Concretely, the matrix are laid out in vectors as follows
(subscripts represent byte offsets):
| Source 4x4 matrix layout (8-bit elements) | |||
|---|---|---|---|
| vs0 | vs1 | vs2 | vs3 |
| vs4 | vs5 | vs6 | vs7 |
| vs8 | vs9 | vs10 | vs11 |
| vs12 | vs13 | vs14 | vs15 |
| Source 4x8 matrix layout (8-bit elements) | |||||||
|---|---|---|---|---|---|---|---|
| vs0 | vs1 | vs2 | vs3 | vs4 | vs5 | vs6 | vs7 |
| vs8 | vs9 | vs10 | vs11 | vs12 | vs13 | vs14 | vs15 |
| vs16 | vs17 | vs18 | vs19 | vs20 | vs21 | vs22 | vs23 |
| vs24 | vs25 | vs26 | vs27 | vs28 | vs29 | vs30 | vs31 |
| Destination 4x4 matrix layout (32-bit elements) | |||
|---|---|---|---|
| vs0-3 | vs4-7 | vs8-11 | vs12-15 |
| vs16-19 | vs20-23 | vs24-27 | vs28-31 |
| v(s+1)0-3 | v(s+1)4-7 | v(s+1)8-11 | v(s+1)12-15 |
| v(s+1)16-19 | v(s+1)20-23 | v(s+1)24-27 | v(s+1)28-31 |
All XSTIME instructions are allocated under the RISC-V CUSTOM_1 opcode.
For convenience, this assembler file defines
4 macros to easily assemble the vmadot instruction
or its 3 variants:
vs1 and vs2.vs1 and vs2.vs1 and zero-extends vs2.vs1 and sign-extends vs2.Before using any of the instruction, configure the vector unit, e.g.:
vsetivli zero, 16, e8, m1, ta, ma
# ...
vmadot v2, v7, v9
This will sign-extend all 8-bit elements in v7
and v9 to 32-bit,
then transpose v9, then calculate the matrix product of
v7 by (the transposition of) v9,
and destructively add the result to v2.
Mind that the destination vector number must be even (due to EMUL=2).