RISC-V Vector extension draft

My ~~new~~last year's ~~work~~hobby of writing optimised FFmpeg (and VLC) functions for the RISC-V Vector extension has been hampered by two factors. The first factor is, well, that it is just a hobby, and one of many at that. The other factor is the lack of available hardware, with which to run tests and benchmarks.

Shameless plug

My RISC-V activities are suboptimally conducted outside business hours and mostly on my free time. If you are looking for a senior system software engineer with RISC-V experience, see my my LinkedIn profile.

Background

In all due fairness, the RISC-V V specification was ratified less than 2 years prior to my writing this piece, and just about 6 months prior to starting the aforementioned RISC-V porting activities. Releasing hardware is much slower than software, and nobody really expected any functioning commercial hardware less than 18 months after specification, at the very best. For comparison, more than 5 years passed from the publication of the final specification for ARM's Scalable Vector Extension to general availability of ARMv9-A processors in early 2022 (in the form of very expensive high-end mobile phones).

Nevertheless, Alibaba's subsidiary T-Head did release a processor design within months of the ratification of RISC-V V, which would be followed by several vastly improved designs since then. But there was and, at the time of writing, still is a big catch: The processors implement a pre-ratification draft version 0.7.1 of the vector extension, which is binary incompatible with ratified version 1.0.

This got several people wondering exactly how incompatible they are. However I have not been able to locate a summary of the incompatibilities. So here is a report from my modest attempt at figuring those out.

Disclaimer

But first, a big disclaimer is necessary: I do not currently have any hardware with Vector support, whether per a draft of the specification, or per the ratified specification. This document is in no way official or authoritative, and comes with no warranty whatsoever. The rest of this article is based exclusively on reading and comparing different versions of the specification and I did not experimentally confirm or infirm any information therein.

Moreover, there are over 700 tracked changes between versions 0.7.1 and 1.0 of the specification. I did not review every single one of them in precise detail, and also will not do so on my free time. (If you want/need such a task performed, hire an engineer.)

Note for the future

This documented was prepared and edited in spring 2023. Already by then, more than a year had passed since the beginning of the RISC-V V draft implementation so-called controversy. Eventually, hardware conforming to the ratified specification should become broadly available.

If you are reading this in 2024, this might not be so relevant any longer.
If you are reading this in 2025 or later, this has hopefully become completely irrelevant already.

Changed computational instructions

But with that exceedingly long foreword, lets start with the longest set of changes, which is to say changes to instruction opcodes.

Added instructions

These opcodes were added in the final standard which were missing in the draft:

Integer (sign/zero) extension: vsext.vf2 vsext.vf4 vsext.vf8 vzext.vf2 vzext.vf4 vzext.vf8
Integer averaging add/subtract: vaaddu.vv vaaddu.vx vasubu.vv vasubu.vx
Vector move: vm1r.v vm2r.v vm4r.v vm8r.v
Float reciprocal estimate: vfrsqrt7.v vfrec7.v
Type conversion: vfcvt.rtz.xu.f.v vfcvt.rtz.x.f.v vfncvt.rod.f.f.w vfncvt.rtz.xu.f.w vfncvt.rtz.x.f.w vfwcvt.rtz.xu.f.v vfwcvt.rtz.x.f.v
Vector permutation: vmv1.r vmv2.r vmv4.r vmv8.r vfslide1down.vf vfslide1up.vf vrgatherei16.vv

In addition to those vector computational instructions, a third vector configuration instruction, vsetivli, was also added separately.

Those instructions would obviously not work properly on affected processors. In most if not all cases, simple substitution sequences exist. But making those substitution would invalidate any benchmark for code that advantageously features any of these instructions listed herein.

Modified instructions

The following instructions have changed encoding:

Integer multiply-add: vwmaccus.vx vwmaccsu.vv vwmaccsu.vx
Integer average add/subtract: vaadd.vv vaadd.vx vasub.vv vasub.vx
Vector-scalar move: vfmv.s.f vfmv.f.s vmv.s.x vmv.x.s
Vector mask: vfirst.m vcpop.m

These instructions can be used with version 0.7.1, But they cannot be assembled, at least not with a conforming assembler. Trickery such as assembler macros may be required.

Unaries

Furthermore three encoding groups were modified, changing all instructions inside each of those groups: VFUNARY0, VFUNARY1 and VMUNARY0. This affects the following instructions:

Float square root: vfsqrt.v
Float classify: vfclass.v
Type conversion: vfcvt.xu.f.v vfcvt.x.f.v vfcvt.f.xu.v vfcvt.f.x.v vfncvt.xu.f.w vfncvt.x.f.w vfncvt.f.xu.w vfncvt.f.x.w vfncvt.f.f.w vfwcvt.xu.f.v vfwcvt.x.f.v vfwcvt.f.xu.v vfwcvt.f.x.v vfwcvt.f.f.v
Mask: vmsbf.m vmsif.m vmsof.m viota.m vid.v

Removed instructions

In my opinion, nobody should really care about removed instructions that did not make the final cut in the ratified standard. If you used those instructions, your code would not be forward-compatible, and thus end up mostly useless sooner rather than later. But for the sake of completeness, here they are:

vaadd.vi vasub.vi vdot.vv vdotu.vv vext.x.v vfdot.vv vmford.vf vmford.vv vwsmaccu.vv vwsmaccu.vx vwsmacc.vv vwsmacc.vx vwsmaccsu.vv vwsmaccsu.vx vwsmaccus.vx

Not to mention

Any instruction renamed without actual change to the encoding is excluded for simplicity, since that would be mostly irrelevant. Also two sets of instructions were added and promptly removed after 0.7.1 and before 1.0: vwsll and vqmacc (and friends).

Loads and stores

This is really as simple as it is terrible, if you need to support the draft version: literally all vector load and all vector store instructions from the draft were removed, and a completely new set of them were added. Due to the extensive list of affected instruction mnemonics, they are not listed here. On the bright side, this is not quite as bad as it sounds, as you can still match loads and stores instructions in many cases.

On one hand, the draft provided load instructions which optionally either zero-extend or sign-extend elements from, and store instructions which narrow elements to, a specified bit width. This was similar to integer scalar loads and stores.

On the other hand, the ratified standard defines load and store instructions which preserve the size of elements. If the configured element size differs from the load/store size, then the effective multiplier for the instruction is adjusted proportionally to adjust for the active vector length.

Added instructions

Strictly speaking, all load and store instructions differ. Nevertheless, if the configured element size matches the load/store size, then widening/narrowing do not apply, and we can notionally consider that the following instructions were added:

Whole-register transfer: vl1r.v (a.k.a. vl1re8.v) vl1re16.v vl1re32.v vl1re64.v vs1r.v (a.k.a. vs1re8.v) vs1re16.v vs1re32.v vs1re64.v
Mask load/store: vlm.v vsm.v
Indexed unordered load: vluxei8.v vluxei16.v vluxei32.v vluxei64.v vluxseg2ei8.v vluxseg2ei16.v vluxseg2ei32.v vluxseg2ei64.v vluxseg3ei8.v vluxseg3ei16.v vluxseg3ei32.v vluxseg3ei64.v vluxseg4ei8.v vluxseg4ei16.v vluxseg4ei32.v vluxseg4ei64.v vluxseg5ei8.v vluxseg5ei16.v vluxseg5ei32.v vluxseg5ei64.v vluxseg6ei8.v vluxseg6ei16.v vluxseg6ei32.v vluxseg6ei64.v vluxseg7ei8.v vluxseg7ei16.v vluxseg7ei32.v vluxseg7ei64.v vluxseg8ei8.v vluxseg8ei16.v vluxseg8ei32.v vluxseg8ei64.v

Modified instructions

Under the same operational condition, the indexed unordered store instructions changed encoding. The motivation underlying that change was reserved space for future extra-large element sizes.

Indexed unordered store: vsuxei8.v vluxei16.v vluxei32.v vluxei64.v vsuxseg2ei8.v vsuxseg2ei16.v vsuxseg2ei32.v vsuxseg2ei64.v vsuxseg3ei8.v vsuxseg3ei16.v vsuxseg3ei32.v vsuxseg3ei64.v vsuxseg4ei8.v vsuxseg4ei16.v vsuxseg4ei32.v vsuxseg4ei64.v vsuxseg5ei8.v vsuxseg5ei16.v vsuxseg5ei32.v vsuxseg5ei64.v vsuxseg6ei8.v vsuxseg6ei16.v vsuxseg6ei32.v vsuxseg6ei64.v vsuxseg7ei8.v vsuxseg7ei16.v vsuxseg7ei32.v vsuxseg7ei64.v vsuxseg8ei8.v vsuxseg8ei16.v vsuxseg8ei32.v vsuxseg8ei64.v

Removed instructions

Lastly, all sign-extending load instructions were removed, also to make space for larger element sizes. That being noted, if element sizes match, then sign-extending and zero-extending loads are functionally identical.

Vector configuration

While the listings above may seem daunting, the affected computational instructions only add up to a small chunk of the entire Vector extension. Most optimisations would not need them. And those optimisatiosn that do need them can typically find decent substitutions.

The much more severe binary compatiblity problem lies with changes in vector type (vtype) encoding.

Most dramatically the lowest vlmul field grew a sign bit to accomodate for fractional group multipliers added in version 0.9.
Accordingly the offset of the following vsew field was shifted by an one bit, wrecking the decoding of any non-zero value.
The mask agnostic mode was added, with one corresponding flag bit.
Ditto the tail agnostic mode and flag bit.

Concretely:

Only integral group multipliers can be used: m1, m2, m4 and m8.
All element sizes can be used, but 8-bit is the only one that is interoperable between versions. Other sizes assemble differently.
Mask and tail modes must both be undisturbed. While this works properly, most algorithms are potentially faster with agnostic modes, which imposes fewer restrictions on the vector implementation.

Misc (warning)

Beware that there are other more subtle changes. For example, some immediate values have changed signedness. There are probably other issues that I don't even know of.

Final words

With tedious macros, it should be possible to recompile some, but not all, standard-targeting vector code for the draft version. For what it is worth, I would not spend my money on draft-implementing hardware only for testing and benchmarking purposes, and only for a year (give or take) until conformant hardware can probably be procured from open markets at reasonable price points. Then again, this is just my opinion as a time-constrained hobbyist; your mileage may vary, and I do not actually know of upcoming hardware release dates.

And I sure would not mind if I got my hands on (draft or not) vector-capable Linux RISC-V chip without paying for it. Cough cough.

As for writing run-time interoperable code that would run regardless of the implemented version of the vector extension, it would be rather difficult for anything but simple byte-wise algorithms such as memcpy() and memset().

Remlab

Projects

RISC-V Vector extension draft

Shameless plug

Background

Disclaimer

Note for the future

Changed computational instructions

Added instructions

Modified instructions

Unaries

Removed instructions

Not to mention

Loads and stores

Added instructions

Modified instructions

Removed instructions

Vector configuration

Misc (warning)

Final words