RISC-V Vector extension draft

My newlast year's workhobby of writing optimised FFmpeg (and VLC) functions for the RISC-V Vector extension has been hampered by two factors. The first factor is, well, that it is just a hobby, and one of many at that. The other factor is the lack of available hardware, with which to run tests and benchmarks.

Background

In all due fairness, the RISC-V V specification was ratified less than 2 years prior to my writing this piece, and just about 6 months prior to starting the aforementioned RISC-V porting activities. Releasing hardware is much slower than software, and nobody really expected any functioning commercial hardware less than 18 months after specification, at the very best. For comparison, more than 5 years passed from the publication of the final specification for ARM's Scalable Vector Extension to general availability of ARMv9-A processors in early 2022 (in the form of very expensive high-end mobile phones).

Nevertheless, Alibaba's subsidiary T-Head did release a processor design within months of the ratification of RISC-V V, which would be followed by several vastly improved designs since then. But there was and, at the time of writing, still is a big catch: The processors implement a pre-ratification draft version 0.7.1 of the vector extension, which is binary incompatible with ratified version 1.0.

This got several people wondering exactly how incompatible they are. However I have not been able to locate a summary of the incompatibilities. So here is a report from my modest attempt at figuring those out.

Disclaimer

But first, a big disclaimer is necessary: I do not currently have any hardware with Vector support, whether per a draft of the specification, or per the ratified specification. This document is in no way official or authoritative, and comes with no warranty whatsoever. The rest of this article is based exclusively on reading and comparing different versions of the specification and I did not experimentally confirm or infirm any information therein.

Moreover, there are over 700 tracked changes between versions 0.7.1 and 1.0 of the specification. I did not review every single one of them in precise detail, and also will not do so on my free time. (If you want/need such a task performed, hire an engineer.)

Note for the future

This documented was prepared and edited in spring 2023. Already by then, more than a year had passed since the beginning of the RISC-V V draft implementation so-called controversy. Eventually, hardware conforming to the ratified specification should become broadly available.

Changed computational instructions

But with that exceedingly long foreword, lets start with the longest set of changes, which is to say changes to instruction opcodes.

Added instructions

These opcodes were added in the final standard which were missing in the draft:

Integer (sign/zero) extension
vsext.vf2 vsext.vf4 vsext.vf8
vzext.vf2 vzext.vf4 vzext.vf8
Integer averaging add/subtract
vaaddu.vv vaaddu.vx
vasubu.vv vasubu.vx
Vector move
vm1r.v vm2r.v vm4r.v vm8r.v
Float reciprocal estimate
vfrsqrt7.v vfrec7.v
Type conversion
vfcvt.rtz.xu.f.v vfcvt.rtz.x.f.v
vfncvt.rod.f.f.w vfncvt.rtz.xu.f.w vfncvt.rtz.x.f.w
vfwcvt.rtz.xu.f.v vfwcvt.rtz.x.f.v
Vector permutation
vmv1.r vmv2.r vmv4.r vmv8.r
vfslide1down.vf vfslide1up.vf
vrgatherei16.vv

In addition to those vector computational instructions, a third vector configuration instruction, vsetivli, was also added separately.

Those instructions would obviously not work properly on affected processors. In most if not all cases, simple substitution sequences exist. But making those substitution would invalidate any benchmark for code that advantageously features any of these instructions listed herein.

Modified instructions

The following instructions have changed encoding:

Integer multiply-add
vwmaccus.vx vwmaccsu.vv vwmaccsu.vx
Integer average add/subtract
vaadd.vv vaadd.vx
vasub.vv vasub.vx
Vector-scalar move
vfmv.s.f vfmv.f.s
vmv.s.x vmv.x.s
Vector mask
vfirst.m vcpop.m

These instructions can be used with version 0.7.1, But they cannot be assembled, at least not with a conforming assembler. Trickery such as assembler macros may be required.

Unaries

Furthermore three encoding groups were modified, changing all instructions inside each of those groups: VFUNARY0, VFUNARY1 and VMUNARY0. This affects the following instructions:

Float square root
vfsqrt.v
Float classify
vfclass.v
Type conversion
vfcvt.xu.f.v vfcvt.x.f.v vfcvt.f.xu.v vfcvt.f.x.v
vfncvt.xu.f.w vfncvt.x.f.w vfncvt.f.xu.w vfncvt.f.x.w vfncvt.f.f.w
vfwcvt.xu.f.v vfwcvt.x.f.v vfwcvt.f.xu.v vfwcvt.f.x.v vfwcvt.f.f.v
Mask
vmsbf.m vmsif.m vmsof.m viota.m vid.v

Removed instructions

In my opinion, nobody should really care about removed instructions that did not make the final cut in the ratified standard. If you used those instructions, your code would not be forward-compatible, and thus end up mostly useless sooner rather than later. But for the sake of completeness, here they are:

vaadd.vi vasub.vi
vdot.vv vdotu.vv
vext.x.v
vfdot.vv
vmford.vf vmford.vv
vwsmaccu.vv vwsmaccu.vx vwsmacc.vv vwsmacc.vx
vwsmaccsu.vv vwsmaccsu.vx vwsmaccus.vx

Not to mention

Any instruction renamed without actual change to the encoding is excluded for simplicity, since that would be mostly irrelevant. Also two sets of instructions were added and promptly removed after 0.7.1 and before 1.0: vwsll and vqmacc (and friends).

Loads and stores

This is really as simple as it is terrible, if you need to support the draft version: literally all vector load and all vector store instructions from the draft were removed, and a completely new set of them were added. Due to the extensive list of affected instruction mnemonics, they are not listed here. On the bright side, this is not quite as bad as it sounds, as you can still match loads and stores instructions in many cases.

On one hand, the draft provided load instructions which optionally either zero-extend or sign-extend elements from, and store instructions which narrow elements to, a specified bit width. This was similar to integer scalar loads and stores.

On the other hand, the ratified standard defines load and store instructions which preserve the size of elements. If the configured element size differs from the load/store size, then the effective multiplier for the instruction is adjusted proportionally to adjust for the active vector length.

Added instructions

Strictly speaking, all load and store instructions differ. Nevertheless, if the configured element size matches the load/store size, then widening/narrowing do not apply, and we can notionally consider that the following instructions were added:

Whole-register transfer
vl1r.v (a.k.a. vl1re8.v) vl1re16.v vl1re32.v vl1re64.v
vs1r.v
(a.k.a. vs1re8.v) vs1re16.v vs1re32.v vs1re64.v
Mask load/store
vlm.v vsm.v
Indexed unordered load
vluxei8.v vluxei16.v vluxei32.v vluxei64.v
vluxseg2ei8.v vluxseg2ei16.v vluxseg2ei32.v vluxseg2ei64.v
vluxseg3ei8.v vluxseg3ei16.v vluxseg3ei32.v vluxseg3ei64.v
vluxseg4ei8.v vluxseg4ei16.v vluxseg4ei32.v vluxseg4ei64.v
vluxseg5ei8.v vluxseg5ei16.v vluxseg5ei32.v vluxseg5ei64.v
vluxseg6ei8.v vluxseg6ei16.v vluxseg6ei32.v vluxseg6ei64.v
vluxseg7ei8.v vluxseg7ei16.v vluxseg7ei32.v vluxseg7ei64.v
vluxseg8ei8.v vluxseg8ei16.v vluxseg8ei32.v vluxseg8ei64.v

Modified instructions

Under the same operational condition, the indexed unordered store instructions changed encoding. The motivation underlying that change was reserved space for future extra-large element sizes.

Indexed unordered store
vsuxei8.v vluxei16.v vluxei32.v vluxei64.v
vsuxseg2ei8.v vsuxseg2ei16.v vsuxseg2ei32.v vsuxseg2ei64.v
vsuxseg3ei8.v vsuxseg3ei16.v vsuxseg3ei32.v vsuxseg3ei64.v
vsuxseg4ei8.v vsuxseg4ei16.v vsuxseg4ei32.v vsuxseg4ei64.v
vsuxseg5ei8.v vsuxseg5ei16.v vsuxseg5ei32.v vsuxseg5ei64.v
vsuxseg6ei8.v vsuxseg6ei16.v vsuxseg6ei32.v vsuxseg6ei64.v
vsuxseg7ei8.v vsuxseg7ei16.v vsuxseg7ei32.v vsuxseg7ei64.v
vsuxseg8ei8.v vsuxseg8ei16.v vsuxseg8ei32.v vsuxseg8ei64.v

Removed instructions

Lastly, all sign-extending load instructions were removed, also to make space for larger element sizes. That being noted, if element sizes match, then sign-extending and zero-extending loads are functionally identical.

Vector configuration

While the listings above may seem daunting, the affected computational instructions only add up to a small chunk of the entire Vector extension. Most optimisations would not need them. And those optimisatiosn that do need them can typically find decent substitutions.

The much more severe binary compatiblity problem lies with changes in vector type (vtype) encoding.

Concretely:

Misc (warning)

Beware that there are other more subtle changes. For example, some immediate values have changed signedness. There are probably other issues that I don't even know of.

Final words

With tedious macros, it should be possible to recompile some, but not all, standard-targeting vector code for the draft version. For what it is worth, I would not spend my money on draft-implementing hardware only for testing and benchmarking purposes, and only for a year (give or take) until conformant hardware can probably be procured from open markets at reasonable price points. Then again, this is just my opinion as a time-constrained hobbyist; your mileage may vary, and I do not actually know of upcoming hardware release dates.

And I sure would not mind if I got my hands on (draft or not) vector-capable Linux RISC-V chip without paying for it. Cough cough.

As for writing run-time interoperable code that would run regardless of the implemented version of the vector extension, it would be rather difficult for anything but simple byte-wise algorithms such as memcpy() and memset().