My newlast year's workhobby of writing optimised
FFmpeg
(and VLC) functions for
the RISC-V Vector extension has been hampered by two factors.
The first factor is, well, that it is just a hobby, and one of many at that.
The other factor is the lack of available hardware,
with which to run tests and benchmarks.
My RISC-V activities are suboptimally conducted outside business hours and mostly on my free time. If you are looking for a senior system software engineer with RISC-V experience, see my my LinkedIn profile.
In all due fairness, the RISC-V V specification was ratified less than 2 years prior to my writing this piece, and just about 6 months prior to starting the aforementioned RISC-V porting activities. Releasing hardware is much slower than software, and nobody really expected any functioning commercial hardware less than 18 months after specification, at the very best. For comparison, more than 5 years passed from the publication of the final specification for ARM's Scalable Vector Extension to general availability of ARMv9-A processors in early 2022 (in the form of very expensive high-end mobile phones).
Nevertheless, Alibaba's subsidiary T-Head did release a processor design
within months of the ratification of RISC-V V, which would be followed
by several vastly improved designs since then.
But there was and, at the time of writing, still is a big catch:
The processors implement a pre-ratification draft version 0.7.1
of the vector extension, which is binary incompatible
with ratified version 1.0
.
This got several people wondering exactly how incompatible they are. However I have not been able to locate a summary of the incompatibilities. So here is a report from my modest attempt at figuring those out.
But first, a big disclaimer is necessary: I do not currently have any hardware with Vector support, whether per a draft of the specification, or per the ratified specification. This document is in no way official or authoritative, and comes with no warranty whatsoever. The rest of this article is based exclusively on reading and comparing different versions of the specification and I did not experimentally confirm or infirm any information therein.
Moreover, there are over 700 tracked changes
between versions 0.7.1
and 1.0
of the specification.
I did not review every single one of them in precise detail,
and also will not do so on my free time.
(If you want/need such a task performed, hire an engineer.)
This documented was prepared and edited in spring 2023. Already by then, more than a year had passed since the beginning of the RISC-V V draft implementation so-called controversy. Eventually, hardware conforming to the ratified specification should become broadly available.
But with that exceedingly long foreword, lets start with the longest set of changes, which is to say changes to instruction opcodes.
These opcodes were added in the final standard which were missing in the draft:
vsext.vf2 vsext.vf4 vsext.vf8
vzext.vf2 vzext.vf4 vzext.vf8
vaaddu.vv vaaddu.vx
vasubu.vv vasubu.vx
vm1r.v vm2r.v vm4r.v vm8r.v
vfrsqrt7.v vfrec7.v
vfcvt.rtz.xu.f.v vfcvt.rtz.x.f.v
vfncvt.rod.f.f.w vfncvt.rtz.xu.f.w vfncvt.rtz.x.f.w
vfwcvt.rtz.xu.f.v vfwcvt.rtz.x.f.v
vmv1.r vmv2.r vmv4.r vmv8.r
vfslide1down.vf vfslide1up.vf
vrgatherei16.vv
In addition to those vector computational instructions, a third vector configuration instruction, vsetivli, was also added separately.
Those instructions would obviously not work properly on affected processors. In most if not all cases, simple substitution sequences exist. But making those substitution would invalidate any benchmark for code that advantageously features any of these instructions listed herein.
The following instructions have changed encoding:
vwmaccus.vx vwmaccsu.vv vwmaccsu.vx
vaadd.vv vaadd.vx
vasub.vv vasub.vx
vfmv.s.f vfmv.f.s
vmv.s.x vmv.x.s
vfirst.m vcpop.m
These instructions can
be used with version 0.7.1
,
But they cannot be assembled,
at least not with a conforming assembler.
Trickery such as assembler macros may be required.
Furthermore three encoding groups were modified,
changing all instructions inside each of those groups:
VFUNARY0
, VFUNARY1
and VMUNARY0
.
This affects the following instructions:
vfsqrt.v
vfclass.v
vfcvt.xu.f.v vfcvt.x.f.v vfcvt.f.xu.v vfcvt.f.x.v
vfncvt.xu.f.w vfncvt.x.f.w vfncvt.f.xu.w vfncvt.f.x.w
vfncvt.f.f.w
vfwcvt.xu.f.v vfwcvt.x.f.v vfwcvt.f.xu.v vfwcvt.f.x.v
vfwcvt.f.f.v
vmsbf.m vmsif.m vmsof.m viota.m vid.v
In my opinion, nobody should really care about removed instructions that did not make the final cut in the ratified standard. If you used those instructions, your code would not be forward-compatible, and thus end up mostly useless sooner rather than later. But for the sake of completeness, here they are:
vaadd.vi vasub.vi
vdot.vv vdotu.vv
vext.x.v
vfdot.vv
vmford.vf vmford.vv
vwsmaccu.vv vwsmaccu.vx vwsmacc.vv vwsmacc.vx
vwsmaccsu.vv vwsmaccsu.vx vwsmaccus.vx
Any instruction renamed without actual change to the encoding is
excluded for simplicity, since that would be mostly irrelevant.
Also two sets of instructions were added and promptly removed
after 0.7.1
and before 1.0
:
vwsll and vqmacc (and friends).
This is really as simple as it is terrible, if you need to support the draft version: literally all vector load and all vector store instructions from the draft were removed, and a completely new set of them were added. Due to the extensive list of affected instruction mnemonics, they are not listed here. On the bright side, this is not quite as bad as it sounds, as you can still match loads and stores instructions in many cases.
On one hand, the draft provided load instructions which optionally either zero-extend or sign-extend elements from, and store instructions which narrow elements to, a specified bit width. This was similar to integer scalar loads and stores.
On the other hand, the ratified standard defines load and store instructions which preserve the size of elements. If the configured element size differs from the load/store size, then the effective multiplier for the instruction is adjusted proportionally to adjust for the active vector length.
Strictly speaking, all load and store instructions differ. Nevertheless, if the configured element size matches the load/store size, then widening/narrowing do not apply, and we can notionally consider that the following instructions were added:
vl1r.v
(a.k.a. vl1re8.v
)
vl1re16.v vl1re32.v vl1re64.v
vs1r.v
(a.k.a. vs1re8.v
)
vs1re16.v vs1re32.v vs1re64.v
vlm.v vsm.v
vluxei8.v vluxei16.v vluxei32.v vluxei64.v
vluxseg2ei8.v vluxseg2ei16.v vluxseg2ei32.v vluxseg2ei64.v
vluxseg3ei8.v vluxseg3ei16.v vluxseg3ei32.v vluxseg3ei64.v
vluxseg4ei8.v vluxseg4ei16.v vluxseg4ei32.v vluxseg4ei64.v
vluxseg5ei8.v vluxseg5ei16.v vluxseg5ei32.v vluxseg5ei64.v
vluxseg6ei8.v vluxseg6ei16.v vluxseg6ei32.v vluxseg6ei64.v
vluxseg7ei8.v vluxseg7ei16.v vluxseg7ei32.v vluxseg7ei64.v
vluxseg8ei8.v vluxseg8ei16.v vluxseg8ei32.v vluxseg8ei64.v
Under the same operational condition, the indexed unordered store instructions changed encoding. The motivation underlying that change was reserved space for future extra-large element sizes.
vsuxei8.v vluxei16.v vluxei32.v vluxei64.v
vsuxseg2ei8.v vsuxseg2ei16.v vsuxseg2ei32.v vsuxseg2ei64.v
vsuxseg3ei8.v vsuxseg3ei16.v vsuxseg3ei32.v vsuxseg3ei64.v
vsuxseg4ei8.v vsuxseg4ei16.v vsuxseg4ei32.v vsuxseg4ei64.v
vsuxseg5ei8.v vsuxseg5ei16.v vsuxseg5ei32.v vsuxseg5ei64.v
vsuxseg6ei8.v vsuxseg6ei16.v vsuxseg6ei32.v vsuxseg6ei64.v
vsuxseg7ei8.v vsuxseg7ei16.v vsuxseg7ei32.v vsuxseg7ei64.v
vsuxseg8ei8.v vsuxseg8ei16.v vsuxseg8ei32.v vsuxseg8ei64.v
Lastly, all sign-extending load instructions were removed, also to make space for larger element sizes. That being noted, if element sizes match, then sign-extending and zero-extending loads are functionally identical.
While the listings above may seem daunting, the affected computational instructions only add up to a small chunk of the entire Vector extension. Most optimisations would not need them. And those optimisatiosn that do need them can typically find decent substitutions.
The much more severe binary compatiblity problem lies
with changes in vector type (vtype
) encoding.
vlmul
field grew a sign bit
to accomodate for fractional group multipliers
added in version 0.9
.vsew
field
was shifted by an one bit,
wrecking the decoding of any non-zero value.Concretely:
m1
, m2
, m4
and m8
.
Beware that there are other more subtle changes. For example, some immediate values have changed signedness. There are probably other issues that I don't even know of.
With tedious macros, it should be possible to recompile some, but not all, standard-targeting vector code for the draft version. For what it is worth, I would not spend my money on draft-implementing hardware only for testing and benchmarking purposes, and only for a year (give or take) until conformant hardware can probably be procured from open markets at reasonable price points. Then again, this is just my opinion as a time-constrained hobbyist; your mileage may vary, and I do not actually know of upcoming hardware release dates.
And I sure would not mind if I got my hands on (draft or not) vector-capable Linux RISC-V chip without paying for it. Cough cough.
As for writing run-time interoperable code that would run regardless
of the implemented version of the vector extension,
it would be rather difficult for anything but simple byte-wise algorithms
such as memcpy()
and memset()
.