RISC-V support in FFmpeg

Earlier this year, I had been porting VLC to RISC-V. Now the work is pretty much done. Of course, there are still plenty of missing bits and prices for the VisionFive version 1 single board computer, that I have been working with, but this relates to the peripherals found on that specific board rather than the instruction set. In fact, VLC even already has RISC-V assembler optimisations for the Vector extension that does not even exist yet in actual commercial hardware.

So obviously, the time was ripe to work on VLC's dependencies, most particularly everybody's favorite open-source multimedia codec library! That is of course to say FFmpeg.

To give credit where due, strictly speaking, FFmpeg can be built for RISC-V since OpenBSD developer Brad Smith provided a simple patch to recognise riscv as a target architecture already over a year ago. But there are quite a few other aspects to consider to properly support FFmpeg on a new ISA.

Shameless plug

My RISC-V activities are suboptimally conducted outside business hours and mostly on my free time. If you are looking for a senior system software engineer with RISC-V experience, see my my LinkedIn profile.

Automatic testing

Not only is FFmpeg a bit of a beast of its own among open-source projects, it also comes with its completely own test automation: FATE. So the preferable first step to adding a new architecture, or even a new platform (riscv64-linux-gnu in this case) consists of setting up one FATE instance... and asking nicely for permission to connect it to the official FATE server to report the results.

In fact there are now two instances, one for GCC and one for Clang/LLVM, both running on the same VisionFive board. By the way, I would like to (again) thank StarFive Tech who provided the device (and myself for providing the electricity). I did consider testing more compiler versions, but that does not seem particularly useful; in fact, it would seem that testing more GNU/binutils versions would be more valuable, as that is the package providing the RISC-V assembler, GNU/as.

(NOTE: at the moment, LLVM's built-in assembler is mostly unusable on RISC-V because it lacks support for the .option arch directive.)

Architecture features

For a mix of historical reasons and shortcomings of the C language, FFmpeg still needs a number to be taught of a number of architecture-specific aspects manually.

First, it wants to know if the architecture supports unaligned accesses, and what is the highest sensible alignment for memory allocations. In the case of RISC-V, unaligned accesses are not generally supported, and even when they are, they tend to be extremely slow. (I suspect that the VisionFive board traps and works aournd unaligned accesses in machine mode, which would explain why it works but so terribly slowly.) As for memory alignment, FFmpeg's 16-byte default is just fine, so that part is easy.

Then we have a bunch of basic bit operations: Counting leading zeroes, trailing zeroes, and set bits (i.e., Hamming weight), and byte-wise order swapping. All of these are support by the recently ratified RISC-V "Zbb" Bit manipulation extension's Basic subset.

FFmpeg also wants several clipping or saturating arithmetic functions. None of those are implemented natively by RISC-V, which prides itself for being a Reduced Instruction Set Computer architecture (it is in the name afterall). Nevertheless, the default FFmpeg implementations are optimised for an instruction set that works with zero-extension. For the most part, RISC-V uses sign extension instead. Accounting for that particularity, a few arithmetic functions can be micro-optimised slightly differently.

CPU feature detection

The penultimate piece of work concerns the detection of processor capabilities, which becomes essential for the last step.

Compilation-time detection

One piece of good news here is that RISC-V has completely normalised feature detection at build-time. Unlike on other instruction, you do not have to cross fingers that a given feature can be detected at all, and then guess how to do so. On RISC-V, if an extension, say Zfoobar, is supported by the compiler target, then the predefined __riscv_zfoobar constant has a non-zero value. That value represents the extension version. Otherwise, the constant is not defined.

Boot-time detection

At boot time, the capabilities of each processor core is listed in the flattened device tree where the operating system can find them. FFmpeg does not run on bare metal, so that is not directly useful for our topical purposes. But the curious can find that data from /proc/cpuinfo, under the isa heading, e.g.:

processor       : 0
hart            : 0
isa             : rv64imafdc
mmu             : sv39
uarch           : sifive,u74-mc

Run-time detection

For run-time detectionof processor capabilities that are not expressly selected at the time of compilation, we are however still left with whatever bespoke mechanism the operating system defines, if any. On Linux, the convention for RISC-V is two-tiered:

In the case of an extension with a single letter designation, the nth bit (starting from zero) of the AT_HWCAP entry from the auxv auxillary vector is set if it is supported.

Otherwise, the semi-official plan is that the AT_HWCAP2 entry will convey a pointer to a structure within the vDSO area. That structure should describe processor capabilities. Or such was the plan at the time of Linux Plumbers 2022 anyway. At the time of writing, that pointer is always NULL.

As an example, to detect support for single precision floating point, a.k.a. the F extension, you can use this C code:

#include <stdbool.h>
#include <sys/auxv.h>

bool riscv_f_supported(void)
{
        unsigned long hwcap = getauxval(AT_HWCAP);

        return (hwcap >> ('F' - 'A')) & 1;
}

DSP functions

Last but not least, no FFmpeg port would be complete without optimised DSP functions. FFmpeg defines a lot of DSP functions, ranging from generic calculation functions such as the scalar product, to specific tools of specific codecs such as the MPEG Video motion compensation.

This is arguably the most relevant part of an FFmpeg port, and definitely the most difficult and time-consuming, dwarving everything else above. While FFmpeg contains generic C implementations of every single one of their DSP functions, those are much slower than the optimised versions written in assembler using SIMD instruction set extensions.

On RISC-V, there are two completely different SIMD extensions. Or rather there are:

The ratified standard Vector extension, known as RISC-V V. It uses large scalable vector registers from 128 to 1024 bits each.
The draft Packed SIMD extension, tentatively known as RISC-V P, which is conceptually similar to the old ARMv6 SIMD. It does not seem to be making much progress on the path to actual ratification.

Unfortunately, neither of those extensions are really available in hardware as of yet.

RISC-V V Vector extension

To be fair, there exists a cheap but relatively slow implementation of a draft version of RISC-V V from T-Head. But the implemented version is binary-incompatible with the ratified standard that compilers and assemblers support today, and that future commercial processors are expected to provide.

This means that benchmarking is plainly impossible, and we are left to test with QEMU (or the official SPIKE simulator). On the bright side, the Vector extension is much easier to program with, than most SIMD extensions on other instruction sets such as x86's SSE and AVX:

It does not require over-aligned vector base addresses in memory.
It can handle different hardware vector lengths (like ARM SVE), with a single piece of code.
It can deal with short vectors and does not require special handling of edge cases (also like ARM SVE).
It can trivially be unrolled if doing so would improve bandwidth, by simply a "group multiplier" in the vector unit settings at run-time.

At this point, I have already submitted RISC-V V support for most of libavutil generic DSP functions, with the notable, important but very difficult case of the transform functions. I have also added a few simple and relatively simple libavcodec ad-hoc functions. Though in all honesty, that was just the proverbial tip of the iceberg. The amount of work is difficult, enormous and yours truly is but a hobbyist on their free time.

Also the inability to run benchmarks is a major bottleneck. So help would be welcome both in labour and in hardware... at least in the hopefully near future whence vector-capable hardware should become available.

Remlab

Projects