Other uses of futex

In the previous article, we looked at (re)implementing condition variables using futex. There are a number of other useful primitives that futex can provide, and we consider the main ones this time.

But first, lets revisit the condition variable implementation.

Condition variable (2)

The last (and only functional) implementation in the previous article required two atomic variable (value and previous), one of which was the futex (value). This is not necessary.

If we check carefully, the two variables are always moving in lock-step. At any time, their respective value are equal, or the different is just one. Thus the entropy of the second variable is only one bit. Furthermore, the futex value needs enough entropy to count all threads that can use the condition variable concurrently, so not the entire range of an integer.

For instance, on Linux, in the absolute worst case, the condition variable is located in shared-memory and used by all threads in all processees. Threads share the same identifier space as processes (PID), and the maximum setting for the maximum number of processes is:

2²² (~4,000,000) for 64-bits systems,
2¹³ (~8,000) for normal 32-bits systems,
2¹⁰ (~1,000) for small 32-bits systems.

The futex implementation itself is theoretically limited to 2²⁹ (~500,000,000) processes. Either way, we can set one bit of the futex value aside for the lock-step, and use the other bits for counting.

In the following example, we use the low-order bit for lock-step, and the rest for counting:

typedef struct
{
        atomic_uint value;
} cnd_t;

/* Our static initializer */
#define CND_INITIALIZER_NP { ATOMIC_VAR_INIT(0) }

int cnd_init(cnd_t *cv)
{
        atomic_init(&cv->value, 0);
        return thrd_success;
}

int cnd_wait(cnd_t *cv, mtx_t *mtx)
{
        unsigned value = atomic_load_explicit(&cv->value,
                                              memory_order_relaxed);

        while (value & 1)
        {
                if (atomic_compare_exchange_weak_explicit(&cv->value,
                                &value, value + 1,
                                memory_order_relaxed, memory_order_relaxed))
                        value++;
        }

        mtx_unlock(mtx);
        futex_wait(&cond->value, value, NULL);
        mtx_lock(mtx);
        return thrd_success;
}

int cnd_signal(cnd_t *cv)
{
        atomic_fetch_or_explicit(&cv->value, 1, memory_order_relaxed);
        futex_signal(&cv->value);
        return thrd_success;
}

Unfortunately, there is still a very subtle race condition (at least in theory) found by Zbigniew Chlebicki: if a first thread wakes a second thread billions of times in a row until the counter wraps, a third thread could end up missing a wake-up. We will see how to solve that for good in a later article.

Semaphore

The semaphore is probably the most straightforward use of a futex; it was presumably one of the main design criteria when futeces were invented (do not quote me on this - I cannot verify if it is true). The semaphore futex value can be the semaphore count; no other data is necessary.

typedef struct
{
        atomic_uint value;
} sem_t;

int sem_init(sem_t *sem, unsigned initval)
{
        atomic_init(&sem->value, initval);
        return thrd_success;
}

int sem_wait(sem_t *sem)
{
        unsigned value = 1;

        while (!atomic_compare_exchange_weak_explicit(&sem->value,
                                                      &value, value - 1,
                                                      memory_order_acquire,
                                                      memory_order_relaxed))
        {
		if (value == 0) {
                        futex_wait(&sem->value, 0, NULL);
                        value = 1;
                }
        }

        return thrd_success;
}

void sem_post(sem_t *sem)
{
        atomic_fetch_add_explicit(&sem->value, 1, memory_order_release);
        futex_signal(&sem->value);
        return thrd_success;
}

N.B.: Semaphores are not included in C11, hence custom prototypes.

There are one bug and one performance problem with the implementation above. Fixing them is left as another exercise to the reader:

sem_post() can trigger an overflow; it should return an error.
sem_post() needlessly calls futex_signal() if there are no waiters.

Auto-reset event (binary semaphore)

The event is a primitive found in Windows, and some embedded systems. It is effectively a semaphore whose maximum value is one. Implementation is even simpler than that of a normal semaphore, since the post operation need not handle overflows, and the only possible values are 0 and 1. A compare exchange loop is still required in the wait operation however.

Manually reset event

The manual-reset event, also found on Windows, is trivial to implement with a futex. It has three operations: set, clear and wait(-for-set). They can be implemented with futex and simple atomic load/store operations.

Pulse event

The pulse event is yet another variant of the Windows event. It is broken by design because it is intrinsically subject to a race condition. Solving that race is essentially why condition variables are always paired with a lock, and why futex_wait() needs a comparand value parameter.

Regardless, a pulse event could be implemented with a futex; it is unadvisable but absolutely trivial.

One-time initializer

The one-time initializer is a tri-state and can be implemented with a futex. The need for strong compare-exchange is a particularity:

typedef struct
{
        atomit_int value;
} once_t;

#define ONCE_INITIALIZER { ATOMIC_VAR_INIT(0) }

void once(once_t *once, void (*callback)(void))
{
        int value = 0;

        /* already initialized (2) state - (optional) fast path */
        if (atomic_load_explicit(&once->value, memory_order_acquire) == 2)
                return;

        /* move from uninitialized (0) to pending (1) state */
        if (atomic_compare_exchange_strong_explicit(&once->value,
                                                    &value, 1,
                                                    memory_order_relaxed,
                                                    memory_order_acquire))
        {
                callback();
                /* move from pending (1) to initialized (2) state */
                atomic_store_explicit(&once->value, 2,
                                      memory_order_release);
                futex_broadcast(&once->value);
                return;
        }

        /* wait for initialized (2) state - slow path */
        while (value == 1)
        {
                futex_wait(&once->value, 1, NULL);
                value = atomic_load_explicit(&->value, memory_order_acquire);
        }
}

Mutual exclusion lock (mutex)

Mutual exclusion lock is probably the best known thread synchronization primitives. But even if try-lock and timed-lock operations are left aside (as they are not relevant to futex usage), the mutex comes in different kinds...

Fast mutex

The simplest and most common mutex kind is the fast mutex; a thread simply blocks forever if it attempts to lock recursively. In principles, it could be implemented by storing a single bit as the futex value, indicated whether the mutex is locked/owned or not.

However, for performance reasons, distinguishing uncontended and contended locked states is preferable. Indeed that saves the unlock operation from the need to issue a futex_signal() system call in the fast/uncontended path.

An example implementation can be found here.

Error-checking mutex

The error-checking mutex returns an error when a thread attempts to lock the mutex recursively. (By the way, error checking does not affect unlocking. Formally, unlocking a mutex not locked by the calling thread is undefined behaviour even if the mutex is error-checking.)

The error-checking mutex needs to keep track of its owning thread. On Linux, the gettid system call can be used for that purpose: it returns a strictly positive integer uniquely identifying the calling thread across the whole system.

As with the fast mutex, the contended and uncontended state should be distinguished. Assuming that the thread identifier is always strictly positive (true on Linux, at least), the sign bit can be used as contention flag, the magnitude bits as owner thread identifier and special value zero for unlocked state. And thus the same implementation strategies as for the fast mutex can be used.

Recursive mutex

The recursive mutex needs to keep track of the recursion count and the owner thread. An efficient implementation of the recursive mutex cannot be achieved with only a futex. Specifically, when the owning thread increments the recursion count, or decrements to a value larger than zero, the mutex remains locked. Any thread waiting on the mutex should therefore be kept sleeping. And thus, the futex value should remain constant, which implies the recursion count is kept separately.

Read/write lock

For similar reasons, the read/write lock can also not be efficiently implemented with only a futex. The implementation must keep track of the number of active readers, so that read-locks and unlocks can be paired. Regardless of how the implementation prioritizes readers and writers, such as to avoid writer starvation, that counter changes while no threads should be woken up.

Barrier

Lastly, the barrier is a lesser known construct, but is defined by POSIX, and even became mandatory in the year 2008 release.

The barrier is essentially a just counter, like the semaphore. However, the relation between the counter and thread wake-ups is the opposite: all threads are woken up when the counter reaches zero from a predefined value.

Like the recursive mutex and the read/write lock, the barrier counter should be stored separately from any futex value. The futex value can however be used to store one or both of the following piece of the state of the barrier:

the initial count, which is constant and is used to reset the barrier after one iteration completes,
a flag to distinguish which thread gets the special PTHREAD_BARRIER_SERIAL_THREAD return value.

Remlab

Projects