PIC versus pic

Have you ever wondered what was the difference between the -fpic and -fPIC compiler command line flags were? No? Then you can go back to doing normal things. But if you have, then read on.

For a start, the GCC documentation does provide a hint (emphasis added):

'-fpic': Generate position-independent code (PIC) suitable for use in a shared library, if supported for the target machine. Such code accesses all constant addresses through a global offset table (GOT). The dynamic loader resolves the GOT entries when the program starts (the dynamic loader is not part of GCC; it is part of the operating system). If the GOT size for the linked executable exceeds a machine-specific maximum size, you get an error message from the linker indicating that '-fpic' does not work; in that case, recompile with '-fPIC' instead. (These maximums are 8k on the SPARC, 28k on AArch64 and 32k on the m68k and RS/6000. The x86 has no such limit.); Position-independent code requires special support, and therefore works only on certain machines. For the x86, GCC supports PIC for System V but not for the Sun 386i. Code generated for the IBM RS/6000 is always position-independent.; When this flag is set, the macros '__pic__' and '__PIC__' are defined to 1.
'-fPIC': If supported for the target machine, emit position-independent code, suitable for dynamic linking and avoiding any limit on the size of the global offset table. This option makes a difference on AArch64, m68k, PowerPC and SPARC.; Position-independent code requires special support, and therefore works only on certain machines.; When this flag is set, the macros '__pic__' and '__PIC__' are defined to 2.

So it has got something to do with size limitations of the Global Offset Table (GOT), and for changing the value of two preprocessor macros, it only matters on some architectures.

For the sake of this short piece, lets not dive into the fine details of the GOT and inner workings of dynamic linking. Briefly, the GOT is a table of pointers that gets filled by the run-time linker to manage cross-references between dynamic shared objects (i.e. libraries).

Then why is it that small pic has a size limitation and large PIC does not?

It boils down to the architecture-specific machine code emitted by the compiler to read pointers from the GOT. Specifically, small pic requires fewer and more optimized instructions than large PIC, at the obvious cost of the size limitations.

AArch64 example

Here is a simple example on the most popular of affected platforms. First consider the following simple C module, with a function foobar referencing two external objects, foo and bar:

/* foobar.c */
extern int foo;
extern int bar;

int foobar(void)
{
        return foo + bar;
}

Then lets compile it:

aarch64-linux-gnu-gcc -Og -fno-pic foobar.c -c -o foobar.o
aarch64-linux-gnu-gcc -Og -fpic foobar.c -c -o foobar-pic.o
aarch64-linux-gnu-gcc -Og -fPIC foobar.c -c -o foobar-PIC.o

And disassemble the results:

Non-PIC

% aarch64-linux-gnu-objdump -dr foobar.o

foobar.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <foobar>:
   0:   90000000        adrp    x0, 0 <foo>
                        0: R_AARCH64_ADR_PREL_PG_HI21   foo
   4:   b9400001        ldr     w1, [x0]
                        4: R_AARCH64_LDST32_ABS_LO12_NC foo
   8:   90000000        adrp    x0, 0 <bar>
                        8: R_AARCH64_ADR_PREL_PG_HI21   bar
   c:   b9400000        ldr     w0, [x0]
                        c: R_AARCH64_LDST32_ABS_LO12_NC bar
  10:   0b000020        add     w0, w1, w0
  14:   d65f03c0        ret

The compiler does not know where the two objects will be located, so it emits static relocations. In normal usage, the linker would resolve and strip the relocations from the final executable.

When generating plain dumb position-dependent code, each object is referenced with a pair of ADRP and LDR instructions, as per the small memory model:

The ADRP instruction assigns the 4 KiB aligned address of the page containing the object into a 64-bits register (x0 here).
The LDR instruction then loads the value of the 32-bits integer object at a 12-bits offset from the page address into a 32-bits register (here w0 and w1).

(AArch64 also features tiny and large memory models, but that is another topic). After both values are loaded, they are added (ADD) and the function returns (RET).

Small pic

With small pic, the example is simple enough that the result looks very similar:

% aarch64-linux-gnu-objdump -dr foobar-pic.o

foobar-pic.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <foobar>:
   0:   90000000        adrp    x0, 0 <_GLOBAL_OFFSET_TABLE_>
                        0: R_AARCH64_ADR_PREL_PG_HI21   _GLOBAL_OFFSET_TABLE_
   4:   f9400001        ldr     x1, [x0]
                        4: R_AARCH64_LD64_GOTPAGE_LO15  foo
   8:   f9400000        ldr     x0, [x0]
                        8: R_AARCH64_LD64_GOTPAGE_LO15  bar
   c:   b9400021        ldr     w1, [x1]
  10:   b9400000        ldr     w0, [x0]
  14:   0b000020        add     w0, w1, w0
  18:   d65f03c0        ret

The compiler still emits an ADRP instruction with position-relative page high 21-bits address relocation (R_AARCH64_ADR_PREL_PG_HI21)... However there is only one such instruction where there were previously two. And it refers to a third _GLOBAL_OFFSET_TABLE_ symbol instead of foo and bar.

All accesses to the GOT are done through the same base register. That register contains the address of the 4 KiB page where the GOT starts.

This saves one instruction per access to the GOT, after the first one. But it limits the GOT size to the range of the 64-bits load GOT page low 15-bits relocation type (R_AARCH64_LD64_GOTPAGE_LO15). Such a relocation can generate a byte offset between 0 and 32760 in multiple of 8. Considering that the GOT might start at any offset between 0 and 4088 (4096 minus 8 bytes), the size of the GOT cannot safely exceed 32760 minus 4088 equals 28 KiB (i.e. 3584 entries).

Large PIC

With large PIC, the generated byte code is essentially identical to the non-PIC code with one ADRP and one LDR instruction for each object:

% aarch64-linux-gnu-objdump -dr foobar-PIC.o

foobar-PIC.o:     file format elf64-littleaarch64


Disassembly of section .text:

0000000000000000 <foobar>:
   0:   90000000        adrp    x0, 0 <foo>
                        0: R_AARCH64_ADR_GOT_PAGE       foo
   4:   f9400000        ldr     x0, [x0]
                        4: R_AARCH64_LD64_GOT_LO12_NC   foo
   8:   b9400002        ldr     w2, [x0]
   c:   90000001        adrp    x1, 0 <bar>
                        c: R_AARCH64_ADR_GOT_PAGE       bar
  10:   f9400021        ldr     x1, [x1]
                        10: R_AARCH64_LD64_GOT_LO12_NC  bar
  14:   b9400020        ldr     w0, [x1]
  18:   0b000040        add     w0, w2, w0
  1c:   d65f03c0        ret

The relocation types are however different, GOT-specific ones. As the relocated page address can now vary from one object to the other, the GOT size is no longer constrained to the range of the relocation of the LDR instruction, and can be as large as the memory model allows.

And there is the difference between the pic and PIC.

Remlab

Projects

PIC versus pic

AArch64 example

Non-PIC

Small pic

Large PIC