assembly - :lower16, :upper16 for aarch64; absolute address into register;

Question

Welcome To Ask or Share your Answers For Others

assembly - :lower16, :upper16 for aarch64; absolute address into register;

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

assembly - :lower16, :upper16 for aarch64; absolute address into register;

I need to put a 32-bit absolute address into a register on AArch64. (e.g. an MMIO address, not PC-relative).

On ARM32 it was possible to use lower16 & upper16 to load an address into a register

movw    r0, #:lower16:my_addr
movt    r0, #:upper16:my_addr

Is there a way to do similar thing on AArch64 by using movk?

If the code is relocated, I still want the same absolute address, so adr is not suitable.

ldr from a nearby literal pool would work, but I'd rather avoid that.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T21:17:17+0000

If your address is an assemble-time constant, not link-time, this is super easy. It's just an integer, and you can split it up manually.

I asked gcc and clang to compile unsigned abs_addr() { return 0x12345678; } (Godbolt)

// gcc8.2 -O3
abs_addr():
    mov     w0, 0x5678               // low half
    movk    w0, 0x1234, lsl 16       // high half
    ret

(Writing w0 implicitly zero-extends into 64-bit x0, same as x86-64).

Or if your constant is only a link-time constant and you need to generate relocations in the .o for the linker to fill in, the GAS manual documents what you can do, in the AArch64 machine-specific section:

Relocations for ‘MOVZ’ and ‘MOVK’ instructions can be generated by prefixing the label with #:abs_g2: etc. For example to load the 48-bit absolute address of foo into x0:
    movz x0, #:abs_g2:foo     // bits 32-47, overflow check
    movk x0, #:abs_g1_nc:foo  // bits 16-31, no overflow check
    movk x0, #:abs_g0_nc:foo  // bits  0-15, no overflow check

The GAS manual's example is sub-optimal; going low to high is more efficient on at least some AArch64 CPUs (see below). For a 32-bit constant, follow the same pattern that gcc used for a numeric literal.

 movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
 movk x0, #:abs_g1:foo              // bits 16-31, overflow check

#:abs_g1:foo will is known to have its possibly-set bits in the 16-31 range, so the assembler knows to use a lsl 16 when encoding movk. You should not use an explicit lsl 16 here.

I chose x0 instead of w0 because that's what gcc does for unsigned long long. Probably performance is identical on all CPUs, and code size is identical.

.text
func:
   // efficient
     movz x0, #:abs_g0_nc:foo           // bits  0-15, no overflow check
     movk x0, #:abs_g1:foo              // bits 16-31, overflow check

   // inefficient but does assemble + link
   //  movz x1, #:abs_g1:foo              // bits 16-31, overflow check
   //  movk x1, #:abs_g0_nc:foo           // bits  0-15, no overflow check

.data
foo: .word 123       // .data will be in a different page than .text

With GCC: aarch64-linux-gnu-gcc -nostdlib aarch-reloc.s to build and link (just to prove we can, this will just crash if you actually ran it), and then aarch64-linux-gnu-objdump -drwC a.out:

a.out:     file format elf64-littleaarch64


Disassembly of section .text:

000000000040010c <func>:
  40010c:       d2802280        mov     x0, #0x114                      // #276
  400110:       f2a00820        movk    x0, #0x41, lsl #16

Clang appears to have a bug here, making it unusable: it only assembles #:abs_g1_nc:foo (no check for the high half) and #:abs_g0:foo (overflow check for the low half). This is backwards, and results in a linker error (g0 overflow) when foo has a 32-bit address. I'm using clang version 7.0.1 on x86-64 Arch Linux.

$ clang -target aarch64 -c aarch-reloc.s
aarch-reloc.s:5:15: error: immediate must be an integer in range [0, 65535].
     movz x0, #:abs_g0_nc:foo
              ^

As a workaround g1_nc instead of g1 is fine, you can live without overflow checks. But you need g0_nc, unless you have a linker where checking can be disabled. (Or maybe some clang installs come with a linker that's bug-compatible with the relocations clang emits?) I was testing with GNU ld (GNU Binutils) 2.31.1 and GNU gold (GNU Binutils 2.31.1) 1.16

$ aarch64-linux-gnu-ld.bfd aarch-reloc.o 
aarch64-linux-gnu-ld.bfd: warning: cannot find entry symbol _start; defaulting to 00000000004000b0
aarch64-linux-gnu-ld.bfd: aarch-reloc.o: in function `func':
(.text+0x0): relocation truncated to fit: R_AARCH64_MOVW_UABS_G0 against `.data'

$ aarch64-linux-gnu-ld.gold aarch-reloc.o 
aarch-reloc.o(.text+0x0): error: relocation overflow in R_AARCH64_MOVW_UABS_G0

MOVZ vs. MOVK vs. MOVN

movz = move-zero puts a 16-bit immediate into a register with a left-shift of 0, 16, 32 or 48 (and clears the rest of the bits). You always want to start a sequence like this with a movz, and then movk the rest of the bits. (movk = move-keep. Move 16-bit immediate into register, keeping other bits unchanged.)

mov is sort of a pseudo-instruction that can pick movz, but I just tested with GNU binutils and clang, and you need an explicit movz (not mov) with an immediate like #:abs_g0:foo. Apparently the assembler won't infer that it needs movz there, unlike with a numeric literal.

For a narrow immediate, e.g. 0xFF000 which has non-zero bits in two aligned 16-bit chunks of the value, mov w0, #0x18000 would pick the bitmask-immediate form of mov, which is actually an alias for ORR-immediate with the zero register. AArch64 bitmask-immediates use a powerful encoding scheme for repeated patterns of bit-ranges. (So e.g. and x0, x1, 0x5555555555555555 (keep only the even bits) can be encoded in a single 32-bit-wide instruction, great for bit-hacks.)

There's also movn (move not) which flips the bits. This is useful for negative values, allowing you to have all the upper bits set to 1. There's even a relocation for it, according to AArch64 relocation prefixes.

Performance: `movz low16; movk high16` in that order

The Cortex A57 optimization manual

4.14 Fast literal generation

Cortex-A57 r1p0 and later revisions support optimized literal generation for 32- and 64-bit code
    MOV wX, #bottom_16_bits
    MOVK wX, #top_16_bits, lsl #16
[and other examples]

... If any of these sequences appear sequentially and in the described order in program code, the two instructions can be executed at lower latency and higher bandwidth than if they do not appear sequentially in the program code, enabling 32-bit literals to be generated in a single cycle and 64-bit literals to be generated in two cycles.

The sequences include movz low16 + movk high16 into x or w registers, in that order. (And also back-to-back movk to set the high 32, again in low, high order.) According to the manual, both instructions have to use w, or both have to use x registers.

Without special support, the movk would have to wait for the movz result to be ready as an input for an ALU operation to replace that 16-bit chunk. Presumably at some point in the pipeline, the 2 instructions merge into a single 32-bit immediate movz or movk, removing the dependency chain.

Categories

assembly - :lower16, :upper16 for aarch64; absolute address into register;

assembly - :lower16, :upper16 for aarch64; absolute address into register;

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

MOVZ vs. MOVK vs. MOVN

Performance: `movz low16; movk high16` in that order

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Categories

assembly - :lower16, :upper16 for aarch64; absolute address into register;

assembly - :lower16, :upper16 for aarch64; absolute address into register;

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

MOVZ vs. MOVK vs. MOVN

Performance: movz low16; movk high16 in that order

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags

Performance: `movz low16; movk high16` in that order