Expand A64 SVE scatter/gather memory access instructions

ARM has a rich set of SVE instructions1. For clients that need to instrument memrefs (e.g. drcachesim), we need to expand them to scalar loads and stores, like what we did for x86 scatter/gather in #2985 (closed). These instructions have many variants; I'm summarising my understanding below. 1 and the Scalable Vector Extension of the Arm manual2 have a more detailed discussion.

There are multiple ways in which the memory address can be specified. These are all predicated loads/stores, meaning that an element may be active or inactive, based on special predicate registers. They have either the LD1* or ST1* prefix

Scalar + immediate: For contiguous access. Memory address is generated by a 64-bit scalar base and immediate index.
Scalar + scalar: For contiguous access. Memory address is generated by a 64-bit scalar base and scalar index which is added to the base address.
Scalar + vector: For possible non-contiguous access, also known as gather load/scatter store. Memory addresses are generated by a 64-bit scalar base plus vector index.
Vector + immediate: For possible non-contiguous access, also known as gather load/scatter store. Memory addresses are generated by a vector base plus immediate index.

There are variants with different element sizes (unsigned double-word; signed and unsigned byte, halfword, word).

Faults for non-active elements are always suppressed. There are different load instruction variants based on how faults for active elements are treated: besides the usual, each of the above has a “first fault” (faults only for first active element) and “non fault” variants.

For “scalar plus scalar” and “scalar plus immediate” load instructions, there are variants that allow reading contiguous 2/3/4 elements, each to the same element number in 2/3/4 vector registers. These have LDN* or STN* prefix, where N=2/3/4.

There are also some un-predicated instructions (LDR and STR) that use the "scalar + immediate" scheme to load/store vectors or predicate registers.

The x86 scatter/gather that we handled in #2985 (closed) is the "scalar + vector" variant with regular faulting behaviour. More work will be required to adapt drx_expand_scatter_gather to these other variants.

For the contiguous access variants, we could model them as a single memory address with a larger size. But this is not a correct model, because each element can be active/inactive based on the predicate register, so the memory addresses that end up being accessed can be non-contiguous. It'll be correct to model them as scatter/gather, using multiple element-sized accesses.