Use new ARM atomic opcodes where available, for DR and even for app
These are ideas from @algr :
ARMv8.1 adds atomic add opcodes with release-acquire semantics: LDADDA, LDADDAL, LDADDL, and corresponding STADD*, etc. These are simpler to use and preferable to the exclusive monitor loops in DR's own code. DR would need to dynamically select whether these were available in the underlying processor.
Furthermore, we could potentially translate some exclusive monitor loops in the application to use these new atomic opcodes instead, avoiding the exclusive monitor instrumentation issue #1698.