drmemtrace non-inlined trace-delay instruction counting is racy
drmemtrace's -trace_after_instrs feature for 32-bit x86 and 32-bit arm does not inlined the counter increment and has a clean call that does a simple increment of a global variable that is not marked as std::atomic or anything. There is no comment about this: I could imagine deliberately living with the races perhaps on x86, but on weaker-memory-model arm it feels like it's possible to be waay off for small values with just a few threads.
We could stick a std::atomic on the global, or take the effort to inline for these 32-bit arches. If we're going to lose accuracy we may as well do the per-thread counting from #5026 (closed).