rare interruptions between instrumentation commit point and app instr execution cause tool inaccuracies
Most tools execute their added instrumentation prior to the execution of the application instruction being observed. If an interruption arrives in between the instrumentation and app execution, the tool can think the app instruction did execute, and when control resumes at the same execution it can double-count the instruction or mess up its state or whatever its instrumentation focus happens to be. This is a general issue with the instrumentation and its app instruction not being one atomic unit.
For always-asynchronous signals, DR always delays delivery until after a block finishes executing. And for synchronous signals, the interrupted instruction is re-executed, so this problem never shows up for "normal" signals. It could happen for a not-normally-asynchronous signal sent asynchronously: e.g., a SIGSEGV sent via SYS_kill. (Some parts of DR's signal code look for is_sys_kill() but whether to deliver now does not, today.)
This can also happen with DR relocating a thread: if its SIGUSR2 arrives in between instru and app. Here, there the client can refuse to relocate.
The same interruptions could happen at any point during a multi-instruction instrumentation sequence and not just right in between the end of the instrumentation and the app instr. For some clients this is fine since they don't "commit" their observation until the final couple of instructions: e.g., a tracing tool not updating its buffer pointer until the end.
For the signal case: DR still needs to call the client's restore_state event. The client could conceivably roll back its instru actions, if it can tell whether it already executed them. It could decode the raw cache instructions to figure this out though this gets hacky and not always foolproof; or if we implement i#3801 it could look at its own IR metadata.
The action items here are to document this more clearly, and perhaps implement handling in our provided tool restore_state events? This adds complexity though. We could look into whether we could disallow this from ever happening: add the is_sys_kill() check (which for modern Linux kernels should be solid) to record_pending_signal(), and for relocation we can have DR detect this and re-try -- though that may lead to a lot of re-tries for heavy-instrumentation clients; most signals are likely to arrive in the middle of instrumentation.