Usage of Private Code Cache and Clean Calls confuses DR when main cache is thread private.

Problem: Undefined behaviour is encountered when a client places a clean call inside a private shared code cache and has DR options set to use private caches (via -thread_private) for the main cache. At a high level, DynamoRIO gets confused, as it treats the insertion of the clean call as an insertion to a thread-private cache. However, the private cache is NOT thread-private but is shared.

The problem mainly occurs on 32-bit platforms.

Test Case: The following code is a modified version of the memtrace sample. I could have used the original code of memtrace to show the bug, but this version removed unnecessary overhead for quick testing.

#include <stdio.h>
#include <string.h> /* for memset */
#include <stddef.h> /* for offsetof */
#include "dr_api.h"
#include "drmgr.h"
#include "drreg.h"
#include "drutil.h"
#include "utils.h"

reg_id_t tls_raw_reg;
uint tls_raw_base;
static size_t page_size;
static app_pc code_cache;

static void event_exit(void);
static dr_emit_flags_t
event_bb_insert(void *drcontext, void *tag, instrlist_t *bb, instr_t *instr,
        bool for_trace, bool translating, void *user_data);

static void
code_cache_init(void);
static void
code_cache_exit(void);

DR_EXPORT void dr_client_main(client_id_t id, int argc, const char *argv[]) {

    page_size = dr_page_size();
    drmgr_init();
    drutil_init();

    dr_register_exit_event(event_exit);
    if (!drmgr_register_bb_instrumentation_event(NULL, event_bb_insert, NULL)) {
        /* something is wrong: can't continue */
        DR_ASSERT(false);
        return;
    }

    dr_raw_tls_calloc(&(tls_raw_reg), &(tls_raw_base), 1, 0);
    code_cache_init();
}

static void event_exit() {

    dr_raw_tls_cfree(tls_raw_base, 1);

    code_cache_exit();

    if (!drmgr_unregister_bb_insertion_event(event_bb_insert))
        DR_ASSERT(false);

    drutil_exit();
    drmgr_exit();
}

int test_counter = 0;

static void clean_call(void) {
    test_counter++;
}

static void code_cache_init(void) {
    void *drcontext;
    instrlist_t *ilist;
    instr_t *where;
    byte *end;

    drcontext = dr_get_current_drcontext();
    code_cache = dr_nonheap_alloc(page_size,
    DR_MEMPROT_READ | DR_MEMPROT_WRITE | DR_MEMPROT_EXEC);
    ilist = instrlist_create(drcontext);
    
    opnd_t opnd1 = opnd_create_far_base_disp_ex(tls_raw_reg,
    REG_NULL, REG_NULL, 1, tls_raw_base + (0 * (sizeof(void *))), OPSZ_4, false,
    true, false);

    where = INSTR_CREATE_jmp_ind(drcontext, opnd1);
    instrlist_meta_append(ilist, where);
    /* clean call */
    dr_insert_clean_call(drcontext, ilist, where, (void *) clean_call, false,
            0);

    end = instrlist_encode(drcontext, ilist, code_cache, false);
    DR_ASSERT((size_t )(end - code_cache) < page_size);
    instrlist_clear_and_destroy(drcontext, ilist);
    dr_memory_protect(code_cache, page_size, DR_MEMPROT_READ | DR_MEMPROT_EXEC);
}

static void code_cache_exit(void) {
    dr_nonheap_free(code_cache, page_size);
}

static dr_emit_flags_t event_bb_insert(void *drcontext, void *tag,
        instrlist_t *bb, instr_t *where, bool for_trace, bool translating,
        void *user_data) {

    instr_t *instr, *restore;
    opnd_t opnd1, opnd2;
    restore = INSTR_CREATE_label(drcontext);

    /* mov restore DR_REG_XCX */
    opnd1 = opnd_create_far_base_disp_ex(tls_raw_reg,
    REG_NULL, REG_NULL, 1, tls_raw_base + (0 * (sizeof(void *))), OPSZ_4, false,
    true, false);
    opnd2 = opnd_create_instr(restore);
    instr = INSTR_CREATE_mov_imm(drcontext, opnd1, opnd2);
    instrlist_meta_preinsert(bb, where, instr);

    opnd1 = opnd_create_pc(code_cache);
    instr = INSTR_CREATE_jmp(drcontext, opnd1);
    instrlist_meta_preinsert(bb, where, instr);

    instrlist_meta_preinsert(bb, where, restore);

    return DR_EMIT_DEFAULT;
}

The code is fairly simple. For every instruction, the client jumps inside a shared private code cache created via dr_nonheap_alloc. The cache has a clean call inserted inside it.

The application under test needs to be multi-threaded. The bug fails on applications such as Apache. However, I recommend pigz for testings as it is quite a light-weight application and the bug is triggered in a matter of seconds.

I run the the client as follows:

/home/john/dynamorio/install/bin32/drrun -thread_private -disable_traces -opt_cleancall 2 -c libmemtrace_test.so -- pigz -k -d /home/john/generated.zip

Back-Trace: The back-trace does not provide much hints towards the root cause of the bug. It indicates that the bug is triggered in monitor_cache_enter, but the function is unrelated. I had cases where the top function changes on multiple runs.

(gdb) bt
#0  0x7113268f in monitor_cache_enter (dcontext=0x0, f=0x458ee9f0)
    at /home/john/dynamorio/core/monitor.c:1891
#1  0x7108f181 in d_r_dispatch (dcontext=0x0)
    at /home/john/dynamorio/core/dispatch.c:197
#2  0x00000246 in ?? ()
(gdb) info registers
eax            0x0	0
ecx            0x4578a704	1165534980
edx            0xc	12
ebx            0x71365000	1899384832
esp            0x457ece90	0x457ece90
ebp            0x457ecf7c	0x457ecf7c
esi            0x71365000	1899384832
edi            0x4578a704	1165534980
eip            0x7113268f	0x7113268f <monitor_cache_enter+26>
eflags         0x10216	[ PF AF IF RF ]
cs             0x73	115
ss             0x7b	123
ds             0x7b	123
es             0x7b	123
fs             0x43	67
gs             0x33	51

I also analysed the debug logs but did not find anything suspicious at the crash point - the master signal is just triggered to handle the fault.

Root Cause: Eventually, I made some progress by inspecting code related to inserting clean calls, particularly the prepare_for_call_ex and cleanup_after_call_ex. These functions set the runtime usage of drcontext statically or dynamically depending whether caches are shared and whether the architecture is 64-bit. This check is done via SCRATCH_ALWAYS_TLS

# define SCRATCH_ALWAYS_TLS() (DYNAMO_OPTION(private_ib_in_tls))

The code essentially determines whether the private_ib_in_tls parameter is set. Note, this parameter is enabled by default for shared caches and for 64-bit, but not when dealing with a 32-bit application and thread private caches are used.

Regarding the bug, when the clean call is inserted inside the private cache, an absolute address pertaining to the thread's drcontext is used. However, the private code cache is shared, and when another thread enters the cache, it's not dynamically loading its drcontext, but modifying the drcontext of some other thread.

The place where the absolute address is set is in the dcontext_opnd_common function, where the the absolute parameter is passed as true.

dcontext_opnd_common(dcontext_t *dcontext, bool absolute, reg_id_t basereg, int offs,
                     opnd_size_t size)

Solutions: One quick solution is to document this limitation and tell the user to pass the private_ib_in_tls runtime parameter when clean calls are in shared private caches - the client does not crash with this parameter. The second, perhaps more convoluted solution with regards to code changes, is to always use tls slots and dynamically load the drcontext, thus removing the private_ib_in_tls parameter all together.