Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • D dynamorio
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 1,467
    • Issues 1,467
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 44
    • Merge requests 44
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • DynamoRIO
  • dynamorio
  • Issues
  • #4481
Closed
Open
Issue created Oct 08, 2020 by Administrator@rootContributor

ASSERT: Race/crash within DR dispatch when performing unlink_flush during app thread creation

Created by: nextsilicon-itay-bookstein

Describe the bug I wrote a piece of minimally adaptive instrumentation for indirect branches. To adapt to dynamically discovered indirect branch targets, this implementation calls dr_unlink_flush_region (I also tried delay_flush) from a clean-call, flushing the fragment to which it's going to return so that my instrumentation code will be dynamically reconstructed with the newly discovered target.

A series of spurious crashes in the fragment unlink flow, application register corruptions, and other fun things led me to try and debug the problem and narrow the repro down to something more minimal, and I traced it down to the usage of dr_unlink_flush_region at the time that the application is creating a lot of threads. Because it was flaky/racy/non-deterministic, I tried to force/stress the problem by unlinking much more aggresively (I had a high threshold before deciding to unlink).

In addition, I tried using a Debug DR build from most recent master. The assert I encountered is this:

Internal Error: DynamoRIO debug check failure: ../core/dispatch.c:757 wherewasi == DR_WHERE_FCACHE || wherewasi == DR_WHERE_TRAMPOLINE || wherewasi == DR_WHERE_APP || (dcontext->go_native && wherewasi == DR_WHERE_DISPATCH)

Adding a print revealed that the relevant values for this assert were as follows:

wherewasi = 2 (DR_WHERE_DISPATCH), dcontext->go_native = 0

When I tried to use delay_flush instead of unlink_flush I encountered this assert:

Internal Error: DynamoRIO debug check failure: ../core/vmareas.c:9502 false && "stale multi-init entry on frags list"

Because both the application and the DR client are reasonably complex, I tried to narrow a minimal repro down to a tiny DR client and a tiny application. The tiny application simply creates a lot of threads, each calling printf 100 times. The client simply calls dr_unlink_flush_region from a clean-call out of every indirect call and indirect jump, once every few times that the clean-call happens. I had to toy with the thread count and the unlink_flush call ratio to get at a good deterministic repro. I've attached the code for the repro.

The attached tar.gz file contains build.sh, CMakeLists.txt, src/repro_client.c and src/repro_app.c. Note that build.sh nukes <script_dir>/build and re-creates it by invoking cmake. unlink_flush_repro.tar.gz

A plain drrun with the provided client and the provided app should trigger the issue. I haven't tested it on multiple machines to ensure that there's no dependence on core count or anything like that.

I can potentially try to debug this further, but at this point I thought asking here would be a good idea :)

Expected behavior Application should successfully run to completion (albeit slowly).

Versions

  • What version of DR are you using? Top of master branch from day of posting this
  • What operating system version are you running on? Debian GNU/Linux 10 (buster)
  • Is your application 32-bit or 64-bit? 64-bit
Assignee
Assign to
Time tracking