online drcachesim is too slow (8x slower than cachegrind): regression?
This issue is about improving the overall drcachesim online analysis performance. Xref #1738 on optimizing the cache simulation code. Xref #2001 on optimizing the tracer.
Running SPEC2006 mcf on the test input, we are currently very slow: 8x slower than cachegrind in fact (!!). I believe this is a big regression since I recall running this same performance test years ago and being in the 20 second range? Even the basic_counts tool is slow which makes this separate from #1738. I have not yet profiled this: that would be the first step.
This is a 64-bit release build:
$ /usr/bin/time ./mcf ../../data/test/input/inp.in
MCF SPEC CPU2006 version 1.10
Copyright (c) 1998-2000 Zuse Institut Berlin (ZIB)
Copyright (c) 2000-2002 Andreas Loebel & ZIB
Copyright (c) 2003-2005 Andreas Loebel
nodes : 5985
active arcs : 102404
simplex iterations : 63475
objective value : 3180065918
new implicit arcs : 2393292
active arcs : 2495696
simplex iterations : 118645
objective value : 2060055866
erased arcs : 2387557
checksum : 2997477
done
1.58user 0.04system 0:01.70elapsed 95%CPU (0avgtext+0avgdata 159752maxresident)k
$ /usr/bin/time /work/dr/git/build_x64_rel_tests/bin64/drrun -- ./mcf ../../data/test/input/inp.in
<...>
1.66user 0.03system 0:01.71elapsed 98%CPU (0avgtext+0avgdata 162812maxresident)k
$ /usr/bin/time /work/dr/git/build_x64_rel_tests/bin64/drrun -t drcachesim -- ./mcf ../../data/test/input/inp.in
<...>
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (1 thread(s))
L1I stats:
Hits: 3,367,529,621
Misses: 1,830
Invalidations: 0
Miss rate: 0.00%
L1D stats:
Hits: 1,002,703,388
Misses: 369,946,321
Invalidations: 0
Prefetch hits: 47,349,279
Prefetch misses: 322,597,042
Miss rate: 26.95%
Core #1 (0 thread(s))
Core #2 (0 thread(s))
Core #3 (0 thread(s))
LL stats:
Hits: 356,111,139
Misses: 13,837,012
Invalidations: 0
Prefetch hits: 254,939,620
Prefetch misses: 67,657,422
Local miss rate: 3.74%
Child hits: 4,417,582,288
Total miss rate: 0.29%
224.86user 16.13system 3:21.27elapsed 119%CPU (0avgtext+0avgdata 167692maxresident)k
$ /usr/bin/time /work/dr/git/build_x64_rel_tests/bin64/drrun -t drcachesim -simulator_type basic_counts -- ./mcf ../../data/test/input/inp.in
<...>
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
3223195651 total (fetched) instructions
433 total non-fetched instructions
0 total prefetches
1178112041 total data loads
194412943 total data stores
1 total threads
28369030 total scheduling markers
0 total transfer markers
0 total function id markers
0 total function return address markers
0 total function argument markers
0 total function return value markers
0 total other markers
Thread 18131 counts:
3223195651 (fetched) instructions
433 non-fetched instructions
0 prefetches
1178112041 data loads
194412943 data stores
28369030 scheduling markers
0 transfer markers
0 function id markers
0 function return address markers
0 function argument markers
0 function return value markers
0 other markers
124.72user 13.88system 1:44.85elapsed 132%CPU (0avgtext+0avgdata 167700maxresident)k
$ /usr/bin/time valgrind --tool=cachegrind -- ./mcf ../../data/test/input/inp.in
==18032== Cachegrind, a cache and branch-prediction profiler
==18032== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==18032== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==18032== Command: ./mcf ../../data/test/input/inp.in
==18032==
--18032-- warning: L3 cache found, using its data for the LL simulation.
<...>
==18032==
==18032== I refs: 3,223,256,389
==18032== I1 misses: 1,764
==18032== LLi misses: 1,741
==18032== I1 miss rate: 0.00%
==18032== LLi miss rate: 0.00%
==18032==
==18032== D refs: 1,346,785,609 (1,178,206,897 rd + 168,578,712 wr)
==18032== D1 misses: 437,406,928 ( 423,864,952 rd + 13,541,976 wr)
==18032== LLd misses: 81,227,588 ( 73,769,939 rd + 7,457,649 wr)
==18032== D1 miss rate: 32.5% ( 36.0% + 8.0% )
==18032== LLd miss rate: 6.0% ( 6.3% + 4.4% )
==18032==
==18032== LL refs: 437,408,692 ( 423,866,716 rd + 13,541,976 wr)
==18032== LL misses: 81,229,329 ( 73,771,680 rd + 7,457,649 wr)
==18032== LL miss rate: 1.8% ( 1.7% + 4.4% )
24.04user 0.05system 0:24.20elapsed 99%CPU (0avgtext+0avgdata 180484maxresident)k