I've been trying to analyze certain applications with performance counters on a Opteron 6172, running Red Hat Enterprise Linux Workstation release 6.2 (Santiago).
I'm using PAPI v184.108.40.206 which uses the AMD native events CPU_CLK_UNHALTED for counting total cycles and DATA_CACHE_ACCESSES for counting L1 Data cache accesses.
The number of clocks that the CPU is not in a halted state (due to STPCLK or a HLT instruction). Note: this
event allows system idle time to be automatically factored out from IPC (or CPI) measurements, providing the
OS halts the CPU when going idle. If the OS goes into an idle loop rather than halting, such calculations are
influenced by the IPC of the idle loop.
The number of accesses to the data cache for load and store references. This may include certain microcode
scratchpad accesses, although these are generally rare. Each increment represents an eight-byte access,
although the instruction may only be accessing a portion of that. This event is a speculative event.
The problems I've been experiencing is that the number of L1 data cache accesses have been higher than the total number of cycles in some cases. A cache access does not halt the cpu, to my understanding, so it should fit within the total cycles. Also when dividing the total cycles by the clock frequency of the Opteron 6172 I get a pretty accurate estimate of the runtime, which makes me think that the total cycles is ok and the problem has to be with the counting of the data cache accesses.
I understand a core can issue two cache loads/stores per cycle but the cost of even half the accesses would be too great to fit within the total cycles.
Any help or reason to why this can occur is greatly appreciated, thanks in advance!