AMD Logo AMD Developer Central
AMD Developer Forums
Decrease font size
Increase font size
Topic Title: memory accesses question
Topic Summary:
Created On: 10/30/2009 12:53 PM
Status: Post and Reply
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 10/30/2009 12:53 PM
User is offline View Users Profile Print this message

Author Icon
jyost

Posts: 3
Joined: 10/29/2009

I'm kind of new to CodeAnalyst and I find that I frequently get results that I don't expect or have difficulty interpreting.

Here's an example.  Consider the following little test program:

#include <stdlib.h>
#include <string.h>

main()
{
    size_t size = 1000000;
    int iters = 1000;

    unsigned char *buf = (unsigned char *)malloc (size);

    register unsigned char sum = 0;

    for (register int i = 0; i < iters; i++)
    {
    for (register int j = 0; j < size; j++)
        sum += buf[j];
    }

    return sum;
}

Compiled as follows:

g++ -o simple simple.cpp

(So - no optimization.)

I would expect this to do ony reads, and no (or very few) writes to main memory.  In fact, if I look at events 0x6C (reads) and 0x6D (writes), it seems to do about as many reads as writes, if I'm interpreting the results correctly.  Hmmm ... Maybe "sum" isn't being put in a register, in spite of the "register" keyword.  That's the only theory I have.  But I'm not sure that I believe that.

The actual results I got from one run were 7566 for reads, 31897 for writes and 3128 for DRAM accesses - all with a sample period of 10,000.  And ... hmmm ... maybe that sample period should be 500,000.  But, still ...

Another question: Why is event 0xE0 (DRAM accesses) not equal to the sum of event 0x6C (reads) and 0x6D (writes)?

What I'm ultimately trying to determine is if a real program (not the above test) is bumping up against memory bandwidth limits, but I'm not sure which event or events I should look at.  BTW - I have looked at Paul Drongowski's "Basic Performance Measurements ..." document, which is certainly very helpful, but still leaves me with some questions.  (Maybe I'm just thick!)

 10/30/2009 06:11 PM
User is offline View Users Profile Print this message

Author Icon
leiy

Posts: 70
Joined: 06/26/2007

CA 2.9.5.2-cg launch the app before starting the profile. That caused profile duration vary since the starting profile has delay with syscall.

We will fix this issue.

Thanks for reporting this.



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
 11/02/2009 10:13 AM
User is offline View Users Profile Print this message

Author Icon
pdrongowski

Posts: 23
Joined: 12/21/2007

Hi --

You're probably not getting the assembler language code that you're
expecting from the compiler. Here are some quick results using G++
version 4.1.2 on SLES.

The assembler language output was generated using the -S option.
The first example was generated with the command:
    g++ -S -o simple simple.cpp
The -S option asks the compiler to leave the intermediate assembler
language file simple.s.

-- pj

P.S. I'll be sending two examples in the next replies. I'm just trying to keep each reply short.

 



-------------------------
The information presented in this document is for
informational purposes only and may contain technical
inaccuracies, omissions and typographical errors. Links
to third party sites are for convenience only, and no
endorsement is implied.
 11/02/2009 10:22 AM
User is offline View Users Profile Print this message

Author Icon
pdrongowski

Posts: 23
Joined: 12/21/2007

As you mentioned, the default optimization level is -O0, no optimization.
The keyword "register" is really a hint to the compiler that the variable
will be frequently used. The compiler is free to use or ignore the hint.
Since optimization is turned off, the compiler ignores the hint and
allocates the variables into stack (memory) locations.

I've annotated the assembler program with the variable to stack location
bindings.

 

 

Code:
***************************
-O0 optimization
***************************

-8(%rbp)  == Base address of the array (buf)
-12(%rbp) == Outer loop bound 1000 (iters)
-24(%rbp) == Inner loop bound 1000000 (size)
-36(%rbp) == Inner loop counter (j)
-40(%rbp) == Outer loop counter (i)
-41(%rbp) == Sum of bytes (sum)
        
.L3:
        movl        $0, -36(%rbp) 
        jmp         .L4                        
.L5:                                       
        movslq      -36(%rbp),%rax  
        addq        -8(%rbp), %rax         
        movzbl      (%rax), %eax           
        addb        %al, -41(%rbp)            
        addl        $1, -36(%rbp)           
.L4:                                        
        movslq      -36(%rbp),%rax         
        cmpq        -24(%rbp), %rax         
        jb          .L5  
        addl        $1, -40(%rbp)                 
.L2:                                            
        movl        -40(%rbp), %eax                
        cmpl        -12(%rbp), %eax           
        jl          .L3  


-------------------------
The information presented in this document is for
informational purposes only and may contain technical
inaccuracies, omissions and typographical errors. Links
to third party sites are for convenience only, and no
endorsement is implied.

Edited: 11/02/2009 at 10:28 AM by pdrongowski
 11/02/2009 10:27 AM
User is offline View Users Profile Print this message

Author Icon
pdrongowski

Posts: 23
Joined: 12/21/2007

In the following case, optimization was turned on. The generated code
is probably more in line with your expectations. Here the variables
are bound to registers.

 

Code:
***************************
-O2 optimization
***************************

rdi == Base address of the array (buf)
r8d == Outer loop counter (i)
rcx,rdx == Inner loop counter (j)

.L2:
        xorl        %esi, %esi
        movl        $1, %edx  
        jmp        .L4        
.L3:                          
        movq        %rdx, %rsi
        movq        %rcx, %rdx
.L4:                          
        leaq        1(%rdx), %rcx
        addb        (%rdi,%rsi), %al
        cmpq        $1000001, %rcx   
        jne         .L3  
        addl        $1, %r8d
        cmpl        $1000, %r8d
        jne         .L2 


-------------------------
The information presented in this document is for
informational purposes only and may contain technical
inaccuracies, omissions and typographical errors. Links
to third party sites are for convenience only, and no
endorsement is implied.
 11/02/2009 05:40 PM
User is offline View Users Profile Print this message

Author Icon
leiy

Posts: 70
Joined: 06/26/2007

Based on BKDG (BIOS and Kernel Debug Guide), EventSelect 06Dh Octwords Written to System: The number of octword (16-byte) data transfers from the processor to the system. These may be part of a 64-byte cache line writeback or a 64-byte dirty probe hit response.

The counts of event, 06Dh, in the simple program, probably are due to dirty probe hit response.

You can look at the issue differently from Instruction-based sampling point of view -- setup IBS Op Sampling with "dispatch count" mode. You will find there is no store in the simpe program.

 



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
 11/11/2009 03:01 PM
User is offline View Users Profile Print this message

Author Icon
jyost

Posts: 3
Joined: 10/29/2009

Hi Paul -

First off - thanks for your responses.

But your responses only address the question of whether things are getting put into registers, no?   I believe that the code is still generating write events - according to CodeAnalyst - even when it's optimized.  So what I'm wondering is where those write events are coming from.  Lei Yu suggested that they're actually "dirty probe hit responses", rather than the result of writes that my code is doing.  I need to learn what dirty probe hit responses are, where they're coming from, and whether they should be considered when investigating memory bandwidth issues (I assume the answer to that would be "yes".)

 

Statistics
6125 users are registered to the AMD Developer Forums forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information