AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs - Hard-Core Software Optimization
Decrease font size
Increase font size
June 2, 2009
  Adventures in Dual Booting OpenSolaris

This year I entered into a new role as a performance engineer for AMD, assigned with tackling any and all Sun compiler performance engineering issues for AMD's Sun alliance.

This blog entry focuses on how I got multi-boot working on a system with both SuSE Linux® Enterprise Server 10 SP2 and OpenSolarisTM 2008.11, even though OpenSolaris is installed on the second partition (most of the blogs and articles I found online always recommended OpenSolaris be installed on the primary partition)

Back in the day, it wasn't called "multi-boot," it was just "dual-boot" (I suppose because having two operating systems installed on one disk was almost a freak of nature). Multi-booting operating systems is somewhat of a black art, mainly involving choosing, installing, and configuring the boot loader.

As a software developer in the past, I have performed such ad-hoc system setups frequently, mostly focused on bootstrapping a project. It is no different as a performance engineer.  So I recently found myself engaged in setting up a shiny new system with a couple of AMD "Istanbul" processors hot off the fab. The activity generated a surprising amount of excitement ...

... A crowd of engineers gather around the latest machine. A few twists of a knob here, a button there, and a fiery glow lights their faces. They hunger for performance numbers! Overnight SPEC® CPU2006 runs are almost too much to endure. Can we speed up the install? Should we add more memory? We want those results!

Back to the real world - I need SLES 10 SP2 for our initial studies, so that goes on first. Anticipating the need to multi-boot, I divide the disk into 3 partitions while installing SLES to the first one, namely (hd0,0). I setup and configure the SPEC benchmarks and get those started.

... The benchmarks results finally come in. The engineers ooh and ahh over the towering new SPEC numbers. Abruptly they disperse, returning to their cubicles to digest. I finally have the machine to myself (moo hoo ha ha).

Now comes the OpenSolaris install on the second partition ((hd0,1)). It goes smoothly, except it installs a new copy of GRUB (GRand Unified Bootloader) which doesn't seem to know anything about the original SLES partition. When I reboot, I can't get back to the original install!

... I have broken the shiny new machine! The light glows but it is a strange color, not the fiery glow the engineers will need any day now. Both hands inside the box, I am certain if I stop to scratch my nose I will lose control and it will fly around the cube and out the window.

I have two paths to try: 1) configure the OpenSolaris GRUB to see SLES 10, or 2) configure the SLES GRUB to see OpenSolaris.

First I try configuring the OpenSolaris GRUB by editing GRUB's menu.lst file. Booting OpenSolaris, I look for /boot/grub/menu.lst but eventually I discover that OpenSolaris' GRUB menu file is in /rpool/boot/grub/menu.lst. I cook up an entry like this:


title SLES 10 SP2, kernel 2.16.16.60-0.21-smp
root (hd0,0)
kernel /boot/vmlinuz-2.6.16-60-0.21-smp \
root=/dev/dsk/by-id/scsi-SATA_ST3250410AS_6RYC836A-part1 \
vga=normal showopts ide=nodma apm=off acpi=off noresume edd=off 3
initrd /boot/initrd-2.6.16.60-0.21-smp

 

but after tweaking several times (where GRUB complains about not finding a valid OS) I can't get the recipe exactly right. I move on to option #2, getting the SLES GRUB booting again.

At first I try OpenSolaris' fdisk command but I don't find an easy way to determine the device name of the SLES disk partition (because I am unfamiliar with the OpenSolaris way of device naming). So I decide to do it from SLES - if I can boot the SLES partition, or mount it somehow from a rescue disk I could modify its GRUB configuration (by editing the /boot/grub/menu.lst file). After some Googling, I create and boot a SLES 10 SP2 install CD, boot in rescue mode, mount the partition, and then add this entry:


title OpenSolaris 2008.11
root (hd0,1)
chainloader +1

 

While booted, I use the SLES fdisk to mark the SLES partition as bootable. When I reboot, the machine comes up and boots SLES 10 SP2 without intervention! Whew!.

And now I can choose the OpenSolaris 2008.11 partition at boot time, which then displays the OpenSolaris GRUB menu, which knows how to boot the OpenSolaris ZFS partition. If I had to I could use fdisk again to make the machine boot to OpenSolaris every time, but for now it will reboot to SLES 10 SP2 each time.

... The light is restored and the machine is ready. When the engineers return to take the machine they will be able to use it as they did before, but I have left a door to my little workshop where I can return when their interests move on to the next shiny problem.

 

 



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 06/03/2009 at 10:49 AM by qneill

 Post a Comment    

    Posted By: Quentin Neill @ 06/02/2009 12:14 PM     Hard-Core Software Optimization     Comments (2)  

May 4, 2009
  Beta CodeAnalyst v2.9 Released for Windows

Hello all, again --

The next version of the AMD CodeAnalyst Performance Analyzer is available for you now.  I encourage to you to download it in another window and read the rest of the blog while it downloads.

We've added some widely requested enhancements, deprecated one feature, and fixed *cough* a few bugs.  While this release is in the Beta period, please send us feedback about anything you would like to suggest for the actual release or any issues you encounter.  You're welcome to send that to us any time, but during the Beta period, we're devoted to working on issues based on your feedback.    I invite you all to visit our forums for feedback, questions, and answers.

Some of the enhancements added are:

  • Multiple simultaneous symbol servers.
  • Process filters: You can limit the reported data to certain processes.
  • An API: No longer are you limited to interacting with AMD CodeAnalyst through our command line applications or our GUI, you can now programmatically control profiling and you can fold, spindle, and mutilate the data before displaying it.
  • Notes: You can add a customized note to each profile session. This feature should help you remember essential details about a session and reduce the length of session names.
  • Call stack data for a running process: You can now capture call stack information about a process using the command line tool without launching the process from CodeAnalyst.

I am sorry to report that our simulation feature is now deprecated.  It was useful for many reasons, but it was still a simulation of pipeline behavior.  Now we have instruction-based sampling (IBS) information available.  IBS can measure actual instruction execution, so I recommend that to you instead!

If you really must know about bug fixes and open issues, you can check out the release notes shipped with each version of the AMD CodeAnalyst tool. 

Most of the time since the last release has been spent writing and testing the API.  I've been working hard to make the API convenient and well documented (in doxygen format) for y'all.  We added the API so that you can build your own custom tools.  We are including some new sample code showing how to use the API, and I would love to hear (or read) what you end up doing with it.  Please post your projects and requests for further enhancements or clarifications on the forums.

Thanks!

-=Frank



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 05/04/2009 05:28 PM     Hard-Core Software Optimization     Comments (0)  

April 28, 2009
  AMD CodeAnalyst Workshop Summary - Part 5 of 5

Profiling JIT compiled code

 

Managed code is a popular approach to software development and deployment, as the developers do not have to compile it separately for different environments. Managed code executes on virtual machines which provide a secure and portable execution environment.  Java and .NET are examples of managed code systems. Managed code systems use just-in-time compilation to translate from a portable, intermediate representation of a program to native (machine) code.  The generated code is compiled "just-in-time" for the execution.  It's possible to profile the generated code when it executes, but in order to interpret the profile data and optimize the code, the AMD CodeAnalyst tool must instrument it a little.  That way we capture the generated native code and how the native code relates back to your source code.

 

AMD CodeAnalyst provides several profile agents, which gather the necessary information at code generation time during a profile and save the information for profile data analysis later.  Since the agents behave differently for the 32-bit and 64-bit runtimes, we provide both 32- and 64-bit agents.  Also, there are two profile agent interfaces for Java.  The JVMPI has mostly been depreciated in favor of the newer JVMTI, but we try not to make assumptions about what Java runtime you are using and provide both.  JVMPI uses the command line parameter -XrunCAJVMPIA32 in the Java application launch command to integrate the agent.  JVMTI, on the other hand, uses the option -agentlib:CAJVMTIA32.  If you launch the Java application through the AMD CodeAnalyst standalone GUI, we automatically add the command line option for you. 

 

On Linux®, the source code for both agents is provided and you can compile and use the agent of your choice.  However, Java source code cannot be shown with the profile data.  AMD CodeAnalyst shows the generated native code in the assembly tab.

 

.NET has a different method of attaching the profile agent.  It uses environmental variables and GUIDs.  Before running an application or module with managed code, you have to set Cor_Enable_Profiling=0x1.  Once you've enabled the profiling, you must tell the .NET runtime environment which profiling agent to use.  If you're using the 32-bit runtime, you will set COR_PROFILER={D007F1AC-DA06-4d68-BF47-E94790DD379F}.  If you're using a 64-bit system, you should test whether the runtime is 64-bit or 32-bit.  The 64-bit profile agent environmental setting is COR_PROFILER={891D5491-7E37-4b23-BE66-1C837FED378B}.  If you launch the managed application through the standalone AMD CodeAnalyst GUI, the environmental variables are automatically set for you.

 

We don't currently have a profiling solution for interpreted languages like Perl, Python, or Basic.  We can profile the applications, of course, but the samples are associated with the language interpreter and AMD CodeAnalyst cannot tie the profile data back to your source code for analysis.  If you have a great idea for features or enhancements, please send mail to CodeAnalyst.support@amd.com.

 

Windows® and Linux® differences

 

The final topic of this blog is about the differences between AMD CodeAnalyst on Windows® and on Linux.  We try to maintain feature parity on both platforms.  However, there are times when a new feature may be available for one operating system platform, but not the other.  Other times, like with a major hardware release, the same feature is introduced on both platforms simultaneously.  If there is a feature on one version that isn't available on the other and you need it, please let us know.

 

There are advantages and disadvantages on both platforms, mainly due to the platform-specific method used to collect profile data.  AMD CodeAnalyst uses Oprofile for data collection on Linux and uses its own proprietary driver to collect profile data on Windows. Oprofile aggregates profile data on-the-fly into summaries as it is captured while the Windows driver writes profile data to a file so that it can be aggregated during post-processing.  On-the-fly aggregation allows longer sampling sessions, since the resulting profile files are relatively compact.  However, on-the-fly aggregation loses timestamp information.  Without timestamp information, AMD CodeAnalyst cannot generate thread profiles.

 

The Windows version of AMD CodeAnalyst uses the APIC timer for time-based profiling.  Oprofile does not use the APIC timer, so the CPU Clocks not Halted event is substituted as a time measurement.  And speaking of events, the Windows version of AMD CodeAnalyst uses time-based event multiplexing to switch between events.  Event multiplexing in version 2.7 of AMD CodeAnalyst on Linux re-runs the application for each event group.  We hope to add event multiplexing to a future version and to contribute our changes to the open source Oprofile code base.

 

On Windows, AMD CodeAnalyst is integrated with two major integrated development environments (IDE): Microsoft Visual Studio 2005 and 2008 and Eclipse.  With Visual Studio, you can use the profile controls, profile session lists, and data windows without leaving the Visual Studio environment.  The AMD CodeAnalyst plug-in will be installed by default if Visual Studio 2005 or 2008 is installed before AMD CodeAnalyst is.  The plug-in for Eclipse is called "CodeSleuth".  CodeSleuth uses AMD CodeAnalyst to collect and analyze compiled Java code from within the Eclipse IDE.  For more information, you can go to http://developer.amd.com/cpu/CodeAnalyst/codeanalystwindows/codesleuth.

 

While we provide example programs in both versions, on Linux, we make the entire source code base available via GPL version 2.  If you do patch something, please send it back to us, so we can incorporate it into the code base.

 

In conclusion, through these topics, I've tried to provide useful information to you about performance optimization and specifically how to use AMD CodeAnalyst Performance Analyzer to improve your software.  If you have read through the articles and clicked a couple of the links, you should have a firm grounding in it all, so it's a good time to get started.  Just in case you haven't downloaded the software yet, you can find AMD CodeAnalyst available for download at http://developer.amd.com/cpu/CodeAnalyst.  The AMD CodeAnalyst tool is available at no charge, so please don't hesitate to try it out and let us know what you think. We appreciate your reading these and welcome all of your feedback, bug reports, enhancement requests, and comments.

 

Thank you for your time.

-=Frank Swehosky



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/28/2009 02:43 PM     Hard-Core Software Optimization     Comments (0)  

April 22, 2009
  AMD CodeAnalyst Workshop Summary - Part 4 of 5

In this blog, I cover the last three profiling configurations that the AMD CodeAnalyst tool offers: event-based sampling, instruction-based sampling, and thread profiling.

 

Event-based sampling

The next topic in our profiling jaunt is about how to use the event-based sampling feature of AMD CodeAnalyst Performance Analyzer.  Event-based sampling profiles give you a better understanding of what hardware events are occurring during your profile.  This should give you a more clear view of how your software is affecting the system and how it can be changed for great performance!

 

Background info

Performance monitoring events are manufacturer specific to each version of a processor.  The number of performance monitoring counters is also version-specific. There are two ways of using the event counters.  You can use them directly to count events or you can trigger sampling interrupts when a certain count is reached.  AMD CodeAnalyst uses the interrupt method to take samples.

You need a privileged ring 0 access driver in order to configure the events to be counted.  That is why you need to be an administrator (or root) to install the AMD CodeAnalyst tool. 

Third generation AMD OpteronTM processors and AMD PhenomTM processors support four event counters.  To sample more than four events in one profiling session, we use event multiplexing to swap the event configurations out every so-many milliseconds.  Each core has its own set of counters.  AMD CodeAnalyst currently configures all cores identically. When an event counter is configured, AMD CodeAnalyst currently also sets an event count based on the configuration given for the profile.  When the event count, or "sampling period", is reached, an interrupt is generated in order to take a sample.  More interrupts and more data are generated for a profile when the event count is smaller.  If the event count is too low, an interrupt will occur while processing the previous interrupt and the system will grind to a halt.  Since the frequency of events may vary by system and the applications that are running, we allow you to go to improbable limits in your search for data.  If your system does lock up, for the next profile simply change the profile configuration to increase the event count to a safer and larger number. 

 

Using event-based sampling

Different processor families and implementations support different performance events.  The AMD CodeAnalyst tool takes this into account and only shows you the events available on your system.  AMD CodeAnalyst provides predefined profile configurations that group appropriate events together in order to gather data on specific subjects:

  • Assess performance
  • Investigate L2 cache access
  • Investigate branching
  • Investigate data access
  • Investigate instruction access

You can also customize event configurations as shown in the figure below, to find out what matters and is useful to you.  AMD CodeAnalyst provides a template "Current event-based profile" that you can modify to your heart's content.  If you want to share your profile configurations, you can export them and other team members can import them.  This short blog entry doesn't have enough space to provide a detailed description of each available event.  You can read the specific events for your system in the BIOS and Kernel Developer's Guides (BKDG) at http://developer.amd.com/documentation/guides.  You will need to read the BKDG that is specific to the processor within your platform.  Here's a brief list of some of the events we've found useful:

  •  0x040: Data Cache Accesses
  •  0x041: Data Cache Misses
  •  0x076: CPU Clocks Not Halted
  •  0x0c0: Retired Instructions
  •  0x0cb: Retired MMXTM / FP Instructions
  •  0x0e9: CPU/IO Requests to Memory/IO

The CPU/IO Requests to Memory/IO event is especially useful for measuring memory requests on NUMA platforms, where memory latency can severely impact performance.

Figure 1 - Event-based sampling profile configuration

In the figure above, you'll note that the CPU/IO Requests to Memory/IO event has been selected, and that the unit masks have been set to count an event when data is requested from a non-local node.  This configuration is only useful in a NUMA set up, and I've written more about why it's useful in the thread profiling section below.

Depending on the type of the event data that was collected, you may be able to use different views to display different aspects of the data.  Unlike the timer-based sample profiling view, for event-based sampling profiles there are a lot more views available.  Whether a particular view is present for a profile depends on the event data that was collected.  If the appropriate event data is available, you can get instruction per cycle (IPC) assessments, branch assessments, and etc.  You can always use the "All Data" view to see the raw sampled data.  For derived measurements in the views, the data is normalized, so the calculations and comparisons are valid and make sense.

Just like in timer-based sampling, you can ‘drill-down' or further investigate particular modules and functions. Depending upon the resulting profiles, you may be able to determine if you need to change an algorithm completely, or just modify some data structures and code to implement better data access patterns.  If you go down to the assembly level, however, there is some inaccuracy with the instruction addresses associated with a sample, due to sampling skid and out-of-order execution.  For more information on out-of-order execution, please go to http://en.wikipedia.org/wiki/Out-of-order_execution.  The problem is that the event may have been triggered by an instruction, but the interrupt handler captures the address of the next instruction or some other instruction within the neighborhood of the culprit.  Instruction-based sampling is designed to eliminate this inaccuracy.

 

Instruction-based sampling

Closely related to event-based sampling is instruction-based sampling.  It is only available on systems with AMD Family 10h processors.  The mechanism is different from event-based sampling in that after the count is triggered, a fetch or execution operation is tagged, and the events that occur throughout its execution are tracked.  After the operation retires, the driver retrieves event information through the raised interrupt.

There are two types of data that can be profiled simultaneously: fetch and op.  The fetch count is the number of completed fetched operations between tagged fetches.  For the op sampling, there are two methods of tagging an op.  The cycle method will wait for the specified number of processor cycles and then tag an op that will be dispatched in the next cycle, if a valid op is available.  The cycle method has a small, but unavoidable timing bias that will cause certain ops to be tagged more often than the actual execution frequency.  The dispatched op method counts ops as they are dispatched and tags the next available op when the op interval (period) expires. The dispatched op method is designed to reduce sampling bias.

Figure 2 - Instruction-based sampling configuration

Since instruction-based sampling provides the address of the tagged operation as well as the events caused by it, we have the exact address at which events occurred, which resolves the concerns raised earlier about the accuracy of event-based profiling data.  We also can collect data on a large number of events without multiplexing or specifying complex configurations.  In addition to simple sample counts, more information is collected, like latency counts and the memory addresses used for load and store events.  Instruction-based sampling is a new technique and has the potential for other kinds of analysis.  If you have a good idea about how to display or use the effective addresses to help you optimize your application, please let us know!

 

Thread profiling

And now for something completely different -- thread profiling.  This feature was requested by a user for investigating a ccNUMA performance issue.  Thread profiling is only available on Windows, because Oprofile strips out the timestamp data during a profile and a timeline of thread-oriented events cannot be reconstructed during post-processing.

Here's the theory first.  The system hardware architecture can have a significant impact on memory latency, depending on memory access patterns.  If you aren't familiar with NUMA or SMP architecture, here's a good starter link: http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access.  In symmetric multi-processor systems, the RAM is equally available to all processors.  In NUMA systems, each processor has access to its own RAM.  Also, memory allocated in one set of memory doesn't get transferred directly to another set.  Therefore if a thread allocates memory on one processor and then moves to a different processor, the memory accesses will involve a greater latency as it needs to go through the first processor.  For more theory and a whole lot more depth, I refer you again to AMD's Software Optimization Guide (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf). Some good heuristics for optimizing processor usage with multiple threads on NUMA system are:

  • Maintain balanced system loads through core-scheduled threads
  • For threads with independent data, try to schedule the threads on different idle nodes. When there are no more available idle nodes, try to schedule the threads on the idle cores across the nodes.
  • For threads with shared data, try to schedule the threads on different idle cores of the same node.
  • Try to avoid switching a thread to a different node then the one on which it was created.

If you want to examine how and on which cores your threads are executing, AMD CodeAnalyst has a thread profile configuration for you.  The thread profile data is shown differently from other profile data. 

Figure 3 - Thread profile data

A thread profile is shown as a time chart, with all the available cores shown for each thread.  Thread activity is divided into time-slices. The color of a time-slice indicates user activity (green) or kernel activity (yellow).  You can change the time-slice period.  The location of a colored time-slice indicates the core which executed the thread during that time period.  You can also set a threshold value to see if there were more samples than the threshold in a time-slice. 

The other main feature of the thread profile is the ability to track non-local memory accesses.  A non-local memory access is a memory access across nodes on a ccNUMA system.  In the AMD CodeAnalyst tool, the occurrence of a non-local memory access during a time-slice is represented as a red bar.  If there are samples available, you can open the "Non-Local Memory Accesses" tab and see a list of the modules that had the accesses.  You can expand each module and see the list of functions in which the accesses occurred.  Expand again, and you can see the address list.  Double-click an address item and you will open a source code tab to show you in the code where the trouble is happening.

If you would like to suggest thread profiling features, we are currently collecting requirements for a thread analysis tool.  Please send any ideas and requests to CodeAnalyst.support@amd.com.

 

-Frank Swehosky



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 04/28/2009 at 03:02 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/22/2009 11:55 AM     Hard-Core Software Optimization     Comments (0)  

April 21, 2009
  AMD CodeAnalyst Workshop Summary - Part 3 of 5

This blog covers the first type of profiling: timer-based sampling and call stack sampling.

 

Timer-based sampling

Here I discuss why and how to use the timer-based sampling feature of AMD CodeAnalyst Performance Analyzer.  If you're already convinced about using timer-based sampling, you can skip ahead to the section on how to use it now.

Why should you use timer-based sampling?

Traditionally, the way to calculate performance gains (or losses) is by measuring the time a program, function, or loop takes to execute.  This could be done for each optimized version of the program by reading the time before and after execution, and then calculating the difference (elapsed time).  The rdtsc ("read timestamp counter" instruction) can be used to read the current time as measured in processor cycles.  The problem with this method arises with multi-core and multi-node systems.  Each core in the system maintains a separate timestamp counter.  The counters are not guaranteed to be synchronized.  While you can start to see the trouble, it gets worse if your system uses power management (clock throttling), running some cores at different clock rates, as they're needed.  If a thread switches cores during execution, the end timestamp may have no correlation to the beginning timestamp.  Your performance calculations are now suspect.

AMD CodeAnalyst instead takes a statistical sampling approach.  On Windows®, an APIC timer signals when samples should be taken on each core.  On Linux, AMD CodeAnalyst uses the open source profiler Oprofile to collect performance data. Oprofile is unable to use the APIC as a timer.  Instead we use the CPU Clocks not Halted event (CPU_CLK_UNHALTED) with event select value 0x76) to measure time, with calculations based on the system clock speed to convert from cycles to milliseconds.  On both Windows and Linux®, each sample can be interpreted as a period of time of execution.  As the sampling period decreases, the data becomes much less approximate as resolution increases, but it also directly affects the amount of samples taken.  Thus, with the AMD CodeAnalyst tool, the approximate time spent in an algorithm (code region) is reflected by the number of samples taken in your program.  This works over all cores, and can show you how much time was spent on specific processors.  This gives you a good estimate of performance.  After several rounds of optimization, you should see the time spent in the optimized algorithms decrease.  Since this is a statistical method, the performance estimate may be inaccurate if insufficient samples are taken.

 

How do you use timer-based sampling?

This article isn't supposed to substitute for the tutorial so I won't go into too much detail here.  First, open AMD CodeAnalyst and create a new project.  Next open the configuration manager dialog (Configuration Manager), select "Current time-base profile" and click the Edit button.

                  Edit timer configuration

Figure 1 - Timer-based sampling configuration

You can set the timer interval to your preference.  Your data is more accurate with a smaller interval, more profile data is collected and overhead is higher.  Before you run your profile, make sure your module is running during the profile, or the data won't be that interesting. Click OK to accept the timer configuration. Then click the green Start button in the toolbar to collect profile data.

The newly created profile should open automatically after data collection.  The view called "time-based profile" is shown, which by default shows the samples as a percentage of the samples taken.  If you want to get the raw sample counts, you can click the "Manage" button and uncheck "Show Percentage".  You can go to the graph tab to see the same data in chart form. 

System Graph Tab

Figure 2 - System Graph Tab

To get more information on your module of interest, just double-click on the corresponding item.  You can see where time is spent in different functions when you look at the module data tab.  Samples are aggregated at the address level, so if you have symbolic information, you can go to the source tab and see the sample distribution across source-level statements, or you could just open up the assembly tab that shows the sample distribution across native (assembler) code.

With information about where the time is spent during execution, the next step is to figure out why time is spent in hot code regions.  You can collect types of event-based profiles to accomplish this task.  If you want a better understanding of what call paths resulted in the time being spent in the sampled functions, you may want to use call stack sampling.

Call stack sampling

You can use call stack sampling to understand how different call paths affect the time spent in functions.  Currently, call stack sampling is only available on Windows when you launch a process from within the AMD CodeAnalyst tool.  While call stack sampling has more overhead than the regular timer-based profiling, it is far less intrusive than the instrumentation required to collect a complete a call graph collection.  However, call stack sampling is still based on the sampling and its results are subject to statistical variation. The call graph is usually incomplete because infrequently executed functions may not be sampled at all.

Call stack sampling requires a lot more profiling overhead.  You can change the amount of effort spent on call-stack unwinding with the Session Settings dialog, as seen in the following image.

Session Settings Dialog 

Figure 3 - Session Settings Dialog

The call stack unwind level controls how deeply the call stack is explored during the profile.  The call stack unwind level controls how many call-return addresses are traced when a sample is taken.  More processing time is used in order to explore longer call paths.  The value of 10 in the figure above means that the profiler will attempt to trace ten call paths for each call stack sample.  The call stack interval controls how often during the profile a call stack sample is taken.  In the figure above, with the value of 10, there will be one call stack sample collected for every ten timer-based profiling samples.

The CSS data is made available through the Processes tab.  You just select the launched process, and either use the "Call-stack data" button() or context menu item.  The call stack tab has two parts.  The top part has the call tree data.  You can expand the tree and see what methods had call stack samples.  The call tree shows you which other functions called the sampled functions.  The depth of the tree is limited by the unwind level used in the profile configuration.  ‘Self' samples are the samples taken in the function.  ‘Children' samples are when the function was an ancestor of the function in which the samples were taken. The bottom part of the call stack tab is called the butterfly view.  It depends on which function item is selected in the top part.  It gives detailed information on the ancestors (to the left) and the children (to the right).  In the ancestors section, you can see all the functions that called the selected function.  If you expand the items in the ancestor section, you can see all the call sites within the function from which the selected function was called.  You can also see the call frequency, to help determine if there is a particular path that needs improvement.  In the children section, you can expand the items to see the addresses at which samples were taken.

 

-Frank Swehosky



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 04/28/2009 at 04:03 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/21/2009 06:23 PM     Hard-Core Software Optimization     Comments (0)  

April 13, 2009
  AMD CodeAnalyst Workshop Summary - Part 2 of 5

AMD CodeAnalyst Tools and Usage Model

This blog is about the tools we provide as part of the AMD CodeAnalyst Performance Analyzer. It should give you an idea of how to use the tools and how to best fit profiling into your software life cycle. 

Customer requests drive the evolution of the AMD CodeAnalyst tool and its features. If there is a feature that would really like to see in our tool, please contact us at CodeAnalyst.support@amd.com.

The main usage model for AMD CodeAnalyst divides its operation into two parts:

  1. Profile data collection

  2. Post-processing

Profile data is communicated to post-processing through files. AMD CodeAnalyst provides both GUI and command line tools. The install package includes profile agents to capture Just-In-Time (JIT) generated code. Predefined profile configurations and basic view analyses make easy to collect, post-process and view profile data.

The profile driver on Windows® is the base for lightweight profile data collection and it is included in the AMD CodeAnalyst installation. Lightweight in this instance means that the overhead of profiling doesn't significantly change the characteristics of the system being profiled. Profile data consists of samples, which the driver collects. When it is time for a sample to be taken, either because of an event count trigger, a timer trigger, or an instruction trigger, an interrupt occurs. The driver handles the interrupt and collects data like which processor core took the interrupt, the timestamp when the interrupt happened, the execution instruction pointer address, the process ID, the thread ID, and even register values. The total amount of driver processing required (sampling overhead) depends on the sampling frequency and whether call stack sampling is enabled. The more data that the driver has to collect when profiling, the more time it takes to just save the data. Ideally, there would be no overhead, but since we do most of our processing after the profile after the data has been collected, we consider the current driver to be lightweight. On Linux, we use Oprofile as our profiling subsystem.

The AMD CodeAnalyst GUI post-processes the collected data and displays profiles. It uses a project and profile session usage model. A project corresponds to an application or module that you are optimizing or investigating. Each profile session is a 'run' or profile instance that can be named, reviewed, saved, or deleted. The sessions are saved in separate directories under the project. The appropriate JIT data files generated by the profile agents are also saved in the session directories, so you can review the JIT data later or compare the code output by different runtime engines. Since AMD CodeAnalyst is file-based, it is very easy to share data between co-workers, either by copying the whole project or by just importing the particular sessions in which you're interested.

AMD CodeAnalyst is a system-wide profiler, so you can monitor services, drivers, and any server applications that start up automatically. You also have the opportunity to launch an application or batch file when you start to collect a profile. You can choose many settings for each profile session, but you must choose a 'profile configuration' when collecting a profile using the AMD CodeAnalyst GUI. We have included many pre-defined configurations, based on our experience on what people have traditionally used to profile: execution time, an overall assessment of performance, data accesses, branching behavior etc. If you want, you can customize the configurations, changing the sampling frequency and the type of data collected. We do not make assumptions about the limits of your system, so we give you the ability to shoot yourself in the foot by choosing a high sampling frequency, thereby asking for more information than the profiler can handle. If you request too much data in one profile session by setting the sampling frequency too high, the system will lock up.

After your profiling session is finished, AMD CodeAnalyst analyzes and aggregates the data. When you display data from the profiling session, you can choose from multiple analysis views. Which views are available depends on what data is available from a profile. Some views provide basic ratios, like Instructions per cycle (IPC) or the DTLB request rate. For ratios, the data is normalized or weighted, so statistical comparisons make sense.

While I'll be writing about the explicit differences between the versions available on Windows and Linux® later, there is one additional tool on Windows that should be mentioned here: the Microsoft® Visual Studio integration package. It is integrated directly into Visual Studio and allows you to use most of the functionality of the stand-alone AMD CodeAnalyst GUI for each solution. Another integration tool, called CodeSleuth, is available as an Eclipse plug-in for Java performance measurement. More information can be found at here on our main CodeSleuth page.

There are separate versions of the AMD CodeAnalyst tool for Windows and Linux.

AMD CodeAnalyst for Windows includes:

  • Stand-alone AMD CodeAnalyst GUI to collect and analyze profiles

  • Online help including descriptions of performance events

  • Visual Studio integrated package

  • CaProfile.exe command line utility to collect profile data

  • CaDataAnalyze.exe command line utility to analyze profile data files

  • CaReport.exe command line utility to display analysis results

  • Java profiling agents (32-bit/64-bit for both jvmpi and jvmti)

  • Pause and resume profiling control API

  • AMD CodeAnalyst for Windows profiling driver

The AMD CodeAnalyst tool for Linux includes:

  • Stand-alone AMD CodeAnalyst GUI

  • Online help including descriptions of performance events

  • Source code (GPL v2) for AMD CodeAnalyst and the Oprofile open source profiler

  • Java profiling agents (32-bit/64-bit for both jvmpi and jvmti)

  • Kernel modules to support data collection on the latest AMD processors

-Frank Swehosky



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 04/28/2009 at 04:03 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/13/2009 06:47 PM     Hard-Core Software Optimization     Comments (0)  

April 2, 2009
  AMD CodeAnalyst Workshop Summary - Part 1 of 5

While in India last summer, I had a chance to give a short series of one-day workshops on AMD CodeAnalyst Performance Analyzer. Since many other software developers may find this information to be helpful, I've attempted to capture most of the presented material here in this blog. If you have any questions or ideas for enhancements to our application, please send them to CodeAnalyst.support@amd.com or feel free to comment on one of my blog posts. I encourage you to download AMD CodeAnalyst free of charge and check out the Intro to AMD Codeanalyst videofor a quick overview of some of the most useful features.

 

The workshop sessions were approximately:

60 minutes on Performance and optimization

40 minutes on AMD CodeAnalyst tools and model

30 minutes on Timer-based sampling

30 minutes on Call stack sampling

30 minutes on Event-based sampling

15 minutes on Thread profiles

30 minutes on Windows® and Linux® differences

15 minutes on JIT profiles

30 minutes on Instruction-based sampling

30 minutes on Simulation profiles, Visual Studio integration, and Oprofile

 

I'll write a synopsis of each session.

 

Performance and optimization

 

Performance tuning and optimization is complicated. You should have a better idea of how to use the information our tool provides after understanding this background information.

 

There are several good reasons to optimize software:


  • More efficient processing leads to power and/or hardware resource savings

  • Improved benchmark scores

  • Increased scalability

  • More processing time for other features 


There are some potential pitfalls to optimized code that must be kept in mind:


  • Optimizations can hinder comprehension and maintainability of the code, especially if the optimizations aren't sufficiently commented

  • Optimizations too early in the development cycle can easily introduce bugs 

  • Code churning with optimizations can get quite messy when more than one person is touching the code


In order to get the best results in your optimizations, you should make sure to:


  • Keep performance in mind throughout the entire software life cycle, not as an afterthought

  • Choose the best algorithms during the design phase because an efficient algorithm can have the most impact on performance

  • Think carefully about the data sets that your application will use, because the actual performance of your algorithm could change depending on the size and properties of your data set


This is where, in the presentation, I spent a small amount of time reviewing big O notation. You probably vaguely recall from an algorithm class that there is a way to compare the relative worst-case efficiency of algorithms as the data set size approaches infinity. However, since most of the time we deal with finite amounts of data, the theoretically-worse worst-case may actually be faster due to your situation. So, for optimization work, you have to know and understand algorithms and the circumstances that your code is encountering. An excellent tutorial for actually using the knowledge is at PerlMonks. Good design is a big part of where we earn our salaries.

 

Optimization is an iterative process with diminishing returns. It can be difficult to figure out when to stop. There are two heuristics that seem to work consistently: the Pareto principle and the 90/10 law. The Pareto principle is a rule of thumb that says improving only 20% of the code will yield 80% of your results. Thus, spending time improving more than 20% of your source code will have a diminished effect. The 90/10 law is that typically 90% of the execution time is spent in 10% of the code. Thus to efficiently optimize, you need to identify the parts of your code where most of the time is spent such that optimizations there will probably return most of the improvement.

 

Timer-based statistical profiling (Ahh-hah, that's where AMD CodeAnalyst software comes into it) will give you information on where your algorithms are spending time. The intuitive approach, or guessing, can sometimes work, but is often wrong. A small sub-function that is called from 70% of your other code would not be an intuitive guess, but if you can reduce the time spent by 50%, then your application may have a 35% speed boost, rather then the 5% boost from all of your effort on a separate big part. Not bad, eh? Once you locate the areas of interest, you need to understand why the code is spending time there. (Areas of interest are often called "hot spots".) There are many, many possible causes. It could just be that the data set is larger than expected and that you need to change algorithms. Most of your time could be in the algorithm overhead, when your algorithm is unnecessarily complex. Unfortunately, scalability in algorithms sometimes requires complexity in order to remain correct. You could also have data cache bus synchronization, or conversion problems. Fortunately, AMD CodeAnalyst software offers event-based and instruction-based sampling to give you insight as to why your code is behaving the way it is on the platform hardware.

 

These days, memory management has the most impact on how quickly your code runs (after your software algorithm). Poor memory access patterns can drag your system to sloth-like execution. To operate efficiently and at speed, processors must have data available when it is needed. The latency of data access increases as it gets farther from the use in the CPU, starting from the CPU registers, to L1 cache, to L2 cache, to L3 cache, to local RAM, to remote RAM, and finally to paged-out on disk. The amount of time spent waiting for data can increase by as much as an order of magnitude or more from one level to the next, depending on the level in the memory hierarchy and hardware. The fastest instructions use data in the limited registers of the CPU. After that, data must be loaded from memory.  To reduce the latency of data access, CPUs have up to three levels of cache within the chip. Unfortunately, the fastest cache memory has the smallest capacity. L2 cache memory is a lot larger, but is slower and L3 cache memory is the largest of the on-chip cache and the slowest. 

 

Beyond the on-chip cache memories, the CPU has to talk to the motherboard memory bus and retrieve data from the installed RAM. If you have a multi-core chip, all the cores on the chip use the same bus. If your AMD system has a NUMA architecture then each processor node has its own bus to its own RAM. If one processor has to get data from another processor's RAM, the access latency is obviously higher than local access. 

 

Beyond RAM, the operating system can "page-out" RAM memory contents to disk. It can take some time to locate where memory contents are stored on the drive.  So, any code that causes page faults (requiring data to be page-in from the disk) will run slower, and just hang around waiting for data that is resident on the drive. The worst situation is data stored on a remote system across the network. In that case, the data has to be encoded and transported across a relatively slow network!

 

Once you've had a moment to think about data access and your own experience, stretch, get a preferred beverage, and then sit down and read through AMD's Software Optimization Guide.There is an enormous amount of technical information and useful content in the guide, or "SWOG". If you don't deal with assembly language, you can just skim those parts, but the concepts are important. You can also get a much quicker grasp on some of the most important points of the SWOG by viewing the Software Optimization Guide Video Series. In the workshop, I just covered the "Key Optimizations" listed on page 6, table 2 of the pdf:

Table 2. Optimizations by Rank 

1.     Load-Execute Instructions for Floating-Point or Integer Operands (See section 4.2 on page 53.)

2.     Write-Combining (See section 5.6 on page 89.)

3.     Branches That Depend on Random Data (See section 6.3 on page 101.)

4.     Loop Unrolling (See section 7.2 on page 110.)

5.     Pointer Arithmetic in Loops (See section 7.6 on page 116.)

6.     Explicit Load Instructions (See section 9.2 on page 146.)

7.     Reuse of Dead Registers (See section 9.15 on page 165.)

8.     ccNUMA Optimizations (See section 11.1 on page 179.)

9.     Multithreading (See section 11.3 on page 190.)

10. Prefetch and Streaming Instructions (See section 5.5 on page 81.)

11. Memory and String Routines (See section 5.9 on page 92.)

12. Loop Iteration Boundaries (See section 4.3 on page 56.)

13. Floating-Point Scalar Conversions (See sections 9.16 on page 166.)

 

Several of these suggestions apply only if you can control the generated assembly code from your source code, but some of them affect very high-level programming. The only item I'm going to comment on at this time is the Memory and String Routines. I'll get to the threading tips later. You should know what libraries have already been optimized and how to use them to your benefit. Many brilliant engineers have sweated over the most common string functions used, optimizing them so you don't have to reinvent and then polish the wheel. AMD has provided open source libraries like Framewave, for just this purpose. Your code goes faster and you don't have to do anything but change the link stage of your build process! 

 

It's also useful to understand how the language you're using lays out arrays and structures in memory. That will help you to know how to efficiently access information in an order that will take advantage of the memory cache lines and avoid page faults. There are two principles of locality that are exploited when the CPU caches memory data: temporal and spatial. They're both intuitive. Temporal locality means that recently accessed items are likely to be accessed in the near future, so the older items which haven't been accessed recently are more likely to be flushed from the cache. Spatial locality means that items whose addresses are near one another tend to be referenced close together in time, like when a data item is stored in a cache line, so are the bytes around it. This would be a good time to look at the AMD CodeAnalyst tutorial from the included "Help" section with the classic matrix multiplication in C++ to illustrate the impact of using memory efficiently.

 

I hope this gave you a good intro to software optimization issues. I'll be posting 4 additional blogs to cover the rest of the agenda items that I outlined above. Let me know if you find this interesting and if you have any questions. 

 

-Frank Swehosky, Sr. Software Development Engineer



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 04/02/2009 at 07:03 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 04/02/2009 06:45 PM     Hard-Core Software Optimization     Comments (0)  

July 9, 2008
  Mandelbrot and 16-bit fixed point multiplies ( Part II )
The original SSE2 implementation of the Mandelbrot leveraged the PMADDWD instruction to do the workload inside of the inner loop. Unfortunately, in order to get this to work, lots of code had to be inserted to shuffle data around, and this instruction leaves the packed results in a 32-bit format. This requires PACKSSDW instructions to get the data back to 16-bits before it can be used for further computation. This adds significant overhead to the inner loop calculation. The advantage that PMULHRSW provides are that there are no dependencies on how the data is ordered within the register and it produces its results in a packed 16-bit format. After gaining an understanding of the code differences between the SSE2 and SSSE3 implementations, I believed that it was possible to gain this same advantage using SSE2, but I needed to leverage PMULHW for PMULHRSW. PMULHW writes into the upper 16 bits of the destination 32 bit temporary result, as illustrated in Figure 3 below.


Figure 3: Bit selection of PMULHW

Using 4.12 fixed point arithmetic, PMULHW produces an 8.8 fixed point number as a result, as illustrated in Figure 4.


Figure 4: 4.12 PMULHW multiply; W=Whole, F=Fractional bit

This leaves the least significant bit of the result to represent 2-8. Unfortunately, after modifying the SSE2 code to use this technique, the fidelity of the Mandelbrot picture started to degrade. Inside the edges of the Mandelbrot pattern, strings of long black pixels began to show, and the rest of the intricate pattern began to look noisy and dirty, with random pixel popping. It became evident that the precision of the Mandelbrot needed to retain the 2-9 bit. I needed to add more fractional bits (precision) to the upper 16 bits of the multiply to get this technique to work.
The only way to add more fractional bits is to take away whole bits. Stepping back a bit, I took a hard look at the data. In this particular Mandelbrot benchmark, the left and right edges of the window are represented by -2.25 and .75 respectively, and -1.25 and 1.25 for the bottom and top edges respectively. If I took away a whole bit from the fixed point data, changing the 4.12 input data to 3.13, I still have enough range to represent the default Mandelbrot zoom. For signed data, 3 whole bits can represent a range from -4 to +3. If you factor in the value of fractional bits, the upper range is actually very close to positive 4. Figure 5 below illustrates how PMULHW treats 3.13 source data.


Figure 5: 3.13 PMULHW multiply; W=Whole, F=Fractional bit

As can be seen, the precision of PMULHW now goes to 2-10, because a whole bit was reduced from each 16 bit source, so the multiplied result loses two whole bits. This reduces the range of the signed result to 6 bits (-32 to +31), but this goes beyond our modest needs. In addition, with 10 fractional bits, this exceeds the precision that PMULHRSW gave with 4.12 fixed point data. A testing pass verified that the fidelity of the rendered fractal with this new data format and algorithm was indeed tight and well formed. Overall, with the reduction of pack and data swizzling instructions, this resulted in about a 2.7x speedup over the original SSE2 implementation that used PMADDWD. Through my own internal measurements, this is actually slightly faster than the PMULHRSW optimized versions as well.
I think the biggest point that I want to get across with my writing is, "Think about your data". How can you optimize its use? Not only in terms of how much you have, but as illustrated in this fixed point example, the data format as well. While it's true that the PMULHW instruction enables us to do a fast 16-bit fixed point multiply, we had to change the format of the data to make use of the optimization possible. If you have control over your data (in this benchmark I had, but this is not always the case), the time spent optimizing your data up front can pay back huge dividends later on with simpler/fewer, faster code.

Kent Knox
Member of AMD Technical Staff

Kent Knox is a Member of Technical Staff in Solutions Enablement Engineering at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 07/09/2008 at 02:54 PM by kknox

 Post a Comment    

    Posted By: Kent Knox @ 07/09/2008 01:59 PM     Hard-Core Software Optimization     Comments (0)  

July 1, 2008
  Mandelbrot and 16-bit fixed point multiplies ( Part I )
I recently had the opportunity to work on and help optimize a benchmark that uses fixed-point math to carry out an iterative calculation in a loop. The function of the benchmark is to calculate a Mandelbrot fractal in memory and report the time it takes to 'draw' the fractal as a rate. There are several different codepaths inside this benchmark, each implemented to take maximum advantage of different SIMD instruction sets, such as SSE2, SSSE3 and SSE4.1. The SSSE3 and SSE4.1 versions of this routine were approximately 2.7x faster than the SSE2 version. AMD processors support SSE2, SSE3 and SSE4a, so I wanted to investigate what could be done to optimize the SSE2 version of the function.
After I had a chance to visually inspect the various codepaths, it became obvious that the reason the SSSE3 and SSE4.1 routines had such a significant performance lead was due to the PMULHRSW instruction. There is not much literature available on this SSSE3 opcode, but it is defined as 'Packed Multiply High with Round and Scale' and is an instruction designed for fixed-point math. It operates on packed integer data; multiplying two packed 16-bit source words and producing a packed 16-bit destination word. The 16 bits that this instruction chooses to place in the destination register is a little unusual, as illustrated by Figure 1 below. When two 16 bit values are multiplied together, the result is a 32 bit value. However, in order to get 32 bits to fit in a 16 bit result, some bits have got to go, and the bits that PMULHRSW chooses to keep are significantly different than PMULHW or PMULLW. The red squares in the figure below represent bits PMULHRSW truncates, and the green bits are written as the result of the multiply.


Figure 1: Bit selection of PMULHRSW

The 31st bit is a redundant sign bit, so it gets truncated; this is an effect of the two 16-bit sources being signed inputs. Bits 30-15 are the next 16 most significant consecutive bits, and the rest of the least significant bits are truncated. For good measure, the most significant 14th truncated bit is rounded by adding a one before being truncated; this is where the 'round' comes from in the definition of the instruction and makes sense only for fixed point numbers, as this increases the accuracy of fractional bits. Since the most significant sign bit is truncated, the answer written to the destination register is logically left shifted by 1 (the 30th bit is now the most significant bit of the 16 bit result), which in effect is multiplying the result by two; this is where the 'scale' comes from in the definition of the instruction.
This particular Mandelbrot benchmark was originally written to operate on data in a 4.12 fixed point format. For those who feel a little rusty, this Book of Hook page provides a simple review of fixed point math. The zoom of the Mandelbrot includes the real x-axis number range from -2.25 to +.75, and the imaginary y-axis number range from -1.25 to +1.25, which with signed 4.12 numbers leaves plenty of slack. The inner-loop of the Mandelbrot algorithm is a sequence of mul's and add's of complex numbers. A Mandelbrot white paper describing how to calculate the Mandelbrot algorithm can be found following the link. Also, Mike Wall has an article on performance optimization in windows in which he uses a Mandelbrot sample for his explanation; full source code available. For the SSSE3 and SSE4.1 implementation, PMULHRSW was used to multiply these 4.12 fixed point numbers; two 4.12 numbers multiplied together creates an 8.24 32-bit number, and using the bit selection technique of PMULHRSW as illustrated in Figure 2, a rounded 7.9 fixed point number is written out as the packed result. The least significant fractional bit represents 2-9, which provides enough precision to render a faithful representation of the Mandelbrot set. Eventually, this product has to be left shifted by 3 bits to get back to the original 4.12 to continue the iterations of packed mul's and add's.


Figure 2: 4.12 PMULHRSW multiply; W=Whole, F=Fractional bit

This post gave the background of the optimization problem and described the operation of the PMULHRSW opcode. In Part II of my discussion, I will describe the technique I used to optimize the Mandelbrot fixed-point multiply for SSE2.

Kent Knox
Member of AMD Technical Staff

Kent Knox is a Member of Technical Staff in Solutions Enablement Engineering at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 07/09/2008 at 02:54 PM by kknox

 Post a Comment    

    Posted By: Kent Knox @ 07/01/2008 06:52 PM     Hard-Core Software Optimization     Comments (0)  

February 21, 2008
  Optimizing Inter-Core Data Transfer on AMD Phenomtm processors
The AMD Phenomtm family of microprocessors (Family 0x10) is AMD's first generation to incorporate 4 distinct cores on a single die, and the first to have a cache that all the cores share. There is a small subset of compute problems that can be categorized as belonging in a Producer and Consumer paradigm; a thread of a program running on a single core produces data, which is meant to be consumed by a thread that is running on a separate core. With such programs, it is desirable to get the two distinct cores to communicate through the shared cache, to avoid round trips to/from main memory.

With a naïve implementation of a producer/consumer program on the AMD Phenom processor, measured bandwidth results will appear to be throttled by main memory speeds. Main memory speeds can vary, but with DDR2 533 memory (average grade), this is around ~4 GB/s. Why is this?

There are several architectural details on the AMD Phenom processor that can limit inter-core bandwidth if not properly understood. The type and size of the cache on the AMD Phenom core has a direct effect on bandwidth; it includes a "mostly exclusive victim" cache. The MOESI protocol that the AMD Phenom cache uses for cache coherency can also limit bandwidth; it is important to keep a cache line in the 'M' state for optimal producer/consumer performance. A detailed explanation of the AMD Phenom cache architecture and how this relates to producer/consumer performance can be found in the Software Optimization Guide for AMD Family 10h Processors ( section 11.5 ).

Assuming a single buffer has been defined for the producer and consumer threads to walk and communicate, the following bulleted list is a checklist of the constraints to follow to achieve maximum bandwidth:

  • The consumer thread needs to 'lag' the producer thread by at least L1 & L2 cache size (modulo arithmetic)

  • The producer thread needs to 'lag' the consumer thread by at least L1 & L2 cache size (modulo arithmetic)

  • The buffer should be at least 2*(L1 & L2)

  • The producer thread should not get so far ahead of the consumer to flood the L3, if a large buffer is used

  • Use prefetchw on the consumer side, even if the consumer does not want to modify the data

  • Add a small fudge factor to the calculated sizes to give the threads some 'slack' when communicating through the caches


In general, the AMD Phenom cache is optimized for widely shared data, i.e. one core produces data that many other cores may be interested in. In the producer/consumer program however, it is known ahead of time that the data the producer creates is only interesting to the matching consumer thread, and not to any other thread. Following the constraints listed above, it is possible to achieve an aggregate ~12 GB/s bandwidth for two producer/consumer pairs (to maximize 4 cores) on the AMD Phenom processor.

Kent Knox
Member of AMD Technical Staff

-------------------------
This response is provided for informational purposes only, is provided "AS IS" and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 07/01/2008 at 06:04 PM by kknox

 Post a Comment    

    Posted By: Kent Knox @ 02/21/2008 05:13 PM     Hard-Core Software Optimization     Comments (1)  

August 31, 2007
  Intrinsics and Casting

One of the nice but dangerous things about assembly language is its virtual
lack of any kind of type checking. If you are optimizing C code by writing in-line
assembler, it is fairly easy to reference C variables and constructs. Of course
you need to know what you are doing, and the compiler will assume you do. When
I began porting some existing in-line assembler to intrinsics I was surprised
and, okay, pretty frustrated at how often the compiler would not accept my references
in an intrinsic even though it worked in in-line assembler code. Since intrinsics
are compiled as "C" code, this is the compiler doing as it should
and enforcing type checking. Those of you that are "casting gurus"
will have no problem with this issue, but for those of you that spend most of
your time at the assembler level, I hope that these comments will help.


Simple

So let's start with a simple example. Suppose you have four single precision
floats you want to load into an xmm register. In assembly it's really easy:
    typedef struct tFOUR_FLOATS

{
struct
{
float f0;
float f1;
float f2;
float f3;
};

} FOUR_FLOATS;

FOUR_FLOATS var0 = {1.0f, 2.0f, 3.0f, 4.0f};

_asm
{
lea esi, var0
movups xmm0, xmmword ptr [esi]
}


To accomplish the same thing using intrinsic it's still not bad:
    {

_m128 r0ps;
FOUR_FLOATS var0 = {1.0f, 2.0f, 3.0f, 4.0f};

r0ps = _mm_loadu_ps((float *) &var0);
}


If you're new to using intrinsics you'll see that we had to create a variable
r0ps for the result of the load. For debug builds the compiler will actually
create a 128 bit variable r0ps, which you can use for debugging. This is relatively
slow as the memory location will be accessed on each reference to r0ps. But
not to worry, for release (optimized) builds, the compiler will load the value
directly into an xmm register. There could be some exceptions to this depending
on the complexity of your code forcing the compiler to spill some values into
memory.


Inline MMX to Intrinsic SSE

I'm still talking about casting here but I suspect there's a lot of inline
MMX code out there that developers would like to keep because it's optimized,
but it needs to be converted to intrinsics to work in 64-bit builds. Frankly,
this should be converted to SSE anyway to take advantage of the xmm registers,
even if it's for 32-bit builds.

In the MMX code fragment below, you might not create pSrc as a pointer to an
unsigned char since that's not what it's pointing to, but you have undoubtedly
dealt with a lot of code that does just this. The point is that the inline assembler
merrily deals with it.
    {

TWO_FLOATS var1 = {5.0f, 6.0f};

unsigned char * pSrc = (unsigned char*) &var1;

_asm
{
mov esi, dword ptr [pSrc]
movd mm0, dword ptr [esi]
}
}

If you just scan the available intrinsics you'll notice that there is no intrinsic
with a name like movd. But the functionality is there. For similar functionality
but converting to XMM registers I chose _mm_cvtsi32_si128. Since this intrinsic
requires an int as the input parameter, you'll need to do this cast for this
case:


__m128i r2i;

r2i = _mm_cvtsi32_si128 (*(int *)pSrc);


Arrays and Intrinsics


Another casting issue you'll come across is indexing into arrays. In the code
fragment below, we want to be able to index into an array and load an integer
value from the array into an XMM register. In assembler, the compiler will allow
you to easily do the pointer arithmetic and then use the result as an address
to read.


int IntArray[10] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };
int *pInt = &IntArray[0];
int Someindex = 3;
__m128i r2i;

_asm // casting and indexing into arrays
{
mov esi, pInt
mov ebx, dword ptr [Someindex] // just for example
lea esi, [esi + ebx * 4] // 4 bytes per int
movd xmm0, dword ptr [esi]
}

If you try to do this with intrinsics you might try:

	r2i = _mm_cvtsi32_si128 (pInt + Someindex);


But the compiler will generate an error:

	error C2664: '_mm_cvtsi32_si128' : cannot convert parameter 1 from 'int *' to 'int'


One way to get this to work is:

	r2i = _mm_cvtsi32_si128 (*((int *)pInt + (unsigned int) Someindex));



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Randy VanderHeyden @ 08/31/2007 07:18 PM     Hard-Core Software Optimization     Comments (0)  

June 29, 2007
  Fun with Intrinsics
When porting 32-bit code to 64-bit code, one of the issues that must be dealt with is the absence of inline assembly language support in the Microsoft Visual Studio 2005 64-bit compiler. One of the common places where developers have used inline assembly is to determine specific processor features via the cpuid instruction. This can be really easy to fix using intrinsics.

While this is not intended to be a primer on intrinsics, I'll write a few introductory words for those that are unfamiliar. Compiler intrinsics are generally built into a compiler as opposed to being in a library. They also usually generate inline code, although sometimes there is an option to generate a function call. Although an intrinsic may offer some portability, for example the __cpuid intrinsic will work fine for either 32-bit x86 or 64-bit AMD64 targets with the Visual Studio compiler, this may not be so across compilers. Gcc has intrinsics but I believe they are referred to as built-ins and they tend to have different mnemonics even though many overlap in terms of the native instructions they use.

In assembly language, the code might look like the crude code in example 1 below. This code just executes one cpuid function call and displays the results. Your real code would be more substantial. The intrinsic version is shown in example 2 below. Note you'll need to include the header intrin.h to use the __cpuid intrinsic.

I originally decided to write this note because I was unable to locate a cpuid sample using intrinsics. Of course since that time I have found a fairly thourough one on MSDN. It's a little difficult to find online by searching so here's a link, or failing that, drill down in the MSDN Library to Development Tools and Languages\Visual Studio 2005\Visual Studio\Visual C++\Reference\C/C++ Languages\Compiler Intrinsics\Alphabetical Listing...\__cpuid.

http://msdn2.microsoft.com/en-...y/hskdteyh(VS.80).aspx

Code:
Example 1:

// cpuid.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <stdio.h>

int _tmain(int argc, _TCHAR* argv[])
{

   int      largestfunction;
   char   idstring[13] = {0};

   __asm
   {
      mov   eax, 0
      cpuid
      mov largestfunction, eax

      mov dword ptr idstring[0], ebx
      mov dword ptr idstring[4], edx
      mov dword ptr idstring[8], ecx
   }
   printf("%s Largest function supported is %x", idstring, largestfunction);
   return 0;
}


Example 2:

// cpuid_intrinsic.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include <intrin.h>
#include <string.h>

int _tmain(int argc, _TCHAR* argv[])
{
   int      CPUInfo[4] = {0};
   char   idstring[13] = {0};

   __cpuid(CPUInfo, 0);
   *(int*)idstring = CPUInfo[1];
   *(int*)(idstring + 4) = CPUInfo[3];
   *(int*)(idstring + 8) = CPUInfo[2];
   idstring[12] = 0;
   printf("%s Largest function supported is %#x", &idstring[0], CPUInfo[0]);
   return 0;
}


-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Dev PerformanceTeam @ 06/29/2007 11:32 AM     Hard-Core Software Optimization     Comments (0)  

June 27, 2007
  Welcome to Hard-Core Software Optimization!
Welcome to our blog. This blog is from the Developer Performance Team at AMD. We will mostly concentrate on software optimization and performance topics. Certainly other areas related to software development and AMD products in general may be touched on. We may also dive in to hardware performance especially as it relates to I/O and how to extract the best performance from it from a software perspective.

Please do give us feedback as to the topics you'd like to see discussed, and your overall opinions of the blog.

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Dev PerformanceTeam @ 06/27/2007 10:51 AM     Hard-Core Software Optimization     Comments (2)  

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information