AMD Developer Blogs
January, 2008 (1) |
January, 2009 (5) |
February, 2008 (5) |
February, 2009 (4) |
March, 2008 (3) |
March, 2009 (6) |
April, 2008 (4) |
April, 2009 (8) |
May, 2008 (4) |
May, 2009 (4) |
June, 2007 (4) |
June, 2008 (4) |
 |
 |
June 30, 2009
| |
ACML 4.3.0 Performance Data
Now that the ACML 4.3.0 release is completed and posted live on AMD Developer Central, I’ve been spending time collecting all the performance data needed to document the improvements in the 4.3.0 release. There are several new features that should show up nicely in performance graphs. Improvements include a new SGEMM kernel for AMD Family 10h, new DGEMM and SGEMM for Woodcrest, Penryn, and Nehalem Intel processors, improved level 1 BLAS kernels, 3D FFT work, and new scalar acml_mv functions. It’s a really long list!
You can easily demonstrate these new performance features by using the examples in the performance directory of the ACML installation. There are examples for a few different routines, and these can be easily modified to demonstrate other routines as well.
A couple of trends are jumping out from the data collected so far. First, the 4.3.0 Level 3 blas routines run much better than previous versions on Intel machines. It is very competitive with MKL on Intel processors!
Second, the Intel Nehalem is a very impressive processor. However Istanbul’s 6 cores can crank out a bunch of raw DGEMM flops. This graph tells the story:

More information on ACML 4.3.0 is available on the ACML home page. If you have feedback on how the new release improves performance for your application, we'd love to hear about it.
-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.
|
|
|
June 29, 2009
| |
Removing C wrapper functions from the AMD Core Math Library (ACML) to resolve linking issues.
ACML is a significant library of (mostly) FORTRAN subroutines, provided in binary form and available for download at http://developer.amd.com/acml. Each version of the library has been compiled with a particular FORTRAN compiler, and is compatible with application programs written and compiled with the same compiler.
Although FORTRAN programming has hardly disappeared, if you're reading this blog, the odds are far more likely that you're developing in C/C++ or C#.
Calling FORTRAN subroutines from C/C++/C# is doable, but there are a lot of potential problems and pitfalls. The C and FORTRAN languages have completely different subroutine naming and argument-passing conventions. For example, where C/C++ passes parameters by value (except for arrays), FORTRAN passes them by reference. When you have a multi-dimensional array, FORTRAN stores the data in column-major order; C/C++ uses row-major order. Different FORTRAN compilers have different conventions for passing strings, for the name of the subroutine entry point, etc.
To help make ACML useful to C/C++/C# programmers, some versions of the library come with support for C compilers, including an "acml.h" header and "C wrapper" functions. These alternate entry points take care of most of the hassle for you (although it's up to the user to watch out for the row-major versus column-major array problem).
For example, suppose you consulted the section "Determining the best ACML version for your system" in the ACML manual (online here: http://developer.amd.com/cpu/Libraries/acml/onlinehelp/Documents/BestLibrary.html#BestLibrary), and chose to download the Linux IFort64 version for your project. You would be able to code your project with either Intel (R) FORTRAN or a compatible C/C++ compiler. Your choice.
So how does this work? If a FORTRAN module containing : CALL DNRM2 (...) or SUBROUTINE DNRM2 (...) is compiled with the 64-bit ifort compiler, the linkage name passed to the linker is "dnrm2_", (note: the lower-case symbol name with trailing underscore). Both the caller and the callee assume that all parameters are passed by reference.
If a C program module containing: #include <acml.h> dnrm2 (...) is compiled with the 64-bit GNU gcc compiler, the linkage name passed to the linker is "dnrm2" (lower-case symbol name without the trailing underscore). The caller passes array parameters by reference, but all other parameters are passed by value.
You can use the "objdump" or "nm" utilities from the GNU binutils package to confirm the external linkage symbols in an object or library file.
So, we can provide a single library with both FORTRAN-callable and C-callable versions of the same routine, because the linkage names used for subroutines are different for the two languages. The ACML library contains two object modules for each routine defined in "acml.h". The FORTRAN version exports the symbol with the trailing underscore as the entry point with the FORTRAN calling convention. A separate "C wrapper" module exports the symbol without the underscore as the entry point for a short routine that resolves the differences in calling conventions and then calls the FORTRAN-compatible version.
So all is well as long as your project is built with the specific FORTRAN compiler or a compatible C compiler or some combination of those. But you can run into trouble if yet another compiler is thrown into the mix, or another 3rd-party library which was built with another compiler is used.
One of our users recently ran into exactly this situation. They wanted to link together their program code, which was compiled with Intel (R) FORTRAN , plus ACML, plus yet another linear algebra library (which I won't name - let's call it library X). Library X was linked with object code from a different FORTRAN compiler which did not append a trailing underscore to the linkage name. The calling routine would push references (addresses) of the scalar parameters (such as the array sizes) onto the stack and then call the symbol "dnrm2" (without the underscore). The linker would match that name with the "C wrapper" for dnrm2, which would expect those parameters to have been passed by value. It would then execute the dnrm2 algorithm using the address of the array size variable N in place of N itself. This would probably just crash with a segment violation. If by some miracle it did not crash, it certainly would not compute the correct results.
In some cases the ACML user can make local customizations to the ACML library to work are around these problems. Of course, it is strictly the user's responsibility to insure that these customizations are appropriate and generate correct linkages. In this case, the work-around was to remove all of the c wrappers from libacml.a.
The script below shows how this can be done. The technique used is a quick-and-dirty hack, and not the most efficient or elegant way of accomplishing the same effect.
#! /bin/sh
# Make a local copy of the ifort64 ACML static library cp /opt/acml4.1.0/ifort64/lib/libacml.a ./libacml.a
# Create a list of all of C-wrapper modules ar -t libacml.a | egrep _cw.o > wrapperlist
# Create a script to delete all of the C-wrapper modules # and execute it. sed "s/.*/ar -dv libacml.a &/" wrapperlist | bash
# Clean up rm ./wrapperlist
|
One undocumented piece of information makes it easier to remove the "C wrapper" functions from this version of libacml.a: All of those object modules have names with the suffix "_cw.o". There is no guarantee that this will be true in other versions of the library or in future releases.
With this knowledge, the "ar -t" and "sed ... | bash" lines of the script are all that is needed to remove these modules. Of course, this will remove them one at a time, which is remarkably slow and inefficient. On the other hand, you only need to do this once. You should expect this script to take a good fraction of an hour to execute, and plan accordingly; start it when you're ready to leave for lunch or a meeting.
Let us know if this makes ACML more useful for you; we'd like to hear what you're doing with the library.
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 06/30/2009 at 11:42 AM by jim.conyngham@amd.com
|
|
|
June 15, 2009
| |
Just released: Advanced Synchronization Facility (ASF) specification
Recently AMD released an experimental specification for a proposed AMD64 architecture feature that may be of interest to all programmers of highly concurrent programs, libraries, runtimes, and operating systems: Advanced Synchronization Facility, or ASF for short. This is the first of three blog articles describing why AMD's Operating System Research Center (OSRC) became involved in the development of ASF, how we are evaluating ASF, and how this and other activities fit into the EU-funded VELOX project aiming at improving the state of the art for software-transactional-memory systems.
In this posting I will give you a quick overview of what ASF is and how it works, along with some example code. I'll also describe how I became involved in developing ASF and why we are releasing this spec proposal.
About ASF In a nutshell, ASF is intended to make it easier to write efficient, highly concurrent programs.
When AMD introduced multicore CPUs to the x86 world, we acknowledged that individual CPU cores weren't getting much faster with each silicon-technology generation. Instead, we decided to provide multiple CPU cores in one processor. This put the burden on the software community of making programs run faster on newer processors (i.e., programs have to be changed to take advantage of the parallelism.)
Writing efficient, concurrent programs or parallelizing an existing sequential program is a hard endeavor. The trickiest part is making sure that all program threads have a consistent view of all shared data. ASF is intended to address this very problem, known as synchronization.
How does ASF work? ASF provides a mechanism to update multiple shared memory locations atomically without having to rely on locks for mutual exclusion. It's quite flexible as the semantics of the update are not fixed, but can be provided using standard x86 instructions.
Here's an example. This code snippet implements a two-word compare-and-swap primitive, with new instructions highlighted in red:
; DCAS Operation:
; IF ((mem1 = RAX) && (mem2 = RBX))
; {
; mem1 = RDI
; mem2 = RSI
; RCX = 0
; }
; ELSE
; {
; RAX = mem1
; RBX = mem2
; RCX = 1
; }
; (R8, R9 modified)
;
DCAS:
MOV R8, RAX
MOV R9, RBX
retry:
SPECULATE ; Speculative region begins
JNZ retry ; Page fault, interrupt, or contention
MOV RCX, 1 ; Default result, overwritten on success
LOCK MOV RAX, [mem1] ; Specification begins
LOCK MOV RBX, [mem2]
CMP R8, RAX ; DCAS semantics
JNZ out
CMP R9, RBX
JNZ out
LOCK MOV [mem1], RDI ; Update protected memory
LOCK MOV [mem2], RSI
XOR RCX, RCX ; Success indication
out:
COMMIT ; End of speculative region
The SPECULATE-COMMIT pair wraps a speculative region, which speculatively reads from and writes to protected memory locations using the LOCK MOV instructions. The speculative memory updates will become visible to other CPUs only when the speculative region completes successfully.
Here's what the speculative region does in this example: The initial LOCK MOV instructions signify the memory locations that need to be monitored for external modifications and also read the memory operands into the RAX and RBX registers. The code then compares these operands with the original register operands (saved to R8 and R9 at the outset of the routine). The DCAS operation may fail because of a miscomparison at that point, bypassing the memory update. The RCX register returns a pass-fail indication.
A speculative region may also be aborted, for example when a contending program thread accesses a protected memory location or when an interrupt occurs. In this case, all speculative memory updates are discarded, and the program flow (instruction and stack pointer) is rolled back to just after SPECULATE, where software can inspect the reason for the abort in the rAX and rFLAGS registers. The code in this example examines RFLAGS immediately after SPECULATE using a JNZ instruction that branches to the abort handler, which in this case just attempts a retry. A real implementation might have a more elaborate recovery strategy, for example, exponential backoff if the abort was due to contention.
How we are developing ASF ASF really is a team effort, with team members looking at various software applications, hardware implementation, and the specification itself.
When I joined AMD's OSRC at the end of 2006, I quickly discovered ASF as it existed at that time: a mechanism for improving the efficiency of highly parallel, lock-free synchronization code. In previous work I had used lock-free data structures for building a real-time microkernel operating system, and I had often craved a feature for multi-word atomic updates such as ASF. This might explain why I was so enthralled by ASF.
In the meantime, I have become the editor of the ASF specification proposal. I'm working with the ASF team to evaluate the feature in various application scenarios, and to further develop ASF based on our findings. We have expanded its focus to include software transactional memory (STM) as well; more on that in a later blog post.
We are also actively discussing ASF with both academic and industrial partners to learn about interesting application areas and to derive requirements for an eventual implementation in future products.
The ASF specification ASF is an experimental architecture extension currently in proposal stage. AMD has not yet committed to including this feature into any future CPU product. Instead, we are soliciting input from developers and researchers that would help us refine the ASF specification to better meet software development requirements.
ASF is not the first feature we have proposed in this way. A year and a half ago, AMD decided to be more open in developing extensions to the AMD64 architecture to help ensure we meet the needs of the software development community and to encourage cross-vendor compatibility. At that time, we proposed the Lightweight Profiling (LWP) and SSE5 features in a similar spirit, and we received extremely valuable input from the programming community that helped us improve our future products - to your benefit. SSE5 has just recently evolved into the AVX-compatible XOP, which we described in a previous blog entry.
Please download the ASF specification proposal and send your comments to ASF_Feedback@amd.com.
--- Michael Hohmuth, MTS AMD Operating System Research Center, Dresden
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
June 11, 2009
| |
JavaOne 2009
I was lucky enough to go to JavaOne last week and thought I'd share some comments, highlights, a few quibbles, and a way to make some serious money if you are in the beanbag industry. I felt that this year's JavaOne was a little subdued -- attendance seemed lower (we can probably all guess that the economy was a factor here) and generally there were fewer 'cool!' exclamations from the audiences. Monday This was 'Community One' day. I attended a couple of sessions (Hadoop and Cloud related) but really spent most of the day bumping into people and catching up. CommunityOne looked a little sparsely attended at times. I did attend a session where the presenters and attendees discussed how to get the most out of their JUGs (Java User Groups). This was a really good session. I did enjoy hanging out in the AMD sponsored 'Hang Space,' and I had my first 'patent pending idea' here watching all of the laptop users sitting on the floor next to the walls (where the 110vac was served) and not on the comfy beanbags! So beanbag builders of the world, we need beanbags which incorporate 110v sockets. These could be sold in strings which connect together and will allow those slacking off at conferences to actually partake in the bean-bag offerings rather than sit on the floor. Of course, one might ask why the beanbags were not dragged to the walls, and the answer would be, you wouldn't be able to watch the episodes of 'The Office - US version' that were being served up on the big screen, obviously. I, of course, could happily sit in a beanbag, pretend to work and watch Dwight, Jim, and Pam wrestle with their plight because I have an AMD powered HP dv2 - whose battery lasted way longer than Season 1 of "The Office." Tuesday It was good to see Scott McNealy handover (the keynote, not Sun just yet) to Larry Ellison. Larry's remarks regarding the importance of Java to Oracle must have made a few folks sleep easier on Tuesday evening and I suspect that the JavaFX team will be particularly pleased with Larry calling out JavaFX by name and pushing a possible OpenOffice/JavaFX integration down the line. That should be good for JavaFX and hopefully good for OpenOffice. So where is JavaFX in 2009? I count this as the third JavaOne where Sun has pushed JavaFX. 2007 was kind of a preview, and I enjoyed the demos but that was really all it was. It dominated in 2008, but was still really not cooked and I walked out of the lab session when I was asked to sign an NDA -- an NDA for a lab session at a conference that I paid to attend seemed a bit weird. Now in 2009 I really do think it might start gaining some traction. The addition of charting was smart (and pretty obvious really) and I was pleased that even Eclipse users got something in the form of a fairly cool Eclipse plugin. Now it really feels that JavaFX is not just for Netbeans anymore. The demos were slicker and the downloaded Eclipse plugin worked like a charm. Having worked on a large Flex application a few years back, and having seen some extremely cool Flex apps, I have always seen JavaFX as too little too late. Flash and Flex have pretty much carved up the R part of RIA (although AJAX is not dead yet!). Now I am a little more hopeful for JavaFX to at least find an audience. The more natural Java integration and the impressive binding support will appeal to those who really took to mxml+actionscript, and I can see the story developing. The effort that has gone into jnlp/applet deployment (on jre 1.6_10 +) has helped enormously and once we can find a way to get JavaFX to launch faster (Flash still seems to launch way faster than even trivial JavaFX apps) I think that JavaFX will come into its own. I look forward to kicking the tyres some more. Joshua Bloch (Google, Inc) and Neal Gafter's (Microsoft) "Return of the Puzzlers: Schlock and Awe" session was as well attended as ever. These guys do a great job presenting these infuriating corner cases. I liked the fact that they acknowledged making some of the mistakes presented; it makes us all feel a little less incompetent. I think I got more answers right this year, although my success rate is still not impressive. The "Small Language Changes in JDK(tm) Release 7" session by Joseph Darcy, Sun Microsystems, Inc. was interesting. I really like the 'Elvis operator' :? and also look forward to using some of the suggestions for less verbose 'Generic' declaration/initializations. The "Asynchronous I/O Tricks and Tips" session by Jean-François Arcand and Alan Bateman from Sun Microsystems, Inc. was an informative session. I really am guilty of not tracking nio (when will the 'n' in 'nio' seem really inappropriate) enough, and I look forward to using some of these tricks, especially using a 'Future' to access the response from an asynchronous read. One of my favourite sessions was "Toward a Renaissance VM" by Brian Goetz and John Rose from Sun Microsystems. Sometimes I feel my head is way too small to understand this JSR 292 of stuff, but I actually felt that I have a grasp of how this will help dynamic languages and also how it might apply to frameworks which currently rely on bytecode engines/injection and reflection to do their work. I still need to track down more information on this but the fog is lifting for me. I wish I had caught the "The Feel of Scala" session by Bill Venners of Artima, Inc. Only as the week progressed did I realize that I need to track Scala. I look forward to the slides and video of this presentation. Wednesday I attended a great session called 'State: You're Doing It Wrong -- Alternative Concurrency Paradigms on the JVM&trade Machine' in the morning from Jonas Bonér of Scalable Solutions. This session proposed State, Actor message passing and Data Flow mechanisms to improve concurrency. For me the Actor-based demos (based on Scala) not only prompted me to look at this approach in my Java apps, but also was a great example of how Scala can be scaled out. As I mentioned earlier I really need to dig into Scala some more. I regret missing "The Modular Java(tm) Platform and Project Jigsaw" by Mark Reinhold of Sun Microsystems, Inc. From what I have read alsewhere this modular approach is really going to help deployment and packaging. Joshua Bloch's (from Google) ""Effective Java": Still Effective After All These Years" was another opportunity to see the 'Billy Mayes' of Java (I really mean no disrespect - Josh is a pitch-perfect pitch man) do what he does flawlessly. His 'Effective Java' book is like the Movie 'Brazil;' you need to reread/review every year to catch what you missed previously. I enjoyed "The Ghost in the Virtual Machine: A Reference to References" session from Bob Lee, Google Inc., which went into depth regarding GC, references, and finalization issues. I look forward to walking through the slide deck on this one. I learned a lot and also know a bunch slipped on past me. I watched a cool demo which redefined classes in a running JVM using a java agent and some classloader tricks. This BOF session "Runtime Update of Java(tm) Technology-Based Applications, Using Dynamic Class Redefinition" by Allan Gregersen from University of Southern Denmark was fun and educational. The presenter built a Swing-based game incrementally by adding fields and methods, changing class hierarchies, etc., all without ever restarting the JVM. Although in practice I feel this javagent based chaining approach may not scale particularly well, if this can be pushed down into the JVM (as the presenter suggested) then this whole area has some great potential. I must apologise to my fellow AMDer, Richard West, and David Gilbert from Object Refinery Limited for missing their "JFreeChart: Surviving and Thriving" BOF. I look forward to picking Richard's brain about this great toolkit. Thursday Occasionally I like to see what is going on in the Swing world. I don't really get to write much in Swing but there are some really great toolkits out there. I particularly enjoyed "Swing Rocks: A Tribute to Filthy-Rich Clients" by Martin Gunnarsson and Pär Sikö from Epsilon Information Technology. Swing really can look compelling. The "Matchmaking in the Cloud: Hadoop and EC2 at eHarmony" session from Steve Kuo and Joshua Tuberville of eHarmony, Inc. was a good presentation (and from a show of hands there were two attendees that actually got married through eHarmony so there was a cool validation of eHarmony's matching algorithm!). It walked through the technical and economic considerations around using these technologies. "Garbage Collection Tuning in the Java HotSpot(tm) Virtual Machine" from Charlie Hunt and Antonios Printezis of Sun Microsystems, Inc was a good, informative session that walked through a number of great slides highlighting what to do and what not to do. I still feel that GC tuning should be less of a 'dark art.' I worry how many JVMs are sitting out there thrashing when a few command line options would smooth the way. I do wish for a -XX+GCAdvise option which (possibly at the end of each GC) would suggest what command lines would be optmil with a specific workload. I know that I am supposed to use the printgc options (flag examples) to be added, and/or use visualvm to show me the graphs that I should use to determine what flags will be optimal, but this seems way too hard. Surely after running for a while the GC engine/subsystem would have a enough data to generate an 'I suggest running with these flags ... because ....' style report, instead of 'here are a bunch of graphs and text dumps, now go away and work out what you did wrong and come back.' Sometimes I don't want to learn to fish; sometimes I would just like to eat some fish.  Cliff Click (from Azul Systems) and Brian Goetz's (Sun Microsystems) session, "This Is Not Your Father's Von Neumann Machine; How Modern Architecture Impacts Your Java(tm) Apps" was another one of the highlights of the conference. It was a great presentation and allowed folk without a deep understanding of microprocessor architecture to walk away with some understanding of what happens under the hood. The slide deck in the middle which walked through the issues relating to how multi-core architectures executing speculatively have to handshake over the cache was very, very slick. I am looking forward to Cliff and Brian's Boxed Set being released.  There were some great sessions on "Actor-Based Concurrency in Scala" from Philipp Haller of EPFL and Frank Sommers of Artima which really rammed home how effective Scala and this Actor-based communication mechanism can simplify some concurrency problems. As I mentioned before this was brought up in a former session, and I enjoyed digging deeper in this dedicated session. I stayed late to enjoy the "Java(tm) Programming Language Tools in JDK(tm) Release 7" BOF on Thursday night hosted by Maurizio Cimadamore and Jonathan Gibbons from Sun Microsystems, Inc. I applaud the upcoming refactoring of javap and also enjoyed the discussion on how we might get better error reporting out of javac. I also vote [should this be "voted" in this context?] for the option of getting compilation rendered to xml to help tool chaining. Friday Gosling's "Toy Show" (Friday morning) did have some cool stuff; the JavaFX studio tool for composing JavaFX without coding does look very, very good. Also the image analysis toolkit which generated analytical 'hashes' for images and then allowed image related searching/matching was very impressive. My favourite was the Printer/Copier based Java app for creating arbitrary multiple choice exam papers or surveys on plain paper, then printing a bunch of the question papers off and by feeding a special page with the answers and the response papers into the scanner, allow the copier/printer to grade the papers. Very smart. The "Under the Hood: Inside a High-Performance JVM(tm) Machine" session from Trent Gray-Donald of IBM was excellent. This provided some more insight into what happens when your code is executed by a modern JVM. Sadly I missed afternoon sessions because I had to get to the airport to get home to watch season two of 'The Office.'  There certainly is enough to dig into to keep me busy enough until next year.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
|
|
|
| |
How Complex Is Your JRE Command-line?
Guess how many command-line flags there are for the server JRE in the OpenJDK? I'm hearing 42. Kudos to all of you fans of the late Douglas Adams, but you're slightly short of the real answer. It's 477 (give or take a flag or two). To confirm, just go into src\share\vm\runtime\globals.hpp and src\share\vm\opto\c2_gloabls.hpp, which define them. The flags control all sorts of things, some of which you are probably very familiar with like the heap (-Xms -Xmx), and some which you may not know about, such as the memory footprint settings (-XX:ReservedCodeCacheSize and -XX:InitialCodeCacheSize).
I'm not asking you this because I want to know if you have intimate knowledge of the JRE (although if you can keep bits of trivia like this in your head, I am truly impressed). My question really comes out of the world of performance analysis of Java runtimes. Suffice it to say that as the Java Labs works to improve JRE performance, sometimes our analysis leads to improvements that can be realized by tuning these existing command-line flags. But here's my theory...I bet most of you use few, if any, of these flags in production. You probably have very good reasons for doing this. You may not have access to the command line, or you may have different applications, some of which my benefit from certain flags, while others won't. If true, the result is the same...when we look to improve JRE performance, we really need to do it in a way that is engineered to help potentially any application in a flexible way that does not require changes to the command line.
So answer these two questions:
- Do you set any command line flags in production?
- If yes, what are they?
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
June 5, 2009
| |
A Java Generics Performance Puzzler
In this entry, we’ll go down that well-worn path of looking at some microbenchmark results and trying to explain them.
This microbenchmark created an ArrayList such that if one went thru the ArrayList in order, the entries were randomly distributed in memory. We also had enough elements in the list that it would take some time to go thru the list. We then wanted to go thru the list in order and “split” it so that we created two new ArrayLists, one for all the even elements and one for all the odd elements.
There are a number of ways to code the splitting but let’s start with an approach that doesn’t use Iterators, but just uses an integer index to the get method for the source and then adds (appends) to the destination ArrayList. The body of the loop might then look like the following:
Version 1
ArrayList aListSrc, aListDest1, aListDest2; ... while (idxSrc < NUMOBJS) { aListDest1.add(aListSrc.get(idxSrc)); aListDest2.add(aListSrc.get(idxSrc+1)); idxSrc+=2; }
After measuring version 1, you decide those add method lines are a bit wordy so you break them into two statements, using a local variable to hold the intermediate result. Or perhaps you wanted to print some debug information for each element as you are copying it, and you needed a local variable to hold the element reference (and you then removed the debug statements). So you end up with something like:
Version 2 ArrayList aListSrc, aListDest1, aListDest2; ... while (idxSrc < NUMOBJS) { MyClass myc = aListSrc.get(idxSrc); aListDest1.add(myc); myc = aListSrc.get(idxSrc+1); aListDest2.add(myc); idxSrc+=2; }
But when you measure version 2, you find that it is much slower than version 1 (about 1/3 the speed in my measurements). Before reading on, you might try to figure out why. Is the JVM perhaps not able to optimize away the store to the local variable? And if so, is the store to the local variable really that expensive? I will add that in both cases, the get and add methods got inlined nicely into the timed loop.
Answer
You may recall that generics in Java are implemented with type checking at compile time but with type erasure at run time. How does that impact us? Well for one it means that at runtime the call to
aListSrc.get(idxSrc);
really returns an Object, even though aListSrc is an ArrayList. Therefore the statement from version 2:
MyClass myc = aListSrc.get(idxSrc);
requires a runtime castcheck that the Object returned by aListSrc really is a MyClass. If you look at the byte codes generated for such a statement, you will see a checkcast bytecode.
To check whether the object returned by aList.get really is of type MyClass (or a child of MyClass) the JVM must read the header of the object. In our list splitting operation however we never had any other reason to look at any of the fields of the MyClass objects as we went thru the list. We just copied each MyClass reference from the source list into one of the destination lists. So by having to look at the header as part of the castcheck, we must now wait until the object is read from memory into the processor’s cache. And with lots of objects in the list, it makes it less likely that an object is already in the cache when we need it.
How did we avoid the castcheck in Version 1? In version 1, the javac compiler used the list’s type declaration ArrayList to guarantee that the returned object was of type MyClass at compile time. And at runtime the types from the generics were erased so basically we have a get method returning an object which is passed to an add method which takes an Object. So no checkcast is necessary.
Note that we can try to get around the checkcast by just declaring the local variable to be an Object rather than a MyClass, but now the javac compiler will rightly complain when we try to do an add of an Object into an ArrayList.
Version 3 (will not compile) ArrayList aListSrc, aListDest1, aListDest2; ... while (idxSrc < NUMOBJS) { Object myc = aListSrc.get(idxSrc); aListDest1.add(myc); // error ... }
I should note here that if our original algorithm had looked at fields of the MyClass objects to make some decision on how to split the list, then the object would have already have to be read from memory for the other field accesses and the extra time to do the header check for the castcheck would have been insignificant.
Even though the above is explainable by type erasure, I’m not sure it follows the principal of least surprise. After all, I declared aListSrc to be ArrayList and all I did was assign the .get output to a MyClass object. If the javac compiler knew enough to eliminate the castcheck between the output of the get and the input of the add, why couldn’t it eliminate it between the output of the get and the assignment to the local variable?
Looking at this from another angle, one might ask whether the JVM can optimize away the castcheck at runtime. A check with the Hotspot folks indicated that the bytecodes are saying "throw an exception if aListSrc.get ever returns a non-MyClass object". And the JVM cannot elide bytecodes that could cause an exception like this.
So the message is don't cast your return from the Collections classes like this if you don't need to.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 06/05/2009 at 03:53 PM by tdeneau
|
|
|
June 2, 2009
| |
Adventures in Dual Booting OpenSolaris
This year I entered into a new role as a performance engineer for AMD, assigned with tackling any and all Sun compiler performance engineering issues for AMD's Sun alliance.
This blog entry focuses on how I got multi-boot working on a system with both SuSE Linux® Enterprise Server 10 SP2 and OpenSolarisTM 2008.11, even though OpenSolaris is installed on the second partition (most of the blogs and articles I found online always recommended OpenSolaris be installed on the primary partition)
Back in the day, it wasn't called "multi-boot," it was just "dual-boot" (I suppose because having two operating systems installed on one disk was almost a freak of nature). Multi-booting operating systems is somewhat of a black art, mainly involving choosing, installing, and configuring the boot loader.
As a software developer in the past, I have performed such ad-hoc system setups frequently, mostly focused on bootstrapping a project. It is no different as a performance engineer. So I recently found myself engaged in setting up a shiny new system with a couple of AMD "Istanbul" processors hot off the fab. The activity generated a surprising amount of excitement ...
... A crowd of engineers gather around the latest machine. A few twists of a knob here, a button there, and a fiery glow lights their faces. They hunger for performance numbers! Overnight SPEC® CPU2006 runs are almost too much to endure. Can we speed up the install? Should we add more memory? We want those results!
Back to the real world - I need SLES 10 SP2 for our initial studies, so that goes on first. Anticipating the need to multi-boot, I divide the disk into 3 partitions while installing SLES to the first one, namely (hd0,0). I setup and configure the SPEC benchmarks and get those started.
... The benchmarks results finally come in. The engineers ooh and ahh over the towering new SPEC numbers. Abruptly they disperse, returning to their cubicles to digest. I finally have the machine to myself (moo hoo ha ha).
Now comes the OpenSolaris install on the second partition ((hd0,1)). It goes smoothly, except it installs a new copy of GRUB (GRand Unified Bootloader) which doesn't seem to know anything about the original SLES partition. When I reboot, I can't get back to the original install!
... I have broken the shiny new machine! The light glows but it is a strange color, not the fiery glow the engineers will need any day now. Both hands inside the box, I am certain if I stop to scratch my nose I will lose control and it will fly around the cube and out the window.
I have two paths to try: 1) configure the OpenSolaris GRUB to see SLES 10, or 2) configure the SLES GRUB to see OpenSolaris.
First I try configuring the OpenSolaris GRUB by editing GRUB's menu.lst file. Booting OpenSolaris, I look for /boot/grub/menu.lst but eventually I discover that OpenSolaris' GRUB menu file is in /rpool/boot/grub/menu.lst. I cook up an entry like this:
title SLES 10 SP2, kernel 2.16.16.60-0.21-smp root (hd0,0) kernel /boot/vmlinuz-2.6.16-60-0.21-smp \ root=/dev/dsk/by-id/scsi-SATA_ST3250410AS_6RYC836A-part1 \ vga=normal showopts ide=nodma apm=off acpi=off noresume edd=off 3 initrd /boot/initrd-2.6.16.60-0.21-smp
but after tweaking several times (where GRUB complains about not finding a valid OS) I can't get the recipe exactly right. I move on to option #2, getting the SLES GRUB booting again.
At first I try OpenSolaris' fdisk command but I don't find an easy way to determine the device name of the SLES disk partition (because I am unfamiliar with the OpenSolaris way of device naming). So I decide to do it from SLES - if I can boot the SLES partition, or mount it somehow from a rescue disk I could modify its GRUB configuration (by editing the /boot/grub/menu.lst file). After some Googling, I create and boot a SLES 10 SP2 install CD, boot in rescue mode, mount the partition, and then add this entry:
title OpenSolaris 2008.11 root (hd0,1) chainloader +1
While booted, I use the SLES fdisk to mark the SLES partition as bootable. When I reboot, the machine comes up and boots SLES 10 SP2 without intervention! Whew!.
And now I can choose the OpenSolaris 2008.11 partition at boot time, which then displays the OpenSolaris GRUB menu, which knows how to boot the OpenSolaris ZFS partition. If I had to I could use fdisk again to make the machine boot to OpenSolaris every time, but for now it will reboot to SLES 10 SP2 each time.
... The light is restored and the machine is ready. When the engineers return to take the machine they will be able to use it as they did before, but I have left a door to my little workshop where I can return when their interests move on to the next shiny problem.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 06/03/2009 at 10:49 AM by qneill
|
|
|
June 1, 2009
| |
"Istanbul" overview
Today, AMD is launching the "Istanbul" processor. Since our first dual-core processor, those of us in AMD's CPU ISV team have been evangelizing that more cores are coming. This processor contains 6 cores on one die. Just to be clear these are 6 distinct physical cores, just as the Shanghai processors contained 4 distinct physical cores. Each core comprises 512K of L2 cache and 128k of L1 cache. The L3 is a 6MB cache shared by the six cores. The Istanbul processors are MP capable, supporting up to 8 processors (48 cores). There have been numerous refinements made to this processor. One notable change is the addition of a Probe Filter, which you may see referred to as HyperTransportTM technology, HT Assist. Simply put, this filter can greatly reduce HT traffic between multiple sockets, which in turn can improve memory bandwidth, especially on 4 socket platforms. For those with silicon interest, the "Istanbul" processors are fabricated with the 45nm SOI process. And did I mention that these processors use AMD's existing Socket F (1207) infrastructure? Which means that on many platforms all is needed is a simple BIOS upgrade. Some of the other features are: HT3 capability and numerous power saving features. More blogs to come on the cool new features of Istanbul - otherwise known as the new Six-core AMD OpteronTM processors.
 Six-core AMD OpteronTM processor
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 06/01/2009 at 12:32 PM by devcentral
|
|
|
| |
"Shanghai" blog category is now "Istanbul" blog category
With the launch of the new Six-Core AMD OpteronTM processors (codenamed "Istanbul"), the powerful follow-up to the "Shanghai" processors, we're updating the title of this blog category to reflect the information you will now find here. Don't worry, the previous content isn't going away - it's still very valid, since the "Istanbul" processors build on foundations that were laid by the "Barcelona" and "Shanghai" processors, and add advancements in many features. Check back often for new write-ups on these features, and visit our "Istanbul" Zone for a round-up of everything you need to know about this enhanced generation.
We'd appreciate hearing what you think about the new "Istanbul" processors, so leave us a comment!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
| |
"Shanghai" Zone is now "Istanbul" Zone
Looking for our "Shanghai" Zone? All the content and resources you expected to find are still there, but we've added some new information about AMD's follow-up to the Quad-Core AMD OpteronTM processor (codenamed "Barcelona", and "Shanghai") and have renamed the content section to "Istanbul" Zone. The new Six-Core AMD Opteron processors (codenamed "Istanbul") retain all the features of the "Barcelona" and "Shanghai" processors and add further advancements in the technologies for even better performance. Find out what's new with this six core processor in the "Istanbul" Zone.
We'd appreciate hearing what you think about the new "Istanbul" processors, so leave us a comment!
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
May 15, 2009
| |
New Virtualization Article
Check out this new article in the Java Zone: Optimizing Java Performance in a Virtualized Environment. It's based on a JavaOne 2008 Tech Session of the same name by Shrinivas and Azeem, which provided a good overview of how to navigate the intersecting worlds of Java and Virtualization.
Let us know what you think.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
May 6, 2009
| |
Striking a Balance
This week, AMD is making a couple of very important announcements for developers: support of Intel's Advanced Vector Extensions (AVX) instruction set in future AMD processors, and the adaptation to the AVX framework of AMD's previous SSE5 instruction set proposal. The latter step has resulted in three new extensions: XOP (for eXtended Operations), CVT16 (half-precision floating point converts), and FMA4 (four-operand Fused Multiply/Add). In this posting I'll give an overview of the capabilities that these extensions provide, and also some insight into why we're taking this step.
First, the why. When we proposed the SSE5 extensions back in mid-2007, it brought some important innovations to the SIMD side of the x86 architecture:
- a non-destructive three-operand capability, and a four-operand capability to support some very powerful new operations;
- a set of powerful permute and conditional move instructions for data movement, plus Fused Multiply/Add (FMA) instructions for high-performance floating point;
- a variety of other new operations to address various holes in the SSE instruction set: shift/rotate, integer compares, integer multiply/accumulate, and half-precision floating point support.
In April of 2008, Intel published its AVX/FMA proposal, which incorporated several of SSE5's innovations - in particular the three- and four-operand capabilities, the Fused Multiply/Add instructions, and some of the permute instructions - except in a somewhat different form. This proposal also added some new capabilities with a new instruction format: doubling the width of SIMD FP operations, applying the non-destructive three-operand capability to most legacy SSE instructions, and greatly expanding the potential opcode space for future extensions.
With this duplication of functionality between SSE5 and AVX/FMA, and AVX's additional features, we felt the right thing to do was to support AVX. In our minds, a more unified instruction set is clearly what's best for developers and the x86 software industry. With our acceptance of AVX, a key aspect of this instruction set unification is the stability of the specification. Since we don't control the definition of AVX, all we can say for sure is that we expect our initial products to be compatible with version 5 of the specification (the most recent one, as of this writing, published in January of 2009), except for the FMA instructions, which we expect will be compatible with version 3 (published in August of 2008).
Why the FMA difference? This was not something we did lightly. In December of 2008, Intel made significant changes to the FMA definition, which we found we could not accommodate without unacceptable risk to our product schedules. Yet we did not want to deprive customers of the significant performance benefits of FMA. So we decided to stick with the earlier definition, renaming it FMA4 (for four-operand FMA - Intel's newer definition uses what we believe to be a less capable three-operand, destructive-destination format). It will have a different CPUID feature flag from Intel's FMA extension. At some future point, we will likely adopt Intel's newer FMA definition as well, coexisting with FMA4. But as you might imagine, we may wait until we're sure the specification is stable.
The fact remains that AVX does not incorporate all of SSE5's features. Since SSE5 was based on months of discussions with ISVs on what sort of capabilities they felt were needed, and had been positively reviewed by the industry when we first put out the specification, we decided to follow through with development of these additional features. To do so most effectively, we redefined them in the AVX framework, resulting in the XOP extension.
So, what's in XOP?
Well, quite a lot, really. First of all, the instruction formatting was changed to leverage the capabilities that the AVX VEX prefix brings, using a new VEX-like three-byte prefix sequence called (interestingly enough) the XOP prefix. This provides three- and four-operand non-destructive destination encoding, an expansive new opcode space, and extension of SIMD floating point operations to 256 bits. The SSE5 operations that are retained by the XOP extension are:
- Horizontal integer add/subtract: Signed or unsigned add, or signed subtract, of adjacent byte, word, or dword elements in the source vector to word, dword or qword elements of the destination vector. 128-bit.
- Integer multiply/accumulate: Multiplies elements of two input vectors, adding the results to a third input vector. 128-bit.
- Shift/rotate with per-element counts: These use a vector of shift counts, allowing each element of the source vector to be shifted or rotated by a different amount. There is also a rotate instruction with an immediate-byte single count applied to all elements. 128-bit.
- Integer compare: Signed and unsigned comparison of byte, word, dword and qword elements, with predicate (mask) generation as in the various SSE compare instructions. The particular comparison to perform is specified in an immediate byte. 128-bit.
- Byte permute: A powerful operation which copies bytes from two 16-byte input vectors to a 16-byte destination vector, optionally performing a selected transformation on each, under the control of a third input vector. 128-bit.
- Bit-wise conditional move: Selects each bit of the destination vector from either of two input vectors, per a third input vector. 128- and 256-bit.
- Fraction extract: Extract the mantissa from floating point operands. Scalar and 128- or 256-bit vector, single and double precision.
- Half-precision convert: These convert between half-precision and single-precision formats while loading or storing a four- or eight-element vector. They provide dynamic control of rounding and denormalized operand handling. These particular instructions form a separate extension called CVT16, with a distinct CPUID feature flag.
Along with the FMA4 instructions, these support a wide variety of numeric-intensive, multimedia, and cryptographic applications, and allow some new cases of automatic vectorization by compilers. Speaking of compilers, plans are afoot to support these in intrinsic form in various compilers, and they may be used automatically in code generation in some cases.
A version of the AMD64 SimNow! simulator with support for these extensions is planned for availability in very short order.
I hope I've given you a good taste of these new features. For all the details on the XOP and FMA4 extensions, you can find the specification here. And, I encourage you to read the blog of our CMO, Nigel Dessau, for an executive perspective on driving innovation into the x86 instruction set. We believe we've struck the right balance between innovation and standardization. Feel free to comment or ask questions - we're always happy to hear from you. As you can see below, we've already heard from ten of our technology partners on the subject.
Dave Christie is a Fellow and senior architect at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.
Partner Support Quotes
Absoft
"The addition of AVX support by AMD is a great move as it enables superior performance potential across AMD's x86 family of processors," said Wood Lotz, Absoft CEO. "AMD's use of AVX can also simplify development of high performance compilers and tools for companies like Absoft, and enable customers across a wide variety of industries to build faster applications."
Acumem
"Acumem fully supports AMD's adoption and enhancement of the AVX instructions and will follow this standard as it becomes available in the market. As an ISV for performance tools we clearly see potential for performance improvements with these new additions" said Mats Nilsson, VP Software Engineering at Acumem.
Axceleon
"Axceleon applauds AMDs efforts to support both specifications, AVX and SSE5, in their XOP specification proposal. The further enhancements in FMA4 which accelerate floating point algorithms are very important to Axceleon's HPC customers and will be welcomed across the board" said Mike Duffy, CEO of Axceleon.
Bibble Labs
"We at Bibble Labs are constantly looking for performance improvements, and as such we are investigating AVX because of the possible performance advantage it might bring. We also appreciate that AMD is taking an active role to ensure the instruction sets converge and not create separate, conflicting instruction sets," said Jeff Stephens, Vice President of Product Development, Bibble Labs.
Cakewalk
"We commend AMD for taking an active role in open standards, by unifying the x86 instruction set and merging SSE5 into the AVX specification. This can help improve compatibility and simplify the work for developers implementing this. We look forward to investigating AVX for potential advantages it may bring to our real-time applications and plug-ins," said Noel Borthwick, Chief Technology Officer, Cakewalk.
Nero
"We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant," said Simone Hoefer, General Manager, Technology at Nero AG. "This will help reduce implementation complexity and multiple code-paths. We are confident that the SIMD (SSE/SSE2) optimizations already implemented will scale nicely to 256-bit/AVX, allowing us to truly embrace this new development."
Smith Micro Software
"Having to choose acceleration solutions that work well on both AMD and Intel CPU platforms, Smith Micro welcomes convergence of the x86 instruction set. AMD supporting AVX is desirable from Smith Micro's point of view," said Uli Klumpp, director of engineering, Smith Micro Software, Inc. "The AVX instruction set extensions are looking promising for further optimizing our computationally most demanding software, DCC and data compression products such as Poser and StuffIt."
Sonic Solutions
"AMD's adoption of AVX will help Sonic unify some of its engineering efforts and reduce development costs," said Jim Roth, Chief Technical Officer, Sonic Solutions. "We welcome this initiative and the proposed enhancements to the x86 processor architecture, which we will leverage to increase the responsiveness and performance of our digital media applications."
Sony Creative Software
"We are pleased that AMD has decided to adopt the AVX instruction set extension instead of offering a variant," said John Freeborg, Vice President of Engineering for Sony Creative Software. "We also appreciate that AMD is taking an active role to ensure these converge and do not create separate, conflicting instruction sets."
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 05/06/2009 at 10:38 AM by rex8664
|
|
|
May 4, 2009
| |
Beta CodeAnalyst v2.9 Released for Windows
Hello all, again --
The next version of the AMD CodeAnalyst Performance Analyzer is available for you now. I encourage to you to download it in another window and read the rest of the blog while it downloads.
We've added some widely requested enhancements, deprecated one feature, and fixed *cough* a few bugs. While this release is in the Beta period, please send us feedback about anything you would like to suggest for the actual release or any issues you encounter. You're welcome to send that to us any time, but during the Beta period, we're devoted to working on issues based on your feedback. I invite you all to visit our forums for feedback, questions, and answers.
Some of the enhancements added are:
- Multiple simultaneous symbol servers.
- Process filters: You can limit the reported data to certain processes.
- An API: No longer are you limited to interacting with AMD CodeAnalyst through our command line applications or our GUI, you can now programmatically control profiling and you can fold, spindle, and mutilate the data before displaying it.
- Notes: You can add a customized note to each profile session. This feature should help you remember essential details about a session and reduce the length of session names.
- Call stack data for a running process: You can now capture call stack information about a process using the command line tool without launching the process from CodeAnalyst.
I am sorry to report that our simulation feature is now deprecated. It was useful for many reasons, but it was still a simulation of pipeline behavior. Now we have instruction-based sampling (IBS) information available. IBS can measure actual instruction execution, so I recommend that to you instead!
If you really must know about bug fixes and open issues, you can check out the release notes shipped with each version of the AMD CodeAnalyst tool. 
Most of the time since the last release has been spent writing and testing the API. I've been working hard to make the API convenient and well documented (in doxygen format) for y'all. We added the API so that you can build your own custom tools. We are including some new sample code showing how to use the API, and I would love to hear (or read) what you end up doing with it. Please post your projects and requests for further enhancements or clarifications on the forums.
Thanks!
-=Frank
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
May 1, 2009
| |
Putting Enums to work
Most Java developers are probably aware that enums were added to Java 1.5 and we are becoming more familiar with seeing them used like this:
enum LIGHT { RED, AMBER, GREEN}; Here we are defining an enum that we can use to hold the state of a traffic light. The above code allows LIGHT to be used as a new type. LIGHT light = LIGHT.RED; and via some magic we can use LIGHT values in switch constructs. switch(light){ case RED: System.out.println("Stop"); break; case AMBER: System.out.println("Get ready"); break; case GREEN: System.out.println("Go"); break; } we can iterate over the values of LIGHT using the array returned from the values() accessor.. for (LIGHT light:LIGHT.values()){ System.out.println(light); }
and can also perform ordinal comparisons.. if (light < LIGHT.GREEN.ordinal()){ System.out.println("Not yet!"); }
Because enums are indeed Classes we can customize them by adding fields, constructors and methods.
So if we wanted to be able to query each 'value' for the next in the sequence (including wrapping from GREEN to RED) we can use :-
current.values()[(current.ordinal()+1)%current.values().length]
However, rather than having this logic spill out into the code using the enum, we can provide a method in the enum itself
enum LIGHT { RED, AMBER, GREEN;
LIGHT next(){ return(values()[(this.ordinal()+1)%values().length]); } }
So now we can query the next value using
light = light.next();
We can also overload methods for each value. So an alternative to the above implementation might be
enum LIGHT { RED, AMBER{ LIGHT next(){ return(GREEN); } }, GREEN{ LIGHT next(){ return(RED); } }, // We need a method to override, so lets assume RED is the default LIGHT next(){ return(AMBER); } }
Which is a little more verbose, but in some ways more explicit. Note that we must provide an implementation for the enum and then each 'value' can overload this if it chooses.
Although we can't extend enums (and probably for good reason) we can implement interfaces.
Let’s say we had an application which deals with a bound set of file types (XML, TEXT and ANY). We could make a FILE_TYPE enum which supports the FileFilter interface.
enum FILE_TYPE implements FileFilter { ANY, TXT{ boolean accepts(File _file){ return(_file.getName().endsWith(".txt") || _file.getName().endsWith(".text")); } String getDescription(){ return("TXT files"); } } XML{ boolean accepts(File _file){ return(_file.getName().endsWith(".xml")); } String getDescription(){ return("XML files"); } } boolean accepts(File _file){ return(true); } String getDescription(){ return("Any file"); } }
This then allows us to write a method for getting a file from a JFileChooser dialog...
File getFile(FILE_TYPE _fileType){ JFileChooser chooser = new JFileChooser(); filter.setDescription(_fileType.getDescription()); chooser.setFileFilter(_fileType); int returnVal = chooser.showOpenDialog(parent); if(returnVal == JFileChooser.APPROVE_OPTION) { return(chooser.getSelectedFile()); } return(null); }
So we can ask for an XML file using..
File file = getFile(FILE_TYPE.XML);
Obviously we need to be careful and not 'misuse' them, but I believe that enums can offer options beyond the traditional static list of values.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
|
|
|
April 28, 2009
| |
AMD CodeAnalyst Workshop Summary - Part 5 of 5
Profiling JIT compiled code
Managed code is a popular approach to software development and deployment, as the developers do not have to compile it separately for different environments. Managed code executes on virtual machines which provide a secure and portable execution environment. Java and .NET are examples of managed code systems. Managed code systems use just-in-time compilation to translate from a portable, intermediate representation of a program to native (machine) code. The generated code is compiled "just-in-time" for the execution. It's possible to profile the generated code when it executes, but in order to interpret the profile data and optimize the code, the AMD CodeAnalyst tool must instrument it a little. That way we capture the generated native code and how the native code relates back to your source code.
AMD CodeAnalyst provides several profile agents, which gather the necessary information at code generation time during a profile and save the information for profile data analysis later. Since the agents behave differently for the 32-bit and 64-bit runtimes, we provide both 32- and 64-bit agents. Also, there are two profile agent interfaces for Java. The JVMPI has mostly been depreciated in favor of the newer JVMTI, but we try not to make assumptions about what Java runtime you are using and provide both. JVMPI uses the command line parameter -XrunCAJVMPIA32 in the Java application launch command to integrate the agent. JVMTI, on the other hand, uses the option -agentlib:CAJVMTIA32. If you launch the Java application through the AMD CodeAnalyst standalone GUI, we automatically add the command line option for you.
On Linux®, the source code for both agents is provided and you can compile and use the agent of your choice. However, Java source code cannot be shown with the profile data. AMD CodeAnalyst shows the generated native code in the assembly tab.
.NET has a different method of attaching the profile agent. It uses environmental variables and GUIDs. Before running an application or module with managed code, you have to set Cor_Enable_Profiling=0x1. Once you've enabled the profiling, you must tell the .NET runtime environment which profiling agent to use. If you're using the 32-bit runtime, you will set COR_PROFILER={D007F1AC-DA06-4d68-BF47-E94790DD379F}. If you're using a 64-bit system, you should test whether the runtime is 64-bit or 32-bit. The 64-bit profile agent environmental setting is COR_PROFILER={891D5491-7E37-4b23-BE66-1C837FED378B}. If you launch the managed application through the standalone AMD CodeAnalyst GUI, the environmental variables are automatically set for you.
We don't currently have a profiling solution for interpreted languages like Perl, Python, or Basic. We can profile the applications, of course, but the samples are associated with the language interpreter and AMD CodeAnalyst cannot tie the profile data back to your source code for analysis. If you have a great idea for features or enhancements, please send mail to CodeAnalyst.support@amd.com.
Windows® and Linux® differences
The final topic of this blog is about the differences between AMD CodeAnalyst on Windows® and on Linux. We try to maintain feature parity on both platforms. However, there are times when a new feature may be available for one operating system platform, but not the other. Other times, like with a major hardware release, the same feature is introduced on both platforms simultaneously. If there is a feature on one version that isn't available on the other and you need it, please let us know.
There are advantages and disadvantages on both platforms, mainly due to the platform-specific method used to collect profile data. AMD CodeAnalyst uses Oprofile for data collection on Linux and uses its own proprietary driver to collect profile data on Windows. Oprofile aggregates profile data on-the-fly into summaries as it is captured while the Windows driver writes profile data to a file so that it can be aggregated during post-processing. On-the-fly aggregation allows longer sampling sessions, since the resulting profile files are relatively compact. However, on-the-fly aggregation loses timestamp information. Without timestamp information, AMD CodeAnalyst cannot generate thread profiles.
The Windows version of AMD CodeAnalyst uses the APIC timer for time-based profiling. Oprofile does not use the APIC timer, so the CPU Clocks not Halted event is substituted as a time measurement. And speaking of events, the Windows version of AMD CodeAnalyst uses time-based event multiplexing to switch between events. Event multiplexing in version 2.7 of AMD CodeAnalyst on Linux re-runs the application for each event group. We hope to add event multiplexing to a future version and to contribute our changes to the open source Oprofile code base.
On Windows, AMD CodeAnalyst is integrated with two major integrated development environments (IDE): Microsoft Visual Studio 2005 and 2008 and Eclipse. With Visual Studio, you can use the profile controls, profile session lists, and data windows without leaving the Visual Studio environment. The AMD CodeAnalyst plug-in will be installed by default if Visual Studio 2005 or 2008 is installed before AMD CodeAnalyst is. The plug-in for Eclipse is called "CodeSleuth". CodeSleuth uses AMD CodeAnalyst to collect and analyze compiled Java code from within the Eclipse IDE. For more information, you can go to http://developer.amd.com/cpu/CodeAnalyst/codeanalystwindows/codesleuth.
While we provide example programs in both versions, on Linux, we make the entire source code base available via GPL version 2. If you do patch something, please send it back to us, so we can incorporate it into the code base.
In conclusion, through these topics, I've tried to provide useful information to you about performance optimization and specifically how to use AMD CodeAnalyst Performance Analyzer to improve your software. If you have read through the articles and clicked a couple of the links, you should have a firm grounding in it all, so it's a good time to get started. Just in case you haven't downloaded the software yet, you can find AMD CodeAnalyst available for download at http://developer.amd.com/cpu/CodeAnalyst. The AMD CodeAnalyst tool is available at no charge, so please don't hesitate to try it out and let us know what you think. We appreciate your reading these and welcome all of your feedback, bug reports, enhancement requests, and comments.
Thank you for your time.
-=Frank Swehosky
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
|
|
|
April 22, 2009
| |
AMD CodeAnalyst Workshop Summary - Part 4 of 5
In this blog, I cover the last three profiling configurations that the AMD CodeAnalyst tool offers: event-based sampling, instruction-based sampling, and thread profiling.
Event-based sampling
The next topic in our profiling jaunt is about how to use the event-based sampling feature of AMD CodeAnalyst Performance Analyzer. Event-based sampling profiles give you a better understanding of what hardware events are occurring during your profile. This should give you a more clear view of how your software is affecting the system and how it can be changed for great performance!
Background info
Performance monitoring events are manufacturer specific to each version of a processor. The number of performance monitoring counters is also version-specific. There are two ways of using the event counters. You can use them directly to count events or you can trigger sampling interrupts when a certain count is reached. AMD CodeAnalyst uses the interrupt method to take samples.
You need a privileged ring 0 access driver in order to configure the events to be counted. That is why you need to be an administrator (or root) to install the AMD CodeAnalyst tool.
Third generation AMD OpteronTM processors and AMD PhenomTM processors support four event counters. To sample more than four events in one profiling session, we use event multiplexing to swap the event configurations out every so-many milliseconds. Each core has its own set of counters. AMD CodeAnalyst currently configures all cores identically. When an event counter is configured, AMD CodeAnalyst currently also sets an event count based on the configuration given for the profile. When the event count, or "sampling period", is reached, an interrupt is generated in order to take a sample. More interrupts and more data are generated for a profile when the event count is smaller. If the event count is too low, an interrupt will occur while processing the previous interrupt and the system will grind to a halt. Since the frequency of events may vary by system and the applications that are running, we allow you to go to improbable limits in your search for data. If your system does lock up, for the next profile simply change the profile configuration to increase the event count to a safer and larger number.
Using event-based sampling
Different processor families and implementations support different performance events. The AMD CodeAnalyst tool takes this into account and only shows you the events available on your system. AMD CodeAnalyst provides predefined profile configurations that group appropriate events together in order to gather data on specific subjects:
- Assess performance
- Investigate L2 cache access
- Investigate branching
- Investigate data access
- Investigate instruction access
You can also customize event configurations as shown in the figure below, to find out what matters and is useful to you. AMD CodeAnalyst provides a template "Current event-based profile" that you can modify to your heart's content. If you want to share your profile configurations, you can export them and other team members can import them. This short blog entry doesn't have enough space to provide a detailed description of each available event. You can read the specific events for your system in the BIOS and Kernel Developer's Guides (BKDG) at http://developer.amd.com/documentation/guides. You will need to read the BKDG that is specific to the processor within your platform. Here's a brief list of some of the events we've found useful:
- 0x040: Data Cache Accesses
- 0x041: Data Cache Misses
- 0x076: CPU Clocks Not Halted
- 0x0c0: Retired Instructions
- 0x0cb: Retired MMXTM / FP Instructions
- 0x0e9: CPU/IO Requests to Memory/IO
The CPU/IO Requests to Memory/IO event is especially useful for measuring memory requests on NUMA platforms, where memory latency can severely impact performance.

Figure 1 - Event-based sampling profile configuration
In the figure above, you'll note that the CPU/IO Requests to Memory/IO event has been selected, and that the unit masks have been set to count an event when data is requested from a non-local node. This configuration is only useful in a NUMA set up, and I've written more about why it's useful in the thread profiling section below.
Depending on the type of the event data that was collected, you may be able to use different views to display different aspects of the data. Unlike the timer-based sample profiling view, for event-based sampling profiles there are a lot more views available. Whether a particular view is present for a profile depends on the event data that was collected. If the appropriate event data is available, you can get instruction per cycle (IPC) assessments, branch assessments, and etc. You can always use the "All Data" view to see the raw sampled data. For derived measurements in the views, the data is normalized, so the calculations and comparisons are valid and make sense.
Just like in timer-based sampling, you can ‘drill-down' or further investigate particular modules and functions. Depending upon the resulting profiles, you may be able to determine if you need to change an algorithm completely, or just modify some data structures and code to implement better data access patterns. If you go down to the assembly level, however, there is some inaccuracy with the instruction addresses associated with a sample, due to sampling skid and out-of-order execution. For more information on out-of-order execution, please go to http://en.wikipedia.org/wiki/Out-of-order_execution. The problem is that the event may have been triggered by an instruction, but the interrupt handler captures the address of the next instruction or some other instruction within the neighborhood of the culprit. Instruction-based sampling is designed to eliminate this inaccuracy.
Instruction-based sampling
Closely related to event-based sampling is instruction-based sampling. It is only available on systems with AMD Family 10h processors. The mechanism is different from event-based sampling in that after the count is triggered, a fetch or execution operation is tagged, and the events that occur throughout its execution are tracked. After the operation retires, the driver retrieves event information through the raised interrupt.
There are two types of data that can be profiled simultaneously: fetch and op. The fetch count is the number of completed fetched operations between tagged fetches. For the op sampling, there are two methods of tagging an op. The cycle method will wait for the specified number of processor cycles and then tag an op that will be dispatched in the next cycle, if a valid op is available. The cycle method has a small, but unavoidable timing bias that will cause certain ops to be tagged more often than the actual execution frequency. The dispatched op method counts ops as they are dispatched and tags the next available op when the op interval (period) expires. The dispatched op method is designed to reduce sampling bias.

Figure 2 - Instruction-based sampling configuration
Since instruction-based sampling provides the address of the tagged operation as well as the events caused by it, we have the exact address at which events occurred, which resolves the concerns raised earlier about the accuracy of event-based profiling data. We also can collect data on a large number of events without multiplexing or specifying complex configurations. In addition to simple sample counts, more information is collected, like latency counts and the memory addresses used for load and store events. Instruction-based sampling is a new technique and has the potential for other kinds of analysis. If you have a good idea about how to display or use the effective addresses to help you optimize your application, please let us know!
Thread profiling
And now for something completely different -- thread profiling. This feature was requested by a user for investigating a ccNUMA performance issue. Thread profiling is only available on Windows, because Oprofile strips out the timestamp data during a profile and a timeline of thread-oriented events cannot be reconstructed during post-processing.
Here's the theory first. The system hardware architecture can have a significant impact on memory latency, depending on memory access patterns. If you aren't familiar with NUMA or SMP architecture, here's a good starter link: http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access. In symmetric multi-processor systems, the RAM is equally available to all processors. In NUMA systems, each processor has access to its own RAM. Also, memory allocated in one set of memory doesn't get transferred directly to another set. Therefore if a thread allocates memory on one processor and then moves to a different processor, the memory accesses will involve a greater latency as it needs to go through the first processor. For more theory and a whole lot more depth, I refer you again to AMD's Software Optimization Guide (http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/40546.pdf). Some good heuristics for optimizing processor usage with multiple threads on NUMA system are:
- Maintain balanced system loads through core-scheduled threads
- For threads with independent data, try to schedule the threads on different idle nodes. When there are no more available idle nodes, try to schedule the threads on the idle cores across the nodes.
- For threads with shared data, try to schedule the threads on different idle cores of the same node.
- Try to avoid switching a thread to a different node then the one on which it was created.
If you want to examine how and on which cores your threads are executing, AMD CodeAnalyst has a thread profile configuration for you. The thread profile data is shown differently from other profile data.

Figure 3 - Thread profile data
A thread profile is shown as a time chart, with all the available cores shown for each thread. Thread activity is divided into time-slices. The color of a time-slice indicates user activity (green) or kernel activity (yellow). You can change the time-slice period. The location of a colored time-slice indicates the core which executed the thread during that time period. You can also set a threshold value to see if there were more samples than the threshold in a time-slice.
The other main feature of the thread profile is the ability to track non-local memory accesses. A non-local memory access is a memory access across nodes on a ccNUMA system. In the AMD CodeAnalyst tool, the occurrence of a non-local memory access during a time-slice is represented as a red bar. If there are samples available, you can open the "Non-Local Memory Accesses" tab and see a list of the modules that had the accesses. You can expand each module and see the list of functions in which the accesses occurred. Expand again, and you can see the address list. Double-click an address item and you will open a source code tab to show you in the code where the trouble is happening.
If you would like to suggest thread profiling features, we are currently collecting requirements for a thread analysis tool. Please send any ideas and requests to CodeAnalyst.support@amd.com.
-Frank Swehosky
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 04/28/2009 at 03:02 PM by devcentral
|
|
|
April 21, 2009
| |
AMD CodeAnalyst Workshop Summary - Part 3 of 5
This blog covers the first type of profiling: timer-based sampling and call stack sampling.
Timer-based sampling
Here I discuss why and how to use the timer-based sampling feature of AMD CodeAnalyst Performance Analyzer. If you're already convinced about using timer-based sampling, you can skip ahead to the section on how to use it now.
Why should you use timer-based sampling?
Traditionally, the way to calculate performance gains (or losses) is by measuring the time a program, function, or loop takes to execute. This could be done for each optimized version of the program by reading the time before and after execution, and then calculating the difference (elapsed time). The rdtsc ("read timestamp counter" instruction) can be used to read the current time as measured in processor cycles. The problem with this method arises with multi-core and multi-node systems. Each core in the system maintains a separate timestamp counter. The counters are not guaranteed to be synchronized. While you can start to see the trouble, it gets worse if your system uses power management (clock throttling), running some cores at different clock rates, as they're needed. If a thread switches cores during execution, the end timestamp may have no correlation to the beginning timestamp. Your performance calculations are now suspect.
AMD CodeAnalyst instead takes a statistical sampling approach. On Windows®, an APIC timer signals when samples should be taken on each core. On Linux, AMD CodeAnalyst uses the open source profiler Oprofile to collect performance data. Oprofile is unable to use the APIC as a timer. Instead we use the CPU Clocks not Halted event (CPU_CLK_UNHALTED) with event select value 0x76) to measure time, with calculations based on the system clock speed to convert from cycles to milliseconds. On both Windows and Linux®, each sample can be interpreted as a period of time of execution. As the sampling period decreases, the data becomes much less approximate as resolution increases, but it also directly affects the amount of samples taken. Thus, with the AMD CodeAnalyst tool, the approximate time spent in an algorithm (code region) is reflected by the number of samples taken in your program. This works over all cores, and can show you how much time was spent on specific processors. This gives you a good estimate of performance. After several rounds of optimization, you should see the time spent in the optimized algorithms decrease. Since this is a statistical method, the performance estimate may be inaccurate if insufficient samples are taken.
How do you use timer-based sampling?
This article isn't supposed to substitute for the tutorial so I won't go into too much detail here. First, open AMD CodeAnalyst and create a new project. Next open the configuration manager dialog ( ), select "Current time-base profile" and click the Edit button.

Figure 1 - Timer-based sampling configuration
You can set the timer interval to your preference. Your data is more accurate with a smaller interval, more profile data is collected and overhead is higher. Before you run your profile, make sure your module is running during the profile, or the data won't be that interesting. Click OK to accept the timer configuration. Then click the green Start button in the toolbar to collect profile data.
The newly created profile should open automatically after data collection. The view called "time-based profile" is shown, which by default shows the samples as a percentage of the samples taken. If you want to get the raw sample counts, you can click the "Manage" button and uncheck "Show Percentage". You can go to the graph tab to see the same data in chart form.

Figure 2 - System Graph Tab
To get more information on your module of interest, just double-click on the corresponding item. You can see where time is spent in different functions when you look at the module data tab. Samples are aggregated at the address level, so if you have symbolic information, you can go to the source tab and see the sample distribution across source-level statements, or you could just open up the assembly tab that shows the sample distribution across native (assembler) code.
With information about where the time is spent during execution, the next step is to figure out why time is spent in hot code regions. You can collect types of event-based profiles to accomplish this task. If you want a better understanding of what call paths resulted in the time being spent in the sampled functions, you may want to use call stack sampling.
Call stack sampling
You can use call stack sampling to understand how different call paths affect the time spent in functions. Currently, call stack sampling is only available on Windows when you launch a process from within the AMD CodeAnalyst tool. While call stack sampling has more overhead than the regular timer-based profiling, it is far less intrusive than the instrumentation required to collect a complete a call graph collection. However, call stack sampling is still based on the sampling and its results are subject to statistical variation. The call graph is usually incomplete because infrequently executed functions may not be sampled at all.
Call stack sampling requires a lot more profiling overhead. You can change the amount of effort spent on call-stack unwinding with the Session Settings dialog, as seen in the following image.
Figure 3 - Session Settings Dialog
The call stack unwind level controls how deeply the call stack is explored during the profile. The call stack unwind level controls how many call-return addresses are traced when a sample is taken. More processing time is used in order to explore longer call paths. The value of 10 in the figure above means that the profiler will attempt to trace ten call paths for each call stack sample. The call stack interval controls how often during the profile a call stack sample is taken. In the figure above, with the value of 10, there will be one call stack sample collected for every ten timer-based profiling samples.
The CSS data is made available through the Processes tab. You just select the launched process, and either use the "Call-stack data" button( ) or context menu item. The call stack tab has two parts. The top part has the call tree data. You can expand the tree and see what methods had call stack samples. The call tree shows you which other functions called the sampled functions. The depth of the tree is limited by the unwind level used in the profile configuration. ‘Self' samples are the samples taken in the function. ‘Children' samples are when the function was an ancestor of the function in which the samples were taken. The bottom part of the call stack tab is called the butterfly view. It depends on which function item is selected in the top part. It gives detailed information on the ancestors (to the left) and the children (to the right). In the ancestors section, you can see all the functions that called the selected function. If you expand the items in the ancestor section, you can see all the call sites within the function from which the selected function was called. You can also see the call frequency, to help determine if there is a particular path that needs improvement. In the children section, you can expand the items to see the addresses at which samples were taken.
-Frank Swehosky
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 04/28/2009 at 04:03 PM by devcentral
|
|
|
April 15, 2009
| |
Java Posse interview
A few weeks ago, the Java Posse interviewed Azeem, Gary, and I. The podcast has been posted! A lot of great topics were covered, including JVM performance, multi-core programming, developer tools and more. Have a listen, then comment here.
Many thanks to the Java Posse for the opportunity.
Ben
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 04/15/2009 at 12:18 PM by bpollan
|
|
|
April 13, 2009
| |
AMD CodeAnalyst Workshop Summary - Part 2 of 5
AMD CodeAnalyst Tools and Usage Model
This blog is about the tools we provide as part of the AMD CodeAnalyst Performance Analyzer. It should give you an idea of how to use the tools and how to best fit profiling into your software life cycle.
Customer requests drive the evolution of the AMD CodeAnalyst tool and its features. If there is a feature that would really like to see in our tool, please contact us at CodeAnalyst.support@amd.com.
The main usage model for AMD CodeAnalyst divides its operation into two parts:
- Profile data collection
- Post-processing
Profile data is communicated to post-processing through files. AMD CodeAnalyst provides both GUI and command line tools. The install package includes profile agents to capture Just-In-Time (JIT) generated code. Predefined profile configurations and basic view analyses make easy to collect, post-process and view profile data.
The profile driver on Windows® is the base for lightweight profile data collection and it is included in the AMD CodeAnalyst installation. Lightweight in this instance means that the overhead of profiling doesn't significantly change the characteristics of the system being profiled. Profile data consists of samples, which the driver collects. When it is time for a sample to be taken, either because of an event count trigger, a timer trigger, or an instruction trigger, an interrupt occurs. The driver handles the interrupt and collects data like which processor core took the interrupt, the timestamp when the interrupt happened, the execution instruction pointer address, the process ID, the thread ID, and even register values. The total amount of driver processing required (sampling overhead) depends on the sampling frequency and whether call stack sampling is enabled. The more data that the driver has to collect when profiling, the more time it takes to just save the data. Ideally, there would be no overhead, but since we do most of our processing after the profile after the data has been collected, we consider the current driver to be lightweight. On Linux, we use Oprofile as our profiling subsystem.
The AMD CodeAnalyst GUI post-processes the collected data and displays profiles. It uses a project and profile session usage model. A project corresponds to an application or module that you are optimizing or investigating. Each profile session is a 'run' or profile instance that can be named, reviewed, saved, or deleted. The sessions are saved in separate directories under the project. The appropriate JIT data files generated by the profile agents are also saved in the session directories, so you can review the JIT data later or compare the code output by different runtime engines. Since AMD CodeAnalyst is file-based, it is very easy to share data between co-workers, either by copying the whole project or by just importing the particular sessions in which you're interested.
AMD CodeAnalyst is a system-wide profiler, so you can monitor services, drivers, and any server applications that start up automatically. You also have the opportunity to launch an application or batch file when you start to collect a profile. You can choose many settings for each profile session, but you must choose a 'profile configuration' when collecting a profile using the AMD CodeAnalyst GUI. We have included many pre-defined configurations, based on our experience on what people have traditionally used to profile: execution time, an overall assessment of performance, data accesses, branching behavior etc. If you want, you can customize the configurations, changing the sampling frequency and the type of data collected. We do not make assumptions about the limits of your system, so we give you the ability to shoot yourself in the foot by choosing a high sampling frequency, thereby asking for more information than the profiler can handle. If you request too much data in one profile session by setting the sampling frequency too high, the system will lock up.
After your profiling session is finished, AMD CodeAnalyst analyzes and aggregates the data. When you display data from the profiling session, you can choose from multiple analysis views. Which views are available depends on what data is available from a profile. Some views provide basic ratios, like Instructions per cycle (IPC) or the DTLB request rate. For ratios, the data is normalized or weighted, so statistical comparisons make sense.
While I'll be writing about the explicit differences between the versions available on Windows and Linux® later, there is one additional tool on Windows that should be mentioned here: the Microsoft® Visual Studio integration package. It is integrated directly into Visual Studio and allows you to use most of the functionality of the stand-alone AMD CodeAnalyst GUI for each solution. Another integration tool, called CodeSleuth, is available as an Eclipse plug-in for Java performance measurement. More information can be found at here on our main CodeSleuth page.
There are separate versions of the AMD CodeAnalyst tool for Windows and Linux.
AMD CodeAnalyst for Windows includes:
- Stand-alone AMD CodeAnalyst GUI to collect and analyze profiles
- Online help including descriptions of performance events
- Visual Studio integrated package
- CaProfile.exe command line utility to collect profile data
- CaDataAnalyze.exe command line utility to analyze profile data files
- CaReport.exe command line utility to display analysis results
- Java profiling agents (32-bit/64-bit for both jvmpi and jvmti)
- Pause and resume profiling control API
- AMD CodeAnalyst for Windows profiling driver
The AMD CodeAnalyst tool for Linux includes:
- Stand-alone AMD CodeAnalyst GUI
- Online help including descriptions of performance events
- Source code (GPL v2) for AMD CodeAnalyst and the Oprofile open source profiler
- Java profiling agents (32-bit/64-bit for both jvmpi and jvmti)
- Kernel modules to support data collection on the latest AMD processors
-Frank Swehosky
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 04/28/2009 at 04:03 PM by devcentral
|
|
|
April 6, 2009
| |
Looking for Information on Overclocking?
Here inside AMD Developer Central, we continuously work to improve your experience on our site. As part of these efforts, we periodically check the search logs to learn what our users are interested in and to ensure that we have useful content around those topics. One search term that seems to come up often is "overclocking."
Overclocking is the action of running your computer components (like CPU, GPU) at a higher speed than specified (and designed) by the manufacturers. Some reasons why users choose to overclock are to save money and increase performance. However, running the system at a higher speed increases power consumption, which leads to more heat, noise, and potential stability issues.
In a professional setting, overclocking can be risky. (That's why we issue warnings, like the one below.) A minor error could seriously affect system performance and delay project schedules. But, if you are a hobbyist / tweaker / gamer willing to take the risk and want to extract every last drop from the CPU/GPU, then overclocking is probably a topic you are very interested in. To achieve overclocking on AMD systems, we recommend AMD OverDriveTM , a state-of-the-art system management tool that includes overclocking capabilities.
Patrick Moorhead, AMD's VP of Advanced Marketing, has also written several blogs on this topic:
http://blogs.amd.com/patmoorhead/tag/overclock/
Also watch this,
The Proving Grounds on Youtube in HD: http://links.amd.com/PROVINGGROUNDS
The Proving Grounds on Youtube in Standard Definition: http://links.amd.com/SDPROVINGGROUNDS
We hope this information helps. So which pill would you take: the red or the blue? Do you still intend to overclock your system and, if so, is it your work system or your home system?
***WARNING*** AMD and ATI processors are intended to be operated only within their associated specifications and factory settings. Operating your AMD or ATI processor outside of specification or in excess of factory settings including, but not limited to, overclocking may damage your processor and/or lead to other problems including, but not limited to, damage to your system components (including your motherboard and components thereon (e.g. memory)), system instabilities (e.g. data loss and corrupted images), shortened processor, system component and/or system life and in extreme cases, total system failure. AMD does not provide support or service for issues or damages related to use of an AMD or ATI processor outside of processor specifications or in excess of factory settings. You also may not receive support or service from your system manufacturer. DAMAGES CAUSED BY USE OF YOUR AMD OR ATI PROCESSOR OUTSIDE OF SPECIFICATION OR IN EXCESS OF FACTORY SETTINGS ARE NOT COVERED UNDER YOUR AMD PRODUCT WARRANTY AND MAY NOT BE COVERED BY YOUR SYSTEM MANUFACTURER'S WARRANTY.
-------------------------
Velu, Jayaprakash
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 05/11/2009 at 08:04 AM by jkvelu
|
|
|
FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.
|