AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs
Decrease font size
Increase font size
1 2 >> Next
October 13, 2009
  Dealing With Reality | The Interview | ATI Stream and OpenCL | Part 2

In Part I on the AMD At Home Blog Simon Solotko gave an overview of open, parallel computing with ATI Stream and OpenCl. Here, in Part 2, Simon Solotko & Ben Sander discuss the power of ATI Stream technology and the elegant, standards-based interface now available with OpenCL for GPU.

Ben, what have we created with OpenCL and what does it do?

Ben: Sure, with OpenCL we created a C-based interface for programming a range of parallel processors. Developers write OpenCL Kernels, sub-routines which developers seek to accelerate or offload, and embed these in their applications. OpenCL includes a runtime component which allows these OpenCL Kernels to be compiled at runtime for either a CPU or GPU. AMD has contributed to the development of the OpenCL specification and written the implementation x86 processors and GPU's - a runtime environment which compiles the code near runtime, then schedules and executes the code at runtime.

What are the benefits of being able to compile an application for a CPU or a GPU?

Ben: Developers can write one piece of code and easily support a variety of compute devices in the platform - CPUs and GPUs, from multiple vendors. Code can be load-balanced between CPU and GPU depending on the capabilities in the final platform. For example, we expect that some applications or parts of applications will run faster on the CPU than the GPU, other applications perform better on the GPU. Finally, the OpenCL CPU implementation levertages the CPU hardware debug features to provide excellent debug capabilities, using familiar debug environments, at a full CPU speeds.

When exactly during runtime is the Kernel compiled?

Ben: There are specific commands within the body of your application which you call to compile the Kernel, and direct it to be compiled for the CPU or GPU. At that point, the Kernel code is translated into a binary. The binary later executes natively when the Kernel is called. The code is not interpreted in the hot spot of the loop, it's not like Java in that regard.

So the code within a Kernel looks like C but can be compiled to execute on the GPU?

Ben: Exactly. Because a GPU looks and functions differently than a CPU, however, you have to think differently when you write the Kernel for GPU, because at that point, you are executing your code directly on the GPU. There are constraints imposed on Kernel code to accommodate the specialized functionality of the GPU. Kernels are based on C99 with extensions provided by OpenCL-C for vectors and address spaces.

Give me some examples of the special ways in which the C code within a Kernel is different from the standard code in the body of the application?

Ben: To understand writing a Kernel it is important to understand that the code is actually executing on a GPU, despite the fact that the functions you are performing are syntactically the same as other C code. A GPU has a small fast cache (local memory) and larger main GPU memory (global memory). You move data in blocks, and complete as much of the task on that block as possible before moving the block out and moving the next block in. With a GPU we have a lot of compute bandwidth relative to memory bandwidth making it advantageous to do as much as you can to data within the cache. With OpenCL the blocking process does not necessarily get easier, but you can control it from C code.

How do we move data from main memory to the GPU memory for use by a Kernel function?

Ben: A Kernel cannot move memory from main memory, that is done in your application code. So there are standard functions to copy memory into GPU memory from the application, and pointers to this memory can then be passed to a Kernel function. The Kernel function can then copy memory into the fast cache or "local" memory.

This sounds a bit complicated, but I have to remind myself, this is all standard C code, and we are discussing the optimization that makes something run fast on the GPU, and the memory management tools that are available, now within standard C through the OpenCL library, to do that.

Ben: That's Right. The magic is that a Kernel is C code which is amazingly compiled by the runtime component of OpenCL to run on a GPU or CPU with some extra tools to ensure it can take full advantage of the extremely high compute to memory bandwidth capability of the fast, parallel math engine of the GPU.

So as time goes on, we anticipate that people will write and optimize many useful Kernels which will simplify the development of complex applications?

Ben: Yes. It is relatively straight-forward to port applications written for other GPGPU languages like Brook+ and CUDA to OpenCL. This is a huge step forward from proprietary GPU code, you now have a standard way to get at GPU code and memory from C in a platform independent way.

With ATI Stream technology and the standardization of the programming model with OpenCL for GPU almost any aspiring GPGPU developer can download the tools necessary to get started and develop platform-independent software fueled by the power of the evolved GPU. I have collected resources below to get you started, enjoy blazing the trail of a new frontier in computing!

For more information, watch as AMD's Mike Houston discusses OpenCL and what the future has in store for software applications that use it.

If you are ready to get started with OpenCL, you can begin with AMD's OpenCL resource page here.  

Simon has regular posts on the AMD At Home blog and you can check out The Digital Nexus series here.



-------------------------

Simon Solotko is a Senior Advanced Marketing Manager at AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.


 Post a Comment    

    Posted By: Simon Solotko @ 10/13/2009 02:44 PM     ATI Stream     Comments (1)  

September 15, 2009
  AMD Developer Inside Track, Episode 2: OpenCL Introduction

AMD has always been an advocate of open standards that build on and extend proven technologies (example: x86-64)W.  As such, it is a natural fit for AMD to embrace OpenCL as part of its ATI Stream offering.  But, just what is OpenCL? 

In this month's episode of the AMD Developer Inside Track I interview Mike Houston, GPG System Architect.  He talks about what OpenCL is, what the transition to this new language will be like and he gets into what applications could benefit from OpenCL, as well as what the future has in store for software applications that use it.   

One of the advantages of OpenCL is its advanced queuing system which is great for game development. It is also designed to work very well with various graphics APIs such as OpenGL, DirectX 9 and DirectX 10. 

Game developers aren't the only ones who can take advantage of OpenCL though.  According to Michael, it is going to be very useful for applications such as media encoding, virus scanning, and physics to name a few.  It makes a lot of sense for AMD to move to a ubiquitous computing language that runs on platforms everywhere.  The next few years will be an interesting time for GPGPU technology as several hardware and software vendors get on board. 

ATI Stream technology is gaining significant momentum.  Some cool and unexpected examples of ATI Stream technology in action are:

An example of gaming technology and OpenCL:

Watch the AMD Developer Inside Track, Episode 2 for the full story.

 -Sharon Troia, AMD Developer Outreach



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/16/2009 at 03:21 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 09/15/2009 03:58 PM     ATI Stream     Comments (4)  

September 7, 2009
  Framewave Multipass Build System

Developing libraries can be difficult, fun and interesting; an equally difficult task is testing the library and distributing it, so that other developers can use the library in their projects.  The big advantage of using libraries to accomplish certain functionalities is that libraries are already tested and optimized for various platforms.  For the libraries optimized for particular platforms, there needs to be a dispatch mechanism to select the best optimized path depending on the processor.  I have found that the build system from the Framewave library provides a good solution to accomplish this.

 Derived from the AMD Performance Library, Framewave is a free of charge, open-source collection of popular image and signal processing routines designed to accelerate application development, debugging, multi-threading and optimization on x86-class processor platforms. This library has three paths of optimized code:  a reference code (c code) path, an SSE2 code path, and an SSE3 and F10H code path. One reason I found it interesting is because it is open-source; I can go through the code, understand it, and modify it as per my requirements, plus it has a single source bundle for four operating systems (Linux®, Mac, Windows®, and Solaris operating systems).

 Framewave has a different implementation for each of the paths, and the Framewave build system takes care of combining them together and exposing a single signature. To achieve this, Framewave has a custom build system based on the SCons build tool (http://www.scons.org). The advantage of using SCons is that it uses the Python scripting language for its configuration files.

 Framewave has a single source bundle that is termed platform independent and is compiled using a single build system across all the platforms. The tool sets supported are GCC, MSVC, and Sun CC. This build system allows me to build 32/64-bit shared/static libraries with the ability to build either a debug or release version.

 This build system picks up the file and compiles it n times, n being the number of optimized paths, producing n object files. These n object files are linked together to the stub function which is exported as the actual function. To understand the build system more, refer to the architecture description here: http://framewave.sourceforge.net/DesignDoc/FramewaveBuildSystem-Architecture.htm

 Producing one DLL file and having only one signature exported for each function is a better option than having multiple DLL files for each of the optimized code paths and then loading the particular DLL depending on the processor. The advantage of having one single large DLL file for the library is that I end up adding only one file to the n files present in my in project.

 Overall this build system offers a unique way to bundle software that has different implementations for each processor.

 I'd like to hear what you think.  Is this build system useful in your own work?  What do you like about it, what do you dislike about it?

 Watch out for my next post on Using SCons for building the build system.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/10/2009 at 11:36 PM by jrameshbe

 Post a Comment    

    Posted By: Ramesh J @ 09/07/2009 05:31 AM     AMD Libraries     Comments (0)  

September 4, 2009
  Evaluation of the Advanced Synchronization Facility (ASF)

In a previous entry on the Advanced Synchronization Facility (ASF), my colleague Michael pointed you to the current ASF specification proposal and showed some nifty use-cases for the feature. In this blog entry I'll try to make this a little more practical and show you how you can get some more hands-on experience with ASF.

 

Running ASF

ASF is an experimental feature which means that we do not yet have access to a "toy implementation" in silicon to play with. As with all other cases where the real thing is not available for testing (such as with early crash tests for cars) we resort to simulation to analyse important properties of ASF. Simulation also allows us to get a feeling for how ASF can be used by applications and operating system kernels, and might be integrated into compilers and language runtimes.

The approach of simulation is nothing new inside AMD and we have a rich set of simulation tools available for all kinds of purposes. Several aspects of ASF, however, made us use another external open-source simulator called PTLsim for our analysis. On the one hand, we want to have detailed AMD64 simulation capabilities to provide some performance predictions, get fine-grained thread interleaving right, and support simulation of operating system kernels. Furthermore, we would like to have an understanding of how ASF interacts with other features employed in today's processor cores. On the other hand, all of this should not have prohibitive overheads in terms of simulation speed and prototyping effort.

In addition to the technical requirements, we appreciate PTLsim's open-source license, which makes it easier to share our prototypical ASF simulator implementation with the public and in related projects (such as the EU-funded VELOX project, which Martin will cover in the next post in this series).

Although PTLsim certainly has an impressive list of features, several of these features come at the price of a somewhat large infrastructural requirement. To allow simulation of the entire operating system, PTLsim relies on Xen to provide the first-order hardware abstraction. Xen in turn, however, may demand an elaborate test machine setup.

Besides "just" adding the ASF functionality to PTLsim, I've spent a fair amount of effort adding supportive features, such as a true multi-core simulation model that improves on the previously existing SMT (symmetric multi-threading) model. With the new multi-core model, each logical thread has its own set of resources (functional units and caches) and cores can modify the contents of other caches (for example by invalidating data in other caches by local updates). These interactions were not captured by the SMT model, as threads there shared functional units and caches. Other modifications to the upstream version of PTLsim mostly fix bugs in several subsystems of PTLsim. I regularly hang out on the ptlsim-devel mailing list :-).

Evaluating ASF

Our initial evaluation of ASF started with an (internal) predecessor of the currently available version; let's just call it ASF1. Although ASF1 is a more restricted form of the current ASF specification, its implementation and analysis have been published already. You can take a look at our EPHAM 2008 paper (or at my much more detailed thesis at the same location, if you're adventurous) to get an overview of how things behaved back in 2008. ASF1 basically has a more static phase layout; there is a strict separation between a 'declaration phase' and an 'atomic phase', in which you can add elements to your speculative working sets in the declaration phase only, and then modify them inside the subsequent atomic phase.

The static phase layout makes ASF1 unsuitable for applications that want to interleave modifications and working-set discovery within a single atomic region, unnecessarily restricting programmers' flexibility. Nevertheless we did find ASF1 extremely powerful and we showed an 80% performance improvement over a conventional lock-free implementation of a linked list, and 20% for accelerating a software transactional memory (STM) run-time (you can find more details in the documents referenced above).

ASF1 gives you the flexibility you need to make a lock-free linked-list implementation practical, actually even fairly straightforward. If you have some experience with lock-free linked lists, you'll know that the traditional CAS (compare-and-swap) is not easily usable for element removal from the list. In order to safely remove the element you have to change the preceding element's next-pointer (make it point to the deleted element's successor) and at the same time ensure that nobody concurrently adds an element just after the deleted element. With just CAS it is difficult to ensure that two memory locations do atomically change / keep their value. It is almost trivial to do this with ASF, even ASF1. Just have a look at Michael's DCAS example in the previous blog post.

Besides making the currently specified ASF implementation available for you to play with (see below), we are currently testing and extending the implementation thoroughly. For example, we are porting the TMunit testing application and looking at other larger applications. We also analyse various ways of implementing ASF, see how we can make use of the increased flexibility (over ASF1) for accelerating STMs better than with ASF1, and look at new look-free use cases for ASF.

Finally, we constantly strive to improve ASF to fit the needs of programmers wanting to use it -- so again, if you have any comments on the current ASF specification proposal, leave us a comment or send email to ASF_Feedback@amd.com!

Hands on

In our downloads section you can find all the ingredients needed to brew your own magic ASF1 potion: the tweaked simulator implementing ASF1; the benchmarks in which we have used ASF1 to accelerate (and simplify!) a lock-free linked list implementation and an STM; and various explanatory documents, such as our EPHAM 2008 paper and my Diploma thesis. I'm currently cleaning up the implementation of the current ASF specification in PTLsim and it will become available there shortly, too.

I'm aware that setting up the toolchain might be daunting, largely due to the Xen requirement, and sometimes less than 100% stable thanks to the research nature of the upstream project. If you have any specific questions regarding simulator setup and usage, please leave me a comment.

About me

I joined AMD's OSRC group in Dresden in May 2007 as a student intern and started implementing the original ASF proposal (ASF1 above) in PTLsim. This implementation work laid the foundation for my Master's thesis (mostly in English, ignore the German front pages) which I wrote to finish my studies of Computer Science at TU Dresden and the paper mentioned above. I graduated in February 2008 and have continued my work on ASF as a full employee at the OSRC since then.

I'm interested in most computer science and engineering topics, but I'm currently focusing on:

  • Microarchitecture: Cores, caches and interconnects

  • Memory model semantics

  • Simulation

  • Parallel programming: Transactional memory, lock-free programming

  • Computer graphics

I'd like to hear what your thoughts are on ASF, and what uses you have for it.

--

Stephan Diestelhorst, Software Engineer 1
AMD Operating System Research Center, Dresden



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/04/2009 at 05:35 AM by stephan.diestelhorst

 Post a Comment    

    Posted By: Stephan Diestelhorst @ 09/04/2009 05:14 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

August 17, 2009
  AMD Developer Inside Track - Taking Advantage of Multi-Core

I was fortunate to have the opportunity to host a panel discussion on application development and multi-core at CommunityOne West this year. It was a fantastic opportunity to meet and work with software experts who are in the trenches and every day working on parallel programming solutions. The basic question here was: "How do I get started in taking advantage of multi-core processors?" To answer this question, everybody involved brought unique experiences and perspectives to the table. In the above link, you can see a view of AMD's roadmap - from our perspective, you should take away that from the desktop to the server, multi-core will be king.  

Check out the AMD Developer Inside Track video for a snapshot of three of our partners from this panel and myself answering the question of how to start taking advantage of multi-core processors.

After these events I often get asked the same how-to-get-started question, but with more detail. Someone will say, "Okay, but let me tell you about this..." - so we talk it over. The questions I ask usually include at least some of the following:

  • Who do you work for?
  • What field are you in?
  • What are you trying to do?
  • Where is your code spending the most time now?
  • What are your primary bottlenecks (CPU, I/O, Memory)?
  • Do you need to scale up, or scale out?
  • Are you trying to reduce response time?
  • Are you trying to increase throughput?
  • Where and how big is your data?
  • What are your data dependencies?
  • Are you using a managed runtime environment?
  • What tools are you using?
  • Are you open to using other tools?
  • Will you be able to rewrite code?
  • Who have you talked to in researching your problem?
  • Do you have an n-tier infrastructure?
  • What hardware are you using right now?
  • What are your hardware upgrade plans?

These questions help decompose the problem and also provide a high-level view.  I find these discussions often touch on a mix of abstract principles combined with some specific practical advice. Below, I have some basic getting-started suggestions which I've mapped to the above questions, along with my perspectives on how they bear on the problem. For simplicity's sake, I've decided to map a question once to a single suggestion, though it may really have multiple applications.

Suggestion

Relevant Questions

Perspectives

Identify your problem domain.

Who do you work for?

What field are you in?

 

 

Telecommunications, financial services, manufacturing, scientific programming & HPC, web services, database, ERP/CRM, BI: for these and many other segments there is typically an ecosystem of software tools for building products and solutions, in many cases with significant experience in parallelism.

Don't be afraid to ask for advice -- talk to your community of experts.

Who have you talked to in researching your problem?

Your community of experts can be found at conferences, in online forums, and at your tools vendors.

Clearly define your performance problem and the associated metrics.

What are you trying to do?

Do you need to scale up, or scale out?

Are you trying to reduce response time?

Are you trying to increase throughput?

This is critical in explaining the problem to yourself and others. This should be an easy to understand and simple statement that includes a baseline.

Analyze and identify primary bottlenecks.

Where is your code spending the most time now?

What are your primary bottlenecks (CPU, I/O, Memory)?

Where and how big is your data?

What are your data dependencies?

Do you have an n-tier infrastructure?

 

If you don't know the answers to these questions then you need to do some analysis.  Diagram your infrastructure.  Use performance analysis tools found in your OS and from your tools vendors.  There are usually a few places in your code where most of the time is spent.

Like any optimization effort, you'll analyze first, re-measure, and re-analyze throughout your parallelization effort.

Review alternate algorithms.

Will you be able to rewrite code?

 

After some initial analysis you should take a high-level look at your overall algorithm. It may not be the best choice. It also may place constraints on how easily you can parallelize.

Review current tools and look for acceptable alternates.

Are you using a managed runtime environment? What tools are you using?

Are you open to using other tools?

Will you be able to rewrite code?

 

This is often closely related to the problem domain and associated business requirements.  Maybe you can take a new Fortran compiler that supports parallelization with OpenMP, or maybe you need to focus on a new math library.

Review current hardware and evaluate new hardware.

What hardware are you using right now?

What are your hardware upgrade plans?

 

Along with looking at the architectural and tools aspects of your software, think about how much you could improve your basic situation with new hardware, be it one of more RAM, more or faster CPUs, or bigger or faster disks.

 In conclusion, I want to emphasize that after carefully stating your problem and doing some initial analysis, that you try new implementations with caution.  Measure with appropriate precision, and make sure your measurements are repeatable.  Only then can you be sure that your work is worthwhile. Finally, take a look at the AMD Developer Central for parallelization articles, our CPU analysis tool CodeAnalyst, and our performance libraries

Be sure to check out the first AMD Developer Inside Track video featuring three of AMD's software tools partners giving their perspectives on taking advantage of multi-cores.

-Tracy Carver, Software Developer and Evangelist, AMD



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 08/18/2009 at 01:05 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 08/17/2009 02:17 PM     Inside Dev Central     Comments (0)  

  Introducing the AMD Developer Inside Track - a New Monthly Video Series

I'm a member of AMD's software division (and yes, you read it correctly - I said software).  It turns out that a lot of people are surprised to hear that AMD has a software division.  I can't count the number of times that we've been at tradeshows showing off the AMD CodeAnalyst Performance Analyzer or our Performance Libraries and people have wondered why the heck AMD was at a software developer conference.  The answer is simple; you can't run the hardware without software.  We have a significant investment in software within AMD and with our software partners.  I've vowed to do my part to get you behind-the-scenes, one-on-one time with AMD software developers and our software partners' to get the scoop on what AMD is doing that would matter to software developers. 

The first installment of the AMD Developer Inside Track is available now.  This one features a panel of our software developer tools partners from Allinea, Pervasive and Rogue Wave talking about taking advantage of multi-core processing.  I was able to pull them aside after the CommunityOne West 2009 Multicore Panel sponsored by AMD.  Check out the video, Tracy Carver's blog, and the slides that were presented. 

Next month we will be talking with Michael Houston about OpenCL.  And we have a multitude of topics planned for the rest of the year.  If you have a topic in mind, let us know by making a comment on this blog post, or on our forums

 

-Sharon Troia, Sr. Developer Relations Engineer

 

ps.  If you are experience any viewing problems, please let me know.  We will be adding some different formats, lower resolution versions to  download, as well as the transcripts over the next two weeks.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 08/17/2009 at 06:14 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 08/17/2009 01:58 PM     Inside Dev Central     Comments (2)  

August 14, 2009
  Java Generics Performance Puzzler Part 2
 
In a previous blog, we looked at a microbenchmark where we were pulling an item from a collections class like an ArrayList and eventually putting it in another collection.  And we saw that there could be a significant performance difference between the following two versions:
 (Note: In the following examples, we show only the parts where we access the ArrayLists and leave out any subsidiary logic.)
ArrayList aListSrc, aListDest1, aListDest2;

Version 1
while (idxSrc < NUMOBJS) {
    aListDest1.add(idxDest, aListSrc.get(idxSrc++));
}
and
Version 2
while (idxSrc < NUMOBJS) {
    MyClass myc = aListSrc.get(idxSrc++);
    aListDest1.add(idxDest, myc);
}
with version 2 being slower because it requires a castcheck to check that the Object returned by aListSrc.get could be cast to a MyClass. The performance impact was because the castcheck required touching an object that did not need to be touched in version 1.
In the microbenchmark code above, we navigated thru the ArrayList by incrementing an integer index to the ArrayList.get method.  What if we had used an explicit iterator or used the implied iterator in Java’s  for-each statement?
First let’s look at the least cluttered implementation, which uses for-each loop
Version 3
for (MyClass myc : aListSrc) {
   aListDest1.add(myc);
   // ...
}
and remembering that the for-each loop is syntactic sugar for the following:
Version 3b
for (Iterator iter = aListSrc.iterator(); iter.hasNext() ) {
    MyClass myc = iter.next();
    //body of loop
    aListDest1.add(myc);
}
we can see that, unfortunately, this suffers from the same castcheck as Version 2.   And, once again, we cannot get around the castcheck by making the for-each variable an Object, because the compiler wisely will not let you add an Object to an ArrayList:
Version 4 (will not compile)
for (Object myc : aListSrc) {
   aListDest1.add(myc);        //
ß error here
}
Looking at the expanded code for the for-each loop, we see that we can still both use an explicit iterator and avoid the castcheck by getting rid of the temporary variable from Version 3b and ending up with something like the following:
Version 5
for (Iterator iter = aListSrc.iterator(); iter.hasNext() ) {
    aListDest1.add(iter.next());
}
Like Version 1, this passes all the compile-time checks. And at run time, because of type erasure, iter.next() returns an Object and aListDest1.add consumes an Object .
But ideally we would want to be able to use the less cluttered for-each notation and still get rid of the castcheck.  Can that be done?  Brian Goetz's excellent article Going Wild with Generics talks about using generic methods to force the compiler to use type inference to solve a problem with wildcards in generics.  To quote his article "The Java compiler doesn't perform type inference in very many places, but one place it does is in inferring the type parameter for generic methods".  I wanted to see if the type inference from generic methods would solve our problem here and sure enough it does.
If we code up version 6 as a generic helper method
Version 6
private<V> void splitHelper(ArrayList<V> src, ArrayList<V> dest1, ArrayList<V> dest2) {
    for (V elem : src) {
       dest1.add(elem);
       // ...
    }
}
and we can then call the helper with something like
    splitHelper(aListSrc, aListDest1, aListDest2);
If we run version 6 thru javac and look at the generated bytecodes, we see that the checkcast bytecode that we saw in version 3 is not there, leading to better performance.
So we have found a for-each based solution that has gotten rid of the castcheck, but do others find this behavior surprising?  The difference between Versions 3 and 6 seems very minor and it seems that if the compiler could eliminate the castcheck in Version 6, it could also do so in Version 3.
 


-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 08/19/2009 at 07:11 PM by tdeneau

 Post a Comment    

    Posted By: Tom Deneau @ 08/14/2009 03:32 PM     AMD Java Labs     Comments (4)  

August 5, 2009
  Windows® 7 and AMD - Technical Collaboration

AMD is a close collaborator with Microsoft.  We work together to help ensure the operating system runs smoothly and efficiently on AMD platforms.  Here are some of the key technology collaborations for Windows® 7:

  • Power Management: AMD worked closely with Microsoft to support a new AMD product-specific power management driver in Windows 7. This in-box driver supports older processors as well as the latest generation AMD OpteronTM processor and AMD PhenomTM II processor. In addition to the power management driver, AMD collaborated with Microsoft to fine tune default power policy parameters that control power state transitions to help optimize for power and performance. And since this driver is "in-box", there's no need to download.
  • Virtualization: AMD provided code to Microsoft and worked with the Hypervisor teams to help ensure that Hyper-V R2 and Windows Virtual PC in Windows 7 utilize Rapid Virtualization Indexing (aka nested paging tables) for improved performance of VM guests. All of the third-generation AMD Opteron processors, AMD Phenom processors, and AMD Phenom II processors support Rapid Virtualization Indexing. In addition, most of AMD's shipping processors (other than AMD SempronTM processors) include AMD-VTM technology and thus support Windows XP Mode for Windows 7.
  • Stability & Performance: Current and upcoming reference platforms containing multi-core processors from AMD were loaned to Microsoft's labs to vet out potential incompatibilities with Windows 7 and Windows Server 2008 R2.
  • Graphics: AMD has been working hard to support DirectX® 11, so there are plans to make native DirectX 11 hardware from AMD in its ATI RadeonTM GPUs available when Windows 7 is released.
  • GPU Compute: DirectX 11 Compute Shader (CS) is a new API in Windows 7 that helps enable rich applications through the use of compute on the GPU (General Purpose GPU or GPGPU). Rich experiences such as drag-and-drop media transcoding, physics, and AI are a few areas that DirectX 11 CS can help enable.

 

For more information on AMD and Microsoft technical collaboration visit the Windows Zone on developer.amd.com.  For more information on what AMD is doing overall with Microsoft for end users, check out the Microsoft & AMD corporate site, or see the AMD video on Microsoft's Ready. Set. 7 site.

 

-Robin Maffeo, Microsoft Alliance Manager



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 08/06/2009 at 08:09 PM by devcentral

 Post a Comment    

    Posted By: AMD DeveloperCentral @ 08/05/2009 08:05 PM     Inside Dev Central     Comments (1)  

  ATI Stream SDK and OpenCL(TM)

It's been a while since we've had an update on the ATI Stream Developer Blog... Over the past year since the last blog posting, a lot has happened. ATI Stream SDK v1.x saw two release (v1.3-beta at the end of last year and v1.4-beta at the beginning of this year). With each of those releases the SDK and Brook+, in particular, we focused on stability and adding more exciting features.

We've even launched an ATI Stream Developer Showcase site where quite a few of your fellow developers have submitted their ATI Stream applications to show the developer community (you), the exciting things they have done with the ATI Stream SDK. ATI Stream Power Toys came into existence and we are planning to continue to grow it as we come up with fun and useful tools for you that just can't wait for the next ATI Stream SDK release. And, ACML-GPU finally made it out of alpha/beta testing and is now release on AMD Developer Central. All truly exciting stuff!

But, what has been even more anticipated since the middle of last year has been OpenCL(TM). If you don't know much about OpenCL and how it meshes with the rest of GPGPU history, take a look here. It was a tremendous amount of work that kept our engineering team up late for many nights... but, finally, we were able to release a beta version of our ATI Stream SDK v2.0 with OpenCL x86 CPU support today. It's part of our complete OpenCL development platform and is designed to help accelerate your applications with OpenCL today on multi-core CPUs, plus helps you take advantage of the added speed of GPUs later on this year. If you are interested in giving it a try, visit our ATI Stream SDK v2.0 Beta Program page to download the beta release.

Benedict Gaster, our OpenCL compiler architect here at AMD, has written an introductory tutorial for OpenCL to help developers get started learning and getting comfortable programming in OpenCL. You can find his OpenCL tutorial article here.

Also take a look at Patricia Harrell's blog, OpenCL Changes the Game. Patricia is the Director of Stream Computing here at AMD.

Stay tuned for even more information about ATI Stream SDK developments.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 08/06/2009 at 08:19 PM by michael.chu@amd.com

 Post a Comment    

    Posted By: Michael Chu @ 08/05/2009 01:58 AM     ATI Stream     Comments (3)  

July 24, 2009
  Performance Profiling Without the Overhead

Performance Profiling Without the Overhead

Here at AMD, we know that in order to improve program performance, you have to be able to measure it. AMD's Lightweight Profiling feature (LWP) is designed to make performance measurement even easier and with negligible overhead. In this post, I'll give you an overview of LWP and tell you why we think it's an exciting next step in the area of performance tuning.

First, a little history. Late in 2007, AMD announced Lightweight Profiling as a proposed extension to the AMD64 architecture that would allow an application to gather performance statistics about itself with low overhead. We posted the preliminary specification and asked for feedback from the developer community. Much to our delight, many of you responded with comments, criticisms, and suggestions on the proposal. We've read all of your feedback, and last week we posted the current version of the LWP specification. The announcement and the link to the spec are here. Thanks to all of you who helped us out.

What came before...

It's important to be able to measure the details of a program's performance in order to find ways to speed it up. Until now, there have been just two ways to do this. The first is via instrumentation, i.e., adding code to the program to watch the clock, or the cycle counter, or just to count the number of times an instruction or loop is executed. Instrumentation can be added by the programmer or by a compiler. Unfortunately, it seriously perturbs the application, and the instrumented code usually doesn't have the same characteristics as the original code, especially when dealing with the data and instruction caches. Also, instrumentation can't observe the hardware caches, so it can't gather data about cache behavior.

The second traditional method of monitoring performance is to use the hardware performance counters. These count hardware events and generate an interrupt after a programmed number of events have happened. The counters can report on events that are too hard to instrument (like counting each x86 instruction) or are not visible to software (like cache misses). These counters are used by the AMD CodeAnalyst Performance Analyzer and provide deep insight into application and system performance. However, each time a data sample is gathered, the processor must take an interrupt to a kernel-mode driver, and that takes hundreds or thousands of cycles. The driver, by simply executing, changes the contents of the data cache and the instruction cache and may perturb the application's performance. The counters can only be configured, started, and stopped from kernel mode, so an application must call a driver or the operating system to control them. Finally, some systems do not context-switch the performance counters when changing threads or processes, and on those systems, performance monitoring can only be done globally by a single user at a time.

Introducing LWP

After reading about current technology, you might think that an ideal performance monitor should:

  • Operate entirely in user mode
  • Cause little or no perturbation of the application
  • Be controlled separately for each thread
  • Have low overhead to allow for higher sampling rates

And that describes LWP!

Lightweight Profiling adds a set of user-controlled counters to the AMD64 architecture. They can monitor multiple events simultaneously. An application thread starts profiling by providing the address of an LWP control block (LWPCB) as the operand to the new LLWPCB instruction. The contents of the LWPCB specify which events to count and how often to count them. It also points to a ring buffer in the application's memory into which the hardware will store event records. That's it.

Once started, LWP counts the specified events. When an event counter underflows, it stores an event record at the head of the ring buffer and resets the counter. (If requested, LWP randomizes the bottom bits of the new counter value to prevent "beating" against constant length loops.) LWP stores the record without interrupting the flow of the program, so the only perturbation to the program's performance is writing the record (usually affecting only a single data cache line) and a few cycles to perform the write. The record contains the event type, the address of the instruction that caused the underflow, and other information about the event. All event types share one ring buffer and can be sorted out by the event type field in the record.

Of course, eventually the buffer will fill up. What then? Well, a program has two options for emptying the ring buffer. First, it can simply poll the buffer and remove event records from the tail of the ring. When software rewrites the tail pointer, the LWP hardware knows it can reuse the newly emptied region of the ring buffer. Since the buffer is in user memory, the program can even share the memory with another process, and that second process can be responsible for draining the buffer. Second, the application can specify that it wants LWP to generate an interrupt when the ring buffer is filled past a certain threshold. For instance, it can configure a buffer to hold 10,000 event records and tell LWP to interrupt whenever there are more than 9,000 records in the buffer. The interrupt does indeed perturb the program, but it does so 1/9000th as often as the traditional performance counters would. Better still, since the buffer is in user memory, the application can catch the interrupt and do whatever it wants with the data. It can store it to disk for later analysis, or it can process it immediately and even try to fix performance problems as they are happening.

In addition, LWP is a per-thread feature. Each thread on the system can be monitoring different events at different rates without interference. If a thread is not using LWP, there is no impact on its performance even if other threads have LWP active.

Some LWP Details

The LWP events are a small subset of the events available in the traditional performance counters. They include Instructions Retired, Branches Retired, and DCache Misses. The Branches Retired event can be filtered by whether the branch is direct or indirect, conditional or unconditional, or other criteria. It captures the target address of the branch, a useful value when looking at indirect branches. The DCache miss event can be filtered by cache level to capture only "expensive" cache misses.

One exciting feature of LWP is the ability to insert events into the ring buffer under program control. There are two new instructions to do this:

  • LWPINS inserts a record into the ring buffer containing data taken from the arguments to the instruction. A program can use LWPINS to insert a marker to indicate an important event, such as loading or unloading a shared library, that influences the way addresses should be interpreted in subsequent event records.
  • LWPVAL uses an event counter and decrements the counter each time it is executed, much the way the hardware event counters work. When the counter underflows, it inserts a record into the ring buffer containing data from its arguments. A program uses LWPVAL to implement a technique called value profiling. For instance, it can profile the divisor of a commonly executed DIV instruction and if the data show that the divisor is frequently the same number, it can rewrite the instruction to test for that value and execute an optimized code sequence. Similarly, it can profile the target of a hot indirect branch and generate better code if one way of the branch is dominant.

Who will use LWP?

LWP can be used in many different application environments. These include:

  • Managed Runtime Environment: Managed Runtimes (MRTEs) are programming environments such as Java and the Microsoft® .NET Framework. These environments have the ability to generate AMD x86 or x64 code for routines coded in a high level managed language (such as Java or C#), and they can do that on the fly as a program is running. The MRTE can enable LWP and periodically look for performance problems. If (when!) it finds them, it can generate better code for the hot spots and improve the program's overall performance. LWP is lightweight enough that it can run continuously.
  • Dynamic Optimizer: A Dynamic Optimizer is a program that monitors an application and attempts to improve its performance by modifying it as it runs. In this case, the target application is compiled to native code from a traditional language like C or C++. The Dynamic Optimizer can gather performance data without affecting the flow of control in the application.
  • Compiler Feedback: Most modern compilers have an option to build an instrumented program which the developer runs to gather information on the program's performance. Unfortunately, the added instrumentation (and the fact that optimization levels are often cranked down in a feedback compilation) perturbs the program so much that what's being measured is substantially different from the "real" program. With LWP, the compiler can gather statistics on the program execution without changes, and it can insert LWPVAL instructions to profile interesting areas without adding a large block of instrumentation code and without clobbering any registers. If the application runs without turning on LWP, the LWPVAL instructions act as NOPs and only take a few cycles.

Conclusion

We're very excited about Lightweight Profiling, and I hope this note has piqued your interest. You can read the full specification at the LWP page on Developer Central. There's also an email link you can use to send us your comments and suggestions.

P.S.

My colleagues suggested that I make this more "bloggy" by adding references to "traditional performance values" and "herbal performance enhancers". This postscript is dedicated to them.

Anton Chernoff is a Senior Fellow and architect at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 07/29/2009 at 09:12 AM by anton.chernoff

 Post a Comment    

    Posted By: Anton Chernoff @ 07/24/2009 11:04 AM     Inside Dev Central     Comments (1)  

July 23, 2009
  Final Words

I have always been a little unhappy with the decision to overload the use of the 'final' keyword to enable local variables to be made available to methods in inner classes.
 
Let's recap. Here is a method which launches a thread which prints integers 0 thru 9.
 
public void launch(){
   new Thread(new Runnable(){
     public void run(){
        for (int i=0; i<10; i++){
           System.out.println("i="+i);
        }
     }
   }).start();
}
 
We decide to refactor this method to take two arguments (launch(int min, int max)) so that we can control the start and end values of the count.  We might be tempted to try
 
// will not compile
public void launch(int min, int max){
   new Thread(new Runnable(){
     public void run(){
        for (int i=min; i<max; i++){
           System.out.println("i="+i);
        }
     }
   }).start();
}
 
But this will fail to compile.
 
The problem is that the parameters min and max are not in the scope of the run() method in the anonymous inner class implementation of Runnable(). In fact, because the run() method is being executed in another thread, it is likely that the original call to launch() has returned before the run() method has even started, so the variables that were on the stack when we created our Runnable() are long gone.  To solve this problem Java needs a way to signal that a variable should be captured into the scope of any anonymous inner class that wants to use it. If Annotations were around, I suspect that an Annotation would have worked well for this, unfortunately this 'requirement' predated Annotations and it was decided to 'overload' the use of the final keyword to convey this intent.
 
public void launch(final int min, final int max){
   new Thread(new Runnable(){
     public void run(){
        for (int i=min; i<max; i++){
           System.out.println("i="+i);
        }
     }
   }).start();
}
 
The above method will now compile and will function as suggested.
 
But 'final' seems wrong here.  I understand that there is a reluctance to add new key/reserved words to a language (just look at all the trouble that enum and assert created!), but final seems to be a weird choice.  I think it breaks the law of 'least astonishment'.
 
Let's refactor our method one more time.  This time we will launch 10 threads per count value and we will print the 'number' of each thread. Here is our first attempt
 
// Won't compile
public void launch(final int min, final int max){
   for (int c=0; c<10; c++){
     new Thread(new Runnable(){
       public void run(){
          for (int i=min; i<max; i++){
            System.out.println("Thread "+c+" i="+i);
          }
       }
     }).start();
   }
}
 
Again our compilation issue is that the 'c' variable is not available in the run method of the anonymous inner class.
We need c to be a final variable.  Let's make it final
 
// Won't compile for a different reason ;)
public void launch(final int min, final int max){
   for (final int c=0; c<10; c++){
     new Thread(new Runnable(){
       public void run(){
          for (int i=min; i<max; i++){
            System.out.println("Thread "+c+" i="+i);
          }
       }
     }).start();
   }
}
 
Doh! Of course c can't be final; it is a loop variable. If we mark it as 'final' we are applying the traditional (you can't mutate this) meaning of final, yet we need to mark it as final for the variable to be made available to the inner class. We are forced to do 'weird things' to get around this, like create a local final value for the purpose of capturing the value for the inner class.
 
public void launch(final int min, final int max){
   for (int c=0; c<10; c++){
     final int fc = c;   // fc is only used to expose a final value to the innerclass
     new Thread(new Runnable(){
       public void run(){
          for (int i=min; i<max; i++){
            System.out.println("Thread "+fc+" i="+i);
          }
       }
     }).start();
   }
}
 
Yuck!
 
However you might be even more surprised by this solution ;)
 
public void launch(final int min, final int max){
   for (final int c: new int[]{0,1,2,3,4,5,6,7,8,9}){
     new Thread(new Runnable(){
       public void run(){
          for (int i=min; i<max; i++){
            System.out.println("Thread "+c+" i="+i);
          }
       }
     }).start();
   }
}
 
What?
 
So it looks like we can declare a loop variable to be final providing we are using the new for-each form.  The code is happy to mutate it (so it's not really final, is it?) and also make it available to appropriate inner classes. 

How bizarre.
 
Next time we will look at how these final variables actually get captured/cloned into the inner classes.  One might be surprised what is happening at the bytecode level to allow these 'final' values [to be?] made available to inner classes



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied



Edited: 07/23/2009 at 06:32 PM by gfrostamd

 Post a Comment    

    Posted By: Gary Frost @ 07/23/2009 12:02 PM     AMD Java Labs     Comments (6)  

July 21, 2009
  HT Assist - what is it?

Scalable Performance with HyperTransportTM Technology HT Assist:

With the release of the Six-Core AMD OpteronTM processor, formerly code-named "Istanbul", an important new hardware feature called HT Assist has been included that helps increase performance on 4-socket and 8-socket AMD OpteronTM 8400 Series processor-based systems.

As you scale the number of sockets and, thus, processors in a system, maintaining data coherency becomes a more complex and important issue.  On a single-socket system with a multi-core processor your single processor just has to maintain cache coherency between the processor cores; there are no other sockets or processors to maintain coherency or communication with.

In a multi-socket system, each processor has to communicate with each other processor to make sure it is working on the latest data, or cache line, to maintain coherency (and thus program correctness).  This communication is done over HyperTransportTM technology links between the processor sockets in the case of systems based on HyperTransport technology.  With a broadcast coherence protocol, the latency of a memory access is always the longer of 2 paths: the time it takes to return data from DRAM and the time it takes to probe all the caches in the system.  Only when the processor has received the data and all probe responses can it actually process the required transaction.  With a 4-socket or 8-socket system (24 or 48 total processor cores with Six-Core AMD Opteron processor-based systems) the HyperTransport technology links between processors can increasingly be loaded with a significant amount of latency-sensitive cache probe requests checking for data coherency.

In a 4-socket system, one cache line coherency check can generate 10 or more messages over the 4 HyperTransport links connecting the 4 processors together.  These transactions include all the probe requests, probe responses, data request, and data responses. With HT Assist though this same check may only generate 2-3 messages.  This significantly reduces the latency of the coherency check and the amount of transactions over the HyperTransport links.

HT Assist, or the Probe Filter as it is sometimes called, works by using part of the processor's L3 cache as a directory cache.  This directory cache tracks all cache lines cached in the system.  Instead of generating numerous cache probes when checking a cache line the processor does a Probe Filter Lookup.  This helps lower latency for accesses to local DRAM because there is no need to wait for probe responses when accessing local data.  This also means there is less queuing delay due to the lower HyperTransport technology traffic.  With significantly reduced probe traffic it effectively also increases system bandwidth performance.  It also should be noted that the directory cache uses 1MB of the 6MB L3 cache in the case of the Six-Core AMD Opteron processor.  As well, HT Assist is only enabled on 4-socket and 8-socket systems, where the performance benefits largely outweigh the small decrease in available L3 data cache.  On the other hand, HT Assist is not enabled on 2-socket systems where there is much less cache probe traffic and the full L3 cache is utilized.

We've measured the difference of HT Assist on Six-Core AMD Opteron processors and the results are nothing but stunning.  On the same 4-socket system, we measured 42GB/s of memory bandwidth with the STREAM benchmark with HT Assist, while only getting 25.5GB/s when HT Assist is disabled.* For 4-socket and 8-socket Six-Core AMD Opteron processor-based systems, this can translate into a significant performance uplift for applications that depend on cache performance, memory bandwidth, and system scalability.

Applications that naturally will get a benefit from HT Assist include Database, Virtualization, and High Performance Computing (HPC).  And there is no need for software developers to change their code, just enjoy the extra performance from AMD!

-Justin Boggs

ISV Developer Relations

 

* 42GB/s using 4 x Six-Core AMD OpteronTM processors ("Istanbul") Model 8435 in Tyan Thunder n4250QE (S4985-E) motherboard, 32GB (16x2GB DDR2-800) memory, SuSE Linux® Enterprise Server 10 SP1 64-bit with HT Assist enabled vs. 25.5GB/s with HT Assist disabled. 

 



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 07/21/2009 05:01 PM     AMD “Istanbul” (Family 10h) Processor Software Visible Features     Comments (2)  

July 17, 2009
  Just released: Sun Studio 12 Update 1, featuring optimizations for AMD Opterontm processors

Featuring guest blogger from Sun Microsystems, Darryl Gove.

The release of a new version of Sun Studio is always an exciting moment for Sun Studio enthusiasts. Sun Studio 12 came out pretty much two years ago, and a lot has changed in that time.

One particular trend has been that multicore processors have become mainstream. One way of illustrating that is to look at the number of threads per chip for all the submitted SPEC® CPU2006 integer speed results*. The following chart shows the cumulative number of submitted results since the benchmark came out in 2006 until the middle of June 2009 broken down by the number of threads that the chip could support.

Cumulative number of CPU2006 Integer Speed results submitted

 

Two years ago, when Sun Studio 12 came out, chips that could support two threads were starting to become common. Now we're looking at that being a minimal thread count, and we're starting to see the ramp up of threads that can support more than 4 threads - the latest AMD processors support six threads per chip. In tandem with the growth in thread count, we're seeing much more interest in developing applications that can use this core count. Sun is fortunate that with Solaris and Sun Studio, we have a very comprehensive, and long standing, investment in multiprocessor technology:. from virtualisation, through Zones, to scalability to huge core counts.

Sun Studio has always been on the leading edge of developing parallel applications. There are two ways of leveraging multiple cores, either through libraries provided with the compiler or through the parallisation of your application. For those people using the Performance Library, this is now optimised to take advantage of the latest AMD Quad-core and Six-core processors.

The easiest way of producing parallel code is using automatic parallelisation. Sun was the first company to submit automatically parallelised results for SPEC® CPU2000. Automatic parallelisation is a great technology. It takes some of the work of making parallel codes away from the developer, and places it firmly into the category of "just another compiler flag".

However, the compiler can't do this for all codes, which is why Sun was also one of the first companies to support the OpenMP 3.0 specification.

The OpenMP 3.0 specification is a very important step in making parallel programming easier. The 2.5 specification that was supported by Sun Studio 12 allows developers to identify loops that can be performed in parallel, and different sections of code that can be run simultaneously. The big improvement in the 3.0 specification is the support for Tasks. A task is a unit of work that one thread can request another thread to do. The developer defines the tasks in the source code, but the executed tasks and their order is dynamically determined at runtime. This massively increases the range of applications that can be parallelised using OpenMP.

Of course, writing parallel applications becomes much harder without the tools to support this. Sun Studio 12 Update 1 includes these tools. The Debugger for diagnosing bugs in parallel applications, the Performance Analyzer for determining the activity of all the threads in an application, and the Thread Analyzer for identifying data races in parallel applications. The Performance Analyzer has been enhanced to support hardware counters in the latest AMD processors. The hardware performance counters are an optimal way of determining exactly what the processor is doing during the run of your application.

Performance is often one of the motivating factors for any compiler upgrade. In a compiler suite performance comes from two sources: enabling the developer to identify opportunities to improve performance, and the ability of the compiler to produce good code for the processor. The performance analyzer is able to profile all kinds of parallel applications including those parallelised with OpenMP directives as well as distributed MPI applications. This enables you to quickly determine where, at a source code level, the application is spending its time, and to drill down into that source to understand the performance at the level of hardware events.

Sun Studio screenshot

The goal for the Sun Studio compiler has always been to produce code that runs as fast as possible on all SPARC and x86 processors. Sun has worked closely with AMD to ensure that the compiler is aware of the best practices for producing code for the latest AMD processors. Sun Studio 12 Update 1 includes this support and continues the long track record of delivering superior performance on AMD processors.

As well as providing support for all processors, Sun Studio is also supported on a number of platforms: Solaris, OpenSolaris, and Linux (for x86). Perhaps most importantly Sun Studio 12 Update 1 is free of charge to download and use.

* SPEC and the benchmark names SPECfp and SPECint are registered trademarks of the Standard Performance Evaluation Corporation.  Benchmark results stated above reflect results posted on www.spec.org as of 15 June 2009.

Darryl Gove is a Senior Staff Engineer in the compiler team at Sun Microsystems. He works on the optimisation and tuning of applications and benchmarks. He is the author of the books "Solaris Application Programming" and "The Developers Edge," and maintains a blog at http://blogs.sun.com/d/. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 07/17/2009 12:55 PM     Inside Dev Central     Comments (0)  

July 14, 2009
  The scoop on the x86 Open64 Compiler Suite

You may have seen the recent blog post from our CMO Nigel Dessau about the release of the x86 Open64 Compiler Suite. Nigel makes some great points about why AMD feels this open source project is important, so I won’t go into that here.  Instead, I’ll provide an overview of the latest release and what the features can mean for your development work.

 

Like other compilers, Open64 optimizes applications aggressively in many dimensions, but what is different is that Open64 employs innovative techniques that stem from an understanding of the underlying hardware architecture, such as laying out data structures in space and cache efficient manners and deploying aggressive forms of loop-nest optimizations to promote locality. The biggest area this helps is with multi-core scalability, a measure of throughput performance of running multiple applications simultaneously on multiple cores, where a memory sub-system is often stressed.

 

While the Open64 compiler suite was created to optimize software development for all x86-based architectures, it utilizes many features that take particular advantage of AMD’s technology. One such example is enabling the use of 2MB huge pages for programs built with Open64 to help reduce TLB misses. Another important feature is enhanced code generation and instruction scheduling to take advantage of core pipeline hardware features.  Also, software data prefetching is better tuned to work with the hardware prefetcher and DRAM prefetcher to effectively hide memory latencies. This latest release also offers preview features of OpenMP and automatic parallelization to map program parallelism to multiple cores.

 

Here’s the full list of new features in x86 Open64 4.2.2 that AMD added (also detailed in the release notes):

 

·         Support for 2 MB huge pages.

·         Improved loop fusion and loop unrolling.

·         Improved head/tail duplication, if-merging, scalar replacement and constant folding optimizations.

·         Improved interprocedural alias analysis.

·         Improved partial inlining and inlining of virtual functions.

·         More aggressive re-layout optimization for structure members.

·         Improved instruction selection and instruction scheduling.

·         Improved tuning of library functions.

 

What this compiler suite really enables is highly optimized performance when running multiple applications at the same time, which is pretty much the norm for real-world workloads.  In the spirit of open source projects, we’d like your feedback on how to improve this compiler suite.  If you would like to suggest features for future releases, leave us a comment.  While we can’t promise that the features will be added, we certainly take your feedback under serious consideration.

 

 

Roy Ju

AMD Fellow



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 07/14/2009 04:30 PM     Inside Dev Central     Comments (3)  

July 6, 2009
  IEEE floating point exception handling in Windows® OS

In this blog, we present an example of how IEEE floating-point (FP) exceptions can be caught when programming in C++ for Microsoft® Windows® using Microsoft Visual Studio (VS). We employ the __try/__except extension available in the VS C++ compiler and the _fpieee_flt filter function to handle exceptions. We specifically talk about IEEE exceptions raised by SSE FP instructions, how the MXCSR register behaves, and some behind-the-scene details.

FP arithmetic in the x86 world has traditionally been done by x87 instructions. But after the advent of the x86-64 (AMD64) architecture, FP math is increasingly done using the SSE FP instructions. Like their x87 counterparts, SSE instructions also raise IEEE exceptions during certain FP arithmetic operations. These exceptions are hardware exceptions raised by the processor to signal abnormal cases and conditions. By default, FP exceptions are masked, which means that they are recorded in a status register but prevented from actually getting raised. On the other hand, if they are unmasked, they will be raised and can alter the program flow. The MXCSR register controls the masking of FP exceptions for the SSE FP instructions. It also acts as the status register that records FP exceptions when those exceptions do occur.

The IEEE FP exceptions are hardware exceptions and hence need support from the OS to get control back to user code when these exceptions occur. The structured exception handling (SEH) mechanism of Windows makes this possible. (Refer to http://msdn.microsoft.com/en-us/library/ms680657(VS.85).aspx). The _fpieee_flt function acts as the bridge in SEH to the user defined handler function. (Refer to http://msdn.microsoft.com/en-us/library/te2k2f2t(VS.80).aspx). The handler is registered using this function, and when the exceptions get filtered by SEH, control is transferred to the handler with all the relevant information about the exception.

Here is an example program to illustrate:

#include <iostream>

#include <float.h>

#include <math.h>

#include <fpieee.h>

#include <windows.h>

 

extern "C" int handler(_FPIEEE_RECORD *p)

{

    std::cout << "In the handler invoked by _fpieee_flt" << std::endl;

    if(p->Operation  == _FpCodeLog)

        return EXCEPTION_CONTINUE_EXECUTION;

    else

        return EXCEPTION_EXECUTE_HANDLER;

}

 

int main()

{

    unsigned int cw;

 

    // Get control word

    _controlfp_s(&cw, 0, 0); // Line A

 

    // Enable zero-divide exception

    _controlfp_s(0, ~_EM_ZERODIVIDE, _MCW_EM); // Line B

 

    for(int i=0; i<2; i++)

    {

        __try

        {

            double b, a = 0.0;

            

            if(i==0)

                b = log(a); // Line C

            else

                b = 1/a; // Line D

 

            std::cout << "b: " << b << std::endl;

        }

        __except(_fpieee_flt(GetExceptionCode(),

            GetExceptionInformation(), handler))

        {

            std::cout << "In the __except block" << std::endl;

        }

    }

 

    // Restore control word

    _controlfp_s(0, cw, _MCW_EM); // Line E

 

    return 0;

}

This code was run on VS 2008 targeting the x64 platform. Since it is a 64-bit target, the code generated will contain SSE FP instructions to perform the FP arithmetic operations.

The _controlfp_s function is the interface to access and modify the MXCSR register. In line A, we store the control word for restoring it later. If the MXCSR register (not the variable cw) is examined we see it is set to 1f80h. This shows that all FP exceptions are masked (Refer to AMD64 architecture programmer's manual volume 1). At Line B, we enable the zero-divide FP exception. Now the MXCSR register changes to 1d80h to unmask that particular exception.

Next, we try two scenarios in which the zero divide exception can occur. The first is taking logarithm of zero. According to the IEEE 754 standard's recommendation, this operation should raise an FP zero divide exception and the log function does that. The second scenario is a simple divide operation that will raise this exception. 

The FP exception handler function checks if the exception was thrown by a log operation. If it is, it returns a code asking for the execution to continue in the __try block. If not, the return code notifies the program to execute the __except block. Refer to http://msdn.microsoft.com/en-us/library/s58ftw19(VS.80).aspx to learn more about __try/__except blocks and exception-handling constants.

In the first iteration when line C is executed, control is transferred to the handler, which then asks control be given back to the __try block where the exception occurred and hence back to the log function. The log function continues and an output of negative infinity is produced. Examining the MXCSR register at various points shows that all FP exceptions are temporarily masked when the control is in the handler (1f80h) and restored when control gets back to the __try block (1d80h).

In the second iteration when line D is executed, control goes to the handler and then to the __except block. In this case the MXCSR register changes to 1d84h after line D and stays that way until the exception masks are restored at line E. If you disassemble the program, you will see that line D is compiled as a divsd instruction. During execution this SSE instruction sets the zero-divide status bit in MXCSR (the 4 in 1d84h), and since the zero-divide mask bit is cleared it causes a hardware FP exception. This exception is trapped by the OS and the control is transferred back to user code through SEH.

In the first case with the log operation, it is not hard to see that the temporary masking of the exceptions was done by the log function and not by SEH mechanism of the OS. In this case, the IEEE FP exception was simulated by software (similar to a call to RaiseException function) and not by a single hardware instruction as was in the second scenario.

We hope you find this discussion and example useful. If you have any questions or comments, please post them. In the future, we will discuss similar techniques for Linux®.

Visit AMD's Windows zone (http://developer.amd.com/zones/windows/Pages/default.aspx) for general Windows related information.

 



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: Bragadeesh Natarajan @ 07/06/2009 07:11 PM     AMD Libraries     Comments (5)  

June 30, 2009
  ACML 4.3.0 Performance Data

Now that the ACML 4.3.0 release is completed and posted live on AMD Developer Central, I’ve been spending time collecting all the performance data needed to document the improvements in the 4.3.0 release.   There are several new features that should show up nicely in performance graphs.  Improvements include a new SGEMM kernel for AMD Family 10h, new DGEMM and SGEMM for Woodcrest, Penryn, and Nehalem Intel processors, improved level 1 BLAS kernels, 3D FFT work, and new scalar acml_mv functions.  It’s a really long list!

You can easily demonstrate these new performance features by using the examples in the performance directory of the ACML installation.  There are examples for a few different routines, and these can be easily modified to demonstrate other routines as well.

A couple of trends are jumping out from the data collected so far.  First, the 4.3.0 Level 3 blas routines run much better than previous versions on Intel machines.  It is very competitive with MKL on Intel processors!

Second, the Intel Nehalem is a very impressive processor.  However Istanbul’s 6 cores can crank out a bunch of raw DGEMM flops.  This graph tells the story:

 

More information on ACML 4.3.0 is available on the ACML home page.  If you have feedback on how the new release improves performance for your application, we'd love to hear about it.



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Chip Freitag @ 06/30/2009 11:46 AM     AMD Libraries     Comments (1)  

June 29, 2009
  Removing C wrapper functions from the AMD Core Math Library (ACML) to resolve linking issues.

ACML is a significant library of (mostly) FORTRAN subroutines, provided in binary form and available for download at http://developer.amd.com/acml.  Each version of the library has been compiled with a particular FORTRAN compiler, and is compatible with application programs written and compiled with the same compiler.

Although FORTRAN programming has hardly disappeared, if you're reading this blog, the odds are far more likely that you're developing in C/C++ or C#.

Calling FORTRAN subroutines from C/C++/C# is doable, but there are a lot of potential problems and pitfalls.  The C and FORTRAN languages have completely different subroutine naming and argument-passing conventions.  For example, where C/C++ passes parameters by value (except for arrays), FORTRAN passes them by reference.  When you have a multi-dimensional array, FORTRAN stores the data in column-major order; C/C++ uses row-major order.  Different FORTRAN compilers have different conventions for passing strings, for the name of the subroutine entry point, etc.

To help make ACML useful to C/C++/C# programmers, some versions of the library come with support for C compilers, including an "acml.h" header and "C wrapper" functions.  These alternate entry points take care of most of the hassle for you (although it's up to the user to watch out for the row-major versus column-major array problem).

For example, suppose you consulted the section "Determining the best ACML version for your system" in the ACML manual (online here: http://developer.amd.com/cpu/Libraries/acml/onlinehelp/Documents/BestLibrary.html#BestLibrary), and chose to download the Linux IFort64 version for your project.   You would be able to code your project with either Intel (R) FORTRAN  or a compatible C/C++ compiler.  Your choice.

So how does this work?  If a FORTRAN module containing :
           CALL DNRM2 (...)
or
           SUBROUTINE DNRM2 (...)
is compiled with the 64-bit ifort compiler, the linkage name passed to the linker is "dnrm2_", (note: the lower-case symbol name with  trailing underscore).  Both the caller and the callee assume that all parameters are passed by reference.

If a C program module containing: 
           #include  <acml.h>
           dnrm2 (...)
is compiled with the 64-bit GNU gcc compiler, the linkage name passed to the linker is "dnrm2"  (lower-case symbol name without the trailing underscore).  The caller passes array parameters by reference, but all other parameters are passed by value.

You can use the "objdump" or "nm" utilities from the GNU binutils package to confirm the external linkage symbols in an object or library file.

So, we can provide a single library with both FORTRAN-callable and C-callable versions of the same routine, because the linkage names used for subroutines are different for the two languages.  The ACML library contains two object modules for each routine defined in "acml.h".  The FORTRAN version exports the symbol with the trailing underscore as the entry point with the FORTRAN calling convention.  A separate "C wrapper" module exports the symbol without the underscore as the entry point for a short routine that resolves the differences in calling conventions and then calls the FORTRAN-compatible version.

So all is well as long as your project is built with the specific FORTRAN compiler or a compatible C compiler or some combination of those.  But you can run into trouble if yet another compiler is thrown into the mix, or another 3rd-party library which was built with another compiler is used.

One of our users recently ran into exactly this situation.  They wanted to link together their program code, which was compiled with Intel (R) FORTRAN , plus ACML, plus yet another linear algebra library (which I won't name - let's call it library X).  Library X was linked with object code from a different FORTRAN compiler which did not append a trailing underscore to the linkage name.  The calling routine would push references (addresses) of the scalar parameters (such as the array sizes) onto the stack and then call the symbol "dnrm2" (without the underscore).  The linker would match that name with the "C wrapper" for dnrm2, which would expect those parameters to have been passed by value.  It would then execute the dnrm2 algorithm using the address of the array size variable N in place of N itself.  This would probably just crash with a segment violation.  If by some miracle it did not crash, it certainly would not compute the correct results.

In some cases the ACML user can make local customizations to the ACML library to work are around these problems.  Of course, it is strictly the user's responsibility to insure that these customizations are appropriate and generate correct linkages.   In this case, the work-around was to remove all of the c wrappers from libacml.a.

The script below shows how this can be done.   The technique used is a quick-and-dirty hack, and not the most efficient or elegant way of accomplishing the same effect. 

#! /bin/sh
#   Make a local copy of the ifort64 ACML static library
cp /opt/acml4.1.0/ifort64/lib/libacml.a ./libacml.a
#   Create a list of all of C-wrapper modules
ar -t libacml.a | egrep  _cw.o > wrapperlist
#    Create a script to delete all of the C-wrapper modules
#    and execute it.
sed "s/.*/ar -dv libacml.a &/" wrapperlist | bash
#    Clean up
rm ./wrapperlist

One undocumented piece of information makes it easier to remove the "C wrapper" functions from this version of libacml.a:  All of those object modules have names with the suffix "_cw.o".  There is no guarantee that this will be true in other versions of the library or in future releases.

With this knowledge, the "ar -t" and "sed ... | bash" lines of the script are all that is needed to remove these modules.  Of course, this will remove them one at a time, which is remarkably slow and inefficient.  On the other hand, you only need to do this once.  You should expect this script to take a good fraction of an hour to execute, and plan accordingly;  start it when you're ready to leave for lunch or a meeting.

Let us know if this makes ACML more useful for you; we'd like to hear what you're doing with the library. 


The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



-------------------------





The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 





Edited: 06/30/2009 at 11:42 AM by jim.conyngham@amd.com

 Post a Comment    

    Posted By: Jim Conyngham @ 06/29/2009 03:39 PM     AMD Libraries     Comments (1)  

June 15, 2009
  Just released: Advanced Synchronization Facility (ASF) specification

Recently AMD released an experimental specification for a proposed AMD64 architecture feature that may be of interest to all programmers of highly concurrent programs, libraries, runtimes, and operating systems: Advanced Synchronization Facility, or ASF for short. This is the first of three blog articles describing why AMD's Operating System Research Center (OSRC) became involved in the development of ASF, how we are evaluating ASF, and how this and other activities fit into the EU-funded VELOX project aiming at improving the state of the art for software-transactional-memory systems.

In this posting I will give you a quick overview of what ASF is and how it works, along with some example code. I'll also describe how I became involved in developing ASF and why we are releasing this spec proposal.

About ASF
In a nutshell, ASF is intended to make it easier to write efficient, highly concurrent programs.

When AMD introduced multicore CPUs to the x86 world, we acknowledged that individual CPU cores weren't getting much faster with each silicon-technology generation. Instead, we decided to provide multiple CPU cores in one processor. This put the burden on the software community of making programs run faster on newer processors (i.e., programs have to be changed to take advantage of the parallelism.)

Writing efficient, concurrent programs or parallelizing an existing sequential program is a hard endeavor. The trickiest part is making sure that all program threads have a consistent view of all shared data. ASF is intended to address this very problem, known as synchronization.

How does ASF work?
ASF provides a mechanism to update multiple shared memory locations atomically without having to rely on locks for mutual exclusion. It's quite flexible as the semantics of the update are not fixed, but can be provided using standard x86 instructions.

Here's an example. This code snippet implements a two-word compare-and-swap primitive, with new instructions highlighted in red:

; DCAS Operation:
; IF ((mem1 = RAX) && (mem2 = RBX))
; {
;   mem1 = RDI
;   mem2 = RSI
;   RCX = 0
; }
; ELSE
; {
;   RAX = mem1
;   RBX = mem2
;   RCX = 1
; }
; (R8, R9 modified)
;
DCAS:
 MOV      R8, RAX
 MOV      R9, RBX
retry:
 SPECULATE                    ; Speculative region begins
 JNZ      retry               ; Page fault, interrupt, or contention
 MOV      RCX, 1              ; Default result, overwritten on success
 LOCK MOV RAX, [mem1]         ; Specification begins
 LOCK MOV RBX, [mem2]
 CMP      R8, RAX             ; DCAS semantics
 JNZ      out
 CMP      R9, RBX
 JNZ      out
 LOCK MOV [mem1], RDI         ; Update protected memory
 LOCK MOV [mem2], RSI
 XOR      RCX, RCX            ; Success indication
out:
 COMMIT                       ; End of speculative region

The SPECULATE-COMMIT pair wraps a speculative region, which speculatively reads from and writes to protected memory locations using the LOCK MOV instructions. The speculative memory updates will become visible to other CPUs only when the speculative region completes successfully.

Here's what the speculative region does in this example: The initial LOCK MOV instructions signify the memory locations that need to be monitored for external modifications and also read the memory operands into the RAX and RBX registers. The code then compares these operands with the original register operands (saved to R8 and R9 at the outset of the routine). The DCAS operation may fail because of a miscomparison at that point, bypassing the memory update. The RCX register returns a pass-fail indication.

A speculative region may also be aborted, for example when a contending program thread accesses a protected memory location or when an interrupt occurs. In this case, all speculative memory updates are discarded, and the program flow (instruction and stack pointer) is rolled back to just after SPECULATE, where software can inspect the reason for the abort in the rAX and rFLAGS registers. The code in this example examines RFLAGS immediately after SPECULATE using a JNZ instruction that branches to the abort handler, which in this case just attempts a retry. A real implementation might have a more elaborate recovery strategy, for example, exponential backoff if the abort was due to contention.

How we are developing ASF
ASF really is a team effort, with team members looking at various software applications, hardware implementation, and the specification itself.

When I joined AMD's OSRC at the end of 2006, I quickly discovered ASF as it existed at that time: a mechanism for improving the efficiency of highly parallel, lock-free synchronization code. In previous work I had used lock-free data structures for building a real-time microkernel operating system, and I had often craved a feature for multi-word atomic updates such as ASF. This might explain why I was so enthralled by ASF.

In the meantime, I have become the editor of the ASF specification proposal. I'm working with the ASF team to evaluate the feature in various application scenarios, and to further develop ASF based on our findings. We have expanded its focus to include software transactional memory (STM) as well; more on that in a later blog post.

We are also actively discussing ASF with both academic and industrial partners to learn about interesting application areas and to derive requirements for an eventual implementation in future products.

The ASF specification
ASF is an experimental architecture extension currently in proposal stage. AMD has not yet committed to including this feature into any future CPU product. Instead, we are soliciting input from developers and researchers that would help us refine the ASF specification to better meet software development requirements.

ASF is not the first feature we have proposed in this way. A year and a half ago, AMD decided to be more open in developing extensions to the AMD64 architecture to help ensure we meet the needs of the software development community and to encourage cross-vendor compatibility. At that time, we proposed the Lightweight Profiling (LWP) and SSE5 features in a similar spirit, and we received extremely valuable input from the programming community that helped us improve our future products - to your benefit. SSE5 has just recently evolved into the AVX-compatible XOP, which we described in a previous blog entry.

Please download the ASF specification proposal and send your comments to ASF_Feedback@amd.com.

---
Michael Hohmuth, MTS
AMD Operating System Research Center, Dresden



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 06/15/2009 01:57 PM     AMD Operating System Research Center (OSRC)     Comments (3)  

June 11, 2009
  JavaOne 2009

I was lucky enough to go to JavaOne last week and thought I'd share some comments, highlights, a few quibbles, and a way to make some serious money if you are in the beanbag industry.
 
I felt that this year's JavaOne was a little subdued -- attendance seemed lower (we can probably all guess that the economy was a factor here) and generally there were fewer 'cool!' exclamations from the audiences.
 
Monday
 
This was 'Community One' day. I attended a couple of sessions (Hadoop and Cloud related) but really spent most of the day bumping into people and catching up. CommunityOne looked a little sparsely attended at times.
 
I did attend a session where the presenters and attendees discussed how to get the most out of their JUGs (Java User Groups).  This was a really good session.
 
I did enjoy hanging out in the AMD sponsored 'Hang Space,' and I had my first 'patent pending idea' here watching all of the laptop users sitting on the floor next to the walls (where the 110vac was served) and not on the comfy beanbags! So beanbag builders of the world, we need beanbags which incorporate 110v sockets.  These could be sold in strings which connect together and will allow those slacking off at conferences to actually partake in the bean-bag offerings rather than sit on the floor.  Of course, one might ask why the beanbags were not dragged to the walls, and the answer would be, you wouldn't be able to watch the episodes of 'The Office - US version' that were being served up on the big screen, obviously.    I, of course, could happily sit in a beanbag, pretend to work and watch Dwight, Jim, and Pam wrestle with their plight because I have an AMD powered HP dv2 - whose battery lasted way longer than Season 1 of "The Office."
  
 
Tuesday
 
It was good to see Scott McNealy handover (the keynote, not Sun just yet) to Larry Ellison. Larry's remarks regarding the importance of Java to Oracle must have made a few folks sleep easier on Tuesday evening and I suspect that the JavaFX team will be particularly pleased with Larry calling out JavaFX by name and pushing a possible OpenOffice/JavaFX integration down the line. That should be good for JavaFX and hopefully good for OpenOffice.
 
So where is JavaFX in 2009? I count this as the third JavaOne where Sun has pushed JavaFX. 2007 was kind of a preview, and I enjoyed the demos but that was really all it was. It dominated in 2008, but was still really not cooked and I walked out of the lab session when I was asked to sign an NDA -- an NDA for a lab session at a conference that I paid to attend seemed a bit weird. Now in 2009 I really do think it might start gaining some traction.  The addition of charting was smart (and pretty obvious really) and I was pleased that even Eclipse users got something in the form of a fairly cool Eclipse plugin.  Now it really feels that JavaFX is not just for Netbeans anymore.  The demos were slicker and the downloaded Eclipse plugin worked like a charm.  
 
Having worked on a large Flex application a few years back, and having seen some extremely cool Flex apps, I have always seen JavaFX as too little too late. Flash and Flex have pretty much carved up the R part of RIA (although AJAX is not dead yet!). Now I am a little more hopeful for JavaFX to at least find an audience. The more natural Java integration and the impressive binding support will appeal to those who really took to mxml+actionscript, and I can see the story developing.  The effort that has gone into jnlp/applet deployment (on jre 1.6_10 +) has helped enormously and once we can find a way to get JavaFX to launch faster (Flash still seems to launch way faster than even trivial JavaFX apps) I think that JavaFX will come into its own.  I look forward to kicking the tyres some more.
 
Joshua Bloch (Google, Inc) and Neal Gafter's (Microsoft) "Return of the Puzzlers: Schlock and Awe" session was as well attended as ever. These guys do a great job presenting these infuriating corner cases. I liked the fact that they  acknowledged making some of the mistakes presented; it makes us all feel a little less incompetent. I think I got more answers right this year, although my success rate is still not impressive. 
 
The "Small Language Changes in JDK(tm) Release 7" session by Joseph Darcy, Sun Microsystems, Inc. was interesting.  I really like the 'Elvis operator' :? and also look forward to using some of the suggestions for  less verbose 'Generic' declaration/initializations.
 
The "Asynchronous I/O Tricks and Tips" session by Jean-François Arcand and Alan Bateman from Sun Microsystems, Inc. was an informative session. I really am guilty of not tracking nio (when will the 'n' in 'nio' seem really inappropriate) enough, and I look forward to using some of these tricks, especially using a 'Future' to access the response from an asynchronous read. 
 
One of my favourite sessions was "Toward a Renaissance VM" by Brian Goetz and John Rose from  Sun Microsystems. Sometimes I feel my head is way too small to understand this JSR 292 of stuff, but I actually felt that I have a grasp of how this will help dynamic languages and also how it might apply to frameworks which currently rely on bytecode engines/injection and reflection to do their work.  I still need to track down more information on this but the fog is lifting for me.
 
I wish I had caught the "The Feel of Scala" session by Bill Venners of Artima, Inc.  Only as the week progressed did I realize that I need to track Scala. I look forward to the slides and video of this presentation.
 
Wednesday
 
I attended a great session called 'State: You're Doing It Wrong -- Alternative Concurrency Paradigms on the JVM&trade Machine' in the morning from Jonas Bonér of Scalable Solutions.  This session proposed State, Actor message passing and Data Flow mechanisms to improve concurrency.  For me the Actor-based demos (based on Scala) not only prompted me to look at this approach in my Java apps, but also was a great example of how Scala can be scaled out.  As I mentioned earlier I really need to dig into Scala some more.
 
I regret missing "The Modular Java(tm) Platform and Project Jigsaw" by Mark Reinhold of Sun Microsystems, Inc. From what I have read alsewhere this modular approach is really going to help deployment and packaging.
 
Joshua Bloch's (from Google) ""Effective Java": Still Effective After All These Years" was another opportunity to see the 'Billy Mayes' of Java (I really mean no disrespect - Josh is a pitch-perfect pitch man) do what he does flawlessly.  His 'Effective Java' book is like the Movie 'Brazil;' you need to reread/review every year to catch what you missed previously.
 
I enjoyed "The Ghost in the Virtual Machine: A Reference to References" session from Bob Lee, Google Inc., which went into depth regarding GC, references, and finalization issues.  I look forward to walking through the slide deck on this one.  I learned a lot and also know a bunch slipped on past me.
 
I watched a cool demo which redefined classes in a running JVM using a java agent and some classloader tricks.  This BOF session "Runtime Update of Java(tm) Technology-Based Applications, Using Dynamic Class Redefinition" by Allan Gregersen from University of Southern Denmark was fun and educational. The presenter built a Swing-based game incrementally by adding fields and methods, changing class hierarchies, etc., all without ever restarting the JVM.  Although in practice I feel this javagent based chaining approach may not scale particularly well, if this can be pushed down into the JVM (as the presenter suggested) then this whole area has some great potential.
 
I must apologise to my fellow AMDer, Richard West, and David Gilbert from Object Refinery Limited for missing their "JFreeChart: Surviving and Thriving" BOF.  I look forward to picking Richard's brain about this great toolkit.
 
Thursday
 
Occasionally I like to see what is going on in the Swing world.  I don't really get to write much in Swing but there are some really great toolkits out there. I particularly enjoyed "Swing Rocks: A Tribute to Filthy-Rich Clients" by Martin Gunnarsson and Pär Sikö from Epsilon Information Technology. Swing really can look compelling.
 
The "Matchmaking in the Cloud: Hadoop and EC2 at eHarmony" session from Steve Kuo and Joshua Tuberville of eHarmony, Inc. was a good presentation (and from a show of hands there were two attendees that actually got married through eHarmony so there was a cool validation of eHarmony's matching algorithm!). It walked through the technical and economic considerations around using these technologies.
   
"Garbage Collection Tuning in the Java HotSpot(tm) Virtual Machine" from Charlie Hunt and Antonios Printezis of Sun Microsystems, Inc was a good, informative session that walked through a number of great slides highlighting what to do and what not to do.  I still feel that GC tuning should be less of a 'dark art.'  I worry how many JVMs are sitting out there thrashing when a few command line options would smooth the way.  I do wish for a -XX+GCAdvise option which (possibly at the end of each GC) would suggest what command lines would be optmil with a specific workload. I know that I am supposed to use the printgc options (flag examples) to be added, and/or use visualvm to show me the graphs that I should use to determine what flags will be optimal, but this seems way too hard.  Surely after running for a while the GC engine/subsystem would have a enough data to generate an 'I suggest running with these flags ... because ....' style report, instead of 'here are a bunch of graphs and text dumps, now go away and work out what you did wrong and come back.'   Sometimes I don't want to learn to fish; sometimes I would just like to eat some fish.
 
 
Cliff Click (from Azul Systems) and Brian Goetz's (Sun Microsystems)  session,  "This Is Not Your Father's Von Neumann Machine; How Modern Architecture Impacts Your Java(tm) Apps" was another one of the highlights of the conference.  It was a great presentation and allowed folk without a deep understanding of microprocessor architecture to walk away with some understanding of what happens under the hood. The slide deck in the middle which walked through the issues relating to how multi-core architectures executing speculatively have to handshake over the cache was very, very slick. I am looking forward to Cliff and Brian's Boxed Set being released.
 
 
There were some great sessions on  "Actor-Based Concurrency in Scala" from Philipp Haller of EPFL and Frank Sommers of Artima which really rammed home how effective Scala and this Actor-based communication mechanism can simplify some concurrency problems.   As I mentioned before this was brought up in a former session, and I enjoyed digging deeper in this dedicated session.
 
I stayed late to enjoy the "Java(tm) Programming Language Tools in JDK(tm) Release 7" BOF on Thursday night hosted by Maurizio Cimadamore and  Jonathan Gibbons from Sun Microsystems, Inc.  I applaud the upcoming refactoring of javap and also enjoyed the discussion on how we might get better error reporting out of javac. I also vote [should this be "voted" in this context?] for the option of getting compilation rendered to xml to help tool chaining. 
 
Friday
 
Gosling's "Toy Show" (Friday morning) did have some cool stuff; the JavaFX studio tool for composing JavaFX without coding does look very, very good. Also the image analysis toolkit which generated analytical 'hashes' for images and then allowed image related searching/matching was very impressive. My favourite was the Printer/Copier based Java app for creating arbitrary multiple choice exam papers or surveys on plain paper, then printing a bunch of the question papers off and by feeding a special page with the answers and the response papers into the scanner, allow the copier/printer to grade the papers.   Very smart. 
 
The "Under the Hood: Inside a High-Performance JVM(tm) Machine" session from Trent Gray-Donald of IBM was excellent.  This provided some more insight into what happens when your code is executed by a modern JVM.
 
Sadly I missed afternoon sessions because I had to get to the airport to get home to watch season two of 'The Office.'
 
There certainly is enough to dig into to keep me busy enough until next year.
 

 



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied


 Post a Comment    

    Posted By: Gary Frost @ 06/11/2009 06:30 PM     AMD Java Labs     Comments (0)  

  How Complex Is Your JRE Command-line?

Guess how many command-line flags there are for the server JRE in the OpenJDK?  I'm hearing 42.  Kudos to all of you fans of the late Douglas Adams, but you're slightly short of the real answer.  It's 477 (give or take a flag or two).  To confirm, just go into src\share\vm\runtime\globals.hpp and src\share\vm\opto\c2_gloabls.hpp, which define them.  The flags control all sorts of things, some of which you are probably very familiar with like the heap (-Xms -Xmx), and some which you may not know about, such as the memory footprint settings (-XX:ReservedCodeCacheSize and -XX:InitialCodeCacheSize).

I'm not asking you this because I want to know if you have intimate knowledge of the JRE (although if you can keep bits of trivia like this in your head, I am truly impressed).  My question really comes out of the world of performance analysis of Java runtimes.  Suffice it to say that as the Java Labs works to improve JRE performance, sometimes our analysis leads to improvements that can be realized by tuning these existing command-line flags.  But here's my theory...I bet most of you use few, if any, of these flags in production.  You probably have very good reasons for doing this.  You may not have access to the command line, or you may have different applications, some of which my benefit from certain flags, while others won't.  If true, the result is the same...when we look to improve JRE performance, we really need to do it in a way that is engineered to help potentially any application in a flexible way that does not require changes to the command line.

So answer these two questions:

  • Do you set any command line flags in production?
  • If yes, what are they?


-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: Ben Pollan @ 06/11/2009 12:04 PM     AMD Java Labs     Comments (1)  

1 2 >> Next

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information