AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs - AMD Libraries
Decrease font size
Increase font size
November 19, 2009
  SuperComputing 2009 - Day 2

Of the Wednesday sessions, one of the most interesting was a talk on Matlab.  Matlab has language constructs such as parfor that enable rapid migration to multicore.  And the distributed keyword marks an array as suitable for parallel processing.  Matlab figures out the rest.  In todays world of multicore CPUs, tools like this will be indespensible for getting the most out of your CPU dollars.

Later in the day I attended an OpenMP Birds-of-a-Feather.  They discussed the roadmap for OpenMP versions.  The 3.1 version is imminent.  Among the features discussed was better CPU affinity.  This got my attention.  One of the keys to repeatably good performance for large multithreaded ACML tasks is ensuring good task CPU affinity.  ACML uses OpenMP to provide parallel operation for many of the BLAS, LAPACK, and FFT routines.  When ACML is called with a large enough problem, it will run in a OpenMP parallel section to divide the problem among available CPU cores.  Thread affinity is needed to keep threads running always on the same core to maximize cache reuse and minimize remote memory accesses.  The numactl API and command are not sufficient.  Numactl will restrict a group of threads to a set of processors, but does not prevent migration of tasks between the specified nodes and cores.

Fortunately, all of the OpenMP capable compilers used by ACML have implementation specific ways to lock tasks to cores.  The downside is that each compiler has a different set of environment variables to control this.  It sounds like OpenMP 3.1 will standardize this.  It may be a while before the compilers catch up and implement this new feature, so in the meantime, check the documentation for each compiler to determine the best way to enforce affinity.



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Chip Freitag @ 11/19/2009 11:13 AM     AMD Libraries     Comments (0)  

November 18, 2009
  SuperComputing 2009

Online at last. I can't believe some hotels still charge for wireless internet.  Here at the SuperComputing 2009 show in sunny Portland Oregon, wireless is free, fast, and reliable.

For the first time at an super computing show, I have no booth demo so aside from the usual meetings, I'm free to actually attend technical sessions. There are way too many sessions for one person to attend, difficult decisions must be made. My priorities are learning more about how real applications are using sparse solvers and what kind of performance is being reported, finding FFT applications that we can use for benchmarking and rounding out our feature set, exploring how MATLAB users are calling ACML, and looking at how GPUs are being adopted in the HPC market.

The opening keynote by Justin Rattner was kind of interesting. His talk was about 3D internet and ways to catalyze exponential growth for high performance computing. He talked abou OpenSim and one of his demos was a realtime talk with a University of Utah researcher who was represented locally by a sim'd avatar inhabiting a world populated by sim'd ferns which were the subject of his research, all rendered in realtime. The fern simulations were interacting with many aspects of the environment, which was also simulated. It's easy to see how you could create complex simulation worlds that require huge amounts of computing power. Oh, and there's a bunch of software needed also.

One of the technical sessions looked like just what I was looking for on sparse solvers. The big take away for me is that the primary task most people are using as a metric is as simple as solving Ax=b, where A is large and sparse. There are many ways to solve  the problem, and the techniques used are very much dependent on the nature of the sparse system. The example problems used in the presentations are drawn from a variety of problem areas, and have matrix sizes on the order of 1 million squared, with about a hundred million non-zero elements.

The set of data found at the Matrix Market (http://math.nist.gov/MatrixMarket/) features much smaller matrices, so some searching will be necessary to find appropriate large problem examples suitable for todays faster computers.

It's a bit humbling that I'm stumbling into an area of computing that's been well known for the last 20 years. Many of the people who have worked on sparse solver applications over the years are here at the show.

 

 



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Chip Freitag @ 11/18/2009 12:45 PM     AMD Libraries     Comments (0)  

September 7, 2009
  Framewave Multipass Build System

Developing libraries can be difficult, fun and interesting; an equally difficult task is testing the library and distributing it, so that other developers can use the library in their projects.  The big advantage of using libraries to accomplish certain functionalities is that libraries are already tested and optimized for various platforms.  For the libraries optimized for particular platforms, there needs to be a dispatch mechanism to select the best optimized path depending on the processor.  I have found that the build system from the Framewave library provides a good solution to accomplish this.

 Derived from the AMD Performance Library, Framewave is a free of charge, open-source collection of popular image and signal processing routines designed to accelerate application development, debugging, multi-threading and optimization on x86-class processor platforms. This library has three paths of optimized code:  a reference code (c code) path, an SSE2 code path, and an SSE3 and F10H code path. One reason I found it interesting is because it is open-source; I can go through the code, understand it, and modify it as per my requirements, plus it has a single source bundle for four operating systems (Linux®, Mac, Windows®, and Solaris operating systems).

 Framewave has a different implementation for each of the paths, and the Framewave build system takes care of combining them together and exposing a single signature. To achieve this, Framewave has a custom build system based on the SCons build tool (http://www.scons.org). The advantage of using SCons is that it uses the Python scripting language for its configuration files.

 Framewave has a single source bundle that is termed platform independent and is compiled using a single build system across all the platforms. The tool sets supported are GCC, MSVC, and Sun CC. This build system allows me to build 32/64-bit shared/static libraries with the ability to build either a debug or release version.

 This build system picks up the file and compiles it n times, n being the number of optimized paths, producing n object files. These n object files are linked together to the stub function which is exported as the actual function. To understand the build system more, refer to the architecture description here: http://framewave.sourceforge.net/DesignDoc/FramewaveBuildSystem-Architecture.htm

 Producing one DLL file and having only one signature exported for each function is a better option than having multiple DLL files for each of the optimized code paths and then loading the particular DLL depending on the processor. The advantage of having one single large DLL file for the library is that I end up adding only one file to the n files present in my in project.

 Overall this build system offers a unique way to bundle software that has different implementations for each processor.

 I'd like to hear what you think.  Is this build system useful in your own work?  What do you like about it, what do you dislike about it?

 Watch out for my next post on Using SCons for building the build system.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/10/2009 at 11:36 PM by jrameshbe

 Post a Comment    

    Posted By: Ramesh J @ 09/07/2009 05:31 AM     AMD Libraries     Comments (0)  

July 6, 2009
  IEEE floating point exception handling in Windows® OS

In this blog, we present an example of how IEEE floating-point (FP) exceptions can be caught when programming in C++ for Microsoft® Windows® using Microsoft Visual Studio (VS). We employ the __try/__except extension available in the VS C++ compiler and the _fpieee_flt filter function to handle exceptions. We specifically talk about IEEE exceptions raised by SSE FP instructions, how the MXCSR register behaves, and some behind-the-scene details.

FP arithmetic in the x86 world has traditionally been done by x87 instructions. But after the advent of the x86-64 (AMD64) architecture, FP math is increasingly done using the SSE FP instructions. Like their x87 counterparts, SSE instructions also raise IEEE exceptions during certain FP arithmetic operations. These exceptions are hardware exceptions raised by the processor to signal abnormal cases and conditions. By default, FP exceptions are masked, which means that they are recorded in a status register but prevented from actually getting raised. On the other hand, if they are unmasked, they will be raised and can alter the program flow. The MXCSR register controls the masking of FP exceptions for the SSE FP instructions. It also acts as the status register that records FP exceptions when those exceptions do occur.

The IEEE FP exceptions are hardware exceptions and hence need support from the OS to get control back to user code when these exceptions occur. The structured exception handling (SEH) mechanism of Windows makes this possible. (Refer to http://msdn.microsoft.com/en-us/library/ms680657(VS.85).aspx). The _fpieee_flt function acts as the bridge in SEH to the user defined handler function. (Refer to http://msdn.microsoft.com/en-us/library/te2k2f2t(VS.80).aspx). The handler is registered using this function, and when the exceptions get filtered by SEH, control is transferred to the handler with all the relevant information about the exception.

Here is an example program to illustrate:

#include <iostream>

#include <float.h>

#include <math.h>

#include <fpieee.h>

#include <windows.h>

 

extern "C" int handler(_FPIEEE_RECORD *p)

{

    std::cout << "In the handler invoked by _fpieee_flt" << std::endl;

    if(p->Operation  == _FpCodeLog)

        return EXCEPTION_CONTINUE_EXECUTION;

    else

        return EXCEPTION_EXECUTE_HANDLER;

}

 

int main()

{

    unsigned int cw;

 

    // Get control word

    _controlfp_s(&cw, 0, 0); // Line A

 

    // Enable zero-divide exception

    _controlfp_s(0, ~_EM_ZERODIVIDE, _MCW_EM); // Line B

 

    for(int i=0; i<2; i++)

    {

        __try

        {

            double b, a = 0.0;

            

            if(i==0)

                b = log(a); // Line C

            else

                b = 1/a; // Line D

 

            std::cout << "b: " << b << std::endl;

        }

        __except(_fpieee_flt(GetExceptionCode(),

            GetExceptionInformation(), handler))

        {

            std::cout << "In the __except block" << std::endl;

        }

    }

 

    // Restore control word

    _controlfp_s(0, cw, _MCW_EM); // Line E

 

    return 0;

}

This code was run on VS 2008 targeting the x64 platform. Since it is a 64-bit target, the code generated will contain SSE FP instructions to perform the FP arithmetic operations.

The _controlfp_s function is the interface to access and modify the MXCSR register. In line A, we store the control word for restoring it later. If the MXCSR register (not the variable cw) is examined we see it is set to 1f80h. This shows that all FP exceptions are masked (Refer to AMD64 architecture programmer's manual volume 1). At Line B, we enable the zero-divide FP exception. Now the MXCSR register changes to 1d80h to unmask that particular exception.

Next, we try two scenarios in which the zero divide exception can occur. The first is taking logarithm of zero. According to the IEEE 754 standard's recommendation, this operation should raise an FP zero divide exception and the log function does that. The second scenario is a simple divide operation that will raise this exception. 

The FP exception handler function checks if the exception was thrown by a log operation. If it is, it returns a code asking for the execution to continue in the __try block. If not, the return code notifies the program to execute the __except block. Refer to http://msdn.microsoft.com/en-us/library/s58ftw19(VS.80).aspx to learn more about __try/__except blocks and exception-handling constants.

In the first iteration when line C is executed, control is transferred to the handler, which then asks control be given back to the __try block where the exception occurred and hence back to the log function. The log function continues and an output of negative infinity is produced. Examining the MXCSR register at various points shows that all FP exceptions are temporarily masked when the control is in the handler (1f80h) and restored when control gets back to the __try block (1d80h).

In the second iteration when line D is executed, control goes to the handler and then to the __except block. In this case the MXCSR register changes to 1d84h after line D and stays that way until the exception masks are restored at line E. If you disassemble the program, you will see that line D is compiled as a divsd instruction. During execution this SSE instruction sets the zero-divide status bit in MXCSR (the 4 in 1d84h), and since the zero-divide mask bit is cleared it causes a hardware FP exception. This exception is trapped by the OS and the control is transferred back to user code through SEH.

In the first case with the log operation, it is not hard to see that the temporary masking of the exceptions was done by the log function and not by SEH mechanism of the OS. In this case, the IEEE FP exception was simulated by software (similar to a call to RaiseException function) and not by a single hardware instruction as was in the second scenario.

We hope you find this discussion and example useful. If you have any questions or comments, please post them. In the future, we will discuss similar techniques for Linux®.

Visit AMD's Windows zone (http://developer.amd.com/zones/windows/Pages/default.aspx) for general Windows related information.

 



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: Bragadeesh Natarajan @ 07/06/2009 07:11 PM     AMD Libraries     Comments (5)  

June 30, 2009
  ACML 4.3.0 Performance Data

Now that the ACML 4.3.0 release is completed and posted live on AMD Developer Central, I’ve been spending time collecting all the performance data needed to document the improvements in the 4.3.0 release.   There are several new features that should show up nicely in performance graphs.  Improvements include a new SGEMM kernel for AMD Family 10h, new DGEMM and SGEMM for Woodcrest, Penryn, and Nehalem Intel processors, improved level 1 BLAS kernels, 3D FFT work, and new scalar acml_mv functions.  It’s a really long list!

You can easily demonstrate these new performance features by using the examples in the performance directory of the ACML installation.  There are examples for a few different routines, and these can be easily modified to demonstrate other routines as well.

A couple of trends are jumping out from the data collected so far.  First, the 4.3.0 Level 3 blas routines run much better than previous versions on Intel machines.  It is very competitive with MKL on Intel processors!

Second, the Intel Nehalem is a very impressive processor.  However Istanbul’s 6 cores can crank out a bunch of raw DGEMM flops.  This graph tells the story:

 

More information on ACML 4.3.0 is available on the ACML home page.  If you have feedback on how the new release improves performance for your application, we'd love to hear about it.



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Chip Freitag @ 06/30/2009 11:46 AM     AMD Libraries     Comments (1)  

June 29, 2009
  Removing C wrapper functions from the AMD Core Math Library (ACML) to resolve linking issues.

ACML is a significant library of (mostly) FORTRAN subroutines, provided in binary form and available for download at http://developer.amd.com/acml.  Each version of the library has been compiled with a particular FORTRAN compiler, and is compatible with application programs written and compiled with the same compiler.

Although FORTRAN programming has hardly disappeared, if you're reading this blog, the odds are far more likely that you're developing in C/C++ or C#.

Calling FORTRAN subroutines from C/C++/C# is doable, but there are a lot of potential problems and pitfalls.  The C and FORTRAN languages have completely different subroutine naming and argument-passing conventions.  For example, where C/C++ passes parameters by value (except for arrays), FORTRAN passes them by reference.  When you have a multi-dimensional array, FORTRAN stores the data in column-major order; C/C++ uses row-major order.  Different FORTRAN compilers have different conventions for passing strings, for the name of the subroutine entry point, etc.

To help make ACML useful to C/C++/C# programmers, some versions of the library come with support for C compilers, including an "acml.h" header and "C wrapper" functions.  These alternate entry points take care of most of the hassle for you (although it's up to the user to watch out for the row-major versus column-major array problem).

For example, suppose you consulted the section "Determining the best ACML version for your system" in the ACML manual (online here: http://developer.amd.com/cpu/Libraries/acml/onlinehelp/Documents/BestLibrary.html#BestLibrary), and chose to download the Linux IFort64 version for your project.   You would be able to code your project with either Intel (R) FORTRAN  or a compatible C/C++ compiler.  Your choice.

So how does this work?  If a FORTRAN module containing :
           CALL DNRM2 (...)
or
           SUBROUTINE DNRM2 (...)
is compiled with the 64-bit ifort compiler, the linkage name passed to the linker is "dnrm2_", (note: the lower-case symbol name with  trailing underscore).  Both the caller and the callee assume that all parameters are passed by reference.

If a C program module containing: 
           #include  <acml.h>
           dnrm2 (...)
is compiled with the 64-bit GNU gcc compiler, the linkage name passed to the linker is "dnrm2"  (lower-case symbol name without the trailing underscore).  The caller passes array parameters by reference, but all other parameters are passed by value.

You can use the "objdump" or "nm" utilities from the GNU binutils package to confirm the external linkage symbols in an object or library file.

So, we can provide a single library with both FORTRAN-callable and C-callable versions of the same routine, because the linkage names used for subroutines are different for the two languages.  The ACML library contains two object modules for each routine defined in "acml.h".  The FORTRAN version exports the symbol with the trailing underscore as the entry point with the FORTRAN calling convention.  A separate "C wrapper" module exports the symbol without the underscore as the entry point for a short routine that resolves the differences in calling conventions and then calls the FORTRAN-compatible version.

So all is well as long as your project is built with the specific FORTRAN compiler or a compatible C compiler or some combination of those.  But you can run into trouble if yet another compiler is thrown into the mix, or another 3rd-party library which was built with another compiler is used.

One of our users recently ran into exactly this situation.  They wanted to link together their program code, which was compiled with Intel (R) FORTRAN , plus ACML, plus yet another linear algebra library (which I won't name - let's call it library X).  Library X was linked with object code from a different FORTRAN compiler which did not append a trailing underscore to the linkage name.  The calling routine would push references (addresses) of the scalar parameters (such as the array sizes) onto the stack and then call the symbol "dnrm2" (without the underscore).  The linker would match that name with the "C wrapper" for dnrm2, which would expect those parameters to have been passed by value.  It would then execute the dnrm2 algorithm using the address of the array size variable N in place of N itself.  This would probably just crash with a segment violation.  If by some miracle it did not crash, it certainly would not compute the correct results.

In some cases the ACML user can make local customizations to the ACML library to work are around these problems.  Of course, it is strictly the user's responsibility to insure that these customizations are appropriate and generate correct linkages.   In this case, the work-around was to remove all of the c wrappers from libacml.a.

The script below shows how this can be done.   The technique used is a quick-and-dirty hack, and not the most efficient or elegant way of accomplishing the same effect. 

#! /bin/sh
#   Make a local copy of the ifort64 ACML static library
cp /opt/acml4.1.0/ifort64/lib/libacml.a ./libacml.a
#   Create a list of all of C-wrapper modules
ar -t libacml.a | egrep  _cw.o > wrapperlist
#    Create a script to delete all of the C-wrapper modules
#    and execute it.
sed "s/.*/ar -dv libacml.a &/" wrapperlist | bash
#    Clean up
rm ./wrapperlist

One undocumented piece of information makes it easier to remove the "C wrapper" functions from this version of libacml.a:  All of those object modules have names with the suffix "_cw.o".  There is no guarantee that this will be true in other versions of the library or in future releases.

With this knowledge, the "ar -t" and "sed ... | bash" lines of the script are all that is needed to remove these modules.  Of course, this will remove them one at a time, which is remarkably slow and inefficient.  On the other hand, you only need to do this once.  You should expect this script to take a good fraction of an hour to execute, and plan accordingly;  start it when you're ready to leave for lunch or a meeting.

Let us know if this makes ACML more useful for you; we'd like to hear what you're doing with the library. 


The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



-------------------------





The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 





Edited: 06/30/2009 at 11:42 AM by jim.conyngham@amd.com

 Post a Comment    

    Posted By: Jim Conyngham @ 06/29/2009 03:39 PM     AMD Libraries     Comments (1)  

April 2, 2009
  Faster string operations

libsst.so: super-fast string scanning functions

In the broad scheme of improving application (and system) compute performance, there are a number of optimizations which can be applied at various points in software development and deployment, the most common being the enabling of the optimizer when compiling code.  But when system libraries need to run on a wide variety of hardware platforms, the code in those libraries is written using algorithms appropriate to, and then compiled for, a generic (lowest common denominator) target.  A typical example of this is libc.

As a consequence, there may be a significant amount of performance 'headroom' available, depending upon the instruction set features available to a systems CPU, above and beyond the "generic" code model.  That is, by using different algorithms, it may be possible to significantly increase performance of key routines, and by doing so, increase the overall performance of the entire system.

Taking a look at specifics - one routine which can be significantly sped up is strlen()

For simplicity of code fragments, assume 64-bit Linux®, all strings being fed into strlen() start on a 0mod32 boundary, and that the string to be scanned happens to be 42 bytes long. The implementation can be a simple as a byte-by-byte scan for a zero:

size_t strlen(const char *src) {
    size_t length = 0;
        while (*src == 0)) {
        src++; length++;
    }
    return length;
}

This is a completely valid implementation, but it means src and length get changed 42 times, and 42 memory reads are required.  A more sophisticated approach is to use general purpose integer registers, read 8 bytes at a time, and then check the entire register for a zero byte.  This cuts both the number of instructions executed and the number of memory reads by roughly a multiple of 8 (there are some fiddly bits that need to happen on both the leading and trailing part of the string, as there is no requirement that the input to strlen() is conveniently aligned).  The way this is usually done is to construct a mask of 0xfefefefefefefeff, do some simple xor/add arithmetic, and use the carry bit to indicate when there is a zero byte in the register (doing an objdump –d /lib64/libc.so.6 will reveal the details)

But it is possible to do even better by using a different algorithm to one which does not use any bitmasks or arithmetic to detect the NULL byte.  Enter the AMD Family 10h processors, with support for SSE2 instructions, and the POPCNT (population count) instruction.  There are SSE2 instructions which perform operations on 16 bytes at a time in XMM registers.  The trick is to find the zero byte in an XMM register and then be able to convert that into a value that can be use by control flow instructions, as well as appropriately incrementing the value of length.  Here's the pseudo code:

PXOR %xmm1, %xmm1                 // build a mask of all zeroes
do {
    MOVDQA         %xmm2, srcp;                     // load the next 16 bytes
    PCMPEQB       %xmm2, %xmm1;             // compare data vs. zeroes
    PMOVMSKB   %rax, %xmm2;                  // put result mask into integer register
    srcp++;
    length += 16;
} while (%rax == 0);

So how does this work?

The PCMPEQB instruction compares the bytes from the source against the corresponding bytes in %xmm1 (which are all zero). Each byte in the result register is set to 0xff if it matches (i.e., is a zero byte), and 0x00 if not.  Thus, if any byte in the source is zero, the result will have a 0xff byte; if no source bytes were zero, the result will be zero. We need to convert that result into something in a general purpose register that we can use for control flow (to exit the 16-byte-at-a-time scan loop) and also use to find the rightmost zero byte in the source register.  We use the PMOVMSKB instruction, which moves the leading bit of each byte from an XMM register into the corresponding bit in a general purpose register.  If the result is non-zero, it means one of the source bytes was zero.  This solves the problem of control flow for the loop.  Now, how do we figure out which byte was the first zero?  Well, this is equivalent to computing a domain-bound case of the ffs() function, which finds the first (rightmost) bit that is set in its argument.  (ffs() returns 0 if its input value is 0, but we already know for a fact that at least one bit is set.)

Here’s the clever bit: this can be computed with the POPCNT instruction

int bounded_ffs() {
    return POPCNT(input^(~(-input)));
}

While this looks messy, it compiles down to a three instruction sequence, with no branches: LEA/XOR/POPCNT.

Building upon this technique, by observing that many of the string scanning operations are variants of strlen, the same general approach can be applied:

  • strchr- a search for the first occurrence of an arbitrary byte in a string (same as index())
  • strrchr() - a search for the last occurrence of an arbitrary byte in a string (same as rindex())
  • memchr() - a length bounded search for the first occurrence of an arbitrary byte (as opposed to a zero byte)
  • strnlen() - a length bounded search for a null byte

Implement all of these, and you get libsst.so - a set of vectorized "superstring" routines tuned for use on the AMD family 10h CPUs.

Rick Gorton



-------------------------

Richard Gorton


Performance Enhancement Technologies


The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied



Edited: 04/16/2009 at 11:57 AM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: Richard Gorton @ 04/02/2009 01:32 PM     AMD Libraries     Comments (6)  

September 17, 2008
  ACML Example Programs and the Bonus RNG

The ACML examples directory contains a wealth of programs that demonstrate how to use many of the routines found in AMD’s Core Math Library (ACML)   Two of these programs in particular provide a primary source of documentation on how to take advantage of the User Supplied Random Number Generator (RNG) feature found in ACML.  These two programs are drandinitializeuser_c_example.c and drandinitializeuser_example.f.  These are simply C and Fortran versions of the same basic program.  

First a little background on the RNG routines.  ACML includes 5 base generators:  These are the NAG Basic Generator (aka MCG59),  L’Ecuyer Combined Recursive Generator,  Blum-Blum-Shub,  Wichmann-Hill (based on the 1982 paper), and Mersenne Twister (MT19937).  There are also 26 distribution generators, each of which starts with a base sequence from any of the 5 base generators or from a user supplied generator.  It should be noted that all of these generators produce pseudo-random numbers.  This means that the sequences generated are statistically indistinguishable from a true random sequence.  They are, however, repeatable.  This has advantages for many programs where results from one run to the next must be directly compared, and having different numbers would cause difficulties in the comparison.  Using pseudo random numbers, the program acts like it has truly random numbers, but subsequent runs will behave the same for debugging and performance comparisons, as long as the same seeds – or initial values – are used. 

The interface to the user supplied generators is described in the ACML User Guide.  In addition to DRANDINITIALIZEUSER, two user supplied routines are described.  These user supplied interfaces are documented as UINI and UGEN.  These two subroutine names are provided to the DRANDINITIALIZEUSER routine.  Their names are supplied by you, and can be anything you want them to be. 

The example programs for DRANDINITIALIZEUSER implement an updated version of the Wichmann-Hill generator.  This RNG is described in the 2005 paper by Wichmann and Hill (http://www.eurometros.org/file_download.php?file_key=247).  

Looking at the source code for the example, you will find three parts.  The first is the example code that calls drandinitializeuser.  This is very similar to the other RNG example calls.  The other two routines, WHINI and WHGEN, are the interesting parts.  These are the initializer and generator for the new Wichmann-Hill generator.  You easily compare the statements in the code to the algorithm defined by the 2005 paper.   Note that this generator has a period of 2^121.  That is it can generate up to 2.65 x 10^36 pseudo-random numbers before the sequence starts repeating.

State

WHINI contains the initialization code.  It accepts the seed values provided to DRANDINITIALIZEUSER and fills them in as initial values of a STATE array.  This array has 28 elements.  The 4 seed values are filled into STATE (5:8), and correspond to the ix, iy, iz, and it variables from the paper.  STATE (9:24) correspond to the 4 large primes and their dividers as specified in the paper.

Generation

When a sequence of numbers is requested through a call to one of the distributions that uses the state array supplied by DRANDINITIALIZEUSER, the WHGEN routine is called to supply that sequence.  It will perform the Wichmann-Hill algorithm for each value required using the parameters stored in state.  When done it will update STATE (5:8) with the last values of ix, iy, iz, and it.

 

Multiple Sequences

The WH 2005 paper spells out a method of producing an independent set of initial values for the sequence.  

Starting with the value for ix (or SEED(1), which is stored in STATE(5) by WHINI),  new values for ix are computed by subsequent computations of the equation:

ix := 46340 x ix mod 2147483579. 

New values for iy are computed using:

iy := 22000 x iy mod 2147483543.

 These new ix and iy values are then used – along with the already supplied iz and it – to form a new unique sequence of numbers.

 Using this method, you can generate up to 2.3E18 unique sequences of random numbers, each with 2.6E36 numbers.  That’s a lot of numbers!

An alternate method of producing more sequences would be to choose 4 new large primes and derive the corresponding multipliers.

Summary

This write-up introduces the ACML user supplied random number generator feature, and the example programs that demonstrate its use.  As an extra bonus, these examples implement an upgrade of the Wichmann-Hill RNG.   



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Chip Freitag @ 09/17/2008 06:01 PM     AMD Libraries     Comments (0)  

September 24, 2007
  Put your MOVEMASK on

As we mentioned, last week, we started to talk about options for misaligned access features and option available pre-Orcas. In this week's post, we explore those options and continue our discussion about the movemask instruction.


Any SIMD instruction set has one inherent deficiency. Since all the operations are done in parallel, it is not possible to write code that checks individual values and branches to specific code. Simply put, we cannot have 'if' statements.


The movemask instruction is a particularly valuable instruction for this reason. It creates a bit mask based on the result of a comparison of the high bit of each value in the xmm register. By using a comparison operation, you can look at the data in an xmm register and then use movemask to bring the results into a regular x86 register. This can be used for branching and you essentially now have an 'if' statement, even with SSE code.


On K8 processors, this instruction didn't have 'significant' latency, but to make an effective 'if' statement, we needed to use the instruction in every loop iteration we had. Even despite the fact that it takes only 6 cycles on a K8, it is six extra cycles per loop. Additionally, it is a vector path instruction, which means that it blocks any other instructions to execute in parallel, when it itself is executing. On a super scalar architecture, this can mean a fairly big performance loss because, in those six cycles, , another two instructions could have potentially finished!


On f10h, this instruction has its latency reduced (though this reduction is simply a result of going to the 128-bit FPU) and more importantly, it is no longer a vector path! Hence, this now becomes a very workable instruction, even if used in every loop iteration. The new latency is one cycle more than a simple integer add instruction.


This leads to some additional code changes, particularly in division operations, which are extremely expensive and can be totally avoided if, let's say, all the values in the register holding the divisors are 1. Of course, our changes again went in with the namespaces, to ensure we still keep maximal optimization for both K8 and f10h processors.


These were the optimizations we could do on our end from within our C++ code. The next step is to see how we can leverage the compilers to give us better performance for f10h. That should be another interesting exercise. In the end, we will make sure that barring toolset and logistical restrictions, APL runs the fastest code possible on whichever processor it is running.

--

One of the new instruction sets introduced in the Third Generation AMD OpteronTM processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We'll first explore what POPCNT can do for you.


In almost every interview I have given to date, I have been asked the question, "How would you calculate the number of bits set in a given 32-bit word?" Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn't very efficient. If you've tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:


int popcount(int x)
{
int popcount;
for (popcount = 0; x; x = x & (x-1), popcount++);
return popcount;
}


This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.


Before I describe POPCNT ("pop count" or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:


Why on earth would anyone need this?


As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.


Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.


Let's say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.


If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.


Enter, pop count! Pop count would simply tell me how many bits were set in the word I've just accessed, with just one instruction! Let's take a look at the gain I realize by using POPCNT.


For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:


Original [byte array based]:
Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter.
Cost: 3 instructions [read, compare, increment] x 1000 = 3,000
Results: 3k instructions. Not a very good idea.


Bit array [without pop count]:
Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1.
Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032
Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.


Bit array [with pop count]:
Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count.
Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94
Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.


NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.


In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the 'canonical NSA instruction' because of the fact that the NSA refused to buy processors which didn't support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.


For example, take the following 8-bit words:


00110001
11010001
^^^^^


The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.


A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!


Of course, this doesn't give us the Hamming distance directly, but that's easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:


; RAX and RBX contain the two words


mov rcx, rax
xor rcx, rbx
popcnt rcx, rcx


; RCX now contains the Hamming distance between RAX and RBX


Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.


This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I'll leave that for next time.


-Rahul Chaturvedi



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 11/16/2007 at 12:40 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: AMD Libraries Team" @ 09/24/2007 10:01 AM     AMD Libraries     Comments (0)  

September 17, 2007
  The Conceptual Shuffle

Most people who have done SIMD programming would tell you how invaluable the shuffle instruction is. It is used extensively to get data in the correct slots in the 128-bit XMM register. This is especially true for complex number operations. Unfortunately, this instruction also has a very high cost on the K8 processors. On f10h though, this instruction is now blazing fast. However, at the same time, the instructions we were using in lieu of using shuffles (movhlps, movlhps in particular), are now going to be much slower on f10h!


Our dilemma is how to ensure that we still maintain optimal speed on both the K8 processors and the new f10h processors.


In the end, we decided to abstract everything down to a 'conceptual' shuffle. The data movement within a register achieved by the movhlps and movlhps instructions is also possible by using a shuffle instruction This is true with almost any data reorganization instruction that works on data sizes greater than 16 bits. We scoured through our code, looking for movhlps, movlhps, and any other instructions that we had been using instead of the shuffle instructions. We replaced the intrinsic calls to all of these instructions with inline function calls to our own set of abstracted 'shuffle' functions. We scoped these calls with a preprocessor macro that was defined to point to the namespace that the shuffle calls for that particular processor.


The code looks something like:


OPTIMIZATION::Shuffle_a0b0a1b1(xmmreg1, xmmreg2)


The shuffle function call is defined in two different namespaces. One namespace holds the K8 optimized shuffle code and the other holds f10h optimized code.


Now, to compile the source file for K8 optimizations, we'd set the OPTIMIZATION pre-processor macro to the name of the namespace which holds the K8 optimized code. If we want to compile the same source file for f10h optimizations, we set the processor macro accordingly to point to the namespace containing the f10h optimized shuffles.


This solved our shuffle problem. Next, we needed to make sure we were taking advantage of the misaligned access and changed instruction scheduling.


Unfortunately, Microsoft hasn't yet released its new compiler, code named, 'Orcas'. This compiler contains support for features in the f10h processors. The only way to leverage the misaligned access feature was if we wrote code in assembly, which APL does only in cases where we get a significant advantage. Even when going to assembly can give us even a 10-15% advantage, something that would be rather good for an optimization, because of portability and maintenance issues, we generally avoid it. Since this wasn't a case that would justify going to assembly, we moved onto the last, but not least, optimization on our list for f10h. More on that next time.


--

One of the new instruction sets introduced in the Third Generation AMD OpteronTM processor is Advanced Bit Manipulation (ABM), comprising two instructions that operate on general purpose registers: LZCNT and POPCNT. We'll first explore what POPCNT can do for you.


In almost every interview I have given to date, I have been asked the question, "How would you calculate the number of bits set in a given 32-bit word?" Of course, by the thirtieth time I was asked that question, I was finally able to figure out what answer to give, which wasn't very efficient. If you've tried to calculate this number yourself, or have tried to answer this question for others, I hope this discussion will be helpful because there are many ways to do it in software. One way is using lookup tables, which access memory, but multiple lookups are needed (unless you have a 4 GB table for all 32 bits!). Alternatively, you can use another common algorithm. Subtract one from the number, then perform the AND operation with the original number. Do this until the number is 0. The number of iterations it takes for the number to become zero is the number of bits set. A typical pop count function using this method would look like this:


int popcount(int x)
{
int popcount;
for (popcount = 0; x; x = x & (x-1), popcount++);
return popcount;
}


This function is generic and can be applied to multiple integer types. If your integer size is limited, there are a few more techniques that are floating around (easily Googled) but none of them are as efficient as one instruction.


Before I describe POPCNT ("pop count" or population count), the first of two advanced bit manipulation instructions that are provided in the new AMD Family 10h processors, you might have the exact same question that I had the first several times I was asked this in an interview:


Why on earth would anyone need this?


As it turns out, counting the number of bits set in a word (a machine word, that is), can be quite useful. I started realizing this when I moved to using bit arrays for computations.


Let me give you a quick scenario. I have implemented an array which stores the results of a network transmit operation. Each element represents a true or a false, depending on whether that particular block transmitted correctly or not. I need to use this data to calculate how much packet loss I have experienced.


Let's say that block numbers 7, 32, and 62 were not transmitted. The values at array index 7, 32, and 62 would be set to 1 and the rest to 0. If I am transmitting megabytes of information, this array could grow very large and it would be using a minimum of 8 bits of storage for each 1 or 0 it needs to store (if I am using the smallest data type provided to me) unless I use a bit array.


If I use a bit array, my array becomes much smaller, which means that I need to do fewer memory accesses to traverse the entire array, less memory is being used, etc. The only problem is with accessing an element in this array. To see if a bit is set in the bit array, I need to read one chunk of the array into a word and then shift bit by bit to see if anything has failed.


Enter, pop count! Pop count would simply tell me how many bits were set in the word I've just accessed, with just one instruction! Let's take a look at the gain I realize by using POPCNT.


For 1MB of data with a 1k block size, I have 1,000 elements. Therefore, the number of instructions taken by each approach would have been:


Original [byte array based]:
Execution: For each element, I need to read the byte value and check if it is 1. If it is, I need to increment my counter.
Cost: 3 instructions [read, compare, increment] x 1000 = 3,000
Results: 3k instructions. Not a very good idea.


Bit array [without pop count]:
Execution: For every 32 elements, I need to read one word, shift the bits out, check if the left-most bit is 1 or not (check the sign of the resultant number), and then increment my counter if the bit is 1.
Cost: (1 read + 32 shifts + 32 compares + 32 increments) x (1000/32) = 3032
Results: Considering there are much fewer reads here, this approach would still be a lot faster because of a lot fewer memory accesses.


Bit array [with pop count]:
Execution: For every 32 elements, I need to read one word, do one pop count, and increment my counter by the return value from the pop count.
Cost: (1 read + 1 POPCNT + 1 add to the counter) x (1000/32) = 94
Results: Using the POPCNT instruction here gives me a whopping 32x reduction of instructions, representing a significant performance gain! This is with using 32-bit words. For 64 bits, there is even greater performance gain.


NOTE: There are other algorithms that could result in fewer instructions without using pop count, but we have chosen this x and x-1 approach because it is easily portable. Other algorithms that could perform this function faster often require a fixed number of bits, and hence are not suitable for all purposes. Even so, pop count is faster than the most optimal approach without pop count.


In addition to this specific scenario, there are several applications where pop count can substantially increase performance. Pop counts are used in cryptography (in fact, this instruction is also commonly called the 'canonical NSA instruction' because of the fact that the NSA refused to buy processors which didn't support this instruction), encoding/decoding, databases (for quickly assessing information about data), and many others. One application that I find POPCNT most useful for is to quickly calculate Hamming distances. A Hamming distance is essentially a measure of how different one word is from another. Remember, this is not how different the values held by the words are (we could just use a subtract instruction to find that out!) but how the words themselves differ. For machine words, it is defined as the number of bits that are different between the two words.


For example, take the following 8-bit words:


00110001
11010001
^^^^^


The lower 5 bits, denoted by the carats, are the same; hence only three bits are different. Therefore, the Hamming distance between these two words is 3.


A POPCNT instruction can give us the Hamming weight of a word, which is the difference between a word and the base word in its class. Because the difference between any particular word and a word with all 0s is the number of bits which are set, that is exactly what POPCNT gives us!


Of course, this doesn't give us the Hamming distance directly, but that's easily fixed. All we have to do is zero out the common bits between the two words and the result holds only the bits that are different. Our friendly neighborhood XOR instruction can do that for us, leaving us with the following sequence of instructions for calculating the Hamming distance between two words:


; RAX and RBX contain the two words


mov rcx, rax
xor rcx, rbx
popcnt rcx, rcx


; RCX now contains the Hamming distance between RAX and RBX


Hamming distances can be used to calculate things like error in data, as in how much error exists. They can be used as thresholds in encoding or decoding of audio/video. In fact, any place where you need any sort of fuzzy logic, Hamming distances could be useful. There are many other potential applications, but too many to be covered here.


This covers the first of two new advanced bit manipulation instructions that are introduced with the new Family 10h architecture. This leaves us with another interesting instruction, LZCOUNT, which counts the number of leading zeros in a given word. But, I'll leave that for next time.


-Rahul Chaturvedi



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 11/16/2007 at 12:41 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: AMD Libraries Team" @ 09/17/2007 10:27 AM     AMD Libraries     Comments (0)  

September 10, 2007
  All SIMD All the Time

Working on the AMD Performance Library, our life is SIMD (Single Instruction, Multiple Data). An 8-bit add operation takes one cycle; doing sixteen 8-bit add operations using an integer add, SSE instruction (which works on 128 bits at a time), takes four cycles. The advantage of an 8-bit add operation is enormous, so if an operation can be done in parallel, we make sure that it utilizes an instruction from one of our SIMD instruction sets: SSE/SSE2/SS3.

The family 10h processors (f10H), initially unveiled as "Barcelona", has several new and exciting changes in its FPU, which results in vast improvements for SIMD instructions. In APL, our aim was to leverage these changes and other new features in order to give the best performance possible to our users. Here's how we did it.

The first major improvement on the f10h processors is that the FPU has been widened from 64 bits to 128 bits. Let's look at what this means.

Previously, a SSE 'add' instruction (let's take PADDW), would have taken 4 cycles to execute. According to the Software Optimization Guide for the K8 family of processors, the latency for this operation is only 2 cycles, but because the FPU used to be 64 bits, each 128-bit SSE instruction was broken into two, and each part was executed separately. So straight out, most of our code runs close to twice as fast on a Barcelona machine without us touching a thing!

Life can't be that easy though, so we started going through the instruction set with a fine-toothed comb. Now most instructions in the SSE instruction set have similar latencies but several instructions have drastically different latencies (when optimizing, 4 cycles can be significant). There are always instructions that a developer writing SSE optimized code needs to avoid simply because there are sometimes two (or more) other instructions that can do the job in less time. One particular example of this is the shuffle instructions on the K8 processor, which we will discuss in our next post.

--Rahul Chaturvedi



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 11/16/2007 at 04:50 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: AMD Libraries Team" @ 09/10/2007 10:40 AM     AMD Libraries     Comments (0)  

  Some Errata on SSE128: AMD's New Floating-Point Enhancements

Back in June, AMD Developer Central published a feature article detailing, for the first time, AMD's SSE128 extensions (aka 'FP128') for the Barcelona processor platform. With the impending release of Third-Generation AMD Opteron(TM) processors, we bring you some errata and updates.

This paragraph just below Table 1, on page 2 of the article, is slightly incorrect.  AMD Opteron processors, new or old, do not perform 128-bit floating point operations. It is important to understand what the contents of a 128-bit SSE register represent.  For the case of double precision data, a register, say XMM0, holds one or two 64-bit double precision values.  For single precision, XMM0 holds one or four 32-bit single precision values. 

This is true for both K8 and for the new Third-Generation Opteron.

The paragraph is correct in saying that K8 operates on 64-bit chunks.  K8 has one multiply unit and one add unit.  Each of these could retire one double precision operation per clock, for a maximum throughput of 2 FLOPS per cycle.  This was the maximum throughput regardless of whether the code was scalar (e.g. MULSD), or vectorized (MULPD).

For single precision on K8, the story is a little different.  The K8 single precision scalar instructions can retire one add operation and one multiply operation (see the optimization guide to determine which execution unit is used for a particular instruction) per cycle providing 2 single precision FLOPS per cycle. But for the vector instructions, the K8 single precision units can process two of the 4 values at a time.  This means that vector single precision ops on K8 can run at 4 FLOPS per cycle.

Think of an SSE register of being in two halves, upper and lower.  K8 works on these two halves independently, and only one can be retired at a time (per execution unit).

The Third-Generation Opteron floating point unit can now operate on both halves at the same time.  The new add and multiply units can retire both 64-bit halves of the 128-bit register each clock.   (This is not the same as doing 128-bit calculations aka quad precision).  The net result is that you can now process 2 DP adds and 2 DP multiplies per clock, or 4 FLOPS per cycle.  And for single precision math, you now can run 8 FLOPS per cycle!

--Chip F.

 



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 09/10/2007 at 10:36 AM by "AMD_Libraries

 Post a Comment    

    Posted By: AMD Libraries Team" @ 09/10/2007 10:34 AM     AMD Libraries     Comments (2)  

September 6, 2007
  Welcome to the AMD Libraries blog
Welcome to the AMD Libraries blog, which comes direct to you from the AMD Performance Libraries team. Here, we will discuss emerging trends, best practices and insider tips for optimizing code for stellar performance using AMD's performance and optimization routines and libraries. Follow along as the APL (AMD Performance Library) team overcomes challenges with Streaming SIMD Extensions (SSE) programming. Listen in as ACML (AMD Core Math Library) experts provide guidance on AMD's new 128-bit floating point enhancements, as well as other features that take advantage of enhanced features of the AMD "Barcelona" processor platform. Receive early guidance about new features, functions and development tools - well before they hit the street.

More than anything, we are here to engage in a dialog with our user communities. We look forward to your comments, opinions and interactions.

-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: AMD Libraries Team" @ 09/06/2007 02:20 AM     AMD Libraries     Comments (0)  

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information