AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs
Decrease font size
Increase font size
February 23, 2009
  Huge Pages and NUMA on Windows® Operating Systems
 
Many Java applications, especially those using large heaps, can benefit from what the operating systems call large or huge pages.  In the x86 architecture, these are pages that are larger than the default 4K byte pages, usually 2MB.  See the article Supersizing Java: Large Pages on the Opteron Processor for a discussion of these huge pages and how to set up your OS and JVM to use them.
Now if you are specifically using huge pages in a multi-processor NUMA environment and you intend to run multiple JVMs each affinitized to a node, which means we are pinning that JVM's threads to one node, then for maximum performance you would like to make sure that the huge pages you allocate for each affinitized JVM's heap are local to that node.   A previous blog entry addressed this issue for Linux® operating systems, and here we discuss Windows® operating systems.
Huge Pages and NUMA on Windows
On Windows, as described in the Supersizing article, you do not need to (in fact you cannot) reserve the huge pages before an application like a JVM can use them.   You just need to enable the user’s rights to “Lock Pages in Memory” and the requesting application will acquire the huge pages at runtime.   Note that the allocation policy should thus be different from the Linux allocation policy because the Linux policy happened outside of the process context at page reserve time.
But let’s look at what happens at runtime. The JVM or other application makes a VirtualAlloc request to allocate a chunk of memory with a flag to map it to huge pages.  The OS really deals in terms of 4K pages.  To return a 2MB page it must find 512 contiguous 4K pages and those will then be locked in memory (they are not candidates for paging out).  By default, it will look for such a 2MB block in the memory that is local to the requesting processor.  If the OS cannot find a free 2MB page in local memory, it will search in memory from other nodes in the system.
Another thing to note is that on Windows, an affinitized process really just has its execution threads bound to a set of processor cores.   There is no way to force memory allocation to be affinitized as well.     Windows will silently allocate memory for an affinitized process from other nodes if it doesn’t have enough memory locally.  (Note: this is true even if the newer VirtualAllocExNuma call is used.)  So the allocation strategy mentioned above holds for both affinitized and non-affinitized processes.
There is one other complication.  This huge page allocation strategy, allocating as much as possible from local memory, and then allocating from remote memory reflects what happens on Windows Server 2008.  However, on Windows Server 2003 there is a feature of huge page allocation that makes it less likely to allocate as much as possible from local memory.   On Windows Server 2003, the OS gives priority to contiguous blocks of 2M pages whether they are local or not.   So even if there is enough total local huge pages, if memory is fragmented enough so that large contiguous blocks are not available on the requesting node, the likelihood of getting local huge page memory will be less on WS2003. 
If you do control the memory allocation code for the application you can work around this Windows Server 2003 feature by allocating the huge pages one page at a time.  Later releases of the standard JVMs implement this Windows Server 2003 workaround.
 
Measuring whether we can get Local Huge Pages
On either Windows Server 2008 or Windows Server 2003, since the application will just silently allocate memory from other nodes if it doesn’t have enough locally, how does one tell if their affinitized process really is getting all their huge pages from local memory?  I’m not aware of any Windows utility that helps answer this question.  In AMD Java Labs we use an internal tool that can tell us how many free huge pages are available on each node.  This tool allocates as many huge pages as possible and then looks at the physical address of the pages to tell what node they are on.    We can then infer that these pages must have been free and could be acquired by the real application of interest. 
I’d be interested in hearing whether others know of a Windows utility that helps show where the huge pages can be allocated.
There is some advice however.  Since you can’t reserve huge pages on Windows, and since the longer a system is up, the more fragmented it’s free page pool will be, it follows that if your application's performance depends not only on getting enough huge pages but also on making sure those pages are local to each node, such an application should make its request as early as possible after boot time before the memory gets fragmented.
In a future blog, we plan to address a different memory allocation problem:  what if your process is not affinitized but really is running on cores that are distributed across several nodes and it still wants memory usage to be as local as possible?  Stay tuned.
 


-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: Tom Deneau @ 02/23/2009 04:07 PM     AMD Java Labs     Comments (1)  

February 11, 2009
  Heap settings and reading verbose GC output

Last time we talked about setting up GC flags and how to tune your Java application with the proper GC algorithm.  This time we're going to focus on how to properly setup your Java heap to help get the best performance possible.

The easiest way to set your heap would be to set a maximum allowable size and let the JVM handle everything else.  By using the flag "-Xmx" you can set the heap size.  Example would be:  "-Xmx1024m" or "-Xmx1g" both set the maximum heap size to 1gigabyte.  Just realize that you don't want to set the maximum heap larger than the available memory in your system.  Performance will suffer and you'll be worse off than had you used a smaller heap.

While that's a good start, we can clearly do better than that.  The JVM also has this notion of the heap being split in multiple sections.  One section is the nursery or young generation where all allocations happen.  The other section is the tenured space or old generation where long lived objects get promoted from the nursery.  We can tune the size of these spaces for better performance depending on how our application allocates objects.  By adding the flag "-Xmn" you can set the nursery size, and the remaining heap is then allocated to the tenured space. 

This tuning of nursery can be an important source of performance improvements as a nursery collection is significantly faster than a full GC.  And if your application creates many short lived objects then you would be better off with a larger nursery than tenured space.  If only we could somehow calculate how much nursery our application uses we could tune better.  Ah, but we can!  Just use the flag "-verbose:gc" which shows each nursery and full GC and how much was collected and how much of the heap is free.  And in our case we'll add "-XX:+PrintGCDetails" which increases the level of detail.  As an example see:
 
[GC [DefNew: 910K->12K(960K), 0.0003947 secs] 1239K->341K(5056K), 0.0005048 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [DefNew: 908K->13K(960K), 0.0004003 secs] 1237K->341K(5056K), 0.0005126 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[GC [DefNew: 909K->14K(960K), 0.0005241 secs] 1237K->342K(5056K), 0.0006367 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
[Full GC [Tenured: 3485K->4095K(4096K), 0.1745373 secs] 61244K->7418K(63104K), [Perm : 10756K->10756K(12288K)], 0.1762129 secs] [Times: user=0.19 sys=0.00, real=0.19 secs]

Those are examples of the output you'll get when using "-verbose:gc" and "-XX:+PrintGCDetails".  What does all of that mean?  The first three lines are young generation collections.  The Nursery is 960k large and had around 909k worth of data before the collection and around 13k after the collection.  The total size of the young generation was 5056k and had around 1237k worth of data before the collection and about 341k after the collection.  The last line shows a full collection.  The tenured space is 4096k and in this case objects were moved into tenured space but none were collected.  This points to a bigger problem, where the tenured space is too small and should be increased to help give better performance.
 
You might have noticed that with "-XX:+PrintGCDetails" you'll get a bit of output when your application ends that looks like:
 
Heap
 def new generation   total 960K, used 16K [0x22990000, 0x22a90000, 0x22e70000)
  eden space 896K,   0% used [0x22990000, 0x22990810, 0x22a70000)
  from space 64K,  22% used [0x22a70000, 0x22a738b0, 0x22a80000)
  to   space 64K,   0% used [0x22a80000, 0x22a80000, 0x22a90000)
 tenured generation   total 4096K, used 328K [0x22e70000, 0x23270000, 0x26990000)
   the space 4096K,   8% used [0x22e70000, 0x22ec2328, 0x22ec2400, 0x23270000)
 compacting perm gen  total 12288K, used 152K [0x26990000, 0x27590000, 0x2a990000)
   the space 12288K,   1% used [0x26990000, 0x269b6390, 0x269b6400, 0x27590000)
    ro space 8192K,  63% used [0x2a990000, 0x2aea3ae8, 0x2aea3c00, 0x2b190000)
    rw space 12288K,  53% used [0x2b190000, 0x2b7f83f8, 0x2b7f8400, 0x2bd90000)
 
This shows you the sizes of the various parts of the heap, how large they were and the amount used at the time the application quit.  It's a good snapshot of the application's heap requirements at the end of the run and can help you tune.   As an example, if the new generation used is close to the total size of the new generation then you should increase the nursery size using "-Xmn" flag.  Again having to garbage collect is expensive, and causes your application to stop executing while the collection happens.  The fewer you have to do the faster your application runs.  And a full GC is slower than a regular GC.  If your application is causing many full GCs to happen and your tenured generation is nearly full then modify the nursery (-Xmn) such that the nursery is a smaller portion of the total heap size.
 
Stay tuned for part 3 as we discuss how to combine GC flags and heap flags to tune a real application for maximum performance.



-------------------------
--
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

 Post a Comment    

    Posted By: Azeem Jiva @ 02/11/2009 03:45 PM     AMD Java Labs     Comments (0)  

February 6, 2009
  Builders instead of Constructors for Immutable Objects.

As Java developers, we are familiar with constructors.  We create them every day to get our objects into a usable state. 

Consider a simple Rectangle class's constructor.

class Rectangle{

private int x,y,w,h; // x, y, width and height

Rectangle(int _x, int _y, int _w, int _h){

// Intialize  x, y, w and h

}

}

As API developers, we are always attempting to balance functionality and convenience. Even this fairly straightforward class can become confusing when we try to provide convenient alternative constructors. 

If, for example, our Rectangle's w and h were usually 100 we may be tempted to create a constructor which defers to the original constructor with default values.

Rectangle(int _x, int _y){

this(_x, _y, 100, 100);

}

One might also allow a default for when h and w are 100 and x and are 0

Rectangle(){

this(0, 0, 100, 100);

}

Things get tricky when we then try to create a constructor where we specify the w and h and take the default x and y.

We might be tempted to try to add this constructor

Rectangle(int _w, int _h){

this(0, 0, _w, _h);

}

Sadly, this constructor is now ambiguous and will result in a compile failure. If you look at Rectangle(int _x, int _y),  it is the same. The compiler does not care what the variable names are, there can only be one constructor taking (int,int).

It would be great if Java had a syntax for disambiguating the arguments to a method or constructor .

Rectangle r = new Rectangle(width:100, x:20);

We could specify the arguments in any order and the compiler would create a synthetic constructor which matched the arguments; alas Java cannot do this although we will see later how we can get close to this.

Of course, if we did not want our Rectangle to be immutable, we can abandon having multiple constructors and just use mutators.

Rectangle r = new Rectangle(); // defaults x,y = 0 and width,height=100

r.setWidth(500);

r.setHeight(500);

This is probably the best approach for most classes; it is a little verbose but it is easy to understand.

However, for immutable classes we need another approach.  

Thankfully, we can use a builder pattern.

If we define an inner static class within Rectangle called Builder which we can mutate into an appropriate state (using regular setters), we can then pass this builder into Rectangle's constructor and let the constructor pull its own state from the builder.

Sounds more complicated than it is. So lets see what this looks like:

class Rectangle{

private int x,y,h,w;

public static class Builder{

protected int x,y,h,w;

void setW(int _w){w = _w; }

void setX(int _x){x = _x;}

void setY(int _y){y = _y;}

void setH(int _h){h = _h;}

}

public Rectangle(Builder _builder){

x=_builder.x;

y=_builder.x;

w=_builder.w;

h=_builder.h;

}

}

So now we can create an instance of Rectangle.Builder and mutate it.

Then we can create a Rectangle by passing the Rectangle.Builder to the constructor.

Rectangle.Builder builder = new Rectangle.Builder();

builder.setW(500);

builder.setH(1000);

Rectangle rectangle = new Rectangle(builder);

We can make code less verbose by using what I will call 'chainable mutators' to the builder.  Unlike traditional mutators, we avoid using the traditional setX() style for field x;instead we just name the mutator the same as the field (so we use x() instead ofsetX()). We also return 'this' from our mutator so that subsequent mutator calls can be chained together.

Our builder mutators now look like this:

Builder w(int _w){

w = _w;

return(this);

}

And so now we can use:

Rectangle.Builder builder = new Rectangle.Builder();

builder.w(200).h(200).x(5).y(5);

Rectangle rectangle = new Rectangle(builder);

We can construct and chain in one line:

Rectangle.Builder builder = new Rectangle.Builder().w(200).h(200).x(5).y(5);

Rectangle rectangle = new Rectangle(builder);

Or we can avoid declaring the temporary builder object altogether.

Rectangle r = new Rectangle(new Rectangle.Builder().w(200).h(200).x(5).y(5));

Now we have the safety of immutable Rectangles with a mechanism for getting immutable objects into a reasonable state.

At this point we can make two more enhancements.

1.       Add a static build() method to Rectangle to construct the Rectangle.Builder.

2.       Add a commit() method to the Rectangle.Builder to construct the Rectangle.

Our code now looks like this:

class Rectangle{

private int x,y,h,w;

public static class Builder{

protected int x,y,h,w;

Builder w(int _w){w = _w; return(this);}

Builder x(int _x){x = _x; return(this);}

Builder y(int _y){y = _y; return(this);}

Builder h(int _h){h = _h; return(this);}

Rectangle commit(){return new Rectangle(this);}

}

private Rectangle(Builder _builder){

x=_builder.x;

y=_builder.x;

w=_builder.w;

h=_builder.h;

}

Builder build(){

return(new Builder());

}

}

And we can create a Rectangle using:

Rectangle r = Rectangle.build().w(200).h(200).x(5).y(5).commit();

This pattern is certainly not new. It borrows from the traditional builder pattern and from what Martin Fowler refers to as a 'fluent interface': (http://martinfowler.com/bliki/FluentInterface.html).

We have been using this pattern in internal projects and would be interested to hear how folks feel about combining the builder pattern with chainable mutators for constructing immutable objects.



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied



Edited: 02/16/2009 at 06:27 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: Gary Frost @ 02/06/2009 01:36 PM     AMD Java Labs     Comments (0)  

February 3, 2009
  Which Java GC Collector is right for you?

Tuning GC flags seems like it should be difficult and error prone, but following a few easy steps can help you get improved application performance with only a few minutes of your time. The first step is to decide what sort of application you are trying to tune. Is your application throughput bound and needs to perform as many transactions as possible in the shortest time period? Or is your application latency bound where each transaction must finish in a specified time period?

I'm going to be using the Sun HotSpot JVM flags as my examples but both Oracle's JRockit and IBM's J9 have similar flags. See the appropriate documentation on Oracle's or IBM's site, respectively.

Traditionally most applications are throughput bound and are not sensitive to GC pause times. Throughput bound applications are better suited to a fast collection where the JVM can pause Java execution for GC. There is another class of applications that require a certain guarantee of how long the GC pauses will take. These applications are popular in the financial sector and other related fields. These applications require that the JVM ensures that individual transactions finish within a certain time period, usually less than 10ms.

Let's start with the most common applications first and target throughput applications. The first flag to enable is "-XX:+UseParallelOldGC" which enables a parallel version of both the young generation collector and the old generation collector. By default the number of threads is some variant of the number of processors that changes depending on the version of Java. We should change that to some reasonable default depending on the number of cores that you have available on your system and more importantly the number of cores you want to dedicate to this application. In this example I'm going to assume the system has two processors and eight cores (2p/8c). Since the application I want to run is exclusively dedicated to the system I'm also going to set aside all eight cores to garbage collection. Now this isn't as bad as you realize since the ParallelGC is a "Stop the world" collector and your Java application won't make forward progress until the collection has completed. To modify the number of threads use the flag: "-XX:ParallelGCThreads=#" where # is the number of threads in our case 8.

Once you've enabled those flags test the application out and see how much performance you've gained. Ideally your application should now run faster and have shorter garbage collection pause times. What about other applications that are latency driven rather than performance driven? For those applications we need to enable the low pause or concurrent collector. To enable the concurrent collector we use the flag "-XX:+UseConcMarkSweepGC" which enables the concurrent collector. But since we care about latency and don't want GC time to dominate over application CPU use time we need to enable the incremental mode for CMS by using "-XX:+CMSIncrementalMode". These two flags only get maximum impact when the JVM knows what the collector is doing and is able to profile and change parameters. To enable that feature use "-XX:+CMSIncrementalPacing".

Now rerun your application and see if the response time for your application has increased. In most cases while response time may have increased, your throughput may have dropped, but whether that tradeoff is acceptable is a decision only you can make. Stay tuned next time as we explore how to use GC logs to tune heap sizes and control GC performance.

-------------------------

Azeem Jiva, MTS Software Engineer



-------------------------
--
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.

Edited: 02/04/2009 at 01:56 AM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: Azeem Jiva @ 02/03/2009 04:43 PM     AMD Java Labs     Comments (0)  

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information