In a previous blog, we looked at a microbenchmark where we were pulling an item from a collections class like an ArrayList and eventually putting it in another collection. And we saw that there could be a significant performance difference between the following two versions:
(Note: In the following examples, we show only the parts where we access the ArrayLists and leave out any subsidiary logic.)
ArrayList aListSrc, aListDest1, aListDest2;
Version 1
while (idxSrc < NUMOBJS) { aListDest1.add(idxDest, aListSrc.get(idxSrc++)); }
with version 2 being slower because it requires a castcheck to check that the Object returned by aListSrc.get could be cast to a MyClass. The performance impact was because the castcheck required touching an object that did not need to be touched in version 1.
In the microbenchmark code above, we navigated thru the ArrayList by incrementing an integer index to the ArrayList.get method. What if we had used an explicit iterator or used the implied iterator in Java’s for-each statement?
First let’s look at the least cluttered implementation, which uses for-each loop
Version 3
for (MyClass myc : aListSrc) { aListDest1.add(myc); // ... }
and remembering that the for-each loop is syntactic sugar for the following:
Version 3b
for (Iterator iter = aListSrc.iterator(); iter.hasNext() ) { MyClass myc = iter.next(); //body of loop aListDest1.add(myc); }
we can see that, unfortunately, this suffers from the same castcheck as Version 2. And, once again, we cannot get around the castcheck by making the for-each variable an Object, because the compiler wisely will not let you add an Object to an ArrayList:
Version 4 (will not compile)
for (Object myc : aListSrc) { aListDest1.add(myc); // ß error here }
Looking at the expanded code for the for-each loop, we see that we can still both use an explicit iterator and avoid the castcheck by getting rid of the temporary variable from Version 3b and ending up with something like the following:
Version 5
for (Iterator iter = aListSrc.iterator(); iter.hasNext() ) { aListDest1.add(iter.next()); }
Like Version 1, this passes all the compile-time checks. And at run time, because of type erasure, iter.next() returns an Object and aListDest1.add consumes an Object .
But ideally we would want to be able to use the less cluttered for-each notation and still get rid of the castcheck. Can that be done? Brian Goetz's excellent article Going Wild with Generics talks about using generic methods to force the compiler to use type inference to solve a problem with wildcards in generics. To quote his article "The Java compiler doesn't perform type inference in very many places, but one place it does is in inferring the type parameter for generic methods". I wanted to see if the type inference from generic methods would solve our problem here and sure enough it does.
If we code up version 6 as a generic helper method
Version 6
private<V> void splitHelper(ArrayList<V> src, ArrayList<V> dest1, ArrayList<V> dest2) { for (V elem : src) { dest1.add(elem); // ... } }
and we can then call the helper with something like
splitHelper(aListSrc, aListDest1, aListDest2);
If we run version 6 thru javac and look at the generated bytecodes, we see that the checkcast bytecode that we saw in version 3 is not there, leading to better performance.
So we have found a for-each based solution that has gotten rid of the castcheck, but do others find this behavior surprising? The difference between Versions 3 and 6 seems very minor and it seems that if the compiler could eliminate the castcheck in Version 6, it could also do so in Version 3.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
I have always been a little unhappy with the decision to overload the use of the 'final' keyword to enable local variables to be made available to methods in inner classes.
Let's recap. Here is a method which launches a thread which prints integers 0 thru 9.
public void launch(){ new Thread(new Runnable(){ public void run(){ for (int i=0; i<10; i++){ System.out.println("i="+i); } } }).start(); }
We decide to refactor this method to take two arguments (launch(int min, int max)) so that we can control the start and end values of the count. We might be tempted to try
// will not compile public void launch(int min, int max){ new Thread(new Runnable(){ public void run(){ for (int i=min; i<max; i++){ System.out.println("i="+i); } } }).start(); }
But this will fail to compile.
The problem is that the parameters min and max are not in the scope of the run() method in the anonymous inner class implementation of Runnable(). In fact, because the run() method is being executed in another thread, it is likely that the original call to launch() has returned before the run() method has even started, so the variables that were on the stack when we created our Runnable() are long gone. To solve this problem Java needs a way to signal that a variable should be captured into the scope of any anonymous inner class that wants to use it. If Annotations were around, I suspect that an Annotation would have worked well for this, unfortunately this 'requirement' predated Annotations and it was decided to 'overload' the use of the final keyword to convey this intent.
public void launch(final int min, final int max){ new Thread(new Runnable(){ public void run(){ for (int i=min; i<max; i++){ System.out.println("i="+i); } } }).start(); }
The above method will now compile and will function as suggested.
But 'final' seems wrong here. I understand that there is a reluctance to add new key/reserved words to a language (just look at all the trouble that enum and assert created!), but final seems to be a weird choice. I think it breaks the law of 'least astonishment'.
Let's refactor our method one more time. This time we will launch 10 threads per count value and we will print the 'number' of each thread. Here is our first attempt
// Won't compile public void launch(final int min, final int max){ for (int c=0; c<10; c++){ new Thread(new Runnable(){ public void run(){ for (int i=min; i<max; i++){ System.out.println("Thread "+c+" i="+i); } } }).start(); } }
Again our compilation issue is that the 'c' variable is not available in the run method of the anonymous inner class. We need c to be a final variable. Let's make it final
// Won't compile for a different reason ;) public void launch(final int min, final int max){ for (final int c=0; c<10; c++){ new Thread(new Runnable(){ public void run(){ for (int i=min; i<max; i++){ System.out.println("Thread "+c+" i="+i); } } }).start(); } }
Doh! Of course c can't be final; it is a loop variable. If we mark it as 'final' we are applying the traditional (you can't mutate this) meaning of final, yet we need to mark it as final for the variable to be made available to the inner class. We are forced to do 'weird things' to get around this, like create a local final value for the purpose of capturing the value for the inner class.
public void launch(final int min, final int max){ for (int c=0; c<10; c++){ final int fc = c; // fc is only used to expose a final value to the innerclass new Thread(new Runnable(){ public void run(){ for (int i=min; i<max; i++){ System.out.println("Thread "+fc+" i="+i); } } }).start(); } }
Yuck!
However you might be even more surprised by this solution ;)
public void launch(final int min, final int max){ for (final int c: new int[]{0,1,2,3,4,5,6,7,8,9}){ new Thread(new Runnable(){ public void run(){ for (int i=min; i<max; i++){ System.out.println("Thread "+c+" i="+i); } } }).start(); } }
What?
So it looks like we can declare a loop variable to be final providing we are using the new for-each form. The code is happy to mutate it (so it's not really final, is it?) and also make it available to appropriate inner classes.
How bizarre.
Next time we will look at how these final variables actually get captured/cloned into the inner classes. One might be surprised what is happening at the bytecode level to allow these 'final' values [to be?] made available to inner classes
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
I was lucky enough to go to JavaOne last week and thought I'd share some comments, highlights, a few quibbles, and a way to make some serious money if you are in the beanbag industry.
I felt that this year's JavaOne was a little subdued -- attendance seemed lower (we can probably all guess that the economy was a factor here) and generally there were fewer 'cool!' exclamations from the audiences.
Monday
This was 'Community One' day. I attended a couple of sessions (Hadoop and Cloud related) but really spent most of the day bumping into people and catching up. CommunityOne looked a little sparsely attended at times.
I did attend a session where the presenters and attendees discussed how to get the most out of their JUGs (Java User Groups). This was a really good session.
I did enjoy hanging out in the AMD sponsored 'Hang Space,' and I had my first 'patent pending idea' here watching all of the laptop users sitting on the floor next to the walls (where the 110vac was served) and not on the comfy beanbags! So beanbag builders of the world, we need beanbags which incorporate 110v sockets. These could be sold in strings which connect together and will allow those slacking off at conferences to actually partake in the bean-bag offerings rather than sit on the floor. Of course, one might ask why the beanbags were not dragged to the walls, and the answer would be, you wouldn't be able to watch the episodes of 'The Office - US version' that were being served up on the big screen, obviously. I, of course, could happily sit in a beanbag, pretend to work and watch Dwight, Jim, and Pam wrestle with their plight because I have an AMD powered HP dv2 - whose battery lasted way longer than Season 1 of "The Office."
Tuesday
It was good to see Scott McNealy handover (the keynote, not Sun just yet) to Larry Ellison. Larry's remarks regarding the importance of Java to Oracle must have made a few folks sleep easier on Tuesday evening and I suspect that the JavaFX team will be particularly pleased with Larry calling out JavaFX by name and pushing a possible OpenOffice/JavaFX integration down the line. That should be good for JavaFX and hopefully good for OpenOffice.
So where is JavaFX in 2009? I count this as the third JavaOne where Sun has pushed JavaFX. 2007 was kind of a preview, and I enjoyed the demos but that was really all it was. It dominated in 2008, but was still really not cooked and I walked out of the lab session when I was asked to sign an NDA -- an NDA for a lab session at a conference that I paid to attend seemed a bit weird. Now in 2009 I really do think it might start gaining some traction. The addition of charting was smart (and pretty obvious really) and I was pleased that even Eclipse users got something in the form of a fairly cool Eclipse plugin. Now it really feels that JavaFX is not just for Netbeans anymore. The demos were slicker and the downloaded Eclipse plugin worked like a charm.
Having worked on a large Flex application a few years back, and having seen some extremely cool Flex apps, I have always seen JavaFX as too little too late. Flash and Flex have pretty much carved up the R part of RIA (although AJAX is not dead yet!). Now I am a little more hopeful for JavaFX to at least find an audience. The more natural Java integration and the impressive binding support will appeal to those who really took to mxml+actionscript, and I can see the story developing. The effort that has gone into jnlp/applet deployment (on jre 1.6_10 +) has helped enormously and once we can find a way to get JavaFX to launch faster (Flash still seems to launch way faster than even trivial JavaFX apps) I think that JavaFX will come into its own. I look forward to kicking the tyres some more.
Joshua Bloch (Google, Inc) and Neal Gafter's (Microsoft) "Return of the Puzzlers: Schlock and Awe" session was as well attended as ever. These guys do a great job presenting these infuriating corner cases. I liked the fact that they acknowledged making some of the mistakes presented; it makes us all feel a little less incompetent. I think I got more answers right this year, although my success rate is still not impressive.
The "Small Language Changes in JDK(tm) Release 7" session by Joseph Darcy, Sun Microsystems, Inc. was interesting. I really like the 'Elvis operator' :? and also look forward to using some of the suggestions for less verbose 'Generic' declaration/initializations.
The "Asynchronous I/O Tricks and Tips" session by Jean-François Arcand and Alan Bateman from Sun Microsystems, Inc. was an informative session. I really am guilty of not tracking nio (when will the 'n' in 'nio' seem really inappropriate) enough, and I look forward to using some of these tricks, especially using a 'Future' to access the response from an asynchronous read.
One of my favourite sessions was "Toward a Renaissance VM" by Brian Goetz and John Rose from Sun Microsystems. Sometimes I feel my head is way too small to understand this JSR 292 of stuff, but I actually felt that I have a grasp of how this will help dynamic languages and also how it might apply to frameworks which currently rely on bytecode engines/injection and reflection to do their work. I still need to track down more information on this but the fog is lifting for me.
I wish I had caught the "The Feel of Scala" session by Bill Venners of Artima, Inc. Only as the week progressed did I realize that I need to track Scala. I look forward to the slides and video of this presentation.
Wednesday
I attended a great session called 'State: You're Doing It Wrong -- Alternative Concurrency Paradigms on the JVM&trade Machine' in the morning from Jonas Bonér of Scalable Solutions. This session proposed State, Actor message passing and Data Flow mechanisms to improve concurrency. For me the Actor-based demos (based on Scala) not only prompted me to look at this approach in my Java apps, but also was a great example of how Scala can be scaled out. As I mentioned earlier I really need to dig into Scala some more.
I regret missing "The Modular Java(tm) Platform and Project Jigsaw" by Mark Reinhold of Sun Microsystems, Inc. From what I have read alsewhere this modular approach is really going to help deployment and packaging.
Joshua Bloch's (from Google) ""Effective Java": Still Effective After All These Years" was another opportunity to see the 'Billy Mayes' of Java (I really mean no disrespect - Josh is a pitch-perfect pitch man) do what he does flawlessly. His 'Effective Java' book is like the Movie 'Brazil;' you need to reread/review every year to catch what you missed previously.
I enjoyed "The Ghost in the Virtual Machine: A Reference to References" session from Bob Lee, Google Inc., which went into depth regarding GC, references, and finalization issues. I look forward to walking through the slide deck on this one. I learned a lot and also know a bunch slipped on past me.
I watched a cool demo which redefined classes in a running JVM using a java agent and some classloader tricks. This BOF session "Runtime Update of Java(tm) Technology-Based Applications, Using Dynamic Class Redefinition" by Allan Gregersen from University of Southern Denmark was fun and educational. The presenter built a Swing-based game incrementally by adding fields and methods, changing class hierarchies, etc., all without ever restarting the JVM. Although in practice I feel this javagent based chaining approach may not scale particularly well, if this can be pushed down into the JVM (as the presenter suggested) then this whole area has some great potential.
I must apologise to my fellow AMDer, Richard West, and David Gilbert from Object Refinery Limited for missing their "JFreeChart: Surviving and Thriving" BOF. I look forward to picking Richard's brain about this great toolkit.
Thursday
Occasionally I like to see what is going on in the Swing world. I don't really get to write much in Swing but there are some really great toolkits out there. I particularly enjoyed "Swing Rocks: A Tribute to Filthy-Rich Clients" by Martin Gunnarsson and Pär Sikö from Epsilon Information Technology. Swing really can look compelling.
The "Matchmaking in the Cloud: Hadoop and EC2 at eHarmony" session from Steve Kuo and Joshua Tuberville of eHarmony, Inc. was a good presentation (and from a show of hands there were two attendees that actually got married through eHarmony so there was a cool validation of eHarmony's matching algorithm!). It walked through the technical and economic considerations around using these technologies.
"Garbage Collection Tuning in the Java HotSpot(tm) Virtual Machine" from Charlie Hunt and Antonios Printezis of Sun Microsystems, Inc was a good, informative session that walked through a number of great slides highlighting what to do and what not to do. I still feel that GC tuning should be less of a 'dark art.' I worry how many JVMs are sitting out there thrashing when a few command line options would smooth the way. I do wish for a -XX+GCAdvise option which (possibly at the end of each GC) would suggest what command lines would be optmil with a specific workload. I know that I am supposed to use the printgc options (flag examples) to be added, and/or use visualvm to show me the graphs that I should use to determine what flags will be optimal, but this seems way too hard. Surely after running for a while the GC engine/subsystem would have a enough data to generate an 'I suggest running with these flags ... because ....' style report, instead of 'here are a bunch of graphs and text dumps, now go away and work out what you did wrong and come back.' Sometimes I don't want to learn to fish; sometimes I would just like to eat some fish.
Cliff Click (from Azul Systems) and Brian Goetz's (Sun Microsystems) session, "This Is Not Your Father's Von Neumann Machine; How Modern Architecture Impacts Your Java(tm) Apps" was another one of the highlights of the conference. It was a great presentation and allowed folk without a deep understanding of microprocessor architecture to walk away with some understanding of what happens under the hood. The slide deck in the middle which walked through the issues relating to how multi-core architectures executing speculatively have to handshake over the cache was very, very slick. I am looking forward to Cliff and Brian's Boxed Set being released.
There were some great sessions on "Actor-Based Concurrency in Scala" from Philipp Haller of EPFL and Frank Sommers of Artima which really rammed home how effective Scala and this Actor-based communication mechanism can simplify some concurrency problems. As I mentioned before this was brought up in a former session, and I enjoyed digging deeper in this dedicated session.
I stayed late to enjoy the "Java(tm) Programming Language Tools in JDK(tm) Release 7" BOF on Thursday night hosted by Maurizio Cimadamore and Jonathan Gibbons from Sun Microsystems, Inc. I applaud the upcoming refactoring of javap and also enjoyed the discussion on how we might get better error reporting out of javac. I also vote [should this be "voted" in this context?] for the option of getting compilation rendered to xml to help tool chaining.
Friday
Gosling's "Toy Show" (Friday morning) did have some cool stuff; the JavaFX studio tool for composing JavaFX without coding does look very, very good. Also the image analysis toolkit which generated analytical 'hashes' for images and then allowed image related searching/matching was very impressive. My favourite was the Printer/Copier based Java app for creating arbitrary multiple choice exam papers or surveys on plain paper, then printing a bunch of the question papers off and by feeding a special page with the answers and the response papers into the scanner, allow the copier/printer to grade the papers. Very smart.
The "Under the Hood: Inside a High-Performance JVM(tm) Machine" session from Trent Gray-Donald of IBM was excellent. This provided some more insight into what happens when your code is executed by a modern JVM.
Sadly I missed afternoon sessions because I had to get to the airport to get home to watch season two of 'The Office.'
There certainly is enough to dig into to keep me busy enough until next year.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
Guess how many command-line flags there are for the server JRE in the OpenJDK? I'm hearing 42. Kudos to all of you fans of the late Douglas Adams, but you're slightly short of the real answer. It's 477 (give or take a flag or two). To confirm, just go into src\share\vm\runtime\globals.hpp and src\share\vm\opto\c2_gloabls.hpp, which define them. The flags control all sorts of things, some of which you are probably very familiar with like the heap (-Xms -Xmx), and some which you may not know about, such as the memory footprint settings (-XX:ReservedCodeCacheSize and -XX:InitialCodeCacheSize).
I'm not asking you this because I want to know if you have intimate knowledge of the JRE (although if you can keep bits of trivia like this in your head, I am truly impressed). My question really comes out of the world of performance analysis of Java runtimes. Suffice it to say that as the Java Labs works to improve JRE performance, sometimes our analysis leads to improvements that can be realized by tuning these existing command-line flags. But here's my theory...I bet most of you use few, if any, of these flags in production. You probably have very good reasons for doing this. You may not have access to the command line, or you may have different applications, some of which my benefit from certain flags, while others won't. If true, the result is the same...when we look to improve JRE performance, we really need to do it in a way that is engineered to help potentially any application in a flexible way that does not require changes to the command line.
So answer these two questions:
Do you set any command line flags in production?
If yes, what are they?
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
In this entry, we’ll go down that well-worn path of looking at some microbenchmark results and trying to explain them.
This microbenchmark created an ArrayList such that if one went thru the ArrayList in order, the entries were randomly distributed in memory. We also had enough elements in the list that it would take some time to go thru the list. We then wanted to go thru the list in order and “split” it so that we created two new ArrayLists, one for all the even elements and one for all the odd elements.
There are a number of ways to code the splitting but let’s start with an approach that doesn’t use Iterators, but just uses an integer index to the get method for the source and then adds (appends) to the destination ArrayList. The body of the loop might then look like the following:
After measuring version 1, you decide those add method lines are a bit wordy so you break them into two statements, using a local variable to hold the intermediate result. Or perhaps you wanted to print some debug information for each element as you are copying it, and you needed a local variable to hold the element reference (and you then removed the debug statements). So you end up with something like:
But when you measure version 2, you find that it is much slower than version 1 (about 1/3 the speed in my measurements). Before reading on, you might try to figure out why. Is the JVM perhaps not able to optimize away the store to the local variable? And if so, is the store to the local variable really that expensive? I will add that in both cases, the get and add methods got inlined nicely into the timed loop.
Answer
You may recall that generics in Java are implemented with type checking at compile time but with type erasure at run time. How does that impact us? Well for one it means that at runtime the call to
aListSrc.get(idxSrc);
really returns an Object, even though aListSrc is an ArrayList. Therefore the statement from version 2:
MyClass myc = aListSrc.get(idxSrc);
requires a runtime castcheck that the Object returned by aListSrc really is a MyClass. If you look at the byte codes generated for such a statement, you will see a checkcast bytecode.
To check whether the object returned by aList.get really is of type MyClass (or a child of MyClass) the JVM must read the header of the object. In our list splitting operation however we never had any other reason to look at any of the fields of the MyClass objects as we went thru the list. We just copied each MyClass reference from the source list into one of the destination lists. So by having to look at the header as part of the castcheck, we must now wait until the object is read from memory into the processor’s cache. And with lots of objects in the list, it makes it less likely that an object is already in the cache when we need it.
How did we avoid the castcheck in Version 1? In version 1, the javac compiler used the list’s type declaration ArrayList to guarantee that the returned object was of type MyClass at compile time. And at runtime the types from the generics were erased so basically we have a get method returning an object which is passed to an add method which takes an Object. So no checkcast is necessary.
Note that we can try to get around the checkcast by just declaring the local variable to be an Object rather than a MyClass, but now the javac compiler will rightly complain when we try to do an add of an Object into an ArrayList.
Version 3 (will not compile) ArrayList aListSrc, aListDest1, aListDest2; ... while (idxSrc < NUMOBJS) { Object myc = aListSrc.get(idxSrc); aListDest1.add(myc); // error ... }
I should note here that if our original algorithm had looked at fields of the MyClass objects to make some decision on how to split the list, then the object would have already have to be read from memory for the other field accesses and the extra time to do the header check for the castcheck would have been insignificant.
Even though the above is explainable by type erasure, I’m not sure it follows the principal of least surprise. After all, I declared aListSrc to be ArrayList and all I did was assign the .get output to a MyClass object. If the javac compiler knew enough to eliminate the castcheck between the output of the get and the input of the add, why couldn’t it eliminate it between the output of the get and the assignment to the local variable?
Looking at this from another angle, one might ask whether the JVM can optimize away the castcheck at runtime. A check with the Hotspot folks indicated that the bytecodes are saying "throw an exception if aListSrc.get ever returns a non-MyClass object". And the JVM cannot elide bytecodes that could cause an exception like this.
So the message is don't cast your return from the Collections classes like this if you don't need to.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Check out this new article in the Java Zone: Optimizing Java Performance in a Virtualized Environment. It's based on a JavaOne 2008 Tech Session of the same name by Shrinivas and Azeem, which provided a good overview of how to navigate the intersecting worlds of Java and Virtualization.
Let us know what you think.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Most Java developers are probably aware that enums were added to Java 1.5 and we are becoming more familiar with seeing them used like this:
enum LIGHT { RED, AMBER, GREEN}; Here we are defining an enum that we can use to hold the state of a traffic light. The above code allows LIGHT to be used as a new type. LIGHT light = LIGHT.RED; and via some magic we can use LIGHT values in switch constructs. switch(light){ case RED: System.out.println("Stop"); break; case AMBER: System.out.println("Get ready"); break; case GREEN: System.out.println("Go"); break; } we can iterate over the values of LIGHT using the array returned from the values() accessor..
for (LIGHT light:LIGHT.values()){ System.out.println(light); }
and can also perform ordinal comparisons.. if (light < LIGHT.GREEN.ordinal()){ System.out.println("Not yet!"); }
Because enums are indeed Classes we can customize them by adding fields, constructors and methods.
So if we wanted to be able to query each 'value' for the next in the sequence (including wrapping from GREEN to RED) we can use :-
We can also overload methods for each value. So an alternative to the above implementation might be
enum LIGHT { RED, AMBER{ LIGHT next(){ return(GREEN); } }, GREEN{ LIGHT next(){ return(RED); } }, // We need a method to override, so lets assume RED is the default LIGHT next(){ return(AMBER); } }
Which is a little more verbose, but in some ways more explicit. Note that we must provide an implementation for the enum and then each 'value' can overload this if it chooses.
Although we can't extend enums (and probably for good reason) we can implement interfaces.
Let’s say we had an application which deals with a bound set of file types (XML, TEXT and ANY). We could make a FILE_TYPE enum which supports the FileFilter interface.
Obviously we need to be careful and not 'misuse' them, but I believe that enums can offer options beyond the traditional static list of values.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
A few weeks ago, the Java Posse interviewed Azeem, Gary, and I. The podcast has been posted! A lot of great topics were covered, including JVM performance, multi-core programming, developer tools and more. Have a listen, then comment here.
Many thanks to the Java Posse for the opportunity.
Ben
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Many of us are familiar with Apache's JMeter tool, an open source tool which can help load test and measure the performance of web applications. JMeter has an excellent GUI mode and this is the mode that is presented if you invoke JMeter with no arguments. During script development, this GUI mode is the way to go. New configuration elements, thread groups and samplers can be added and edited and the results from runs can be viewed with a number of different listeners which helps with debugging.
When the script is complete and debugged, however, you may find it more convenient to run JMeter in non-GUI mode. Because of the reduced overhead, you may also find you can drive more requests per second out of JMeter in non-GUI mode. Running a finished test plan in non-GUI mode is fairly simple, use the -n option for non-GUI mode and the -t option to specify the test plan as follows:
JMeter -n -t MyTestPlan.jmx
However, if you take an existing test plan you developed in GUI mode and run it as above, chances are the only output you'll see will be something like the following:
Created the tree successfully using MyTestPlan.jmx Starting the test @ Mon Mar 02 14:01:13 CST 2009 (1236024073938) Tidying up ... @ Mon Mar 02 14:01:15 CST 2009 (1236024075844) ... end of run
with no additional information sent to stdout. Any listeners you happened to have in your test plan were probably GUI-mode-specific and don't send anything to non-GUI mode. So how do we see how we're doing regarding throughput and/or response time?
Using the JMeter Summariser
The Summariser is like a special listener that only applies to non-GUI mode. It is controlled through JMeter properties. JMeter properties can be set either by
editing the default in the bin/JMeter.properties file
creating a new properties file and specifycing that on the JMeter command line (-p option)
specifying any number of individual properties on the JMeter command line using the -J or -D options (type JMeter -help to see the command line options).
The following properties affect the summariser:
# Define the following property to automatically start a summariser # with that name(applies to non-GUI mode ony) summariser.name=summary # # interval between summaries (in seconds) default 3 minutes summariser.interval=180 # # Write messages to log file summariser.log=true # # Write messages to System.out summariser.out=true
So with the above properties enabled, we would get a summary sent every 180 seconds to both System.out and the log file. Note the summariser.name allows other summarisers to be developed and plugged in but I have not investigated that yet. The default summariser has been pretty useful. Here's an example of what you get from the default summariser:
The lines with "summary +" are incremental for the latest summariser period, the lines with "summary =" are cumulative. The above was with a summariser period of 20 secs, you can see the actual periods can sometimes be longer than the specified period and the length of the very first period is somewhat random. You get the throughput statistics as well as average, min and max response times, and how many errors were detected (assuming your JMeter test plan as assertions to detect errors).
The JMeter log file
You may have noticed that one of the summariser properties was whether we wanted to send summariser output to the log file. Every JMeter run produces a log file, by default bin/JMeter.log. In either GUI or non-GUI mode, you can specify a different log file directly on the command line using the -j option. And as with most things in JMeter, the actual name of the log file is configurable through a property, in this case the log_file property. One interesting option allows the log_file property to include the date, such as
log_file='JMeter_'yyyyMMddHHmmss'.tmp'
So when you enable summariser.log=true as we did above, the summary lines we saw on stdout will also appear in the log file, with an identifying header such as "INFO - JMeter.reporters.Summariser:" to show their source (they will be mixed with other JMeter log information).
Logging samples in non-GUI mode
Note that the JMeter usage message refers to a different kind of log file which can be used to log the actual samples in an xml file format called JTL:
-l, --logfile <argument>
the file to log samples to
So in non-GUI mode this can be used to log the actual sample information much as the View Results Tree listener does in GUI mode. In fact, you can take the jtl file produced here and load it into the listeners in GUI mode to get a more readable display. Because of the intrusive overhead of logging all samples, you probably won't want to do that in measurement mode but you might do it when debugging a problem.
In a future post, we'll get into a couple of other useful JMeter topics:
Using the scripting configuration elements supported by JMeter to add dynamic run time information to the log file.
Having finer control over what gets saved in the .jtl file
Meanwhile, I'd be interested in learning how others use JMeter in non-GUI mode.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Most modern processors are superscalar or implement a form of parallism where instructions can schedule and execute multiple instructions at one time. The latest AMD Opteron™ processors are superscalar which allows them to run code faster than would otherwise be possible at the same clock rate. This can present a few problems with respect to finding performance bottlenecks. Since the latest AMD Opteron™ processors can execute and track up to 72 instructions in flight at one time, it becomes difficult to properly match up source code to the in-flight instructions.
What does all of the above have to do with profiling your application? If you are investigating performance issues with a timer based profile then you are more interested in how many cycles a method took to execute. And a few instructions being mapped to the wrong method is not of much consequence. Although if you are running profiles with performance counters then the lack of precision between when an instruction performed the event you are tracking and when that value shows up in the counter could throw off your analysis. Traditional profiling tools will not be able to accurately match the performance counter timer samples with the generated X86 assembly code. Instructions in flight can retire at any time, depending on memory access, hits in cache, stalls in the pipeline and many other factors and performance counter events can be attributed to the wrong instruction. This is even more difficult in a managed environment like Java where the generated code is dynamically created and executed.
All of this adds up to inaccurate mappings between X86 assembly and performance counters which can mislead most performance engineers into fixing performance bottlenecks on the wrong sequence of code.
One way around this is to use better implementations of hardware performance counters. The latest AMD Opteron™ processors have something that can help. Instruction Based Sampling (IBS) provides precise information about the execution of instructions. IBS provides four advantages over conventional performance counters:
1.Hardware events are attributed precisely to the instruction that caused the event.
2.A wide range of events are gathered, and are not limited to four out of many events that must be specified at the beginning of the profile.
3. Virtual and physical addresses of load/store operands are collected. This allows managed environments to associate specific data structures with X86 instructions.
4.Latency is measured for key performance parameters such as data cache miss.
-------------------------
--
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
I'm happy to announce that the Java Labs will be interviewed by the Java Posse tomorrow. Their weekly podcast keeps the development community up-to-speed on the world of Java. While we certainly plan to discuss the involvement of the team and AMD in that world, the interview gives us the opportunity to discuss the points that are important to you.
Do you have a pressing topic that you want us to weave into the discussion? Let us know by commenting here.
Thanks,
Ben
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Things like code refactorings, insufficient code coverage testing, poor coding standards or mere oversight can often lead to redundant code. Generally all commercial JVMs apply a certain level of escape analysis and dead code elimination optimizations while executing Java code. However, there are certain cases where JVM is unable to eliminate redundancy; for example, unused instance variables of a class.
IDEs such as Eclipse can warn the Developers about local variables and instance variables not being read by the program. It might be easier for JVMs to identify and optimize away redundant local variables. However, it would not be possible for a JVM to eliminate unused instance variables that can potentially be accessed using mechanisms such as Java Reflection. These unused instance variables will increase object size and, in turn, the memory footprint of the Java application if a large number of such objects are allocated by the program.
You might think that such unused/unread instance variables might not have a big impact on the application performance. Contrary to this belief, such redundant variables can substantially affect application performance. In fact, this could lead to pathological cases where application performance can be hampered due to inefficient processor memory cache usage. For example, a redundant instance variable which gets mapped to the end of object layout in the memory can increase the object size such that the resulting object does not fit into a single cache line. On the other hand, when such a field gets mapped into middle parts of the object layout, it can lead to memory holes and result in redundant garbage collection cost.
The following benchmark demonstrates performance effects of unused instance variables:
As you can see, this application creates approximately 1024M of data. The ObjectTrimming class has 7 fields of long data type and a byte array. Since each long field occupies 64 bits of data, objects of type ObjectTrimming class will have 56 bytes of space occupied by long fields. We are using the byte array to allocate another 968 bytes of data which will bring each ObjectTrimming object to a size of 1KB. We are creating 104857 such objects in a loop which iterates for 10 times. Thus in the end, listStore linked list will contain approximately 1024M of data.
We executed this benchmark three times on a system with configuration of 2 chips, 8 CPUs, 4 cores per chip AMD Opteron 8384 processor running at 2.7GHz, with 8G of RAM and it took an average of 9894 milliseconds. None of the long variables are used by the ObjectTrimming class. Neither could they be accessed by other classes, so we decided to remove these fields and rerun the modified benchmark. This time the execution time for the three runs averaged to 6493 milliseconds. That is 52% improvement in application performance. It is unlikely that a class will have a large unused-fields-to-used-fields ratio as in our example above. However, the example above shows that saving 56 bytes of data in a 1K object can have a big impact on memory footprint of the application.
We used Sun Java SE Runtime Environment (build 1.6.0_06-p-b01) with Java HotSpot Server VM (build 14.0-b09, mixed mode) for the above experiments. The JVM command-line flags used to run the benchmark were: java -server -Xms1175M -Xmx1175M
As noted above, JVMs cannot necessarily eliminate such unused fields, thus it is the Java Developer's responsibility to optimally define classes.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
In many texts String is cited as the 'gold standard' of Java's various immutable classes. Any google of 'Immutable Java' will invariably reveal examples using String to demonstrate the benefits and characteristics of a good immutable class.
'In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created.'
I like this definition as it describes what I had assumed to be the main point: that an instance's state should not be allowed to be modified post-construction.
However, I recently found myself looking at the String class and came across its hashCode() method :-
public int hashCode() { int h = hash; if (h == 0) { int off = offset; char val[] = value; int len = count; for (int i = 0; i < len; i++) { h = 31*h + val[off++]; } hash = h; } return h; }
Looking closely we see that this is a classic implementation of the 'lazy evaluation' pattern. Rather than computing the hash value in the constructor, or computing it each time the hashCode() method is called,we compute it once (on the first call to hashCode()) and save the computed value in the hash field.
We can see this if we read the private hash field reflectively.
String helloWorld = "helloWorld";
Field field=String.class.getDeclaredField("hash"); field.setAccessible(true); // because hash field is private
System.out.println("Before first hashcode call "+field.getInt(helloWorld));
helloWorld.hashCode();
System.out.println("After first hashcode call "+field.getInt(helloWorld));
If one runs this snippet of code it will output something like
Before first hashcode call 0 After first hashcode call -1554135584
So for a String instance which we create, but for which we never call hashCode(), the private hash field will remain 0. It is only changed when we call hashCode().
By Wikipedia's definition I believe String fails the immutablility test.
The argument might be that we can't observe a String instance in a different state without resorting to a reflective read of String's hash field and because the call to retrieve the state actually modifies it; if we can't observe it changing, then it didn't change.
This is reminiscent of
"If a tree falls in a forest and nobody hears it, did it make a noise?"
Or my personal version
"If I say something in a room and my wife doesn't hear me, am I still wrong?"
However it still seems like String is not immutable.
So I would like to open up a discussion: Is String immutable?
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
Many Java applications, especially those using large heaps, can benefit from what the operating systems call large or huge pages. In the x86 architecture, these are pages that are larger than the default 4K byte pages, usually 2MB. See the article Supersizing Java: Large Pages on the Opteron Processor for a discussion of these huge pages and how to set up your OS and JVM to use them.
Now if you are specifically using huge pages in a multi-processor NUMA environment and you intend to run multiple JVMs each affinitized to a node, which means we are pinning that JVM's threads to one node, then for maximum performance you would like to make sure that the huge pages you allocate for each affinitized JVM's heap are local to that node. A previous blog entry addressed this issue for Linux® operating systems, and here we discuss Windows® operating systems.
Huge Pages and NUMA on Windows
On Windows, as described in the Supersizing article, you do not need to (in fact you cannot) reserve the huge pages before an application like a JVM can use them. You just need to enable the user’s rights to “Lock Pages in Memory” and the requesting application will acquire the huge pages at runtime. Note that the allocation policy should thus be different from the Linux allocation policy because the Linux policy happened outside of the process context at page reserve time.
But let’s look at what happens at runtime. The JVM or other application makes a VirtualAlloc request to allocate a chunk of memory with a flag to map it to huge pages. The OS really deals in terms of 4K pages. To return a 2MB page it must find 512 contiguous 4K pages and those will then be locked in memory (they are not candidates for paging out). By default, it will look for such a 2MB block in the memory that is local to the requesting processor. If the OS cannot find a free 2MB page in local memory, it will search in memory from other nodes in the system.
Another thing to note is that on Windows, an affinitized process really just has its execution threads bound to a set of processor cores. There is no way to force memory allocation to be affinitized as well. Windows will silently allocate memory for an affinitized process from other nodes if it doesn’t have enough memory locally. (Note: this is true even if the newer VirtualAllocExNuma call is used.) So the allocation strategy mentioned above holds for both affinitized and non-affinitized processes.
There is one other complication. This huge page allocation strategy, allocating as much as possible from local memory, and then allocating from remote memory reflects what happens on Windows Server 2008. However, on Windows Server 2003 there is a feature of huge page allocation that makes it less likely to allocate as much as possible from local memory. On Windows Server 2003, the OS gives priority to contiguous blocks of 2M pages whether they are local or not. So even if there is enough total local huge pages, if memory is fragmented enough so that large contiguous blocks are not available on the requesting node, the likelihood of getting local huge page memory will be less on WS2003.
If you do control the memory allocation code for the application you can work around this Windows Server 2003 feature by allocating the huge pages one page at a time. Later releases of the standard JVMs implement this Windows Server 2003 workaround.
Measuring whether we can get Local Huge Pages
On either Windows Server 2008 or Windows Server 2003, since the application will just silently allocate memory from other nodes if it doesn’t have enough locally, how does one tell if their affinitized process really is getting all their huge pages from local memory? I’m not aware of any Windows utility that helps answer this question. In AMD Java Labs we use an internal tool that can tell us how many free huge pages are available on each node. This tool allocates as many huge pages as possible and then looks at the physical address of the pages to tell what node they are on. We can then infer that these pages must have been free and could be acquired by the real application of interest.
I’d be interested in hearing whether others know of a Windows utility that helps show where the huge pages can be allocated.
There is some advice however. Since you can’t reserve huge pages on Windows, and since the longer a system is up, the more fragmented it’s free page pool will be, it follows that if your application's performance depends not only on getting enough huge pages but also on making sure those pages are local to each node, such an application should make its request as early as possible after boot time before the memory gets fragmented.
In a future blog, we plan to address a different memory allocation problem: what if your process is not affinitized but really is running on cores that are distributed across several nodes and it still wants memory usage to be as local as possible? Stay tuned.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Last time we talked about setting up GC flags and how to tune your Java application with the proper GC algorithm. This time we're going to focus on how to properly setup your Java heap to help get the best performance possible.
The easiest way to set your heap would be to set a maximum allowable size and let the JVM handle everything else. By using the flag "-Xmx" you can set the heap size. Example would be: "-Xmx1024m" or "-Xmx1g" both set the maximum heap size to 1gigabyte. Just realize that you don't want to set the maximum heap larger than the available memory in your system. Performance will suffer and you'll be worse off than had you used a smaller heap.
While that's a good start, we can clearly do better than that. The JVM also has this notion of the heap being split in multiple sections. One section is the nursery or young generation where all allocations happen. The other section is the tenured space or old generation where long lived objects get promoted from the nursery. We can tune the size of these spaces for better performance depending on how our application allocates objects. By adding the flag "-Xmn" you can set the nursery size, and the remaining heap is then allocated to the tenured space.
This tuning of nursery can be an important source of performance improvements as a nursery collection is significantly faster than a full GC. And if your application creates many short lived objects then you would be better off with a larger nursery than tenured space. If only we could somehow calculate how much nursery our application uses we could tune better. Ah, but we can! Just use the flag "-verbose:gc" which shows each nursery and full GC and how much was collected and how much of the heap is free. And in our case we'll add "-XX:+PrintGCDetails" which increases the level of detail. As an example see:
Those are examples of the output you'll get when using "-verbose:gc" and "-XX:+PrintGCDetails". What does all of that mean? The first three lines are young generation collections. The Nursery is 960k large and had around 909k worth of data before the collection and around 13k after the collection. The total size of the young generation was 5056k and had around 1237k worth of data before the collection and about 341k after the collection. The last line shows a full collection. The tenured space is 4096k and in this case objects were moved into tenured space but none were collected. This points to a bigger problem, where the tenured space is too small and should be increased to help give better performance.
You might have noticed that with "-XX:+PrintGCDetails" you'll get a bit of output when your application ends that looks like:
Heap def new generation total 960K, used 16K [0x22990000, 0x22a90000, 0x22e70000) eden space 896K, 0% used [0x22990000, 0x22990810, 0x22a70000) from space 64K, 22% used [0x22a70000, 0x22a738b0, 0x22a80000) to space 64K, 0% used [0x22a80000, 0x22a80000, 0x22a90000) tenured generation total 4096K, used 328K [0x22e70000, 0x23270000, 0x26990000) the space 4096K, 8% used [0x22e70000, 0x22ec2328, 0x22ec2400, 0x23270000) compacting perm gen total 12288K, used 152K [0x26990000, 0x27590000, 0x2a990000) the space 12288K, 1% used [0x26990000, 0x269b6390, 0x269b6400, 0x27590000) ro space 8192K, 63% used [0x2a990000, 0x2aea3ae8, 0x2aea3c00, 0x2b190000) rw space 12288K, 53% used [0x2b190000, 0x2b7f83f8, 0x2b7f8400, 0x2bd90000)
This shows you the sizes of the various parts of the heap, how large they were and the amount used at the time the application quit. It's a good snapshot of the application's heap requirements at the end of the run and can help you tune. As an example, if the new generation used is close to the total size of the new generation then you should increase the nursery size using "-Xmn" flag. Again having to garbage collect is expensive, and causes your application to stop executing while the collection happens. The fewer you have to do the faster your application runs. And a full GC is slower than a regular GC. If your application is causing many full GCs to happen and your tenured generation is nearly full then modify the nursery (-Xmn) such that the nursery is a smaller portion of the total heap size.
Stay tuned for part 3 as we discuss how to combine GC flags and heap flags to tune a real application for maximum performance.
-------------------------
--
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
As API developers, we are always attempting to balance functionality and convenience. Even this fairly straightforward class can become confusing when we try to provide convenient alternative constructors.
If, for example, our Rectangle's w and h were usually 100 we may be tempted to create a constructor which defers to the original constructor with default values.
Rectangle(int _x, int _y){
this(_x, _y, 100, 100);
}
One might also allow a default for when h and w are 100 and x and y are 0
Rectangle(){
this(0, 0, 100, 100);
}
Things get tricky when we then try to create a constructor where we specify the w and h and take the default x and y.
We might be tempted to try to add this constructor
Rectangle(int _w, int _h){
this(0, 0, _w, _h);
}
Sadly, this constructor is now ambiguous and will result in a compile failure. If you look at Rectangle(int _x, int _y), it is the same. The compiler does not care what the variable names are, there can only be one constructor taking (int,int).
It would be great if Java had a syntax for disambiguating the arguments to a method or constructor.
Rectangle r = new Rectangle(width:100, x:20);
We could specify the arguments in any order and the compiler would create a synthetic constructor which matched the arguments; alas Java cannot do this although we will see later how we can get close to this.
Of course, if we did not want our Rectangle to be immutable, we can abandon having multiple constructors and just use mutators.
Rectangle r = new Rectangle(); // defaults x,y = 0 and width,height=100
r.setWidth(500);
r.setHeight(500);
This is probably the best approach for most classes; it is a little verbose but it is easy to understand.
However, for immutable classes we need another approach.
Thankfully, we can use a builder pattern.
If we define an inner static class within Rectangle called Builder which we can mutate into an appropriate state (using regular setters), we can then pass this builder into Rectangle's constructor and let the constructor pull its own state from the builder.
Sounds more complicated than it is. So lets see what this looks like:
class Rectangle{
private int x,y,h,w;
public static class Builder{
protected int x,y,h,w;
void setW(int _w){w = _w; }
void setX(int _x){x = _x;}
void setY(int _y){y = _y;}
void setH(int _h){h = _h;}
}
public Rectangle(Builder _builder){
x=_builder.x;
y=_builder.x;
w=_builder.w;
h=_builder.h;
}
}
So now we can create an instance of Rectangle.Builder and mutate it.
Then we can create a Rectangle by passing the Rectangle.Builder to the constructor.
Rectangle.Builder builder = new Rectangle.Builder();
builder.setW(500);
builder.setH(1000);
Rectangle rectangle = new Rectangle(builder);
We can make code less verbose by using what I will call 'chainable mutators' to the builder. Unlike traditional mutators, we avoid using the traditional setX() style for field x;instead we just name the mutator the same as the field (so we use x() instead ofsetX()). We also return 'this' from our mutator so that subsequent mutator calls can be chained together.
Our builder mutators now look like this:
Builder w(int _w){
w = _w;
return(this);
}
And so now we can use:
Rectangle.Builder builder = new Rectangle.Builder();
builder.w(200).h(200).x(5).y(5);
Rectangle rectangle = new Rectangle(builder);
We can construct and chain in one line:
Rectangle.Builder builder = new Rectangle.Builder().w(200).h(200).x(5).y(5);
Rectangle rectangle = new Rectangle(builder);
Or we can avoid declaring the temporary builder object altogether.
Rectangle r = new Rectangle(new Rectangle.Builder().w(200).h(200).x(5).y(5));
Now we have the safety of immutable Rectangles with a mechanism for getting immutable objects into a reasonable state.
At this point we can make two more enhancements.
1.Add a static build() method to Rectangle to construct the Rectangle.Builder.
2.Add a commit() method to the Rectangle.Builder to construct the Rectangle.
Our code now looks like this:
class Rectangle{
private int x,y,h,w;
public static class Builder{
protected int x,y,h,w;
Builder w(int _w){w = _w; return(this);}
Builder x(int _x){x = _x; return(this);}
Builder y(int _y){y = _y; return(this);}
Builder h(int _h){h = _h; return(this);}
Rectangle commit(){return new Rectangle(this);}
}
private Rectangle(Builder _builder){
x=_builder.x;
y=_builder.x;
w=_builder.w;
h=_builder.h;
}
Builder build(){
return(new Builder());
}
}
And we can create a Rectangle using:
Rectangle r = Rectangle.build().w(200).h(200).x(5).y(5).commit();
We have been using this pattern in internal projects and would be interested to hear how folks feel about combining the builder pattern with chainable mutators for constructing immutable objects.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied
Edited: 02/16/2009 at 06:27 PM by AMD Developer Blogs Moderator
Tuning GC flags seems like it should be difficult and error prone, but following a few easy steps can help you get improved application performance with only a few minutes of your time.The first step is to decide what sort of application you are trying to tune.Is your application throughput bound and needs to perform as many transactions as possible in the shortest time period?Or is your application latency bound where each transaction must finish in a specified time period?
I'm going to be using the Sun HotSpot JVM flags as my examples but both Oracle's JRockit and IBM's J9 have similar flags.See the appropriate documentation on Oracle's or IBM's site, respectively.
Traditionally most applications are throughput bound and are not sensitive to GC pause times.Throughput bound applications are better suited to a fast collection where the JVM can pause Java execution for GC.There is another class of applications that require a certain guarantee of how long the GC pauses will take.These applications are popular in the financial sector and other related fields.These applications require that the JVM ensures that individual transactions finish within a certain time period, usually less than 10ms.
Let's start with the most common applications first and target throughput applications. The first flag to enable is "-XX:+UseParallelOldGC" which enables a parallel version of both the young generation collector and the old generation collector.By default the number of threads is some variant of the number of processors that changes depending on the version of Java.We should change that to some reasonable default depending on the number of cores that you have available on your system and more importantly the number of cores you want to dedicate to this application.In this example I'm going to assume the system has two processors and eight cores (2p/8c).Since the application I want to run is exclusively dedicated to the system I'm also going to set aside all eight cores to garbage collection.Now this isn't as bad as you realize since the ParallelGC is a "Stop the world" collector and your Java application won't make forward progress until the collection has completed.To modify the number of threads use the flag:"-XX:ParallelGCThreads=#" where # is the number of threads in our case 8.
Once you've enabled those flags test the application out and see how much performance you've gained.Ideally your application should now run faster and have shorter garbage collection pause times.What about other applications that are latency driven rather than performance driven?For those applications we need to enable the low pause or concurrent collector.To enable the concurrent collector we use the flag "-XX:+UseConcMarkSweepGC" which enables the concurrent collector.But since we care about latency and don't want GC time to dominate over application CPU use time we need to enable the incremental mode for CMS by using "-XX:+CMSIncrementalMode".These two flags only get maximum impact when the JVM knows what the collector is doing and is able to profile and change parameters.To enable that feature use "-XX:+CMSIncrementalPacing".
Now rerun your application and see if the response time for your application has increased.In most cases while response time may have increased, your throughput may have dropped, but whether that tradeoff is acceptable is a decision only you can make.Stay tuned next time as we explore how to use GC logs to tune heap sizes and control GC performance.
-------------------------
Azeem Jiva, MTS Software Engineer
-------------------------
--
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 02/04/2009 at 01:56 AM by AMD Developer Blogs Moderator
Well, I may have managed to avoid watching "It's a Wonderful Life" last month, but I can't say the same about avoiding holiday cookies. I've just come off of enough of my seasonal sugar high to be able to construct actual sentences, so that I can tell you about the great trip that I had to NY on 12/17.
First, I had a good chat with David Worthington of the SD Times, who put together an article discussing AMD's Java optimization efforts. It highlights some of the short and long term optimization goals that we have.
That evening, I presented "Run Anywhere: The Hardware Platform Perspective" at the NY JavaSIG. Mainly, it's an introduction to the work that we do in the Java Labs, and an overview of tuning options from a hardware point-of-view. Many thanks to Google, who hosted the space for the meeting, to the great NY JavaSIG audience, and a special shout out to Eric Bruno, who blogged about it on Dr. Dobb's CodeTalk
Overall, a successful trip. I even got to see some snow before getting back to warmer climes.
Ben
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Many Java applications, especially those using large heaps, can benefit from what the operating systems call large or huge pages. In the x86 architecture, these are pages that are larger than the default 4K byte pages, usually 2MB. See the article Supersizing Java: Large Pages on the Opteron Processor for a discussion of these huge pages and how to set up your OS and JVM to use them.
But what if you are specifically using huge pages in a NUMA environment and you intend to run multiple JVMs each affinitized to a node? For maximum performance you would like to make sure the huge pages are balanced across the nodes of the system. In the following we discuss some finer points of huge pages and NUMA. The behavior and utilities available are different across Linux® and Windows®-based systems.
Huge Pages and NUMA on Linux®
On Linux, as described in the Supersizing article, you have to first reserve the huge pages (by a command such as echo nnn >/proc/sys/vm/nr_hugepages) before an application like a JVM can use them. Since huge pages need to be locked in memory, the write to nr_hugepages actually allocates the pages, and holds them in a pool. This reserved huge page memory cannot then be used for small pages. By reading back nr_hugepages, you can see whether you got the number of huge pages you asked for. (Note that you won’t see an error message on the write, so you won’t know how many huge pages you got unless you read it back). This is all fine, but what if we want to know not only how many huge pages we got, but also what nodes they were attached to?
This information is available thru the command cat /sys/devices/system/node/node*/meminfo. For each node, you are shown the HugePages_Total and the HugePages_Free. So after reserving thru nr_hugepages, you should see the HugePages_Free on the nodes in the system adding up to the total number of hugepages you asked for. And you would want those HugePages_Free to be distributed evenly across the nodes.
The default allocation policy (maybe the only one) is to go from node to node in a round robin fashion looking for a free 2MB chunk of memory and placing it in the pool. However, the important point is that if one node has no more free 2MB chunks of contiguous memory available, the reservations will continue on other nodes until the required nr_hugepages are reserved or until no more hugepages can be found on any node. This makes sense because when you write to nr_hugepages, the system doesn’t know whether you’re later going to use those huge pages from affinitized processes or not.
The above logic holds true when you’re increasing the number of huge pages. However, if you’re reducing the number of huge pages, the behavior is less sensible in that the OS will free up as many pages as possible from node 0, then as many as possible from node 1, etc. until you’ve freed up the required number of pages. Thus, if you want to reduce the number of huge pages and still have them balanced across nodes, first reduce them to zero, and then increase them to the number you really want.
On Linux, when an application that has been affinitized using –membind wants to allocate a huge page from the pool, it is only allowed to allocate from the pool local to the node (or nodes) it’s been affinitized to. If the application asks for more pages than are available on that node, it will fail rather than just allocate an unused page from the pool of some other node. In a way this is good because if you went to the trouble of affinitizing you wanted the pages to be local and there are advantages to failing early and letting you fix the page allocation problem rather than running with non-local pages at possibly lower performance.
In general huge page reservation should be done as early as possible before the memory gets fragmented which would result in fewer contiguous 2MB chunks being available. But the methods above can let you know whether a later reservation will meet your needs or not.
In our next entry, we’ll look at how Windows-based systems handle huge pages and NUMA. Meanwhile, I’d be interested in cases where people have actually used huge pages and affinity on Linux.
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Tracing a performance problem back to source code is a common task in performance analysis. Tools that automate this task can save engineers plenty of manual labor. Though there are many Java tools that perform source mapping, they arguably have limited accuracy because JVMTI, the interface that they rely on, lacks information about inlining decisions. JVMTI does not know whether the code that the JVM generates for a compiled method belongs to the method itself or to another method that was inlined into it.
In a previous blog article ("Inlining Information Hidden in the Hotspot JVM"), I mentioned that we have found a way to expose to tool writers the inlining information that is inside the Hotspot JVM, paving the way for more accurate source mapping. However, I didn't describe how we implemented our approach. The way you solve a software problem can open the door to solving other software problems than the one that you had originally targeted.
Initially, we wanted to solve the problem of improved source mapping. We knew that the Hotspot JVM contained information about inlining decisions, and that this information was unavailable to tool writers. We wanted to expose this information to tool writers.
Naturally, the first place to look was the JVMTI interface, since this is the interface that is used to pass information between the JVM and external tools. Through this interface, the JVM emits events (such as CompiledMethodLoad and ClassFileLoadEvent) to inform external tools about the different phases that it is in. For example, whenever it compiles a method, it emits a CompiledMethodLoadEvent, which contains information about the method that was compiled. External tools (e.g., JVMTI agents) can process these events to do things like printing the names of the methods that the JVM compiled, or the names of the classes that were loaded.
The interesting thing about the CompiledMethodLoad callback is that it contains a void pointer that the JVM does not use. This led us to our solution: Whenever a method is compiled, we could simply attach its inlining information to this pointer, thereby allowing this information to be passed outside the JVM via the CompiledMethodLoad event.
What's nice about this mechanism is that it doesn't alter the JVMTI spec, and that it is reusable. Although this mechanism allows us to expose inlining information, we could in the future use it to expose other kinds of information, such as register allocation information or information about optimization decisions. So while our changes benefit source-to-address mapping, they can have more widespread applications.
What other applications do you see benefiting from our approach? And to those of you who are writing Java tools, what additional information do you wish you had, which is currently unavailable to you since it is inside the JVM?
-------------------------
Vasanth Venkatachalam
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 08/18/2009 at 10:39 AM by AMD Developer Blogs Moderator