In a previous entry on the Advanced Synchronization Facility (ASF), my colleague Michael pointed you to the current ASF specification proposal and showed some nifty use-cases for the feature. In this blog entry I'll try to make this a little more practical and show you how you can get some more hands-on experience with ASF.
Running ASF
ASF is an experimental feature which means that we do not yet have access to a "toy implementation" in silicon to play with. As with all other cases where the real thing is not available for testing (such as with early crash tests for cars) we resort to simulation to analyse important properties of ASF. Simulation also allows us to get a feeling for how ASF can be used by applications and operating system kernels, and might be integrated into compilers and language runtimes.
The approach of simulation is nothing new inside AMD and we have a rich set of simulation tools available for all kinds of purposes. Several aspects of ASF, however, made us use another external open-source simulator called PTLsim for our analysis. On the one hand, we want to have detailed AMD64 simulation capabilities to provide some performance predictions, get fine-grained thread interleaving right, and support simulation of operating system kernels. Furthermore, we would like to have an understanding of how ASF interacts with other features employed in today's processor cores. On the other hand, all of this should not have prohibitive overheads in terms of simulation speed and prototyping effort.
In addition to the technical requirements, we appreciate PTLsim's open-source license, which makes it easier to share our prototypical ASF simulator implementation with the public and in related projects (such as the EU-funded VELOX project, which Martin will cover in the next post in this series).
Although PTLsim certainly has an impressive list of features, several of these features come at the price of a somewhat large infrastructural requirement. To allow simulation of the entire operating system, PTLsim relies on Xen to provide the first-order hardware abstraction. Xen in turn, however, may demand an elaborate test machine setup.
Besides "just" adding the ASF functionality to PTLsim, I've spent a fair amount of effort adding supportive features, such as a true multi-core simulation model that improves on the previously existing SMT (symmetric multi-threading) model. With the new multi-core model, each logical thread has its own set of resources (functional units and caches) and cores can modify the contents of other caches (for example by invalidating data in other caches by local updates). These interactions were not captured by the SMT model, as threads there shared functional units and caches. Other modifications to the upstream version of PTLsim mostly fix bugs in several subsystems of PTLsim. I regularly hang out on the ptlsim-devel mailing list :-).
Evaluating ASF
Our initial evaluation of ASF started with an (internal) predecessor of the currently available version; let's just call it ASF1. Although ASF1 is a more restricted form of the current ASF specification, its implementation and analysis have been published already. You can take a look at our EPHAM 2008 paper (or at my much more detailed thesis at the same location, if you're adventurous) to get an overview of how things behaved back in 2008. ASF1 basically has a more static phase layout; there is a strict separation between a 'declaration phase' and an 'atomic phase', in which you can add elements to your speculative working sets in the declaration phase only, and then modify them inside the subsequent atomic phase.
The static phase layout makes ASF1 unsuitable for applications that want to interleave modifications and working-set discovery within a single atomic region, unnecessarily restricting programmers' flexibility. Nevertheless we did find ASF1 extremely powerful and we showed an 80% performance improvement over a conventional lock-free implementation of a linked list, and 20% for accelerating a software transactional memory (STM) run-time (you can find more details in the documents referenced above).
ASF1 gives you the flexibility you need to make a lock-free linked-list implementation practical, actually even fairly straightforward. If you have some experience with lock-free linked lists, you'll know that the traditional CAS (compare-and-swap) is not easily usable for element removal from the list. In order to safely remove the element you have to change the preceding element's next-pointer (make it point to the deleted element's successor) and at the same time ensure that nobody concurrently adds an element just after the deleted element. With just CAS it is difficult to ensure that two memory locations do atomically change / keep their value. It is almost trivial to do this with ASF, even ASF1. Just have a look at Michael's DCAS example in the previous blog post.
Besides making the currently specified ASF implementation available for you to play with (see below), we are currently testing and extending the implementation thoroughly. For example, we are porting the TMunit testing application and looking at other larger applications. We also analyse various ways of implementing ASF, see how we can make use of the increased flexibility (over ASF1) for accelerating STMs better than with ASF1, and look at new look-free use cases for ASF.
Finally, we constantly strive to improve ASF to fit the needs of programmers wanting to use it -- so again, if you have any comments on the current ASF specification proposal, leave us a comment or send email to ASF_Feedback@amd.com!
Hands on
In our downloads section you can find all the ingredients needed to brew your own magic ASF1 potion: the tweaked simulator implementing ASF1; the benchmarks in which we have used ASF1 to accelerate (and simplify!) a lock-free linked list implementation and an STM; and various explanatory documents, such as our EPHAM 2008 paper and my Diploma thesis. I'm currently cleaning up the implementation of the current ASF specification in PTLsim and it will become available there shortly, too.
I'm aware that setting up the toolchain might be daunting, largely due to the Xen requirement, and sometimes less than 100% stable thanks to the research nature of the upstream project. If you have any specific questions regarding simulator setup and usage, please leave me a comment.
About me
I joined AMD's OSRC group in Dresden in May 2007 as a student intern and started implementing the original ASF proposal (ASF1 above) in PTLsim. This implementation work laid the foundation for my Master's thesis (mostly in English, ignore the German front pages) which I wrote to finish my studies of Computer Science at TU Dresden and the paper mentioned above. I graduated in February 2008 and have continued my work on ASF as a full employee at the OSRC since then.
I'm interested in most computer science and engineering topics, but I'm currently focusing on:
-
Microarchitecture: Cores, caches and interconnects
-
Memory model semantics
-
Simulation
-
Parallel programming: Transactional memory, lock-free programming
-
Computer graphics
I'd like to hear what your thoughts are on ASF, and what uses you have for it.
--
Stephan Diestelhorst, Software Engineer 1
AMD Operating System Research Center, Dresden
-------------------------
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.
Edited: 09/04/2009 at 05:35 AM by stephan.diestelhorst