AMD Logo AMD Developer Central
AMD Developer Blogs
AMD Developer Blogs - AMD Operating System Research Center (OSRC)
Decrease font size
Increase font size
November 17, 2009
  The VELOX research project

This is the third of three blog articles describing how AMD's Operating System Research Center (OSRC) became involved in the development of the Advanced Synchronization Facility (ASF), how we are evaluating ASF, and how this and other activities fit into the EU-funded VELOX project aiming at improving the state of the art for software-transactional-memory systems.

AMD's Operating System Research Center plays a central role in the EU-funded VELOX project, which targets an integrated approach to transactional memory (TM) on multi and many-core computers. I will shed some light on this role in the next section after a short introduction into transactional memory.

Transactional memory

The transactional-memory programming paradigm is one promising approach for better and easier leveraging the trend to more parallelism in hardware. In the future, in order to benefit from hardware developments, software will have to utilize the increase in hardware parallelism. Simply relying on increases in single-thread performance will not bring large improvements anymore.

There exists a range of approaches to leverage this hardware parallelism, all of which have different restrictions:

  • For example, traditional lock-based approaches either do not scale well to a large number of cores, if coarse-grained locking is used, or

  • They are hard to get right and are complex if fine-grained locking is used.

  • Lock-free approaches may provide good performance for specific cases but are also very tricky to implement correctly.

Transactional memory may come as a solution here because the programming model is easier to grasp: Atomic blocks are marked in the source code and are mapped to transactional-memory primitives by a compiler. For lock-based approaches, single, distinct locks have to be designated for specific resources to obtain parallism. Although with TM only a single primitive is used for marking atomic blocks, it may still provide good scalability as only dynamic conflicts have to be handled. For example, if two threads use a single lock for protecting an atomic block of code, only one threads can ever enter this block at once, independently of what is done inside the block. If this block of code is used for transferring funds from one account to another, no such transfers can run in parallel. With fine-grained locking, a lock for each account could be used required, which is more complex to use and may actually lead to a deadlock. With TM, any number of threads can enter this block and a conflict occurs only if the same memory regions are touched. For the account example, this means that conflicts only occur if two threads try to modify the same accounts at the same time, otherwise, everything can run in parallel.

The following example demonstrates the simple syntax for protection a block of code with a transaction:

__tm_atomic {
  res = transfer_funds(src_account, dest_account);

  account_update_stat++;
}

Compare this to the complexity of a fine-grained locking example:

lock(src_account.lock);
lock(dest_account.lock);
lock(stats_lock);
res = transfer_funds(src_account, dest_account);
account_update_stat++;

unlock(src_account.lock);

unlock(dest_account.lock);
unlock(stats_lock);

Software transactional memory (STM), up to now is not ready for widespread deployment. Its overhead simply is too high for many applications due to the additional bookkeeping involved. We saw an order-of-magnitude slowdown for single microbenchmarks in our measurements compared to a lock-free implementation. Even with perfect scalability around ten cores would be required to match the performance of traditional approaches on a single core.

Although a hardware-transactional-memory solution would be very desirable as the costly bookkeeping could be relocated to dedicated silicon, hardware-transactional memory is not available for current industry architectures. Full-blown support is very complex to realize as it would have to interact with many processor parts and would be very resource hungry to implement.

Therefore, special hardware support for speeding up critical paths in STM proposals and support for hybrid software-hardware solutions, as targeted with ASF, is a key motivation for AMD to participate in VELOX.

Our Role in VELOX

In VELOX we fulfill two roles: First, we provide partners with support for verifying implementation proposals regarding their feasibilty from an industry point-of-view. Second, we verify our own hardware-transactional-memory proposal - AMD's ASF.

We recently released version 2.1 of the ASF Specification. Feel free to comment on it, we are interested in your feedback!

In the previous entry of our blog series on ASF, Stephan described how we use PTLsim for prototyping and evaluating ASF. To this end, we also extend PTLsim's memory module to better correspond to contemporary AMD multicore systems.

PTLsim's Memory Module

Up to now, PTLsim's memory model is restricted to a single, inclusive cache hierarchy. This limits the simulation accuracy for current AMD multicore systems, which connect single processors with HyperTransport links.

To overcome this limitation we are working on extending PTLsim's memory model.

  • We therefore first establish an interface between the core model and the memory hierarchy and refactor PTLsim to solely access memory through this interface.
    Up to now, PTLsim's access to memory locations is somewhat scattered throughout the code and the cache classes are tightly integrated with the rest of the code.
    For the interface design itself, we were inspired by the requests in M5's memory model and by the Ruby memory module in GEMS.

  • If this process is completed, other, more versatile memory modules can be used together with PTLsim's core, such as the module that is currently being developed at VELOX partner Chalmers University of Technology. Alternatively, the refactored memory module in PTLsim can be developed further and can be extended to more accurately model a HyperTransport-connected NUMA system.

 

Initial state of PTLsim's memory module.

Starting point for the memory module structure in original PTLsim. The parts of the cache hierarchy (that will comprise the memory module) are tightly integrated with the rest of PTLsim.

 

Encapsulated memory module.

Architecture after introducing the memory-module interface and using it to fully encapsulate accesses to the memory module (the cache hierarchy).

 

Replace encapsulated module with VELOX memory module.

Now the memory module can be replaced with another one, implementing the same interface

 

Outlook

I am currently in the process of updating the established interface inside PTLsim to the new packet-oriented version. Once this is done, we plan to release this version to Velox partners and will help integrate the first external memory module with PTLsim. Further on, PTLsim will be part of an integrated Velox demonstrator. Already our current collaboration with Velox partners is giving us valuable feedback regarding the ASF specification and the implementation in PTLsim. By further working on the memory module, we will be able to do better, more in-depth evaluations of ASF and will be able to estimate how ASF could be best used in tomorrow's software.

About me

I joined AMD's OSRC group in Dresden in October 2008 to work on ASF and, in the context of VELOX, to extend PTLsim.

Before that, I was a PhD student at Technische Universität Dresden, with interests in runtime monitoring for real-time systems, microkernel-based systems, and other operating system topics, such as disk scheduling and file systems.

Martin Pohlack, Senior Software Engineer
AMD Operating System Research Center, Dresden



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 11/17/2009 at 03:03 PM by AMD Developer Blogs Moderator

 Post a Comment    

    Posted By: Stephan Diestelhorst @ 11/17/2009 11:57 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

September 4, 2009
  Evaluation of the Advanced Synchronization Facility (ASF)

In a previous entry on the Advanced Synchronization Facility (ASF), my colleague Michael pointed you to the current ASF specification proposal and showed some nifty use-cases for the feature. In this blog entry I'll try to make this a little more practical and show you how you can get some more hands-on experience with ASF.

 

Running ASF

ASF is an experimental feature which means that we do not yet have access to a "toy implementation" in silicon to play with. As with all other cases where the real thing is not available for testing (such as with early crash tests for cars) we resort to simulation to analyse important properties of ASF. Simulation also allows us to get a feeling for how ASF can be used by applications and operating system kernels, and might be integrated into compilers and language runtimes.

The approach of simulation is nothing new inside AMD and we have a rich set of simulation tools available for all kinds of purposes. Several aspects of ASF, however, made us use another external open-source simulator called PTLsim for our analysis. On the one hand, we want to have detailed AMD64 simulation capabilities to provide some performance predictions, get fine-grained thread interleaving right, and support simulation of operating system kernels. Furthermore, we would like to have an understanding of how ASF interacts with other features employed in today's processor cores. On the other hand, all of this should not have prohibitive overheads in terms of simulation speed and prototyping effort.

In addition to the technical requirements, we appreciate PTLsim's open-source license, which makes it easier to share our prototypical ASF simulator implementation with the public and in related projects (such as the EU-funded VELOX project, which Martin will cover in the next post in this series).

Although PTLsim certainly has an impressive list of features, several of these features come at the price of a somewhat large infrastructural requirement. To allow simulation of the entire operating system, PTLsim relies on Xen to provide the first-order hardware abstraction. Xen in turn, however, may demand an elaborate test machine setup.

Besides "just" adding the ASF functionality to PTLsim, I've spent a fair amount of effort adding supportive features, such as a true multi-core simulation model that improves on the previously existing SMT (symmetric multi-threading) model. With the new multi-core model, each logical thread has its own set of resources (functional units and caches) and cores can modify the contents of other caches (for example by invalidating data in other caches by local updates). These interactions were not captured by the SMT model, as threads there shared functional units and caches. Other modifications to the upstream version of PTLsim mostly fix bugs in several subsystems of PTLsim. I regularly hang out on the ptlsim-devel mailing list :-).

Evaluating ASF

Our initial evaluation of ASF started with an (internal) predecessor of the currently available version; let's just call it ASF1. Although ASF1 is a more restricted form of the current ASF specification, its implementation and analysis have been published already. You can take a look at our EPHAM 2008 paper (or at my much more detailed thesis at the same location, if you're adventurous) to get an overview of how things behaved back in 2008. ASF1 basically has a more static phase layout; there is a strict separation between a 'declaration phase' and an 'atomic phase', in which you can add elements to your speculative working sets in the declaration phase only, and then modify them inside the subsequent atomic phase.

The static phase layout makes ASF1 unsuitable for applications that want to interleave modifications and working-set discovery within a single atomic region, unnecessarily restricting programmers' flexibility. Nevertheless we did find ASF1 extremely powerful and we showed an 80% performance improvement over a conventional lock-free implementation of a linked list, and 20% for accelerating a software transactional memory (STM) run-time (you can find more details in the documents referenced above).

ASF1 gives you the flexibility you need to make a lock-free linked-list implementation practical, actually even fairly straightforward. If you have some experience with lock-free linked lists, you'll know that the traditional CAS (compare-and-swap) is not easily usable for element removal from the list. In order to safely remove the element you have to change the preceding element's next-pointer (make it point to the deleted element's successor) and at the same time ensure that nobody concurrently adds an element just after the deleted element. With just CAS it is difficult to ensure that two memory locations do atomically change / keep their value. It is almost trivial to do this with ASF, even ASF1. Just have a look at Michael's DCAS example in the previous blog post.

Besides making the currently specified ASF implementation available for you to play with (see below), we are currently testing and extending the implementation thoroughly. For example, we are porting the TMunit testing application and looking at other larger applications. We also analyse various ways of implementing ASF, see how we can make use of the increased flexibility (over ASF1) for accelerating STMs better than with ASF1, and look at new look-free use cases for ASF.

Finally, we constantly strive to improve ASF to fit the needs of programmers wanting to use it -- so again, if you have any comments on the current ASF specification proposal, leave us a comment or send email to ASF_Feedback@amd.com!

Hands on

In our downloads section you can find all the ingredients needed to brew your own magic ASF1 potion: the tweaked simulator implementing ASF1; the benchmarks in which we have used ASF1 to accelerate (and simplify!) a lock-free linked list implementation and an STM; and various explanatory documents, such as our EPHAM 2008 paper and my Diploma thesis. I'm currently cleaning up the implementation of the current ASF specification in PTLsim and it will become available there shortly, too.

I'm aware that setting up the toolchain might be daunting, largely due to the Xen requirement, and sometimes less than 100% stable thanks to the research nature of the upstream project. If you have any specific questions regarding simulator setup and usage, please leave me a comment.

About me

I joined AMD's OSRC group in Dresden in May 2007 as a student intern and started implementing the original ASF proposal (ASF1 above) in PTLsim. This implementation work laid the foundation for my Master's thesis (mostly in English, ignore the German front pages) which I wrote to finish my studies of Computer Science at TU Dresden and the paper mentioned above. I graduated in February 2008 and have continued my work on ASF as a full employee at the OSRC since then.

I'm interested in most computer science and engineering topics, but I'm currently focusing on:

  • Microarchitecture: Cores, caches and interconnects

  • Memory model semantics

  • Simulation

  • Parallel programming: Transactional memory, lock-free programming

  • Computer graphics

I'd like to hear what your thoughts are on ASF, and what uses you have for it.

--

Stephan Diestelhorst, Software Engineer 1
AMD Operating System Research Center, Dresden



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.



Edited: 09/04/2009 at 05:35 AM by stephan.diestelhorst

 Post a Comment    

    Posted By: Stephan Diestelhorst @ 09/04/2009 05:14 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

June 15, 2009
  Just released: Advanced Synchronization Facility (ASF) specification

Recently AMD released an experimental specification for a proposed AMD64 architecture feature that may be of interest to all programmers of highly concurrent programs, libraries, runtimes, and operating systems: Advanced Synchronization Facility, or ASF for short. This is the first of three blog articles describing why AMD's Operating System Research Center (OSRC) became involved in the development of ASF, how we are evaluating ASF, and how this and other activities fit into the EU-funded VELOX project aiming at improving the state of the art for software-transactional-memory systems.

In this posting I will give you a quick overview of what ASF is and how it works, along with some example code. I'll also describe how I became involved in developing ASF and why we are releasing this spec proposal.

About ASF
In a nutshell, ASF is intended to make it easier to write efficient, highly concurrent programs.

When AMD introduced multicore CPUs to the x86 world, we acknowledged that individual CPU cores weren't getting much faster with each silicon-technology generation. Instead, we decided to provide multiple CPU cores in one processor. This put the burden on the software community of making programs run faster on newer processors (i.e., programs have to be changed to take advantage of the parallelism.)

Writing efficient, concurrent programs or parallelizing an existing sequential program is a hard endeavor. The trickiest part is making sure that all program threads have a consistent view of all shared data. ASF is intended to address this very problem, known as synchronization.

How does ASF work?
ASF provides a mechanism to update multiple shared memory locations atomically without having to rely on locks for mutual exclusion. It's quite flexible as the semantics of the update are not fixed, but can be provided using standard x86 instructions.

Here's an example. This code snippet implements a two-word compare-and-swap primitive, with new instructions highlighted in red:

; DCAS Operation:
; IF ((mem1 = RAX) && (mem2 = RBX))
; {
;   mem1 = RDI
;   mem2 = RSI
;   RCX = 0
; }
; ELSE
; {
;   RAX = mem1
;   RBX = mem2
;   RCX = 1
; }
; (R8, R9 modified)
;
DCAS:
 MOV      R8, RAX
 MOV      R9, RBX
retry:
 SPECULATE                    ; Speculative region begins
 JNZ      retry               ; Page fault, interrupt, or contention
 MOV      RCX, 1              ; Default result, overwritten on success
 LOCK MOV RAX, [mem1]         ; Specification begins
 LOCK MOV RBX, [mem2]
 CMP      R8, RAX             ; DCAS semantics
 JNZ      out
 CMP      R9, RBX
 JNZ      out
 LOCK MOV [mem1], RDI         ; Update protected memory
 LOCK MOV [mem2], RSI
 XOR      RCX, RCX            ; Success indication
out:
 COMMIT                       ; End of speculative region

The SPECULATE-COMMIT pair wraps a speculative region, which speculatively reads from and writes to protected memory locations using the LOCK MOV instructions. The speculative memory updates will become visible to other CPUs only when the speculative region completes successfully.

Here's what the speculative region does in this example: The initial LOCK MOV instructions signify the memory locations that need to be monitored for external modifications and also read the memory operands into the RAX and RBX registers. The code then compares these operands with the original register operands (saved to R8 and R9 at the outset of the routine). The DCAS operation may fail because of a miscomparison at that point, bypassing the memory update. The RCX register returns a pass-fail indication.

A speculative region may also be aborted, for example when a contending program thread accesses a protected memory location or when an interrupt occurs. In this case, all speculative memory updates are discarded, and the program flow (instruction and stack pointer) is rolled back to just after SPECULATE, where software can inspect the reason for the abort in the rAX and rFLAGS registers. The code in this example examines RFLAGS immediately after SPECULATE using a JNZ instruction that branches to the abort handler, which in this case just attempts a retry. A real implementation might have a more elaborate recovery strategy, for example, exponential backoff if the abort was due to contention.

How we are developing ASF
ASF really is a team effort, with team members looking at various software applications, hardware implementation, and the specification itself.

When I joined AMD's OSRC at the end of 2006, I quickly discovered ASF as it existed at that time: a mechanism for improving the efficiency of highly parallel, lock-free synchronization code. In previous work I had used lock-free data structures for building a real-time microkernel operating system, and I had often craved a feature for multi-word atomic updates such as ASF. This might explain why I was so enthralled by ASF.

In the meantime, I have become the editor of the ASF specification proposal. I'm working with the ASF team to evaluate the feature in various application scenarios, and to further develop ASF based on our findings. We have expanded its focus to include software transactional memory (STM) as well; more on that in a later blog post.

We are also actively discussing ASF with both academic and industrial partners to learn about interesting application areas and to derive requirements for an eventual implementation in future products.

The ASF specification
ASF is an experimental architecture extension currently in proposal stage. AMD has not yet committed to including this feature into any future CPU product. Instead, we are soliciting input from developers and researchers that would help us refine the ASF specification to better meet software development requirements.

ASF is not the first feature we have proposed in this way. A year and a half ago, AMD decided to be more open in developing extensions to the AMD64 architecture to help ensure we meet the needs of the software development community and to encourage cross-vendor compatibility. At that time, we proposed the Lightweight Profiling (LWP) and SSE5 features in a similar spirit, and we received extremely valuable input from the programming community that helped us improve our future products - to your benefit. SSE5 has just recently evolved into the AVX-compatible XOP, which we described in a previous blog entry.

Please download the ASF specification proposal and send your comments to ASF_Feedback@amd.com.

---
Michael Hohmuth, MTS
AMD Operating System Research Center, Dresden



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 06/15/2009 01:57 PM     AMD Operating System Research Center (OSRC)     Comments (3)  

December 11, 2008
  IOMMU for XEN

The I/O Memory Management Unit (IOMMU) is a chipset function that is designed to translate addresses used in DMA transfers and protects memory accesses by I/O devices. For a more detailed description please refer to our main IOMMU article. This article is about IOMMU enablement for XEN.

Supported IOMMU features

  • ACPI table support
  • Event logging triggered by IOMMU interrupt
  • HVM PCI pass-through
  • PV PCI pass-through
  • PCI hotplug
  • MSI interrupt mode for pass-though-device
  • Interrupt remapping for both IO-APIC and MSI

Overview

Guest kernels can be implemented either with or without awareness for IOMMU, depending on whether the IOMMU driver is integrated into the guest OS kernel or whether the hypervisor decides to expose IOMMU information to the guest domain. The main advantage of an IOMMU-aware guest is that device isolation can be implemented within the guest domain. The disadvantage is that the hypervisor requires support by the guest's IOMMU driver. At the moment of writing, domain level device isolation is the case most interesting for the XEN community. As explained above, this means IOMMU support in hypervisor only while leaving the guest kernel untouched.

The guest domain can be a para-virtualized or a full-virtualized HVM domain. In both cases, the entire guest memory space, or at least pages mapped with R/W permissions, need to be mapped into IO address space, since without IOMMU-awareness, the guest is not able to cooperate with the hypervisor to tell which part of memory is actually used for DMA transactions.

IOMMU internals in XEN

Both para-virtualized and full-virtualized guests benefit from an IOMMU. Basically, there are two components in IOMMU support involved - the hypervisor and "xentools". Up to now, the XEN development community only requires domain level drive isolation to work with the dom0 kernel left untouched. All low-level hardware operations are performed by the hypervisor and are encapsulated by a limited number of vendor-neutral wrapper functions. Hypercalls are implemented based on those wrappers and therefore support both Intel VT-d and AMD IOMMU. Hypercalls are exposed to xentools to allow device assignment/deassignment. At system initialization stage, the hypervisor detects and initializes the IOMMU. Then all PCI devices are assigned to dom0 using identical mapping.

Device  assignment

A PCI device can be put into the control of either a para-virtualized or a full-virtualized guest domain via device assignment operation. Device assignment is implemented as a hypercall and is invoked by QEMU-dm. For systems based on AMD hardware, device assignment means identifying the correct IOMMU that the corresponding device is attached to and updating its device table

entry according to the content of the IVRS (I/O Virtualization Reporting Structure, an IOMMU extension to ACPI representing the system's I/O topology) entry of the assigned device. Currently, HyperTransport devices that share the same device table entry cannot be assigned to different guest domains. This is the case for legacy PCI devices behind a PCIe®-PCI bridge.

Para-virtualized driver domain

Whenever a para-virtualized driver domain uses machine-physical addresses to program PCI devices this is called "identical mapping". That means the remapped address is identical to the address which triggers IOMMU I/O page table walking. In this mode PCI devices can be isolated between different para-virtualized domains to ensure no device can perform DMA accesses to pages that belong to a different domain.

The driver domain can also run back-end drivers over native device drivers to offer I/O multiplexing for other unprivileged guest domains. XEN utilizes a memory sharing technology called "grant mapping" to share pages between front-end domain and back-end domain. Pages mapped through a grant table must also be reflected into IOMMU I/O page tables.

HVM pass-through domain

For a HVM guest, the IOMMU can be programmed to translate guest physical addresses to machine physical addresses. The driver in turn then can be allowed to access the physical device, which is called "PCI pass-through" mode. PCI pass-through is a key feature for virtualized performance improvements, when used with full-virtualized HVM domains. Some special cases are listed below.

Sharing p2m table with IOMMU

I/O page tables in AMD IOMMUs are designed to be compatible with MMU page tables. But currently neither AMD nor Intel's VT-d use a sharing approach to construct their tables.

XEN implements a guest-physical to machine-physical (p2m) table to translate guest addresses to machine addresses. Unfortunately, it is currently impossible to share this p2m table with an AMD IOMMU. The reason is that the XEN memory management uses AVL bits in the p2m table to trace different guest memory types, which are also needed by AMD IOMMUs to encode lower page table levels. Intel VT-d used to share the p2m table with their IOMMU, but now it also turns out that separate IO page tables are a better approach because of some incompatibilities with the 2MB Intel EPT.

Virtualization of PCI resources

Configuration Space

For each PCI pass-through device a dummy QEMU-dm device with is registered on the QEMU-dm PCI bus by replicating its configuration space. As the guest OS only sees and works with the dummy PCI device's configuration space, QEMU-dm intercepts the guest configuration space write accesses and propagates them into the real hardware.

MMIO and PIO

Each time the guest BIOS, hvmloader or OS changes the PCI BAR, the pci_write_config function in QEMU-dm invokes a hypercall to instruct XEN to construct a p2m mapping. For emulated devices, MMIO and PIO work through normal memory pages, that are shared between the guest domain and QEMU-dm. Read/write accesses to those memory pages are trapped by the hypervisor and forwarded to QEMU-dm. QEMU-dm emulates the behavior of virtual devices by hooking into MMIO and PIO access functions. For directly assigned devices, QEMU-dm invokes hypercalls to map MMIO and PIO into guest memory space and asks the hypervisor to grant the guest OS direct access to these memory regions.

Interrupt handling

HVM PCI device pass-through requires the HVM domain to handle the physical interrupt by itself. When the hypervisor receives a machine IRQ it needs to know which virtual IOAPIC to assert. This requires a new mechanism to be able to bind an IRQ to a guest domain. Interrupt binding is triggered by QEMU-dm right after a PCI device is correctly assigned. This operation is implemented as a hypercall and is exposed through xentools.

According to the current IRQ handling mechanism in XEN, the hypervisor owns the physical IOAPIC, receives the machine interrupt from the device and injects a virtual interrupt event to the guest by explicitly asserting the HVM guest's virtual IOAPIC. The assert/deassert operations update the redirection table entries in the virtual IOAPIC and inject interrupts to the virtual local APIC of each VCPU by calling vlapic_set_irq().

Since PCI devices are directly manipulated by the guest OS and the guest OS will only EOI to the virtual local APIC, we have to find a way to notify the hypervisor about the machine's IRQ line deassertion state and then to perform EOI to the physical IOAPIC. A patch currently in discussion helps to solve this problem by inverting the physical IOAPIC redirection table entry's polarity field before dispatching the interrupt to the OS. Then the de-assertion states of the physical IRQ line causes do_irq() to be re-entered and the hypervisor gets the chance to explicitly deassert the virtual IOAPIC's interrupt „wire". Intel's approach with VT-d is more straightforward. It deasserts the virtual interrupt wire while intercepting the guest's EOI to the virtual IOAPIC and the directly invokes the end() function of MIRQ to EOI physical IOAPIC. This is the mechanism that is used by the upstream version of XEN.

Sharing a physical interrupt line between domains is still not reliable. XEN introduced a timeout mechanism for legacy interrupt sharing instead. However, MSI should be a viable solution for that.

Wei Wang & Peter Oruba



-------------------------

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. Links to third party sites are for convenience only, and no endorsement is implied.


 Post a Comment    

    Posted By: AMD DeveloperCentral @ 12/11/2008 01:35 PM     AMD Operating System Research Center (OSRC)     Comments (1)  

September 1, 2008
  IOMMU

This article is about AMD's IOMMU, coming up in future server chipsets, what it does, how it works and why it is important.

 

What it does

IOMMU stands for I/O Memory Management Unit and works very similar to a processor's memory management unit. The main difference is that it translates memory accesses performed by devices instead of by the processor, as the MMU already does. This address translation is implemented on a paging based scheme. As with the MMU, it is designed to allow not only implementation of translation, but protection functionality, as well. Another key feature is interrupt remapping.

 

How it works

Device pass-through

This is the ability to directly assign  a physical device to a particular guest OS. The required address space translation is handled transparently.  Ideally a device' address space is the same as a guest's  physical address space; however, in the virtualized case this is hard to achieve without an IOMMU.

If done without IOMMU, our experience has been that it is very fragile, slow and works for paravirtualized OSs only. An IOMMU is designed to allow device pass-through functionality to work even with an unmodified OS. Device isolation is a key feature for increased virtualization performance, with network adapters and GPUs being the devices that benefit most, as they usually have high bandwidth requirements. As a side-effect, devices with 32 bit addressing only can be passed to guests that are physically mapped above 4 GB, to allow DMA transfers for them as well.

Device isolation

An IOMMU is designed to be able to safely map a device to a particular guest without risking the integrity of other guests. A guest should not break out of its address space with rogue DMA traffic. Additionally it is designed to provide an increased amount of security in scenarios without virtualization. Particularly the OS should be able to protect itself from buggy device drivers by limiting a device's memory accesses.

Remapping of interrupts

Usually sharing device interrupts among several guests is complicated to handle. An IOMMU provides a basis to separate device interrupts that are already shared by different devices. It remaps a shared interrupt to an exclusive vector to ease up its delivery to a particular guest OS.

 

Why is it important?

In virtualization, there are lots of tricks done to abstract the underlying hardware, but also to minimize virtualization overhead. Using Rapid Virtualization Indexing(tm) instead of shadow page tables for memory management is only one example. The biggest remaining performance gap in today's virtualization scenario is I/O.  An IOMMU helps to bridge this gap and also improves the situation from a security point of view. Last, but not least, it allows hypervisors to be simpler and more robust. 

 

Jörg Rödel & Peter Oruba



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Peter Oruba @ 09/01/2008 03:36 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

March 26, 2008
  Myths and facts about 64-bit Linux(®)

Myths

  • "You don't need 64-bit software with less than 3 GB RAM"
  • "There are less drivers for 64-bit OS"
  • "You will need all new software, all 64-bit"
  • "64-bit software is twice as fast"


AMD's 64-bit architecture extension
AMD64 introduces one new mode to the processor, the Long mode, consisting of the 64-bit mode and the Compatibility mode. The former is the new 64-bit environment, the latter a compatibility implementation to run 32-bit code in that 64-bit environment. The current operating mode is connected to the code segment. By using instructions that change code segments (like syscall) one can switch between both submodes.
Currently the paging algorithm (derived from PAE) limits the physical address to 52 bits, while the virtual address space is 48-bits wide. Even with far less physical RAM this proves to be helpful for memory mapping.

Another important part of the new architecture is the extended number of registers, both the general purpose registers as well as the SSE registers have been doubled, you can now use 16 of each in the new 64-bit mode.


Software support
Support for 64-bit in software required a new ABI to be introduced as well as extending GCC with the new architecture. As an essential part in that, up to six function parameters are now passed in registers. Linux kernel support for 64-bit started by splitting off a new architecture tree, x86-64. Since 2.6.24 this was merged with the i386 tree to one common code base (x86).

In addition a compatibility layer (aka compat layer) provides support for execution of 32-bit binaries on 64-bit processors.


32-bit compatibility
Compatibility to 32-bit code has been a crucial goal in the design. First of all it allows to run unmodified 32-bit code inside a 64-bit environment. Syscalls are used to switch between both worlds.

The processor zero-extends all 32-bit addresses to 64-bit. Applications use the lower 4 GB of the address space. However, physically that can be mapped anywhere. The kernel manages the compat layer for all applications, meaning it resolves bitness differences, structure layouts and invokes the right library version (/lib{32,64}/ld.so).

Speaking of libraries - each library has to be present twice, once for 32-bit and once for 64-bit, which also includes all dependent ones down to the lowest.

With the Linux compat layer it is even possible to run an entire and unmodified 32-bit Linux installation with a 64-bit kernel.


Benchmarks
First we picked some real world benchmarks for our 32-bit vs. 64-bit comparisons. Oggenc, Mencoder and Povray as well as some compilation tests. Furthermore micro benchmarks were used to show specific performance differences for syscalls and 64-bit arithmetics.

We set up three system configurations - a 32-bit installation, a 64-bit installation and a combination of 32-bit installation with 64-bit kernel to challenge the compat layer. All tests were performed on a dual-core AMD-K8(tm) processor with 1 GB RAM.

The tests showed that the penalty of using the compat layer instead of running your 32-bit application on a native 32-bit kernel is about 1-2 percent. So it is almost negligible.

64-bit took the lead in the media encoding tests. Our Povray and Mencoder benchmarks took about 5% less time in the 64-bit case, Oggenc even 25%. Just C-compilation tests showed a performance advantage of 5% to 8% for 32-bit versus 64-bit.

Native arithmetic performance (64-bit data types used in 64-bit software vs. 32-bit data types used in 32-bit) showed a gain of 10% for the 64-bit case. Using 64-bit data types on 32-bit and 64-bit in the arithmetic performance test showed that 64-bit is more than twice as fast as 32-bit.


Downsides of 64-bit
A 64-bit execution environment and 64-bit software surely have their downsides, too. First there is the larger memory footprint. Binaries get larger because of an increased pointer size and 64-bit operands. This leads to higher memory transfer load and therefore increases cache utilization.


Myths revisited

  • "You don't need 64-bit software with less than 3 GB RAM"
    • Performance improvement even on systems with less than 3 GB RAM
  • "There are less drivers for 64-bit OS"
    • Irrelevant to Linux, hail open source 
  • "You will need all new software, all 64 bit"
    • 32-bit compat layer performs very well and is transparent
  • "64-bit software is twice as fast"
    • Rarely the fact, software is usually optimized for 32-bit


Conclusion
Use a 64-bit system and stick to the compat layer if you have the need of running certain 32-bit applications.


Andre Przywara, Andreas Herrmann, Peter Oruba



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 03/26/2008 at 05:07 AM by peteroruba

 Post a Comment    

    Posted By: Peter Oruba @ 03/26/2008 04:54 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

February 20, 2008
  Boosting KVM's performance with nested paging

As opposed to XEN, KVM is an in-kernel hypervisor for Linuxtm that lets you run unmodified guests like Linux (both 32 bit and 64 bit) as well as Windowstm in every kind of flavor. KVM requires hardware support like AMD's SVM (Secure Virtual Machine) to accelerate full virtualization. Practically speaking it is provided as a kernel module that comes in two pieces, a generic one and a second AMD specific one. Our most recent development for KVM is support for the new K10 hardware feature called „nested paging".
Virtualization performance highly depends on the hypervisor's virtual memory management efficiency. Here is where nested paging comes into play. The typically time consuming process of mapping guest physical addresses to host physical addresses does not have to be calculated in software anymore, but can be done in hardware instead. Memory management in software was achieved using shadow paging. However, that was revealed being a major performance bottleneck. Nested paging is a feature that lays off this address mapping to hardware.
Our KVM patch also includes live migration to/from either paging method. As far as we can tell, it will be included in KVM version 61, which will probably come with kernel version 2.6.26. So what does nested paging buy you in terms of number? We've set up a KVM host system with a Linux guest and ran kernbench. Kernbench is a kernel compilation benchmark that does several compile runs, providing a good expressiveness about overall system performance. Our guest system gains about 30% in performance with nested paging enabled as compared to shadow paging. Now a key goal in virtualization is coming much closer, namely native performance. That one improved from 60% to 90% when compiling a kernel from memory, showing KVM as a reasonable alternative to other virtualization solutions.

Jörg Rödel & Peter Oruba



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

 Post a Comment    

    Posted By: Peter Oruba @ 02/20/2008 03:04 AM     AMD Operating System Research Center (OSRC)     Comments (0)  

December 10, 2007
  Introducing the AMD Operating System Research Center Blog

Hello from AMD's OSRC. Let me give a brief introduction of who we are and what our job at AMD is. Founded one and a half years ago, we have grown to a team size of roughly two dozen people, spread over two sites, namely Dresden and Austin.

We are AMD's competence center for operating system related topics like OS Research, Linux kernel development, virtualization technology and system testing. In a CPU's early design phase we provide feedback to the architecture engineers if new features are discussed. Later we prepare and enable the use of new CPU features by the Linux(r) kernel or hypervisors like XEN.

Most of our Linux related work can be found on the Linux Kernel Mailing List, to which we submit patches regarding every kind of AMD specific parts.

Furthermore we run amd64.org which also provides a couple of AMD related mailing lists.

In this blog, we'll be sharing updates on our work here in the OSRC to help you stay informed. Stay tuned.



-------------------------
This response is provided for informational purposes only, is provided “AS IS” and does not obligate AMD to provide any of the services, technology, or programs described.

Edited: 12/10/2007 at 05:09 AM by peteroruba

 Post a Comment    

    Posted By: Peter Oruba @ 12/10/2007 05:05 AM     AMD Operating System Research Center (OSRC)     Comments (1)  

FuseTalk Hosting Executive Plan - © 1999-2009 FuseTalk Inc. All rights reserved.

Contact AMD | Terms and Conditions | Forum Rules | ©2009 Advanced Micro Devices, Inc. | Privacy | Trademark information