AMD Processors
Decrease font size
Increase font size
Topic Title: NUMA topology
Topic Summary:
Created On: 01/14/2011 03:12 AM
Status: Read Only
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 01/14/2011 03:12 AM
User is offline View Users Profile Print this message

Author Icon

Posts: 1
Joined: 01/14/2011

Hi everybody,

I'm playing with a 4 socket NUMA system. Each NUMA node consist of a 4 core Opteron 8378 processor. Three HyperTransport links are available in each node, so the resulting NUMA topology is a square:

N0 -- N1
| |
N2 -- N3

I want to known the NUMA distance between nodes. I expect the following distances:

N0 N1 N2 N3
N0 0 1 1 2
N1 1 0 2 1
N2 1 2 0 1
N3 2 1 1 0

Under Linux I use numactl --hardware to read distances from hardware. It reports:

node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10

The misbehavior is a known problem. It seems that the Opteron BIOS does not export right distances (see

In order to observe the real topology I've run some test throuh the numademo program, distributed within the numactl package. Consider the tests:

pin current process to node i, exec memset on all other nodes

I expect results are better when the memory accessed is near the current node. So I've started by pinning a process to node 0 and writing to other nodes (Each test wrote 128mb of data):

alloc on node 1 memset Avg 2920.52 MB/s Max 2938.22 MB/s Min 2808.14 MB/s
alloc on node 2 memset Avg 2965.80 MB/s Max 2967.84 MB/s Min 2962.34 MB/s
alloc on node 3 memset Avg 2116.58 MB/s Max 2122.49 MB/s Min 2081.02 MB/s

This confirm my ipotesis: node 0 is near node 1 and node 2, while node 3 is 2 hops distant.

I tried the dual configuration (pin process to node 3), and the results are coherent whit those of the previous test.

At last I've tried pinning the process to nodes 1 and 2. Here are the results for node 1:

alloc on node 0 memset Avg 2698.34 MB/s Max 2701.21 MB/s Min 2689.30 MB/s
alloc on node 2 memset Avg 2954.00 MB/s Max 2957.12 MB/s Min 2948.54 MB/s
alloc on node 3 memset Avg 2644.85 MB/s Max 2646.67 MB/s Min 2642.71 MB/s

The nearest node is 2, while 0 and 3 are the farest!

If I pin the process to node 2, I have:

alloc on node 0 memset Avg 2602.85 MB/s Max 2605.16 MB/s Min 2596.69 MB/s
alloc on node 1 memset Avg 2940.77 MB/s Max 2943.11 MB/s Min 2933.34 MB/s
alloc on node 3 memset Avg 2675.09 MB/s Max 2677.07 MB/s Min 2673.23 MB/s

The nearest node is 1, the farest are 0 and 3!

Only the first two test are coherent with the hardware topology. Anyone has any idea of why I got these results?

 02/03/2011 02:47 PM
User is offline View Users Profile Print this message

Author Icon
Dr. Mu

Posts: 1837
Joined: 08/26/2006

I also have a four-die NUMA setup with two Opteron 6128 CPUs, although the setup is different from yours. However, I am also getting some funkiness from numactl and such.

numactl --hardware also does not show the correct topology for my system. The actual hardware is arranged like so:

Note that each die is directly connected via an HT link to every other die. Yet numactl tells me the following:
node distances:
node 0 1 2 3
0: 10 16 16 22
1: 16 10 22 16
2: 16 22 10 16
3: 22 16 16 10

The latency is supposed to be the same from any one die to the next since the HT links are all clocked the same and the dies are all directly connected to each other via HT links. Yet this topology map suggests that the diagonal HT link from 0 to 3 or 1 to 2 is not being used. That's what your system's topology is, yet your map looks like what my system actually is- every die is the same distance away.

I also get some seriously funky figures from numademo as well. My bandwidth numbers are almost exactly the same as yours and also fluctuate randomly, despite my system having much more inter-core bandwidth. Your 8378s have 16-bit HT2 links at 1 GHz full-duplex, good for 2 GB/sec in each direction. Roughly equal figures of ~3 GB/sec from any other node make sense in your case. They make no sense in my case as the links are much higher-clocked than yours, plus their bandwidth varies quite a bit between the different links as 0-1 and 2-3 are 24 bits wide (9.6 GB/sec in each direction), 0-2 and 1-3 are 16 bits wide (6.4 GB/sec in each direction), and 0-3 and 1-2 are 8 bits wide (3.2 GB/sec in each direction.) You'd expect to see that come out in the program, but it doesn't. Very odd. Also, only one CPU core seemed to be loaded at all throughout this, so I don't know what is going on.

I guess about the only thing I can say is that the numademo program is also not working properly.

112018 users are registered to the AMD Processors forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2015 FuseTalk Inc. All rights reserved.

Contact AMD Terms and Conditions ©2007 Advanced Micro Devices, Inc. Privacy Trademark information