AMD Processors
Decrease font size
Increase font size
Topic Title: Correctable ECC errors - anything to worry about?
Topic Summary:
Created On: 06/24/2007 06:03 PM
Status: Read Only
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 06/24/2007 06:03 PM
User is offline View Users Profile Print this message

Author Icon
wellnomics
Junior Member

Posts: 2
Joined: 06/17/2007

Hi there,

We are running Windows Server 2003 Enterprise x64 Edition on a dual AMD Opteron 244 system with 4Gb RAM and a Tyan S2882 motherboard.

Given the spec. of this box, we are expecting 'super duper' performance from it, but since installation we have been unimpressed.

I notice in the Event Log that we are getting several error reported by WMIxWDM relating to a 'fatal bus or interconnect error', but then it reports that these are 'corrected error events'.

Running MCat on these logs gives results similar to the following:

Event Source 1 - WMIxWDM
Processor Number : 1
Bank Number : 4
Time Stamp (0x): 01C6BF5B 2FEC7996
Error Status (0x): 9428C000 54080A13
Error Address (0x): 00000000 FA8292B0
Single bit errors:
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Error valid
Bus Error Code:
Participation processor: Local node responded to the request (RES)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0A13
DRAM memory access (MEM) Generic read (RD), on Generic (LG) cache
ChipKill Syndrome: 0x5451
Error address at 4008 MB

---

Event Source 2 - WMIxWDM
Processor Number : 1
Bank Number : 4
Time Stamp (0x): 01C6BF5B 2FEC7996
Error Status (0x): 9428C001 54080813
Error Address (0x): 00000000 FA8291B0
Single bit errors:
Error associated with CPU0 core
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Error valid
Bus Error Code:
Participation processor: Local node originated the request (SRC)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0813
DRAM memory access (MEM) Generic read (RD), on Generic (LG) cacheassociated with CPU0
ChipKill Syndrome: 0x5451
Error address at 4008 MB

---

Event Source 3 - WMIxWDM
Processor Number : 1
Bank Number : 4
Time Stamp (0x): 01C6BF5B 2FEC7996
Error Status (0x): 9428C000 54080A13
Error Address (0x): 00000000 FA429B80
Single bit errors:
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Error valid
Bus Error Code:
Participation processor: Local node responded to the request (RES)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0A13
DRAM memory access (MEM) Generic read (RD), on Generic (LG) cache
ChipKill Syndrome: 0x5451
Error address at 4004 MB

---

Event Source 4 - WMIxWDM
Processor Number : 1
Bank Number : 4
Time Stamp (0x): 01C6BF5B 2FEC7996
Error Status (0x): 9428C001 54080813
Error Address (0x): 00000000 FA8291B0
Single bit errors:
Error associated with CPU0 core
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Error valid
Bus Error Code:
Participation processor: Local node originated the request (SRC)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0813
DRAM memory access (MEM) Generic read (RD), on Generic (LG) cacheassociated with CPU0
ChipKill Syndrome: 0x5451
Error address at 4008 MB

I have tried running memory tests and swapping RAM into different locations, but without significant change. I guess my next step is to remove one of the processors - would you agree?

But, before I start doing that - are these 'errors' actually of significance, or am I wasting my time in trying to solve them?

If you need any additional information to help diagnose this issue, I'd be only to happy to provide it. Would appreciate your thoughts.

Thank you in advance for you time and assistance,

Simon
 06/24/2007 08:47 PM
User is offline View Users Profile Print this message

Author Icon
EduardoS
Member

Posts: 133
Joined: 03/01/2006


Processor Number : 1
Bank Number : 4

You can try swap memory modules, if one of two numbers above change, you probably get a bad memory module.

note: Actually a dual Opteron 244 isn't impressive.

Edited: 06/24/2007 at 08:49 PM by EduardoS
 06/24/2007 09:24 PM
User is offline View Users Profile Print this message

Author Icon
wellnomics
Junior Member

Posts: 2
Joined: 06/17/2007

Hello again,

Thank you for your quick response. I realise that whilst not 'top notch', we should expect 'decent performance' from our system and it shouldn't feel 'sluggish'...

Anyway, I have previously tried swapping the memory modules and the error does not change significantly. The only was I got a discernable variation was by removing two of the chips - in that case, the Address reported in the System Event Log changed from around 4202860976 to 2668372208.

Any more thoughts?

Thanks again,

Simon
 06/25/2007 09:34 PM
User is offline View Users Profile Print this message

Author Icon
EduardoS
Member

Posts: 133
Joined: 03/01/2006

So... swap the processors, if not change at least the problem inst the processor.

The address change when removing two chips is interesting, with 2GB of memory, where the second number points to?
 07/13/2007 10:46 AM
User is offline View Users Profile Print this message

Author Icon
X4600_Destroyer
Junior Member

Posts: 2
Joined: 07/13/2007

I'm having the same problem with a sun X4600M2. I swapped the processors around and when I ran the MCAT again it showed the same error in a different location. Sun came out and replaced the memory DIMMs and I'm still showing the same errors, I fear that I'm going to have to replace the whole CPU card.

-------------------------
Sun X4600M2
8x 8000 Series Opterons=16 Cores
32GB DDR2 ECC RAM

(1 bad proc= $60,000 paper weight)
 07/13/2007 10:49 AM
User is offline View Users Profile Print this message

Author Icon
X4600_Destroyer
Junior Member

Posts: 2
Joined: 07/13/2007

Event Source 231 - WMIxWDM
Processor Number : 8
Bank Number : 4
Time Stamp (0x): 01C7C4A5 F982B850
Error Status (0x): D4014001 00080813
Error Address (0x): 00000004 FD54A010
Single bit errors:
Error associated with CPU0 core
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Second error
Error valid
Bus Error Code:
Participation processor: Local node originated the request (SRC)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0813
DRAM memory access (MEM) Generic read (RD), on Generic (LG) cacheassociated w
ith CPU0
ChipKill Syndrome: 0x0002
Error address at 20437 MB
Address decode:
Node ID4, Chip Select2, Logical DIMM1
---

Event Source 232 - WMIxWDM
Processor Number : 10
Bank Number : 4
Time Stamp (0x): 01C7C4AB 2F87A43E
Error Status (0x): D4014100 00080A13
Error Address (0x): 00000005 01444140
Single bit errors:
Error found by scrubber
Correctable ECC error
Error address valid in MCi_ADDR
Error reporting enabled
Second error
Error valid
Bus Error Code:
Participation processor: Local node responded to the request (RES)
Time-out: Request did not time out
Memory transaction type: Generic read (RD)
I/O: DRAM memory access (MEM)
Cache level: Generic (LG)
North Bridge Error MC4:
Extended Error Code: 0x8 - ChipKill ECC Error
Error Code: 0x0A13
DRAM memory access (MEM) Generic read (RD), on Generic (LG) cache
ChipKill Syndrome: 0x0002
Error address at 20500 MB
Address decode:
Node ID5, Chip Select2, Logical DIMM1

-------------------------
Sun X4600M2
8x 8000 Series Opterons=16 Cores
32GB DDR2 ECC RAM

(1 bad proc= $60,000 paper weight)
 07/31/2007 12:31 AM
User is offline View Users Profile Print this message

Author Icon
Brane
Member

Posts: 126
Joined: 07/12/2006

Early Optys have had some kind of bug in the ECC circuitry, so when ECC was active they worked slowly and were in some circumstancews prone to crashing.

If you bought Tyan + couple old unicore optys to be impressed, then I'm afraid you made terrible choice.

Old unicore Optys are overcharged crap and old HT 1.0 links are too slow for reallly efficient workload sharing.

That kind of board might work fine for an optimized server with precisely set affinities of each job, but it isn't so stellar otherwise.

If you got it from someone for peanuts, fine, but if you paid full price for new one, you did bad deal.

WRT to an error: Try using minimal RAM ( just two RAM sticks on CPU0) and see if the error goes away with ae<ch pair of RAM sticks.
If not, try exchanging CPUs and repeat.

-------------------------
On the journey of life I chose the psycho path
 02/11/2008 07:38 AM
User is offline View Users Profile Print this message

Author Icon
mobigital
Junior Member

Posts: 9
Joined: 02/10/2008

Originally posted by: wellnomics

Hi there,

We are running Windows Server 2003 Enterprise x64 Edition on a dual AMD Opteron 244 system with 4Gb RAM and a Tyan S2882 motherboard.

...
...
...

If you need any additional information to help diagnose this issue, I'd be only to happy to provide it. Would appreciate your thoughts.

Thank you in advance for you time and assistance,

Simon



Simon,

was your system bugchecking from these correctable errors?
my system is bugchecking from similar errors, i am not sure how to prevent it, aside from addressing the memory issue.
if it can be corrected, i don't want my system to halt or reboot.

did you install anything special on the machine to treat these errors as warnings.
Statistics
112018 users are registered to the AMD Processors forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2014 FuseTalk Inc. All rights reserved.



Contact AMD Terms and Conditions ©2007 Advanced Micro Devices, Inc. Privacy Trademark information