Topic Title: How do I verify faulty cores?
Topic Summary:
Created On: 03/13/2014 10:26 AM
Status: Post and Reply
Linear : Threading : Single : Branch
1 2 Next Last unread
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 03/13/2014 10:26 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Okay, this is a tricky problem I've got, that I can't even reproduce reliably.

I've got an FX-8350 that I bought on release in November 2012 (had to wait 3 weeks for it to get shipped from Dresden). It's always been a bit finicky, but since it's not properly supported by Windows I never thought much of it.

However, occasionally (rarely) I've got Windows crashing to BSOD, error code 0x101, with the message that one of the secondary processors has timed out. I've got all the hotfixes installed, disabled all the throttling features, and 99% of the time the machine runs rock solid and will happily pound 100% load on all 8 cores when rendering video.

Now comes the problem: About a month ago the fan on the CPU cooler failed, but the motherboard cut the power before it got too hot (it cuts the power at 63 C). I've replaced the cooler (with a CoolerMaster Hyper 212 Evo which keeps the CPU so cold I have yet to see it go above 50 C under full load). There is no visible heat damage on either side of the CPU, no evidence on the heatsink either (I've had AMD CPUs overheat in the early Athlon days where they made burn marks on the heatsinks). HOWEVER, the crash to "secondary processor timeout" has happened multiple times since then (like 3-5). Funny thing is it crashes when it's not really doing anything, but if it's running a lot of active programs at the same times it seems to be more stable.

I've tried to run Prime95 to torture the CPU, but I cannot get the system to crash. Even at 100% load (according to Windows Resource Monitor) on all cores the machine acts like it's not doing anything. BUT, Prime95 tells me core #5 is faulty and it will stop the thread on that core in less than a minute, while all the other cores continue at 100%. I do get calculation errors on the second half the cores, but since I don't understand what the heck it is Prime is actually doing, I can't tell if that's a real issue or not. AFAIK AMD's FPUs have always been a bit weak.

I'm not sure how Windows indexes the cores, but with 4 cores producing math errors, would that mean that 2 of the FPUs are faulty? Since I believe the vishera only has 1 FPU per core pair.

My real issue is that since Windows don't support the SMT design, how the heck do I prove that there's anything wrong with the cores at all, and that it's not just Windows' lack of proper multithreading that causes these problems? For some reason I expected AMD to have a proper testing guide or tool for download, but I can't find anything.

My work depend on this computer functioning, thus I'm not really inclined to ship off the CPU without being able to prove that it actually faulty (also I have no idea what the warranty actually is on this CPU). I don't have a spare machine atm, which makes being without a working computer very difficult. Thus I'm looking for a way to prove what exactly is going on, since for the most part everything runs just fine, until Windows suddenly decides to crash the system for an error that isn't apparent. I've had to modify the registry to keep it from crashing the display driver and other drivers because the default timeout settings are way too low, but I don't know if it's a similar issue with the CPU.

EDIT: Bought it in 2012 (first release), not 2013



Edited: 03/13/2014 at 02:34 PM by metalbunny
 03/13/2014 11:49 AM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

P95 just runs math computations. It's not unusual at all for one or more cores to be a little weaker than others and thus stop working during a stress test - when everything isn't 100% correct. There may or may not be anything wrong with the CPU at all. Without being able to duplicate the issue it's a tough one to fix or claim a warranty. If P95 will reproduce the stopped worker every time then you have a chance to resolve or confirm the problem.

The way I would approach this is to run Memtest86+ v4.20 or V5 overnight to be sure there isn't a RAM issue. Unfortunately in recent years I have seen name brand RAM fail in as little as 2 weeks to 2 months.

If the RAM is fine then I'd run P95 a few times to see if if you have the same core #5 stop each time. If so then I'd manually set the proper RAM timings, frequency and CPU vcore voltage and test again. If you still have core #5 issues I'd bump the CPU-NB to 1.3v and HT to 1.25v and test again.

If you still have issues I'd file a AMD tech support ticket and see if they will warranty the CPU - just in case it's an internal issue. There can be many issue that cause a worker to stop under P95 or heavy video use however and Windows error messages are not always accurate. I've seen corrupted/bad driver files cause BSOD, hangs and CPU errors.



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.

 03/13/2014 02:54 PM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

How would RAM cause a processor timeout? I'll have to see about Memtest, not sure if I have a 64 bit version.

EDIT: I understand that with a processor as massively complex as an 8-core there is bound to be some minor faults. I fully accept that, I'd just rather if those faults didn't cause a system crash, though I find Windows is really really bad at error handling.

Realised I forgot to list my specs, so I'll stick them in at the bottom here.

I looked over the voltages, and the BIOS said the Vcore was supposed to be 1.285 V, but the actual voltage was flickering all over the place, dropping down to 1.26 at times, so I raised it to 1.3 V which seem to make it not drop quite as much. Since my crashing is so random I have no idea if it has any effect yet - could be months before another crash, or it might not crash at all. I seem to recall raising the voltage when I originally built the machine, but since it's over a year ago the BIOS has been updated a couple times since, and I have no idea what I set it at.

When I changed the cooler I had all the RAM out because I lost half of it in the system (showed as 16 GB with 8 GB usable). Took all the sticks out, mixed them up, and then put them back in, which fixed that.

Main specs:

MSI 990FXA-GD80 motherboard, bios 11.43

AMD FX-8350 4 GHz (seen it hit 4.4 GHz in turbo boost, but it's on auto so the BIOS controls that)

16 GB DDR3-1600 Kingston HyperX Genesis (4x4 GB dual-channel. OC'ed the controller to 1600 to support the RAM).

1x MSI Geforce GTX 660 Ti 2 GB OC (1.3 GHz) Power Edition

Power Supply is CoolerMaster Silent Pro M2 620 W (55 A +12V rails to handle the graphics card, not sure about the other rails)

I got 3 SATA HDD drives in AHCI mode (not RAID), 1 DVD writer, all permanent USB devices except for the keyboard have their own power supply.

OS is Win 7 Home Premium 64 bit

 03/13/2014 03:36 PM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

If you're running in auto mode the vcore and frequency should adjust all over the place based on load and CPU temp. Typically the FX-8000 series have a minimum default vcore of 1.325v or higher for the P0 state. This would be what you'd set it to in manual mode. Your BIOS may show you the default vcore for your CPU when you switch from "auto" vcore mode to manual mode.

As far as a processor timeout it can be caused by many things including bad RAM/drivers. It means the CPU is waiting for another instruction to complete but there is a reason why it can't. It doesn't mean the CPU has a technical internal issue. P95 and most stress test all use some portion of RAM to conduct their tests.



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.

 03/16/2014 09:56 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Initial run of Memtest86+ v5  (single and 8 core tests) with default tests came up with nothing. Will have to check the readme and see if it has any tests that isn't in the default set when I have the time to run them (like over night or something).

Had another BSOD crash with a "page fault in non-paged area" referencing the DirectX memory manager. Never seen that before, but I did recently update the graphics drivers.

It may just be that it takes a while to get the system stable again, like it was when I first built it, but since I adjusted the Vcore I haven't had a processor timeout, though it's too early to tell if it's a permanent fix.

 03/16/2014 10:28 PM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

I'd set the vcore to 1.325v and use the LLC setting that holds it closest to that. I'd also bump the RAM voltage +.05v if you have not already done so. Did you try increasing the CPU-NB and HT voltages? If not I'd give them a try too if the vcore and RAM voltage increases don't fix the issue.



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.

 03/17/2014 05:16 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

I actually totally forgot to check whether the RAM is running at the right voltage. Also just thought of the fact that I did up the speed to match the RAM, so it'd actually make sense if the HT needs a little more juice.

I'll try to adjust the voltages to your suggestions and see if I can get the system stabilized.

OK. You suggested to up the NB voltage, but looking at the voltage settings I wasn't sure if you meant the CPU side or the NB side, so I only changed the CPU side. I couldn't find an LLC setting. I left everything else on auto. All current voltage settings are as below.

  • CPU   1.325100 V
  • CPU-NB   1.308866 V
  • CPU-PLL   auto
  • CPU DDR-PHY    auto
  • DRAM    1.65100
  • DDR Vref    auto
  • DDR VTT    auto
  • NB    auto
  • NB PCI-E    auto
  • HT Link    1.256 V
  • SB    auto

According to Kingston this 4x4 GB set runs 1.65 V, while all their other 1600 MB/s sets run 1.5 V, but the BIOS only tells me what the Vcore currently is, it doesn't actually display the others.



Edited: 03/17/2014 at 06:56 AM by metalbunny
 03/17/2014 06:55 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Well... the machine dropped into a hard reset without warning or error before I even got to the testing, so I set the NB and HT voltages back to auto for now.

According to Hardware Monitor the 1.325 V setting makes the core run at 1.3 V ... but is that enough to run on turbo?

Also... is it possible that it's the AMD AHCI driver that's being buggy? Running the controller in AHCI mode is the only thing that's different after I reinstalled the system.

Says the AMD SATA driver is 1.2.1.296, and 1.2.001.0296 in details. This is from the chipset drivers MSI provides, since it's impossible to find them on the AMD site.



Edited: 03/17/2014 at 07:03 AM by metalbunny
 03/17/2014 02:42 PM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Machine passed all tests, but then I ran into another hard reset while playing SWTOR. Decided to check the power connections one more time, and discovered that the RAM was burning hot to the touch (there's no thermal sensor for that, so no clue how hot it really gets).

All the issues I've been having the past month started after I had to replace the CPU cooler, so I guess maybe it doesn't contribute as much to cooling the RAM as the old one did - it's a better heatsink, bigger fan, and it spins much slower. So in an attempt at solving that problem I upped the CPU min speed limit to 75%, and changed the front fan speed to 80% from Auto (which is annoying since it means fan noise, but beats the machine crashing). AFAIK the BIOS adjusts the case fans based on NB temps and since that's in the low 30s it has no reason to run the fans very high.

I believe overheating RAM would explain some of the randomness of my crashes, but without thermal sensors for RAM it's impossible to keep an eye on it.

 03/17/2014 04:59 PM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

Your RAM should not be "burning hot" as DDR3 DIMMs typically only consume 10-12w. Top mounted heatsinks aren't even required because DDR3 runs so cool. Obviously you need some airflow around the DIMMs but it should not take much to keep the temps in check. Overheated RAM certainly could cause random crashes. The question is why is the RAM so hot?



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.

 03/18/2014 09:32 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Like I posted earlier, the machine's run solid and trouble free for over a year, which makes it even stranger that it's acting up just because I changed the CPU cooler. Last summer it ran with room temperatures up towards 30 C without issues (it's very hot in this building).

Just for the sake of clarification, this is a (not that great) picture of my system: http://metalbunny.net/stuff/20140318_133429.jpg

As it can be seen in the picture, the CPU fan sits over memory bank #4, but there is a gap underneath (albeit only like half a centimeter). Besides the 140 mm fan in the back that's visible, there's also 2x 140 mm fans in the front. The upper drive cage is sideways to act as an airduct, and even with the fans on low there's actually quite high airflow going through the gap between the CPU cooler and the drive cages.

Since I cannot find any faults using the benchmark tools, I returned all voltage settings and case fan speeds to "auto", and then disabled the CPU fan control - which makes the fan run at full speed at all times. Basically that's how it's always been set while in my R4 case.This way the CPU cooler is audible, but it's still a dampened case so it's far from loud.

Since I cannot find anything wrong using the testing tools, I tried SWTOR one more time and with these settings were able to play without it crashing. But changing the voltage settings at all made the RAM go super hot and the system very unstable.

One thing I have noticed is that with the RAM set at 1600 it actually runs 1560. Hardly significant though.

As for the temps: CPU idles at 30-35 C, system (which I assume is the NB) says 30 C, and core temps report 10-15 C. While playing SWTOR, CPU reached 47 C, system 45 C, and core temps maxed out around 37 C. RAM got very warm to the touch, but not as insanely hot as before when it crashed.

Or rather, the RAM heatspreaders get very hot. Kingston list the operating temperature as max. 85 C, which would cause the heatspreaders to get incredibly hot, though they ideally should never reach those temperatures. I wonder if the problem is that the air flows over the RAM, and there's not enough turbulence to actually cool them properly, but then again it ran for a year where it was not an issue.

EDIT: After I changed the CPU cooler last month the machine wouldn't boot and I had to reset the CMOS to get it to post. Once running only half the RAM was usable (8 GB missing), so at first I tried forcing dual-channel manually and switch it to manual timing settings. Then I ended up removing all the RAM, switch around all the sticks, and reinsert them, which made all 16 GB usable again, but I never changed the timing and link settings back to auto until it crashed the 2nd time yesterday. I don't know if switching the timing to manual disables the DDR throttling. Much like 99% of all the other settings in the BIOS, there is no documentation for them (I love MSI's quality, but they have no clue about documentation).



Edited: 03/18/2014 at 09:43 AM by metalbunny
 03/18/2014 11:06 AM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

If the RAM has a 1.65v default for 1600 MHz. frequency then it's really low binned RAM as most DDR3 RAM will run @ 1600 MHz. @ 1.5v. The over-volting to make it stable @ 1600 MHz. is a bit unusual but typically a crutch with 4x DIMMs and weak RAM.

The fact that it shows to be running at 1560 MHz. is also odd. This might be a mobo issue however. Whatever the case it should not be very hot when in use, definitely not burning hot. It's really difficult to troubleshoot this kind of issue unfortunately.



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.



Edited: 03/18/2014 at 03:29 PM by AMDforMe
 03/18/2014 11:40 PM
User is offline View Users Profile Print this message

Author Icon
mrfla
Nerfed

Posts: 58
Joined: 03/07/2014

Since it happens only at idle as you said, i would say your powersaving feature on the cpu is not stable, it maybe somehow try to drop voltage too low to run at idle and save energy and than it crashes. Look for the newest bios version of motherboard or disable idle state to let the cpu always be at stable voltage.

 03/20/2014 04:16 PM
User is offline View Users Profile Print this message

Author Icon
mrfla
Nerfed

Posts: 58
Joined: 03/07/2014

Sorry for double post, as you said it only happens when the cpu is doing nothing, go to catalyst/performance/CPU consomption slide the lowest clock a bit higher and let the highest one as it is. That's it , it should stop crashing.

 03/22/2014 07:09 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

It mostly crashed at near idle, lately it was only under 25-30% load, but stable on idle. All the power saving features are disabled, so it shouldn't drop the voltage or the clock at any time.

Since I made my last post there hasn't been any problems. No errors, no crashes, no glitches. Only thing that even showed up was the old Windows bug that thinks the graphics driver has timed out (because it can't be bothered drawing the desktop when running full screen programs) and then restarts it even though it's working perfectly - I keep adjusting the timeout setting, but this is a bug MS won't fix. I got rid of all the warnings in device manager (non-existing devices), and have nothing to report from the Event Log.

As for BIOS, MSI recommends only to use the new BIOS versions with Windows 8 (they also don't seem to fix anything except maybe support for FX-9000), which makes the one I have the latest relevant version.

Catalyst CC and AMD Overdrive seem to have compatibility issues with this motherboard (Overdrive even says it does), so I cannot use them for anything practical. MSI's own motherboard utility doesn't list the OC settings (or even the correct current settings), which makes it fairly useless. I use Open Hardware Monitor to keep an eye on voltages, fan speeds, and temperatures - it's the only one I've found that properly lists the HDD temps and lists the individual speeds of each core. PCwizard doesn't like this system and reports impossible voltages (like +12V being 0.2V).

As for the RAM voltage, Kingston wrote 1.65-1.9v for AMD use on the product page, but the spec sheets for this specific RAM kit lists 1.5V. Guess I should have looked at the PDF first instead of trusting the web page. I'm still waiting for Kingston to reply to my mail to them, and hoping they can clarify if there is a compatibility problem between this RAM and motherboard. It's listed as Intel XMP RAM (i7 4-way), but I do believe that should not make any difference.

As for all the other oddities, like the timings: I think it has to be the motherboard. I can handle the RAM being a little underclocked if it keeps it cold (or maybe the MC just don't want to go higher than 7.8x?). The CPU idles at 20.5x (4100 MHz) instead of 20x like it's supposed to, and I'm thinking if the core clocks are a bit off, maybe it throws off the RAM clock as well?

I stilll don't have an explanation for why the RAM gets super hot when set to manual timings (without even changing them from default), but barely gets warm on auto. According to Memtest it makes no difference, the RAM is just noticeably warmer to the touch on manual timings.

But for now the system is stable, and that's all I wanted to begin with. I'll post again if anything changes, but - fingers and toes crossed - it looks like it's behaving now. So thanks for the help and advice I've gotten thus far.

 03/22/2014 12:18 PM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

Hopefully you've corrected enough issues that your PC will be stable for your use. The mobo sounds problematic if the BIOS doesn't show the correct info. and industry std. applets like AMD Overdive won't work with it. MSI must have done some odd stuff or it's just a bad mobo? The BIOS may be compromised because of the oddity of Win 8?



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.

 03/27/2014 07:15 AM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Originally posted by: AMDforMe Hopefully you've corrected enough issues that your PC will be stable for your use. The mobo sounds problematic if the BIOS doesn't show the correct info. and industry std. applets like AMD Overdive won't work with it. MSI must have done some odd stuff or it's just a bad mobo? The BIOS may be compromised because of the oddity of Win 8?

AMD Overdrive works for reporting status of speeds/temperatures etc, but all the adjustment options don't work. Right now I can't remember the error message it gave me. The BIOS itself reports the current settings (when in the BIOS/CMOS menu during POST) though I can only find the current CPU core voltage and no mention of the others.

Not exactly sure why MSI's own tool only reports the default settings and not the OC settings, but I tried installing all the MSI tools, which didn't change it. It seems a lot like they didn't update the utility when they updated the BIOS.

I believe the issue with the utilities come from MSI's Click BIOS II. The motherboard manual only shows and mentions basic settings in the traditional text-mode BIOS interface, which I have never seen and have no idea if is even in the board. When entering setup it loads the Click BIOS. It's visual and mouse driven and very easy to use, but the built-in help is of no use and the mobo manual only explains some very basic settings that are not interesting. The vast majority of options in the BIOS setup are undocumented, atleast that I've been able to find. I have not found any option to disable the CLICK BIOS either.

As far as I can tell the board was built for the original Bulldozer CPUs as they are mentioned briefly somewhere in the manual. But I had to update the BIOS to even get it to work with the FX-8350, otherwise it showed as "unknown CPU" and ran at 2 GHz. Granted, I ordered the system the day after the FX-8350 was made available, but still.

As for why they recommend to only update the BIOS for Windows 8, your guess is as good as mine. The newer BIOS only says "Update CPU AGESA code" which I assume mean it's just support for the newer models.

But so far I've not had a single crash. No log errors, no warnings. The whole problem seem to have started with disabling the automatic RAM settings. After I switched them back to auto (I had it set to "unlink") all my trouble went away. I guess it's A. a bug in the BIOS, or B. there's an undocumented setting that gets changed when switching to manual or C. something unlisted that is affected by it.

I don't like running the RAM on manual anyway, because Kingston doesn't disclose the vast majority of the timing settings the BIOS offers (even their support only gives me 9-9-9-27).

I've so far been able to switch the CPU fan back to auto (which cuts 400 rpm off the fan and makes it barely audible) without the system crashing, so I'm guessing that wasn't part of the problems.

 03/27/2014 04:51 PM
User is offline View Users Profile Print this message

Author Icon
metalbunny
Peon

Posts: 15
Joined: 01/02/2013

Of course now Windows decided to crash the computer again...although this time it appears to be caused by the usual problem of Microsoft not having a clue about putting error handlers into their software.

Was playing one Steam game while downloading another, and then it crashed to BSOD. The error was

The computer has rebooted from a bugcheck.  The bugcheck was: 0x1000007e (0xffffffffc0000005, 0xfffff80002efe622, 0xfffff880035e06d8, 0xfffff880035dff30). A dump was saved in: C:\Windows\Minidump\032714-20217-01.dmp. Report Id: 032714-20217-01.

Going by MSDN, it's a bug in the Windows driver kit causing memory access violations. I don't think this can be blamed on the hardware, nothing's even warm so def not the RAM overheating this time.

 03/27/2014 10:03 PM
User is offline View Users Profile Print this message

Author Icon
AMDforMe
Overclocker

Posts: 619
Joined: 09/08/2013

Unfortunately a lot of Windows error messages are totally incorrect so all you can do is try to determine the root cause.



-------------------------

Building a reliable PC involves more than just assembling the parts. You need to be able to configure all of the BIOS settings appropriately. This can be quite involved and frustrating as it can require a lot of trial and error with stress testing. It is however often the only means to get a 100% reliable PC.

 03/28/2014 03:46 AM
User is offline View Users Profile Print this message

Author Icon
mrfla
Nerfed

Posts: 58
Joined: 03/07/2014

Seems like the ram can be the issue, if you using xmp that is made for intel, AMD should be AMP profile. You now can put voltage to default since it's not voltage stability problem. 1- Ram issue 2- Bios issue 3- Motherboard issue because bios should show everything as it should be. Be sure you using 9xx series chipset.

Statistics
85721 users are registered to the AMD Support and Game forum.
There are currently 5 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2014 FuseTalk Inc. All rights reserved.