AMD Processors
Decrease font size
Increase font size
Topic Title: Guide: Lock-ups and Crashes
Topic Summary:
Created On: 01/10/2005 07:21 AM
Status: Read Only
Linear : Threading : Single : Branch
Search Topic Search Topic
Topic Tools Topic Tools
View similar topics View similar topics
View topic in raw text format. Print this topic.
 01/10/2005 07:21 AM
User is offline View Users Profile Print this message

Author Icon
Missile Maker
Senior Member

Posts: 1910
Joined: 08/05/2004

This Guide was initially written by Jan Peter in the Netherlands, and the original text can be reached by the link in my sig. I've edited the typo's, spellings, and syntax, plus updated with latest info.

All the usual warnings:

Cheers,
Missile Maker


My game just went belly up. What's wrong?"

Just to start off with, no program to date that actually does something significant, is completely free of errors. The games from Paradox are no exception to the rule, and will crash from time to time. What you need to understand here is that today's PC's are highly complex machines with all sorts of combinations of hardware, OS and driver software. It's just not possible (not even for Microsoft itself) to exhaustively test each and every combination. That can only be done (and even then to a certain extend only) on controlled platforms like an Apple Macintosh, a Sony Play station or a Microsoft Xbox.

Now to the failures themselves. Basically, there are three different kinds of failures.

1) At some point in the game, the game time stops advancing and 'pauses' indefinitely.
2) All of a sudden, the game stops abruptly, and you end up in the desktop.
3) All of a sudden, the PC locks up completely and becomes unresponsive to everything.

Now, let’s examine each case in more detail.

1 - The game clock stops.

This is (most likely) caused by an internal routine that enters a loop (to search for something, for example) and never exits again. When this happens in the game engine part of the game (i.e. that part that deals with the game rules, updating the provinces, evaluating the AI, etc.), then the current game turn never finishes. Contrary to a chess program, where the AI is simply cut off when the time has elapsed, Paradox games extend the game turn until all (AI) processing is complete. If one of the routines during this phase never exits, then in turn the current game turn never ends. And that means that the game clock stops advancing. It should be treated as a game bug, and, when reproducible, should be reported in the appropriate bug forum. Normal Windows behavior is still possible, like using <alt><tab> to switch to the desktop.

2 - The game terminates suddenly

This is the most common type of failure. It is also known as a CtD or Crash to Desktop.

What causes it, you may ask? Well, it can be caused by a lot of things, both under and outside of the game's control. What actually happens is standard Windows behavior when what is known as an exception occurs. Exceptions are (mostly) fatal occurrences that prevent the application (our game in this case) from continuing. Other applications also can have these kinds of fatal interruptions. How an application responds is ultimately a choice for the programmer. When the application does nothing, the exception ultimately ends up in the Windows kernel. The kernel has only two ways of dealing with it. It can either show you a dialog box with a cryptic looking text and an <Ok> button, or it presents you with the dreaded BSOD (or blue screen of death). Either way, the application is dead and terminated.

For a game like the Paradox games, that is not a good solution. Letting the application die by the hands of Windows itself is a very bad idea. When that happens, none of the resources (except memory) that were claimed by the game are released. That means that DirectX remains active, sound buffers are still allocated, and so forth. Thus, the game engine contains a very rudimentary solution. The game engine captures the exception, and gives itself a more or less orderly way out. Since the game cannot continue running (that is, unfortunately, the nature of an exception), the only thing it can do is to release all resources, close down DirectX, and quit. Now, it would have been nice if the game actually produced a (user friendly) message box stating why it closed down, but unless you are an expert or programmer, that would not mean much to you, the user.

Now, what are these exceptions? Well, most (if not all) are in fact processor exceptions. The processor, during the course of running the program (and all of its supporting stuff like the video or sound driver) encounters a state that is illegal or otherwise undefined. The most common of those are:

Access Violation
The processor was instructed to access a piece of information on a memory location that either does not exist, or that the current process has no access to (for example, because it belongs to a different process). It invariably means that there is a bug somewhere, because normally this should never happen. Accessing memory through an uninitialized pointer can cause this, or accessing memory through a pointer that has been released back to Windows previously. Now, the question that remains is whether this is caused by the game code (and thus is a bug in the game) or a driver. If problem is reproducible in the game (as in: load a save game, do the same steps over and over again, and each time it crashes at the same spot), then it's most likely a problem in the game, and should be reported in the appropriate bug forum.

Page error
The processor was instructed to access a piece of memory that is registered as being stored in a swap file, but for some reason the virtual memory manager could not load it (back) into main memory. It can either indicate a corruption of the paging tables (a malfunctioning device driver can cause this, for example), or the system is low on pagable physical RAM.

No memory left
An attempt to allocate a chunk of memory has failed, most likely because available RAM has been exhausted. Having not too much physical RAM, together with an almost full hard disk partition holding the swap file, can cause this. It can indicate a memory leak in the game or another application, or simply that too many applications are open at the same time.

Invalid handle
An attempt was made to call a Windows function with a handle that (no longer) exists. Most Windows API functions perform a (limited) sanity check on the parameters they receive from the calling application. When something doesn't add up here, this exception can follow. It usually indicates a failure in the application that called the Windows function. Again, if this is reproducible, it should be reported in the appropriate bug forum.

When your OS is Windows 9x, then there may be a second reason for this type of exception. On a Windows 9x system, there is only a limited amount of memory reserved for allocating Windows resources. Those are the things these handles normally refer to. On a Windows 9x system, a fixed amount of two times 64 KB (yes you read that correctly. It's kilobytes) is system wide set aside for storing resources like icons, mouse cursors, edit boxes, menu bars and what not. Having lots of applications open will quickly exhaust this limited amount of RAM, and that can cause Windows API functions to fail.

Illegal instruction
An attempt was made to execute an illegal (or non-existing) processor instruction. Normally, this can never happen. When it does, it usually means the program entered a random piece of memory, thinking that program instructions are stored there. It's usually an indication that some time before this point something has gone wrong, like a processor stack corruption. This can be caused by a function that tries to access a local buffer outside of its defined bounds. This is, btw., how viruses misuse buffer overflow vulnerabilities in the various operating systems.

Privileged instruction
Some processor instructions are reserved for the so called supervisor mode. This is a processor mode, reserved for OS kernel routines and key device drivers. Normal applications (including games) run in user mode. In this mode, the privileged CPU instructions may not be executed. If a program attempts this anyway, then this exception follows. It usually indicates that program execution has entered a chunk of code that it wasn't supposed to enter. Again, as with the Illegal instruction, stack corruption is the most likely cause.

Stack overflow
This is a simple one. The memory, reserved for the stack, has been exhausted. Usually this happens when a routine calls itself (directly or indirectly) indefinitely. It indicates a logic error in the program or a driver.

floating point failures
This is a collection of related exceptions, all linked to floating point operations. Things like division by zero, taking the square root of a negative value, that sort of thing. Usually indicates an error in the program's logic.

3 - The system freezes completely, leaving the PC unusable.

This is a very nasty condition. However, it has very little to do with the game itself, and a lot with the current system configuration. The most common cause of a full system freeze is a condition that has been named 'infinite loop' by Microsoft. This is, in fact, a system failure within the AGP section of your mainboard. Let me explain a bit.

A modern day AGP video card is much more than simply an advanced version of the good old VGA card and its predecessors. Those were simply dumb frame buffer cards, and all of its memory contents were manipulated by the CPU. Nowadays, video chips are even more advanced than the main CPU itself. Together with the support chips on the video board they are, in fact, a separate computer all by themselves. Like the main CPU, the video card runs it's own, highly specialized operating system and communicates with the rest of the system via the AGP interface. The communication can be initiated both by the video chip and the main CPU, and the AGP interface in the main board's chipset controls this communication.

When all goes well, you will never notice anything of this. You only see the result, which is a great looking image in your game of choice. However, things can, unfortunately, go horribly wrong. When the video card is not used as a dumb frame buffer card (something that the standard PCI VGA driver does), the main CPU does not manipulate the contents of the frame buffer directly. Instead, it tells the video processor what to do. The video processor then executes those commands. For this to work, the CPU must be able to tell the video chip what to do, and the video chip must be able to accept those commands. The AGP interface is what connects these two subsystems. Now, in order to speed up processing on both sides of the AGP interface, the chipset maintains a command queue, which buffers the various instructions until such time as the video chip is ready to process them. The size of this buffer is actually determined by the chipset that is in use on your motherboard.

So, what happens if the CPU is stuffing commands faster into the AGP pipeline than the video chip can execute them? Well, sooner or later that buffer fills up. When that happens, the CPU will be stalled by the chipset until such time as the video chip has executed its current command, and retrieves the next pending one from the AGP pipeline. That will free up a slot at the other end. The CPU can now finish putting its command into the AGP pipeline. The stall is lifted, and the CPU is released by the chipset and can finally continue executing program instructions. If the video chip is slow at processing commands for any reason, then this stalling of the CPU by the main board's chipset will be perceived by you, the user, as a temporary system freeze.

Things can become even worse, if for some reason the video chips stops retrieving commands from the AGP pipeline. Then the temporary CPU stall becomes a permanent one. Since the CPU isn't allowed to execute new program instructions, it cannot respond to keystrokes, mouse clicks and the like. Even the sound card's interrupts won't be honored. That usually causes a sound card to repeat its most recently loaded sound fragment over and over again.

What can cause such a condition to occur? Well, as said previously, a modern video chip is a highly sophisticated mini computer with its own operating system. Like Windows, this OS can crash. When it crashes, it won't execute its program until it gets rebooted. A video reset could do the trick, but it's not easy to let the main CPU issue a reset command if the CPU itself is stalled because the AGP pipeline is filled up, because of the video chip's crash. So a hard system reset or a power cycle is usually the only viable way out.

The most likely cause of a video card's crash is, believe it or not, insufficient power. Like it or not, but modern day PC's are extremely power hungry. What's more, the tolerances for voltage fluctuations are significantly less than a couple of years ago. True, the tolerances are still rated as plus or minus 5%, but on today's AGP x8 boards that is 5% of 0.8 volt, and not 5% of 5 volt which it was a mere 5 years ago. This means that today’s chips are far less forgiving if you have a power supply that is not completely up to the task. As a rule of thumb, a good power supply used in any Pentium 4 or AMD Athlon system which is paired with a modern AGP video board should be able to deliver at least 300 W. Be advised, this is not 300 W input, but 300 W output. Power supplies, when they operate, incur thermal loss. On a good power supply this is as little as 15%. On a bad one, this can be as high as 50%. As a second rule, the power supply must be able to deliver 21 Amps combined on the 3.3 and 5 volt power rails. This is not the same as simply adding up the separate Amps listings of the 3.3 and 5 volt rails. A good power supply will list the combined Amps as a separate rating.

A second cause of a video board crash is overheating. Modern video processors run hot, even hotter than your main processor. And while the main processor gets a big cooling solution, the video chip usually has nothing more than a large heat spreader and a small fan. What's worse, the mounting location of the AGP card itself in most computer casings is so bad that the tiny little cooling fan cannot suck in (enough) cool air and get rid of the heated air. And this causes the temperature to rise, especially if the chip is working hard, like in a game. When it overheats, a few things can happen. If you're lucky, the card has thermal protection and the chip simply stalls until it's cooled off a bit. Like in the filled up AGP pipeline case, you will perceive this as momentary systems wide freeze that lasts a couple of seconds. If you're unlucky, the chip starts behaving erratically or stops altogether. Again, this causes a permanent system freeze until a hard reset or power cycle.

A third cause for a complete system freeze is the AGP driver software itself. Intel has written the specs for the AGP interface, and these specs allow for the main CPU and the video card to both access main RAM. So, it can happen that both want to access the same location at the same time. Normally this would not be a problem, as only one device can access main memory at any given time, and so either the CPU waits for the video processor or vice versa. However, this does interfere with another portion of the specs, specifically dealing with the CPU side of the communication. Intel has specified that all data transfers should happen in 64 bit chunks, or 8 bytes. Intel also specified that these chunks should always start at multiples of 8 bytes. However, there is a provision that allows access on the uneven 4 byte boundaries, in which case the actual data transfer is split into two separate ones. The first one deals with the lower 4 bytes (scaled up to 8 bytes), and the next one deals with the upper 4 bytes (also scaled to 8 bytes).

While a driver is allowed to do this, it is highly discouraged. The reason why is very simple. As stated before, AGP allows the video processor to initiate memory access. What happens if the video processor wants access to the same memory location as the main CPU is dealing with right now, and it does this precisely between the two split up partial data transfers? Well, the mainboard's chipset refuses the video processor access, because part of the memory transfer concerning precisely that location hasn't finished yet. Allowing the video processor to proceed would alter the memory location, and this would corrupt the pending second half of the CPU's data transfer. By the same token, the second half of the CPU's data transfer will be rejected by the video processor. So both data transfers are essentially blocked, and both the video processor and the CPU are stalled, and cannot continue with their respective programs. Again, what we have here is a complete system freeze. This particular variant was the first confirmed case of a system freeze, and because the data transfer requests bounce back and forth between the video chip and the CPU indefinitely, Microsoft called this type of problem 'infinite loop'. To date, mostly VIA is guilty of this type of failure in their AGP driver, which is part of the VIA 4in1 driver package, later dubbed Hyperion drivers. That's why owners of VIA chipsets (especially the aging KT133 and KT266 models) are hit with the freeze more often than owners of other types of chipsets.

If you get hit with any of these problems, there are a number of things one can do. Ultimately, all these measures seek to slow down the speed of the video card and/or the AGP interface. Less speed means less heat and less Amps drawn from the power supply.

1) Lower your AGP bus rating. Usually, the BIOS allows for manual selection between x1, x2, x4 or x8.
2) Disable side banding. Side banding is an AGP pipeline feature that implements a sort of passing lane for AGP commands, in which special AGP commands can bypass pending requests in the regular AGP pipeline buffer. Not all video ships implement this feature as robust as it should.
3) Disable fast writes. Again, this will make your AGP pipeline slower, thus slowing down the video processor with it.
4) If your system came with power and temperature monitoring software, start it before running the game. While the game is running, check the readouts of the monitoring software to see if the temperature rises to critical levels, or the power (especially the 3.3 and 5 volt rails) fluctuate either dangerously close to the 5% rule or even exceed it. If the temperature rises too high, you need better cooling. If the power fluctuates too much, you need a better power supply, or the power regulators on your mainboard cannot cope and run too hot. Power regulators are those medium sized vertically mounted chips with a large heat spreader mounted on the back.

If you are the owner of a VIA chipset, you may want to consider not installing the VIA 4in1 drivers. Instead, rely on the AGP support that comes by default in Windows itself, together with support from the video driver. There are instructions on the VIARENA website on loading the 4in1 drivers but not the AGP driver, and instead activating the AGP Driver in WIN XP. Both ATI and nVidia drivers will also correctly activate the AGP support for a VIA chipset if the VIA 4in1 AGP drivers are not installed.

-------------------------
3700+ Zalman CNPS9500
DFI LP nF3 250GB
2x512 OCZ DDR400 ELPR2
6800GT AC nV Silencer 5
2X80GB 1x100GB in JBOD
Antec TB 480W
Lian-Li V1100
Statistics
112018 users are registered to the AMD Processors forum.
There are currently 0 users logged in.

FuseTalk Hosting Executive Plan v3.2 - © 1999-2014 FuseTalk Inc. All rights reserved.



Contact AMD Terms and Conditions ©2007 Advanced Micro Devices, Inc. Privacy Trademark information