Originally posted by: ProfessorYaffle
I'm trying to install ubuntu gutsy on a newly built phenom 9600 machine: Gigabyte MA790X-DS4 motherboard, 2 sticks of 1GB PC2-6400 Corsair RAM, 300GB Maxtor HD, XFX GeForce 8400 GS (that's just a stop-gap until the 8800GTs come back into stock) plus peripherals.
I started trying to install kubuntu, but that hung part way through the file copying. Basic ubuntu installed okay and worked fine for half an hour or so but died part way through downloading and installing updates. After two or three reboots and picking up the updates where it left off the updates were on but the system still wasn't stable. It's okay to start with, but after a while some processes would crash or hang. Sometimes X would die or the whole system would freeze and need a hard reset.
Initial error messages indicated disk i/o problems so I blanked the entire disk and did a destructive scan for bad blocks - none found. Then I ran memtest86 all day and it found nothing wrong. I installed again from scratch and had pretty much the same problems. I tried it with and without the proprietory nvidia driver - no difference.
Eventually I happened to notice that the crashes started happening just after the system broke through the 1GB memory usage (reported by free/top.) So I took the system down and ran memtest again (20 full passes no errors.) Moreover, I could reproduce the general problem state (although not specific crashes) by running wc on all the files in /usr (which slowly filled up the filesystem cache.)
I took out 1 stick of memory and everything worked okay (can't go over 1GB with only 1GB installed!) Swapped it to just the other stick - still okay. Borrowed 4 sticks of 512M and as soon as the usage gets over 1GB things start dying again. I've tried running the memory in ganged & unganged mode - no difference.
I flashed the motherboard with the latest bios and tried it with and without the TLB bug workaround - still no effect.
In desperation I installed OpenSuSE10.3 on a separate partition, and got exactly the same symptoms.
Finally I noticed kernel "oops" messages in /var/log (Doh!). Anybody else had this problem or know if there are any kernel options that can be used to avoid it?
An example kernel error from the log:
Feb 12 22:47:08 lilith kernel: Unable to handle kernel paging request at ffff81003ca03000 RIP:
Feb 12 22:47:08 lilith kernel: [<ffffffff802f8973>] number+0x1ad/0x1de
Feb 12 22:47:08 lilith kernel: PGD 8063 PUD 9063 PMD c23c00003ca001e3 BAD
Feb 12 22:47:08 lilith kernel: Oops: 000b  SMP
Feb 12 22:47:08 lilith kernel: last sysfs file: /devices/pci0000:00/0000:00:0a.0/0000:02:00.0/irq
Feb 12 22:47:08 lilith kernel: CPU 1
Feb 12 22:47:08 lilith kernel: Modules linked in: nls_utf8 iptable_filter ip_tables ip6table_filter ip6_tables x_tables ipv6 cpufreq_conservative cpufreq_userspace cpufreq_powersave powernow_k8 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device apparmor loop dm_mod snd_hda_intel snd_pcm snd_timer usb_storage snd rtc_cmos parport_pc soundcore nvidia(P) rtc_core parport ide_core ohci1394 snd_page_alloc rtc_lib ieee1394 sr_mod cdrom r8169 i2c_piix4 button i2c_core sg usbhid hid ff_memless ehci_hcd ohci_hcd usbcore sd_mod edd ext3 mbcache jbd fan pata_atiixp ahci libata scsi_mod thermal processor
Feb 12 22:47:08 lilith kernel: Pid: 3712, comm: gkrellm Tainted: P N 188.8.131.52-0.1-default #1
... plus tracebacks etc. I can post more of these if it helps but there's hundreds of lines of them. Faults seems to vary between CPUs 2 and 3; occasionally 1, but I don't think I've seen a fault on CPU0 (but it may not yet be a statistically significant sample size to read too much into this.)
Also note: the CPU is a black edition one, but it's never been overclocked (although I might try underclocking it tonight.)
I'd first try to run the most recent kernel (184.108.40.206) to see if that changes things. Older kernels sometimes do not play well with newer hardware.