am I reading this right, that my cpu just died?

euphoric.cat@lemmy.blahaj.zone · 8 months ago

am I reading this right, that my cpu just died?

ShortN0te@lemmy.ml · 8 months ago

If the CPU died, the PC would not have booted up so far.

Markaos@lemmy.one · 8 months ago

Does UEFI initialize all the cores? I know the OS always starts with only one core available, but I’m not sure if UEFI just disables the cores after it’s done its thing, or if it doesn’t touch them. Because if it stays on core 0 and never even brings the other ones up, then this issue with core 2 could let it boot this far just fine.

fuckwit_mcbumcrumble@lemmy.world · 8 months ago

POST is supposed to check all CPU cores and all RAM. It’s check isn’t perfect, but it does check them all.

euphoric.cat@lemmy.blahaj.zone · edit-2 8 months ago

you might be right, i left it for a couple minutes to write to amd about warranty and it booted up just fine into linux when i wanted to get the motherboard info. wtf. but now that i think about it, i do have weird cpu related blue screens and timeouts more often than what i’d like to think is normal

Fisch@discuss.tchncs.de · 8 months ago

Maybe do a memtest. Idk what the software was called exactly but you just flash it on a USB, boot from that USB and it will test if your RAM is okay.

8 months ago

Memtest86+ bloody great software.

euphoric.cat@lemmy.blahaj.zone · edit-2 8 months ago

ive already done one recently and it came out with no problems, maybe i’ll do another one

Possibly linux@lemmy.zip · edit-2 8 months ago

Try booting from a USB with a stock kernel (no Ubuntu derivatives) and then running a stress test.

I would make sure you set the logs to print to screen. If you have a dedicated graphics card try removing it. As you might have a bad gpu

excitingburp@lemmy.world · 8 months ago

RAM could be a cheaper culprit. Try re-seating it.

monsterpiece42@reddthat.com · 8 months ago

Partially dead CPUs can absolutely still POST and boot. I work in a PC repair shop and see it all the time. Everything will work totally “fine” and you’ll get weird errors here and there similarly to failing RAM. You have to run a dedicated CPU test like the ones in OCCT (Windows-based, don’t lynch me) or similar to see if you’re getting WHEA or other errors.

The reason for this is that a lot of CPUs have built in redundancy to get around having imperfect silicon, and sometimes that is enough to make the system still work, but not be quite “right”.

The good news is, if you are producing such errors, you usually have a 3yr warranty on most CPUs and the OEM will RMA them for you.

euphoric.cat@lemmy.blahaj.zone · 8 months ago

OCCT? Tell me more about it. Also I’ve only had my cpu since September so I can definitely RMA it if so

monsterpiece42@reddthat.com · 8 months ago

That’s the name of the program. You can search it and it’ll pop right up. It is now owned by Cooler Master.

Once you download it, you can run either the CPU Srress test or the Linpack test (this is for Intel mostly as it is their proprietary test) and it’ll run while looking for math or WHEA errors.

While you’re doing science, I would also recommend doing a RAM test with memtest86+. You download the .iso and make a bootable drive, and boot into it. Both RAM and CPU can make similar weird failures so checking both is a decent idea.

connaisseur@feddit.de · 8 months ago

Do you have a Ryzen system? https://wiki.gentoo.org/wiki/Ryzen#Random_reboots_with_mce_events

euphoric.cat@lemmy.blahaj.zone · 8 months ago

wow, yeah that looks exactly like what my error message looked like. i have a ryzen 7 5800x. but the weird thing is this happned after a blue screen on windows, grub then tired to boot into linux since thats the first option

DefederateLemmyMl@feddit.nl · 8 months ago

You may also want to check if your bios is up-to-date.

My 5900x had some spontaneous crashes and reboots when I just got it, a bios update eventually resolved it. This was around the time zen3 was just out, and there were still quite a few bugs in AMD’s AGESA library, which is included in the motherboard’s bios.

Many motherboards still ship with an ancient bios, or just have been sitting on a shelf somewher for a very long time with an old bios. So if you have never touched your bios, check that first.

euphoric.cat@lemmy.blahaj.zone · 8 months ago

i updated mine in september or so when switching from a 3600 to 5800x, i’ll see though

DefederateLemmyMl@feddit.nl · edit-2 8 months ago

Yeah I’d say that sounds recent enough, but it’s still possible that there’s some obscure bios bug you’re hitting.

Cargon@lemmy.ml · 8 months ago

I have a Ryzen 3700x that had similar problems. In my case disabling Precision Boost Overdrive and regular Precision Boost eliminated the crashes. PB being just the regular boosting behavior of the CPU. With it turned off the CPU basically only adjusts its frequency between the idle frequency of like 800 MHz to the base clock (3.6 GHz or whatever).

I think basically what happened was the BIOS was running the CPU too hot and eventually it just couldn’t stably boost to the higher frequencies which would cause problems. It’s an easy thing to try and see if it works for you. In my case I was able to salvage the CPU by putting it into a server whose workload doesn’t benefit from moment to moment super high CPU clock speeds.

euphoric.cat@lemmy.blahaj.zone · 8 months ago

that would suck to disable since i run some pretty high cpu intensive stuff but i’ll try get around to it soon

Possibly linux@lemmy.zip · 8 months ago

It sounds like a hardware issue then. Try running a ram test.

https://memtest.org/

If it is a RAM issue I would recommend reinstalling everything

just_another_person@lemmy.world · 8 months ago

My first guess. Might be a core offline. Doesn’t mean the entire CPU is dead though.

SimplyTadpole@lemmy.dbzer0.com · edit-2 8 months ago

For what it’s worth, I’ve had Linux spew similar CLI errors when booting up complaining about a critical CPU problem, when the problem actually was that it was reading data off of a dying hard-drive. (Removing said drive, as well as replacing it with a new, healthier drive, made the issue go away.)

Not saying your problem is actually a dying storage device, but that it’s possible the issue might not actually be your CPU itself.

euphoric.cat@lemmy.blahaj.zone · edit-2 8 months ago

come to think of it, yeah. not sure what though. my parts are all relivively new but the instability carried over

skooma_king@lemm.ee · 8 months ago

You could test that theory by attempting to boot from a live USB if you happen to have one.

BZzzz@jlai.lu · 8 months ago

[Hardware Error]: HumAiN hElP mEeeeee

euphoric.cat@lemmy.blahaj.zone · 8 months ago

real

SayCyberOnceMore@feddit.uk · 8 months ago

Best way: strip the whole thing down to 1 stick of RAM and do a memtest and then work back up.

Don’t rule out a dodgy PSU with a floating power rail, so the first few RAM tests are also testing if the PSU is dying.

simonmicro@programming.dev · edit-2 8 months ago

Most common issue would be something with your system memory. I could imagine that this caused the timeout of your cpu, which waited for the startup code, which never arrived.

In case you want to test that, swap your memory sticks around. Or tell the kernel to ignore that cpu (see command line arguments of the kernel).

Jarvis2323@programming.dev · 8 months ago

Did you build this box? Could be cpu thermal paste not properly applied.

twei@discuss.tchncs.de · 8 months ago

naw, if that was the case the cpu would throttle to the minimum clock and then shut off after a while (or just keep running at like 100°C)

euphoric.cat@lemmy.blahaj.zone · 8 months ago

i did, i maed sure to put the right amount and temps were on par with what others reported

db2@lemmy.world · 8 months ago

GG Tim

mvirts@lemmy.world · 8 months ago

is it always cpu 2? Can you disable that core in your bios?

lea@feddit.de · 8 months ago

I’ve had this error upon random reboots after upgrading to Linux 6.8 on 5950x. Went back to 6.7.9 and hasn’t happened again since. What version are you on? Would be interesting to know.

euphoric.cat@lemmy.blahaj.zone · 8 months ago

idk, im on arch and i update every day. but this isnt a result of linux, read my other comments

crispy_kilt@feddit.de · 8 months ago

sanpo@sopuli.xyz · 8 months ago

Maybe. Or maybe it’s something else and it just looks like CPU error.

Does this always fail the same way after reboot?
If you can still boot, maybe you can try running memtest and see what happens.

See the line starting with “IPID”? Try googling for these codes and see if any results sound familiar to your situation.

Otherwise your only option is to try another CPU and see if error goes away.

lurch (he/him)@sh.itjust.works · 8 months ago

Could be. Are you able to make a kernel without watchdogs? Maybe the watchdog malfunctions.

euphoric.cat@lemmy.blahaj.zone · 8 months ago

I have no idea what that means

lurch (he/him)@sh.itjust.works · edit-2 8 months ago

Your error message has “Watchdog Timeout error” in it.

There is a menu config in the linux kernel sourcecode package. It lets you check and uncheck things you want in your kernel. Some can be on, off and “M” for a module you can add or remove while it’s running.

Watchdog refers to a periodical test that checks if the system still runs as expected, so it auto-reboots or shuts down, if not.

The config has multiple options about watchdogs and hangcheck in multiple places. You could install your distros kernel source package, start with the config from your current kernel and uncheck everything related, then compile a custom kernel that doesn’t have this watchdog and will therefore run further. If it’s a CPU error, it will then just die later. If it’s a bug that just makes the watchdog think the system doesn’t work, it will then run fine.

mvirts@lemmy.world · 8 months ago

na the watchdog just got too hungry (could still be bad???). I think you’d get an mce if you cpu is failing but bootable.