• Markaos@lemmy.one
      link
      fedilink
      arrow-up
      45
      arrow-down
      1
      ·
      8 months ago

      Does UEFI initialize all the cores? I know the OS always starts with only one core available, but I’m not sure if UEFI just disables the cores after it’s done its thing, or if it doesn’t touch them. Because if it stays on core 0 and never even brings the other ones up, then this issue with core 2 could let it boot this far just fine.

    • euphoric.cat@lemmy.blahaj.zoneOP
      link
      fedilink
      arrow-up
      33
      arrow-down
      1
      ·
      edit-2
      8 months ago

      you might be right, i left it for a couple minutes to write to amd about warranty and it booted up just fine into linux when i wanted to get the motherboard info. wtf. but now that i think about it, i do have weird cpu related blue screens and timeouts more often than what i’d like to think is normal

    • monsterpiece42@reddthat.com
      link
      fedilink
      arrow-up
      16
      arrow-down
      1
      ·
      8 months ago

      Partially dead CPUs can absolutely still POST and boot. I work in a PC repair shop and see it all the time. Everything will work totally “fine” and you’ll get weird errors here and there similarly to failing RAM. You have to run a dedicated CPU test like the ones in OCCT (Windows-based, don’t lynch me) or similar to see if you’re getting WHEA or other errors.

      The reason for this is that a lot of CPUs have built in redundancy to get around having imperfect silicon, and sometimes that is enough to make the system still work, but not be quite “right”.

      The good news is, if you are producing such errors, you usually have a 3yr warranty on most CPUs and the OEM will RMA them for you.

        • monsterpiece42@reddthat.com
          link
          fedilink
          arrow-up
          2
          ·
          8 months ago

          That’s the name of the program. You can search it and it’ll pop right up. It is now owned by Cooler Master.

          Once you download it, you can run either the CPU Srress test or the Linpack test (this is for Intel mostly as it is their proprietary test) and it’ll run while looking for math or WHEA errors.

          While you’re doing science, I would also recommend doing a RAM test with memtest86+. You download the .iso and make a bootable drive, and boot into it. Both RAM and CPU can make similar weird failures so checking both is a decent idea.

    • euphoric.cat@lemmy.blahaj.zoneOP
      link
      fedilink
      arrow-up
      38
      ·
      8 months ago

      wow, yeah that looks exactly like what my error message looked like. i have a ryzen 7 5800x. but the weird thing is this happned after a blue screen on windows, grub then tired to boot into linux since thats the first option

      • DefederateLemmyMl@feddit.nl
        link
        fedilink
        English
        arrow-up
        19
        ·
        8 months ago

        You may also want to check if your bios is up-to-date.

        My 5900x had some spontaneous crashes and reboots when I just got it, a bios update eventually resolved it. This was around the time zen3 was just out, and there were still quite a few bugs in AMD’s AGESA library, which is included in the motherboard’s bios.

        Many motherboards still ship with an ancient bios, or just have been sitting on a shelf somewher for a very long time with an old bios. So if you have never touched your bios, check that first.

      • Cargon@lemmy.ml
        link
        fedilink
        arrow-up
        4
        ·
        8 months ago

        I have a Ryzen 3700x that had similar problems. In my case disabling Precision Boost Overdrive and regular Precision Boost eliminated the crashes. PB being just the regular boosting behavior of the CPU. With it turned off the CPU basically only adjusts its frequency between the idle frequency of like 800 MHz to the base clock (3.6 GHz or whatever).

        I think basically what happened was the BIOS was running the CPU too hot and eventually it just couldn’t stably boost to the higher frequencies which would cause problems. It’s an easy thing to try and see if it works for you. In my case I was able to salvage the CPU by putting it into a server whose workload doesn’t benefit from moment to moment super high CPU clock speeds.

  • SimplyTadpole@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    39
    ·
    edit-2
    8 months ago

    For what it’s worth, I’ve had Linux spew similar CLI errors when booting up complaining about a critical CPU problem, when the problem actually was that it was reading data off of a dying hard-drive. (Removing said drive, as well as replacing it with a new, healthier drive, made the issue go away.)

    Not saying your problem is actually a dying storage device, but that it’s possible the issue might not actually be your CPU itself.

  • SayCyberOnceMore@feddit.uk
    link
    fedilink
    English
    arrow-up
    21
    ·
    8 months ago

    Best way: strip the whole thing down to 1 stick of RAM and do a memtest and then work back up.

    Don’t rule out a dodgy PSU with a floating power rail, so the first few RAM tests are also testing if the PSU is dying.

  • simonmicro@programming.dev
    link
    fedilink
    arrow-up
    15
    ·
    edit-2
    8 months ago

    Most common issue would be something with your system memory. I could imagine that this caused the timeout of your cpu, which waited for the startup code, which never arrived.

    In case you want to test that, swap your memory sticks around. Or tell the kernel to ignore that cpu (see command line arguments of the kernel).

  • lea@feddit.de
    link
    fedilink
    arrow-up
    6
    ·
    8 months ago

    I’ve had this error upon random reboots after upgrading to Linux 6.8 on 5950x. Went back to 6.7.9 and hasn’t happened again since. What version are you on? Would be interesting to know.

  • sanpo@sopuli.xyz
    link
    fedilink
    arrow-up
    6
    arrow-down
    1
    ·
    8 months ago

    Maybe. Or maybe it’s something else and it just looks like CPU error.

    Does this always fail the same way after reboot?
    If you can still boot, maybe you can try running memtest and see what happens.

    See the line starting with “IPID”? Try googling for these codes and see if any results sound familiar to your situation.

    Otherwise your only option is to try another CPU and see if error goes away.

      • lurch (he/him)@sh.itjust.works
        link
        fedilink
        arrow-up
        1
        ·
        edit-2
        8 months ago

        Your error message has “Watchdog Timeout error” in it.

        There is a menu config in the linux kernel sourcecode package. It lets you check and uncheck things you want in your kernel. Some can be on, off and “M” for a module you can add or remove while it’s running.

        Watchdog refers to a periodical test that checks if the system still runs as expected, so it auto-reboots or shuts down, if not.

        The config has multiple options about watchdogs and hangcheck in multiple places. You could install your distros kernel source package, start with the config from your current kernel and uncheck everything related, then compile a custom kernel that doesn’t have this watchdog and will therefore run further. If it’s a CPU error, it will then just die later. If it’s a bug that just makes the watchdog think the system doesn’t work, it will then run fine.

  • mvirts@lemmy.world
    link
    fedilink
    arrow-up
    1
    arrow-down
    4
    ·
    8 months ago

    na the watchdog just got too hungry (could still be bad???). I think you’d get an mce if you cpu is failing but bootable.