• 0 Posts
  • 38 Comments
Joined 1 year ago
cake
Cake day: June 15th, 2023

help-circle









  • I think having an A partition and a B partition (I’m assuming that’s how SteamOS works) wouldn’t help in this case. If the A partition downloaded the definition file, crashed and failed to reboot; the bootloader could failover to the B partition - which would then download the definition file, crash and fail to reboot. It would have to keep rolling back to a last known good snapshot until the update got withdrawn.

    You could have an ephemeral set up that wipes /var and /etc and recreates them every boot. I don’t think these EDR tools would like that very much though.



  • It’s a proprietary config file. I think it’s a list of rules to forbid certain behaviours on the system. Presumably it’s downloaded by some userland service, but it has to be parsed by the kernel driver. I think the files get loaded ok but the driver crashes when iterating over an array of pointers. Possibly these are the rules and some have uninitialised pointers but this is speculation based on some kernel dumps on twitter. So the bug probably existed in the kernel driver for quite a while, but they pushed a (somehow) malformed config file that triggered the crash.


  • For this Channel File, yes. I don’t know what the failure rate is - this article mentions 40-70%, but there could well be a lot of variance between different companies’ machines.

    The driver has presumably had this bug for some time, but they’ve never had a channel file trigger it before. I can’t find any good information on how they deploy these channel files other than that they push several changes per day. One would hope these are always run by a diverse set of test machines to validate there’s no impact to functionality but only they know the procedure there. It might vary based on how urgent a mitigation is or how invasive it’ll be - though they could just be winging it. It’d be interesting to find out exactly how this all went down.




  • It’s not that clear cut a problem. There seems to be two elements; the kernel driver had a memory safety bug; and a definitions file was deployed incorrectly, triggering the bug. The kernel driver definitely deserves a lot of scrutiny and static analysis should have told them this bug existed. The live updates are a bit different since this is a real-time response system. If malware starts actively exploiting a software vulnerability, they can’t wait for distribution maintainers to package their mitigation - they have to be deployed ASAP. They certainly should roll-out definitions progressively and monitor for anything anomalous but it has to be quick or the malware could beat them to it.

    This is more a code safety issue than CI/CD strategy. The bug was in the driver all along, but it had never been triggered before so it passed the tests and got rolled out to everyone. Critical code like this ought to be written in memory safe languages like Rust.




  • Is there any reason to keep the existing set-up? If it’s just one drive, you could replace it with another and install Alma or something fresh. Then you could copy over whatever config the old system had to get up and running again. You could swap to the old drive if you needed to revert. If you have a spare machine, you could stand up the fresh setup side-by-side with the old one before swapping over.