Hacker News new | past | comments | ask | show | jobs | submit login

I have a Ryzen 5 1600 with a pair of Crucial ECC UDIM. It works perfectly on the ASUS Prime B350M-A/CSM uATX board. Stress tested. Errors are logged and halt behavior is confirmed (ubuntu 17.04).



Interesting, could you possibly explain how "halt behavior" indicates that ECC is fully functioning? I've just read a lot about ECC being compatible but not tests explicitly showing it works.


On my system if an uncorrectable error were to happen the system goes into a 'machine check' state (a halt). In theory, the BIOS supports 'chip kill' which is that it reboots with a bit set to not use that specific chip in the DIMM. I've never seen that happen on my desktop but in the data center when it happened a system that should have 128GB of memory would reboot and come back with less than the full amount (as expected by the DIMMs plugged in) reported by the BIOS.


Interesting, depending on the amount of RAM killed that seems like it could be a bit overkill for what could just be a cosmic ray flip.

Though perhaps the rare frequency of cosmic ray flips makes that acceptable.


Chipkill is an IBM name for the feature that corrects multiple bit errors in DRAM, are you sure about the reboot thing or are there two similarly named ECC techs that do different things?


Intel discussion of Chipkill (https://www.intel.com/content/dam/www/public/us/en/documents...). Servers in question were Supermicro "Jupiter" chassis with the Westmere processors, discussion on Supermicro of their implementation (https://www.supermicro.com/support/faqs/faq.cfm?faq=2642)

When systems rebooted with less memory than the system configuration database said they should have most of the time there would be a multi-bit error detection, machine check, and 'memory update' in the IPMI buffer.


Why kill the DIMM? Cosmic ray bitflips don't harm the hardware, AFAIK.


It's usually not cosmic ray.


The system must be configured to halt the system on detection of uncorrectable multibit errors -- which shows that the detection of uncorrectable errors works.


The specs for the board indicate no ECC support. How did you test it?

BTW, I'm hoping that ECC is there.


Currently on mobile, but the hardware blog Hardware Cannucks has a fantastic post covering how "true" ECC support / functionality is on Ryzen. They also investigate this on both Windows and Linux.


Thanks - just read it. I get it now. The chipset supports it and linux turns it on. Looks like the halt functionality is not there.


Interesting, is there a list of boards and supported modules that work with ECC?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: