Non-ECC Memory Corrupted My Hard Drive Image - This Is Why ECC Memory Is So Important

แชร์
ฝัง
  • เผยแพร่เมื่อ 17 ก.ย. 2024

ความคิดเห็น • 20

  • @andrewphi4958
    @andrewphi4958 ปีที่แล้ว +16

    I've been telling people for decades: if it's > 4GB, it must have ECC!

  • @vlad7951
    @vlad7951 ปีที่แล้ว +8

    The amount of time consuming work that went into this video is incredible.
    This reminds me of the time I worked in a computer repair shop when I was ~16. Failing ram was on our plate at least 4 times a week. One of our tests was to swap dual kit ram sticks between customers PC's and run the test again.
    For some odd reason it usually worked.
    I still don't get it to this day because once a PC came into the Hardware section of the shop we cleaned it first, and changed the thermal paste. The cleaning was thorough. We pulled out the ram, video card, etc and blow air. We cleaned the slots with a toothbrush, etc. Only after these steps we would do any testing. If we hard failing ram we would move them in a different slot.

  • @thirtysixnanoseconds1086
    @thirtysixnanoseconds1086 ปีที่แล้ว +11

    always enjoy your videos robert! the depth you go into and the way you tackle these problems - i always learn something and find it very interesting

  • @noanyobiseniss7462
    @noanyobiseniss7462 ปีที่แล้ว +13

    Old boards with memory errors are normally the caps on the mobo I have found.

  • @buckc3471
    @buckc3471 ปีที่แล้ว +1

    This was incredibly thorough and informative! Look forward to your future videos.
    Came upon this video when having trouble with a new system I was building. It worked initially with no errors. Then had random errors and reboots. Machine was supposed to be a new hypervisor so stability was needed. Looking at logs found numerous errors relating to memory, L3 cache, USB devices, and “Hardware” only after putting it under load. Funny thing is the build uses ECC memory.
    Ran memtest86 for over 100 hours and no errors. A friend informed me based off the L3 cache it could be the CPU misaligned. Disassembled, cleaned, ensured CPU was mounted correctly, reassembled, re-installed OS and still had errors. Another memory test reported no errors. Found online ECC memory can correct issues before memtest can recognize it. So, disabled ECC on motherboard, ran another memtest, and had an error in the first hour. Testing the sticks one by one revealed errors on the first stick within an hour. Think that stick reported ~30 errors in the first 2-3 tests. So far 2 other sticks have been clean. 4th stick has been going for several hours with no issues. After replacing faulty stick(s) plan to put it all back in and test again before turning ECC back on and resuming build.
    Sometimes I really wonder how nice it would be not to do this to myself. Just have a Mac laptop and do some light internet browsing with an ISP provided router. You know, the simple life.

    • @RobertElderSoftware
      @RobertElderSoftware  ปีที่แล้ว +3

      Thanks, glad you found this insightful.
      That's kind of annoying that memtest doesn't easily show errors that are corrected by ECC. I would kind of think that should be one of the primary things that a *memory test* would show you (even if they were corrected). I think there are also multiple versions of memtest, and I don't know the difference all that well. Considering how important memory errors are, it's surprising how difficult it is to get visibility into them. On Linux, you can check some some obscure '/sys/devices':
      serverfault.com/questions/643542/how-do-i-get-notified-of-ecc-errors-in-linux
      and there are a couple niche tools for monitoring, but I would think that the # of ecc corrections aught be an important metric that should be readily available somewhere. Perhaps it's a limitation of how this information gets communicated (or not) from the RAM stick to the operating system?
      "Sometimes I really wonder how nice it would be not to do this to myself. Just have a Mac laptop and do some light internet browsing with an ISP provided router. You know, the simple life."
      You said it. So much time wasted debugging things that shouldn't need to be debugged in the first place. I can live with knowing that a piece of hardware is defective and that I need to replace it, but having to suspect that *maybe* some of my hardware *might* be defective and causing errors with no certainty is like being trapped in a fever dream.
      Anyway, if you liked this video, you might like this other one that I did:
      th-cam.com/video/SexoI7kt4jk/w-d-xo.html

  • @BrianGarside
    @BrianGarside ปีที่แล้ว +4

    Finally a truly informative video on this topic. Thanks!

  • @ДмитрийВалетин-ъ4ы
    @ДмитрийВалетин-ъ4ы ปีที่แล้ว +5

    The answer why dd made exact copy of your HD while ddrescue failed is just a bad luck. ddrescue requested a large enough heap memory block and faulty page was inside this block. With dd you requested just a 16k heap memory block, where the probability to get faulty page in this block is much lesser than with ddrescue. You can easily check it by requesting block size as large as entire free memory of your PC. This will raise the odds of getting faulty memory page allocated to the heap and dd with produce corrupted image as well.

  • @saki7804
    @saki7804 ปีที่แล้ว +3

    Great video, thanks for taking the time to make this

  • @DavidSmith-me3qp
    @DavidSmith-me3qp ปีที่แล้ว +3

    Extraordinary video!!! Wow!

  •  ปีที่แล้ว +2

    Doing copy on machine that shows symptoms of some problems. Brilliant idea.
    Rebooting computer was your ECC but you ignored it.

  • @humpheryflaubert8172
    @humpheryflaubert8172 7 หลายเดือนก่อน +3

    26:39 lol

  • @momohLBY
    @momohLBY ปีที่แล้ว +1

    Excellent video

  • @andrewphi4958
    @andrewphi4958 ปีที่แล้ว +6

    Detecting errors on a CPU without ECC capabilities is a pain )

    • @andrewphi4958
      @andrewphi4958 ปีที่แล้ว +1

      Also, using Intel Core is just asking for trouble, which, in this case, is probably a good idea )

    • @CristianCiupitu
      @CristianCiupitu ปีที่แล้ว

      @@andrewphi4958 what's wrong with Intel Core?

    • @xrafter
      @xrafter ปีที่แล้ว

      ​@@CristianCiupitu
      Old not 5hose funcy icpu

  • @DonNadie05
    @DonNadie05 2 หลายเดือนก่อน

    This video is massively underrated, is the on-die ECC on ddr5 ram now the same as it was or maybe better?

    • @orka16605
      @orka16605 21 วันที่ผ่านมา

      No, ddr5 on die ecc is only on-stick. Real ECC covers the entire pathway from cpu request to recieve data.

  • @wizard-pirate
    @wizard-pirate 6 หลายเดือนก่อน

    I can ship you some ddr3 ram if you want.