@@wendelltron I wonder if NV/AMD will bump into this issue on GPUs eventually... I mean, boosting GPUs to max. [but damaging] voltage at the loading screen or game menu, to drop it to safe level after game loads. Saving grace for now is that NV/AMD aren't pushing 1.3+ volts though them :D Still, ~1.05V limit isn't exactly low given the absolute transistor count monsters, that current gen GPUs became (overheating A LOT on Air cooling and barely "fine" on water because of it). Maybe base/stock max. voltage = 0.8V for GPUs, and 1.25V on CPUs is the answer ? (with option to go higher noted as overclock, by driver option click and acknowledging a warning popup window about higher power/temps ?) Side note : I wonder how much electricity is being wasted by simply having millions of CPUs/GPUs pushing for that extra 5-10% more performance, by increasing power usage by +30-50% at the same time... Heating the globe by being inefficient on perf/watt metric (since stuff isn't ran at "sweet spot" on V/F curve by default) - Dumb AI era is indeed here. PS. Sorry for this post being a BZ-style rant...
@@tringuyen7519the real problem is that even if they fix the bug, most likely there will be a loss of performance. And the CPUs that are already having issues will not be able to run at higher speeds if at all.
What inteldont is also a live USB image that anyone can run to test whether their CPU is OK or needs to be replaced. Intel should hire RAD Game Tools and a bunch of independent labs to farm out various test. It has been noticed that RAD's Oodle experiences algorithmic errors on ailing Intel CPUs. RAD has an existing relationship with Intel, RAD being developers of Larrabee.
As a shoutout... Those T-series Intel SKUs (or their closely-related cousins) are also often found in industrial/manufacturing environments in "Edge" or "IoT" machines (sometimes called industrial PCs or IPCs). Interestingly enough... I haven't seem many (any?) 13th/14th gen Panel PCs or other embedded hardware solutions... but that doesn't mean they aren't out there.
There are (or were?) new embedded options coming out that have their e-cores completely disabled check the 14901ke. Interestingly they now disappared from intel ark but Google still points to the Ark page as are many news outlets.
some of the "rent a minecraft server for your friends" use a business class SFF dell/hp/lenovo with T series cpus, and those Do Not Like that workload, interstingly.
All of these CPUs get very, very high transient spikes apparently, that can't be caught with software. You need an oscilloscope to measure these spikes according to Buildzoid from the Actually Hardware Overclocking channel. His CPU hits around 1.6-1.7v fairly often but it would be such a random transient spike that it wouldn't be measurable with software.
Indeed. And why trust these software readouts in the first place. They can be manipulated like, let's say VW's emission values, apart from being inaccurate and of low resolution.
I don't get it how Intel could overlook an issue which can be detected with a $200 scope. Bet someone there knew all along, and management decided to ride it out till the warranty runs out. Or middle management hid it in order to not upset the upper level management, because bad news are unwelcome and they shoot the messenger, like in many companies. I mean we already have the proven case on our hands where they decided to hide the 13th gen oxidation defect from their customers and ride it out instead of issuing a notice on a small serial number range.
I don't get it how Intel could overlook an issue which can be detected with a cheap 1gs/s scope. Bet someone there knew all along, and management decided to ride it out till the warranty runs out. Or middle management hid it in order to not upset the upper level management, because bad news are unwelcome, like in many companies. I mean we already have the proven case on our hands where they decided to hide the 13th gen oxidation defect from their customers.
I think it would be useful if, for each core, each processor, or each SKU, we could produce a graph. On the x-axis, we would want bins of frequencies (say 4000-4100MHz, 4101-4200MHz, etc.) and on the y-axis, the range of voltages requested for those frequency bins (with error bars). Then it would be interesting to compare those processors that are having trouble to those that aren't. Great work on the patch - I'm sure many will find this data useful.
@@owlmostdead9492 And then there's micro, which is supposed to be µ, but u and mc are also used. My engineer brain *hates* it. At least u doesn't clash with anything.
When you have to change your software monitoring because Intel f'd consumers so they would not lose to AMD by 3-5%. What a fucked up situation. Hope that CEO gets the boot soon, before the damage he does cannot be reversed. I know he ruined the consumer market and if they do not do a recall it is might not recover.
The CPUs that are damaging the brand now were designed before Gelsinger was CEO. This is inherited problems. The handling of the issue is what we can blame on him.
@@zulu02280Gelsinger is not wrong to try to boost foundry business. Intel's fab capacity could be considered a US national strategic asset with one hell of a basically guaranteed revenue stream if anything were to ever happen to TSMC. Like, say, a military invasion.
It's nice to see someone being scientific about this rather than just wildly speculating. If anyone can investigate and get to the bottom of this issue, it is Wendell.
Just learning about turbostat is great! It's like learning ffmpeg is the backbone of many video based programs. I'm guessing this was used in tuning your framework battery management tests? Thanks as always Wendell!
While probably not in the same realm as CPU binning, where bleeding edge thresholds make or break a CPU model, or scrap, quality of motherboards in the same product line or production run probably vary much more than we would like to know and the manufactures would admit.
You know, you throw shit chips out the door. You see someone that you know, and they ask you how they are, and you just have to say that they're fine, when they're're not really fine, but you just can't get into it because they would sue you.
Intel provided Asus (and everyone else) with the documentation, firmware and validation tools. If Asus follows that and validates accordingly, then Asus did nothing wrong unless it can be proven that they have a measurably higher failure rate than all the other board vendors.
Asus has a very spotted past and the Asus auto overclock option in the bios is providing a lot of voltage to the chip, so they are not without fault even with the 14th gen.
I thought that’s pretty high voltage for a cpu when it’s being slammed to be 1.3-1.4v, the few CPU’s that I’ve got it’s typically 1.12-1.25v when all cores are being slammed. Love your stuff Wendell
The problem is not when all cores all loaded, is when only one core is loaded, that's when that single core goes to the very high boost speed (6.0ghz for 14900k) and requests 1.50v+ , those 1.50v gets applied to the ring and gets fried.
@@ivantsipr mine is a poor sample 95SP rating and for 6ghz is 1.478v is what the vid table says of course it’s lower than that because I set it lower z790 dark hero. I think it’s 1.406v for 6ghz now
It was a binning issue, it was intel chasing $$$, The yield on the silicon quality for the top binning for i9 was too low and they could not produce enough i9's, so they simply pushed the lower quality silicon into the higher bins, so they could meet the demand for the higher profit i9's. half of them should have been binned as i7's or lower.
Indeed - even if that's true, no CPU should be pulling such high voltages no matter what the binning level was. (Plus the motherboards should also not allow that much voltage into the CPU). This is either a bug in one of the voltage controllers of the chip or something gone wrong with the architecture somewhere. It's also a V&V failure to not catch this during release testing...
@@hard4hardware He seems to be quoting Tom from Moore's Law Is Dead. But take note that MLID stated this as only ONE OF the 3 potential stability issues, & probably not the main one (the others are oxidation of one batch plus the ring-bus voltage issue).
@@peterwstacey Or simply a too aggressive algorithm for the silicon quality to meet its base specs. Call that a bug if you will. The motivating intent and outcome are the same.
i went down the rabbithole of c-states on the package level, monitored by powertop. kaby lake worked fine for this. but not alder lake. for kaby lake i could save 1w per core at the wall.
It stands to reason that the damage has already been done and that it is only a matter of a few months or years until it fails. Like with the Intel Atom C2000 bug, but without any way to hotwire the CPU. So, I guess that Intel is playing for time at this point, effectively trying to stall beyond warranty dates and also covering things up as to not endanger (future) sales.
Outstanding! Affected people/companies are doing the analysis and patching in public while Intel is still doing their best to gaslight everyone. They are pushing their CPUs too hard out of the box. They had and still have it coming. Only a years-long spotless track record will convince me to put Intel back in our evaluations.
The magnitude of this issue is huge, and both of them are doing tons of work in different areas of the issue and working together also. If Wendell was the only one making light of this issue, Intel would figure out some way to silence him. They can't silence everyone. I'd like to see Steve take Wendell's data package to present at a sit down Intel board meeting, LOL.
He is also the only with a real education in the field, right? The others are guys that sold PCs in a store, a guy that likes to Apple products and some long haired guy that did basic validation at Dell or something, right?
@@zulu02280 I don’t think he has a formal education in the field though. He’s self taught. I think that’s what he said when he was interviewing tech tech potato who does have a PhD.
For the level1vid tool, it'd be really nice to have an extended mode that ran through all the combinations of physical cores, checking to see what it takes to make the cpu mad.
Hey, Mr Wendell Wizard. How about calling the column header VID instead of Voltage? Also is this value the raw VID or is it SVID (so DC/LL taken into account, like in HWInfo)? Btw this SVID readback is also used for power calculation, so by using lower DC/LL motherboards basically can open more power budget while still showing 253 W or whatever in the software readout.
it's open source :) I am using the equation and terminology from intel reference to directly convert the register value in binary to a voltage. it's literally just msr198 in the turbostat.c patch. in my utility I'm doing a little more work.
Nice. May I suggest to name the column "VID" instead of "Voltage"? "Voltage" is confusing as one might think this is Vcore, which is the actual voltage that the CPU gets. VID is the proper term for the voltage requested by the CPU.
In April 2024, I bought a new system with an MSI Z790 Pro Wifi P motherboard with an i9 Core 14900KF cpu. The machine runs ProxMox then uses that to run a Linux image running TrueNAS, then a couple of other Linux images for programming and sometimes a Windows 11 image. I've applied MSI's updated BIOS image 7E06vAD3 dated 6/20/2024 to the motherboard but suspect some tool like this might be required to spot any degredation that starts accelerating over time. However, I presume watching stats from within the BIOS isn't very useful since if you're looking at BIOS screens, the machine isn't running a full OS and client loads. I would also assume running a tool like this on one of the CLIENT machne images running on the system won't help either because they may not see all cores in the processor. FWIW, the motherboard BIOS indicated before and after the patch that it was configured with a baseline CPU voltage of 0.88 volts so it seems to have come from the factory WITHOUT any config that was trying to overclock it or even push it close to limits.
nice patch , got system stable now with no more reboots , installed i7z as well nice tool , is their a AMD alternative ? CPU-X is pretty slow and knowing voltage for each core would be sweet to know which the better cores are
I still think the high transient currents are revealing a node or process limit. Sort of how we discovered plasma cutting; the goal was better/deeper penetrating welding. At least I am under the impression that this amount of current draw is incredible and new territory for this particular node/process. Who knows? maybe quantum tunneling results in real tunneling.
Have you actually done any testing and comparisons between the requested voltage and crashing? Can we expect some sort of correlation data in a future video? 👀
a little bit. it seems that when a CPU starts to degrade the instability creeps in at the lower end of the vid ranges, not the high end. so typical tests that fully load a cpu may not show any problems whereas running bursty workloads show problems.
@@Level1LinuxCould increasing the voltage reduce crashing then? At the cost of accelerated degradation I suppose.. I was moreso wondering if it'd be possible to diagnose a CPU early based on its tendency to request higher or lower voltage for specific frequency
@@Level1LinuxCould that be one of the reasons this was not detected by QA at Intel? They never designed tests for this specific failure mode and therefore they couldn’t detect it?
I love my 13900 - RTX 4070 mobile rig. Smooth HIGH FPS gaming, lighting fast workloads, and pretty amazing battery life (10+ or so hours) given the system. But it runs HOT when worked and I worry all the time about the longevity of this system. In comparison for times when i need something smaller and lighter I have a Ryzen 7730 system. That same gaming albeit at a solid and smooth 60FPS and work load might take a slight bit longer has the smaller system running cool as a cucumber and without the jet engine sounds. Im not saying I may ditch Intel next go around, but its looking likely if nothing changes.
Wendel you might not be exactly correct. More voltage does not mean bad CPU. Der 8auer made a video and tested multiple CPUs. There are CPUs with higher voltage but less power dissipation. But I am not sure, if those CPUs nevertheless suffer more electromigration.
Can you do a video on determining if your CPU needs RMA'd. It's often difficult to figure out if just bad code, or if CPU now in this case. I host a few dedicated servers on Debian w/ Docker. I do see crashes from time to time but that seems normal. But now , with a 14700, I will question every crash :| Only recently have the new microcode been available.
can we collect vid data and profile systems? i would warn you against comparing data between different motherboards and motherboard bioses using the intel register without doing something to validate that these are valid comparisons. the null hypothesis is that they are not. i am willing to put in some work on this if i can be of help
Sadly software-tools have a near 0 chance of catching those voltage spikes. You'd need hardware interrupts on the actual register with a direct readout to catch those. The VID changes rather quickly and the spikes are really just short intervals - if memory serves me right it was something like 200us and the are not "common" so catching them is really not easy. And then the resulting voltage-spikes have a nasty relation with interrupts and low threadcount workloads like many loading-screens and menus and there sadly is no way of monitoring that in software at all (oscilloscopes should be fine as we are talking timescales on order of 1 us).
I have the 13500. I started off with Asus B760 Prime. I had constant freezes and power kernel problems. Every diagnostic checked out. The seller exchanged the motherboard for a Gigabyte Z790 Aorus DDR4 which seems to have fixed the problem. Haven't had an issue since then. Will I be okay? 😢
Ive patched mine and my friends 13900k by setting an internal vr voltage limit to 1.4v this will completley stop the voltage spikes! Dont know why intel doesnt recommend it or using it for their own patch
@@ducpaii Basically what buildzoid did in his video about using safe limits ? Yeah, that is a bit crazy that's not more enforced. I guess it's because for some CPUs that is not stable.
If they're smart they'll figure out some mutually agreeable reward for him... I believe it will be something more akin to Wendell joining AMD and Nvidia on the Top 3 hitlist in the boardroom at HQ though. They're sewing up their ouija dolls in his form as we speak.
my school buys those EliteDesk 800 minis and i JUST got one with a 14700T and a 3050....I told my folks hey this poblem is building, might hit us hard - no one listens - will see!!!
I remember the days of having to put a heatsink directly on to the processor die without cracking it, or stabbing the motherboard with a flathead screwdriver. Ahh traumatic memories.
Lembro de quando amd não tinha os recursos financeiros que a intel tem, para cometer esse ato tão grotesco a nível global com seus refresh +++++++++++😂
Wait a minute, what's with this good i7 and good i9? Everyone was claiming that all 13 and 14 gen CPUs were e experiencing this problem but now its only some.
Woooooooo! Linux tools baby! Wendell just finding the bad intel chips in his spare time, telling us normies how easy it was. Intel kinda sucks btw. I thought they were the big dogs....
mOhm is milli not Mega ohm. System programming and ee cpu stuff is what i did my whole life. Those voltages are still high level to alu and instruction decoder etc...
do we still don’t know for sure the issue for certain . I’ll stick to my 2 cents from 4 years ago efficiency cores in a desktop PC is a dumb idea always will be
I don't think you are missing something. Based on what I saw, it might just be the p!ss poor software quality. CPUs are getting more complex and all I can really see "improving" is the gaudy and extremely laggy BIOS/UEFI UI.
I applied the automated all core turbo even when on light loads, that and lowering down voltage to 0.85v at idle seems to have turned stable my unstable at stocks gaming MOBOS 14900K.
You better start the RMA process. If your CPU is already unstable at Intel Default settings with the Extreme profile selected, then it is only going to get worse and worse and worse. And you paid for a CPU that can run the Extreme profile.....
@@andersjjensen yeah, problem is, for those who aren't rich enough, like me, we tend to buy 2nd hand (but at the end, we pay twice the price of brand new with warranty). And so there's no invoice or proof of buy that anyone would take as good for making an RMA.
Just Synch your effing cores and you stop Single Core voltage damage. This is sooo simple. For a Single Core to reach 5.8Ghz it needs more than 1.5v. Then you have damage. NOT a mystery....
as a tech illiterate person, is there a way TL;DR to fix this because its alot of jargon and all I understand is that my new gen intel CPU is gonna fry itself at some point
Voltage is at least a part of it. If the P cores are lightly loaded the E cores can make stupid VID requests (for some reason) which might be juicing the ring. There are other things on the package than cores that might be misbehaving. Like what else is one the same power plane as the memory controller? VDD2 is supposed to be 1.2 (DDR4) or 1.1 (DDR5) but a quick search is telling me XMP profiles are/were pushing it to 1.3-1.5(!)V.
Just lock all cores and limit voltage to where it’s stable. The single core boost is what is degrading the chips same way when you would push a chips voltage overclocking it the last decade are more. Mainly this algorithmic boost for the maximum benchmarks and efficiency are what is causing the issue
Load line is in milli Ohms. Capital M = Mega. Lowercase m = milli
Thanks bz
OUUPPPSSSS look over me, I'm slow :D I did at least calculate the right voltage, task failed successfully. engagement challenge? :)
@@wendelltron I wonder if NV/AMD will bump into this issue on GPUs eventually...
I mean, boosting GPUs to max. [but damaging] voltage at the loading screen or game menu, to drop it to safe level after game loads.
Saving grace for now is that NV/AMD aren't pushing 1.3+ volts though them :D
Still, ~1.05V limit isn't exactly low given the absolute transistor count monsters, that current gen GPUs became (overheating A LOT on Air cooling and barely "fine" on water because of it).
Maybe base/stock max. voltage = 0.8V for GPUs, and 1.25V on CPUs is the answer ? (with option to go higher noted as overclock, by driver option click and acknowledging a warning popup window about higher power/temps ?)
Side note : I wonder how much electricity is being wasted by simply having millions of CPUs/GPUs pushing for that extra 5-10% more performance, by increasing power usage by +30-50% at the same time...
Heating the globe by being inefficient on perf/watt metric (since stuff isn't ran at "sweet spot" on V/F curve by default) - Dumb AI era is indeed here.
PS. Sorry for this post being a BZ-style rant...
Yeah, thought that Wendel misspoke there as well
Was about to say: That'll be some insane voltage drop if we talkin megaohms.
The dangling Pi at 4:51 is pure gold
For optimal cooling 💀
And it's doing SDR work!
Wendell does what Inteldon't
If Intel release a fix for a bug, doesn’t that mean that Intel CPUs had a bug? Not possible! Oh, the humanity!😂😂😂
@@tringuyen7519the real problem is that even if they fix the bug, most likely there will be a loss of performance. And the CPUs that are already having issues will not be able to run at higher speeds if at all.
What inteldont is also a live USB image that anyone can run to test whether their CPU is OK or needs to be replaced. Intel should hire RAD Game Tools and a bunch of independent labs to farm out various test. It has been noticed that RAD's Oodle experiences algorithmic errors on ailing Intel CPUs.
RAD has an existing relationship with Intel, RAD being developers of Larrabee.
@@steveftoth also patching already damaged chips wont somehow make them fixed.
@@steveftothduh because the issues stem from physical damage to the chips
As a shoutout... Those T-series Intel SKUs (or their closely-related cousins) are also often found in industrial/manufacturing environments in "Edge" or "IoT" machines (sometimes called industrial PCs or IPCs). Interestingly enough... I haven't seem many (any?) 13th/14th gen Panel PCs or other embedded hardware solutions... but that doesn't mean they aren't out there.
There are (or were?) new embedded options coming out that have their e-cores completely disabled check the 14901ke. Interestingly they now disappared from intel ark but Google still points to the Ark page as are many news outlets.
Some laptops as well.
some of the "rent a minecraft server for your friends" use a business class
SFF dell/hp/lenovo with T series cpus, and those Do Not Like that workload, interstingly.
You can also find them in some mini pcs... Looking at u hp!!!
All of these CPUs get very, very high transient spikes apparently, that can't be caught with software. You need an oscilloscope to measure these spikes according to Buildzoid from the Actually Hardware Overclocking channel. His CPU hits around 1.6-1.7v fairly often but it would be such a random transient spike that it wouldn't be measurable with software.
Indeed. And why trust these software readouts in the first place. They can be manipulated like, let's say VW's emission values, apart from being inaccurate and of low resolution.
True, a scope is the only way to see the high speed spikes.
I don't get it how Intel could overlook an issue which can be detected with a $200 scope. Bet someone there knew all along, and management decided to ride it out till the warranty runs out. Or middle management hid it in order to not upset the upper level management, because bad news are unwelcome and they shoot the messenger, like in many companies.
I mean we already have the proven case on our hands where they decided to hide the 13th gen oxidation defect from their customers and ride it out instead of issuing a notice on a small serial number range.
I don't get it how Intel could overlook an issue which can be detected with a cheap 1gs/s scope. Bet someone there knew all along, and management decided to ride it out till the warranty runs out. Or middle management hid it in order to not upset the upper level management, because bad news are unwelcome, like in many companies.
I mean we already have the proven case on our hands where they decided to hide the 13th gen oxidation defect from their customers.
And even if it could catch it every so often, it wouldn't be reflected by what the CPU is requesting.
this is like those warez release groups that patch game bugs alongside their bootlegs. great work !
I think it would be useful if, for each core, each processor, or each SKU, we could produce a graph. On the x-axis, we would want bins of frequencies (say 4000-4100MHz, 4101-4200MHz, etc.) and on the y-axis, the range of voltages requested for those frequency bins (with error bars). Then it would be interesting to compare those processors that are having trouble to those that aren't. Great work on the patch - I'm sure many will find this data useful.
5:45 s/megaohm/milliohm/
It bothers me too. Persistently being 9 orders of magnitude off hurts my ears. :)
It's milli-ohms and not mega-ohms for both DC and AC loadline.
@@pravardhanus People should just learn M = MEGA, m = milli and W for WUMBO
@@owlmostdead9492 And then there's micro, which is supposed to be µ, but u and mc are also used. My engineer brain *hates* it. At least u doesn't clash with anything.
doh, I meant that said the wrong thing ouppsssss
When you have to change your software monitoring because Intel f'd consumers so they would not lose to AMD by 3-5%. What a fucked up situation. Hope that CEO gets the boot soon, before the damage he does cannot be reversed. I know he ruined the consumer market and if they do not do a recall it is might not recover.
The guy that pushed to use the tainted substrate that caused the corrosion problems in these chips left all by himself.
Gelsinger seems to sacrifice the cpu business for the foundry business... Let's see how that is going to play out 😮
Intel trying to avoid the most expensive cpu recall since cpu's were created 😂😭
The CPUs that are damaging the brand now were designed before Gelsinger was CEO. This is inherited problems. The handling of the issue is what we can blame on him.
@@zulu02280Gelsinger is not wrong to try to boost foundry business. Intel's fab capacity could be considered a US national strategic asset with one hell of a basically guaranteed revenue stream if anything were to ever happen to TSMC. Like, say, a military invasion.
It's nice to see someone being scientific about this rather than just wildly speculating. If anyone can investigate and get to the bottom of this issue, it is Wendell.
Thanks for doing this Wendell.
Just learning about turbostat is great! It's like learning ffmpeg is the backbone of many video based programs. I'm guessing this was used in tuning your framework battery management tests? Thanks as always Wendell!
While probably not in the same realm as CPU binning, where bleeding edge thresholds make or break a CPU model, or scrap, quality of motherboards in the same product line or production run probably vary much more than we would like to know and the manufactures would admit.
You know, you throw shit chips out the door. You see someone that you know, and they ask you how they are, and you just have to say that they're fine, when they're're not really fine, but you just can't get into it because they would sue you.
Asus motherboard been helping Intel to save face but many dumb media blame Asus now we all know its.not asus fsult. Thanks for this
Intel provided Asus (and everyone else) with the documentation, firmware and validation tools. If Asus follows that and validates accordingly, then Asus did nothing wrong unless it can be proven that they have a measurably higher failure rate than all the other board vendors.
Asus has a very spotted past and the Asus auto overclock option in the bios is providing a lot of voltage to the chip, so they are not without fault even with the 14th gen.
This is awesome and the L1 team is awesome for doing it. I hope Intel doesn't use this to claim there is a software fix for their issues
always here for a level1linux 🙌
Positively brilliant, again, Wendell!
I'm thankful that my E5-2699 is not affected😁
0:26 "Going to have to be physically replaced."
Intel wants to one-up CrowdStrike.
I thought that’s pretty high voltage for a cpu when it’s being slammed to be 1.3-1.4v, the few CPU’s that I’ve got it’s typically 1.12-1.25v when all cores are being slammed. Love your stuff Wendell
The problem is not when all cores all loaded, is when only one core is loaded, that's when that single core goes to the very high boost speed (6.0ghz for 14900k) and requests 1.50v+ , those 1.50v gets applied to the ring and gets fried.
@@ivantsipr mine is a poor sample 95SP rating and for 6ghz is 1.478v is what the vid table says of course it’s lower than that because I set it lower z790 dark hero. I think it’s 1.406v for 6ghz now
It was a binning issue, it was intel chasing $$$, The yield on the silicon quality for the top binning for i9 was too low and they could not produce enough i9's, so they simply pushed the lower quality silicon into the higher bins, so they could meet the demand for the higher profit i9's. half of them should have been binned as i7's or lower.
Oh wow, source?
Even if true, that explains absolutely nothing about why they’re now failing. At best this is a squirrel meant to distract.
Indeed - even if that's true, no CPU should be pulling such high voltages no matter what the binning level was. (Plus the motherboards should also not allow that much voltage into the CPU). This is either a bug in one of the voltage controllers of the chip or something gone wrong with the architecture somewhere. It's also a V&V failure to not catch this during release testing...
@@hard4hardware He seems to be quoting Tom from Moore's Law Is Dead. But take note that MLID stated this as only ONE OF the 3 potential stability issues, & probably not the main one (the others are oxidation of one batch plus the ring-bus voltage issue).
@@peterwstacey Or simply a too aggressive algorithm for the silicon quality to meet its base specs. Call that a bug if you will. The motivating intent and outcome are the same.
i went down the rabbithole of c-states on the package level, monitored by powertop. kaby lake worked fine for this. but not alder lake. for kaby lake i could save 1w per core at the wall.
hopefully intel's patch will help users with these CPUs to prevent early fail
It stands to reason that the damage has already been done and that it is only a matter of a few months or years until it fails. Like with the Intel Atom C2000 bug, but without any way to hotwire the CPU.
So, I guess that Intel is playing for time at this point, effectively trying to stall beyond warranty dates and also covering things up as to not endanger (future) sales.
They can only delay failure not stop it.
Outstanding! Affected people/companies are doing the analysis and patching in public while Intel is still doing their best to gaslight everyone.
They are pushing their CPUs too hard out of the box. They had and still have it coming. Only a years-long spotless track record will convince me to put Intel back in our evaluations.
Wendell did so much more work than Steve GamersNexus ever did and more
The magnitude of this issue is huge, and both of them are doing tons of work in different areas of the issue and working together also. If Wendell was the only one making light of this issue, Intel would figure out some way to silence him. They can't silence everyone. I'd like to see Steve take Wendell's data package to present at a sit down Intel board meeting, LOL.
What even is this comment? Steve posted a 40 minute video on this.
5:45 - don't you mean milliohm, not megaohm?
He meant gigaohm
Wendell is the only techtuber I know making patches to the kernel. GOAT.
He is also the only with a real education in the field, right?
The others are guys that sold PCs in a store, a guy that likes to Apple products and some long haired guy that did basic validation at Dell or something, right?
@@zulu02280 I don’t think he has a formal education in the field though. He’s self taught. I think that’s what he said when he was interviewing tech tech potato who does have a PhD.
@@RandyRanderson404 Bachelors of Computer Science. Did some fun stuff grad school adjacent, but no masters or phd :)
For the level1vid tool, it'd be really nice to have an extended mode that ran through all the combinations of physical cores, checking to see what it takes to make the cpu mad.
I swear I was just finished watching Shogun and hear Intel committing seppaku
I couldn't control my laughter
Hey, Mr Wendell Wizard. How about calling the column header VID instead of Voltage? Also is this value the raw VID or is it SVID (so DC/LL taken into account, like in HWInfo)?
Btw this SVID readback is also used for power calculation, so by using lower DC/LL motherboards basically can open more power budget while still showing 253 W or whatever in the software readout.
it's open source :) I am using the equation and terminology from intel reference to directly convert the register value in binary to a voltage. it's literally just msr198 in the turbostat.c patch. in my utility I'm doing a little more work.
I was expecting you to mention how things are going with your systems. Coming soon?
Nice. May I suggest to name the column "VID" instead of "Voltage"? "Voltage" is confusing as one might think this is Vcore, which is the actual voltage that the CPU gets. VID is the proper term for the voltage requested by the CPU.
In April 2024, I bought a new system with an MSI Z790 Pro Wifi P motherboard with an i9 Core 14900KF cpu. The machine runs ProxMox then uses that to run a Linux image running TrueNAS, then a couple of other Linux images for programming and sometimes a Windows 11 image. I've applied MSI's updated BIOS image 7E06vAD3 dated 6/20/2024 to the motherboard but suspect some tool like this might be required to spot any degredation that starts accelerating over time. However, I presume watching stats from within the BIOS isn't very useful since if you're looking at BIOS screens, the machine isn't running a full OS and client loads. I would also assume running a tool like this on one of the CLIENT machne images running on the system won't help either because they may not see all cores in the processor. FWIW, the motherboard BIOS indicated before and after the patch that it was configured with a baseline CPU voltage of 0.88 volts so it seems to have come from the factory WITHOUT any config that was trying to overclock it or even push it close to limits.
Good showing mate. Intel is shitting the bed realtime
nice patch , got system stable now with no more reboots , installed i7z as well nice tool , is their a AMD alternative ? CPU-X is pretty slow and knowing voltage for each core would be sweet to know which the better cores are
I still think the high transient currents are revealing a node or process limit. Sort of how we discovered plasma cutting; the goal was better/deeper penetrating welding.
At least I am under the impression that this amount of current draw is incredible and new territory for this particular node/process. Who knows? maybe quantum tunneling results in real tunneling.
Yay!!! Linux content.
Wendell is GodLike!
When L1T office tour and backstory and how you got this amazing building?
Intel's support suggested I set some settings manually. Looks like in my case, the mainboard is part of the problem.
I knew you had skills but golly, ok impressive
HWInfo is able to log Sensor-measures to .csv So, If HWInfo sensors view the real voltages (not the requasted) , that may be solution.
We still running 6 core T series variants in our products because it takes a while to validate and design the boards
Still using Windows 10 LTSC IOT
Have you actually done any testing and comparisons between the requested voltage and crashing? Can we expect some sort of correlation data in a future video? 👀
a little bit. it seems that when a CPU starts to degrade the instability creeps in at the lower end of the vid ranges, not the high end. so typical tests that fully load a cpu may not show any problems whereas running bursty workloads show problems.
@@Level1LinuxCould increasing the voltage reduce crashing then? At the cost of accelerated degradation I suppose..
I was moreso wondering if it'd be possible to diagnose a CPU early based on its tendency to request higher or lower voltage for specific frequency
@@Level1LinuxCould that be one of the reasons this was not detected by QA at Intel? They never designed tests for this specific failure mode and therefore they couldn’t detect it?
I love my 13900 - RTX 4070 mobile rig. Smooth HIGH FPS gaming, lighting fast workloads, and pretty amazing battery life (10+ or so hours) given the system. But it runs HOT when worked and I worry all the time about the longevity of this system. In comparison for times when i need something smaller and lighter I have a Ryzen 7730 system. That same gaming albeit at a solid and smooth 60FPS and work load might take a slight bit longer has the smaller system running cool as a cucumber and without the jet engine sounds. Im not saying I may ditch Intel next go around, but its looking likely if nothing changes.
Wendel you might not be exactly correct. More voltage does not mean bad CPU. Der 8auer made a video and tested multiple CPUs. There are CPUs with higher voltage but less power dissipation. But I am not sure, if those CPUs nevertheless suffer more electromigration.
This reminds me of how you can't get Ryzen Master on Linux either, knowing the best cores on my 5900x would be nice.
Can you do a video on determining if your CPU needs RMA'd. It's often difficult to figure out if just bad code, or if CPU now in this case. I host a few dedicated servers on Debian w/ Docker. I do see crashes from time to time but that seems normal. But now , with a 14700, I will question every crash :| Only recently have the new microcode been available.
There must be a way to physically probe what the VRM is sending to the CPU at any given time
Which git commit is your patch against? i tried a bunch over the last month, and always get failed chunks. How old of code were you working on?!
can we collect vid data and profile systems?
i would warn you against comparing data between different motherboards and motherboard bioses using the intel register without doing something to validate that these are valid comparisons. the null hypothesis is that they are not.
i am willing to put in some work on this if i can be of help
it would be reallllly nice to get some more data wrt to the T variants suffering this issue.
Sadly software-tools have a near 0 chance of catching those voltage spikes. You'd need hardware interrupts on the actual register with a direct readout to catch those. The VID changes rather quickly and the spikes are really just short intervals - if memory serves me right it was something like 200us and the are not "common" so catching them is really not easy. And then the resulting voltage-spikes have a nasty relation with interrupts and low threadcount workloads like many loading-screens and menus and there sadly is no way of monitoring that in software at all (oscilloscopes should be fine as we are talking timescales on order of 1 us).
I have the 13500. I started off with Asus B760 Prime. I had constant freezes and power kernel problems. Every diagnostic checked out. The seller exchanged the motherboard for a Gigabyte Z790 Aorus DDR4 which seems to have fixed the problem. Haven't had an issue since then. Will I be okay? 😢
This mostly effects 13th & 14th i9s and very few i7s. the i5's generally don't request over 1.4v per core.
Probably just Asus again 😢
Ive patched mine and my friends 13900k by setting an internal vr voltage limit to 1.4v this will completley stop the voltage spikes! Dont know why intel doesnt recommend it or using it for their own patch
And yes it still hits max boost
@@ducpaii Basically what buildzoid did in his video about using safe limits ? Yeah, that is a bit crazy that's not more enforced. I guess it's because for some CPUs that is not stable.
Wendell is doing what Intel should have done.
If they're smart they'll figure out some mutually agreeable reward for him... I believe it will be something more akin to Wendell joining AMD and Nvidia on the Top 3 hitlist in the boardroom at HQ though. They're sewing up their ouija dolls in his form as we speak.
@@asknight i hope he wouln't join any corporation. we need more indipendant workers like him
Why not set voltage limits in your bios? or not possible with your board?
my school buys those EliteDesk 800 minis and i JUST got one with a 14700T and a 3050....I told my folks hey this poblem is building, might hit us hard - no one listens - will see!!!
Something for Master Ken!! Ken Shirriff
Appearing directly below a foreboding community note from GN.
"Intel is unbelievably slimy
Multi-part report soon."
I fixed my inteL, ordered AMD.
The subtitles say "colonel adjacent hacking". Sooooo...Lieutenant Colonel hacking?
Thanks!
big request for recent snapdragon laptops linux test
vs asahi m1 (which is a good oss project, but still mediocre at best as a main OS)
Should I report if my volt goes over 1.5? Cus it did
Remember the days of AMD space heaters... Well not only space heaters, but broken heaters.
I remember the days of having to put a heatsink directly on to the processor die without cracking it, or stabbing the motherboard with a flathead screwdriver. Ahh traumatic memories.
Lembro de quando amd não tinha os recursos financeiros que a intel tem, para cometer esse ato tão grotesco a nível global com seus refresh +++++++++++😂
Remember the days the K6 would lockup several times a day? Got real good at saving your work often. 😢😢😢
Disabled Turboboost for my chip and have not had a crash since
RMA the chip. You payed for that performance.
Could you try whether the vf curve feature in intel XTU works? It should print out a vf curve similar to the one in Asus' bios on a K cpu.
Should I be concerned for my 13700k I only got at start of year?
Be like James Bond Damien. Live and let die.
News late last night: "Intel to cut costs by cutting thousands of jobs, report says"
@levelonetechs Could you release your patch?! I'd like to look it over! It would be greatly appreciated!
its on the forum!
red and yellow logo? hotness.
Wait a minute, what's with this good i7 and good i9? Everyone was claiming that all 13 and 14 gen CPUs were e experiencing this problem but now its only some.
"good" ones can be overclocked I suppose, or have lower power at a certain clock. Should have nothing to do with the bug.
Woooooooo! Linux tools baby! Wendell just finding the bad intel chips in his spare time, telling us normies how easy it was. Intel kinda sucks btw. I thought they were the big dogs....
mOhm is milli not Mega ohm. System programming and ee cpu stuff is what i did my whole life. Those voltages are still high level to alu and instruction decoder etc...
talking about intel ... intel have a linux distro how the cpus deal with the fail's on the distro u tested ?
Intel had to fail, so we could get a Level1Linux video. Hope it fails often.
do we still don’t know for sure the issue for certain . I’ll stick to my 2 cents from 4 years ago efficiency cores in a desktop PC is a dumb idea always will be
I have an ASUS that has the V/F Point feature but it just shows 0.000V for all points for my 13900K. So I might be missing something.
I don't think you are missing something. Based on what I saw, it might just be the p!ss poor software quality. CPUs are getting more complex and all I can really see "improving" is the gaudy and extremely laggy BIOS/UEFI UI.
RMAd my intel i9 13900k
I applied the automated all core turbo even when on light loads, that and lowering down voltage to 0.85v at idle seems to have turned stable my unstable at stocks gaming MOBOS 14900K.
You better start the RMA process. If your CPU is already unstable at Intel Default settings with the Extreme profile selected, then it is only going to get worse and worse and worse. And you paid for a CPU that can run the Extreme profile.....
@@andersjjensen yeah, problem is, for those who aren't rich enough, like me, we tend to buy 2nd hand (but at the end, we pay twice the price of brand new with warranty). And so there's no invoice or proof of buy that anyone would take as good for making an RMA.
@@Pacho18 OUCH! Sorry for you loss man :-/
@@andersjjensen Yeah, bc of that, and Intel not doing a proper recall for all of those out there affected, i'm done with'em for life.
Is this just for Intel CPUs?
It seems that for now tool is optimized for Intel CPUs, since they have those pressing issues. But with time one could expand it to all CPUs.
Massive layoff in Intel right now.
He used to be a millionaires but he fixed his life.
Lenovo Legion alderlake so gimped I don't go past 1.191v in performance mode lol, .8v normal use, guess its safe.
grendel glow up - grifting aficionado
sounds like a prog rock band name
Guess I forgot to subscribe
Turbostat works for amd cpu ? Or similar app in kernel ?
Yes, turbostat works on AMD too.
8:30 14700k?
Think maybe Intel should be thinking about hiring you to help figure it out cuz they seem pretty clueless
Just Synch your effing cores and you stop Single Core voltage damage. This is sooo simple. For a Single Core to reach 5.8Ghz it needs more than 1.5v. Then you have damage. NOT a mystery....
Monitor the memory controller and associated things voltage, you will see things..
Or maybe none of the "engineers" at Intel has ever taken a course in control theory.
According to lact my laptop can request 150w on amd cpu lol
as a tech illiterate person, is there a way TL;DR to fix this because its alot of jargon and all I understand is that my new gen intel CPU is gonna fry itself at some point
mΩ = milliOhm, MΩ = megaOhm
So might the voltage story from intel be a lie? To buy time or so?
Voltage is at least a part of it. If the P cores are lightly loaded the E cores can make stupid VID requests (for some reason) which might be juicing the ring. There are other things on the package than cores that might be misbehaving. Like what else is one the same power plane as the memory controller? VDD2 is supposed to be 1.2 (DDR4) or 1.1 (DDR5) but a quick search is telling me XMP profiles are/were pushing it to 1.3-1.5(!)V.
@@duckrutt DDR 4 was always rated for 1.3V and more my kit is 1.35V this has been working for the last decade!
The problem is not voltage !
Flaw has been in processors since the 11th generation!
Look for processors that didn't die, dual channel stopped working!
I am a T series shill.
Just lock all cores and limit voltage to where it’s stable. The single core boost is what is degrading the chips same way when you would push a chips voltage overclocking it the last decade are more.
Mainly this algorithmic boost for the maximum benchmarks and efficiency are what is causing the issue