why can’t computers have thousands of cores?

แชร์
ฝัง
  • เผยแพร่เมื่อ 21 ก.ย. 2024
  • If you're watching this video on any device made in the last 10 years, be it a desktop, a laptop, a tablet or a phone, then there is an extremely high chance that your device is powered by a multi-core processor. Since the release of the first dual core processor in 2005 by IBM, it has become more and more common for computer processors of all varieties to be multi-core. This is in direct contrast to laptops in the 2000's, like my iBook G4 for example, which was powered by a single core PowerPC processor at around 800MHz. Now a days, it is common for any desktop to have at least 4 cores, and clocked easily into the GHz.
    But what does it mean for a processor to have multiple cores? How does a processor with multiple cores work? Why are more cores better than just one? How many cores are too many? These are all really important questions, and, like you, I was curious to find the answer.
    🏫 COURSES 🏫
    Learn to code in C at lowlevel.academy
    🔥 SOCIALS 🔥
    Low Level Merch!: www.linktr.ee/...
    Follow me on Twitter: / lowlevellearni1
    Follow me on Twitch: / lowlevellearning
    Join me on Discord!: / discord

ความคิดเห็น • 1.3K

  • @utubekullanicisi
    @utubekullanicisi 2 ปีที่แล้ว +1680

    Both Intel and AMD are rumored to release server processors (codenamed Sierra Forest, and Turin respectively) with more than 200 cores in the next few years (as soon as 2024). Servers will continue to scale well and make use of as many cores as you can give them.

    • @kayakMike1000
      @kayakMike1000 2 ปีที่แล้ว +128

      Aren't these are intended for data center where customers lease VMs or some other slice ? AMD has encrypted RAM....

    • @cassandrasibley228
      @cassandrasibley228 2 ปีที่แล้ว +173

      This video is about home and personal computers. Obviously industry hardware is gonna be a lot tankier

    • @kayakMike1000
      @kayakMike1000 2 ปีที่แล้ว +60

      @@cassandrasibley228 well, some of those xeon server processors end up in high-end workstations. I suspect higher end workstations might have take a mid road between core count and individual core performance.

    • @AnarexicSumo
      @AnarexicSumo 2 ปีที่แล้ว +99

      @@kayakMike1000 Can't I just enjoy learning about the extreme limit to the same tech without people going "Uhm ackshually that's not designed for consumers"

    • @littlemeg137
      @littlemeg137 2 ปีที่แล้ว +49

      Sun/Oracle had a 128 core SPARC64 chip over a decade ago. I've still got one of those servers in my basement.

  • @larrydavis3645
    @larrydavis3645 2 ปีที่แล้ว +674

    As a former programmer, not all functions of a program can be run in parallel. Sometimes a function needs to wait for another process to finish before it can proceed.

    • @pwnmeisterage
      @pwnmeisterage 2 ปีที่แล้ว +88

      You just can't count or calculate the next number before you've finished calculating or counting the one before it.
      There is no logic, no clever math or algorithm or brute force which can speed up some simple processes. The complex things just have to wait until the simplex things get done.

    • @MrZnarffy
      @MrZnarffy 2 ปีที่แล้ว +13

      Thats if you iterate...But try using a functional language where even iteration is done by recursion....

    • @larrydavis3645
      @larrydavis3645 2 ปีที่แล้ว +4

      @@MrZnarffy Thank you for the feedback.

    • @duaanekobe2773
      @duaanekobe2773 2 ปีที่แล้ว +1

      It can be endless, 12 is the basic max, 24 = 2x all at very stable more. times infinity Now I seen 36 max and it gets bugs. a controller separating 24 and next 24 = 58, etc... So buffer and code (stable 24 time x), makes best code function. now is the actual program and controller(s), Think A,B,C and the programs, export

    • @larrydavis3645
      @larrydavis3645 2 ปีที่แล้ว +8

      @@duaanekobe2773 Thank you for the feedback. I did most of the programming on mainframe computers and the programs there were extremely linear in nature. We use subprograms for common functions meaning the main program was in a wait state waiting for the subprogram to complete.

  • @davidthacher1397
    @davidthacher1397 2 ปีที่แล้ว +379

    Technical Tradeoffs:
    1. Power - Power Consumption / Thermal
    2. Area - Core Size / Cache / Memory Bandwdith / IO interconnect / Yield
    3. Performance - Architecture / Instruction Set / Clock Speed / FPGA / Multicore / CMT / NUMA / SIMD
    Business Tradeoffs:
    1. Algorithm Capability / Application
    2. Market share / Manufacturing Size
    3. Time to Market / Training

    • @leosmi1
      @leosmi1 2 ปีที่แล้ว +3

      Thank you

    • @brolysmash9333
      @brolysmash9333 2 ปีที่แล้ว +2

      bro you’re the best. Thanks for sharing this up. I’m a network engineer and didn’t know nothing about that.

  • @veleriphon
    @veleriphon 2 ปีที่แล้ว +854

    We see the cores to code limit already with Threadripper 64 core, 128 thread units. It's hilariously overpowered for most tasks.

    • @SandTurtle
      @SandTurtle 2 ปีที่แล้ว +155

      i feel bad for people who buy a threadripper then realize their favorite games either dont support multithreading, or only support 1 or 2 extra threads for main logic.

    • @giahuy8701
      @giahuy8701 2 ปีที่แล้ว +290

      @@SandTurtle of course, Threadripper is not for gaming

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว +134

      @@giahuy8701 No way to blame sh*tty game code optimization on monster CPUs. There are games still struggling on CPUs with 4 cores due to bad optimization and very low CPU use.

    • @SandTurtle
      @SandTurtle 2 ปีที่แล้ว +12

      @@giahuy8701 ye ik but I've heard of people tryna buy them for gaming

    • @ed_iz_ed
      @ed_iz_ed 2 ปีที่แล้ว +13

      @@giahuy8701 games can EASILY make use of multiple threads

  • @badass6300
    @badass6300 2 ปีที่แล้ว +633

    Also, a big factor is that many programs have linear logic. Amdahl's law shows how well a task scales with multiple cores depending on how parallel it is. For 50% parallel tasks above 4 cores is pointless. For 75% parallel above 16 cores is pointless. You just don't gain performance and that is baked in the logic of the task. Many cores are great when doing multiple of the same task without caring which task is completed first.

    • @mewsermeow8683
      @mewsermeow8683 2 ปีที่แล้ว +94

      That and the fact that once you start getting into problems that are highly parallelizable, you'd just use a gpu anyway.

    • @badass6300
      @badass6300 2 ปีที่แล้ว +26

      @@mewsermeow8683 if the gpu has the instructions for it, but 99% of the time yes.

    • @hjups
      @hjups 2 ปีที่แล้ว +51

      @@badass6300 It's not about if the GPU has the instructions, it's largely about the type of problem too. GPUs for example, don't do well with highly divergent streams, but do well with highly uniform streams. Modern CPUs can often do much better with divergent streams due to their internal out-of-order nature, and throwing more CPUs at the problem has almost perfect scaling with Amdahl's law in such problems (very small sequential part - usually global book-keeping).

    • @badass6300
      @badass6300 2 ปีที่แล้ว +8

      ​@@hjups True, but GPU architecture is getting close to CPU architecture with the passing generations. AMD GPUs since RDNA1 have hardware schedulers and might get OoO Execution in the future.
      Then again with chiplets they might get a whole CPU to themselves for certain tasks.
      Or vice-versa, integrated GPUs might get good, or both.

    • @hjups
      @hjups 2 ปีที่แล้ว +27

      ​@@badass6300 Not really. GPUs are fundamentally different than CPUs due to the parallel / vector nature. Some improvements have been made to handle thread divergence, but they are never going to be robust as a CPU.... otherwise.... they would be CPUs...
      As for OoO, both NVidia and AMD GPUs do OoO internally. It's not something advertised though.

  • @benandrew9852
    @benandrew9852 2 ปีที่แล้ว +14

    I've recently started a job as a technical support engineer / technical writer working on complex digital signal processing applications. Videos like this, like yours, are exceptionally valuable to me as a non-programmer. There are limitations to design, implementation and efficiency that are contingent on factors entirely within low-level hardware programming, and having them explained so succinctly makes my job way easier, because I'm being provided with a higher level understanding that I can pass on to my reports. Props.
    And, on a more personal-craft level, the quality of your videos in terms of rapidly explaining complex topics through efficient use of graphics and constrained use of jargon is very inspiring. Well done.

  • @TrippTech
    @TrippTech ปีที่แล้ว +5

    (electrical engineer here)
    LOVE this, great explanation!!! One thing i would have mentioned, especially when talking about single core chips is "out of order" execution where the chip executes instructions as soon as its ready, rather than everything waiting in a queue. Probably one of the biggest advances in chip design in history.

  • @desmondbrown5508
    @desmondbrown5508 2 ปีที่แล้ว +285

    I think it would be interesting to go into why GPUs CAN have so many cores, be parallelized more effectively and with better thermal efficiency, but CPUs cannot. I know the answer, but I do think it would be an interesting follow up video.

    • @JustinShaedo
      @JustinShaedo 2 ปีที่แล้ว +14

      Total agreement. I don't know the but I'm certainly curious!

    • @richardg8376
      @richardg8376 2 ปีที่แล้ว +104

      @@JustinShaedo A basic explanation would be that the kind of work a GPU does is easy to break up and spread among hundreds of small cores, and a GPU is designed for parallel processing on tasks that don't depend on each other.
      In a GPU you define a single program called a "shader", essentially a script which defines various inputs and what the GPU should do with those inputs.
      Each core on the GPU then runs in lockstep with each other: they all run the exact same shader script, albeit with different parameters. You cannot have half the cores run one script and the other half run the other. This is great for 3D graphics where the output of each pixel on a screen can be calculated independently, all using the same script.
      This is also why we still need CPUs and not just run everything on GPUs: GPU cores cannot run separate processes simultaneously on each core: only hundreds of copies of the same process with different inputs.

    • @Chezrlz009
      @Chezrlz009 2 ปีที่แล้ว +26

      @WJ gpus are designed to do a bunch of math at once. Each core is designed for a very specific task. Hence tensor cores, rt cores, etc. Cpus are supposed to be able to handle anything and everything, but maybe not as efficiently.

    • @Chezrlz009
      @Chezrlz009 2 ปีที่แล้ว +6

      @WJ i dont see how that shows gpus being able to have more cores in the first place. Utilization is different than physical cpnstraints. Alsp, they want the OS to work on old laptops or cheaper ones which the majority of which have really weak processors with few cores. If ms optimized windows for pcs with 8 cores, pcs with 4 cores would struggle to do anything. I agree though. I wish linux replaced windows. Windows is a cash grab but so is everything in capitalism that isn't truly for free.

    • @Chezrlz009
      @Chezrlz009 2 ปีที่แล้ว +4

      @WJ yeah it stinks. Sadly, quantum computing has the same issue as more cores. Noone will want to switch, but nothing digital can be translated. Qbits arent binary and work off of quantum superpositions. They are programmed entirely differently. Additionally, quantum entanglement is highly unstable and can't be observed or interacted with in any way or the particles will lose their entanglement and define themselves. That means you need to cool the qbits with liquid nitrogen, which is very difficult. You are correct about gpus being useful to streamers though. For certain tasks, computers can use gpus to perform tasks such as encoding streams and rendering videos on the go. Chip makers design a specific pipeline for a gpu that will help it perform tasks. Gpus made for rendering game graphics tend to work well with rendering streams which is handy. Pipelines are basically made up of core clusters that each perform a different tasks and instructions are sent through the pipeline to have things like shaders, sharpening, particles, textures, etc applied.
      Edit: tldr: quantum computing and more cores have the same obstacle of the economy and society having trouble changing rapidly and both, especially quantum computing, would disrupt a lot. :c

  • @RoelBaardman
    @RoelBaardman 2 ปีที่แล้ว +78

    You're more describing the limits of the Von Neumann architecture and our current (mostly sequential) models than anything else imho.
    Have a look at Erlang and the Actor model, and I think you'll argee that processors can scale just fine if we rule out shared memory.

    • @kayakMike1000
      @kayakMike1000 2 ปีที่แล้ว +2

      The functional wizard has spoken. (Pay no attention to the man behind the curtain)

    • @Handelsbilanzdefizit
      @Handelsbilanzdefizit 2 ปีที่แล้ว +2

      It's called memory driven computing. Very smart.
      HP tried this for a while.
      I have no idea what happened to this.

    • @RoelBaardman
      @RoelBaardman 2 ปีที่แล้ว

      @@Handelsbilanzdefizit Thanks for sharing!

    • @StCreed
      @StCreed 2 ปีที่แล้ว +1

      Occam on transputers already solved a lot of issues with programming. Too bad it never took off.

    • @RoelBaardman
      @RoelBaardman 2 ปีที่แล้ว

      @@StCreed Interesting, thanks for sharing!

  • @RonJohn63
    @RonJohn63 ปีที่แล้ว +16

    AMD also released their first dual-core CPUs in 2005. (Of course, not everyone instantly bought them...)
    Another issue with huge core counts is cross-core communication: threads usually want to talk to each other, and the wiring between all those cores gets crazy. You effectively get a traffic jam in there...

    • @jessepollard7132
      @jessepollard7132 ปีที่แล้ว

      That is what the crossbar switch is used for with communication between the CPU and the shared cache that mediates acess to the memory bus.

    • @RonJohn63
      @RonJohn63 ปีที่แล้ว

      @@jessepollard7132 right. But even crossbars have a bandwidth limit.
      This is also why NUMA was developed.

    • @jessepollard7132
      @jessepollard7132 ปีที่แล้ว

      @@RonJohn63 yes, but it isn't a bandwdth limit - but number of switches limit for physical implementation. NUMA provides the same interconnections but with different constraints.

    • @jessepollard7132
      @jessepollard7132 ปีที่แล้ว

      @@expressionsartistic5856 Actually that was built by Sun, not Cray. If I remember right that was supposed to be the CS-64.

  • @dannygjk
    @dannygjk 2 ปีที่แล้ว +64

    Back in the 80's I read a magazine article about the 'Connection Machine' which had 65536 processors but each processor wasn't like a core that we think of these days. Each processor was a tiny simple device which operated in a massively parallel architecture. Such machines had a limited practical value since they were specialized for a narrow range of problems and were also limited by being a 'hard-wired' architecture. Right now I can't think of a better description but I do know I should word it differently. I vaguely remember it had clever solutions to how to break down tasks and how the machine's processors worked together. It makes me think of things like ant colonies.

    • @ivanscottw
      @ivanscottw 2 ปีที่แล้ว +23

      Errr... GPUs ?

    • @dannygjk
      @dannygjk 2 ปีที่แล้ว

      @@ivanscottw I don't remember if the connection machine was analogous to a GPU because I don't remember the details of the architecture.

    • @mateusvmv
      @mateusvmv 2 ปีที่แล้ว +1

      Sounds more like a cluster

    • @dannygjk
      @dannygjk 2 ปีที่แล้ว +7

      @@mateusvmv iirc the whole machine's architecture was roughly analogous to a GPU. It wouldn't be like what people think of as a cluster we have these days. Each processor was a very simple device nowhere near what a processor is in a cluster we think of these days. It was huge tho compared to a GPU which isn't surprising since the first one was built in the 80's.

    • @littlemeg137
      @littlemeg137 2 ปีที่แล้ว +4

      The whole point of the Connection Machine's hypercube topology was to allow programmers to define the optimal architecture for the problem they were trying to solve. Unfortunately, very few HPC programmers of the time could make the cognitive leap to this model from Fortran on vector machines.

  • @dmitrykargin4060
    @dmitrykargin4060 ปีที่แล้ว +3

    Scientific computing guy here. Most often we hit RAM bandwidth limit. Sometimes we use all bandwidth by a single core with optimised AVX2 code and perfect memory layout. Using more cores will just slow down everything until you switch to a platform with more DDR channels.

  • @MichaelBristow137
    @MichaelBristow137 2 ปีที่แล้ว +58

    My first computer had 48k (it was an Apple II+ with 16k extra memory). I remember learning some assembly language. Now I have a multi gigabyte memory (255 Gb SD, plus 128 internal) phone which takes 8 Mb photos... I am so amazed at how far we've come and what the computer is actually doing to even display what I'm typing right now. It's mind bogglingly amazing...

    • @EdKolis
      @EdKolis ปีที่แล้ว +9

      I remember the animated intro for Megaman X saying back in 1993 that X had 32,768TB of RAM, and I had to look up what a terabyte was and I was like "lol what". Now that actually seems feasible in the not too distant future - will AI advance to the same point as X too?

    • @rodjacksonx
      @rodjacksonx ปีที่แล้ว +3

      My first was an Atari 800, I'm pretty sure it had 64K. It was a fossil even when I got ahold of it. I recently fulfilled a childhood dream by building a new system and just maxing out the RAM for the heck of it. 128GB has never felt so good!

    • @cpK054L
      @cpK054L ปีที่แล้ว +6

      @@EdKolis well... 64-bit operating systems won't go away for a LONG time... as it can address 16 exabytes (still at least 6-folds away)
      Nothing says you can't have 32 exabytes of RAM... they question is.. .why?
      You might as well live alone in a 100,000 sqFt mansion and ask yourself... why?

    • @embrikchloraker8186
      @embrikchloraker8186 ปีที่แล้ว +2

      @@EdKolis What also amuses me is that, even in the future with specs like that, they're apparently still using DOS interfaces.

    • @Bobby-fj8mk
      @Bobby-fj8mk ปีที่แล้ว +2

      I learnt Intel 8085 assembly language back in the early to mid 80s.
      What is actually going on is so simple on an instruction by instruction basis.
      At some point in history the CPUs were able to allocate tasks from a single program
      to multiple cores all by themselves without a programmer writing instructions
      for them to do that.

  • @pwnmeisterage
    @pwnmeisterage 2 ปีที่แล้ว +32

    GPU and SPU cards already pack hundreds or thousands of "cores" onboard. They can only process simplex tasks, not complex tasks, but they can stream their outputs in near-realtime.
    They suck a lot of power and spew a lot of heat while working at full load.

    • @gantz4u
      @gantz4u 2 ปีที่แล้ว

      Which theyve been laying the ground work for since single core with things like liquid cooling and cryogenic cooling. Even my air cooler block is light years above what we had in the 1990's to where its on par if not out cools a 1990's water cooler.

    • @mm2f419
      @mm2f419 2 ปีที่แล้ว

      what are spus?

    • @hjups
      @hjups 2 ปีที่แล้ว

      @@mm2f419 I think it should have been "DPU" not "SPU". So the smart network cards like Nvidia's BlueField.

    • @JorgetePanete
      @JorgetePanete 2 ปีที่แล้ว

      simple*

    • @CocoaEm
      @CocoaEm ปีที่แล้ว

      @@JorgetePanete they do lots and lots of simplistic operations just when its added up its complex.

  • @AlessioSangalli
    @AlessioSangalli 2 ปีที่แล้ว +227

    I always categorized "asymmetric" systems the ones that, while having multiple cores, do not have cache coherency - so it's up to the programmer to synchronize the cores. I once worked on a system that was running Linux on a core and an RTOS on the other, with independent MMUs

    • @LowLevelLearning
      @LowLevelLearning  2 ปีที่แล้ว +78

      Two OS's on separate cores, very interesting.

    • @llothar68
      @llothar68 2 ปีที่แล้ว +8

      Yes, i think Apple will have to go this way. We can see on the M1 Ultra that they hit the limit for chip connect already. But it could be nice to have a start with getting "blade computer" into the world of desktops. We had them in servers for a long time. Multi Socket Boards still try to do cache coherency. But unfortunately desktop computers aren't there yet.

    • @MatthijsvanDuin
      @MatthijsvanDuin 2 ปีที่แล้ว +7

      Embedded and mobile SoCs quite commonly can have lots of different cores with little or no coherency. For example TI's TDA4VM has (counting only freely programmable cores):
      - dual-core arm cortex-A72
      - three dual-core arm cortex-R5F subsystems
      - one TI C71x DSP
      - two TI C66x DSPs
      - two real-time subsystems with 6 TI PRU cores each
      with cache coherency only available between the cortex-A72 and the C71x as far as I understand (with snooping of main memory access by other cores or DMA, but no coherency with local caches of e.g. the R5F or C66x subsystems), while many of TI's older SoCs have no cache coherency whatsoever.

    • @kippie80
      @kippie80 2 ปีที่แล้ว +3

      This is already done with Intel and Apple chips. Security. Forget name in intel but Apple put its T2 chip in the cpu Mx series

    • @Mrcrappyfuntastic
      @Mrcrappyfuntastic 2 ปีที่แล้ว +3

      Didn't the Ps3 have a similar issue too?

  • @ccflan
    @ccflan 2 ปีที่แล้ว +66

    On of the best TH-cam channels out there, it feels like you should pay to see this Content so thank you

    • @LowLevelLearning
      @LowLevelLearning  2 ปีที่แล้ว +6

      Thanks for the love as always!

    • @ObligedTester
      @ObligedTester 2 ปีที่แล้ว +3

      Totally agree. I hope some of my youtube premium dollars end up on this channel 😅

    • @8lec_R
      @8lec_R 2 ปีที่แล้ว

      There's a patreon, feel free to pay. I can't afford to so I'd rather have content that is free and is viewer supported rather than something locked behind a paywall

  • @LilacMonarch
    @LilacMonarch ปีที่แล้ว +38

    The "number of transistors doubling every 2 years" might already be hitting its end. The problem is in order to add more, they have to be made so small that it's impossible to keep the circuits properly separated. The gaps are so small that electrons easily jump across, causing shorts. Maybe we will see an increase in larger sized CPUs, but that will have its own problems.

    • @NFchegg
      @NFchegg ปีที่แล้ว +1

      Chiplets

    • @AlMcpherson79
      @AlMcpherson79 ปีที่แล้ว +10

      Improve efficiency without improving capability to the point that we can start stacking the processors... resulting in THICC CPUS.

    • @LilacMonarch
      @LilacMonarch ปีที่แล้ว +17

      @@AlMcpherson79 now that sounds like a thermal nightmare

    • @KeinNiemand
      @KeinNiemand ปีที่แล้ว +1

      The number of transistors still increases so we havn't hit the absolute end yet, but it probably has already slowed down from the doubling every 2 years of morse law.

    • @ViktardTRTH
      @ViktardTRTH ปีที่แล้ว

      @@jsmith8147 I’ve got some bad news if you think quantum computing is the answer bc while it can theoretically perform faster there is no real world useful function with scalable architecture

  • @69k_gold
    @69k_gold 2 ปีที่แล้ว +13

    "If you're watching this on any device made in the last 10 years.."
    Me watching this on my 2008 Windows XP Professional PC with an Athlon chipset: *You're wrong*

    • @billyswong
      @billyswong 2 ปีที่แล้ว +1

      We are in 2022 now.

  • @renchesandsords
    @renchesandsords 2 ปีที่แล้ว +52

    to be fair, in the science and datacenter space, that kind of core density can be effectively leveraged, threadripper and epyc proved it, and the development of processors like genoa and bergamo only serves to drive that point home further

    • @drstrangecoin6050
      @drstrangecoin6050 2 ปีที่แล้ว +2

      Yeah exactly. I got clickbaited by the title because I work on a system with over a thousand cores. Promise based task schedulers and MPI make it possible to recruit massive computing power for certain workflows and vectorizing loops over a distributed system is somewhat independent of code at this point. Old Perl script? Throw it into the OpenPBS scheduler with GNU parallel and loop over your entire data set as a matrix.

    • @prashanthb6521
      @prashanthb6521 2 ปีที่แล้ว +9

      @@drstrangecoin6050 I think you are getting it totally wrong. There is no machine with a single CPU of 1000 cores. You are using a cluster with independent memory bandwidth. That doesnt provide any hurdles mentioned in this video at all.

  • @xeridea
    @xeridea 2 ปีที่แล้ว +9

    The main issue is that, many tasks don't gain much efficiency from being split to many cores, due to having data dependencies on previous instructions. Generally, better applications for multithreading are those that have workloads easily divided up. Anything to do with graphics tends to be heavily threadable, which is why GPUs these days have upwards of over 10,000 tiny cores, you have millions of pixels on a screen, so it is easy to split up the work. Game logic, however, isn't as easy to split up, which is why games don't generally benefit from having more than 6 CPU cores. It would be trivial to have a CPU with 1000 cores, just shrink the cores. With CPUs though, it is generally better to have a smaller number of cores, that are better at executing code fast, than it is to have a crazy amount of simple cores.
    It is significantly more energy efficient to have more cores if workload can use them, which is why GPUs are so much more efficient at drawing graphics than CPUs. On the flipside, GPUs are pretty bad at general code, since to effectively use them, code needs to be what is referred to as "embarrassingly parallel". Many non graphics tasks are still able to be effectively programmed on the GPU, so they are still used for non graphics tasks, just not as CPUs.

  • @RunForPeace-hk1cu
    @RunForPeace-hk1cu 2 ปีที่แล้ว +6

    More core = more memory = more cache = more interconnect speeds = more energy = more heat.
    Cache coherency nightmare.

  • @CustomCans
    @CustomCans 2 ปีที่แล้ว +10

    I saw the title of this video and instantly thought of the Cerebras wafer scale processors - I think they definitively prove that computers and CPUs can have thousands of cores ;)

    • @ABaumstumpf
      @ABaumstumpf 2 ปีที่แล้ว

      Can have and be useful for general purpose are very different things.
      You can build a jetpack, you can build a microwave that is powered by hamsters - does not mean you should do it or that it would make any sense to do so.

    • @leovang3425
      @leovang3425 2 ปีที่แล้ว +1

      @@ABaumstumpf more like having a supersonic airliner, sure its fast. But it's not economical or is it pleasant to be around.

    • @prateekpanwar646
      @prateekpanwar646 ปีที่แล้ว

      @@leovang3425 Concorde

  • @erikshure360
    @erikshure360 2 ปีที่แล้ว +15

    It's pretty much impossible for Moore's Law to persist for another 50 years -- transistors can only be so small. If anything, a different form a computing will takeover by then -- like optical computing.

    • @4.0.4
      @4.0.4 2 ปีที่แล้ว +2

      And they'll call it quantum computing for marketing purposes. Which isn't entirely wrong but not what people expect.

    • @officialrights6009
      @officialrights6009 2 ปีที่แล้ว

      Or analoug computers

    • @matsv201
      @matsv201 2 ปีที่แล้ว

      Well.. yes but now.
      Effecticly the way it was originaly precived it alreddy died 10 years ago... well really even more.
      The nm scale we have to day os symbolic, not real.
      The transistor density have been increased by using other tricks like standing transistors or more layers

    • @ABaumstumpf
      @ABaumstumpf 2 ปีที่แล้ว +1

      At some point yes, it will stop being correct. And no sane person doubts that.
      But we do not know yet When that will happen.
      Also quantum-computers very likely are not the answer to 99.999% of all problems as far as we are aware - they simply are too slow and inefficient for anything that does not involve sifting through enormous amounts of combinatory possibilities.
      @@matsv201 "Effecticly the way it was originaly precived it alreddy died 10 years ago... well really even more."
      No.

    • @matsv201
      @matsv201 2 ปีที่แล้ว

      @@ABaumstumpf you probally need to motivate your starment a bit

  • @zxuiji
    @zxuiji ปีที่แล้ว

    Well adding more cores and keeping all buses available to all threads (albeit not at exactly the same time, just close enough) is easy, all that's needed is a dedicated chip who's only purpose is to loop through each boolean bit linked up to threads supported by the CPU to check if they need an operation done via RAM, the actual operation is then read from the thread that set the bit and once completed the bit is then cleared to say the operation completed, the thread doesn't need to care what bus was used, only that the operation was handled by the chip as soon as one was available. It being a chip it can be made to skip the "if N < THREADS_SUPPORTED" logic by just linking a power of 2 (2,4,8,16,32...) threads and letting the index overflow back to 0 during incrementation thereby reducing power consumption from it and the time it takes to get back to a waiting thread. As for RAM side of things, the most can do is sport an equal amount of RAM as needed to hold all apps and their virtual memory in memory, unlikely to happen any time soon though and would require some understanding on the user's side like "if I open this app with closing another then all apps will be slower due to surpassing RAM limit"

  • @overloader7900
    @overloader7900 2 ปีที่แล้ว +5

    GPUs: 11k cores and more on the way

  • @theldraspneumonoultramicro405
    @theldraspneumonoultramicro405 2 ปีที่แล้ว +1

    fun fact: there is a hard physical limit to how small a transistor can be, as eventually they reach such a small size that the electrons will freely flow thru it, thus leaving the transistor permanently locked in a on state, following morse law, we should reach that pysical size limit as early as 2023.

    • @christophercuston
      @christophercuston 2 ปีที่แล้ว

      Hence, Intel's 12000 CPUs.
      AMD's Threadripper.

  • @saricubra2867
    @saricubra2867 2 ปีที่แล้ว +13

    I'm still using a 4 core 8 thread CPU from 2013 for audio. The code is NOT bad, IS the real time audio processing itself that runs in series which is the bottleneck. BUT, putting audio tracks in parallel scales way better with more cores.
    Some tasks by themselves ARE the bottleneck, not their code.

  • @JohnMiller-mmuldoor
    @JohnMiller-mmuldoor 2 ปีที่แล้ว +5

    6:51 I need me one of them intel I69420 processors 😆

    • @GreatMossWater
      @GreatMossWater 3 หลายเดือนก่อน

      Sounds like a nice processor that smokes the competition out.

  • @homeboy6668
    @homeboy6668 2 ปีที่แล้ว +28

    Hey, could you consider making videos on compiler design maybe? it'll be cool to learn too. BTW, awesome video.

    • @raven4k998
      @raven4k998 ปีที่แล้ว

      how many core's will windows need in the future more then it needs now🤣🤣🤣🤣🤣

  • @herrxerex8484
    @herrxerex8484 2 ปีที่แล้ว +25

    This is one of my fav channels genuinely , could make a RISC-V series or compile resources for it to learn . would love to learn more riscv just don't have a structured way to yet

    • @LowLevelLearning
      @LowLevelLearning  2 ปีที่แล้ว +2

      Working on it! :D

    • @joelsmusic7771
      @joelsmusic7771 2 ปีที่แล้ว +1

      Risc processing is a college course offered at most universities.. I generally enjoyed working with this language.

    • @LowLevelLearning
      @LowLevelLearning  2 ปีที่แล้ว +8

      @@joelsmusic7771 RISC is the general idea of reduced instruction set computers, where as RISC-V is the open source architecture and spec for those processors. RISC-V is more like saying MIPS or ARM than RISC alone.

    • @mikapeltokorpi7671
      @mikapeltokorpi7671 2 ปีที่แล้ว +1

      I remember about drooling RISC processors in early 90's with my school mate. Seems to finally mature to commercial products (like Raspberry Pi replacement). Both high priced for the performance at the moment, though. However/depending on problem, you should get your discrete code running way faster than in CISC architecture on those.

  • @DJ_Force
    @DJ_Force 2 ปีที่แล้ว +9

    You didn't talk about wafer yield. The more cores you have, the physically bigger the chip. The bigger the chip, the more susceptible it is to random manufacturing defects. Meaning, the bigger the chip, the more likely it is to be defective. This can dramatically raise the price since you get less sellable chips per silicon wafer.

    • @WaterZer0
      @WaterZer0 2 ปีที่แล้ว

      So it's fair to say there's an ideal ratio in terms of cost to core count? At least from the manufacturer's point of view.

    • @DJ_Force
      @DJ_Force 2 ปีที่แล้ว

      @@WaterZer0 Well, the smaller the chip, the better the odds it doesn't have a defect. But yes, too small and it won't be powerful enough to do be competitive.

  • @specialopsdave
    @specialopsdave 2 ปีที่แล้ว +2

    My dual-core desktop has had enough performance for everything until 2 years ago, but I don't play many AAA games anyways, so it still works fine for me.

    • @satrah101
      @satrah101 2 ปีที่แล้ว

      Same here, running Linux on it. Gets the job done,

  • @wbtittle
    @wbtittle 2 ปีที่แล้ว +6

    Once upon a time, I was an entry level engineer for Bettis Atomic Labs. They gave us a tour of the facilities. As we were wandering the warehouse, our guide pointed to the 32,000 processor computer they were planning on using to design Atomic Reactors (I made that part up, they were just planning on tryign to figure out how to use a 32,000 processor machine).
    They were trying to work out how to program such a machine.
    Then we moved down the warehouse 20 ft. "This is our 128,000 processor machine"
    "Why did you buy a 128,000 processor machine before you figured out how to code the 32,000 processor machine".
    "Because it is bigger and better!"
    The hurdle of making a 32,000 processor machine work is much much bigger than making a 128,000 processor machine work after you figure out 32000 processors work.

  • @CogentConsult
    @CogentConsult ปีที่แล้ว +2

    Want to hear something scary? In 1969 when the Saturn 5 rocket carried our astronauts to land on the moon, their command module computer had less computing power than today’s pocket calculators. In 1981 my first home computer had only 64k of RAM, a 5mb Hard drive that weighed 66 lbs, used an 8-inch floppy disk, ran on the CP/M operating system and had a green monochrome CRT monitor. Its only function was as a very advanced machine code language translator and word processor for the court reporting profession. Its cost was $50,000. I had two; one for me, one for my wife, since we were both court reporters. The first computer I used was in high school. It was a Univac and we fed our code into the machine via computer punch cards. It was the size of a mini van and weighed nearly as much. We wrote our code in the FORTRAN language and later COBOL. Yeah, I was one of the computer pioneers. Over the past 40 years I’ve owned over 45 computers, five of which I’ve built myself. It has been incredibly fascinating to watch technology get faster, smaller and more powerful each and every year! I am by no means an expert in computers or software design; I’m just a guy who has used computer technology to have employability on my side. What a ride it has been…and what an exciting future computing has: quantum computers…

  • @nickscurvy8635
    @nickscurvy8635 2 ปีที่แล้ว +4

    Some electrical engineers, when confronted with a problem, say "I know, I will use more cores". Now they are the ceo of amd

  • @DDRWakaLaka
    @DDRWakaLaka 2 ปีที่แล้ว +6

    0:10 I think you've confused two different facts -- IBM's POWER4 is from 2001. You might be thinking of AMD's Athlon 64 X2, which is the first *consumer* level dual-core chip and is from 2005.

    • @CocoaEm
      @CocoaEm ปีที่แล้ว +2

      the power4 chip he was on about isnt even multi core, its straight up 2 cpus on the same wafer.

    • @DDRWakaLaka
      @DDRWakaLaka ปีที่แล้ว +3

      @@CocoaEm Yeah, I'm realizing now he's likely referring to the PPC970MP. Which, like you said, was MCM, not two native cores.

  • @occapella8643
    @occapella8643 2 ปีที่แล้ว +3

    At its most basic level, a CPU is just a rock that we trapped lighting inside of and tricked into thinking.

    • @xCwieCHRISx
      @xCwieCHRISx 2 ปีที่แล้ว

      If the apocalypse comes those magical stones are very valuable.

  • @RealCadde
    @RealCadde ปีที่แล้ว

    It would be worth mentioning the difference between parallel and linear programs as well.
    A linear program is one that, in a simple example, takes the output of the previous operation as an input for the next operation.
    a
    a + b
    ab + c
    abc + d
    ...
    That's a linear operation.
    A parallel operation on the other hand does NOT rely on the result of the previous operation in the program as a whole.
    Using the previous example again, but making it parallel...
    Core 0:
    a
    a + b
    ab + c
    abc + d
    Core 1:
    e
    e + f
    ef + g
    efg + h
    Core 2:
    i
    i + j
    ij + k
    ijk + l
    Core 4:
    m
    m + n
    mn + o
    mno + p
    Then as all four cores have ran their code in their slice of the data, they can synchronize and this happens:
    Core 0:
    abcd + core1 + core2 + core3
    or...
    abcd + efgh + ijkl + mnop
    But before that can happen, ALL cores must have completed their slices. In this simple example it's no biggie. Each core runs their slice linearly and in linear time too. So they should all finish at the same time.
    But in reality, not every program is that simple. Some slices take more time than the others to complete as they do more complex operations. In the meantime, all other cores are just sitting around waiting for the most complex operation to finish. Well, they are free to do other things but not for that one program as the program is waiting for the biggest slice to finish.
    Being able to evenly slice up threads of a program such that they all finish at roughly the same time is almost impossible in more complex programs. Especially when you aren't the only program using the cores available as the scheduler might not agree with the program using all cores at that moment in time.
    A somewhat perfect example of parallel tasks that actually do take the same amount of time every time (almost) is what the GPU is doing.
    The GPUs of today has some ten thousand cores. They all work on their own slice of a rendered image.
    Say you have an image that is 1000 x 1000 pixels large, or a megapixel image if you will. Those 10,000 cores will each be working on a region that is 100 pixels large.
    If the task is to fill a gradient horizontally across the screen, then each core simply takes the starting and ending colors and interpolates those going from start to end in their block.
    This operation takes exactly the same amount of time on each core so it just works on GPU's... Because graphics is less complex than programs are in that sense. Graphics don't tend to sit around waiting for user input, network communications and access to memory.
    Each batch on a GPU has exclusive access to memory and all cores. The more data and operations you can cram into a batch the better, otherwise you have to keep telling the GPU what to do.
    In other words, it's better to tell the GPU to draw ten million polygons in one batch than it is to tell the GPU to draw a million here, a million there and another million there...
    When the GPU has ALL the data in one batch, it splits the tasks amongst all cores equally and just barfs pixels back at you in no time.

  • @mully006
    @mully006 2 ปีที่แล้ว +8

    This is a good video, but I think that you overlooked an important aspect and that is HPC computing. While no single chips have thousands of cores, in high performance computing it is common to run code on many many nodes all with 64 or more cores.
    Additionally GPU are really just proceccors with more limited instructions and they generally come with thousands of cores on a single die.

    • @dannygjk
      @dannygjk 2 ปีที่แล้ว +1

      The architectures of GPUs and CPUs are different it's not just about instructions.

  • @Ferrari255GTO
    @Ferrari255GTO 2 ปีที่แล้ว +1

    The sweet spot for most consumers is 8 cores imo, most games don't need more, and asuming your CPU is fairly modern it will be perfectly capable of doing whatever you require it to without issues. It won't be an oven, but it will still need some decent cooling and since it's an 8 core it won't be top of the line, making it cheaper than other CPUs while delivering a really good experience. What i mean is don't just get the biggest thing you can, it might not be as convenient as you think

  • @diconustra
    @diconustra 2 ปีที่แล้ว +4

    I was sysadmin on a couple of multi-processor machines which had very similar issues scaling, except across CPU's and CPU boards. One was an IBM X460 with 4 interconnected chassis, each with four Intel CPU's, the other was an E25K with a half-dozen CPU boards, each with four SPARC CPU's. In each case, we ran into scalability issues related to memory bus bandwidth, the latency of memory fetches and bus I/O across chassis (X460) or boards (E25K).
    Operating system and database configuration and tuning helped, but ultimately both platforms faced diminishing returns on performance as boards and chassis were added, with 16CPU's being the sweet spot.

    • @llothar68
      @llothar68 2 ปีที่แล้ว

      Apples M1 Ultra already hit it. I'm very curious how they will design their MacPro. But i predict we go with multiple computers in the same chassis, also known as Blades in the 2000s server days.

  • @iancamarillo
    @iancamarillo ปีที่แล้ว

    I have this feeling that we’re gonna go back to a single core in a revolutionary design that handles these executions in a different way

  • @albertsun3393
    @albertsun3393 2 ปีที่แล้ว +25

    Interesting thing about multiple cores is that coherence and even just latency in communication between multiple cores eats a huge chunk out of performance. Arbitrating cache coherency between one, two, maybe four cores isn't too bad, but when your critical path in coherency (or latching for multiple clocks) goes all the way across the chip, suddenly your performance drops like a rock. We've seen the transition from higher frequency to more cores because of the exponential increase in power consumption when increasing core clock, but with too many cores we sometimes struggle to even hit our initial clock due to all the overhead for everything else.

    • @Rokabur
      @Rokabur 2 ปีที่แล้ว

      From everything I've seen, more almost always means lower clock speed (unless you're over locking). My quad-core i7-4820K runs at 3.7Ghz while I've seen Threadrippers running at barely 3Ghz.

    • @Demon09-_-
      @Demon09-_- 2 ปีที่แล้ว

      @@Rokabur the thread ripper is not quite a fair comparison thats two different brands with different ipcs and applications. the less clocks may be true to some degree but it would be more fair for you to compare it vs intels new i9 12900k that has 16 total cores and still will run at 5gz out of the box it has some stuff with P cores and E cores. So if you want a more all P core comparision you can look at the 10900k that would still do 4.9ghz out of the box on all 10 cores and 20 threads. and comparisons inside amd are similar there newer 5950x which is 16 cores barley loses any clock speed to there lower 5600x . when comparing you really have to stay inside of the same artitech intel does lose a little bit of clock speed comapred to there lower chips when you compare in server stuff . But server stuff is a little bit of a different ball game where you can get up to 56 cpus that have alot of memeory lanes

  • @SelfMadeSystem
    @SelfMadeSystem ปีที่แล้ว +2

    GPUs have hundreds of cores, but the code made for GPUs is made specifically for parallelisation.

  • @bwiebertram
    @bwiebertram 2 ปีที่แล้ว +3

    In the future, one super computer will do the work for every person on earth

  • @rickpontificates3406
    @rickpontificates3406 ปีที่แล้ว +1

    DMA comes into play also. Memory allocation is important, but having a CPU's MMU core managing its own memory helps ease the bottleneck

  • @seeibe
    @seeibe 2 ปีที่แล้ว +31

    Thanks to GPGPU, we already effectively have CPUs with thousands of cores, just with some limitations.

    • @matsv201
      @matsv201 2 ปีที่แล้ว +3

      That is very ture... but the flip side of that is that aplications that is easy to multi thread runs on gpgpu, while the one that is not run on the CPU... again limit the use of many cores

    • @null6482
      @null6482 ปีที่แล้ว +1

      Hehe. "GPGPU"

  • @FalcoGer
    @FalcoGer ปีที่แล้ว +2

    Moore's law was never a law to begin with. It was more of a design goal for the engineers. And now things are getting so small that individual transistors are only a few tens of atoms across. Given that silicon works by having a very specific amount of impurities, such as phosphorous, in them and that quantum effects start to mess with the whole process, it's physically impossible to keep up transistor doubling, let alone in a 2 year timeframe.

  • @zolp
    @zolp 2 ปีที่แล้ว +3

    There are already 3-digit processors in abundance, memory access also continue to improve, and there are many applications that parallelizes well. GPUs already have thousands of cores and are put to good use.

    • @romanpul
      @romanpul ปีที่แล้ว

      Yeah, but you can‘t really compare GPUs to CPUs. To my (given kinda limited) understanding GPUs resemble more to vector processors and are only efficient for usecases where your input data can be vectotized (ie cases where you fetch huge chunks of data at once and then crunch it). CPUs on the other hand are much better at crunching data which requires frequent, atomic memory access due to their way more elaborate caching architecture

  • @syarifairlangga4608
    @syarifairlangga4608 ปีที่แล้ว +1

    as for ordinary consumer, we want 4 high performance core with 5ghz in all cores rather than hundreds of cores

  • @lockdot2
    @lockdot2 2 ปีที่แล้ว +4

    I am one of the few people still using a single core CPU to watch TH-cam. The CPU I use is a AMD LE 1640, with 1 core, 1 thread.

    • @utubekullanicisi
      @utubekullanicisi 2 ปีที่แล้ว

      You're able to stream at 4K no problem, right?

    • @Elinzar
      @Elinzar 2 ปีที่แล้ว

      Man... How?, Im sure even if you dont have much money you can scrap some am2 cpu with at least double the cores for like nothing these days and swap that cpu out, is it a desktop cpu right?

    • @dannygjk
      @dannygjk 2 ปีที่แล้ว

      @@Elinzar Sounds to me like they have a small laptop or netbook. I have a netbook it is also 1 core 1 thread and only 2 GB RAM. I would add more RAM but I don't think there are RAM modules larger than 2 GB for it and there is only 1 RAM slot. It can barely stream a video at 360p.

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว

      9 year old 4 core 8 thread Intel Core i7-4700MQ here at 3.4GHz
      99% or 100% CPU use in one thread for gaming, audio, even for loading and saving stuff to the HDD (now SSD and the CPU is the bottleneck for the SSD, still ridiculously fast).
      That microbe AMD system would 100% freeze in a DAW lmao.

    • @Elinzar
      @Elinzar 2 ปีที่แล้ว +1

      @@dannygjk i looked it up and one page said it was a desktop cpu from the AM2 platform
      other page said it was a 2014 chip...

  • @Four-S
    @Four-S 2 ปีที่แล้ว

    Bruh why don't you have at least 100k subscribers, this is a great video

  • @Cyberfoxxy
    @Cyberfoxxy 2 ปีที่แล้ว +4

    Meanwhile a common GPU boasts 8000 cores. Tho they are much slower and has only a small set of instructions. Also the instruction set is not standardized. As such OpenGL/OpenCL is implemented by the vendors themselves.

    • @coleshores
      @coleshores 2 ปีที่แล้ว

      Still Turing complete though. There are highly parallel SQL Databases which run entirely on the GPU, such as Omnisci (formerly MapD) for example.

  • @theapexanomoly5354
    @theapexanomoly5354 ปีที่แล้ว +1

    I know this is an older video, but I’m still commenting for the algorithm.

  • @diablo.the.cheater
    @diablo.the.cheater 2 ปีที่แล้ว +6

    Some task can only benefit from paralelization until x limit, some task simply are not even paralelizable, some have very minor gains that would add unnecesary code complexity and some task you can always trow more cores to do faster.
    In general PC use case, most tasks are sequential so more cores only benfit you if you are doing a lot of multitasking

    • @rtyzxc
      @rtyzxc 2 ปีที่แล้ว

      This. Game logic for example. First you tell a character to move x amount. Then you check for collision, and if hit, correct the position or execute some logic. Then you might check if the character is shooting, which again, depends on the character's position. Things have to happen on the correct order, you can't just have multiple cores do each thing simultaneously, or the results would get messed up depending on the order in which the tasks happen to be completed.

    • @techpriest4787
      @techpriest4787 ปีที่แล้ว

      @@rtyzxc that is why OOP is not a thing for games but data oriented programming makes more sense. All languages are OOP except of Rust. You can do DO in C++ and C# too but that is abuse. They are not really made for that.

  • @therosses5
    @therosses5 ปีที่แล้ว +1

    the first computer I touched was the Tandy radio shack 80 model 1. I was surprised you were able to explain cores in a way an old guy can understand. very well done. I'm astounded that after decades the speed of our apps are still held hostage by sucky HD read/write nonsense.

    • @jessepollard7132
      @jessepollard7132 ปีที่แล้ว +1

      Basically, a core is just a CPU. a multicore processor, is just a collection of CPUs wired together to access RAM. it is why each core has an L1/L2 cache and sometimes L3 dedicated to its operation, then a shared cache for all CPUs to use for access to RAM. The shared cache is either called L3 or L4 (usually L4 if the CPUs have dedicated L3).

  • @AlessioSangalli
    @AlessioSangalli 2 ปีที่แล้ว +9

    "Symettric" (5:05) well typos happen 🤣 seriously however the quality of the production is awesome, I wish I were this good with video editing. What program do you use, out of curiosity?

    • @LowLevelLearning
      @LowLevelLearning  2 ปีที่แล้ว +7

      Hahaha crap, there’s always one. I use Davinci Resolve, largely because it’s free XD. Thank you!

    • @vikassm
      @vikassm 2 ปีที่แล้ว

      @@LowLevelLearningFree, yes, also the small matter of it being the most powerful, fully featured A/V production suite in the world 🤣
      If it works for MARVEL, I'm sure us 'lowly' TH-camrs can make do with DaVinci Resolve 😂😂

  • @mwbgaming28
    @mwbgaming28 2 ปีที่แล้ว +2

    4-8 cores with a stupidly high clock speed would probably be the best setup for the time being

    • @Demon09-_-
      @Demon09-_- 2 ปีที่แล้ว +1

      probably about right. 4 with hypthreading is probably fine for the every day user and 8 with hypthreading is what most who plan to play games at this point should shoot for as games have already started to move pretty fast where 4 cores will leave you pretty hard cpu limited depending on the rest of your set up

  • @leftlovers9137
    @leftlovers9137 2 ปีที่แล้ว +4

    I searched this and wala I found your video 11 hours after upload lol

  • @godnyx117
    @godnyx117 ปีที่แล้ว

    I don't know why but it's MIND BLOWING to me that the first dual core CPU was released only in 2005. I would expect it to be released in the early-mid 90s....

  • @johndoh5182
    @johndoh5182 2 ปีที่แล้ว +5

    At 7:00, this issue is once again what I mentioned for 2:15. It's multi-threading. I blame Intel for programmers trying to play catch up with modern many core processors, although there were many software engineers that knew Intel was wrong, and this goes back to Intel vs. AMD around the time of 2008 - 2011, when desktop CPUs had loads on them that started to become large. Intel basicallly said you don't need to add more cores to desktop computing because they will be able to keep improving IPC and getting clock speeds faster and faster. And they seemed to be right because their 2c/4t CPU where better than AMD 4c/4t CPUs. And when AMD came out with an 8c/8t CPU it didn't fare a lot better. Well, those first gen 8c/8t CPUs had core pairs that shared a single FPU while each having their own APU. I know from going through classes that I was taught that FPU math was better. It really wasn't as far as running it on an X86-64 CPU. It is NOW, but it wasn't so much then and it certainly wasn't for AMD 8c/8t 8ALU/4 FPU CPUs. If only professors would have remembered WHY it is you learn Discrete math.
    Regardless, the point is one of multi-threading and how well an application does it. This is something that takes a lot of work for programmers, and lazy developers in the last 2 decades didn't want to think about it and Intel gave them a reason not to. Writing and testing multi-threaded software is harder. I can write a multi-threaded algorithm that is the same speed as a single threaded algorithm or possibly even slower. If one thread is simply waiting for another thread to finish work, such as I have a main thread that spawns another thread to run some function, but my main thread is waiting, this is slower even though it's multi-threaded. So multi-threading requires experienced programmers or engineers to work with a project to evaluate the software development, and it isn't always so obvious if doing one thing vs. another is more beneficial.
    There was one solid point you brought up other than the failure of programmers over the last 15 years to move towards developing their skills writing multi-threaded applications, and that was memory bandwidth. There is nothing other than that you brought up which is a physical limitation until there are other conditions thrown into the conversation, which then means this conversation needed to come from a person who can describe power efficiency, nodes, how clock frequencies affect power efficiency, etc........

    • @Dhaydon75
      @Dhaydon75 2 ปีที่แล้ว

      Another problem is you can have more cores or higher IPC and Freq but still be slower. But that is more of a time critical system problem.

    • @billyswong
      @billyswong 2 ปีที่แล้ว

      The infrastructure and tools for efficient multi-thread software development are not yet polished enough. In theory a programming language could handle thread pool implicitly, in an OS-neutral way. Meanwhile the OSes would provide part or all of the thread pool implementation such that multiple programs using thread pool at the same time won't overcrowd the CPU and introduce unnecessary task switching.

    • @ABaumstumpf
      @ABaumstumpf 2 ปีที่แล้ว

      " blame Intel for programmers "
      And you'd be wrong. Or rather you are just wrong.
      "So multi-threading requires experienced programmers or engineers to work with a project to evaluate the software development, and it isn't always so obvious if doing one thing vs. another is more beneficial."
      If that were the only thing. Many problems simply can not be processed in parallel. The towers of hanoi are a often used example.
      And not to mention all the other problems like coherency, scheduling and specially the bugs that creep up.

    • @johndoh5182
      @johndoh5182 2 ปีที่แล้ว

      @@ABaumstumpf I know not every problem can be solved by multi-threading. There has to be real parallel paths in processing for multi-threading to make any difference. But that parallel path can simply be a few microseconds and it's still beneficial. It can be two sets of calculations that can happen independently and you'll get benefit.
      However Intel said EXACTLY what I said they did when AMD was releasing CPUs with more core counts than Intel. So yes, they were part of the problem. And yes, programmers have been lazy in many companies, and yes many programs can be written much better.
      You're about to see this play out in game engines and what happens when you bring near realistic graphics to a game. Part of this of course is the ability of a GPU, but part of this is the game engine. Unity for instance has been notorious for saying that since a main game thread dictates how fast a CPU can process code, having a game engine be multi-thread only adds complexity with no benefit. On the other hand, Epic Games released UE5 and games are going to be coming out on it starting the end of this year. I watched demos of Matrix Awakens and it was pushing a 5950X to around 40% CPU utilization. Simple math says this game with this game engine overwhelms a 4c/8t CPU, it pushes a 6c/12t CPU to 100% so even that is going to be a bottleneck, and an 8c/16t CPU is going to be minimal to run the game without the CPU being a bottleneck due to being overloaded. There's other reasons the CPU can be a bottleneck, but this is going to be the first time as far as I know that for PC gaming, a 6c/12t CPU is going to be a bottleneck simply because it doesn't have enough cores.
      YES, INTEL SAID that gaming would never require more than 4 cores. Now, finding old information with a search engine isn't very easy, so I'm not going to bother digging. Of course by the time they put out 9th gen, AFTER AMD had released very effective 8c/16t CPUs, Intel did a 180 on THAT statement.
      I'd be a millionaire if I had a dime for every time I've heard a game will never need an 8c/16t CPU. Maybe a slight exaggeration, but I think you get the point. What I think is going to happen is if a game company wants to develop a game that looks realistic, they're going to use UE5 and Unity will be relegated to more simple graphics.
      Autodesk, same thing. Their software gets poor CPU utilization and often when people have a powerful system, EVEN WHEN the software is rendering an image on screen, it's painfully slow. You read comment threads on their site for different software packages and users complain about this, and point out other software that does the same type of rendering and it's much faster.
      Adobe, same thing. They've improved SOME of their software.
      At some point people will leave these companies behind when new hardware is still running like a turtle.
      So yes I know some software cannot be optimized more than it is. But I also know that thousands of students have gotten a BS in software engineering and their professors never emphasized multi-threading along with testing multi-threaded applications. And I also know that in many cases, I'm right and we're going to agree to disagree. I was a person BTW who went through most of a BS degree in software engineering (I had already retired from the military and time was catching up to me along with my back breaking down) and saw this first hand. I ended up having back surgery before my senior year, and after that point I only wanted to work part time and didn't feel like putting 100% of myself into another career.

    • @johndoh5182
      @johndoh5182 2 ปีที่แล้ว

      @@billyswong I agree, and I'm sure there are still many universities that don't push software engineers to program this way, and testing is hard.
      Testing effectiveness for multi-threaded applications, when the intent is to speed up the time it takes to run means time testing along with testing that functions work the way they're supposed to. Multi-threading can slow down an application if done improperly. Simply spawning threads to complete a task, if some other thread is simply waiting for that data can slow down performance due to passing data back and forth.
      So yes it does require testing and the testing is going to be very complicated, but in the end it's the right thing to do for applications that require a bit of computation, and not simply a text editor or other simple computing.
      "Meanwhile the OSes would provide part or all of the thread pool implementation such that multiple programs using thread pool at the same time won't overcrowd the CPU and introduce unnecessary task switching."
      When you have something like a 6c/12t CPU even the Windows schedulers do a good enough job at minimizing context switching. That's not really the issue. Sure if you're doing a bit of multi-tasking it can become an issue but that's not really what I was talking about. And even with multi-threaded apps, I would think that between the application and the scheduler, the scheduler isn't randomly switching a core from one thread to another. I would think that since many threads are short lived, they run to completion so data can be passed, before another thread is loaded to that core (where even with a 6c/12t CPU, it's viewed as 12 cores). When you move up to 8c/16t CPUs and even more cores, this should get easier for a scheduler to handle.

  • @jessepollard7132
    @jessepollard7132 ปีที่แล้ว

    50 years ago there were smulti processors - which did exactly the same thing as a multi-core unit does. The limit then was about 5 processors as a max (mostly due to the memory contention limits you indicated). Some system got around the contention by using multiple memory busses - and it was up to the programmer (or schedulers or both) to avoid the contention by assigning each processor a different memory map (usually the map was in 64KB units but could be larger), thus allowing each memory bus to operate independantly without contention with other memory busses with a resulting much higher thoughput could be achieved. Some motherboards do have parallel memory busses (which tends to require memory chips to be installed in pairs.

    • @jessepollard7132
      @jessepollard7132 ปีที่แล้ว

      YUP. I was Seymour Cray that figured out how to handle multiple processors optimally by using a crossbar switch in the Cray systems produced by Cray Research.

  • @littlemeg137
    @littlemeg137 2 ปีที่แล้ว +4

    The Paracel GeneMatcher had 6,144 cores. The Connection Machine had 65,536 cores.

  • @philipmcdonagh1094
    @philipmcdonagh1094 2 ปีที่แล้ว +1

    You answered everything when you said there was a Boss core. Take the real world what do Bosses do, slow overall work performance down. Thank you.

  • @Kevin-jb2pv
    @Kevin-jb2pv ปีที่แล้ว +6

    "Can Intel make a processor with 1,000 or more cores?"
    Yeah. They're called GPU's.
    I know a GPU is different as far as what it's designed for, but fundamentally it's the same concept just optimized for different tasks. I'm pretty sure that if you had the time, skills, and desire, you could take a GPU (the chip, not necessarily the whole card) and design a Turing complete computer around it functioning as the CPU. It would suck and be super limited and totally not worth the effort, but it would technically still be a computer.

  • @that.schamp
    @that.schamp ปีที่แล้ว +1

    Some of the information in this presentation is valid and relevant, but the premise is bunk. Part of the problem is: are we talking cluster, computer, socket, or die?
    Clusters - able to apply large numbers of individual computers to a single task - broke the 1000 core mark in the early to mid 90's.
    For single computers, SGI broke 1000 cores with the SN/MIPS arch in the 90's, and their UV2 used Xeons in 4096 core single system images inside of 16k core shared memory systems.
    For both single die and single socket, UC Davis developed a Kilocore processor in 2016.
    It's not that we can't develop these systems - we've done it. They have limited utility, but there is still room for limited commercial success. You're just not going to find a 1000 cores in a general purpose desktop computer anytime soon, except in it's graphics card...

  • @endurofurry
    @endurofurry 2 ปีที่แล้ว +8

    i had a 9980XE which is a 18 core processer. but only gets up to 4.5GHz so i decided with the new 12th gens and ddr5 I would upgrade to the 12900KS which is a 16 core (8 efficiency, 8 performance cores) at 5.5GHz and honestly I think my system ran better with the older extreme edition then the much faster newer processer. so it doesn't seem the speed is everything either, I figured the much faster speeds would make up for the few cores lost but it really didn't, i use this PC for gaming which most games don't even use more then 4 cores so my assumption was faster single core performance would be better then more cores, but that seemed to be false.

    • @Demon09-_-
      @Demon09-_- 2 ปีที่แล้ว +1

      eh you should have seen better performance in games if you were cpu bound. Games these days can and will easily use over 4 cores and depending on your gpu and the settings and the game you could see fps improvments quite high. But if your running alot of background or other applications more total performance may benifit you then having the higher ipc. not to mention ddr5 is quite meh atm and basicaly equal to fast ddr4 kits.

  • @310_Latchkey_kid
    @310_Latchkey_kid ปีที่แล้ว

    This is my first time watching one of your videos and honestly all I can say is that your answer to all those questions are very comprehensible and easy to understand! Great work.

  • @mryodak
    @mryodak 2 ปีที่แล้ว +28

    LLL: "Computers Can't Have Thousands of Cores"
    GPUs: Am I a joke to you?

    • @hjups
      @hjups 2 ปีที่แล้ว +1

      GPUs technically don't have thousands of cores either. The Titan V only has 80 (the SM is the equivalent to a CPU core, not a "CUDA Core").

    • @mryodak
      @mryodak 2 ปีที่แล้ว +5

      @@hjups SM(Stream Multiprocessor) are just collections of CUDA cores as far as I know. And Radeon calls their stuff Stream processor and they also have thousands of them.

    • @hjups
      @hjups 2 ปีที่แล้ว +7

      ​@@mryodak That's correct. But they are not "cores", they are ALUs. Put it this way.... you can either claim that the Titan V has 5120 cores and the 5900x has 816, or you can claim that the Titan V has 80 cores and the 5900x has 12.

    • @Conenion
      @Conenion 2 ปีที่แล้ว +1

      GPUs don't have cores. That is simply wrong. They have very small computing units, but many. The entire GPU architecture is targeted towards making a single thing fast, i.e. the graphics pipeline. It can be used for some special number crunching stuff (GPGPU) but that is not what the people who designed GPUs had in focus. When programming for a GPGPU you use a very special style of programming and you have to do a lot of things "by hand".

    • @mryodak
      @mryodak 2 ปีที่แล้ว

      @@Conenion Cuda is c++, opengl is c++, vulkan is c. Other then being parallel and having it's own instruction set, what's the difference?

  • @1over137
    @1over137 2 ปีที่แล้ว +1

    I know you are simplifying but multiple parallel executions have been possible in single cores for a lot longer than we have had multi-cores. There are many CPU tasks which take many clock cycles. Some of those tasks can be executed in parallel with other instructions. Instruction pipelining, speculative execution etc, all work in single cores resulting in an IPC (instructions per clock) greater than 1. As to whether a hardware context switch could occur within the pipelining ... my understanding is that, "hyper threading" is a relatively recent thing, but it exists.

  • @AgentSmith911
    @AgentSmith911 2 ปีที่แล้ว +19

    I just discover a law that is a lot like More's law, but for cores. It says that eventually, we will be in theory reach so many cores that it doesn't matter if we add more cores and threads. It's called Amdahl's law.

    • @matsv201
      @matsv201 2 ปีที่แล้ว

      That law is often missunderstood, its about compute latency, not preformance

    • @davidolsen1222
      @davidolsen1222 2 ปีที่แล้ว

      Amdahl's law is about the relationship of different performance based things within a computer. Where if you take some section that takes 90% of the time and hyper-optimize the crap out of it, so it takes 10% that it's previous time, you've managed the amazing feat of speeding up the system 5X and now you need to optimize the other stuff that didn't take much time before. You end up speeding up one part and that's good but then the other parts become wildly more important and you get diminishing returns on those types of optimizations.

    • @jessepollard7132
      @jessepollard7132 ปีที่แล้ว

      already limited to the bottleneck between CPU and RAM.

  • @OverDriveOnline7921
    @OverDriveOnline7921 ปีที่แล้ว +1

    In the world of x86, there have been multi processor systems for many years, I used to fix them in the mid to late 90’s frequently. However back then, the physical limit was 4 processors before system performance was hit, anything more than 4 were divided into sub groups of up to 4 processors and interlinked together with a separate scheduled data transfer architecture (until transputers came along, but that’s another story).
    This limit was overcome, in part by adding complex cache systems, and while 8 processor systems were now possible cheaply, there were two issues looming on the horizon, Moores law and physical space. The answer to keeping speed bumps predicted by Moores law? Bung more than one processor on a chip, this helps with space, and oddly enough, power consumption too.
    Further advancements have helped shove more cores, essentially what we used to call our CPU, onto a single chip, boosting performance as we go.
    However, doubling the cores does not double the performance, there are and always will be bottlenecks, which become greater with the more cores added, plus the thermal envelopes that our systems need to run under. In many systems now we get past this by breaking the chips down into multiple chip let’s, essentially smaller chips on a single chip chassis, or by adding multiple chips, meaning we’ve gone full circle.
    Still, it’s been interesting from my view, watching computing develop over my (nearly) 51 years at the time of posting this, with 3nm chips due to become mainstream, whole RAM modules fit into the space of an entire CPU from 4 decades ago.

  • @johndoh5182
    @johndoh5182 2 ปีที่แล้ว +8

    6:00, Thermal efficiency. This is hard to throw into a conversation about core counts because a CPU can be lower speed or high speed. Then you have constraints of a node being used. These things together mean that Thermal efficiency has little to do with how many cores can go into a CPU, or if we want to be more technical, a die or chiplet. If one says for instance that due to a thermal limit of X, this die can only have 8 cores, that not really a true statement. It's more on the line of, due to the thermal limit of X and running a processor at a speed of Y, on THIS node a core chiplet using AMD's Zen 4 X86-64 core should have no more than Z cores. Every node has different thermal limits, and different characteristic which cause ever faster speeds to cause the die to heat up to the point where thermal limits are the main constraint. You can clock Intel's Intel 7 obviously up to 5.3 - 5.5GHz which is consuming a large amount of power but clearly it's not affecting the efficiency of the core to do it's work. What is happening more is POWER efficiency rather than thermal efficiency. On the other hand, TSMC N7 isn't efficient over 5GHz in any way. Maybe this will change over time.
    So thermal efficiency is really an edge issue, not a main issue. I could have a die with 30 cores if I run thenm at one speed, and only 8 cores if I run them at another speed when loading all-cores to 100%. So, that's not a BIG constraint and not one I would have led off with.
    This is a situation that just because someone has put out some data, you have to be careful on how you use that data. It's a neat chart that was shown but only useful for some use cases. There had to have been a lot more data talked about before that chart was shown, or David Henderson from GA Tech is not very sharp. Without talking about all that other data, this point is like my other comment, painting a wrong picture.

    • @AnarexicSumo
      @AnarexicSumo 2 ปีที่แล้ว

      How pedantic. Firstly, it's an issue. Whether you think it's a fringe or main issue isn't really here or there. Secondly, your comparison to a slower processor with more cores being cooler is intentionally arguing in bad faith. All else equal, a processor with twice the cores will run hotter and require more cooling to run at its best. In fact due to inefficiencies they will run *disproportionately hotter*. As a rule, consumer CPUs with more cores require more cooling.

    • @johndoh5182
      @johndoh5182 2 ปีที่แล้ว

      @@AnarexicSumo So what you're saying then is every time you use a new node, the argument changes.
      "In fact due to inefficiencies they will run *disproportionately hotter*. As a rule, consumer CPUs with more cores require more cooling."
      So far these inefficiencies ARE related to clock speed. Every node that every fab makes has a point to where pushing beyond that requires more energy than it's worth for the return amount of work being done by the CPU. AND, this is INDEPENDENT of core count.
      As a rule more cores requires more cooling when everything else is equal. But that's the point. Everything else is always CHANGING! So there are no HARD rules for core count with regards to THERMAL EFFICIENCY. It depends on everything else. It's a secondary point. NOT primary. THAT is the point. And yes that is arguing in good faith. The points made in the video is arguing in bad faith.
      To quote "In full transparency some processors these days"........................... and then proceeds to talk as if it was magical that there exists 64 core CPU, which he simply called "double digit", which I find laughable.
      So yes, thermal efficiency is ONE point, but I could probably put 50 cores of compute power in an Apple iPhone using TSMC N3. I don't NEED to, but because that die is clocked slowly, those tiny cores would be NOTHING at the speed at which they operate. So in that case, thermal efficiency ISN'T a limiting factor for the number of cores that are in the device. And that's why I made the point I did. There's no such thing as a certain number of cores that creates a thermal inefficiency. It depends on too many other factors.
      Here, points made in good faith for the limit of core count:
      Memory capacity. Each core needs to have a certain amount of memory space. What that amount of space is, is widely variable because it depends on applications being run.
      Bandwidth into and out of the CPU. The bandwidth needs to be capable of handling the input or output of data that each core could require. What this amount of bandwidth is, is widely variable because it depends on the applications being run.
      Capability of the operating system. The OS has to be able to schedule processes (threads) for each core. If there are so many cores that a scheduler cannot direct threads to each core because the scheduler is not fast enough to rotate through all the cores, then this is too many cores for that operating system. But this is widely variable and depends on the applications being run because a thread can be short lived or long lived.
      I'm trying to think of limitations and the MAIN one that comes to mind is space constraints. This is a REAL constraint, because it doesn't depend on other factors. So, space. AMD is going to be able to release server and WS CPU with Zen 5 that can have 192 cores, or even more. Based on current space, that's what AMD will be able to do with TSMC N3 with either a server MB or a WS MB. And if you're wondering how I get that figure, N3 triples the transistor density over N7. But AMD could be moving to big-little for Zen 5, and AMD might be moving to L3 cache being off-die and being stacked, in which case based on current space constraints, they could probably get up to 256 cores on a SINGLE Zen 5 EPYC CPU. But they'd have to make other changes to the CPU architecture and other architecture to pull that off. PCIe gen5 even with all the lanes that EPYC has probably won't move data fast enough so it would probably need to be using PCIe gen6, which means the rest of the hardware will need to be PCIe gen6. And then DDR5 with 8 memory channels wouldn't be good enough even at the fastest rated speeds. And, with DDR6 supposedly using the same data word length as DDR5, I highly doubt memory bandwidth would allow for that many cores, for many SERVER applications. You'd have to rely on many of the cores already having cached the instructions they need to run so you don't have a couple hundred cores trying to hit memory at the same time.
      But would "thermal efficiency" be an issue for a 256 core CPU? For a server application using TSMC N3 which uses about 40% less power than N7, where boost clocks are usually in the low 3GHz? No, each core could run very efficiently. Total package power could be exceeded though, and that's not an issue of "efficiency" There isn't a limit because it's not "EFFICIENT" It's a limit because it's too much for that package. I THINK AMD could release a 192 core EPYC CPU, so WAY more than just triple digit, which makes this guy's "double digit" comment a complete JOKE. I THINK that with TSMC N3 and the lower clock speeds of EPYC, AMD can get up to 192 cores with Zen 5 as long as DDR5 has hit much faster speeds (they're at 6400 now) AND you increase memory channels to 12, AND AMD has move to stacking L3 cache and it uses something on the lines of 192MB - 256MB AND the hardware platform is using PCIe gen 6 AND AMD adds 25% more PCIe lanes to the CPU, although maybe the move to PCIe gen6 is good enough to handle the bandwidth needs of that many cores with the existing lanes they have now for EPYC.
      And I hope that helps to clear up your lack of understanding on this topic. If not we'll agree to disagree.

  • @Ryanisthere
    @Ryanisthere 2 วันที่ผ่านมา

    as for all things in engineering, there are trade-offs for just building more
    such as cost, space, thermals, or other components not being able to keep up

  • @christopherleadholm6677
    @christopherleadholm6677 2 ปีที่แล้ว +3

    "My mom- my momma says bad code is for the devil!"
    - Adam Sandler as Water Boy

  • @hansbaeker9769
    @hansbaeker9769 ปีที่แล้ว

    Around 1990 or so, I got in a big argument with a computer salesman who was of the belief that if a computer with two cpus would run twice as fast as a computer with one cpu. He was of the belief that every task could be automagically broken up into subtasks that would fully use both processors with no loss of efficiency.
    At the time, I was looking at buying several computers for the company that I worked for and I was quite interested in the computers he was selling, but his lack of understanding of what was involved convinced me that I wasn't going to buy the computers from him.

  • @AlejandroRodolfoMendez
    @AlejandroRodolfoMendez 2 ปีที่แล้ว +5

    So far Windows for desktop have a limit of cores that can be used, Linux has not. But it's a thing for considering on the future.
    Maybe when the limit of core is reached they will make emphasis on number of instruction per cycle.

    • @clovernacknime6984
      @clovernacknime6984 2 ปีที่แล้ว +3

      They did, long ago. That's what pipelining, superscalar, out-of-order-executing processors are all about. However, there's limits to how much you can auto-parallelize a single thread, thus they turned to multi-core - which make the programmer parallelize explicitly - out of desperation, since all other avenues for improvement were exhausted.
      The future is more cores, because we hit the point of diminishing returns for adding more transistors to a single core long ago.

    • @AlejandroRodolfoMendez
      @AlejandroRodolfoMendez 2 ปีที่แล้ว +1

      @@clovernacknime6984 there was attempted seriously since pentium 4 on regular cpu were more on servers and specific cpus. The risc did more but at expense of the operations. Maybe return of cisc too can work.

    • @Conenion
      @Conenion 2 ปีที่แล้ว +1

      @@AlejandroRodolfoMendez
      Since Pentium Pro around 1995 all Intel CPUs are RISC-like internally. AMD followed. x86 CPUs are CISC from the outside, but internally they use all of the "tricks" that make RISC CPUs so fast.

    • @Conenion
      @Conenion 2 ปีที่แล้ว

      @@clovernacknime6984
      > out of desperation, since all other avenues for improvement were exhausted.
      Exactly. Well said.

    • @AlejandroRodolfoMendez
      @AlejandroRodolfoMendez 2 ปีที่แล้ว

      @@Conenion they weren't full risc tho. But yes they were doing stuff like that before.

  • @jeffreymelton2200
    @jeffreymelton2200 ปีที่แล้ว

    I selected the video based on the name of the channel alone! Brilliant naming of the channel. Anyways the video was very informative. I actually learned quite a bit from it. I appreciate the style in which you narrate your videos. making the subject matter incredibly comprehensive, and digestible. Thank you for the content!

  • @matsv201
    @matsv201 2 ปีที่แล้ว +5

    Intel have made a 1000 core processor... back in 2010... it really wasnt that large, it was a fork of 386 ment to run grapics code...so a x86 gpu.... I turned out to not really work well.. but the processor worked

    • @zredplayer
      @zredplayer 2 ปีที่แล้ว

      A 1000 core real CPU. Fo you have a proof that Exist?

    • @ultrapetey
      @ultrapetey 2 ปีที่แล้ว

      @@zredplayer en.wikipedia.org/wiki/Larrabee_(microarchitecture)

  • @gandalfdergraue8444
    @gandalfdergraue8444 ปีที่แล้ว

    A very good explanation for CPUs and their cores...

  • @Sourcer3r
    @Sourcer3r 2 ปีที่แล้ว +4

    Multi hundred cores are already running well,
    just in another way you might taught first: GPU or more specific GPGPU (general purpose gpu) applications.
    Just think a moment about ethereum, ai (delf-driving cars), rendering or scientific research (protein folding, space analysis).
    Of course: your standard operating system will not boot with just a GPU because the instruction set on a gpu compute unit is very limited.
    This might change in the future: take a look at the Apple m1 or any arm (mobile) chip... They can run more efficient in consumer applications, because they carry less instructions (therefore less transistors and shorter paths (wiring) that generate heat).

    • @youtubeshadowbannedme
      @youtubeshadowbannedme 2 ปีที่แล้ว +3

      Just because they run more efficient doesn't mean it'll give good raw performance. The M1 chips excel in both performance and efficiency because of the way Apple designed them to compete with Intel and AMD in the computer market. It's like how Intel was able to make x86 chips that practically was a knockoff of ARM back then, by the name of Atom brand. Only when Intel specifically went out of their way to make an extremely efficient x86 CPU could it happen.

  • @frenchmarty7446
    @frenchmarty7446 ปีที่แล้ว

    For a given die size and transistor count, you have to balance:
    1.) More branch prediction and larger cache, things that every program takes advantage of by default.
    2.) More/faster I/O and memory bandwidth, which also consumes die space.
    3.) More pipelining/superscalar operations. Basically parallelism on a single core that programmers get for free.
    4.) More cores/threads, something that programmers have to intentionally design around, has memory overhead (locks), and has diminishing returns for most programs (Amdahl's law).

  • @HuntingKingYT
    @HuntingKingYT 2 ปีที่แล้ว +3

    "Any computer in the last 10 years" - My pc, Dual-Core i3-2120, 10y/o

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว

      My 4 core 8 thread i7-4700MQ made in 2013 looks like a last gen Threadripper in comparison.

    • @youtubeshadowbannedme
      @youtubeshadowbannedme 2 ปีที่แล้ว

      @@saricubra2867 i7 4700MQ isn't as fast as you think it is, and it definitley cannot compare to i7 4790K. your i7 chip is around the level of i7 2600K at best, but realistically it's probably closer to i5 2500K. this is of course assuming you didn't win the silicon lottery by a big margin. you would need at least i7 7700HQ to match the i7 4790K at the latter's performance at base speed.

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว

      @@youtubeshadowbannedme My i7 outperforms that i5. And yes, it's between the 2600K, or i7-3770K
      I never said that it's equivalent to the 4790K.

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว

      @@youtubeshadowbannedme 2500K lacks hyperthreading lmao.

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว

      @@youtubeshadowbannedme I tested a family member's laptop with the i7-7700HQ and yes, it's kinda a 4790K at stock.
      On average, laptop CPUs are two years behind equivalent high end i7 from desktop, that changed with 11th gen core generation and 12th too, the gap is smaller. For example, the i7-11800H without throttling outperforms the 10700K that was launched before by one year.

  • @thetooginator153
    @thetooginator153 ปีที่แล้ว

    You explained multi-core processing perfectly (IMHO).

  • @triularity
    @triularity 2 ปีที่แล้ว +5

    It's more likely the number of "core" will keep increasing, but most of them will be specialized (i.e. not full CPU cores with full system access). Instead, there could be a bunch of core doing something dedicated (but still programmable), such as encryption or compression in a way which they mostly keep to themselves except when being sent input or outputing results.

    • @mornnb
      @mornnb ปีที่แล้ว +1

      That has trade offs - you have a large number of cores that can only be used for specific tasks that will spend a lot of time idle, where you could be using the transistors for general purpose tasks that can always be used.

    • @CocoaEm
      @CocoaEm ปีที่แล้ว +1

      this already is a thing theres a dedicated encryption engine on every modern cpu. some tasks really do need that extra space of the die to be faster.

    • @DDRWakaLaka
      @DDRWakaLaka ปีที่แล้ว +1

      Like Cell? Which was trash?

    • @triularity
      @triularity ปีที่แล้ว

      CPUs already having encryption engines is a start. And some CPUs do include embedded GPUs for video - but better having it by default, even if there is no display support. Nowadays, going a step more optimized and including a few tensor cores would be useful with ML being more common.
      Maybe even having multi-precision integer math with common functions used in modular math (not just basic add/multiply operations of SIMD),. So newer (or less mainstream) encryption could still benefit and not just be limited to whatever happens to be in the bundled crypto engine. I personally hate it when crypto libraries don't include low level APIs for some standard algorithm.. so when a variant algorithm is needed to support some protocol, it forces developers to practically reinvent the wheel and roll your own from scratch, rather than re-using the existing implementation for most of it - which is just asking for a broken/insecure implementation. So why should it be all-or-nothing for hardware crypto either?

  • @roax206
    @roax206 ปีที่แล้ว

    Though the main problem when increasing performance is power (and thus cooling). The only real use of power for CPUs is the inefficiency of the transistors changing state (bigger transistors = more power loss) and all used power turns to heat.
    As the frequency increases, the voltage required increases, the current scales with it, and the power usage (and heat) grows exponentially.
    Adding cores means simply using the same amount of power again on the additional hardware so the power usage scales linearly rather than exponentially.
    The problem with this is that the software itself must then be specifically designed to run on multiple cores and on top of the extra development cost, the relative performance of multi-core workloads in dependent on how well these programs can be describes as multiple stand alone programs and how often each part would have to rely on another part for output before it can continue.
    With GPU bound video games, you are dealing with a lower clocked chip with dozens if not hundreds of cores, each with dozens of copies of simplified and specialized versions of the CPU circuitry (making the core smaller so they can fit more on) each connected to a single clock. Given this it is not so much that such software does not use multi-core processors, but in situations where the program can be easily computed in parallel, there is no real reason to use the CPU which has much fewer cores.

  • @lawrencedoliveiro9104
    @lawrencedoliveiro9104 2 ปีที่แล้ว +6

    According to the top500 list, the current fastest supercomputer in the world, RIKEN’s Fugaku, has 7,630,848 cores.
    Of course, they’re not x86 cores, they’re ARM. And it’s not running Windows, it’s Linux. That might help.

    • @mikapeltokorpi7671
      @mikapeltokorpi7671 2 ปีที่แล้ว +2

      Not in single silicon, though.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 2 ปีที่แล้ว +2

      @@mikapeltokorpi7671 Not sure why that’s relevant.

    • @Conenion
      @Conenion 2 ปีที่แล้ว

      > That might help.
      Minor. What /really/ helps is that these HPC machines were built with special purposes in mind. These machines typically run algorithms that scale very well. Like for example solving systems of linear equations. Number crunching stuff.

    • @lawrencedoliveiro9104
      @lawrencedoliveiro9104 2 ปีที่แล้ว

      @@Conenion The problems scale, up to a point. That’s why a supercomputer needs a high-performance interconnect which makes up such a big part of its cost.
      If it wasn't for that, a supercomputer would not be much different from, say, a server farm.

    • @Conenion
      @Conenion 2 ปีที่แล้ว

      @@lawrencedoliveiro9104
      True. They need a high-performance interconnect because Amdahl's law would kick in much earlier without.

  • @adrianalanbennett
    @adrianalanbennett 2 ปีที่แล้ว +2

    One can never have too many cores, too much memory, or too much computing power.

    • @daedliy963
      @daedliy963 ปีที่แล้ว

      that's where you're wrong though, the limit of just how much raw processing power actually gets to be used is extremely fickle
      bottlenecks can happen from the rest of the hardware not being able to keep up and causing bottlenecks (like the mobo) to software too simple to really fully utilize that extra firepower
      you'd only be able to use the 100% to show off

  • @kyleeames8229
    @kyleeames8229 2 ปีที่แล้ว +6

    I'm just gonna guess before I see your explanation. Firstly, there are actually relatively few computational problems that can be more efficiently solved with lots of parallelization. Secondly, once core counts go above a certain limit, your chip either has to be really big, or you need an unreasonably large cooling system to keep it from melting a hole in your floor. Ok, I'll see if I'm right!

    • @paklekj4429
      @paklekj4429 2 ปีที่แล้ว +1

      Had to refill the liquid nitrogen every 30min lol

    • @thelazarous
      @thelazarous 2 ปีที่แล้ว

      Well the temperature thing has already been kinda debunked. The original Pentium D is a perfect example; 2 cores, 2x the thermal load. But that's not really a problem with modern dual, quad, or even octuple cores. Today 32 cores requires 250w, in 20 years it'll take 25-50w. 20 years ago 8 full cores on a single package was considered stupid as nothing would ever even use them and if they did they'd melt, now I have 8 full cores in my laptop and they spend plenty of time at 100% usage.

    • @harvey66616
      @harvey66616 2 ปีที่แล้ว

      _"there are actually relatively few computational problems that can be more efficiently solved with lots of parallelization"_ -- uh, what? The class of problems suitable to SIMD architecture is quite large. It's been a significant chunk of research for decades. Modern graphics cards exist, and are in short supply, _because_ there are so many useful applications for that architecture, not just gaming.
      Indeed, the neural network machine learning space alone has myriad applications. And that's just one sub-genre of the larger picture.

  • @jimwhelan9152
    @jimwhelan9152 ปีที่แล้ว

    As a kernel developer and designer I claim that symmetric multiprocessing is just as easy and probably easier to do than asymmetric. I always found the communication required for one processor to "control" the others was much more complex than that required to keep the multiple kernel threads from interfering with each other.

  • @kimobrien.
    @kimobrien. 2 ปีที่แล้ว +3

    You can't have unlimited numbers of transistors because eventually you get down to the atomic level. The same with clock speed eventually the distance traveled across a processor from one side to the other is a quarter wavelength of the clock speed. Than the distance the signal travels becomes important. The size of a chip is also limited to that of about the size of a fingernail.

    • @vadimuha
      @vadimuha ปีที่แล้ว

      There's subatomic level. It's great at parallel computation

  • @a.j.outlaster1222
    @a.j.outlaster1222 ปีที่แล้ว

    Your thermal explanation makes sense, I have long wondered what if instead of a structure like Motherboard->Cores
    What if we tried something like
    Motherboard->Main Cores(Like a secondary Motherboard)->Cores
    And I now know that the thermal thing would most likely still apply to that, Thank you so much for this video, It made things clear! :D

  • @MindCaged
    @MindCaged 2 ปีที่แล้ว +4

    I still remember having those single-core processors for years and the really annoying problem where the computer would freeze because whatever program was running got stuck in an intensive processing loop or even just an infinite loop and was basically hogging the single-core to itself not letting anything else run. It was such a relief even when I got my first dual core, and I was wondering where this had been for so many years. Now I have a quad-core and to be honest I have to have a lot of programs running at once to fully utilize it, or maybe I have to find some program that can actually utilize all the cores at once, which isn't that many. Also, even if I could find one, it'd probably hit a different bottleneck in either memory access or file access speed.

  • @ivanscottw
    @ivanscottw ปีที่แล้ว

    Large number of cores has some usage in certain specific fields - essentially enterprise servers. HPC is pretty much already taken over by GPUs which have specialized compute only cores (with little or no additional or side functionalities), but virtualization (multilayered), cloud computing and compartimentalized computing will very efficiently use systems with large number of sockets with large number of cores and n-Way multithreading, using 4 layers of cache and NUMA.. Having large systems running multi-layer virtualization with thousands.. tens.. hundreds of thousands instances of walled applications OSes or applications starts becoming power efficient, especially when you start overcommitting resources.

  • @singular9
    @singular9 2 ปีที่แล้ว +7

    You could say that we already have thousand + core CPU's called GPU's 😎

    • @saricubra2867
      @saricubra2867 2 ปีที่แล้ว +2

      Thousands of dumb cores that can't handle everything, meanwhile CPUs have a very small amount of smart cores and they are like a Swiss Army knife and can handle everything.

    • @singular9
      @singular9 2 ปีที่แล้ว

      @@saricubra2867 go be boring somewhere else nerd

  • @lucasdegreef5455
    @lucasdegreef5455 2 ปีที่แล้ว +2

    Best content on youtube , can you post more video on esp-idf and freertos

  • @coalhater392
    @coalhater392 2 ปีที่แล้ว +8

    We do have thousands of cores it's called a gpu.

  • @mastershooter64
    @mastershooter64 ปีที่แล้ว +1

    >This Is Why Computers Can't Have Thousands of Cores
    My GPU: _Pathetic_

  • @darkshadowsx5949
    @darkshadowsx5949 ปีที่แล้ว

    nice someone who got moore's law right.
    most people seem to think it can only be achieved by transistor shrinkage.
    In fact the law has no mention of transistor size, die size, or die count. we still have other ways to double transistor count in a CPU.

  • @qm3ster
    @qm3ster 2 ปีที่แล้ว

    There's always background work to do, and having less context switches improves responsiveness and low level cache efficiency