As you mentioned at the outset, the caching needs to be optimised for the workload. IBM's Z-Series is highly targeted to it's workload. The shared caching works when you've got a well-known (homogenous set of services). A (large) x86 server could be hosting VMs all doing different things and a workstation/PC a game, browser tabs, a video editor, a dev environment/tools and a whole raft of utilities and services....each of these actively fights for cache, invalidating older cache lines. The more running discrete services, the easier for cache lines to be invalidated....enough discrete active services and ay worse you almost have a FIFO stream of once used data. Dedicated core caches at least provide some barrier to other cores from effectively starving the owner core of it's own cache (though I am sure IBM have thought of this); therefore allowing threads (which are more inclined to always run on the same core) on the core to run without interruption from other cores/threads. Since Z-Series will have a number of known discrete services running on it at any one time, I would guess the OS will either auto-tune or allow to be defined the caching strategy to each of these services like it does the allocation of cores. x86 doesn't really have this luxury due to its broad, multi-purpose nature.
Thanks for the explanation 👍. Perhaps in a few years this type of caching will come to x86. (I think) if it goes to arm 1st, x86 will be on deaths door if it doesn't come up with its own solution fast!!(as far as I understand the tech)
@@orkhepaj Video games are not usually optimized for multicore performance. So if someone with a PC was playing a game but not really using other programs at the same time, the L2 caches of the other cores likely wouldn't be used very much meaning that there wouldn't be a lot of cache misses on the virtual L3 cache, in addition to the fact that the core running the game now has a much larger L2 cache. Things might get more complicated for streamers though who do have some extra compute going on to process and encode their video/audio inputs.
It seems like this would only make sense if the cores are out of sync. By that I mean if they all have similar loads then their L2 would tend to fill in parallel, reducing the virtual L3 available.
The kind of software that IBM mainframes run really doesn't scale to many cores so I think it is a safe assumption. These systems are usually running many heterogeneous workloads.
Yeah, for games might be better to stick with a big L3 cache, bc there would be a lot of redundant shit on L2 from the fact that games are coded to be really good at multithreading.
@@MaxIronsThird maybe not though cause (not sure if he touched on this but I'm not gonna rewatch) it would depend on where you're processing it. The time it takes to kick something out of the core's cache wouldn't really matter while you're not processing it, but if you can just process it in its new adjacent core when you need it it would be faster than sending it back to the original core, and faster than jumping up and down layers of fixed cache. AMD has shown little bit of extra latency seems worth it to gain a lot of shared cache so this system should have the best of both worlds.
@@Freshbott2 So in theory what gets kicked out to VL3 essentially becomes L2 again if the adjacent core is able to process it. I wonder if it would also be possible for something like Intel's big.LITTLE concept to sandwich their smaller cores between each 'pair' of L2...
I loved the "Cache me outside, how about that" :D Imagine a Zen with each core having a 12.5 MB+ L2 cache, now that's a Gaming CPU for us "cacheual gamers" ;)
@@vyor8837 the sole idea of having an l1 cache is to make things a lot faster. then theres l2 cache as a backup for the tiny l1 cache, and then theres the l3 cache which is huge but slow. i didn't even think an l3 cache could be important to have because i thought it was super slow when i realized it existed. i still think a massive l2 is better than having l3, but if a massive l3 is what i have, i won't complain. is still better than just having l2 but again, i'd rather have a massive l2. i think is faster than go checking l3 of l4.
@@white_mage 12.5mb is smaller than what we have now for cache. Amount beats speed, power draw matters more than both. L2 is really power hungry and really large for its storage amount.
@@vyor8837 you're right, i didn't realize it was that big. if most of the imb chip shown in the video is cache then l2 has to be huge. what about l3? how does it compare to l2?
...that was an awful, naughty pun and you should feel bad. Stop that! STOP GIGGLING, DARN IT! o.@; *IsKidding, that was terrible-in-good-punny-ways pun, which means it was awful but a pun, so a good-awful! Does bet such a naughty punster is giggling, though!*
@ Yet the implications the good Potato points out are that Intel and AMD may well exploit this IP in thier own architecture, which implies the same problems with predictive branching that are at the heart of spectre exploits.
We all probably want to see lots of graphs and diagrams that show all the performance characteristics :D The idea of using virtual L3/L4 caches inside massive L2 caches is really brilliant. I would really like to explore this in my future master thesis that I will start in about two years.
Is it so brilliant for the coder though? This will look like a massive distributed L3 with less predictable access times, rather than a typical L2 because the L2 is now inherently shared.
@@RobBCactive perhaps, but often times you end up working on legacy projects in which the stakeholder wants some refactoring done fast (often out of necessity). At that point, you don’t always have the bandwidth to tailor the code to the platform. Also, often, the code may be on an interpreted language (either moving from or to). That’s code you can hand tailor as easily for a cpu as C or assembly. My opinion is that a lot more code will passively benefit from this cache redesign than not.
@@kiseitai2 But it's inherently giving less control, the kind of code you're talking about relies on L3 anyway; it's not the kind of high performance code that tunes to L1 & L2. Besides you can query cache size and size accordingly for the chunks you do heavy processing on. It's not as hard as you make out to dynamically adjust. I just think people need to recognise that there's potentially more cache contention and it may not be reasonable to process data in half L2 chunks. I don't really see only advantages of a very complicated virtual cache system over L2 per core backed by a large L3 which follows data locality principals. There's been enough problems with SMT threads being able to violate access control already
@@RobBCactive I think it is true that the access times will become less predictable because the size and partition of the L3 cache is dynamic. But what if we give the software some control over how to partition the virtual L3 cache? It could make the access times more predictable, but the OS could also move threads faster from one to core to another. Also a core could be put to sleep and then the associated cache can be configured as a non-virtual L3 cache. You see there are lots of things that make this interesting and could bring some advantages so that the disadvantage of unpredictable access times can be minuscule.
@Fabjan Sukalia : Computational RAM is the future. Compute-in-Memory architectures will compete against and overtake Multicore designs- even those having gigabyte-sized caches.
Virtual L3 /L4 amazing . Wonder how ccoherence is communicated across cores across chips, how virtualization of apps affects flushing/invalidation these virtual caches. Amazing engineering !
L2 / virtual L3 sounds more like something perhaps universally applicable. L4 sounds like something only useful in the context of the Z architecture. Consider that all those reliability guarantees must add quite a bit of overhead on the Z system. Not sure this will still apply to something like a dual socket x86. Still, the tech behind it must be absolutely impressive.
1: I would rather describe the cache layout the other way round. The L1 cache is the one you work with, and anything you don't need is stored in RAM. This way the importance of cache is more pronounced, instead of sounding like just another 2% performance improvement. 2: Branch prediction is used for optimizing small loops (size dependent on µ-op cache size) and decreasing pipeline stalls. What you meant was cache prediction.
This sounds great for mainframe style applications but sounds like a nightmare for any use case where you need to worry about side channel attacks from other processes.
I do wonder if this, ever so slightly, increases attack surface by basically making every private cache line globally accessible. I wonder if there might be even better side channel attacks enabled by this.
The sad reality is, that the fasterl you make the CPU cores the more of such attacks will be possible. The industry might not be ready to admit it yet but speed is most likely not compatible with security.
IBM could design it so that private cache is encrypted and the memory address is hashed and salted. This way a side channel attack might extract the private cache line data they need the keys to which should only be keep in the core to unlock the data.
@@kungfujesus06 The goal is not to make a bullet proof system but make the exploits extremely hard to implement. If each logical thread have their own unique keys and salts then exploits need to string together multiple vulnerability instead of needing only one. Also extracting keys from the core will be harder than extracting data from a shared cache. The problem with secure enclave is that it outsource security to a separate dedicated core instead of each core handling their own security. This reduce the attack surface but also makes security a lot more fragile. You only need to crack that one security core and everything becomes vulnerable.
Super fast L1 with high associativity. Massive L2 cache that can work as L3 at times the way IBM designed it. V-cached L2 or L3 and a chiplet with decent, massive L4. That would be godlike.
Agreed. With the new packaging tech from Intel and and next few cpus can be some interesting designs testing out new techs and designs until the fine the right optimizations and features that make cpus 2x the current performance.
I'm old enough to remember when ARM was used in RISC PCs, they went from ARM610 40MHz to StrongARM 203 with 16Kb of cache, the speed up was way way more than what the raw MHz would indicate. The BASIC interpreter now fitted in cache. Programs ran so fast it was insane. Some games had to be written/patched as they were unplayable. All down to the cache.
Two random thoughts: The virtual L4 to me implies that socket to socket access is still faster than socket to ram. Is that right? I guess that also considers NUMA. How does L4 compared to local ram? Recently we see Intel go relatively bigger L2/core while not so much in L3, and AMD is going bigger L3/core with relatively small L2. Wonder what the thinking is behind that design choice.
From what I remember, the reason AMD had the big L3 cache when they made Zen was so that no matter how many cores were being used, ALL of the L3 would be utilized to maximize performance.
A lot would depend on the TLB structure and architecture, as well as what you're latency penalties are for going through the cache levels. He went into a lot of the decisions that can be made into what differences/decisions you can make with caches and their size. It's clear that AMD made an early decision in Ryzen's life cycle that L3 would be their low latency bulk storage when compared to DRAM. Considering the design inherently increases DRAM access time (relying on a GMI between cores and IO dies) they needed a way to make up that performance loss. Also, with their announcement of v-cache/stacked silicon, they have doubled down on this concept. My guess is that going to the chiplets design helped force their hand in this direction due to size constraints where it makes more sense to reduce the cache bottleneck at the die level as opposed to the core level. Also, you have to be very confident of core utilization to sacrifice die area for l2 in place of l3 as that space is wasted if the core isn't being used. This is where IBM's design is very interesting, but it would be interesting to see the latency of the virtual l3 in worst case (furthers cores apart). Maybe zen 4 or 5 brings stacking from the get go, allowing the base chiplet to have larger core specific cache (l1 and l2) the the upper die being used like v-cache for l3.
@@winebartender6653 Also fetching data from DRAM costs more power, they were explicit about this when explaining the Infinity Cache for RDNA2. Not only a narrower bus to save die space but the cache improved power efficiency.
I hope that marking l3/l4 works, if not this seems like a meltdown madness. Even if that virtual l3/l4 won't be migrated to our CPU's for security reasons, Massive l2 would be interesting to see.
Thank you for your amazing content Ian. Even as a computer scientist (albeit focused on software only), I still learn new things from most of your videos!
I think it is might work very well when not all 8 cores at the same time need to use their entire L2 cashe. Games usually has very uneven workloads between different threads, so it will probably work great for games.
This sounds like a more practical way of using space. If you loose some cycles on the L2 but the gain is the L3 is faster and you save space so a bigger L1 could be done also. A bigger L1 and a faster virtual L3 might cancel out the latency loss of the slightly slower L2. We will see if Intel and AMD start to adopt this new idea.
This is nice. This both allows L2 and "L3" and "L4", in bigger amounts than before, but also the opportunity for a single threaded workload to have a very nice very big 32MB of L2.5 cache. And still more L3 and L4, if the other cores are barely doing stuff. Huge increase either way.
The video doesn't point it out but the different caches use virtual or physical addresses. Code uses its virtual addresses to run at full tilt, but to share cached memory between processes then a physical to virtual address translation using the process memory mapping is required. This is the reason L1 cache can be very low latency, while L2 & L3 are slower. Arguably IBM appear to be using a distributed L3 cache from the size, if you're coding high performance code, you really want to fit in the core's L2 for the main work, accessing L3 via RAM in predictable ways so the processor can anticipate and pre-fetch.
Hmm, I don't remember ever seeing a motherboard with that feature! All the ones I recall using had two 256k SRAM chips soldered to the motherboard. And for some reason it was common enough to have the L2 cache go bad that the BIOS had the ability to bypass the L2 cache for when that happened, albeit with greatly reduced performance. And of course you will remember that back then too the memory controller wasn't in the CPU but the Northbridge. So we had a situation that today sounds insane; When the CPU needed data it would check it's L1 cache, if not there it would check the(external) L2 cache, and if not there it had to ask the memory controller on the Northbridge to go to ram to fetch the data and send it to the CPU. It sounds so barbarically crude now it's almost funny. But back then in the stone age process nodes were so huge there just wasn't room to put everything on the CPU. And if you wanted integrated graphics so you didn't have to buy a graphics card, did you buy a CPU with a GPU built-in? Of course not, that was also the job of the Northbridge too! LOL! There are a LOT of things I can say I miss about "the good old days" but that stuff is NOT part of it! ;-)
The Tilera chips already used this idea - the private L1 caches could also be used as part of a chip-wide L2 cache. I myself have used this in some of my projects since 2001. Some of the ideas that make this work can be found in en.wikipedia.org/wiki/Directory-based_coherence
I just wanted to say that your videos are fantastic! Really well explained and educational to watch. I never understood what L1, 2 or 3 Cache ACTUALLY did. Until now.
A really nice analogy for caches is the Librarian. He keeps a handful of the most recently requested books on top of his desk, a stack of recently used books underneath his desk, a rack behind his desk with the most popular books in the last few days, and finally the main library where the books' are stored more permanently. As the Librarian gets a book request, he puts the book on his desk and moves the oldest book under his desk, then moves the oldest book there back to the shelf behind him, and then puts the oldest book on the shelf back into it's place in the library. Here is easy to see if the librarian keeps too many books on his desk he's going to have difficulty getting his work done - the books will be in the way. If he's got too many books under his desk it will slow him down as well, but he's got more room to arrange this area. The shelf behind him had even more flexibility in it's size and arrangement. That's a CPU and its L1, L2, and L3 caches, plus its main memory.
Ok as someone who worked in IBM database systems (Db2 for Linux and Windows for nearly 30 years, a couple more in SQL/DS), a mainframe isn't the only way that your banking transactions can be kept safe. This is software, specifically the write ahead logging of your transactions and then adding things like dual logging, log shipping for high availability disaster recovery, fail over for nodes etc. We would commonly have 32 to 256 database partitions on 4 to 64 physical machines presenting to the end user as a single database for instance. On with the show.
Per Jim Keller, ~ "performance can be reduced to branch prediction and data locality" ( And I Paraphrase ). Best most concise rule of thumb for all hardware and all software design IMO.
The L3 cache on my old 7900x has the same transfer rate as my DDR4 main memory. It seems that traditional L3 cache is obsolete with today's faster RAM.
I like to toolbelt analogy for cache. The CPU registers are the tools in your hands, and you work directly with them. L1 is your toolbelt, bigger, but slower. L2 is your toolbox, you need to restock your toolbelt from there but it can hold even more tools. L3 is your truck. Lots of storage but you need to go all the way back to your truck. DRAM is the hardware store, almost all the tools available, but very far away. You lose hours just going over there and back. The Hard Drive is China, absolutely all tools, but takes months to get the one you ordered.
12:32 Well, simply put this complements mainframe setup perfectly. I mean, even older MVS concepts like shared/common virtual memory, or the coupling facility, they all enabled Z to perform like no other infrastructure while still maintaining a high degree of parallelism and ridiculously low IO delay. Ironically, despite the "legacy" label being often attached to mainframes by the press, Z still remains one of the most sophisticated and advanced pieces of technology, to this day improved by each iteration.
I think you forgot about registers in your cache level story. Most modern GPU's have a large register section for the immediate access to data. L1 cache has typically 3-5 clock cycle delays, registers are immediately available. Typically the register section of a GPU is much larger than that of a CPU, so it does come into play more than registers do for CPU's.
Not gunna pretend to know chip design too much but from the stuff I watch, Apple’s M1 is really good because it keeps its cores fed due to its design approach (the whole 8x instruction decoders, large caches etc. ). Is this IBM approach not similar in that it attempts to look at the problem differently than others and looks to utilise its resources better/more efficiently rather than just adding more resources?
That's genius, but the potential problem I see here is that how are they pushing out old data? We won't want cases where a core doesn't have enough space on it's private cache cause it's full. So it goes to store elsewhere, but it's private cache was filled with old data from another cores which would not be used much again. A good amount of planning needs to go in there (which I suppose IBM engineers already did). However, I wonder if the virtual L4 helps at all. Shouldn't it be as slow as memory, since it's outside of the chip anyway?
It's pretty awesome that you used an AMD K6 III+ in your video. It's the first CPU to use level 3 cache. Of course it was off chip level 3 cache. But it was huge compared to what was available in level 2 cache of the time.
In my understanding it would only work if the records are being evicted from the caches not only due to being replaced by the newer data but also due to some sort of time of life eviction and possibly other methods because if you only have eviction by replacement with newer records then all L2 caches will be full after a short period of time and there will be no space in any of them to be allocated as virtual L3 cache. Also this time of life and other kinds of eviction should not evict it to some other core's cache but rather erase it because otherwise you'll just have all cores ping-ponging the same old data.
Lines are on a queue to be written back to RAM before they leave L1. You can safely discard anything in cache on most designs without impacting correctness. The hard part tends to be picking which lines to discard 🙃
I'm impressed that you seemingly have the money to buy a multi socket system. Us mere mortals might as well wait another decade before these systems become affordable, at which point, I presume, the requirements for common applications will have appropriately expanded to fill that "infinite amount of cache" rather quickly.
Think different way, if AMD can put large L3 Vcache on top of processor die, it can redesign all on-die L3 cache to L2, You could have huge L2 cash and huge L3 ache.
IBM loves to sell chips with say 8 cores, even if you get a 4 core system as you can enable those other cores at a later time after paying IBM more money. Now the question is - will those unlicensed inactive cores also disable the l2 virtual l3 cache upon those inactive cores or will the way IBM locks cores off still leave that L2 cache exposed and enabled for the virtual L3 cache? What I find interesting about IBM's is the whole memory controller approach and was in effect reprogrammable for future memory interfaces, that's pretty epic.
They really need to hurry up and make dies cube shaped with microholes in it for liquid cooling (millions or billions of layers instead of just 17). Cube vs square = both more cache and lower latency.
I had no idea that "mainframes" still exist. I thought modern banking systems simply used stacks and stacks of standard server chips. I had no idea there was a whole different processor design for modern mainframes.
Virtual cache sounds interesting, but I think I'd need to see examples of how cache eviction works in practice for some realistic loading scenarios before I get excited. Particularly for workloads we think it'll help in. For example does this design cope well with games that would benefit more from larger dedicated shared cache? Or does the design mean that the virtual L3 and L4 is squished out to make room for L2 when that wouldn't be optimal for the load in question?
Love the educational part. Really excited for the upcoming technology. But IBM processors really don't end up in consumer products, right? However the technology might
Apple M1 does not have an L3 cache. CPU, GPU and all the RAM or on the same package. Close together. No connectors, pins, sockets and copper traces on a motherboard. Very similar.
You omit one thing: the type of memory cells used for these caches. In 1995, IBM had a UNIX (AIX) server that used static RAM for cache and it could be expanded. In line with your point, throwing more cache into a base model made it faster than the next up model with a more powerful CPU. The nature of D-RAM is that it is unstable and slow. That static RAM had less that 1/8th the latency of DRAM. And it was very expensive - probably because of very low production volumes. I always wondered why the DRAM in my PC had not gotten replaced by static RAM in the past decades. Oh, mainframes - next to billions of smartphones and PCs, and a couple Macs and iPhones, there were about 10,000 IBM mainframes in use in 2019.
This really came at the right time, I've been learning about cache lately and I didn't realise how much I didn't know :) I really thought it was a simple as more cache = more better.
Super cool idea on IBM's part. So good to see there are still active and commercially viable processor designers out there besides Intel and AMD. When I went through my Computer Engineering undergraduate work, I learned about the layering process when manufacturing silicon chips. Expose the surface to a gas, let it dry and form a masking layer, shine a light through a film mask and through lense onto the surface to burn a pattern through the masking layer, expose that to a doped gas to add doping to the exposed silicon, use an acid to remove the mask layer. rinse and repeat with a different mask and maybe a different doping for a different purpose. At the time, they were doing 7 to 13 layers. I think often there is a grounding layer on the bottom and a power layer on top, with holes poked through the grounding layer for signal leads. I think adding more layers had diminishing utility because of cross talk (flowing current induces a magnetic field around it, which can, in turn, induce current flow in nearby conductors, there by interfering with any signals going through those nearby conductors) and heat dispersal (current flowing through a conductor that has some resistance produces heat and consumes power). It seems like both of those issues are a matter of materials research. If you can get very high resistance in a material that is achievable through a gas exposure process that conducts heat very well, you can disperse heat and prevent cross talk at the same time. And if you can get very low resistance, you can get more current through with less voltage and not only generate less heat and save power, but also charge up capacitance faster and thereby increase signal speeds. If you could get a semiconducting material that was super conducting when/where it is in its conducting state and has very high resistance (enough to basically completely block current flow) in it's non-conducting state, that would really win the game, because your power requirements would be incredibly low, your signal speeds incredibly high, and there would be no heat to disperse. I guess that would be the holy grail for this stuff. Anyway, the point is, it seems like a lot could be achieved by materials research to reduce heat and cross talk and then by adding a lot of layers. Imagine having a cube instead of a flat die. Right now die's are 100mm squared, and about 0.5 mm thick. Some are 0.275mm thick. If you had a die that was 100 mm cubed.. that could potentially theoretically multiply the transistor count by up to 200 or so. Of course it would also take many times longer to produce a wafer, so I guess that's an enormous cost problem. Which I guess probably means that nobody is working on anything like that. But when our processor chips are super-semi-conducting and are cubes (or honey-comb shaped?) instead of flat chips, that's when AI robots will speak fluently and be indistinguishable from humans over audio-only communication channels and be capable of taking over the work of world class human doctors, scientists, artists and engineers and doing a better job of those things.
“Possible” that IBM has patents out the wazoo? Hahaha it’s guaranteed that they do. When I partnered with them on a homomorphic encryption system I built, they tried to convince me to patent it, despite being a pretty simple implementation of the Pallier algorithm lol
cache to VL3 means throwing it from one memory controller to another and a cycle taken to do so, which means two memory controllers would have to wait a cycle to do anything else. I can see the benefit, but also possible issues. cross talk might stall other things til it is done and possible attacks to force a leak of memory from one to another to then be read? be interesting to see where they all go with all the new cache things coming around.
And here I was, thinking that future AMD designs might incorporate a L4 cache on the IO die. Now I am not so sure about that anymore as there seem to be more options.
That would seem like the logical next step offer 3d stacking. (As far as I am understand it) but making a 'virtual' cache pool will be a LOT faster than L4, especially if it is like this with L2. Increasing size but drastically reducing latency.
Afik cache uses a fixed mapping between main memory addresses and cache addresses to maximize search speed. Which means that useful data is never packed densely and there are lots of gaps filled with vacant cache memory all the time. Sounds like an excellent resource to turn into a virtual L3/L4.
Make me think of Intel's use of register renaming, technique that abstracts logical registers from physical registers, plus Intel loves cache on their chips, this already sounds like it's an Intel Technology.
Everyone loves cache, because memory access is relatively much slower compared to CPU cycles. Intel have reduced cache sizes (P4) and love market segmentation with cache being a differentiator.
@@WhenDoesTheVideoActuallyStart There used to be a web site (behind a paywall I forget it's name) that would talk about problems with Intel CPU manufacturing process (before the +++++ stuff) and that site would always talk about Intel always wanting to be a memory company so their half ass processors always had a ton of cache to paper over what a crap processor they made, this being a technology to add a CRAP ton of cache to a CPU thus sounds like an Intel technology to me, to paper over the fact Intel is a crap CPU company that is a monopoly and the only thing keeping it in business is monopoly power and cache memory to make it look like an innovator.
@@DaxVJacobson Every modern processor has shittons of on-die cache, because they're all wide 5Ghz data crunching machines, while the basic DRAM cell hasn't changed much for decades (Besides the shrinking), and still has the same old access latency, which is monstruous when compared to modern on-die SRAM. Every year that passes DRAM is more and more of a restriction to the performance of CPU's, and pre-fetching as much data as possible into faster SRAM caches is one of the only solutions besides magically eliminating Amdahl's law or having STT-MRAM catch up with the 5nm node.
@@DaxVJacobson Companies have been trying to add a CRAP ton of cache to a core since the Arm days of the 80's. Intel sold many Haswell processors with 256MB of cache to the financial market.
in the multi-core era, Intel processor cache were inclusive, so whatever is in L2 is also in L3. On L2 miss, it is only necessary to look in the shared L3 without worrying about what's in other cores (private) L2. With Skylake (SP), L2/L3 became exclusive (AMD did this earlier?). On initial load, memory goes straight to L2. On eviction from L2, it goes to L3. To implement this, each cores' L2 tags is visible on the ring. This being the case, L3 is no longer particularly necessary - however, it may be desirable to keep L2 access time reasonably short.
You did a marvelous job in explaining and simplifying the subject matter, and I think many a layperson who has even a minimal grasp of system\processor architecture would not fail to grasp what you've presented.
One thing to think about is the virtual caches will never be as big in practice as they are in theory. L2 in use subtracts from available L3 and both L2 and L3 subtract from L4. On multi-core loads with heavy L2 usage, L3 and L4 could effectively become tiny. Depending on the cache management software the model where cache size drops with cache number could be turned on it's head and cause problems. The cache management software's design and performance will be a key player in making or breaking this technology in the field.
Cache is ultra important for the future CPUs and your highlighting of eliminating L3 done by IBM was something what Apple did long time ago with their Axx ARM based CPU's where they completely dumped L3 and enlarged L2. Having virtual L3 cache out of L2 cache is definitely way to go, but we are still not where we should be with the system memory which runs as fast in clocks as CPU itself. Future computers and CPU's need to have memory as fast as CPU itself to achieve greatest possible efficiency and ideally if that memory would be also non volatile. Having RAM memory fast as fastest cache with capacity and non volatility sounds like a sci-fi, but its what we would need for the future. Thank you for highlighting this new design from IBM, its really something special and proves again that IBM is superb CPU design company!
This concept reminds me of a paper I wrote on virtual scratch pads back in 2012... at the time there was also the concept of virtual caches... which allows you to borrow space from on-chip memory across other cores... very interesting to see IBM use something similar here... btw i like how the timelines show up as you speak, i will need to learn to do that for my videos :)
You may need to check with a patent lawyer ... this is obviously a "big bucks" feature, and if you had mentioned "virtual caches" back in 2012, then the idea is public domain such that others can freely use the idea ... or maybe you are entitled to royalties as well (I'm sure IBM has patents issued on this).
@@ear4funk814Why would they? They obviously don't have the patent and the only question is whether AMD or apple etc gets sued for patent infringement at which point they'll do a paper search for prior art.
I work in a DC for a bank and they just installed a load of the newest model of Z Mainframes, these things are sick, and some of them even have liquid cooling built in. Amazing machines.
Tightly coupled local data stores are one way of reducing memory pressure. Can be private or coherent shared public depending on the workload and application.
I think 3D Cache or Stacked Cache is going to be the future for a couple of reasons. 1. Space. Like you said in the video. Stacked Cache allows bigger and faster Cache, 4GB is doable. 2. Price. The logic behind Virtual Cache systems is more expensive than 3D Cache. 3. Applications. There are tons of ways you can create very fast chips using 3D design. Imagine a 3D ( Stacked Chip) that's 3 layers. 1st layer for CPU's, 2nd for Massive Cache, 3rd GPU's. That's a very fast and crazy chip having both CPU and GPU using the L3 Cache in the middle. That's some pretty awesome power you get in 1 chip package for the majority of workloads. But the best idea would be a hybrid 3D Virtual Cache. Eventually you'll see CPU's getting GB's Of very fast memory, be it L2, 3 or 4.
I finally understood why and how smaller FFT sizes produce more heat in P95 and larger FTT sizes stress the RAM more. I mean I knew about the relationship but didn't know how it all played out.
I wonder with newer package technologies are we going to go back to the pent-up pro or pent-up designs with off die cache. With the newer interposed larger l3 or l4 can be on package using newer or other NM processes that can make cache faster or more power efficient.
From a high level this sounds very similar to AMD's GPU cache for RDNA3... except its a bit backwards, where each set of CUs can always get at least one cache area, but they can also be granted extra cache compartments (or potentially denied if they are streaming data rather than needing to cache). AMD instead of dedicating cache to cores... is working on figuring out how to increase cache utilization overall based on what the workload is doing actively.
I can see this only working if they are figuring that at least 1/2 the cores on each socket are going to be idle at any given time. Otherwise if this is like a virtual environment where even at 40% load all cores can be active just at different frequencies, I would think that you wouldn't have the spare cache for this to make sense. Even chugging along at say a 10% core load you would want as many instructions stored in cache as possible. That would mean that each CPU would be using all of its 32MB L2 with no room for virtual L3.
Love the choice of a K6-III+ 400mhz ATZ in your graphic at 1:24 -- I'm using one of those in my super socket 7 vintage gaming PC, what a neat processor.
Or more Registers - Itanium had rather great performance, but we had the "but compilers would need to be more complex".... For the tasks they were designed for they still have incredible performance, but the architecture was clearly designed with specific workloads in mind. I have seen an 8Core itanium 2GHz outperform (ever so slightly) a 12Core 2.8 GHz Xeon (and the Xeon was 4 years more recent) for our VM-workloads, but changing the settings to not be "correct" for the CPU/workload-combination and it was several times slower.
the cashing strategy seems a lot like SSD where its can allocate certain portions of the cells for higher speed... in this case it dynamically allocate L3 and L4 cache also for higher speed.
Again a great opportunity to explain the trade-offs of bigger caches versus higher latency taking AMDs Zen 2 and Zen 3 APUs as example because I still don't understand why Zen 3 APUs seem to have fallen behind their CPU counterparts compared to the previous generation, even though the opposite should have happened, as they doubled the L3 cache.
It only takes that long, because of bus hierarchies, memory chaching, prioritization, etc - you could attach DRAM as level 2 cache, and your latency would be reduced, but your memory throughput would be attrocious.
This is super context sensitive. For example on normal desktop workloads, more L2 would be great... but mainly if there were less cores. Because there would be more task swaps. But with higher core count the scheduler remembers the process or thread affinity for this very reason to keep the specific core cache hot for that task. This is so complicated. I can't really imagine it being a guaranteed improvement. Very interesting, lots of power vs performance vs workload vs software optimization type considerations.
Just a technicality correction that didn't sit well on me (1:53) They don't start small because they are really fast, they start small because they are really, really close to the CPU cores and the circuitry needs to be really, really short in order to be that fast which limits the available space that fits the two aforementioned criteria.
Talking about getting IBM to explain part of their CPU to you.. In about 1987-88 I took assembly programming on the (presumably then current) IBM Mainframe (lvl)1 & 2 - I suspect the design team all had very small, shaky, tenuous grips on reality.. Same team probably *had* to write the C compiler, too. ;) I totally *loved* the Apple ][+'s 6502 assembly, though.
CPU design is all about tradeoffs. There's no substitute for fast main memory in the general case. That said, I can see this being faster for some things but slower for others. The main benefit I can see is that this should make the cache size(s) dynamic to a certain extent, at least up to the hardware limits, so it/they can better adapt to changing workloads. Whether that winds up being enough to overcome the extra latency is the question. I'd love to see some solid benchmarking that explores this idea. Of course it will also be interesting to see what new side-channel attacks this enables, a la spectre and meltdown.
Prior to a hard core gaming session, serious gamers run a game sim/file that loads all the graphics and sprite files into main memory to speed up game performance. The idea is, the game slows down (lags) the very first time a G/CPU calls for a file that is not in memory. So, to increase game performance, they pre-loaded sprite and graphics files into memory ahead of time, esp. for the piece (char/race) they are playing.
3d stacking is going to allow for even more complex ways to feed the critical components in these chips. doubling the theoretical number of layers a cpu can have is very powerful. everything is about latency. for the most part we can only consider large structures on the sides of the cores, cant go up or down. stacking you can now build a cache under the logic. its like a skyscraper vs house, a bungalow
There seems to be a right time for each approach and design to work. IBM seems (always) to be right at the tip of the spear , that is innovation. At earlier times such Large pools without some higly intelligent mechanism couldn't make sense. If i am not mistaken IBM also had a connection with Global Foundries and the first Zen iterations , the whole CCX , Core , Cache Structure , i think was influenced by the trends of the μ-arch at the time and each time IBM seems to be there. Also it seems that Intel is starting to make more larger L2 structure with the new cores..maybe going towards the approach? I hope your channel keeps growing , it transcribes to me some good vibes and has a feline feeling to it..
A large down side with this "giant L2" approach is that it only lends itself to the tasks it was made to do. If enough cores are working on data sets larger than the L2 you'll run out of space. This is not as much of an issue with L3, even less when you bring 3d stacked L3 into the mix.
Software developers would love to have a lot more L2 cache (and L3 if that's not possible). The only reason games and other programs can work with current smallish caches is that the features that would require more cache are not just implemented.
This is an excellent report and overall analysis Ian, well done!👍 IBM is really creating something special with Telum and its approach even quite different with the virtual, shared L2 cache reminds me how Apple ditched the slower L3 cache and rather expanded the L2 cache in their newer AXX chips. Still IBM is coming with something special again which might inspire both team blue and red as well as Arm. Regarding cache itself, it would be awesome to have one day GBs of high speed cache on the SOC package which would completely replace the DDR and even HBM. Its quite clear that future of the memory is to get it run much faster so that RAM is running almost as fast as cache and non volatile memory like Flash runs as fast as RAM. Ideal world would be to have one single, unified memory which runs as fast as the fastest Flash, but it's non volatile😁
Cash is super important to performance as you can't buy anything without it. Cache is also important. You need cash to buy more cache :P
My fortune is mostly write through cash, so not enough money to buy cache.
It comes in from the day job, falls through your pocket, right into your bills
I believe that cache is pronounced caish, not cash.
@@MekazaBitrusty I've heard it pronounced as Keishe as well
@@alfredzanini well, Google just slapped me a new one. Apparently it is pronounced cash! English is so screwed up 🙄
As you mentioned at the outset, the caching needs to be optimised for the workload. IBM's Z-Series is highly targeted to it's workload. The shared caching works when you've got a well-known (homogenous set of services). A (large) x86 server could be hosting VMs all doing different things and a workstation/PC a game, browser tabs, a video editor, a dev environment/tools and a whole raft of utilities and services....each of these actively fights for cache, invalidating older cache lines.
The more running discrete services, the easier for cache lines to be invalidated....enough discrete active services and ay worse you almost have a FIFO stream of once used data. Dedicated core caches at least provide some barrier to other cores from effectively starving the owner core of it's own cache (though I am sure IBM have thought of this); therefore allowing threads (which are more inclined to always run on the same core) on the core to run without interruption from other cores/threads.
Since Z-Series will have a number of known discrete services running on it at any one time, I would guess the OS will either auto-tune or allow to be defined the caching strategy to each of these services like it does the allocation of cores. x86 doesn't really have this luxury due to its broad, multi-purpose nature.
Thanks for the explanation 👍. Perhaps in a few years this type of caching will come to x86. (I think) if it goes to arm 1st, x86 will be on deaths door if it doesn't come up with its own solution fast!!(as far as I understand the tech)
so how is this good for gaming?
@@orkhepaj th-cam.com/video/RQ6k-cQ94Rc/w-d-xo.html
@@orkhepaj Video games are not usually optimized for multicore performance. So if someone with a PC was playing a game but not really using other programs at the same time, the L2 caches of the other cores likely wouldn't be used very much meaning that there wouldn't be a lot of cache misses on the virtual L3 cache, in addition to the fact that the core running the game now has a much larger L2 cache. Things might get more complicated for streamers though who do have some extra compute going on to process and encode their video/audio inputs.
Isolated cache is also more secure, as we saw with the release of zen3
It seems like this would only make sense if the cores are out of sync. By that I mean if they all have similar loads then their L2 would tend to fill in parallel, reducing the virtual L3 available.
The kind of software that IBM mainframes run really doesn't scale to many cores so I think it is a safe assumption. These systems are usually running many heterogeneous workloads.
Yeah, for games might be better to stick with a big L3 cache, bc there would be a lot of redundant shit on L2 from the fact that games are coded to be really good at multithreading.
@@MaxIronsThird maybe not though cause (not sure if he touched on this but I'm not gonna rewatch) it would depend on where you're processing it. The time it takes to kick something out of the core's cache wouldn't really matter while you're not processing it, but if you can just process it in its new adjacent core when you need it it would be faster than sending it back to the original core, and faster than jumping up and down layers of fixed cache. AMD has shown little bit of extra latency seems worth it to gain a lot of shared cache so this system should have the best of both worlds.
@@Freshbott2 So in theory what gets kicked out to VL3 essentially becomes L2 again if the adjacent core is able to process it. I wonder if it would also be possible for something like Intel's big.LITTLE concept to sandwich their smaller cores between each 'pair' of L2...
Yeah, depends on some cores being idle for their cache to be available for others to borrow.
I loved the "Cache me outside, how about that" :D
Imagine a Zen with each core having a 12.5 MB+ L2 cache, now that's a Gaming CPU for us "cacheual gamers" ;)
Or 32mb of l3 and 4-8mb of l2
Cache lines matter ;).
@@vyor8837 the sole idea of having an l1 cache is to make things a lot faster. then theres l2 cache as a backup for the tiny l1 cache, and then theres the l3 cache which is huge but slow. i didn't even think an l3 cache could be important to have because i thought it was super slow when i realized it existed. i still think a massive l2 is better than having l3, but if a massive l3 is what i have, i won't complain. is still better than just having l2 but again, i'd rather have a massive l2. i think is faster than go checking l3 of l4.
@@white_mage 12.5mb is smaller than what we have now for cache. Amount beats speed, power draw matters more than both. L2 is really power hungry and really large for its storage amount.
@@vyor8837 you're right, i didn't realize it was that big. if most of the imb chip shown in the video is cache then l2 has to be huge. what about l3? how does it compare to l2?
You’d think you’d want a huge amount of L1 but there’s a cache-22 …
🤦♂️
...that was an awful, naughty pun and you should feel bad. Stop that! STOP GIGGLING, DARN IT! o.@;
*IsKidding, that was terrible-in-good-punny-ways pun, which means it was awful but a pun, so a good-awful! Does bet such a naughty punster is giggling, though!*
This joke has serious cachet.
Can't wait for the new round of spectre exploits based around this.
Exactly what I was thinking.
I don't want to see it happen, but it will.
exploits on Z? this isn't intel
@ Yet the implications the good Potato points out are that Intel and AMD may well exploit this IP in thier own architecture, which implies the same problems with predictive branching that are at the heart of spectre exploits.
If they do a bad enough job we might even see something that can be replicated outside of a lab.
We all probably want to see lots of graphs and diagrams that show all the performance characteristics :D
The idea of using virtual L3/L4 caches inside massive L2 caches is really brilliant. I would really like to explore this in my future master thesis that I will start in about two years.
Is it so brilliant for the coder though?
This will look like a massive distributed L3 with less predictable access times, rather than a typical L2 because the L2 is now inherently shared.
@@RobBCactive perhaps, but often times you end up working on legacy projects in which the stakeholder wants some refactoring done fast (often out of necessity). At that point, you don’t always have the bandwidth to tailor the code to the platform. Also, often, the code may be on an interpreted language (either moving from or to). That’s code you can hand tailor as easily for a cpu as C or assembly. My opinion is that a lot more code will passively benefit from this cache redesign than not.
@@kiseitai2 But it's inherently giving less control, the kind of code you're talking about relies on L3 anyway; it's not the kind of high performance code that tunes to L1 & L2.
Besides you can query cache size and size accordingly for the chunks you do heavy processing on. It's not as hard as you make out to dynamically adjust.
I just think people need to recognise that there's potentially more cache contention and it may not be reasonable to process data in half L2 chunks.
I don't really see only advantages of a very complicated virtual cache system over L2 per core backed by a large L3 which follows data locality principals.
There's been enough problems with SMT threads being able to violate access control already
@@RobBCactive I think it is true that the access times will become less predictable because the size and partition of the L3 cache is dynamic. But what if we give the software some control over how to partition the virtual L3 cache? It could make the access times more predictable, but the OS could also move threads faster from one to core to another. Also a core could be put to sleep and then the associated cache can be configured as a non-virtual L3 cache. You see there are lots of things that make this interesting and could bring some advantages so that the disadvantage of unpredictable access times can be minuscule.
@Fabjan Sukalia : Computational RAM is the future. Compute-in-Memory architectures will compete against and overtake Multicore designs- even those having gigabyte-sized caches.
Virtual L3 /L4 amazing . Wonder how ccoherence is communicated across cores across chips, how virtualization of apps affects flushing/invalidation these virtual caches. Amazing engineering !
L2 / virtual L3 sounds more like something perhaps universally applicable. L4 sounds like something only useful in the context of the Z architecture. Consider that all those reliability guarantees must add quite a bit of overhead on the Z system. Not sure this will still apply to something like a dual socket x86. Still, the tech behind it must be absolutely impressive.
1: I would rather describe the cache layout the other way round. The L1 cache is the one you work with, and anything you don't need is stored in RAM. This way the importance of cache is more pronounced, instead of sounding like just another 2% performance improvement. 2: Branch prediction is used for optimizing small loops (size dependent on µ-op cache size) and decreasing pipeline stalls. What you meant was cache prediction.
This sounds great for mainframe style applications but sounds like a nightmare for any use case where you need to worry about side channel attacks from other processes.
I do wonder if this, ever so slightly, increases attack surface by basically making every private cache line globally accessible. I wonder if there might be even better side channel attacks enabled by this.
The sad reality is, that the fasterl you make the CPU cores the more of such attacks will be possible. The industry might not be ready to admit it yet but speed is most likely not compatible with security.
IBM could design it so that private cache is encrypted and the memory address is hashed and salted. This way a side channel attack might extract the private cache line data they need the keys to which should only be keep in the core to unlock the data.
@@kazedcat I feel like the secure enclave concept tends to eventually be defeated. At least that was the case for Intel.
@@kungfujesus06 The goal is not to make a bullet proof system but make the exploits extremely hard to implement. If each logical thread have their own unique keys and salts then exploits need to string together multiple vulnerability instead of needing only one. Also extracting keys from the core will be harder than extracting data from a shared cache. The problem with secure enclave is that it outsource security to a separate dedicated core instead of each core handling their own security. This reduce the attack surface but also makes security a lot more fragile. You only need to crack that one security core and everything becomes vulnerable.
@@kazedcat Or you can design and patent it if they haven't done so ... seems you have a solution to a real problem.
Super fast L1 with high associativity. Massive L2 cache that can work as L3 at times the way IBM designed it. V-cached L2 or L3 and a chiplet with decent, massive L4.
That would be godlike.
Agreed. With the new packaging tech from Intel and and next few cpus can be some interesting designs testing out new techs and designs until the fine the right optimizations and features that make cpus 2x the current performance.
I'm old enough to remember when ARM was used in RISC PCs, they went from ARM610 40MHz to StrongARM 203 with 16Kb of cache, the speed up was way way more than what the raw MHz would indicate. The BASIC interpreter now fitted in cache. Programs ran so fast it was insane. Some games had to be written/patched as they were unplayable. All down to the cache.
Two random thoughts:
The virtual L4 to me implies that socket to socket access is still faster than socket to ram. Is that right? I guess that also considers NUMA. How does L4 compared to local ram?
Recently we see Intel go relatively bigger L2/core while not so much in L3, and AMD is going bigger L3/core with relatively small L2. Wonder what the thinking is behind that design choice.
From what I remember, the reason AMD had the big L3 cache when they made Zen was so that no matter how many cores were being used, ALL of the L3 would be utilized to maximize performance.
Given their Z platform where they need shuttle data between CPUs all the time, it wouldn't surprise me if CPU to CPU is faster than CPU -> DRAM.
A lot would depend on the TLB structure and architecture, as well as what you're latency penalties are for going through the cache levels. He went into a lot of the decisions that can be made into what differences/decisions you can make with caches and their size.
It's clear that AMD made an early decision in Ryzen's life cycle that L3 would be their low latency bulk storage when compared to DRAM. Considering the design inherently increases DRAM access time (relying on a GMI between cores and IO dies) they needed a way to make up that performance loss. Also, with their announcement of v-cache/stacked silicon, they have doubled down on this concept.
My guess is that going to the chiplets design helped force their hand in this direction due to size constraints where it makes more sense to reduce the cache bottleneck at the die level as opposed to the core level.
Also, you have to be very confident of core utilization to sacrifice die area for l2 in place of l3 as that space is wasted if the core isn't being used. This is where IBM's design is very interesting, but it would be interesting to see the latency of the virtual l3 in worst case (furthers cores apart).
Maybe zen 4 or 5 brings stacking from the get go, allowing the base chiplet to have larger core specific cache (l1 and l2) the the upper die being used like v-cache for l3.
Yes socket to socket links are proprietary and very fast. DDR4 and DDR5 are currently the biggest bottlenecks in any system.
@@winebartender6653 Also fetching data from DRAM costs more power, they were explicit about this when explaining the Infinity Cache for RDNA2. Not only a narrower bus to save die space but the cache improved power efficiency.
I hope that marking l3/l4 works, if not this seems like a meltdown madness. Even if that virtual l3/l4 won't be migrated to our CPU's for security reasons, Massive l2 would be interesting to see.
Thank you for your amazing content Ian. Even as a computer scientist (albeit focused on software only), I still learn new things from most of your videos!
I think it is might work very well when not all 8 cores at the same time need to use their entire L2 cashe. Games usually has very uneven workloads between different threads, so it will probably work great for games.
I worked on the memory coherency unit for these core+cache units. It was fascinating to learn about and glad I could be apart of implementing it.
This sounds like a more practical way of using space. If you loose some cycles on the L2 but the gain is the L3 is faster and you save space so a bigger L1 could be done also. A bigger L1 and a faster virtual L3 might cancel out the latency loss of the slightly slower L2. We will see if Intel and AMD start to adopt this new idea.
This is nice. This both allows L2 and "L3" and "L4", in bigger amounts than before, but also the opportunity for a single threaded workload to have a very nice very big 32MB of L2.5 cache. And still more L3 and L4, if the other cores are barely doing stuff. Huge increase either way.
As always Ian. Very fascinating. Keep the awesome content rolling.
The video doesn't point it out but the different caches use virtual or physical addresses. Code uses its virtual addresses to run at full tilt, but to share cached memory between processes then a physical to virtual address translation using the process memory mapping is required.
This is the reason L1 cache can be very low latency, while L2 & L3 are slower.
Arguably IBM appear to be using a distributed L3 cache from the size, if you're coding high performance code, you really want to fit in the core's L2 for the main work, accessing L3 via RAM in predictable ways so the processor can anticipate and pre-fetch.
L2 used to reside on the motherboard. I had one with a cache SLOT that fitted an extra 256KB on top of the onboard 256KB. :)
Hmm, I don't remember ever seeing a motherboard with that feature! All the ones I recall using had two 256k SRAM chips soldered to the motherboard. And for some reason it was common enough to have the L2 cache go bad that the BIOS had the ability to bypass the L2 cache for when that happened, albeit with greatly reduced performance. And of course you will remember that back then too the memory controller wasn't in the CPU but the Northbridge. So we had a situation that today sounds insane; When the CPU needed data it would check it's L1 cache, if not there it would check the(external) L2 cache, and if not there it had to ask the memory controller on the Northbridge to go to ram to fetch the data and send it to the CPU. It sounds so barbarically crude now it's almost funny. But back then in the stone age process nodes were so huge there just wasn't room to put everything on the CPU. And if you wanted integrated graphics so you didn't have to buy a graphics card, did you buy a CPU with a GPU built-in? Of course not, that was also the job of the Northbridge too! LOL! There are a LOT of things I can say I miss about "the good old days" but that stuff is NOT part of it! ;-)
That was called a COAST module. Cache On A STick!
The Tilera chips already used this idea - the private L1 caches could also be used as part of a chip-wide L2 cache. I myself have used this in some of my projects since 2001. Some of the ideas that make this work can be found in en.wikipedia.org/wiki/Directory-based_coherence
That's 20yrs ago, meaning key Patents may have expired. If so, other companies could start working on their own implementation.
I just wanted to say that your videos are fantastic! Really well explained and educational to watch. I never understood what L1, 2 or 3 Cache ACTUALLY did. Until now.
A really nice analogy for caches is the Librarian. He keeps a handful of the most recently requested books on top of his desk, a stack of recently used books underneath his desk, a rack behind his desk with the most popular books in the last few days, and finally the main library where the books' are stored more permanently. As the Librarian gets a book request, he puts the book on his desk and moves the oldest book under his desk, then moves the oldest book there back to the shelf behind him, and then puts the oldest book on the shelf back into it's place in the library.
Here is easy to see if the librarian keeps too many books on his desk he's going to have difficulty getting his work done - the books will be in the way. If he's got too many books under his desk it will slow him down as well, but he's got more room to arrange this area. The shelf behind him had even more flexibility in it's size and arrangement.
That's a CPU and its L1, L2, and L3 caches, plus its main memory.
Thanks for the explanation it is amazing to see this new approaches to latency and allocation :O
Ok as someone who worked in IBM database systems (Db2 for Linux and Windows for nearly 30 years, a couple more in SQL/DS), a mainframe isn't the only way that your banking transactions can be kept safe. This is software, specifically the write ahead logging of your transactions and then adding things like dual logging, log shipping for high availability disaster recovery, fail over for nodes etc. We would commonly have 32 to 256 database partitions on 4 to 64 physical machines presenting to the end user as a single database for instance. On with the show.
I'm so glad I found your Channel. It's the only way I can get info into the deep-dives of the tech industry. 😎
That was very comprehensive. Never understood cache that well. Thank you.
Per Jim Keller, ~ "performance can be reduced to branch prediction and data locality" ( And I Paraphrase ). Best most concise rule of thumb for all hardware and all software design IMO.
Memory is still the limiting factor in core performance. That has brough on this type of innovation and 3D stacking.
@@hammerheadcorvette4 that is what data locality means. How close the data is. E.g cache, memory, disk, network...
The L3 cache on my old 7900x has the same transfer rate as my DDR4 main memory. It seems that traditional L3 cache is obsolete with today's faster RAM.
That Linode ad with the buff American accent, cracks me up every time. :)
Ah the American accent which slips into an English accent every few words.
😆😆😆
I like to toolbelt analogy for cache. The CPU registers are the tools in your hands, and you work directly with them. L1 is your toolbelt, bigger, but slower. L2 is your toolbox, you need to restock your toolbelt from there but it can hold even more tools. L3 is your truck. Lots of storage but you need to go all the way back to your truck. DRAM is the hardware store, almost all the tools available, but very far away. You lose hours just going over there and back. The Hard Drive is China, absolutely all tools, but takes months to get the one you ordered.
12:32 Well, simply put this complements mainframe setup perfectly. I mean, even older MVS concepts like shared/common virtual memory, or the coupling facility, they all enabled Z to perform like no other infrastructure while still maintaining a high degree of parallelism and ridiculously low IO delay. Ironically, despite the "legacy" label being often attached to mainframes by the press, Z still remains one of the most sophisticated and advanced pieces of technology, to this day improved by each iteration.
I think you forgot about registers in your cache level story. Most modern GPU's have a large register section for the immediate access to data. L1 cache has typically 3-5 clock cycle delays, registers are immediately available. Typically the register section of a GPU is much larger than that of a CPU, so it does come into play more than registers do for CPU's.
This kind of feels like an alternative and an extension to AMD's paper 'Analyzing and Leveraging Shared L1 Caches in GPUs' except for CPUs
Not gunna pretend to know chip design too much but from the stuff I watch, Apple’s M1 is really good because it keeps its cores fed due to its design approach (the whole 8x instruction decoders, large caches etc. ). Is this IBM approach not similar in that it attempts to look at the problem differently than others and looks to utilise its resources better/more efficiently rather than just adding more resources?
And... the M1 also does not have an L3 cache. Seems to work very extremely well
That's genius, but the potential problem I see here is that how are they pushing out old data? We won't want cases where a core doesn't have enough space on it's private cache cause it's full. So it goes to store elsewhere, but it's private cache was filled with old data from another cores which would not be used much again.
A good amount of planning needs to go in there (which I suppose IBM engineers already did). However, I wonder if the virtual L4 helps at all. Shouldn't it be as slow as memory, since it's outside of the chip anyway?
There are probably quite fast interconnects between the chips in the system, likely lower latency than RAM still otherwise they wouldn't bother.
probably they can throw out data from other cores when they need the space
If a cache is full, the oldest data gets evicted.
My understanding is that the caches will prioritize in-core usage as L2 over out-core usage as L3.
It's pretty awesome that you used an AMD K6 III+ in your video. It's the first CPU to use level 3 cache. Of course it was off chip level 3 cache. But it was huge compared to what was available in level 2 cache of the time.
i am pretty sure IBM can save GPU market but they have more important things to do
they always solve very very big problems all the time
IBM designs systems while AMD and Intel design products.
That's how I would classify the difference.
Aga sen napıyon burada ya :D
Apple M1 seems to be efficiently addressing that arena. And M1's just the first version.
In my understanding it would only work if the records are being evicted from the caches not only due to being replaced by the newer data but also due to some sort of time of life eviction and possibly other methods because if you only have eviction by replacement with newer records then all L2 caches will be full after a short period of time and there will be no space in any of them to be allocated as virtual L3 cache. Also this time of life and other kinds of eviction should not evict it to some other core's cache but rather erase it because otherwise you'll just have all cores ping-ponging the same old data.
Lines are on a queue to be written back to RAM before they leave L1. You can safely discard anything in cache on most designs without impacting correctness. The hard part tends to be picking which lines to discard 🙃
@@capability-snob well, the first thing that comes to mind is time of life since the last use of the line.
This also reminds me of the core 2 duo, where it had big L2 cache for the time.
My take - virtual L3 and L4 is there just for compatibility reasons, most things will only use L1 and L2 or will have to go to main memory anyway
This really is impressive, unless all cores are heavily utilized you have almost infinite amount of cache, crazy, crazy
Considering having consistent 100% utilization is almost impossible, this is very cool
@@TheMakiran Ian can tell you his Pi machine will certainly run very close to full bore on all cores!.
I'm impressed that you seemingly have the money to buy a multi socket system. Us mere mortals might as well wait another decade before these systems become affordable, at which point, I presume, the requirements for common applications will have appropriately expanded to fill that "infinite amount of cache" rather quickly.
@@Innosos I'd say the equivalence to sharing the L4 cache between processors would be to share it between chiplets/tiles instead.
At 5.3Ghz, they are hoping to "clear" the L2 as fast as possible. That is reasonable as most of these machines are going into the Finance sector.
16:21 The good doctor is at a loss for words at how huge and cool that is. Love it.
Think different way, if AMD can put large L3 Vcache on top of processor die, it can redesign all on-die L3 cache to L2, You could have huge L2 cash and huge L3 ache.
IBM loves to sell chips with say 8 cores, even if you get a 4 core system as you can enable those other cores at a later time after paying IBM more money.
Now the question is - will those unlicensed inactive cores also disable the l2 virtual l3 cache upon those inactive cores or will the way IBM locks cores off still leave that L2 cache exposed and enabled for the virtual L3 cache?
What I find interesting about IBM's is the whole memory controller approach and was in effect reprogrammable for future memory interfaces, that's pretty epic.
They really need to hurry up and make dies cube shaped with microholes in it for liquid cooling (millions or billions of layers instead of just 17). Cube vs square = both more cache and lower latency.
I had no idea that "mainframes" still exist. I thought modern banking systems simply used stacks and stacks of standard server chips. I had no idea there was a whole different processor design for modern mainframes.
This is truly an amazing technology, and Ian's enthusiasm for it is palpable.
Virtual cache sounds interesting, but I think I'd need to see examples of how cache eviction works in practice for some realistic loading scenarios before I get excited. Particularly for workloads we think it'll help in. For example does this design cope well with games that would benefit more from larger dedicated shared cache? Or does the design mean that the virtual L3 and L4 is squished out to make room for L2 when that wouldn't be optimal for the load in question?
Love the educational part. Really excited for the upcoming technology. But IBM processors really don't end up in consumer products, right? However the technology might
Apple M1 does not have an L3 cache. CPU, GPU and all the RAM or on the same package. Close together. No connectors, pins, sockets and copper traces on a motherboard. Very similar.
You omit one thing: the type of memory cells used for these caches. In 1995, IBM had a UNIX (AIX) server that used static RAM for cache and it could be expanded. In line with your point, throwing more cache into a base model made it faster than the next up model with a more powerful CPU. The nature of D-RAM is that it is unstable and slow. That static RAM had less that 1/8th the latency of DRAM. And it was very expensive - probably because of very low production volumes. I always wondered why the DRAM in my PC had not gotten replaced by static RAM in the past decades.
Oh, mainframes - next to billions of smartphones and PCs, and a couple Macs and iPhones, there were about 10,000 IBM mainframes in use in 2019.
This really came at the right time, I've been learning about cache lately and I didn't realise how much I didn't know :) I really thought it was a simple as more cache = more better.
Super cool idea on IBM's part. So good to see there are still active and commercially viable processor designers out there besides Intel and AMD.
When I went through my Computer Engineering undergraduate work, I learned about the layering process when manufacturing silicon chips. Expose the surface to a gas, let it dry and form a masking layer, shine a light through a film mask and through lense onto the surface to burn a pattern through the masking layer, expose that to a doped gas to add doping to the exposed silicon, use an acid to remove the mask layer. rinse and repeat with a different mask and maybe a different doping for a different purpose. At the time, they were doing 7 to 13 layers. I think often there is a grounding layer on the bottom and a power layer on top, with holes poked through the grounding layer for signal leads. I think adding more layers had diminishing utility because of cross talk (flowing current induces a magnetic field around it, which can, in turn, induce current flow in nearby conductors, there by interfering with any signals going through those nearby conductors) and heat dispersal (current flowing through a conductor that has some resistance produces heat and consumes power). It seems like both of those issues are a matter of materials research. If you can get very high resistance in a material that is achievable through a gas exposure process that conducts heat very well, you can disperse heat and prevent cross talk at the same time. And if you can get very low resistance, you can get more current through with less voltage and not only generate less heat and save power, but also charge up capacitance faster and thereby increase signal speeds. If you could get a semiconducting material that was super conducting when/where it is in its conducting state and has very high resistance (enough to basically completely block current flow) in it's non-conducting state, that would really win the game, because your power requirements would be incredibly low, your signal speeds incredibly high, and there would be no heat to disperse. I guess that would be the holy grail for this stuff.
Anyway, the point is, it seems like a lot could be achieved by materials research to reduce heat and cross talk and then by adding a lot of layers. Imagine having a cube instead of a flat die. Right now die's are 100mm squared, and about 0.5 mm thick. Some are 0.275mm thick. If you had a die that was 100 mm cubed.. that could potentially theoretically multiply the transistor count by up to 200 or so. Of course it would also take many times longer to produce a wafer, so I guess that's an enormous cost problem. Which I guess probably means that nobody is working on anything like that.
But when our processor chips are super-semi-conducting and are cubes (or honey-comb shaped?) instead of flat chips, that's when AI robots will speak fluently and be indistinguishable from humans over audio-only communication channels and be capable of taking over the work of world class human doctors, scientists, artists and engineers and doing a better job of those things.
“Possible” that IBM has patents out the wazoo? Hahaha it’s guaranteed that they do. When I partnered with them on a homomorphic encryption system I built, they tried to convince me to patent it, despite being a pretty simple implementation of the Pallier algorithm lol
casually flexing he partnered with IBM like bruh
They also flex about how many patents they applied for a year and how many they have in total on presentations.
@@S41t4r4 my achivements this year:
- once ate a whole family size pizza alone
get on my level
@Mahesh If you sold your product before the patent was admitted the other party either can't patent it , or looses the patent pretty easily
@Mahesh depends if the jurisdiction is first to file or first to invent
cache to VL3 means throwing it from one memory controller to another and a cycle taken to do so, which means two memory controllers would have to wait a cycle to do anything else. I can see the benefit, but also possible issues. cross talk might stall other things til it is done and possible attacks to force a leak of memory from one to another to then be read? be interesting to see where they all go with all the new cache things coming around.
And here I was, thinking that future AMD designs might incorporate a L4 cache on the IO die. Now I am not so sure about that anymore as there seem to be more options.
That would seem like the logical next step offer 3d stacking. (As far as I am understand it) but making a 'virtual' cache pool will be a LOT faster than L4, especially if it is like this with L2. Increasing size but drastically reducing latency.
@@TheStin777 It's all workload dependent though. The IBM Z series chips are tailored towards financial computing.
Afik cache uses a fixed mapping between main memory addresses and cache addresses to maximize search speed. Which means that useful data is never packed densely and there are lots of gaps filled with vacant cache memory all the time. Sounds like an excellent resource to turn into a virtual L3/L4.
Make me think of Intel's use of register renaming, technique that abstracts logical registers from physical registers, plus Intel loves cache on their chips, this already sounds like it's an Intel Technology.
Everyone loves cache, because memory access is relatively much slower compared to CPU cycles.
Intel have reduced cache sizes (P4) and love market segmentation with cache being a differentiator.
I regret to inform you that no, caches are in fact not "an Intel Technology", they're quite... widely used, to say the least.
@@WhenDoesTheVideoActuallyStart There used to be a web site (behind a paywall I forget it's name) that would talk about problems with Intel CPU manufacturing process (before the +++++ stuff) and that site would always talk about Intel always wanting to be a memory company so their half ass processors always had a ton of cache to paper over what a crap processor they made, this being a technology to add a CRAP ton of cache to a CPU thus sounds like an Intel technology to me, to paper over the fact Intel is a crap CPU company that is a monopoly and the only thing keeping it in business is monopoly power and cache memory to make it look like an innovator.
@@DaxVJacobson Every modern processor has shittons of on-die cache, because they're all wide 5Ghz data crunching machines, while the basic DRAM cell hasn't changed much for decades (Besides the shrinking), and still has the same old access latency, which is monstruous when compared to modern on-die SRAM.
Every year that passes DRAM is more and more of a restriction to the performance of CPU's, and pre-fetching as much data as possible into faster SRAM caches is one of the only solutions besides magically eliminating Amdahl's law or having STT-MRAM catch up with the 5nm node.
@@DaxVJacobson Companies have been trying to add a CRAP ton of cache to a core since the Arm days of the 80's. Intel sold many Haswell processors with 256MB of cache to the financial market.
in the multi-core era, Intel processor cache were inclusive, so whatever is in L2 is also in L3. On L2 miss, it is only necessary to look in the shared L3 without worrying about what's in other cores (private) L2. With Skylake (SP), L2/L3 became exclusive (AMD did this earlier?). On initial load, memory goes straight to L2. On eviction from L2, it goes to L3. To implement this, each cores' L2 tags is visible on the ring. This being the case, L3 is no longer particularly necessary - however, it may be desirable to keep L2 access time reasonably short.
As per usual in the last 12 years, it's IBM doing the major innovations.
You did a marvelous job in explaining and simplifying the subject matter, and I think many a layperson who has even a minimal grasp of system\processor architecture would not fail to grasp what you've presented.
Comment for the algorithm
One thing to think about is the virtual caches will never be as big in practice as they are in theory. L2 in use subtracts from available L3 and both L2 and L3 subtract from L4. On multi-core loads with heavy L2 usage, L3 and L4 could effectively become tiny. Depending on the cache management software the model where cache size drops with cache number could be turned on it's head and cause problems. The cache management software's design and performance will be a key player in making or breaking this technology in the field.
Whoa! This is the same Dr. Ian that got AMD to price the Threadripper 3990X at $3990. Subbed!
Caching is very important if you want to do large fast Walsh Hadamard transforms. Which you might for SwitchNet.
Cache is ultra important for the future CPUs and your highlighting of eliminating L3 done by IBM was something what Apple did long time ago with their Axx ARM based CPU's where they completely dumped L3 and enlarged L2. Having virtual L3 cache out of L2 cache is definitely way to go, but we are still not where we should be with the system memory which runs as fast in clocks as CPU itself. Future computers and CPU's need to have memory as fast as CPU itself to achieve greatest possible efficiency and ideally if that memory would be also non volatile. Having RAM memory fast as fastest cache with capacity and non volatility sounds like a sci-fi, but its what we would need for the future. Thank you for highlighting this new design from IBM, its really something special and proves again that IBM is superb CPU design company!
This concept reminds me of a paper I wrote on virtual scratch pads back in 2012... at the time there was also the concept of virtual caches... which allows you to borrow space from on-chip memory across other cores... very interesting to see IBM use something similar here... btw i like how the timelines show up as you speak, i will need to learn to do that for my videos :)
You may need to check with a patent lawyer ... this is obviously a "big bucks" feature, and if you had mentioned "virtual caches" back in 2012, then the idea is public domain such that others can freely use the idea ... or maybe you are entitled to royalties as well (I'm sure IBM has patents issued on this).
@@ear4funk814Why would they? They obviously don't have the patent and the only question is whether AMD or apple etc gets sued for patent infringement at which point they'll do a paper search for prior art.
Thanks. Great explanation of cache and its relevance. I like your term 'workload profile' and its application to the flexibility of virtual cache.
I work in a DC for a bank and they just installed a load of the newest model of Z Mainframes, these things are sick, and some of them even have liquid cooling built in. Amazing machines.
Tightly coupled local data stores are one way of reducing memory pressure. Can be private or coherent shared public depending on the workload and application.
I think 3D Cache or Stacked Cache is going to be the future for a couple of reasons.
1. Space. Like you said in the video. Stacked Cache allows bigger and faster Cache, 4GB is doable.
2. Price. The logic behind Virtual Cache systems is more expensive than 3D Cache.
3. Applications. There are tons of ways you can create very fast chips using 3D design.
Imagine a 3D ( Stacked Chip) that's 3 layers. 1st layer for CPU's, 2nd for Massive Cache, 3rd GPU's.
That's a very fast and crazy chip having both CPU and GPU using the L3 Cache in the middle.
That's some pretty awesome power you get in 1 chip package for the majority of workloads.
But the best idea would be a hybrid 3D Virtual Cache. Eventually you'll see CPU's getting GB's
Of very fast memory, be it L2, 3 or 4.
I finally understood why and how smaller FFT sizes produce more heat in P95 and larger FTT sizes stress the RAM more. I mean I knew about the relationship but didn't know how it all played out.
You explain things really well. Thumbs up.
Diggin that Konami Code brother, also, I came here from a Gamers Nexus video, and I gotta say you've earned a new subscriber.
I wonder with newer package technologies are we going to go back to the pent-up pro or pent-up designs with off die cache. With the newer interposed larger l3 or l4 can be on package using newer or other NM processes that can make cache faster or more power efficient.
AMD 3d v-cache is apparently twice as dense as the on-die l3, because it uses a library optimized for cache, on the same 7n process.
From a high level this sounds very similar to AMD's GPU cache for RDNA3... except its a bit backwards, where each set of CUs can always get at least one cache area, but they can also be granted extra cache compartments (or potentially denied if they are streaming data rather than needing to cache). AMD instead of dedicating cache to cores... is working on figuring out how to increase cache utilization overall based on what the workload is doing actively.
I can see this only working if they are figuring that at least 1/2 the cores on each socket are going to be idle at any given time. Otherwise if this is like a virtual environment where even at 40% load all cores can be active just at different frequencies, I would think that you wouldn't have the spare cache for this to make sense. Even chugging along at say a 10% core load you would want as many instructions stored in cache as possible. That would mean that each CPU would be using all of its 32MB L2 with no room for virtual L3.
Love the choice of a K6-III+ 400mhz ATZ in your graphic at 1:24 -- I'm using one of those in my super socket 7 vintage gaming PC, what a neat processor.
Incredible. We need this on consumer platforms RIGHT NOW.
Or more Registers - Itanium had rather great performance, but we had the "but compilers would need to be more complex"....
For the tasks they were designed for they still have incredible performance, but the architecture was clearly designed with specific workloads in mind. I have seen an 8Core itanium 2GHz outperform (ever so slightly) a 12Core 2.8 GHz Xeon (and the Xeon was 4 years more recent) for our VM-workloads, but changing the settings to not be "correct" for the CPU/workload-combination and it was several times slower.
the cashing strategy seems a lot like SSD where its can allocate certain portions of the cells for higher speed... in this case it dynamically allocate L3 and L4 cache also for higher speed.
Again a great opportunity to explain the trade-offs of bigger caches versus higher latency taking AMDs Zen 2 and Zen 3 APUs as example because I still don't understand why Zen 3 APUs seem to have fallen behind their CPU counterparts compared to the previous generation, even though the opposite should have happened, as they doubled the L3 cache.
It only takes that long, because of bus hierarchies, memory chaching, prioritization, etc - you could attach DRAM as level 2 cache, and your latency would be reduced, but your memory throughput would be attrocious.
This is super context sensitive. For example on normal desktop workloads, more L2 would be great... but mainly if there were less cores. Because there would be more task swaps. But with higher core count the scheduler remembers the process or thread affinity for this very reason to keep the specific core cache hot for that task. This is so complicated. I can't really imagine it being a guaranteed improvement. Very interesting, lots of power vs performance vs workload vs software optimization type considerations.
Fun fact. The K6-III and K6-III+ at 1:41 did have 256KB of L2 cache on die.
Just a technicality correction that didn't sit well on me (1:53) They don't start small because they are really fast, they start small because they are really, really close to the CPU cores and the circuitry needs to be really, really short in order to be that fast which limits the available space that fits the two aforementioned criteria.
Talking about getting IBM to explain part of their CPU to you.. In about 1987-88 I took assembly programming on the (presumably then current) IBM Mainframe (lvl)1 & 2 - I suspect the design team all had very small, shaky, tenuous grips on reality.. Same team probably *had* to write the C compiler, too. ;) I totally *loved* the Apple ][+'s 6502 assembly, though.
CPU design is all about tradeoffs. There's no substitute for fast main memory in the general case. That said, I can see this being faster for some things but slower for others. The main benefit I can see is that this should make the cache size(s) dynamic to a certain extent, at least up to the hardware limits, so it/they can better adapt to changing workloads. Whether that winds up being enough to overcome the extra latency is the question. I'd love to see some solid benchmarking that explores this idea.
Of course it will also be interesting to see what new side-channel attacks this enables, a la spectre and meltdown.
Prior to a hard core gaming session, serious gamers run a game sim/file that loads all the graphics and sprite files into main memory to speed up game performance. The idea is, the game slows down (lags) the very first time a G/CPU calls for a file that is not in memory. So, to increase game performance, they pre-loaded sprite and graphics files into memory ahead of time, esp. for the piece (char/race) they are playing.
3d stacking is going to allow for even more complex ways to feed the critical components in these chips. doubling the theoretical number of layers a cpu can have is very powerful. everything is about latency. for the most part we can only consider large structures on the sides of the cores, cant go up or down. stacking you can now build a cache under the logic. its like a skyscraper vs house, a bungalow
There seems to be a right time for each approach and design to work.
IBM seems (always) to be right at the tip of the spear , that is innovation.
At earlier times such Large pools without some higly intelligent mechanism couldn't make sense.
If i am not mistaken IBM also had a connection with Global Foundries and the first Zen iterations , the whole CCX , Core , Cache Structure , i think was influenced by the trends of the μ-arch at the time and each time IBM seems to be there.
Also it seems that Intel is starting to make more larger L2 structure with the new cores..maybe going towards the approach?
I hope your channel keeps growing , it transcribes to me some good vibes and has a feline feeling to it..
Do you think HBM clould be used as massive 10's of GB of L4?
A large down side with this "giant L2" approach is that it only lends itself to the tasks it was made to do.
If enough cores are working on data sets larger than the L2 you'll run out of space.
This is not as much of an issue with L3, even less when you bring 3d stacked L3 into the mix.
lolz look at the old Starfire from Sun MS nearly 20 years old and still rocks from a tech perspective, amazing architecture
Software developers would love to have a lot more L2 cache (and L3 if that's not possible). The only reason games and other programs can work with current smallish caches is that the features that would require more cache are not just implemented.
I knew someone would come up with this at some point. Props to them
This is an excellent report and overall analysis Ian, well done!👍 IBM is really creating something special with Telum and its approach even quite different with the virtual, shared L2 cache reminds me how Apple ditched the slower L3 cache and rather expanded the L2 cache in their newer AXX chips. Still IBM is coming with something special again which might inspire both team blue and red as well as Arm.
Regarding cache itself, it would be awesome to have one day GBs of high speed cache on the SOC package which would completely replace the DDR and even HBM. Its quite clear that future of the memory is to get it run much faster so that RAM is running almost as fast as cache and non volatile memory like Flash runs as fast as RAM. Ideal world would be to have one single, unified memory which runs as fast as the fastest Flash, but it's non volatile😁