With CDNA yes. With RDNA1/2/3 they've severely dropped the ball and didn't adequately make it clear that that was the plan all along. On the consumer side which is where hobbyist compute lives the 6950X was the first card to approach the Radeon VII for a traditional (non-AI ML whatever) scientific workload. The 7000 series is actually worse as they cut FP64 performance and the memory model with infinity cache split 5/6 ways (and/or something else) seems to have hurt this specific (opencl which is why it can be tested) workload. George Hotz to the rescue would be awesome.
ROCm 6.0 just dropped today! Love for you Wendell to do an update on this video to show off all the advancements with 6.0 and if there are any noticable performance bumps 🙏
Tensorflow never directly competed with CUDA, it sits on top of CUDA - Tensorflow's primary competitor was (and still is) Pytorch. Both Tensorflow and Pytorch can be run on TPUs, but of course Tensorflow has 1st class support. Both Tensorflow and Pytorch have 1st class support for CUDA. I suspect the real reason Tensorflow hasn't been as popular lately is two-fold. First, a lot of internal Google development resources have moved on to develop JAX instead of TF, and secondly (and more importantly), Pytorch is simply better than Tensorflow. Its significantly more enjoyable and easier to use. And the reason CUDA has beaten out TPUs is also simple - you can only get TPUs using Google Cloud, whereas every cloud, every enterprise datacenter, and every school had direct access to CUDA capable devices. Everyone uses and develops for them, whereas TPUs and the XLA compiler basically only developed by Google. Also, in deep learning we actually don't mind the reduced accuracy for many problems. In fact, a mix of 32 bit and 16 bit is the *default* data format for deep learning now. Reduced precision deep learning is extremely important for large scale neural network development - for three reasons. First, obviously, if you use fewer bits for your model, you can fit a larger model in a single GPU's memory, which makes development easier. Second, the Tensor Cores basically double their FLOPs every time you halve the precision of your data. So if you have 256 TOPs using 32 bit floating point data, then you have 512 using FP16 data, and 1024 TOPs using FP8 data. Even further compression work is being done for INT8 and even INT4. Finally, one of the most important and oft-overlooked issues is that many neural net architectures require very high GPU memory bandwidth - thats why data center GPUs use HBM. When you reduce your data from 32 bit to 16 bit floats, you reduce the memory bandwidth pressure by half. We won't consider AMD cards until they're competitive at FP16 performance with CUDA, and even then, AMD would REALLY need to convince us that their software stack works as seamlessly as CUDA does - you have to add wasted developer and data scientist time to the total cost of the device to get a proper apples-to-apples comparison. We just started getting our H100 deliveries in, and they are truly beasts. I'm hoping we can get some AMD hardware in for benchmarking at some point soon.
It all sounds viable from the hobbyist / small company standpoint. But come on, if you can afford H100s, you're big and successful enough that you can just invest in AMD as a backup plan. This would basically be the equivalent of Valve saying "All PC gamers are on Windows, so we won't invest in Linux". At a certain point, you're the one who has to make it happen.
Reduced precision is NOT the same as violating the FP standards. Going from FP32 to FP16 is a reduction in precision, but if the hardware implements the standards correctly, an FP16 calculation should have the exact same result no matter what card you run it on. Fudging the calculations probably doesn't make a huge difference for most ML applications, but for companies that need auditability (eg finance) or even big tech companies that want to debug an issue affecting a million users out of their billion users... Standards compliance is important, and Nvidia needs to fix their shit.
We have both an MI210 64GB and A100 40GB for my FluidX3D OpenCL software. Both cards are fine, the software runs flawless, but they are super expensive. Value regarding VRAM capacity is better for the MI210, yet performance (actual VRAM bandwidth) is better on the A100. Somehow the memory controllers on AMD cards are not up to the task, 1638 GB/s promised, 950-1300 GB/s delivered. The A100 does the actual 1500 GB/s. Compute performance for such HPC workloads is irrelevant, only VRAM capacity and bandwidth counts.
@@mdzaid5925 crazy right? Transistor density and with it compute power (Flops/s) has grown so fast in the last decade that memory bandwidth cannot keep up. Today almost all compute applications are bandwidth-bound, meaning the CPU/GPU is idle most of the time waiting for data. Even at 2 TB/s.
@@ProjectPhysX True..... not sure about performance implications but computing has evolved very very rapidly. When I think how small each transistor it, how many and how closely they are packed, it feels impossible. Personally, I feel that eventually analog neural networks will take over and gpu's dependency should be reduced to only training / assisting the analog chipsets. Also, I don't have too much faith in current generation of "AI" 😅.
I completely agree, if history has shown is anything it's that when Lisa Su goes all in on something, that something tends to work and work well. I'm just excited to see the market get more diverse as opposed to "CUDA or gtfo", closed ecosystems like that are bad for everyone.
except it was AMDs own fault that cuda became the standard for GPU compute work and they still have not learned adding features to hardware and slapping them on the box is not enough to win. they actually have to provide support and funding to develop 3rd party software that uses the features of the hardware.
Honestly, im hopeful for ROCm on consumer hardware soon, and windows, if your someone that uses any form of creative app like blender or the adobe suite then you know how valuable CUDA is, this really could be the boost AMD needs, Ive been trying my best to recommend AMD but its surprising how many people go Nvidia because of how much better Nvidia is in creative apps, even if they dont use them, its always "well I might want to use them in the future so Ill just go Nvidia" Soon there will be little exscuse to not go AMD, and im all for it, competition is good, not that im in any way an AMD fanboy, I knoe for a fact that if somehow AMD dethroned Nvidia as the market lead they would pull the same shit Nvidia does, but competition is what is meant to stop that
funny enough, they most likely won't use them in the future. I've seen a lot of people using the same argument to go Nvidia but they don't even install any creative applications after buying their gpus. Also I've been using AMD for about 7 months, it isn't necessarily horrible for people who want to just do video editing with premiere pro and illustrator or photoshopping. I use those softwares almost regularly and I face no problem with it.
Yup. For about a decade I was scratching my head why AMD had such a lousy Software strategy. It had great hardware, but the drivers and the lack of tools or API for programmers just seemed like a huge business mistake. Perfect example was the time and resources spent on AMD's Prorender. Considering the multitude of professional and high quality open source renders, ProRender was a pointless exercise; better to spend the man power and money on driver development, or even on openCL when it was viable. At least with ROCm they now seem to understand that everything that is needed to support the hardware is as important as the hardware itself.
I got a 7900xt when rocm5.5 went out. Specifically to use with A1111. It works pretty good. To give an idea I tried 32 image of Dany Devito 768px 20samples,it took 2:30 min. Though I did 8x4 batch if I do 16x2 take 2:40 for 32x1 then it took 3 min. SO yeah the performance is there. I can just imagine how fast the MI300 will be.
@@sebastianguerraty6413 Rocm 5.5 fixed that. Added gfx1100 and thus the 7xxx support. I've been custom compiling pytorch with every new release of rocm. Can't wait for them to start leveraging the AI accelerators cores in the 7xxx series. Weather that is CUDA compatible, and will be exposed via HIP still needs to be seen.
@@chrysalis699 when you compile pytorch for gfx1100 how much of a uplift do you get over stock pytorch? What benefits do you see from the custom compile in general?
@@sailorbob74133 The stock pytorch compiled against rocm 5.4.2 doesn't detect my card at all, so the uplift is infinity 🤣. I doubt it there is much difference for RX 6xxx cards, and there is still quite a bit of unlocked potential in the RX 7xxx cards, as I haven't seen any HIP APIs for the AI accelerators. There are actually barely any mention of them on AMD's site, just an obscure reference on the RX 7600. Probably have to wait for CDNA 3 to release those APIs.
I just noticed that pytorch nightly is now compiled against ROCm 5.6, so I'll probably just switch to those. 🤞the next release will be build against 5.6
Well - I hope for some competition. Standards are fine, but one company owning them is very monopolistic. And AMD's disadvantage seemed to be lack of software rather than hardware.
If you didn't know, in A1111, change the RNG source from GPU to CPU and the optimizer to sdp-no-mem. That should make the differences running on different GPUs as little as possible. Using xformers on cuda can be faster (sdp on pytorch2 has mostly caught up), but the output isn't deterministic.
Rumor says that ROCm might work for RDNA3 on Windows this fall (repo & comments). However something similar was said earlier for 5.6 and that might not be true anymore? I really hope the consumer RDNA cards could run ROCm on Windows and act both like an evaluation for the CDNA platform and as an entry for AI compute, to democratize AI access. Having ROCm support on consumer cars on Windows might also develop traction from other companies (like Tiny corp) to embrace the more open solution, who knows, maybe that will tip the scale to AMDs favor?
Everything points towards it being the case, AMD hasn't said anything offically but the in development documents leaked saying a fall time frame It will happen eventually, even if not fall, it will happen, AMD knows how far behind they are on the consumer side for creative work, they need this
I'm hanging my GPU choice on this date, because honestly I don't want a rtx 3060 12gb, and NGreedier's horrible GeForce experience 🤮 but I want to get into Stable diffusion. A 3080 12gb is just waaaay too much still! But I really want is a rx 6800, with RocM for Windows!
AMD will see a squirrel by then and abandon yet another project with half implemented "support". Why they would even mess with Windows support at this point is dumbfounding, most systems in this realm run Linux unless they are forced to Windows by some 3rd party need for proprietary crap. Windows may still be king of Ma and Pa Kettle's desktop but that isn't this target market segment.
@@mytech6779 I would like to see a segment breakdown between corporate GPU computing and consumer. I would still think Windows Users running blender, Adobe, or some other video graphics program that use GPU rendering is quite large.
The worst thing about ROCm is the hit'n'miss support for commodity GPUs. Back in the 3.x / 4.x days of ROCm, commodity GPUs were half-heartedly supported, with bugs, and sometimes support retroactively withdrawn. These days at least they tell you that if you buy anything other than the W-series of GPUs (i.e. W6800) they don't promise anything. This however, will not increase the mind share; all students and budget-strapped researchers just buy off-the-self nvidia GPUs and go to work. If you've picked a commodity GPU card and are trying to get ROCm to work, be ready for tons of frustration; really, this use case is unsupported. Source: my own experience with ROCm 4.x, using rx480/580, vega 56/64 and Radeon VII (the only one that worked reasonably well).
I would add to the student/budget research thing, they may not be looking for high performance, but they do need the full feature set to do the primary development work, then once working and somewhat debugged they will upgrade to get performance. Even for big-budget ops it makes no sense to have top-end hardware sitting there depreciating for a year or four while the dev team runs experimental test builds. By the time it comes to a real production run another purchase will be needed anyway. That core functionality problem has always been AMDs GPU problem, promises that seem good on paper but ultimately don't deliver. "Oh yeah now that we have your money it turns out you need this specific version of PCIx with that CPU subfamily, on these motherboards with this narrow list of our cards (as we have terrible product line numbering so many in the same apparent series don't work) made in these years, with that specific release of this OS...." Years ago I bought a W7000 (well over $1000 12 years ago) specifically because I wanted to play with the compute side, and there were claims that it had compatible drivers and such (I use Linux, nVidia had terrible support), Nah oops something in the GCN1.x arch was screwed up and compute was never useable even after several major changes in drivers and supposed open sourcing. It worked OK for graphics but my graphics needs are minimal. Later, I switched to a much newer and cheaper equivalent performance consumer AMD card that claimed OpenCL support, nah again doesn't really. Gave me a rather bad taste for AMD. I'm hoping Intel can push some viable non-proprietary alternative to CUDA, I'm due for a new system in the next couple years.
It'd be easier to get students experienced with AMD hardware and get open source support for it, if RDNA had more compatibility with CDNA / better performance parity against NVidia hardware. Students and hobbiests aren't spending 10+k on this kind of stuff.
Yeah, the fact someone can walk into best buy and get a prebuilt and download cuda sdk and learn says a lot on how easy and affordable someone can get into AI/ML. If AMD can do the same for their consumer/gaming hardware then that would be a big game changer.
@@nexusyang4832exactly. There’s a lot of hand wringing about all the various things Nvidia does to needlessly segment their lineup, and that’s all well and good, but that’s not at all what CUDA is. CUDA’s advantage is that it’s the same CUDA wether you have an MX iGPU replacement, the same CUDA that’s in your old Nvidia GPU that you’re replacing (assuming you have an Nvidia gpu, obviously) and it’s the very same CUDA that’s in last year’s laptops, this years laptops, and is certainly going to be in next year’s laptops. It’s not like AMD makes CDNA laptops, and that’s kinda the point.
Hobbiest /student stuff doesn't need performance parity with CDNA. What it needs is ease of access (Availible as a standard feature on commonly availible consumer priced cards, without hobbling); similarity of interface accross products for the user and for software portability between consumer stuff and CDNA; and performance that is good enough to not be frustrating. Reasonable Linux support is also needed. Linux may only make up 2% of total desktops, but Ditzy Sue and Joe Sixpack aren't GPU-compute hobbyists, so total desktops is the wrong stat; in reallity Linux is closer to 50% or more of the relevent market segments.
You forgot to mention George Hotz’s discussion started with his frustration with AMD GPU. The so called “open source” software isn’t so open. Look at the “open” FSR 2 repo, no one is reviewing public pull requests, it’s used more as a marketing tool than supporting OSS community
They never said that fsr2 would be an open source project. They said it would be open source meaning free access to source code and the ability to modify for your own needs. They never said they would accept pull request from the public.
But what I really want to know is, can I use it to TRAIN models too?? Esp on voice and faces, I don't want to upload my family's private data to a cloud service and potentially have them save it forever, I would only trust that locally.
ROCm support is definitely need on consumer grade hardware. -This will give AI students some experience in Amd ecosystem. - Also, not all AI models run on the cloud. For local use, the companies have to consider the available options and currently it's only nvidia.
I am really happy that Ryzen paid off. in 2017 I was one of the earliest adopters who pre-ordered two Ryzen 1700 (non X) systems with X370 boards; and I never pre-order stuff, did not before and have not since. Now, AMD is a proper force for innovation and competition in both the GPU and GPU spaces, for consumers and datacenters. Also, Intel ARC seems to become more interesting by the day. Got an Acer A770 16GB as a curiosity at the start of this year and I still haven't reached the final conclusions about it; seems like every second driver update makes things better.
Yea, it's honestly good ryzen happened, because there was reports they were on the road to bankruptcy All of this is thanks to Lisa Su, she really saved AMD
Well done. Therein lies a tale many of us would like to hear. The buying decision in the market of the day? The cost of an 8 core intel vs amd then e.g.? Lets not forget what a classic the 1600 proved to be.
According to two senior AMD tech folks Zen was designed because they had to! bulldozer/etc. was a failure. Originally they aimed for 70% of Intel performance for 50% of the price, but then TSMC's silicon just kept getting better and Intel stopped innovating. [I had the chance to talk to some senior AMD tech folks when they were recruiting on campus and they were surprised how great Zen turned out too!]
Thanks for this video, this field is moving so quickly it's really hard to keep up to date on the latest advancements, let alone the current status quo
Wendell....this is seriously important work. Making the alternative to what many see as the default choice observably feasible is crucial to easing the hesitancy many people have, and just like in anything else [under the clutches of capitalism] a defacto monopoly can only harm consumers/users.
the promise for ROCm is huge, but better hardware support and better communications about what is and what is intended to be supported is needed. I had to buy a GPU a few years back, and really wanted an AMD GPU for the Linux drivers but I needed tensorflow ability for university. ROCm existed, but there was barely any documentation about what was supported, nothing on what they intend to support, and no timeline for software development, so I got a 2080. I remember roughly at the same time, AMD were touting that "you don't need to buy an instinct to do datacenter compute", but how is "datacenter compute is locked to tesla" any different to "there is no software support for radeon" when you want to get real work done *now*?
Better communications for sure. One of the main issue is that the list they provide is not about the GPU that work with rocm but about the GPU AMD offer support. It is totally useless for people who want to know what GPU will actually run or not. As far as I know about all AMD GPU since vega are already working even if AMD dont offer official "support".
I'm pretty sure your running torch with cu117 or older, the numbers are about 70% lower than what an A100 puts out with these settings on cu118.... if you did just pip install form the default repo it's cu117.
Pausing the Video at 11:37 If AMD is on the left, and Nvidia is on the right. AMD has the better Algorithm running than Nvidia. The Smart phone in Devito's hand isn't merging with the spoon and he has one button on his collar instead of two. Might have taken longer but the image looks more natural which is kind of nuts.
I run AI workloads on a 7900XTX. It's a bit of a headache sometimes, but it works. But there's so much performance left on the table. I recently played around with AMD's AITemplate fork, and it's really fast on RDNA. But it's also incomplete and unstable. Triton recently got lots of MFMA optimizations, no WMMA though. They're largely the same thing as far as I understand, except MFMA is Instinct, WMMA is Radeon. I think even most AMD engineers don't realize Radeon has 'Tensor Cores' now.
>They're largely the same thing as far as I understand Absolutely not, MFMA is 1 clock whatever matrice size MMA, WMMA is just running wave64 in however many clocks on double the SIMD width.
@@whoruslupercal1891 Maybe, but the instructions are mostly the same, no? And WMMA on RDNA3 is actually accelerated (CDNA2, CDNA3 and RDNA3 are the only three architectures supported by rocWMMA, so I assume previous RDNA chips simply didn't have an equivalent), so AMD should probably use those instructions wherever possible.
@@wsippel >but the instructions are mostly the same, no no. >CDNA2, CDNA3 and RDNA3 are the only three architectures supported by rocWMMA Yea but MFMA is different.
Will ROCm permit me to leverage my AMD 7900 XTX for accelerating the locally executing personal AI LLM on my PC? Presently, it operates on my CPU, causing sluggish responses from the LLM.
I used to think that but realized I'll grow grey waiting on a decent implementation. SYCL seems to be stuck in some quasi-propriatary limbo with a company that won't or can't make it widely availible.
@@mytech6779 The most popular Sycl implementations are #OpenSYCL and #DPC++ . Both are open-sourced and work on many different architectures. What do you mean - "stuck in quasi-propriatary limbo with a company" ?
We've been using a lot of TPU the past few months. It's such a weird platform with interesting self-imposed bottlenecks, and doesn't help that Google will suddenly reboot or down our nodes for maintenance at least once or more times every few days without any warnings.
Would love a better explanation for why the math is different. Could be that floating point math is not commutative. That is A * B does not equal B * A. Optimizing compilers sometimes break the order of operations in the name of speed.
developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/ mixed precision instead of full fat fp64. Usually the mantissa is not as many bits. Is why fp64 is a diff compute rate than "fp64" for ai
@@Level1Techs My first thought was that the AMD card was using f32 instead of bfloat16 but I googled and it looks like bfloat16 has been supported since MI100. Perhaps the port isn't using the bfloat16 yet?
Kinda hard to build for determinism when your hardware does lossy stochastic compression on compute.. Even multiple runs of the same data set wouldn't result in the same output on Nvidia. I suspect if the didn't do that they would be significantly slower.
Did you get to test out running LLMs on these GPUs? I'd be curious how many tokens per second these bad boys can push out, especially since it seems like LLMs are going to be a main point of interest for AI companies for at least the next 1-3 years.
Appreciate the work you’ve put into this Wendel. I think AMD needs to support not only frameworks like TF and Torch but also model conversion from one framework/hw to another. Basically the primitives mapping between systems.
Are the differences in the images really due to different precision levels in the hardware or is it (also partly) due to limited determinism and reproducibility? After all you're not guaranteed to get the same image twice, even when using the same seed and HW.
this just popped up in my stream and i would be interested in an update as it's now known that the Nvidia 50xx series will not get any vram capacity upgrades (sans 5090) so it may be tome to look at AMD.
I always spun that technical difference as a "One is a more mature, but less complete offering." So then it became a question of what is good enough for their needs.
to be honest, you suggested that rocm went from can't train shit to can't train shit, which is what nvidia is specialises in. there are more inference startups dying each day than mi200s and mi300s combined shipped that day, and every vendor is coming up with their own inference chip. why would aws offer mi200 or mi300 when they can offer inf1 of their own and can abstract any software difference under ml frameworks? and if they do, why would anyone use that instead of inf1, or better yet, building their own?
I’m glad to be an AMD shareholder, although I guess I might grab a few more shares just in case. (My AMD shares have made a killing so far especially off this AI hype bubble.)
AMD keeps dropping old GPUs in ROCm. RDNA has been ignored forever, I not even OpenCL worked at launch with regular drivers. So there will be little uptake when Nvidia has still something that performs ok at the lower end. Gaudi 2 is also looking OK and Intel seems committed to have the software running on potatoes.
There is a fork of automatic that supports AMD. It's in the main project readme or a google. It seems they randomly decided to drop support for some older cards a few months ago (rocm). Rx 5xx isn't supported and I think vega was also dropped.
Are you sure that the visual differences are because of the different hardware? Is Xformers disabled? I think it should be disabled for a test like this. I think it would explain the visual differences.
This is really interesting I was to know about how a 4090 vs a 7900XTX compares for these workflows. I know both are consumer products but I feel at the top end the line is blurred.
Where is a Microsoft DirectX style layer that sits on top of the GPUs and makes ML vendor agnostic (even if it makes it OS dependent)? If you dont like the OS specific DirectX API then swap in Vulkan API. Ive heard of DirectCompute and OpenCL but they dont seem to have gained traction - why? Also why is ROCm needed when you have those APIs - what is it that makes CUDA compete against all of the above?
Hi, I would like to get started with ML and currently do have 2 offers for graphics card. RX 6800 16 GB and RTX 4060 8 GB Do you know if the 6800 would be suitable for getting started or is it better to use the 4060? Thank you in advance!
Assuming that wasn’t a joke, it’s just level one techs but windows or whatever is cutting off either side of the wall paper. L]evelOn[e techs. I only recognized because I’ve seen the full picture in one of the other vids, and even then it tripped me up a little.
Unless Google sell TPU for enterprise to host themselves, I don't think there will be any large scale adoption to use in consumer products. See, OpenAI trained their model on GPU, best to assume that's Nvidia hardware.
working in opengl for a long time i've come to sum it up as nvidia playing it fast and loose and amd being more accurate. and then there's mesa, which is as close to a reference implementation as you'll getp
Hi Wendell, have you been following the Tinygrad stuff and their troubles with ROCm at all? They look like they have some real work™ they'd like to be able to use AMD for in ML so I think it would be interesting for you to check out.
the demo is for inference, but training is key advantage to nvidia. need to get compute cards at gamer card scale in order for that software support to level out. that's why Ponte Vecchio and TPUs are DOA consumer products. but let's supposed AMD does catch up for the desktop. for mobile, apple and google and Samsung own their own stacks. for robotics, nvidia already has jetson. the market beyond the desktop would need to be big for AMD to really be able to invest and nail AI
A100 being a little off isn’t surprising. Saw the same thing when comparing Amazon’s codewhisperer & OpenAI copilot. CUDA copilot’s code was only 75% correct. Amazon was 90% correct.
Let's hope this works for the consumer GPUs as well. MI210 is 20k at Ebay, the A100 at 6k. Hmm. AMD should make a rather slow GPU with a lot of vRAM for inference and below 1k. Allegedly vRAM prices are currently very low.
The price difference is because the MI210 is relatively new, released early last year. The A100 is 3 years old now and also the A100 has been deployed a lot more, every cloud provider offers some A100 server instance. Meanwhile the only AMD GPUs I can find are Radeon gaming GPUs. The V620 is just a Navi 21 chip with 32gb of vram.
What'll be interesting will be MI300C - which will be all CPU chiplets and Xilinix AI chiplets - Turin-AI... MLID has a video about it. A dual socket version could have more TOPs than an H100.
@@samlebon9884 There's a very reliable rumor channel I've tracked for a few years called Moore's Law is Dead which spoke about the MI300C chip which is all CPU chiplets with HBM3 and a separate AMD project called Turin-AI which is a mix of Zen5 chiplets together with Xilinix AI chiplets on a single package which in a 2P config would be about as powerful as an H100.
Please please Wendell, use your mighty powers and shake down the answers from above when ROCm Windows support is coming around. I mean it'll actually bring more value to AMD's lackluster RDNA 3 so far!
I hope ROCm gets better so that I can go back to using DaVinci Resolve with hardware acceleration on AMD... Using it on Linux is such a pain and I am about to just buy a damn NVIDIA 12 gb card and sell my 16gb AMD card just so I can use the Resolve studio version I pad for. It sucks because I want to drop Windows and I the only thing stopping me is video editing. I either go back to dual booting Windows wich sucks or buy a card with 12gb of vram...not ideal.
Just how deterministic are these prompts? How similar are the same prompts using the same seeds on the same hardware? Some of those differences were rather subtle but wow some of them had huge differences in results. As far as precision goes, I believe AMD inputs two matrices of 16 bit values but the actual pipeline throughout the multiply and accumulate steps is 32 bit. On the green side, two 16 bit matrices are inputted but they only accumulate into a 32 bit value at the end. Both have options to output to 16 bit results. I always get confused on who supports half precision and who support 16 bit bfloat. Just between half precision and bfloat should be enough to get different results of what we've seen here. The only company that is really doing higher precision for tensor operations is IBM which supports 32 bit values for elements of a matrix and 64 bit accumulate. These are all glorious IEEE 754 (-2008?) compliant to quickly import and execution on existing data sets without the need to drop precision prior. While AI has been pushing the usage of 16 bit floating point work, there is certainly a market sector that would like to leverage tensor operations as higher precisions that are used in more traditional single and double precision workloads. Easy prediction would be that it'll follow various GPU tends in this area where they either have a smaller number of 32 or 64 bit precision tensor hardware or modify the existing 16 bit hardware to compute 32 or 64 bit values over multiple cycles. I'm really optimistic about MI300A and MI300X performance. The MI300A in particular should be interesting to test due to its inclusion of some Zen 4 cores inside the same package. On the same note, it wouldn't surprise me if AMD begins swapping out some of the CCX dies on an EPYC package for some CNDA dies or FPGA (or both!). The flexibility and potential from this strategy is enormous. More importantly and pointed out in this video is that AMD has been making massive strides on the software side. The gap hasn't closed but you can see on the horizon where it does now. With regards to TPU4, has there been any inclination that Google has this up and running internally? Are they waiting on 4 nm yields/supply? Can they get enough HBM which has skyrocketed in price lately?
On the same GPU with with same setting and same version of the software it is 100% identical . Though using Xformer or token merging make it not deterministic using a different GPU can also cause difference. To be sure to get similar image on different GPU somebody can use CPU noise generation. At least I think since the option is there but I never tried.
If you don’t use xformers, using the same seed, same prompts, same settings, it’s absolutely deterministic. I had this same exact discussion with someone in discord and I helped him setup his a1111 on his laptop. Even though I had 3090 and he had a 2070, he was able to create the same image. After all, it is math, and algorithm.
@@nexusyang4832 That is the expectation but how did you verify? Ran pixel value difference between the two results? Conceptually they should match but things can get weird when operating near the edges of precision and rounding. Mathematically a*b*c*d*e*f*g*h = h*g*f*e*d*c*b*a but things can get messy when dealing with floating point number and order of operations. Thankfully code is deterministic when executing right? Well you do have things like out-of-order execution which can rearrange how code is executed. Still in a vacuum, the result should be identical with other iterations of itself on the same hardware: things continue to execute out of order in a deterministic fashion. However bringing in ideas like SMT into the picture, you can present a scenario where out-of-order execution changes slightly each iteration based upon available shared resources for execution. Differences can now emerge in what was originally a deterministic process because of the initial low precision nature of the data structures involved (16 bit floating point elements in a matrix for the example in the video). This is a bit of an extreme example of how to induce chaos into what otherwise is a deterministic algorithm..
Nvidia going the inaccurate but faster route has always been a problem for AMD. Because Nvidia is the market leader most software actually expects the inaccuracies in implementations of standards that Nvidia has which will lead to the software not working on the technically more accurate implementations on AMD (or even Intel). So to the consumer that will then mean software doesn't work properly on AMD and at the same time runs slower. Before the recent rewrite this also was a problem for OpenGL, some versions of DX etc.
AMD needs two modes, "Accurate" and "Green Team Inaccuracy"
More like, red dot accurate, noisy green data
Calling it "green mode" and justifying it as "oh it uses less power because it is less accurate" might actually be something they could do
You need two mode... wrapped, and dropped
ROCm seems to have planted itself in the scientific HPC world, let’s hope it can grow from there
With CDNA yes. With RDNA1/2/3 they've severely dropped the ball and didn't adequately make it clear that that was the plan all along. On the consumer side which is where hobbyist compute lives the 6950X was the first card to approach the Radeon VII for a traditional (non-AI ML whatever) scientific workload. The 7000 series is actually worse as they cut FP64 performance and the memory model with infinity cache split 5/6 ways (and/or something else) seems to have hurt this specific (opencl which is why it can be tested) workload.
George Hotz to the rescue would be awesome.
ROCm 6.0 just dropped today! Love for you Wendell to do an update on this video to show off all the advancements with 6.0 and if there are any noticable performance bumps 🙏
Tensorflow never directly competed with CUDA, it sits on top of CUDA - Tensorflow's primary competitor was (and still is) Pytorch. Both Tensorflow and Pytorch can be run on TPUs, but of course Tensorflow has 1st class support. Both Tensorflow and Pytorch have 1st class support for CUDA. I suspect the real reason Tensorflow hasn't been as popular lately is two-fold. First, a lot of internal Google development resources have moved on to develop JAX instead of TF, and secondly (and more importantly), Pytorch is simply better than Tensorflow. Its significantly more enjoyable and easier to use. And the reason CUDA has beaten out TPUs is also simple - you can only get TPUs using Google Cloud, whereas every cloud, every enterprise datacenter, and every school had direct access to CUDA capable devices. Everyone uses and develops for them, whereas TPUs and the XLA compiler basically only developed by Google.
Also, in deep learning we actually don't mind the reduced accuracy for many problems. In fact, a mix of 32 bit and 16 bit is the *default* data format for deep learning now. Reduced precision deep learning is extremely important for large scale neural network development - for three reasons. First, obviously, if you use fewer bits for your model, you can fit a larger model in a single GPU's memory, which makes development easier. Second, the Tensor Cores basically double their FLOPs every time you halve the precision of your data. So if you have 256 TOPs using 32 bit floating point data, then you have 512 using FP16 data, and 1024 TOPs using FP8 data. Even further compression work is being done for INT8 and even INT4. Finally, one of the most important and oft-overlooked issues is that many neural net architectures require very high GPU memory bandwidth - thats why data center GPUs use HBM. When you reduce your data from 32 bit to 16 bit floats, you reduce the memory bandwidth pressure by half.
We won't consider AMD cards until they're competitive at FP16 performance with CUDA, and even then, AMD would REALLY need to convince us that their software stack works as seamlessly as CUDA does - you have to add wasted developer and data scientist time to the total cost of the device to get a proper apples-to-apples comparison. We just started getting our H100 deliveries in, and they are truly beasts. I'm hoping we can get some AMD hardware in for benchmarking at some point soon.
Pin this comment above.
It all sounds viable from the hobbyist / small company standpoint. But come on, if you can afford H100s, you're big and successful enough that you can just invest in AMD as a backup plan. This would basically be the equivalent of Valve saying "All PC gamers are on Windows, so we won't invest in Linux". At a certain point, you're the one who has to make it happen.
Reduced precision is NOT the same as violating the FP standards. Going from FP32 to FP16 is a reduction in precision, but if the hardware implements the standards correctly, an FP16 calculation should have the exact same result no matter what card you run it on. Fudging the calculations probably doesn't make a huge difference for most ML applications, but for companies that need auditability (eg finance) or even big tech companies that want to debug an issue affecting a million users out of their billion users... Standards compliance is important, and Nvidia needs to fix their shit.
We have both an MI210 64GB and A100 40GB for my FluidX3D OpenCL software. Both cards are fine, the software runs flawless, but they are super expensive. Value regarding VRAM capacity is better for the MI210, yet performance (actual VRAM bandwidth) is better on the A100. Somehow the memory controllers on AMD cards are not up to the task, 1638 GB/s promised, 950-1300 GB/s delivered. The A100 does the actual 1500 GB/s. Compute performance for such HPC workloads is irrelevant, only VRAM capacity and bandwidth counts.
What a time we are living in.... ~1000Gbps is not enough 😅
@@mdzaid5925 crazy right? Transistor density and with it compute power (Flops/s) has grown so fast in the last decade that memory bandwidth cannot keep up. Today almost all compute applications are bandwidth-bound, meaning the CPU/GPU is idle most of the time waiting for data. Even at 2 TB/s.
@@ProjectPhysX True..... not sure about performance implications but computing has evolved very very rapidly. When I think how small each transistor it, how many and how closely they are packed, it feels impossible. Personally, I feel that eventually analog neural networks will take over and gpu's dependency should be reduced to only training / assisting the analog chipsets. Also, I don't have too much faith in current generation of "AI" 😅.
What kind of setup you use with this software? Windows redhat?
@@Teluric2 for these servers openSUSE Leap, for others mostly Ubuntu Server minimal installation.
I completely agree, if history has shown is anything it's that when Lisa Su goes all in on something, that something tends to work and work well. I'm just excited to see the market get more diverse as opposed to "CUDA or gtfo", closed ecosystems like that are bad for everyone.
except it was AMDs own fault that cuda became the standard for GPU compute work and they still have not learned adding features to hardware and slapping them on the box is not enough to win. they actually have to provide support and funding to develop 3rd party software that uses the features of the hardware.
@@psionx1Give them a break, they are running on less than half the so of course they'd have to pick and choose their fights
Honestly, im hopeful for ROCm on consumer hardware soon, and windows, if your someone that uses any form of creative app like blender or the adobe suite then you know how valuable CUDA is, this really could be the boost AMD needs, Ive been trying my best to recommend AMD but its surprising how many people go Nvidia because of how much better Nvidia is in creative apps, even if they dont use them, its always "well I might want to use them in the future so Ill just go Nvidia"
Soon there will be little exscuse to not go AMD, and im all for it, competition is good, not that im in any way an AMD fanboy, I knoe for a fact that if somehow AMD dethroned Nvidia as the market lead they would pull the same shit Nvidia does, but competition is what is meant to stop that
funny enough, they most likely won't use them in the future. I've seen a lot of people using the same argument to go Nvidia but they don't even install any creative applications after buying their gpus.
Also I've been using AMD for about 7 months, it isn't necessarily horrible for people who want to just do video editing with premiere pro and illustrator or photoshopping. I use those softwares almost regularly and I face no problem with it.
Yup. For about a decade I was scratching my head why AMD had such a lousy Software strategy. It had great hardware, but the drivers and the lack of tools or API for programmers just seemed like a huge business mistake. Perfect example was the time and resources spent on AMD's Prorender. Considering the multitude of professional and high quality open source renders, ProRender was a pointless exercise; better to spend the man power and money on driver development, or even on openCL when it was viable.
At least with ROCm they now seem to understand that everything that is needed to support the hardware is as important as the hardware itself.
AMD was promising good consumer GPGPU "soon" 15 years ago, lol.
I got a 7900xt when rocm5.5 went out. Specifically to use with A1111. It works pretty good. To give an idea I tried 32 image of Dany Devito 768px 20samples,it took 2:30 min. Though I did 8x4 batch if I do 16x2 take 2:40 for 32x1 then it took 3 min. SO yeah the performance is there. I can just imagine how fast the MI300 will be.
I thought Rocm was only suported in very few 6xxx gpus on AMD and their server class gpus
@@sebastianguerraty6413 Rocm 5.5 fixed that. Added gfx1100 and thus the 7xxx support. I've been custom compiling pytorch with every new release of rocm. Can't wait for them to start leveraging the AI accelerators cores in the 7xxx series. Weather that is CUDA compatible, and will be exposed via HIP still needs to be seen.
@@chrysalis699 when you compile pytorch for gfx1100 how much of a uplift do you get over stock pytorch? What benefits do you see from the custom compile in general?
@@sailorbob74133 The stock pytorch compiled against rocm 5.4.2 doesn't detect my card at all, so the uplift is infinity 🤣. I doubt it there is much difference for RX 6xxx cards, and there is still quite a bit of unlocked potential in the RX 7xxx cards, as I haven't seen any HIP APIs for the AI accelerators. There are actually barely any mention of them on AMD's site, just an obscure reference on the RX 7600. Probably have to wait for CDNA 3 to release those APIs.
I just noticed that pytorch nightly is now compiled against ROCm 5.6, so I'll probably just switch to those. 🤞the next release will be build against 5.6
I hope to heck it is. It's NVidia or nothing until now. Terrific video Wendell. I like it that you have content for the working folks.
We need more high end AI comparisons like this. Hope you get more gear to test!
Well - I hope for some competition. Standards are fine, but one company owning them is very monopolistic. And AMD's disadvantage seemed to be lack of software rather than hardware.
If you didn't know, in A1111, change the RNG source from GPU to CPU and the optimizer to sdp-no-mem. That should make the differences running on different GPUs as little as possible.
Using xformers on cuda can be faster (sdp on pytorch2 has mostly caught up), but the output isn't deterministic.
Rumor says that ROCm might work for RDNA3 on Windows this fall (repo & comments). However something similar was said earlier for 5.6 and that might not be true anymore?
I really hope the consumer RDNA cards could run ROCm on Windows and act both like an evaluation for the CDNA platform and as an entry for AI compute, to democratize AI access.
Having ROCm support on consumer cars on Windows might also develop traction from other companies (like Tiny corp) to embrace the more open solution, who knows, maybe that will tip the scale to AMDs favor?
Everything points towards it being the case, AMD hasn't said anything offically but the in development documents leaked saying a fall time frame
It will happen eventually, even if not fall, it will happen, AMD knows how far behind they are on the consumer side for creative work, they need this
I'm hanging my GPU choice on this date, because honestly I don't want a rtx 3060 12gb, and NGreedier's horrible GeForce experience 🤮 but I want to get into Stable diffusion. A 3080 12gb is just waaaay too much still! But I really want is a rx 6800, with RocM for Windows!
Intel says the same thing, their stack working on Windows too.
AMD will see a squirrel by then and abandon yet another project with half implemented "support". Why they would even mess with Windows support at this point is dumbfounding, most systems in this realm run Linux unless they are forced to Windows by some 3rd party need for proprietary crap. Windows may still be king of Ma and Pa Kettle's desktop but that isn't this target market segment.
@@mytech6779 I would like to see a segment breakdown between corporate GPU computing and consumer. I would still think Windows Users running blender, Adobe, or some other video graphics program that use GPU rendering is quite large.
The worst thing about ROCm is the hit'n'miss support for commodity GPUs. Back in the 3.x / 4.x days of ROCm, commodity GPUs were half-heartedly supported, with bugs, and sometimes support retroactively withdrawn. These days at least they tell you that if you buy anything other than the W-series of GPUs (i.e. W6800) they don't promise anything.
This however, will not increase the mind share; all students and budget-strapped researchers just buy off-the-self nvidia GPUs and go to work. If you've picked a commodity GPU card and are trying to get ROCm to work, be ready for tons of frustration; really, this use case is unsupported.
Source: my own experience with ROCm 4.x, using rx480/580, vega 56/64 and Radeon VII (the only one that worked reasonably well).
I would add to the student/budget research thing, they may not be looking for high performance, but they do need the full feature set to do the primary development work, then once working and somewhat debugged they will upgrade to get performance.
Even for big-budget ops it makes no sense to have top-end hardware sitting there depreciating for a year or four while the dev team runs experimental test builds. By the time it comes to a real production run another purchase will be needed anyway.
That core functionality problem has always been AMDs GPU problem, promises that seem good on paper but ultimately don't deliver. "Oh yeah now that we have your money it turns out you need this specific version of PCIx with that CPU subfamily, on these motherboards with this narrow list of our cards (as we have terrible product line numbering so many in the same apparent series don't work) made in these years, with that specific release of this OS...."
Years ago I bought a W7000 (well over $1000 12 years ago) specifically because I wanted to play with the compute side, and there were claims that it had compatible drivers and such (I use Linux, nVidia had terrible support), Nah oops something in the GCN1.x arch was screwed up and compute was never useable even after several major changes in drivers and supposed open sourcing. It worked OK for graphics but my graphics needs are minimal.
Later, I switched to a much newer and cheaper equivalent performance consumer AMD card that claimed OpenCL support, nah again doesn't really.
Gave me a rather bad taste for AMD. I'm hoping Intel can push some viable non-proprietary alternative to CUDA, I'm due for a new system in the next couple years.
It'd be easier to get students experienced with AMD hardware and get open source support for it, if RDNA had more compatibility with CDNA / better performance parity against NVidia hardware.
Students and hobbiests aren't spending 10+k on this kind of stuff.
Yeah, the fact someone can walk into best buy and get a prebuilt and download cuda sdk and learn says a lot on how easy and affordable someone can get into AI/ML. If AMD can do the same for their consumer/gaming hardware then that would be a big game changer.
@@nexusyang4832exactly. There’s a lot of hand wringing about all the various things Nvidia does to needlessly segment their lineup, and that’s all well and good, but that’s not at all what CUDA is.
CUDA’s advantage is that it’s the same CUDA wether you have an MX iGPU replacement, the same CUDA that’s in your old Nvidia GPU that you’re replacing (assuming you have an Nvidia gpu, obviously) and it’s the very same CUDA that’s in last year’s laptops, this years laptops, and is certainly going to be in next year’s laptops.
It’s not like AMD makes CDNA laptops, and that’s kinda the point.
@@levygaming3133 You're spitting facts. 👍👍👍👍
Excuse me??? Lol
Hobbiest /student stuff doesn't need performance parity with CDNA.
What it needs is ease of access (Availible as a standard feature on commonly availible consumer priced cards, without hobbling); similarity of interface accross products for the user and for software portability between consumer stuff and CDNA; and performance that is good enough to not be frustrating.
Reasonable Linux support is also needed. Linux may only make up 2% of total desktops, but Ditzy Sue and Joe Sixpack aren't GPU-compute hobbyists, so total desktops is the wrong stat; in reallity Linux is closer to 50% or more of the relevent market segments.
Thanks!
Thank you!!
I am such a geek, "Can't believe it's not CUDA" made me actually laugh out loud.
please do 'tech tubers by Balenciaga' next
I think the subtleties shouldn’t be an issue.
Got SD A1111 to work on an RX 6500 XT and an Arc A770. But I wasn't able to run it on Vega iGPUs. The A770 16GB crushed the 3060 12GB I usually use.
hey, if you don't mind , could you please make a short test video on a770.? I'm thinking getting a770
Thank you for producing this content. As always, incredibly interesting
You forgot to mention George Hotz’s discussion started with his frustration with AMD GPU. The so called “open source” software isn’t so open. Look at the “open” FSR 2 repo, no one is reviewing public pull requests, it’s used more as a marketing tool than supporting OSS community
They never said that fsr2 would be an open source project. They said it would be open source meaning free access to source code and the ability to modify for your own needs. They never said they would accept pull request from the public.
But what I really want to know is, can I use it to TRAIN models too?? Esp on voice and faces, I don't want to upload my family's private data to a cloud service and potentially have them save it forever, I would only trust that locally.
ROCm support is definitely need on consumer grade hardware.
-This will give AI students some experience in Amd ecosystem.
- Also, not all AI models run on the cloud. For local use, the companies have to consider the available options and currently it's only nvidia.
Is ROCm supported on RDNA3 IGPU?
By that i mean if one has a Minisforum UM790 Pro (with Ryzen9 7940HS) can that work?
I am really happy that Ryzen paid off. in 2017 I was one of the earliest adopters who pre-ordered two Ryzen 1700 (non X) systems with X370 boards; and I never pre-order stuff, did not before and have not since. Now, AMD is a proper force for innovation and competition in both the GPU and GPU spaces, for consumers and datacenters. Also, Intel ARC seems to become more interesting by the day. Got an Acer A770 16GB as a curiosity at the start of this year and I still haven't reached the final conclusions about it; seems like every second driver update makes things better.
Yea, it's honestly good ryzen happened, because there was reports they were on the road to bankruptcy
All of this is thanks to Lisa Su, she really saved AMD
Well done. Therein lies a tale many of us would like to hear. The buying decision in the market of the day? The cost of an 8 core intel vs amd then e.g.? Lets not forget what a classic the 1600 proved to be.
According to two senior AMD tech folks Zen was designed because they had to! bulldozer/etc. was a failure. Originally they aimed for 70% of Intel performance for 50% of the price, but then TSMC's silicon just kept getting better and Intel stopped innovating. [I had the chance to talk to some senior AMD tech folks when they were recruiting on campus and they were surprised how great Zen turned out too!]
where do you find the model used? I can't find where it is in hugging face. icantbeliveitsnotphotography safe tensors that is.
Maybe this, Google: civitai ICBINP - "I Can't Believe It's Not Photography". I'm downloading it now. Best of luck.
I'm going to Argonne National Lab later this week. Let me know if you want to sneak into the new super computer there ;)
These videos empower my team to express ideas to upper management.
Thanks for this video, this field is moving so quickly it's really hard to keep up to date on the latest advancements, let alone the current status quo
Wendell....this is seriously important work. Making the alternative to what many see as the default choice observably feasible is crucial to easing the hesitancy many people have, and just like in anything else [under the clutches of capitalism] a defacto monopoly can only harm consumers/users.
the promise for ROCm is huge, but better hardware support and better communications about what is and what is intended to be supported is needed. I had to buy a GPU a few years back, and really wanted an AMD GPU for the Linux drivers but I needed tensorflow ability for university. ROCm existed, but there was barely any documentation about what was supported, nothing on what they intend to support, and no timeline for software development, so I got a 2080.
I remember roughly at the same time, AMD were touting that "you don't need to buy an instinct to do datacenter compute", but how is "datacenter compute is locked to tesla" any different to "there is no software support for radeon" when you want to get real work done *now*?
Better communications for sure. One of the main issue is that the list they provide is not about the GPU that work with rocm but about the GPU AMD offer support. It is totally useless for people who want to know what GPU will actually run or not. As far as I know about all AMD GPU since vega are already working even if AMD dont offer official "support".
......I will never look at Danny DaVito the same again. 😱😳😂
This is such a curious way to create spot the difference images.
I'm pretty sure your running torch with cu117 or older, the numbers are about 70% lower than what an A100 puts out with these settings on cu118.... if you did just pip install form the default repo it's cu117.
Pausing the Video at 11:37 If AMD is on the left, and Nvidia is on the right. AMD has the better Algorithm running than Nvidia. The Smart phone in Devito's hand isn't merging with the spoon and he has one button on his collar instead of two. Might have taken longer but the image looks more natural which is kind of nuts.
I run AI workloads on a 7900XTX. It's a bit of a headache sometimes, but it works. But there's so much performance left on the table. I recently played around with AMD's AITemplate fork, and it's really fast on RDNA. But it's also incomplete and unstable. Triton recently got lots of MFMA optimizations, no WMMA though. They're largely the same thing as far as I understand, except MFMA is Instinct, WMMA is Radeon. I think even most AMD engineers don't realize Radeon has 'Tensor Cores' now.
>They're largely the same thing as far as I understand
Absolutely not, MFMA is 1 clock whatever matrice size MMA, WMMA is just running wave64 in however many clocks on double the SIMD width.
@@whoruslupercal1891 Maybe, but the instructions are mostly the same, no? And WMMA on RDNA3 is actually accelerated (CDNA2, CDNA3 and RDNA3 are the only three architectures supported by rocWMMA, so I assume previous RDNA chips simply didn't have an equivalent), so AMD should probably use those instructions wherever possible.
@@wsippel >but the instructions are mostly the same, no
no.
>CDNA2, CDNA3 and RDNA3 are the only three architectures supported by rocWMMA
Yea but MFMA is different.
Will ROCm permit me to leverage my AMD 7900 XTX for accelerating the locally executing personal AI LLM on my PC? Presently, it operates on my CPU, causing sluggish responses from the LLM.
Hopefully SYCL will abstract platform specific APIs like ROCm/CUDA etc.
I used to think that but realized I'll grow grey waiting on a decent implementation. SYCL seems to be stuck in some quasi-propriatary limbo with a company that won't or can't make it widely availible.
@@mytech6779 The most popular Sycl implementations are #OpenSYCL and #DPC++ . Both are open-sourced and work on many different architectures. What do you mean - "stuck in quasi-propriatary limbo with a company" ?
Seeing those giant GPU modules gave me Pentium II flashbacks, lol!
Great video Wendell!. This is good for the market, lets hope that the prices adjust in the next 12 months.
We've been using a lot of TPU the past few months. It's such a weird platform with interesting self-imposed bottlenecks, and doesn't help that Google will suddenly reboot or down our nodes for maintenance at least once or more times every few days without any warnings.
LLMs, please next!
There is a tutorial for amd instict and stable difusion? thanks in advance.
I don't think any of these look like Danny Divito, just Danny Divito like.
Is there any way to get the new version of rocm to work with the mi25?
One big problem for Google was that you only get full TPUs in Google Cloud, otherwise it'd be pretty different.
Would love a better explanation for why the math is different. Could be that floating point math is not commutative. That is A * B does not equal B * A. Optimizing compilers sometimes break the order of operations in the name of speed.
developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/ mixed precision instead of full fat fp64. Usually the mantissa is not as many bits. Is why fp64 is a diff compute rate than "fp64" for ai
@@Level1Techs My first thought was that the AMD card was using f32 instead of bfloat16 but I googled and it looks like bfloat16 has been supported since MI100. Perhaps the port isn't using the bfloat16 yet?
Open standards for ML (read TF) kernel API will help massively to achieve cross-hardware support.
Kinda hard to build for determinism when your hardware does lossy stochastic compression on compute.. Even multiple runs of the same data set wouldn't result in the same output on Nvidia. I suspect if the didn't do that they would be significantly slower.
As nice as it is and as cool as it is, I expect ROCm for windows and Half Life 3 to come out on the same day.....
Did you get to test out running LLMs on these GPUs? I'd be curious how many tokens per second these bad boys can push out, especially since it seems like LLMs are going to be a main point of interest for AI companies for at least the next 1-3 years.
Well, why we don't have it on AWS, or GCP? I'm really looking forward to seeing it.
Appreciate the work you’ve put into this Wendel. I think AMD needs to support not only frameworks like TF and Torch but also model conversion from one framework/hw to another. Basically the primitives mapping between systems.
Very interesting!
Do you know why Stable Diffusion seems to use so much more VRAM on the MI210 than on the A100?
Maybe related to the accuracy stuff? I'm not sure tbh
Thanks for the stock tip Wendell! I'm selling all my TSLA and buying up AMD stock.
Aw was hoping to use rocm my 6800xt
Try it... I bet it will work. My 6700xt and 7900xt work fine with ROCm. SO I guess that the 6800xt will work too.
Are the differences in the images really due to different precision levels in the hardware or is it (also partly) due to limited determinism and reproducibility? After all you're not guaranteed to get the same image twice, even when using the same seed and HW.
could you use those AI parts on Ryzen? I think it is a Notebook CPU
AMD's FP64 cores are great, but they still need more dedicated AI silicon, preferably integrated on the same package.
With the new release of ROCm 6.0 can we revisit this topic?
this just popped up in my stream and i would be interested in an update as it's now known that the Nvidia 50xx series will not get any vram capacity upgrades (sans 5090) so it may be tome to look at AMD.
thoughts on opensycl?
any benchmarks with AMD Ryzen AI?
You're simply the best!
I always spun that technical difference as a "One is a more mature, but less complete offering." So then it became a question of what is good enough for their needs.
to be honest, you suggested that rocm went from can't train shit to can't train shit, which is what nvidia is specialises in. there are more inference startups dying each day than mi200s and mi300s combined shipped that day, and every vendor is coming up with their own inference chip. why would aws offer mi200 or mi300 when they can offer inf1 of their own and can abstract any software difference under ml frameworks? and if they do, why would anyone use that instead of inf1, or better yet, building their own?
I’m glad to be an AMD shareholder, although I guess I might grab a few more shares just in case. (My AMD shares have made a killing so far especially off this AI hype bubble.)
AMD keeps dropping old GPUs in ROCm. RDNA has been ignored forever, I not even OpenCL worked at launch with regular drivers. So there will be little uptake when Nvidia has still something that performs ok at the lower end.
Gaudi 2 is also looking OK and Intel seems committed to have the software running on potatoes.
There is a fork of automatic that supports AMD. It's in the main project readme or a google. It seems they randomly decided to drop support for some older cards a few months ago (rocm). Rx 5xx isn't supported and I think vega was also dropped.
Does ROCm work with Radeon 7900 series cards now?
Yes... I use a 7900xt+ROCm for generating image with A1111.
Are you sure that the visual differences are because of the different hardware? Is Xformers disabled? I think it should be disabled for a test like this. I think it would explain the visual differences.
I was waiting for this video since the teardown came out
I am in AL and never have access to the newest HW. Damn ...
This is really interesting I was to know about how a 4090 vs a 7900XTX compares for these workflows. I know both are consumer products but I feel at the top end the line is blurred.
The differences in the images stem from you using an ancestral sampler, not from the GPU you're using.
Where is a Microsoft DirectX style layer that sits on top of the GPUs and makes ML vendor agnostic (even if it makes it OS dependent)? If you dont like the OS specific DirectX API then swap in Vulkan API. Ive heard of DirectCompute and OpenCL but they dont seem to have gained traction - why? Also why is ROCm needed when you have those APIs - what is it that makes CUDA compete against all of the above?
Hi,
I would like to get started with ML and currently do have 2 offers for graphics card.
RX 6800 16 GB and RTX 4060 8 GB
Do you know if the 6800 would be suitable for getting started or is it better to use the 4060?
Thank you in advance!
Is AI "hallucination" happening here?
Will w7900 with 48gb compare MI210?
EVELON TECHS is going to be a new channel?
Assuming that wasn’t a joke, it’s just level one techs but windows or whatever is cutting off either side of the wall paper. L]evelOn[e techs.
I only recognized because I’ve seen the full picture in one of the other vids, and even then it tripped me up a little.
@@levygaming3133 The full picture is on the monitor to the right.
love these talks
Big fan of Evelon Techs
If AMD wants to get popular, they need to support their consumer grade GPU in ROCm. And also the used market.
Unless Google sell TPU for enterprise to host themselves, I don't think there will be any large scale adoption to use in consumer products. See, OpenAI trained their model on GPU, best to assume that's Nvidia hardware.
working in opengl for a long time i've come to sum it up as nvidia playing it fast and loose and amd being more accurate. and then there's mesa, which is as close to a reference implementation as you'll getp
training performance...?
Hi Wendell, have you been following the Tinygrad stuff and their troubles with ROCm at all? They look like they have some real work™ they'd like to be able to use AMD for in ML so I think it would be interesting for you to check out.
Someone didn't watch to the end of the video ;)
@@Level1Techs whoops sorry, don't have the time to watch rn and I know the best time to get an answer is in the first couple hours so I did a stupid
Run a GPT selfhosted instance
We need an update for ROCm 6.0 and RDNA3
the demo is for inference, but training is key advantage to nvidia. need to get compute cards at gamer card scale in order for that software support to level out. that's why Ponte Vecchio and TPUs are DOA consumer products.
but let's supposed AMD does catch up for the desktop. for mobile, apple and google and Samsung own their own stacks. for robotics, nvidia already has jetson. the market beyond the desktop would need to be big for AMD to really be able to invest and nail AI
A100 being a little off isn’t surprising. Saw the same thing when comparing Amazon’s codewhisperer & OpenAI copilot. CUDA copilot’s code was only 75% correct. Amazon was 90% correct.
i enjoyed this thanks you.
Let's hope this works for the consumer GPUs as well. MI210 is 20k at Ebay, the A100 at 6k. Hmm. AMD should make a rather slow GPU with a lot of vRAM for inference and below 1k. Allegedly vRAM prices are currently very low.
The price difference is because the MI210 is relatively new, released early last year. The A100 is 3 years old now and also the A100 has been deployed a lot more, every cloud provider offers some A100 server instance. Meanwhile the only AMD GPUs I can find are Radeon gaming GPUs. The V620 is just a Navi 21 chip with 32gb of vram.
@@gatocochino5594 Thanks, but I wouldn't pay more just because it's rarer while it doesn't even seem to use less energy.
What'll be interesting will be MI300C - which will be all CPU chiplets and Xilinix AI chiplets - Turin-AI... MLID has a video about it. A dual socket version could have more TOPs than an H100.
I imagined AMD would develop that kind of chip. I enen named it MI300AI.
Could you provide a link to the MI300C?
@@samlebon9884 There's a very reliable rumor channel I've tracked for a few years called Moore's Law is Dead which spoke about the MI300C chip which is all CPU chiplets with HBM3 and a separate AMD project called Turin-AI which is a mix of Zen5 chiplets together with Xilinix AI chiplets on a single package which in a 2P config would be about as powerful as an H100.
Please please Wendell, use your mighty powers and shake down the answers from above when ROCm Windows support is coming around. I mean it'll actually bring more value to AMD's lackluster RDNA 3 so far!
Open Source > Proprietary Vendor Lock-ins
I hope ROCm gets better so that I can go back to using DaVinci Resolve with hardware acceleration on AMD... Using it on Linux is such a pain and I am about to just buy a damn NVIDIA 12 gb card and sell my 16gb AMD card just so I can use the Resolve studio version I pad for. It sucks because I want to drop Windows and I the only thing stopping me is video editing. I either go back to dual booting Windows wich sucks or buy a card with 12gb of vram...not ideal.
Just how deterministic are these prompts? How similar are the same prompts using the same seeds on the same hardware? Some of those differences were rather subtle but wow some of them had huge differences in results.
As far as precision goes, I believe AMD inputs two matrices of 16 bit values but the actual pipeline throughout the multiply and accumulate steps is 32 bit. On the green side, two 16 bit matrices are inputted but they only accumulate into a 32 bit value at the end. Both have options to output to 16 bit results. I always get confused on who supports half precision and who support 16 bit bfloat. Just between half precision and bfloat should be enough to get different results of what we've seen here.
The only company that is really doing higher precision for tensor operations is IBM which supports 32 bit values for elements of a matrix and 64 bit accumulate. These are all glorious IEEE 754 (-2008?) compliant to quickly import and execution on existing data sets without the need to drop precision prior. While AI has been pushing the usage of 16 bit floating point work, there is certainly a market sector that would like to leverage tensor operations as higher precisions that are used in more traditional single and double precision workloads. Easy prediction would be that it'll follow various GPU tends in this area where they either have a smaller number of 32 or 64 bit precision tensor hardware or modify the existing 16 bit hardware to compute 32 or 64 bit values over multiple cycles.
I'm really optimistic about MI300A and MI300X performance. The MI300A in particular should be interesting to test due to its inclusion of some Zen 4 cores inside the same package. On the same note, it wouldn't surprise me if AMD begins swapping out some of the CCX dies on an EPYC package for some CNDA dies or FPGA (or both!). The flexibility and potential from this strategy is enormous. More importantly and pointed out in this video is that AMD has been making massive strides on the software side. The gap hasn't closed but you can see on the horizon where it does now.
With regards to TPU4, has there been any inclination that Google has this up and running internally? Are they waiting on 4 nm yields/supply? Can they get enough HBM which has skyrocketed in price lately?
On the same GPU with with same setting and same version of the software it is 100% identical . Though using Xformer or token merging make it not deterministic using a different GPU can also cause difference. To be sure to get similar image on different GPU somebody can use CPU noise generation. At least I think since the option is there but I never tried.
If you don’t use xformers, using the same seed, same prompts, same settings, it’s absolutely deterministic. I had this same exact discussion with someone in discord and I helped him setup his a1111 on his laptop. Even though I had 3090 and he had a 2070, he was able to create the same image. After all, it is math, and algorithm.
@@nexusyang4832 That is the expectation but how did you verify? Ran pixel value difference between the two results?
Conceptually they should match but things can get weird when operating near the edges of precision and rounding. Mathematically a*b*c*d*e*f*g*h = h*g*f*e*d*c*b*a but things can get messy when dealing with floating point number and order of operations. Thankfully code is deterministic when executing right? Well you do have things like out-of-order execution which can rearrange how code is executed. Still in a vacuum, the result should be identical with other iterations of itself on the same hardware: things continue to execute out of order in a deterministic fashion. However bringing in ideas like SMT into the picture, you can present a scenario where out-of-order execution changes slightly each iteration based upon available shared resources for execution. Differences can now emerge in what was originally a deterministic process because of the initial low precision nature of the data structures involved (16 bit floating point elements in a matrix for the example in the video). This is a bit of an extreme example of how to induce chaos into what otherwise is a deterministic algorithm..
@@leucome Curious, have you tried using different precision if available on the same hardware? Say half precision vs. bfloat?
Nvidia going the inaccurate but faster route has always been a problem for AMD. Because Nvidia is the market leader most software actually expects the inaccuracies in implementations of standards that Nvidia has which will lead to the software not working on the technically more accurate implementations on AMD (or even Intel). So to the consumer that will then mean software doesn't work properly on AMD and at the same time runs slower.
Before the recent rewrite this also was a problem for OpenGL, some versions of DX etc.