Why GPUs Outpace CPUs?

DigitalSreeni

มุมมอง 2 466

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 11 ก.ย. 2024
A Deep Dive into Why GPUs Outpace CPUs - A Hands-On Tutorial
FLOPS is commonly used to quantify the computational power of processors and other computing devices. It is an important metric for tasks that involve complex mathematical calculations, such as scientific simulations, artificial intelligence and machine learning algorithms.
FLOPS stands for "Floating Point Operations Per Second" which means the number of floating-point calculations a computer system can perform in one second. The higher the FLOPS value, the faster the computer or processor can perform floating-point calculations, indicating better computational performance.
In this tutorial, let us use FLOPS as a metric to evaluate the performance of CPU versus GPU. We will begin by employing the DAXPY (Double-precision A*X plus Y) operation, a commonly used operation in numerical computing. This operation involves multiplying a scalar (A) with a vector (X) and adding the result to another vector (Y). We will calculate FLOPS to perform the DAXPY operation using both the CPU and GPU, respectively.
The DAXPY operation is executed using NumPy operations (A * X + Y). NumPy can leverage optimized implementations, and the actual computation may occur in optimized C or Fortran libraries. Therefore, a more effective way to compare speeds is by conducting matrix multiplications using TensorFlow. The second part of our code is designed to accomplish precisely this task. We will perform matrix multiplications of various-sized matrices and explore how the true advantage of GPUs lies in working with large matrices (datasets in general).
In the second part of this tutorial, we will verify the GPU speed advantage over CPU for different matrix sizes. The relative efficiency of the GPU compared to the CPU can vary based on the computational demands of the specific task.
In order to make sure we start with a common base line for each matrix multiplication task, we will clear the default graph and release the GPU memory. We will also disable the eager execution in TensorFlow for the matrix multiplication task. Please note that eager execution is a mode that allows operations to be executed immediately as they are called, instead of requiring them to be explicitly executed within a session. Eager execution is enabled by default in TensorFlow 2.x. By disabling eager execution, operations are added to a computation graph, and the graph is executed within a session.
Finally, Forget FLOPS, it's all about the memory bandwidth!!!
Memory bandwidth is a measure of how quickly data can be transferred between the processor (CPU or GPU) and the memory.
High memory bandwidth is crucial for tasks that involve frequent access to large datasets (e.g., deep learning training)
Memory bandwidth becomes particularly important when dealing with large matrices, as transferring data between the processor and memory efficiently can significantly impact overall performance.
Code used in this video is available here: github.com/bns...
Original title: Why GPUs Outpace CPUs? (tips tricks 56)

ความคิดเห็น • 16

@vitor-ce2ql 7 หลายเดือนก่อน ⁺¹
Hello, I recently discovered your Channel, I thought it was very good, and you are very kind, and you have a good explanation, congratulations, I want your channel to grow ,If my English is bad, sorry, because I'm Brazilian
@aaalexlit 7 หลายเดือนก่อน
Awesome as always, thank you! any chance to have a follow-up that includes TPUs?
@scrambledeggsandcrispybaco2070 7 หลายเดือนก่อน
Hi DigitalSreeni,
I have been using your tutorials as a guideline for segmentation using traditional machine learning. Apeer has changed a lot since your videos were made. When I export the file it gives masks for different classes separately. What can I do ? Thank you for all your knowledge, you are a life safer.
@msaoc22 7 หลายเดือนก่อน
thank you for amazing video and time you spend on us=)
@hamidgholami2683 7 หลายเดือนก่อน
Hi sir hope your doing well
May i ask you to make some videos relating to instance segmentation ? I mean a good explanations and also doing some projects based on that? I will be happy if you respond
@alihajikaram8004 7 หลายเดือนก่อน
Hi, I found your channel very informative and thanks for your great educational videos.
Would you make a video about using conv1d in time series?
Could we use it for feature extraction?
@LiebeZSlade_Ayzal.Y 3 หลายเดือนก่อน
Sir what type of laptop should i get to do deep learning?what do you recommend?
@khangvutien2538 6 หลายเดือนก่อน
At 7:05, I see in the specs sheet 256 Tensor cores. Are they the same tensor processing as in TPU? Maybe you can also explain TPU? Note that I’m just starting to watch. Maybe you will explain later in the video?
@DigitalSreeni 6 หลายเดือนก่อน ⁺¹
The tensor cores in GPUs and TPUs involve tensor processing but they are different technologies designed for different purposes. GPUs are more general-purpose and versatile, suitable for a range of tasks like gaming, graphics rendering, and parallel computing workloads. TPUs are purpose-built for machine learning and are highly optimized for tensor operations.
@zainulabideen_1 7 หลายเดือนก่อน
Found Amazing information, thanks ❤❤❤
@DigitalSreeni 7 หลายเดือนก่อน
Glad it was helpful!
@anshagarwal9826 6 หลายเดือนก่อน
@DigitalSreeni hi can you explain why you divide array size by time to calculate FLOPS how does it give what floating point operations per second it took, what I understood from your calculation is that you might be considering an estimation like how much time it will take to build the newly calculated array is ~ saying FLOPs.
@DigitalSreeni 6 หลายเดือนก่อน
The calculation of FLOPS in my code is based on the time taken to perform a specific operation (e.g., DAXPY) on arrays of a given size. The rationale behind this calculation is that it estimates the rate at which floating-point operations are executed per second. If you consider the DAXPY operation (A * X + Y), each element in the arrays X and Y undergoes a multiplication and an addition, which are floating-point operations. So the total number of floating-point operations is proportional to the array size. It provides a rough measure of the performance in terms of floating-point operations per second. In reality, the actual number depends on the type of operations and of course the underlying hardware.
@anshagarwal9826 6 หลายเดือนก่อน
Thanks @@DigitalSreenimuch appreciated 👍
@vidyasvidhyalaya 7 หลายเดือนก่อน
Sir please upload a seperate video for convert this "195 - Image classification using XGBoost and VGG16 imagenet as feature extractor" into loal web application sir...please sir....dont skip my comment sir...please sir...awaiting to see the video sir
@tektronix475 7 หลายเดือนก่อน ⁺¹
I got 5k times faster for the T4 GPU setup for a 10000 matrix size. which is disheartening and eye popping at the same time.

ต่อไป

เล่นอัตโนมัติ

338 - Understanding the Benford's Law of Probability