Fun fact: with 128 CUDA cores in a Nano, how many cores actually perform the square root operations in the program? Answer: zero. Yep, with the Nano being based on Nvidia's Maxwell architecture, not one of those 128 cores is capable of computing a square root directly. Instead the Nano's single Maxwell SM (streaming multiprocessor) comes with 32 SFUs (special function units) which are used to compute the square root. But even quirkier, these SFUs only know how to compute the reciprocal square root, as well as the regular reciprocal operation. So to get a square root the SFU will actually execute two instructions: a reciprocal square root, followed by a reciprocal. Strange but true! But actually documented in Nvidia's "CUDA C Programming Guide" in the section on "Performance Guidelines: Maximize Instruction Throughput". Ah yes, the joys of having a day job as a CUDA programmer. You get to be gobsmacked every day by the weird ways you need to go about trying to optimize your programs to scrimp and save on every precious clock cycle :P
i like your depth of thought-- can you point us to some info so we can learn the important tech to understand why and how you have determined what you stated-- thanks for the comment-- background: i bought into nvida cuda many years ago for video post processing and could never really take advantage of it...but now want to in the AI/ML solutions for Iot
@@pluralcloud1756 The info can be found in Nvidia's "CUDA C Programming Guide", here's a direct link to the pertinent section on arithmetic instruction throughput: docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions
Not every program need to calculate the square root, and you're incorrect in that statement too. Nvidia's Cuda Cores are stream processors, they do 16 or 32 bit flops. The actual full cores are DPP (64 bit)calculates a square root very precise; however even 32 bit FPP has like what? 20 bit mantissa?, They are good enough to calculate a square root quite accurately! Anyway, it's way more accurate than your handheld calculator!
@@ProDigit80 The CUDA cores do not calculate the square root directly. This is easy to verify: make a simple kernel which calculates a square root, eg. "__global__ void sqrtkern(float fi, float *fo) { *fo = sqrt(fi); }". Then use NVCC with the "-cubin" option to generate a CUBIN file. Then use CUOBJDUMP on this CUBIN file and with the "-sass" option to generate the SASS file which contains the actual low-level assembly instructions for the GPU. Check the SASS file and you will see an instruction "MUFU.RSQ" which is a multi-funtion unit instruction to calculate the reciprocal square root and is issued to the SFUs. So from the assembly you can clearly see that the kernel is using the SFUs to compute reciprocal square roots rather than using the CUDA cores. If you want to avoid using the SFUs and want to solely use the CUDA cores then you have to write your own square root function, meaning do not use the "sqrt()" built-in function in your code.
Greetings from near Albuquerque, New Mexico, USA. Thanks for all you do to bring various computing concepts, hardware, and software to your viewers. I want to leave a few comments about this video on Build Your Own GPU Accelerated Supercomputer. When you take your square root problem and divide it into smaller and smaller but more numerous parts, that is called 'strong scaling' of a numerical problem. This implies that the problem size on each compute node becomes smaller and smaller. Eventually, if the problem continues to be broken up into smaller and smaller pieces, what happens is the communication time from compute node to compute node imposed by the message passing interface (MPI) becomes dominant over the compute time on each node. When this happens, the efficiency of parallel computing can be really low. My point here is that your video shows that double the compute nodes and you halve the compute time. That scaling will happen at first but cannot be continued ad infinitum. Another approach to parallel computing is to take a small problem of a fixed-size on one compute node, then keep adding the same size problem (but expanding the compute domain) to other compute nodes, all working on the same but now bigger problem. This is called 'weak scaling.' And as one might guess, the performance and efficiency curves for strong and weak scaling are quite different. As you know but perhaps some viewers do not, running nvidia GPUs requires knowing the CUDA programming language, which requires a non-trivial effort. This language is entirely different from programming languages such as Python, Fortran, or C++. This is why Intel chose to use more X86 co-processors in their Core i9 boards instead of GPUs so that programmers could stay with their familiar programming languages. AMD took the same approach with their ThreadRipper boards. Software development time is much reduced without having to learn CUDA to program the extra compute nodes. Implementing CUDA on top of typical programming languages can extend significantly the time between the start of a software development program and when the software actually executes properly on a given platform. In a nutshell, the plus side of all this is that GPUs are super fast for numerical computing. GPUs are hands-down faster than any X86 processor. Downside is the difficulty in programming a problem to make proper use of the GPUs. One more comment. For viewers interested in parallel computing, I highly recommend OPENMPI as the Message Passing Interface version to use as it is open source, actively developed, and easy to implement.
Great comment. Very well explained. There's one thing I want your opinion on is how do you views OpenACC for parallel computing ? The learning curve vs. the performance gain ?
@@sanmansabane2899 OpenACC is geared for directive-based parallel computing much like OpenMP. OpenACC utilizes GPUs for accelerated computing, whereas OpenMP uses multi-cores. I have more experience with OpenMP, so my comments will pertain more to OpenMP than OpenACC. In general, the idea here is to take a serial code, add a few beginning and ending [directives in Fortran : pragmas in C or C++] which consist of usually just a few lines of code, and let the directive-based compiler figure out the best way to parallelize the code in-between. Do-loops and For-loops are prime candidates for this approach. As a result of letting the compiler do the heavy lifting, the parallel efficiency one gets out of this approach is heavily compiler dependent. Some compilers do much better at parallel efficiency using directives than others. Also, in my experience, if the code between the directives is written "poorly," the execution time can actually increase rather than decrease. Not good. Note that OpenACC and OpenMP create multiple threads on a core, but they do not allow communication across cores or processors. OpenMPI does that. So the most efficient approach can be using OpenMPI (requires re-write of a lot of code to get right) for intra or inter-process communication, and include in the code directives to launch threads on the cores using OpenMP or threads on the GPU using OpenACC. Note that there has been a successful push to include GPU protocols in OpenMP. This MAY mean OpenACC is falling down in popularity. For example, if you wiki 'OpenACC', you'll find that the decrease in OpenACC's popularity is probably why on April 3, 2019, John Levesque (the director of Cray Supercomputing Center of Excellence at Cray) announced that Cray are ending support for OpenACC. I've met John before. He is a very well-respected and knowledgeable man. I'm sure his decision to drop OpenACC was done with much forethought and much hindsight. This may be a good reason to go with OpenMP -- it will definitely be around for awhile.
i would hardly call this a supercomputer. no offense, although it's nice for test and experimentation, has low power and all that, in reallity, given the orevall cost, a mid tier nvidia gpu will crush it. still good for having some fun.
@@CircuitReborn Has anybody solved tag teaming graphics cards? I remember in the 90’s you could plug six cards into a Mac IIFX to render Photoshop jobs.
9:53 Ok so if i understand correctly: time will return the number of seconds program has run, mpiexec is the utility responsible for cluster management and ./simpleMPI refers to a local binary which is then distributed and run across the cluster? 12:03 Also the Xavier GPU being more powerful you mean the number of cores it has right? Also i would like to see from professor Garry video on Amdahl`s law :)
I really would like to build one of these, I've followed an HPC course at Uni and it fascinated me, beeing able to build a CUDA cluster for like 250€ is awesome!
Your a very good teacher. Because im a noob and i understood everything and learned alot. I went from not knowing what a jetson nano was to learning about parrallel computing and building supercomputers. Thank you 👍
Very Cool! You forgot to mention it take about what ~18W of power? Gary, can you, please, explain exactly how Xavier NX unit can be used for video encoding. I know it runs linux OS Ubuntu on it, so my question is, can it be booted directly of SSD and used as regular desktop PC, running one of the open source editors, such kdenlive, which, by the way, supports parallel video rendering.
i wouldn't recommend it.AFAIK it doesn't support boot from SSD but you can connect one on USB3. it's using ARM64 and many libraries and softwares are not present there. and the overall user experience and fluidity is of the OS is not the best. for the price of the NX you can build a mini Ryzen PC with much better performance and 4 real X86 cores at high clocks
@@SoussiAfif so the issue is with the soft, I know 6 cores ARM cpu is not enough but look at the demo with CUDA cores supporting multistreams video codecs with face recognition AI...so why this API can not be used to simply read XML instructions and run FFMPEG commands in parallel? Answering to your question, today, Ryzen rig make sense to build with PCI-X 4.0 support latest new coming CPUs with shared 3rd level cash, which will cost 3x to start with... most people will do that...
@@SoussiAfif I belive there is a hack for booting from the SSD already. Also I am sure there have been said officially Xavier NX will be able to boot from SSD as standard in the near future.
I think you could, after all as you said it runs Ubuntu so why not? Why I struggle to see is why you would render video via a cluster. Let''s say 4 x Xavier or even more in a cluster, well you get a rather power full x86 for that kind of money. But for the learning experience and the fun of it, just try it and see what you think. Hey, that could be a video if you ever thought of your own channel.
Hi. Can you give me a little information please? My question is basic. I will be using the Mate and three Nanos. One as the master and two nodes for now until I can get my last worker. My question is should I remove the desktop environment from the worker nodes to free up ram and processor usage. Using three Nanos it seems I should just install the main OS on all the SD cards from the carrier that they come with. Then install them into the mate. Is that right? I have found loads of information on running clusters and neat stuff. But the basic setup of the mate is what I am missing. I would just think the desktop environment on the worker nodes would be wasteful. I would love a video on the build from the start to running. Just the basic of the beginning. Setup of the SDs and any really important information needed. Regards, Adam
Fast Transform fixed-filter-bank neural nets don't need that much compute. Moving the training data around is the main problem. The total system DRAM bandwidth is the main factor. Clusters of cheap compute boards could be a better deal than an expensive GPU. For training you can use Continuous Gray Code Optimization. Each device has the full neural model and part of the training set. Each device is sent the same short list of sparse mutations and returns the cost for its part of the training data. The costs are summed and the same accept or reject mutations message is sent to each device.
Hi @GaryExplains - fantastic video. Thank you for sharing your knowledge with the community. I have a quick question. Given that the Jetson Nano used in this video is discontinued, what Jetson module would you recommend instead? Could this work with 4 Jetson Orin Nano modules (and would the Dev Kit be needed or could we just go with the module)? Thanks!
Just curious, is it possible to mix supercomputer clusters together? Like using a raspberry pi cluster for CPU n Nvidia Jetson cluster for GPU computation. Is crosstalk available?
Great video. One question, can you elaborate a little bit more on the github about commSize? I just didn't know how to set it as an argument. Thanks again for the video.
Probably better spent on the 3070, because 100% of the cost is going to the GPU. But if you bought Jetpacks or other single board computers, a lot of that cost would go to parts other than the GPU (CPU, connectors, cables, etc). Just mathematically you'd be spending more money on silicon with a 3070 than with a $500 cluster.
Not even close to worth it for rendering, price wise. a Jetson Nano has 128 Cuda cores and costs roughly 100 euro. A Geforce 1070 has 1900 Cuda cores and you can pick up a used one for around 100 euro. Do the math.
Hello, Kindly provide a detailed video on how to make a cluster supercomputer using 4 Nvidea Jetson NX with detailed wiring and programing commands. Thank You.
I will watch and study all your videos.. I want to do more than just study. There's something I'd like to create. If possible. I'll try to reach out when I'm finished studying all your videos. May it be possible that I could ask a few questions just to gain some knowledge. Great video. I know not much of it but I understood you. There's a lot to it. I need help with my project.
isnt it more cost efficient to compute on graphics card than Jetson stuff? I believe that Jetson is optimized for running with high requirements on low power consumption, being light and small and not at all on raw computation power. Where Jetson may be useful is to drive robots that move around but not for stuff I do like Machine Learning. Cuda cores as far as I know cannot handle floats, which is a requirement to compute SQRT. You use the tensor and cuda cores for linear algebra and matrices transformations on ints.
It would be great if you did videos covering all the details of setting up such a cluster, for a Linux-based environment. What software, how to cable it all up, etc. etc. etc.
Can you show us using an ampere altra, and thread ripper 64 proc x 2, and then nanos clustered. Price, price per core, wattage total, and then speed? We're all looking forward to it.
Also, why not use a bunch of single boards with fibre meshed instead. Perhaps, a redone version using GenZ or something close? Id love to see your project results.
Gary, great video!! I've been thinking of playing around with one of the nano 2gB boards and you have given me some courage to try. Did you have to write your own CUDA code to execute the SQRT function?
nVidia has some kick-ass courses on Ai and machine learning too. Some are even free! I'm too lazy to develop anything. I wait until someone else does all the hard work, then I become the consumer.
What about SOLID using classes conditionally, or webpages/forms event driven dynamically with the controls triggering classes dynamically for a task or sets of tasks programmatically or sequential task activated?
Probably can't be easily run in parallel. Some stuff can be parallelized, some can't. In your example the webserver will be handling many requests in parallel as would the database, but each user interaction and the relevant objects would be a single thread.
I have question, should I get the 4gb Nano or 2gb Nano for the cluster? Does RAM matter? I dont plan to buy them, I just want to know. Aslo, can you run Android on the board? Id love to see a video about that!
you can't run Android on it. the drivers and Firwares are not available. for a GPU/CPU cluster (Aka calculations) the 2gb will do fine. if you're planning to do web hosting, run Docker/kubernetes the 4gb RAM will be very useful addition
It will be really helpful if you can do a video on how to write some basic code that utilizes GPU(Nvidia as well as Radeon) in popular programming languages such as C, Java. Couldn't find proper resources, even if I do find, those are hard to understand
Can you use these clusters, since they share gpu's, to make a small form SBC gaming computer. For like AAA titles that one single board couldn't normally run?
I was just about to post my brilliant formula for calculating the number of (identical) processors above which the compute time of the cluster starts increasing because of latency (The time taken to communicate results over the inter-connects.) Fortunately, I hesitated to post it when I realized that the overall latency of the cluster might not be additive. In other words, total latency time might not = N * L(1) for N = number of processors and L(1) is the latency of one processor alone. Is there some simple formula that scales latency as a function of the number of processors in a cluster? I suppose that number might vary wildly depending on the topology and hardware of the inter-connects, but I have no idea, really.
As someone who just stumbled over this video .. and have no idea what use you have for this. Let me ask a question. For IT Standards I have a rather old Graphics Card in my main system. It is an "Zotac GeForce GTX 970 AMP! Extreme Core Edition". This Card comes with 1664 CUDA Cores. Is there anything that these little computers can do, that my old graphics Card cannot? I really dont understand Why someone Spend 400 Bucks for 4 mini computers, that also need a specific configuration and lots of knowledge, to have less CUDA power then a 6 year old Graphics Card that you can buy for half the price. Am I wrong?
Hey, Gary. Do you think it would be possible to build a super GPU that could take any generic job from, say, a video game, split the job across a thousand smaller chips and respond fast enough ? I read about a pioneer in your field who used 55,000 cheap video game chips to do topological maps in an eye blink. The challenge being a) impersonating a recognized video card that the host computer will accept and b) lightspeed lag. As I recall, he used fiber optics.
Latency is the issue, hence the researchers use of fiber optics etc. Remember a typical graphics card uses a PCIe bus to its full extent. Hard to replicate that kind of speed over a distributed network.
@@GaryExplains I recall the original Power PC RISC 6000 used five chips linked by a 128 bit bus. And there’s no LAW that you have to use a standard enclosure. Make a filing cabinet into a PC case, put in a standard motherboard and snap three two foot by 1 foot high PCI cards into each drawer. I know even the length of the copper traces matters, but you could use fiber optics to distribute the jobs. That’s 5% faster than lightspeed through copper, right? Put as many cores on those cards as you can. Maybe, with a Herculean effort, the timing issues could be resolved and you could run video games on it. Plus, the idea scales well for solving conventional parallel processing jobs.
Interesting, you can probably get access to a full size computer cluster/supercomputer (preferably sponsored by Nvidia) and do a video on some serious number crunching (e.g. 3rd root) using mpi+cuda+scheduler. Thank you for the video :)
Hello, I went that path a few years ago with a cluster of Raspberry Pi, the problem I kept running into was out of RAM errors with Apache Spark. In the end, I prefer to use a multithreaded CPU and a good GPU. Recently, I picked up USB Google Coral TPU (4 TOPS) which I wanted to run in parallel, but because I need "regression" models (not image recognition) I am not sure if I will succeed.
It looks cool and may have some educational value, but otherwise it is useless, single gtx 980ti (same OLD maxwell architecture, so apples to apples) has 2816 cuda cores, jetson nano has 128 and it's clocked lower, so you need more than 22 jetsons to match theoretical computing power of old gtx 980ti which is worth 200usd on ebay and it will work only for special cases when it is not critical for algorithm that you have low bandwidth high latency link between nodes
Yes, as long as your program isn't latency sensitive. So if each node is given a task that takes hours to run and then reports it's results, that would be ok. But if the tasks are short and there is lots of IO then the performance will fall dramatically. There are also security implications of opening up those nodes for access across the internet.
Gary, Hello sir and thanks for the awesome video. I will jump right into it. I followed the GitHub instructions and everything works great. Well almost. I can see the four nano I have as my nodes all crank up. The CPUs go up as the data is received, but only one GPU at a time will ring the program. When I edit the clusterfile I can remove all of them and just add one at a time and each will individually run the program. Each working fine. When I add all four it does show 16 cores and like I said with jtop I can see them all try to go. But only one will go at a time. Never all four. The instructions we simple to follow and it went great. Just wondering if you have any ideas or if there is some data I can share that would assist you with finding out what I am doing wrong. Thank you for the awesome video!!
I just watched your video again and I see you have in the clusterfile the IP and :1 or :3 in your examples. I did not do that part. I will try that when I get home today and see if that is my problem. Thank you
Does anyone know whether this can run the ARM version of the Folding@Home software and whether it will utilise the GPU? (I currently have it running on a Pi 3B+ and a Pi4.)
Not possible to fry a potatoes whilst it's being peeled and chopped, is this a challenge? Give me a space suit with knife and peel attachments and prepare to be amazed :)
What type of operations are a GPU core and a CPU core optimized for? For instance, in what situations could a 4 core cpu out perform a 128 gpu of comparable price. I think it is very interesting, especially if the operations are simple enough (like monitoring IO voltages for temperature sensors and the like) you could reduce down time by segmenting the process over 128 cores instead of 4. This is interesting for square roots, but where would the downfall be? All-in-all this makes me want to get into developing a small super computer as you have shown, for the experience. Thanks!
And could it Run a VMware ESX as it does RaspBerry? I think its a New deal on computing... That Will be great to mount an operating system on a vmwesx clustered SBS machines... Only needed a motherboard to join all this properly and to add a real GPU to all that... Ram, Extra GPU And clustered mini computers can do it greater and better than Only one beast
You'll need a central controller to distribute the load among thousand if not millions of nodes to mimic a brain in real time. What do they call this master controller? The "Soul" of the system?
Dude I want to build a personal super computer. Im giving myself 2 years to complete and I know nothing about computer's. Im just super interested. My goal is the worlds smallest and mos inexpensive super computer. I want to see Latin America progressing.
Just get yourself a single RTX 3070 or 3080 in your PC, and you'll be more than good enough on a super computer at home (not even going for 2 or 3x 3090 in a system).
What Gary fails to mention is that pricing for Jetson boards is 23x more expensive per cuda-core than 3000RTX GPU cards. Jetson TX2: 256 cores, 500$ RTX-3070: 5888 cores, 499$
4 ปีที่แล้ว +3
You are missing the point here. They do the same thing with Raspberry Pi Clusters. It is not about building the most costly effective CPU or GPU environment for computing. It is about building a small supercomputer and test your codes or learn how to run programs in an environment that resembles supercomputers. So you will never use this system for production or doing real tasks but you can use this system to learn how MPI works or how your program scales in a cluster environment. When you buy RTX-3070 you are learning how to run your program on a desktop or laptop environment which is totally different environment than this and it targets different users. I suggest everyone to learn how to do scientific computing by building a Pi (CPU) or Jetson (GPU and CPU) cluster.
@ "So you will never use this system for production or doing real tasks" Please show me where this is mentioned in the video / description. You are right - you should NEVER EVER build a real supercomputer using Jetson boards. But that information is sadly missing in this video.
It's a pity I did know you earlier. Do u have a Facebook account publish it to demographic Africans and Asia ur supposed to have 10million subscribers.
Two things. 1) That isn't the point. The point is to understand how supercomputers work. 2). Really? Four Jetson Nano 2GB boards at $59 each is $236. How much are you paying for an RTX 3080? I would love to know the name of the place you can get one for that price.
@@GaryExplains 1. Your mixing up of gpu cores with regular cpu cores and comparing them to a supercomputer is downright wrong. You confuse people even more while not properly understanding how a supercomputer works and what is used for, furthermore you push the idea that the jetson boards are a feasible option as a compute intensive cluster. 2. You have said it yourself, 2X2Gb Nano, 1X4Gb Nano and 1 Xavier. Xavier alone costs: www.amazon.co.uk/dp/B0892DZQXK?tag=amz-mkt-chr-uk-21&ascsubtag=1ba00-01000-org00-win10-other-nomod-uk000-pcomp-feature-scomp-wm-5&ref=aa_scomp&th=1
Hmmm... I don't understand your point 1. As for your point 2. That is what I happened to have here. I wasn't going to just buy two more Jetson Nano boards for this video. But you only need the Jetson Nano 2GB to make this work as described.
Terrible! Just buy the higher tier boards like the Xavier boards... clearly the Jetson sales must be low enough for representatives to sponsor this gentlemen... don’t waist your time and $!
Hmmm... Terrible comment! First, this video is sponsored by Linode, not NVIDA. Second, how can you justify the extra prices differences between the Nano and the Xavier. Why is using the Nano so "terrible"? What does a person learning about HPC, MPI and CUDA gain by spending all that extra money (besides perf)?
Fun fact: with 128 CUDA cores in a Nano, how many cores actually perform the square root operations in the program? Answer: zero. Yep, with the Nano being based on Nvidia's Maxwell architecture, not one of those 128 cores is capable of computing a square root directly. Instead the Nano's single Maxwell SM (streaming multiprocessor) comes with 32 SFUs (special function units) which are used to compute the square root. But even quirkier, these SFUs only know how to compute the reciprocal square root, as well as the regular reciprocal operation. So to get a square root the SFU will actually execute two instructions: a reciprocal square root, followed by a reciprocal. Strange but true! But actually documented in Nvidia's "CUDA C Programming Guide" in the section on "Performance Guidelines: Maximize Instruction Throughput".
Ah yes, the joys of having a day job as a CUDA programmer. You get to be gobsmacked every day by the weird ways you need to go about trying to optimize your programs to scrimp and save on every precious clock cycle :P
i like your depth of thought-- can you point us to some info so we can learn the important tech to understand why and how you have determined what you stated-- thanks for the comment-- background: i bought into nvida cuda many years ago for video post processing and could never really take advantage of it...but now want to in the AI/ML solutions for Iot
@@pluralcloud1756 The info can be found in Nvidia's "CUDA C Programming Guide", here's a direct link to the pertinent section on arithmetic instruction throughput: docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions
Not every program need to calculate the square root, and you're incorrect in that statement too.
Nvidia's Cuda Cores are stream processors, they do 16 or 32 bit flops.
The actual full cores are DPP (64 bit)calculates a square root very precise; however even 32 bit FPP has like what? 20 bit mantissa?, They are good enough to calculate a square root quite accurately!
Anyway, it's way more accurate than your handheld calculator!
@@ProDigit80 The CUDA cores do not calculate the square root directly. This is easy to verify: make a simple kernel which calculates a square root, eg. "__global__ void sqrtkern(float fi, float *fo) { *fo = sqrt(fi); }". Then use NVCC with the "-cubin" option to generate a CUBIN file. Then use CUOBJDUMP on this CUBIN file and with the "-sass" option to generate the SASS file which contains the actual low-level assembly instructions for the GPU. Check the SASS file and you will see an instruction "MUFU.RSQ" which is a multi-funtion unit instruction to calculate the reciprocal square root and is issued to the SFUs. So from the assembly you can clearly see that the kernel is using the SFUs to compute reciprocal square roots rather than using the CUDA cores.
If you want to avoid using the SFUs and want to solely use the CUDA cores then you have to write your own square root function, meaning do not use the "sqrt()" built-in function in your code.
Lol ty for that fun to know
Greetings from near Albuquerque, New Mexico, USA. Thanks for all you do to bring various computing concepts, hardware, and software to your viewers. I want to leave a few comments about this video on Build Your Own GPU Accelerated Supercomputer.
When you take your square root problem and divide it into smaller and smaller but more numerous parts, that is called 'strong scaling' of a numerical problem. This implies that the problem size on each compute node becomes smaller and smaller. Eventually, if the problem continues to be broken up into smaller and smaller pieces, what happens is the communication time from compute node to compute node imposed by the message passing interface (MPI) becomes dominant over the compute time on each node. When this happens, the efficiency of parallel computing can be really low. My point here is that your video shows that double the compute nodes and you halve the compute time. That scaling will happen at first but cannot be continued ad infinitum.
Another approach to parallel computing is to take a small problem of a fixed-size on one compute node, then keep adding the same size problem (but expanding the compute domain) to other compute nodes, all working on the same but now bigger problem. This is called 'weak scaling.' And as one might guess, the performance and efficiency curves for strong and weak scaling are quite different.
As you know but perhaps some viewers do not, running nvidia GPUs requires knowing the CUDA programming language, which requires a non-trivial effort. This language is entirely different from programming languages such as Python, Fortran, or C++. This is why Intel chose to use more X86 co-processors in their Core i9 boards instead of GPUs so that programmers could stay with their familiar programming languages. AMD took the same approach with their ThreadRipper boards. Software development time is much reduced without having to learn CUDA to program the extra compute nodes. Implementing CUDA on top of typical programming languages can extend significantly the time between the start of a software development program and when the software actually executes properly on a given platform.
In a nutshell, the plus side of all this is that GPUs are super fast for numerical computing. GPUs are hands-down faster than any X86 processor. Downside is the difficulty in programming a problem to make proper use of the GPUs.
One more comment. For viewers interested in parallel computing, I highly recommend OPENMPI as the Message Passing Interface version to use as it is open source, actively developed, and easy to implement.
Great comment. Very well explained. There's one thing I want your opinion on is how do you views OpenACC for parallel computing ? The learning curve vs. the performance gain ?
@@sanmansabane2899 OpenACC is geared for directive-based parallel computing much like OpenMP. OpenACC utilizes GPUs for accelerated computing, whereas OpenMP uses multi-cores. I have more experience with OpenMP, so my comments will pertain more to OpenMP than OpenACC. In general, the idea here is to take a serial code, add a few beginning and ending [directives in Fortran : pragmas in C or C++] which consist of usually just a few lines of code, and let the directive-based compiler figure out the best way to parallelize the code in-between. Do-loops and For-loops are prime candidates for this approach. As a result of letting the compiler do the heavy lifting, the parallel efficiency one gets out of this approach is heavily compiler dependent. Some compilers do much better at parallel efficiency using directives than others. Also, in my experience, if the code between the directives is written "poorly," the execution time can actually increase rather than decrease. Not good. Note that OpenACC and OpenMP create multiple threads on a core, but they do not allow communication across cores or processors. OpenMPI does that. So the most efficient approach can be using OpenMPI (requires re-write of a lot of code to get right) for intra or inter-process communication, and include in the code directives to launch threads on the cores using OpenMP or threads on the GPU using OpenACC. Note that there has been a successful push to include GPU protocols in OpenMP. This MAY mean OpenACC is falling down in popularity. For example, if you wiki 'OpenACC', you'll find that the decrease in OpenACC's popularity is probably why on April 3, 2019, John Levesque (the director of Cray Supercomputing Center of Excellence at Cray) announced that Cray are ending support for OpenACC. I've met John before. He is a very well-respected and knowledgeable man. I'm sure his decision to drop OpenACC was done with much forethought and much hindsight. This may be a good reason to go with OpenMP -- it will definitely be around for awhile.
I took a college class where CUDA was the language (Massively Parallel Computing was the course) and it absolutely requires a 'non-trivial' effort.
"We just take square roots. We're simple folks here."
**builds a supercomputer cluster with GPU acceleration 😎**
😂
i would hardly call this a supercomputer. no offense, although it's nice for test and experimentation, has low power and all that, in reallity, given the orevall cost, a mid tier nvidia gpu will crush it. still good for having some fun.
@@giornikitop5373 gives me a great idea how to build a unique gaming rig though...just using better GPUs.
@@CircuitReborn
Has anybody solved tag teaming graphics cards?
I remember in the 90’s you could plug six cards into a Mac IIFX to render Photoshop jobs.
Mans greatest achievement was working out how to do math faster than his mind would let him ! ! !
I prefer the quote: "Teaching sand to think was a mistake."
@@dam1917 It has a grain of Truth.
I think the greatest achievment was realizing math is not only about counting numbers together... and neither only about finsing faster way to do it.
@@danwe6297 Also: Geometry, Mathematics and Music are all ways to express the same thing ! ! !
Can it play steam games?
9:53 Ok so if i understand correctly: time will return the number of seconds program has run, mpiexec is the utility responsible for cluster management and ./simpleMPI refers to a local binary which is then distributed and run across the cluster? 12:03 Also the Xavier GPU being more powerful you mean the number of cores it has right? Also i would like to see from professor Garry video on Amdahl`s law :)
Yes.
@@GaryExplains another vote for video about Amdahi's law. This is so sexy :-)
I really would like to build one of these, I've followed an HPC course at Uni and it fascinated me, beeing able to build a CUDA cluster for like 250€ is awesome!
Your a very good teacher. Because im a noob and i understood everything and learned alot. I went from not knowing what a jetson nano was to learning about parrallel computing and building supercomputers.
Thank you 👍
Gary, can you make the gpu's and cpu's work together ? And by the way that was awesome..
You are the first to explain that I understand
Yes, a video on Amdahl's law, please!
Could you make this into a render farm? That is separate from the question as to whether that would be a good idea or even efficient.
This is exactly why I came to this video. Can this be an After Effects or Davinci Resolve render farm?
@@DionV I looked at some blender forum posts... yes it can. But not so efficient at all.
@@audiblevideo can you share the links to those posts please?
Very Cool! You forgot to mention it take about what ~18W of power? Gary, can you, please, explain exactly how Xavier NX unit can be used for video encoding. I know it runs linux OS Ubuntu on it, so my question is, can it be booted directly of SSD and used as regular desktop PC, running one of the open source editors, such kdenlive, which, by the way, supports parallel video rendering.
i wouldn't recommend it.AFAIK it doesn't support boot from SSD but you can connect one on USB3. it's using ARM64 and many libraries and softwares are not present there. and the overall user experience and fluidity is of the OS is not the best. for the price of the NX you can build a mini Ryzen PC with much better performance and 4 real X86 cores at high clocks
@@SoussiAfif so the issue is with the soft, I know 6 cores ARM cpu is not enough but look at the demo with CUDA cores supporting multistreams video codecs with face recognition AI...so why this API can not be used to simply read XML instructions and run FFMPEG commands in parallel? Answering to your question, today, Ryzen rig make sense to build with PCI-X 4.0 support latest new coming CPUs with shared 3rd level cash, which will cost 3x to start with... most people will do that...
@@SoussiAfif I belive there is a hack for booting from the SSD already. Also I am sure there have been said officially Xavier NX will be able to boot from SSD as standard in the near future.
I think you could, after all as you said it runs Ubuntu so why not? Why I struggle to see is why you would render video via a cluster. Let''s say 4 x Xavier or even more in a cluster, well you get a rather power full x86 for that kind of money. But for the learning experience and the fun of it, just try it and see what you think. Hey, that could be a video if you ever thought of your own channel.
@@tianjohan4633 I heard that too, but none of reviewers can show that... they all just repeat that demo showcase...
So fascinating. Wow . Thank you all. And the producer.
Great job Gary! So fun to watch
Hey Gary, thanks for this video. Awesome!
My pleasure!
Will you be porting Doom to it?
*GARY!!!*
*GOOD MORNING PROFESSOR!*
*GOOD MORNING FELLOW CLASSMATES!*
Stay safe out there everyone!
MARK!!!
Hi. Can you give me a little information please? My question is basic. I will be using the Mate and three Nanos. One as the master and two nodes for now until I can get my last worker. My question is should I remove the desktop environment from the worker nodes to free up ram and processor usage. Using three Nanos it seems I should just install the main OS on all the SD cards from the carrier that they come with. Then install them into the mate. Is that right? I have found loads of information on running clusters and neat stuff. But the basic setup of the mate is what I am missing. I would just think the desktop environment on the worker nodes would be wasteful.
I would love a video on the build from the start to running. Just the basic of the beginning. Setup of the SDs and any really important information needed.
Regards,
Adam
Congrats! Very well done 👍
Is there a program that can calculate the speed of each one, and set up a unique weight automatically?
Practically, we would run as much as we can on one multi gpu machine. Then we move onto multi node
so what practical use can I use it for in my lab? except for counting numbers?
Fast Transform fixed-filter-bank neural nets don't need that much compute. Moving the training data around is the main problem. The total system DRAM bandwidth is the main factor. Clusters of cheap compute boards could be a better deal than an expensive GPU. For training you can use Continuous Gray Code Optimization. Each device has the full neural model and part of the training set. Each device is sent the same short list of sparse mutations and returns the cost for its part of the training data. The costs are summed and the same accept or reject mutations message is sent to each device.
Hi @GaryExplains - fantastic video. Thank you for sharing your knowledge with the community.
I have a quick question. Given that the Jetson Nano used in this video is discontinued, what Jetson module would you recommend instead? Could this work with 4 Jetson Orin Nano modules (and would the Dev Kit be needed or could we just go with the module)? Thanks!
Great video thanks, it would be great to compare these results with a regular pc.
recommend Terminator (if available) for multi terminal window(s)
Tmux is even better. Just have to learn the weird keybindings.
Just curious, is it possible to mix supercomputer clusters together?
Like using a raspberry pi cluster for CPU n Nvidia Jetson cluster for GPU computation. Is crosstalk available?
Great video. One question, can you elaborate a little bit more on the github about commSize? I just didn't know how to set it as an argument. Thanks again for the video.
If I had $500, would it make more sense to make a cluster for Blender rendering, or get a 3070?
Probably better spent on the 3070, because 100% of the cost is going to the GPU. But if you bought Jetpacks or other single board computers, a lot of that cost would go to parts other than the GPU (CPU, connectors, cables, etc). Just mathematically you'd be spending more money on silicon with a 3070 than with a $500 cluster.
Not even close to worth it for rendering, price wise. a Jetson Nano has 128 Cuda cores and costs roughly 100 euro. A Geforce 1070 has 1900 Cuda cores and you can pick up a used one for around 100 euro. Do the math.
3070. Gpu are very cost effective.
Can this mine monero in the same fashion you demonstrated?
Can you add a gpu to this setup?
How about setting it up for mining Monero? How to do that?
That was my thinking too, I have 19 1gb cards, can they work together as 1 card and then mine with them as 1?
Hello, Kindly provide a detailed video on how to make a cluster supercomputer using 4 Nvidea Jetson NX with detailed wiring and programing commands. Thank You.
Nice one sir
I will watch and study all your videos.. I want to do more than just study. There's something I'd like to create. If possible. I'll try to reach out when I'm finished studying all your videos. May it be possible that I could ask a few questions just to gain some knowledge. Great video. I know not much of it but I understood you. There's a lot to it. I need help with my project.
Thanks for the video. Is the cluster able to run Apache Spark?
isnt it more cost efficient to compute on graphics card than Jetson stuff? I believe that Jetson is optimized for running with high requirements on low power consumption, being light and small and not at all on raw computation power. Where Jetson may be useful is to drive robots that move around but not for stuff I do like Machine Learning. Cuda cores as far as I know cannot handle floats, which is a requirement to compute SQRT. You use the tensor and cuda cores for linear algebra and matrices transformations on ints.
It would be great if you did videos covering all the details of setting up such a cluster, for a Linux-based environment. What software, how to cable it all up, etc. etc. etc.
What areas exactly need expanding because I thought the video along with the documentation I created should be sufficient.
@@GaryExplainsI'm sorry - I've only watched the videos. I will look into your linked documentation. Thanks for the reply!
Can you show us using an ampere altra, and thread ripper 64 proc x 2, and then nanos clustered. Price, price per core, wattage total, and then speed? We're all looking forward to it.
Also, why not use a bunch of single boards with fibre meshed instead. Perhaps, a redone version using GenZ or something close? Id love to see your project results.
Gary, great video!! I've been thinking of playing around with one of the nano 2gB boards and you have given me some courage to try. Did you have to write your own CUDA code to execute the SQRT function?
Two words 'Pretty Cool'
Do you think is possible to run OpenAI Jukebox using these kind of Jetson Nano Cluster arrangement? Thank you
nVidia has some kick-ass courses on Ai and machine learning too. Some are even free!
I'm too lazy to develop anything. I wait until someone else does all the hard work, then I become the consumer.
Indeed it does. I demo one of the course modules during my review of the Jetson Nano 2GB.
@@GaryExplains indeed. I heard it from you first.
If I didn't have you, I'll be back in the dark ages like the rest of humanity.
@@ufohunter3688 😂
Doesn’t having nodes with varying specifications effect overall performance of the cluster?
What about SOLID using classes conditionally, or webpages/forms event driven dynamically with the controls triggering classes dynamically for a task or sets of tasks programmatically or sequential task activated?
Probably can't be easily run in parallel. Some stuff can be parallelized, some can't. In your example the webserver will be handling many requests in parallel as would the database, but each user interaction and the relevant objects would be a single thread.
I have question, should I get the 4gb Nano or 2gb Nano for the cluster?
Does RAM matter?
I dont plan to buy them, I just want to know.
Aslo, can you run Android on the board?
Id love to see a video about that!
you can't run Android on it. the drivers and Firwares are not available. for a GPU/CPU cluster (Aka calculations) the 2gb will do fine. if you're planning to do web hosting, run Docker/kubernetes the 4gb RAM will be very useful addition
It will be really helpful if you can do a video on how to write some basic code that utilizes GPU(Nvidia as well as Radeon) in popular programming languages such as C, Java. Couldn't find proper resources, even if I do find, those are hard to understand
Couldn't you use this kind of setup for a open ai deep learning machine
That was Great, Thanks
You can get the Nvidia Elroy as well which is even smaller.
Can you use these clusters, since they share gpu's, to make a small form SBC gaming computer. For like AAA titles that one single board couldn't normally run?
No. The latency over the network is much too slow.
I was just about to post my brilliant formula for calculating the number of (identical) processors above which the compute time of the cluster starts increasing because of latency (The time taken to communicate results over the inter-connects.) Fortunately, I hesitated to post it when I realized that the overall latency of the cluster might not be additive. In other words, total latency time might not = N * L(1) for N = number of processors and L(1) is the latency of one processor alone. Is there some simple formula that scales latency as a function of the number of processors in a cluster? I suppose that number might vary wildly depending on the topology and hardware of the inter-connects, but I have no idea, really.
As someone who just stumbled over this video .. and have no idea what use you have for this. Let me ask a question.
For IT Standards I have a rather old Graphics Card in my main system. It is an "Zotac GeForce GTX 970 AMP! Extreme Core Edition".
This Card comes with 1664 CUDA Cores. Is there anything that these little computers can do, that my old graphics Card cannot?
I really dont understand Why someone Spend 400 Bucks for 4 mini computers, that also need a specific configuration and lots of knowledge,
to have less CUDA power then a 6 year old Graphics Card that you can buy for half the price.
Am I wrong?
Hey, Gary.
Do you think it would be possible to build a super GPU that could take any generic job from, say, a video game, split the job across a thousand smaller chips and respond fast enough ?
I read about a pioneer in your field who used 55,000 cheap video game chips to do topological maps in an eye blink.
The challenge being a) impersonating a recognized video card that the host computer will accept and b) lightspeed lag.
As I recall, he used fiber optics.
Latency is the issue, hence the researchers use of fiber optics etc. Remember a typical graphics card uses a PCIe bus to its full extent. Hard to replicate that kind of speed over a distributed network.
@@GaryExplains
I recall the original Power PC RISC 6000 used five chips linked by a 128 bit bus.
And there’s no LAW that you have to use a standard enclosure.
Make a filing cabinet into a PC case, put in a standard motherboard and snap three two foot by 1 foot high PCI cards into each drawer.
I know even the length of the copper traces matters, but you could use fiber optics to distribute the jobs.
That’s 5% faster than lightspeed through copper, right?
Put as many cores on those cards as you can.
Maybe, with a Herculean effort, the timing issues could be resolved and you could run video games on it.
Plus, the idea scales well for solving conventional parallel processing jobs.
I have 19 1gb cards, can they work together as 1 card and then mine with them as 1?
Interesting, you can probably get access to a full size computer cluster/supercomputer (preferably sponsored by Nvidia) and do a video on some serious number crunching (e.g. 3rd root) using mpi+cuda+scheduler. Thank you for the video :)
How is MPI Execution different than Map Reduce Techniques like Hadoop?
MPI also has reduce etc functions.
@@GaryExplains oh so it seems both are similar 👍🏼
So would there be a way to get these GPUS to encode a video using NVENC?
Almost certainly you could encode a video in parallel, if you have the right software to do that.
Hello, I went that path a few years ago with a cluster of Raspberry Pi, the problem I kept running into was out of RAM errors with Apache Spark. In the end, I prefer to use a multithreaded CPU and a good GPU. Recently, I picked up USB Google Coral TPU (4 TOPS) which I wanted to run in parallel, but because I need "regression" models (not image recognition) I am not sure if I will succeed.
I'm curious as to how a cluster of APUs, like a Ryzen would run...🤔
i wonder if this could compare to the speed AI upscaling video with a regular GPU like a gtx 1080
It looks cool and may have some educational value, but otherwise it is useless, single gtx 980ti (same OLD maxwell architecture, so apples to apples) has 2816 cuda cores, jetson nano has 128 and it's clocked lower, so you need more than 22 jetsons to match theoretical computing power of old gtx 980ti which is worth 200usd on ebay and it will work only for special cases when it is not critical for algorithm that you have low bandwidth high latency link between nodes
Could a supercomputer be built on a network of clusters of the Nano that are remote from one another ?
Yes, as long as your program isn't latency sensitive. So if each node is given a task that takes hours to run and then reports it's results, that would be ok. But if the tasks are short and there is lots of IO then the performance will fall dramatically. There are also security implications of opening up those nodes for access across the internet.
hello, you have encountered a problem such as crypto mining with the Nvidia Jetson series of computers. it is most likely about mining Monero crypto
Just linking a few computers together is that really a 'supercomputer'?
From the demo he did yes. It did decrease the compute time. From 28 sec to 5sec. Isn't that super?
Gary, Hello sir and thanks for the awesome video. I will jump right into it. I followed the GitHub instructions and everything works great. Well almost. I can see the four nano I have as my nodes all crank up. The CPUs go up as the data is received, but only one GPU at a time will ring the program. When I edit the clusterfile I can remove all of them and just add one at a time and each will individually run the program. Each working fine. When I add all four it does show 16 cores and like I said with jtop I can see them all try to go. But only one will go at a time. Never all four. The instructions we simple to follow and it went great. Just wondering if you have any ideas or if there is some data I can share that would assist you with finding out what I am doing wrong.
Thank you for the awesome video!!
I just watched your video again and I see you have in the clusterfile the IP and :1 or :3 in your examples. I did not do that part. I will try that when I get home today and see if that is my problem.
Thank you
I get xbox vibs from the stacking motherboard
Also upload a video on training of YOLO v4 on cluster of nvidia Xavier NX. The video is very impressive. I was waiting for this type of video.
The code needed to do this is most interesting. Can you make it available?
All the details are in my GitHub repo: github.com/garyexplains/examples/blob/master/how_to_build_nvidia_jetson_gpu_cluster.md
@@GaryExplains Very nice. Thank you.
Imagine technology in 50 years, sooner or later we'll all be on the internet
How much does it cost?
Does anyone know whether this can run the ARM version of the Folding@Home software and whether it will utilise the GPU? (I currently have it running on a Pi 3B+ and a Pi4.)
All Raspberry Pi doesn't support CUDA. Rpi 3 support has partial opencl(cuda alternative) support, but none on rpi4.
Is higher ram (4gb) better for deep learning?
yes yes yes pls make a video about Amdahl's Law
Not possible to fry a potatoes whilst it's being peeled and chopped, is this a challenge? Give me a space suit with knife and peel attachments and prepare to be amazed :)
I’d like to see it handle a jupyterlab server!
Would four intel or amd motherboards work in this manner
If they had NVIDIA graphics cards then yes. If not the they would work similar to the Raspberry Pi supercomputer I show in my other video.
What type of operations are a GPU core and a CPU core optimized for? For instance, in what situations could a 4 core cpu out perform a 128 gpu of comparable price.
I think it is very interesting, especially if the operations are simple enough (like monitoring IO voltages for temperature sensors and the like) you could reduce down time by segmenting the process over 128 cores instead of 4.
This is interesting for square roots, but where would the downfall be?
All-in-all this makes me want to get into developing a small super computer as you have shown, for the experience.
Thanks!
So cool.
That moment when your program is Embarrassingly Parallel, and you like it just the way it is.
APUs?? 🤔
can Ibuy this?
You can buy all the individual components or you could buy a Jetson Mate, I have a review of it here: th-cam.com/video/nWzcEUj0OHc/w-d-xo.html
And could it Run a VMware ESX as it does RaspBerry? I think its a New deal on computing... That Will be great to mount an operating system on a vmwesx clustered SBS machines... Only needed a motherboard to join all this properly and to add a real GPU to all that... Ram, Extra GPU And clustered mini computers can do it greater and better than Only one beast
Change bios
You'll need a central controller to distribute the load among thousand if not millions of nodes to mimic a brain in real time.
What do they call this master controller? The "Soul" of the system?
But can it run "Crysis".
Perhaps a foolish question, but as the Raspberry Pi (for instance) also contains a GPU, could this same thing be done with it via OpenGL?
You probably could with OpenCL or similar
Dude I want to build a personal super computer. Im giving myself 2 years to complete and I know nothing about computer's. Im just super interested. My goal is the worlds smallest and mos inexpensive super computer. I want to see Latin America progressing.
Is a cluster always a supercomputer 🧐
No, because you can have clusters that share IO resources but not compute resources.
Can I built and mine crypto in this type of computers
That was my thinking too, I have 19 1gb cards, can they work together as 1 card and then mine with them as 1?
Just get yourself a single RTX 3070 or 3080 in your PC, and you'll be more than good enough on a super computer at home (not even going for 2 or 3x 3090 in a system).
Yes, of course, but that isn't the point, is it. You don't use MPI with just one card in your desktop.
@@GaryExplains But something like a 3090 with +10k cores, surely crunches way more data, than even 100 of these little boards.
@@ProDigit80 Absolutely. But again that is the point.
Yo!
👍
Seeed studio is selling a 4 Xavier node gpu cluster for a whopping $2200. NO thanks I'll just get a 3090.
What Gary fails to mention is that pricing for Jetson boards is 23x more expensive per cuda-core than 3000RTX GPU cards.
Jetson TX2: 256 cores, 500$
RTX-3070: 5888 cores, 499$
You are missing the point here. They do the same thing with Raspberry Pi Clusters. It is not about building the most costly effective CPU or GPU environment for computing. It is about building a small supercomputer and test your codes or learn how to run programs in an environment that resembles supercomputers. So you will never use this system for production or doing real tasks but you can use this system to learn how MPI works or how your program scales in a cluster environment. When you buy RTX-3070 you are learning how to run your program on a desktop or laptop environment which is totally different environment than this and it targets different users. I suggest everyone to learn how to do scientific computing by building a Pi (CPU) or Jetson (GPU and CPU) cluster.
@ "So you will never use this system for production or doing real tasks"
Please show me where this is mentioned in the video / description.
You are right - you should NEVER EVER build a real supercomputer using Jetson boards.
But that information is sadly missing in this video.
Now i can play minecraft
You show the cluster but not how to do it? what a shame
I thought I covered everything, what is missing?
Its will flail
It's a pity I did know you earlier. Do u have a Facebook account publish it to demographic Africans and Asia ur supposed to have 10million subscribers.
A 3080 is cheaper than all those cards
Two things. 1) That isn't the point. The point is to understand how supercomputers work. 2). Really? Four Jetson Nano 2GB boards at $59 each is $236. How much are you paying for an RTX 3080? I would love to know the name of the place you can get one for that price.
lol,
RTX 3080 is $500+ and as Gary said, it's not the same as running these.
lol. you didn't get the point. imagine someone trying to make a portable ai cluster. are you suppose to use 3080
@@GaryExplains
1. Your mixing up of gpu cores with regular cpu cores and comparing them to a supercomputer is downright wrong. You confuse people even more while not properly understanding how a supercomputer works and what is used for, furthermore you push the idea that the jetson boards are a feasible option as a compute intensive cluster.
2. You have said it yourself, 2X2Gb Nano, 1X4Gb Nano and 1 Xavier.
Xavier alone costs: www.amazon.co.uk/dp/B0892DZQXK?tag=amz-mkt-chr-uk-21&ascsubtag=1ba00-01000-org00-win10-other-nomod-uk000-pcomp-feature-scomp-wm-5&ref=aa_scomp&th=1
Hmmm... I don't understand your point 1. As for your point 2. That is what I happened to have here. I wasn't going to just buy two more Jetson Nano boards for this video. But you only need the Jetson Nano 2GB to make this work as described.
Terrible! Just buy the higher tier boards like the Xavier boards... clearly the Jetson sales must be low enough for representatives to sponsor this gentlemen... don’t waist your time and $!
Hmmm... Terrible comment! First, this video is sponsored by Linode, not NVIDA. Second, how can you justify the extra prices differences between the Nano and the Xavier. Why is using the Nano so "terrible"? What does a person learning about HPC, MPI and CUDA gain by spending all that extra money (besides perf)?
Which one of us is in the business of explaining Nvidia products? Maybe if ur sponsors read this u could make that video :)
eh?