Adi Sieker has put together a multi-process version of the script - www.the-diy-life.com/can-my-water-cooled-raspberry-pi-cluster-beat-my-macbook/#multi_test_script This makes use of multiple cores and threads on the CPU it's being run on, so is more representative of the processing power of the whole CPU. I'll post updates as I run it on each system: HP Laptop: 10,000: 0.9 s 100,000: 18.27 s 200,000: 66.99 s 500,000: 374.3 s (6 min 15 s) MacBook: 10,000: 1.42 s 100,000: 35.64 s 200,000: 142.12 s 500,000: 827.53 s (13 min 45 s) Pi 2.0Ghz: 10,000: 0.9 s 100,000: 109.02 s 200,000: 448.02 s Pi Cluster 2.0Ghz 10,000: 0.9 s 100,000: 15.88 s 200,000: 57.67 s 500,000: 312.45 s (5 min 10 s)
The code on the blog is also buggy, need to change chunks to 100 instead of 1 in last parameter. Ryzen 5900x does 200k in 4.1 seconds with stock amd cooler.
@@acidspark You can probably halve that by a small fix on the finding primes code too; it's ilogical to divide from 2 to candidate number, if the number is not divisible by 2 then it's not divisible by any number bigger than N/2 so range(2, candidate_number/2) will cut in half the tries. That anyway would be cheating to compare with the video numbers, but using all processes correctly no.
With max_number on 200000, mine run 121 seconds on the single threaded version vs 78 seconds on the multi-thread version. Intel i5-8300H (8) @ 4.000GHz 4-core/8-threads
Tested the script on my Ryzen 7 5800x running linux 5.10 and python 3.9.1 with the schedutil governor: To 10000: 0.27 seconds To 100000: 21 seconds To 200000: 73 seconds So faster than the cluster but costs about the same (I think I paid about 520€) just for the CPU. That said, the script only used a single core, if it was multicore it would have been 8-16 times faster.
That's really impressive for a single core! Yeah, if the script were multi-threaded then the Pi's would use their additional 3 cores each as well, so they should also be 3-4 times faster, but still no match for your Ryzen 7 5800x
It's kind of pointless to compare single core performance of a laptop CPU to performance of 8 Raspberry Pis. You could get much more meaningful results by loading every core on laptops (with and w/o hyperthreading) while monitoring for thermal throttling, and also loading each core on your Pi cluster, not just one per node.
Yes, it would definitely have been a better comparison to utilise all cores on each system. I haven't been able to get a multi-threaded Python script to run stably on multiple platforms yet, but I'll keep working on it. This still gives you a pretty good idea of how they would compare if you know the core count of each device being tested.
@@MichaelKlements Yeah, multithreading in python is virtually non-existent (due to GIL), but it should be not too hard to run one python process per core, the same way you did with MPI.
@@MichaelKlements Using Parallel from joblib is pretty trivial to run in multiple cores: drive.google.com/drive/folders/1_VUNGTMIvpuy_7pAXjTvD0MCGfQaf2NM?usp=sharing With Python GIL multi-thread is out of the equation (so on some systems one can gain more performance by leveraging other languages)
I would have really liked to see this as well. I tested the script on my i9-9900k @ 4.8 ghz and got 0.33, 24, and 84. If the script were multi-threaded then the pi's performance would likely falter to some other laptops/PC's
Ryzen 5 3600XT results: single process script: 10k: 0.59 seconds 100k: 28.49 seconds 200k: 104.89 seconds multi-process script: 10k:0.5 seconds 100k:6.82 seconds 200k:23.62 seconds As someone pointed out in the comment section, the performance improved greatly only when I changed line 41 in the multi process script to : parts = chunks (range(2, max_number, 1), mp.cpu_count())
I'm pretty sure mpi just takes the loop and divides it evenly between the nodes. Each loop is independent so this works fine but once you get into loops with dependencies (shared info / mutexs) the cluster is going to slow significantly . It's a shame this wasn't highlighted in the video.
Tested the script on my Ryzen 7 3800x running Windows 10 and python 3.8.7 (More of a mid range cpu 8 cores 16 threads, max all core was 4.47ghz) 10000: 0.33 s 100000: 26.95 s 200000: 102.24 s
it only runs on one thread so core count is irrelevant. the important thing is clock speed. if someone makes an efficient multi thread script then core count would be important along clock speed. i dont know python well enough but i can do C++ or C#.
Here are the results on an i9-9900K at stock speeds: Single Core: 10,000 - 0.35s 100,000 - 25.85s 200,000 - 100.74s Multi Core: 10,000 - 0.29s 100,000 - 3.23s 200,000 - 10.94s 2,000,000 - 2,220.28s
You only ran the test on a single thread on each laptop, so the test isn't very fair when comparing to multiple pi's. With multithreading the laptops would be about twice as fast if they're dual cores, or more with hyperthreading.
For full performance it doesn't show the performance of the laptops but it does represent why a pi cluster is useful over a laptop for this purpose in both price and performance. It also shows that if you are doing the kind of work that would benefit from a cluster even a small performance boost can be worth it while still allowing your main computer to be fine while with a laptop to use multithreading would make the machine slow while the processes are running basically wasting the use of the laptop.
Yes it is only running on a single thread on the laptops, but it's also only running on a single thread on the Pis (they're each running quad core processors). So with a multithreaded script, the laptops would have been around twice as fast, but the Pis would then have been 3-4 times as fast as well.
@@MichaelKlements still makes for an unrepresentative comparison. especially since you told people to try it themselves, especially with ryzen processors which can have a lot more than just 4 cores not to mention hyper threading.
@@MichaelKlements yeah, but with a cluster of 8 or 7 pi it's like running the same script on a 7 thread cpu, you should make it multithreaded for both the pc and the rpi, so it can be comparable, if you run a single threaded app on a ryzen and on an intel the intel will win for higher ghz, and it's the same for the rpi's, if you manage to make it multithreaded on a 8 threads 4ghz cpu it will be much faster than 8 rpi's at 2ghz
To get the Multicore Working you have to change the line 41: #Orginal: parts = chunks(range(2, max_number, 1), 1) #now: parts = chunks(range(2, max_number, 1), mp.cpu_count())
I tried it out on my M1 Mac mini, 5 runs of each script and averaged the times: 10000: 0.45 s 100000: 39.77 s 200000: 155.84 s and because I had the time... 500000: 985.56 s
Thanks for sharing your results. I've heard good feedback from people using M1 Mac's so far, I'll definitely be looking at getting one when they've expanded the range.
As others have already posted the 5800X, I thought I'd add the 10700K (at stock) running ubuntu 20.10. Find all primes up to: 10000 Time elasped: 0.3 seconds Find all primes up to: 100000 Time elasped: 21.63 seconds Find all primes up to: 200000 Time elasped: 86.98 seconds And as others have said, it would be great to have it run multithreaded (and clustered).
Thanks for this Andy. That's quite impressive for a single core! I haven't been able to get a multithreaded Python script to run stably on different platforms, but I'll keep trying.
@Andy Monks,@@MichaelKlements 10700K @5.2 Ghz, 3466 Cl14 32gb ram, Win 10 pro, python-3.9.1 Find all primes up to: 10000 : 0.4 seconds Find all primes up to: 100000: 14.16 seconds Find all primes up to: 200000: 52.64 seconds
I removed the "* 4" to num_cpus and changed to 100 numbers per chunk on line 41, so that it uses more than 1 cpu, and got: 10,000: 0,16 s 100,000: 6,99 s 200,000: 39,52 s 500,000: 274,26 s On a virtual machine inside a dual CPU 16 core AMD Epyc 7281. So that's 64 cores with hyperthreading, but the hyperthreading didn't give any mentionable difference. Running the VM with 32 or 64 cores both yielded about the same results in time.
6:42 You say the cluster needs to manage communication between the nodes. However, the MPI script splits the task in independent, almost identically sized parts which means there is minimal overhead in parallelizing on a cluster. That said, your cluster does look really neat and you clearly put some effort in making this video, nice job!
Yeah this script and software setup in general is not a good example of that because the script already splits the tasks up. There are much better software packages for clusters which better manage the split up and delegation of tasks.
5900 X here! Python 3.9.1 Win10, 5 runs each setting Normal Run: 10000 - 0.23 - 0.26 s 100000 - 18.28 - 20.87 s 200000 - 69.83 - 77.14 s Multi-Threaded: 10000 - 0.46 - 0.49 s 100000 - 11.68 - 11.89 s 200000 - 43.05 - 48.01 s Are comments being removed?
Have the same CPU, but am on Linux and every time is extremely similar. Some faster, some slower, most within just a second or two. What's interesting though is the 10000 multi-thread. Your fastest is 2.3x slower than my slowest, plus your threaded is slower than your unthreaded (which our times are identical for). Not sure what would cause that besides maybe a process already using some of the core it picked. Interesting result.
Are you sure you ran the multithreaded properly? Your single threaded scores are what I would expect compared to my Ryzen 9 4900HS, but your multithreaded scores are 4-10x slower than mine. My laptops multithreaded scores were 10,000: 0.05s 100,000: 2.98s 200,000: 11.57s There’s no way my little laptop is faster in multithreaded than a 5900X.
@@-Burb I've got the same processor as the person above, turns out the multi-thread script was only using 1 core. With that fixed my times were 10000 - 00.06 100000 - 01.20 200000 - 04.36 500000 - 25.18
6:49 Wait a second, you're running a script that contains no logic for splitting up work and makes no MPI calls. What are you actually benchmarking here?
I've been working on a some render farm software. Basically a cluster interface. If you would like to test it out hit me up. Its pretty early stages but it's pretty fun to use. It's originally designed for render farms but has been more generalized over the last few months to run most tasks you can think of.
So, I ran this at 200k on a few different pieces of hardware that I have access to: Westmere (2.27GHz) - 1 core: 340.88s (python 3.5) Westmere (2.27GHz) - 60 cores (10 CPUs): 14.77s (python 3.5) Ivy bridge (4.0GHz) - 1 core - python 3.6: 148.49s Ivy bridge (3.6GHz) - 16 cores (2 CPUs) - python 3.6: 10.36s Zen2 (4.35GHz) - 1 core - python 3.9: 108.53s Zen2 (4.25GHz) - 24 cores (1 CPU) - python 3.8: 8.19s CPU TDPs for these systems? zen2 - 1x280W ivy bridge - 2x120W westmere cluster - 10x60W + networking I imagine your raspberry pi cluster is quite a lot more power efficient than any of my systems! P.S. end_number = int(sys.argv[1])? You've imported the sys module, might as well use it!
Thanks for trying this out and sharing your results! Yeah this cluster runs at about 80W on full load, so it's quite a bit more power efficient than traditional computers.
Very cool experiment and demonstration of cluster computing. Thanks for showing us one of many applications of cluster computing. I will stay tuned for thermals. Great work Maker Michael!
For Ryzen 9 5900x at 4.5Ghz, Windows Version 19042.746 To 10000: 0.27 seconds To 100000: 21.49 seconds To 200000: 88.44 seconds As others have said, this is a single core program and used very little system resources. Running this only increased processor utilization about 2-3%. I t would be awesome to see a multiprocessing iteration
Oh, dang... now I kind of want to put a cluster like that together and finally get around to get an MPI implementation of my raymarched mandelbox thingie. Ihis cluster is excessively cute
Sure, I'll put together a video for connecting an OLED display to the Pi and running a basic script. The issue is often with the software, it's updated so often that tutorial videos are out of date just a few weeks/months after posting them.
I know that the two sharing USB ports started struggling at 2.0Ghz full load, so I didn’t some measurements and add an extra supply to these. They use 1.7A at full CPU load, so the full cluster uses about 70W when running at full load.
Tested on a Xeon e5 2678 v3 @ 3.3 ghz on all cores Edit (tested again on singlecore and multicore version): Singlecore -> Time elapsed: 0.52s, 41.91s, 157.99s Multicore -> Time elapsed: 0.88s, 26.36s, 96.32s
For me one of the advantage of the Pi is the energy efficiency … I run 10 Pi (40 cores) on BOINC and use around 40 Watts for all. Can’t beat that with i7 or Ryzen. I don’t win a race but have a steady energy efficient computing “cluster”.
Hardkernel MC-1 Cluster: 4x Arm Cortex A15 @ 2.0 GHz & 4x Arm Cortex A7 @ 1.4 GHz per node 4 nodes, total 16 threads binding to the large cores (A15) 10000: 0.12 s 100000: 15.04 s 200000: 61.49 s 500000: 431.29 s (7 m 11.29 s)
First half of the video he did single core for all the systems, nothing flawed. All the results are clearly tagged. Them he show it running on multicore for the pi's. Have a blessed day good sir.
Thanks for the suggestion, I had a quick search for a more efficient way of executing the shutdown command across multiple nodes, but came up empty. I'll have a look at Ansible Playbooks
@@MichaelKlements Yes, definitely suggest ansible for shutdown, install updates, change cfg files, etc etc - it's great and very easy to learn and get into + it can use SSH, so it doesn't need a dedicated agent installed on the nodes.
Well, that settles it. I'm building a cluster computer to handle computations. That's freaking sick! NVIDIA makes a GPU microprocessor (Jetson Nano) with 120+ cores. I wonder how viable that would be in terms of GPU super cluster, lol.
Results on Pentium 2 T4400 (Dual core 2.2 Ghz): Single process script: n=10000, time=2.03 seconds n=100000, time=221 seconds Multi-process script: n=10000, time=1.05 seconds n=100000, time=92.26 seconds Second script was optimized as mentioned in other comments. In short, Pentium 2 T4400 is slower than Raspberry Pi 4B used in the video.
Test Results on my Lenova Thinkpad T480 at work for primes up to 10000: 1.24 seconds Will test on my AMD Ryzen 2700X system when I get home....great video and amazing concept! Update 1: 1.2 seconds on my 2017 MacBook Pro at work.
@@InfiniteLemurs I'm sorry but it's not the algorithm's fault, for example a for loop in python is going to take you ages compared to one in Any real programming language or even scripting languages like php/js Go try it yourself.
@@JustKatoh There is significant overhead in python, (though JS is arguably just as bad) but my argument is that inefficiency of the algorithm matters much more than the inefficiency of the language. A good prime sieve is magnitudes of scale faster than the one shown in this video-- a factor of utmost importance here. Writing bogosort in cpp is still going to be worse than quicksort in python; obviously things would be more performant in cpp/rust, but this isn't even beginning to approach optimization yet. The C code above gets absolutely stomped simply by using a better alg. Python isn't the best choice here, but the algorithm is the truly poor one.
@@InfiniteLemurs It does play a major role, lookup nodejs performance compared to python, is just as fast as JAVA, a fully fledge programming language. People trying to defend python's performance is just funny, python is literally a prototyping language to write hacky scripts quick and dirty, it's the only thing good for. I can't find 1 thing made in python, that works well, performant and is preferred over other language alternatives. I think you're misunderstanding what an algorithm is, algorithms usually are written in pseudocode and adapted to languages, be it scripts or programming, Take for example Huffman's compression algorithm, sure there is prewritten assembly and C code but 99% of the time it's going to be presented under a pseudocode format, you can look at it however you want, it's not Huffman's fault python is incapable of it's implementation in a real world scenario when even one of the most inefficient scripting languages ( PHP, nodejs is so much faster ) can still easily run it in a production environment reliably. I really don't get what you're trying to bang at with the "algorithm" argument when algorithms are simple stencils to use independent of the language at hand,
@@MichaelKlements Absolutely! I'm a student learning about databases, cluster computing, and MapReduce right now. So it's very cool to see that just how effective this strategy is, even when equipped with Pi's rather than a full sized PC. I feel like my knowledge is rudimentary at the moment. Are you aware there's a cluster hat available for the Pi4 that controls 4 Pi zeros at once? Might be a cool prospect!
That sounds awesome! I’m also still learning a lot about clusters as I go along (as you can probably tell), it’s been really interesting and fun. Yes I have seen them. It would be quite interesting to try clustering a cluster of a Pi 4 and zeros, I should have a look at that. Thanks for the tip
On my galaxy tablet A, I got 18.3 seconds for 10k. When I tried 100k, the app would just say, "experiencing server issues, try again later" after a couple of minutes 😐
My AMD Ryzen 9 3900X (12 core) took for: 10000 - 0.39 seconds 100000 - 25.37 seconds 200000 - 92.46 seconds I didn't do any modifications (No overclocking etc.) I hope this has been helpfull :)
This is so cool! Would have been very curious to see some analysis of the power consumption of the different computers, and what your cluster could do without the cooling running. Great Video!
I've ran the multi-process version on my Ryzen 5950X, and I've also re-written the benchmark in C++. I'll only post the results for max_number 500000 here since I find that's more relevant due to different overhead mattering less: Find all primes to: 500000 Primes found in all cases: 41538 Python Script (no changes): 257.7s Python Script (chunks set to 100 to actually use more than one core): 24.67s Python Script (chunks set to 100 AND only check divisions to candidate/2): 12.56s C++ single-threaded (to candidate): 13.085s C++ single-thread (to candidate/2): 6.584s C++ multi-threaded (to candidate/2): 1.076s It's fun to see how huge the differences can be, especially considering they scale massively with more and higher numbers to test. The Python script with chunks fixed was using >90% of the CPU reliably, but the C++ multi-threaded version was not - there's definitely improvements to do in my version there, but even as is, it's significantly faster. Edit: An example of silly scaling - when computing up to 2000000, the same multi-threaded C++ code did indeed use the entire CPU and was done in 7.097s. The python script with the optimizations (chunks at 100 and check divisions to canditate/2) did the same 2000000 in 177.61s. A CUDA kernel version of the same prime checking function completes the 2000000 test in 0.309s - and that's a very unoptimized way to do it in CUDA.
Ryzen9 4900HS in an ASUS G14 Zephyrus 10,000: 1.35 s 100,000: 16.66 s 200,000: 59.96 s 500,000: 342.95 s I ran these with remote desktop and had just shut down the mining software when I ran these so the cores may have been hot and resources may still have been in use.
I’ve done a lot of benchmarks for distributed computing with the Pi’s and although they’re great little boards, very often a used desktop will perform more for less money. I’ll only add Pi’s to the cluster when replacing them with faster versions. To keep up in cluster workloads pi’s need to get cheaper or faster at the same price.
Thanks for your thoughts on clustering. I agree, you can definitely get better performance out of some old desktops, but the Pi’s are still quite competitive in terms of power consumption and size.
ryzen 5 5600x with PBO set to 5ghz, win10 python 3.7.8 10000 0.28sec 100000 22.75sec 200000 87.59sec only 1 threat whas in use for the second run 10000 0.28sec 100000 22.73 200000 85.21sec
I know this is not fair but when actually changing the code so it performs more efficiently for example, only going to the square root of the number it gets sooooo much faster. My PC laptop was able to do your code for 100,000 in 56 seconds but when using a new algorithm and using numba for further optimization if can do 100,000,000 in only 361 seconds. This was a great video, now I see how they managed to make a super computer out of PS3's... Keep it up.
Just sharing th-cam.com/video/D3h62rgewZM/w-d-xo.html since it's what can happen with more efficient code and apparently both channels did finding primes challenges almost at the same time. Though I think here it's even mentioned that it's not for efficiency but to compare same code on different setups.
@@tubeincompetence Hey, thanks for recommending that video to me, interestingly I have watched it (I mean it is the youtube algorithm after all) and yes I knew that the purpose was to just see what can run the code faster not to optimize specially when it starts to get more hardware and language bound, I was just shocked on how efficient it got using the optimizations (something I knew but I had no sense of scale for) and so in general thanks for replying to my comment and trying to help me out.
@@tubeincompetence That is super true... that is bad to think that not every task and calculation is possible to be processed in different threads specially in different systems, and that is where being creative in how you're gonna make it more efficient and sometimes even find ways to do the task in different threads even if in first look it might not seem like it's a good thing to do or it's even possible.
Didn't try it on AMD or Intel CPUs but on an RTX 3090 using the Numba Compiler to run the python script on all cuda cores. My results were about 8 seconds for 500.000 digits of pi. 100.000 and 200.000 both below 1sec
digits of pi? you mean prime numbers til 500k? (that's what the script does). Regarding your results: 8 Seconds? Sounds like a LOT of python overhead on the cuda cores? Currently don't have the numbers of cuda cores on 3090, but expected way faster results.
This is a very poor comparison. It would have been great maybe in the 90's, but it's single threaded. Even the PI's have more than one thread each. I actually have the Ryzen 5 2600x, so 11 cores are sitting at absolute idle while one luckless bastard of a core gets pegged while it's working on the larger datasets.
It would definitely have been a more thorough comparison to get the script running across all cores of each device, but I haven't been able to get a multithreaded Python script to run stably on multiple platforms. I'll keep working on it. The Pi cluster is also sitting with 24 unused cores during this test.
@@MichaelKlements that's a video i would like to see, as i have 3 older server systems needing something to test out. Should you get it working i might even go out and buy more pi4's just to build and play with. thanks michael
That cluster from 2 years ago nearly keeps up with my 12th gen i9 12900H 14 Core Monster. (Laptop) 10000 - 0.23 sec 100000 - 17.86 sec 200000 - 71.04 sec Great video, thank you for sharing.
Around 2:45 you said there's exponentially more factors to check when increasing the range. Isn't the growth here quadratic instead of exponential? That is: I think runtime should grow more like x^2 instead of to the power of x.
Ryzen 5 3600x 6-core 12-threads 4.0Ghz Multi threaded times: *10.000- 0.27s *100.000-18.92s *200.000-68.07 These tests were run using Ubuntu 20.04 with the 5.8.0-41-generic kernel using python3 3.8.5.
Posted on the homepage, but ran the multi-cpu edited code posted there on my Ryzen 5900X (win10, plenty of wasted resources with plenty of extra stuff running, running from within PyCharm with python 3.7.8); In short - HyperThreading / SMT seem to give better results, increasing the amount of processess per core carries a penalty, time to spawn processess (guess), but in general, it seems to be something that needs to have a median or be an average of runs. Changing the num_processess = mp.cpu_count(), this can be made a bit more clear - (tests carried it with 1x runtime, so no averages) Just going 1 (*1, 24 processess): 10 000: 0.29s 100 000: 16.01s 200 000: 59.66s 500 000: 344.81s Default settings in the script (*4, 96 processess): 10 000: 0.48s 100 000: 16.38s 200 000: 59.59s 500 000: 349.91s Tweaking the script, doing *24 (makes it angry at low finding, 576 processess): 10 000: 1,89s 100 000: 21.9s 200 000: 63.27s 500 000: 346.81s And just because fun - 1 000 000: 1300.92s Tried it out quickly on my 3900X with Linux and Python 3.8.5; Default settings *4 (96 processess oddly not spreading out well to cores..) 10 000: 0.25s 100 000: 16.52s 200 000: 59.16s 500 000: 340.22s Going *1 (24 processess) 100 000 16.55s And because it's fun: 1 000 000: 1311.5s In short, there's a limiter to how one can scale this on a single core, and increasing the amount of processess per core carries a penalty, and delegating to several cores seems to be a bit of a bother.
Well, what will be impressive too is, what will happen, when using 20 PIs and chunking encodes, ffmpeg splice a file in chunks of e.g. 10 seconds. After that each node renders x of chunks and compare it energywise with an Ryzen 9 or so... So for video encoding (software gets better visual quality than gpu based) this might be an good option too.
Awesome hardware build. I’m on the hardware side of things and I think it pretty cool what you have made. I can’t speak for software side but still awesome to see. Thank you for sharing this program with us.
Super cool cluster! Have you heard of CSSH? Would recommend looking into it if not. Can SSH into all the Pis at once, and send commands to them all at the same time. Absolute lifesaver for clustered setups like this.
Would there be a way to treat a PC with multiple cores similarly to the cluster where the task is divvied up to the different cores resulting in potentially faster computation by more fully utilising the power available? It seems great care was taken to fully utilise each Pi, but were the PC or Mac using more than a portion of their power? Also the weight of the OS seems like it could be a hindrance. I know that's probably outside of the scope.
In this case, you benchmark using integer arithmetic, not floating point. I saw another video that benchmarked a Turing CM 3 cluster. But there too the applications were running not floating point processes but tasks in Drupal and Wordpress. How do your 3 systems compare performing floating point math benchmarks?
What are the practical applications running the Pi's in such a way for someone such as me (not a computer science major or programmer, etc)? Would you be able to run an OS capable of running Windows or Linux and associated apps or will this only run specific programs written for the Pi? Thanks and sorry for my lack of knowledge in this subject!
Pi's themselves are perfectly capable of running a range of Linux based apps and many people use them as standalone desktop computers, typically the Pi 4Bs. Clustering them is purely to improve processing power, mainly for simulations, rendering or bulk computing tasks.
Yes! This had been a fun project to work on and I've learnt a lot along the way as well. I'm looking forward to building a few more clusters in future, this definitely won't be the last.
Ive never owned a cluster of any sort. does ffmpeg have a cluster render option? I know programs like mandelbulber2 have cpu clustering features for fractal rendering... Host needs some RAM. We need a motherboard that will take 8 pi4 compute modules.
I like your cluster setup, and hope to get my own running soon enough. I ran your scripts on my Opteron workstation. (circa 2009, except for the cpus) 2x 6328 cpus, 16 cores total @ 3.2ghz SP: Find all primes up to: 10000 Time elapsed: 0.91 seconds Number of primes found 1230 Find all primes up to: 100000 Time elapsed: 57.64 seconds Number of primes found 9593 Find all primes up to: 200000 Time elapsed: 216.64 seconds Number of primes found 17985 MP: Find all primes up to: 10000, using 16 processes. chunk size: 4 Time elapsed: 0.58 seconds Number of primes found 1229 Find all primes up to: 100000, using 16 processes. chunk size: 48 Time elapsed: 4.42 seconds Number of primes found 9592 Find all primes up to: 200000, using 16 processes. chunk size: 97 Time elapsed: 15.65 seconds Number of primes found 17984 Find all primes up to: 500000, using 16 processes. chunk size: 244 Time elapsed: 87.72 seconds Number of primes found 41538 I had to calculate chunk size to saturate the cpu better, without changing the brute-force prime number algorithm. chunk_size = max_number // num_processes // 128
I modified the scrupt to run in 6 processes with 5 threads in each process which enabled my macbook pro to complete 200 000 in 25 seconds. Just FYI. The single process and thread version completes in 168 seconds.
Here is what my Intel i7 1165G7 (4267MHz LPDDR4) got: Single: 10,000: 0.27 100,000: 22.73 200,000: 87.38 Multi: 10,000: 0.53 100,000: 15.84 200,000: 62.88 Interesting sidenotes: -The the m1 chips are ~40% slower in the singlethreaded test. Not quite sure where the gap comes from but multiple persons in the comments here reported the same. Maybe the apple chip just isnt *that* good in everyday use when you can have an accelerator. -The singlethreaded results can take on most desktop cpus here. The 5900x seems to be around 10% better. Not bad for a laptop i7! -Intels performance has not been stagnating as much as i thought, in the comments you can see clear generational improvements. -Thanks for adding a multithreadded scipt as well!
I ran the script on my M1 Macbook air (fanless) using python IDLE 3.9.0. Single process: 10,000: 0.51s 100,000: 36.96s 200,000: 138.68s Multi process: 10,000: 1.26s 100,000: 27.5s 200,000: 101.2s I think the only other thing that stands out to me is that my computer never got hotter than 35*C or barely warm to the touch.
Tested this on my Ryzen 3800X. 8 Cores, 16 threads, base clock 4GHz. Running Windows 10 Pro. Single-core version: 10,000: 0.32s 100,000: 24.42s 200,000: 92.35s Multi-core version (didn't appear to actually use all cores, according to Task Manager): 10,000: 0.58s 100,000: 16.37s 200,000: 61.36s
Sorry to tell you, but when you were testing the computers on some you ran the script in the IDLE shell. The IDLE shell is very slow and un-optimized, you should rerun this without using IDLE.
Cool Video, i testet the script on my I7 8700. I got the following results: 10000: 0,37 seconds 100000: 36.29 seconds 200000: 123.79 seconds I have no experience in pyton so i ran the script right in PyCharm, what might not be efficient but who knows lol.
For more comparaison data , made with a ryzen 5 2600 : Single process : - 10000 : 0.47s - 100000 : 38.44s - 200000 : 144.72s Multi process (48) : - 10000 : 0.89s - 100000 : 22.61s - 200000 : 81.07s I am surprised by the relativly low difference but it may be perfectly normal.
Adi Sieker has put together a multi-process version of the script - www.the-diy-life.com/can-my-water-cooled-raspberry-pi-cluster-beat-my-macbook/#multi_test_script
This makes use of multiple cores and threads on the CPU it's being run on, so is more representative of the processing power of the whole CPU.
I'll post updates as I run it on each system:
HP Laptop:
10,000: 0.9 s
100,000: 18.27 s
200,000: 66.99 s
500,000: 374.3 s (6 min 15 s)
MacBook:
10,000: 1.42 s
100,000: 35.64 s
200,000: 142.12 s
500,000: 827.53 s (13 min 45 s)
Pi 2.0Ghz:
10,000: 0.9 s
100,000: 109.02 s
200,000: 448.02 s
Pi Cluster 2.0Ghz
10,000: 0.9 s
100,000: 15.88 s
200,000: 57.67 s
500,000: 312.45 s (5 min 10 s)
CPU: i7-10750H 6 cores / 12 threads at 2.6GHz
10K: 0.26 s
100K: 17.59 s
200K: 64.13s
500K: 375.43s
The code on the blog is also buggy, need to change chunks to 100 instead of 1 in last parameter. Ryzen 5900x does 200k in 4.1 seconds with stock amd cooler.
@@esteban- just changing chunks to 100 brought the 500K down to 97.94s on mine.
@@acidspark You can probably halve that by a small fix on the finding primes code too; it's ilogical to divide from 2 to candidate number, if the number is not divisible by 2 then it's not divisible by any number bigger than N/2 so range(2, candidate_number/2) will cut in half the tries. That anyway would be cheating to compare with the video numbers, but using all processes correctly no.
With max_number on 200000, mine run 121 seconds on the single threaded version vs 78 seconds on the multi-thread version.
Intel i5-8300H (8) @ 4.000GHz 4-core/8-threads
Please tell me you called it Octopi.
Octopi already exists
It’s used for 3d printing
Octopis
This comment is underrated.
Nice name🙂
Tested the script on my Ryzen 7 5800x running linux 5.10 and python 3.9.1 with the schedutil governor:
To 10000: 0.27 seconds
To 100000: 21 seconds
To 200000: 73 seconds
So faster than the cluster but costs about the same (I think I paid about 520€) just for the CPU. That said, the script only used a single core, if it was multicore it would have been 8-16 times faster.
Do basically one Ryzen core can do the work of a pi cluster?
@@alouisschafer7212 Essentially
88 seconds for 200000 on a Ryzen 7 3700X, for comparison's sake.
That's really impressive for a single core!
Yeah, if the script were multi-threaded then the Pi's would use their additional 3 cores each as well, so they should also be 3-4 times faster, but still no match for your Ryzen 7 5800x
@@MichaelKlements it can be easily just import multiprocessing library in python, I tried that can running with full 40 cores in my server.
It's kind of pointless to compare single core performance of a laptop CPU to performance of 8 Raspberry Pis.
You could get much more meaningful results by loading every core on laptops (with and w/o hyperthreading) while monitoring for thermal throttling, and also loading each core on your Pi cluster, not just one per node.
Yes, it would definitely have been a better comparison to utilise all cores on each system. I haven't been able to get a multi-threaded Python script to run stably on multiple platforms yet, but I'll keep working on it.
This still gives you a pretty good idea of how they would compare if you know the core count of each device being tested.
@@MichaelKlements Yeah, multithreading in python is virtually non-existent (due to GIL), but it should be not too hard to run one python process per core, the same way you did with MPI.
@@MichaelKlements Using Parallel from joblib is pretty trivial to run in multiple cores: drive.google.com/drive/folders/1_VUNGTMIvpuy_7pAXjTvD0MCGfQaf2NM?usp=sharing
With Python GIL multi-thread is out of the equation (so on some systems one can gain more performance by leveraging other languages)
Thanks José, I'll look into joblib and try that out on the cluster as well.
I would have really liked to see this as well. I tested the script on my i9-9900k @ 4.8 ghz and got 0.33, 24, and 84. If the script were multi-threaded then the pi's performance would likely falter to some other laptops/PC's
Ryzen 5 3600XT results:
single process script:
10k: 0.59 seconds
100k: 28.49 seconds
200k: 104.89 seconds
multi-process script:
10k:0.5 seconds
100k:6.82 seconds
200k:23.62 seconds
As someone pointed out in the comment section, the performance improved greatly only when I changed line 41 in the multi process script to :
parts = chunks (range(2, max_number, 1), mp.cpu_count())
I ran it on my pentium 4.
Stay Tuned.👌
*5 hours ago*
10 hours ago xD
16 hours ago
1 day ago
2 days ago ...
THIS MAN SETUP A CLUSTER TO RUN SINGLE THREADED CODE WTF AM I WATCHING
I was thinking that
I'm pretty sure mpi just takes the loop and divides it evenly between the nodes.
Each loop is independent so this works fine but once you get into loops with dependencies (shared info / mutexs) the cluster is going to slow significantly .
It's a shame this wasn't highlighted in the video.
Yeah, the cluster's performance is better when the task can be parallel. But single core still suck.
this comment needs to be pinned.
also why didnt he test the pc on both single core and multi core performance. just to make it complete
I don't even care if it's faster, that cluster looks so awesome!
Tested the script on my Ryzen 7 3800x running Windows 10 and python 3.8.7 (More of a mid range cpu 8 cores 16 threads, max all core was 4.47ghz)
10000: 0.33 s
100000: 26.95 s
200000: 102.24 s
I have a Ryzen 7 2700 I'll check it when I get home
aw man you got destroyed on the 200k test by a raspberry pi lmao a r7 3800x getting rekt by 8 little tiny computers x(
it only runs on one thread so core count is irrelevant. the important thing is clock speed.
if someone makes an efficient multi thread script then core count would be important along clock speed.
i dont know python well enough but i can do C++ or C#.
@@darkshadowsx5949 but the RPI 4B also has 4 cores
man that's a weak 3800, my 3600 has 78 seconds on 200k. Are your clocks/temps okay?
Here are the results on an i9-9900K at stock speeds:
Single Core:
10,000 - 0.35s
100,000 - 25.85s
200,000 - 100.74s
Multi Core:
10,000 - 0.29s
100,000 - 3.23s
200,000 - 10.94s
2,000,000 - 2,220.28s
You only ran the test on a single thread on each laptop, so the test isn't very fair when comparing to multiple pi's. With multithreading the laptops would be about twice as fast if they're dual cores, or more with hyperthreading.
For full performance it doesn't show the performance of the laptops but it does represent why a pi cluster is useful over a laptop for this purpose in both price and performance. It also shows that if you are doing the kind of work that would benefit from a cluster even a small performance boost can be worth it while still allowing your main computer to be fine while with a laptop to use multithreading would make the machine slow while the processes are running basically wasting the use of the laptop.
Yes it is only running on a single thread on the laptops, but it's also only running on a single thread on the Pis (they're each running quad core processors). So with a multithreaded script, the laptops would have been around twice as fast, but the Pis would then have been 3-4 times as fast as well.
@@MichaelKlements still makes for an unrepresentative comparison. especially since you told people to try it themselves, especially with ryzen processors which can have a lot more than just 4 cores not to mention hyper threading.
@@MichaelKlements yeah, but with a cluster of 8 or 7 pi it's like running the same script on a 7 thread cpu, you should make it multithreaded for both the pc and the rpi, so it can be comparable, if you run a single threaded app on a ryzen and on an intel the intel will win for higher ghz, and it's the same for the rpi's, if you manage to make it multithreaded on a 8 threads 4ghz cpu it will be much faster than 8 rpi's at 2ghz
@@aerispalm6523 intel has hyperthreading, amd has smt (simultaneous multi threading)
To get the Multicore Working you have to change the line 41:
#Orginal:
parts = chunks(range(2, max_number, 1), 1)
#now:
parts = chunks(range(2, max_number, 1), mp.cpu_count())
with 24 cores:
10000 = 0.08 seconds
100000 = 1.67 seconds
200000 = 6.16 seconds
500000 = 37.01 seconds
Thanks much better.
8 cores 16 threads
10000 = 0.47 seconds
100000 = 5.16 seconds
200000 = 21.03 seconds
500000 = 119.29 seconds
stock clock and undervolted 16 core 3950x:
10k: 0.22 seconds
100k: 1.5 seconds
200k: 4.78 seconds
500k: 26.5 seconds
I tried it out on my M1 Mac mini, 5 runs of each script and averaged the times:
10000: 0.45 s
100000: 39.77 s
200000: 155.84 s
and because I had the time...
500000: 985.56 s
Thanks for sharing your results. I've heard good feedback from people using M1 Mac's so far, I'll definitely be looking at getting one when they've expanded the range.
Thats weird. I forced my gf to try it and she got 24 seconds on her MacBook Pro on the 100.000 test.
Try with the MP code fixed (chunks last parameter 100+) to get some real numbers for the test. M1 is probably as fast or faster than the cluster
My iPhone XS got nearly identical results, within the margin of error using a-Shell
Hi. Do you use python running on rosseta or python native for M1?
I love all the people here just participating and sending their test result. apes together strong
We're going to the moon with these results!
Tested the Multi-process script on my Pi400, 2.2 Ghz (overclock) running Raspbian 10 (buster) and Python 3.7.3:
10.000: 0.99 sec
100.000: 108.01 sec
200.000: 448.25 sec
WOW, Was not expecting that performance from a PI. Really powerful little machines....😁
Not very powerful by themselves, but when strung together yes they are very powerful
@@coderdude9417 Apes together strong
Not 'a' pi
'8' pi ;)
@@Chrisknot94 🤣🤣🤣🤣
@@Chrisknot94 or pi³
As others have already posted the 5800X, I thought I'd add the 10700K (at stock) running ubuntu 20.10.
Find all primes up to: 10000
Time elasped: 0.3 seconds
Find all primes up to: 100000
Time elasped: 21.63 seconds
Find all primes up to: 200000
Time elasped: 86.98 seconds
And as others have said, it would be great to have it run multithreaded (and clustered).
Shows how much and has improved 😂. Never thought I'd see the day when AMD beats Intel ate single core.
Thanks for this Andy. That's quite impressive for a single core!
I haven't been able to get a multithreaded Python script to run stably on different platforms, but I'll keep trying.
@Andy Monks,@@MichaelKlements 10700K @5.2 Ghz, 3466 Cl14 32gb ram, Win 10 pro, python-3.9.1
Find all primes up to: 10000
: 0.4 seconds
Find all primes up to: 100000: 14.16 seconds
Find all primes up to: 200000: 52.64 seconds
Only testing single core performance seems unreasonable when testing something that is essentially designed for parallel workloads.
Impresive! I can only imagine what a cluster of 100 Pi can do!!!!!
How about 1 million arm cores:
en.m.wikipedia.org/wiki/SpiNNaker
The SpiNNaker is not a home DIY project...
But yeah is impresive
Awesome Work .. please more ... and thank you
Thank you! Will do!
“Running mathematical computations” 😂 dude be crypto mining.
Or machine learning.
I was also thinking can is do mining
I've seen mining on a commodore 64.
This cant mine lol since they dont have any powerful gpu's
@@AudioEuphoria080 u can mine with cpu too
I removed the "* 4" to num_cpus and changed to 100 numbers per chunk on line 41, so that it uses more than 1 cpu, and got:
10,000: 0,16 s
100,000: 6,99 s
200,000: 39,52 s
500,000: 274,26 s
On a virtual machine inside a dual CPU 16 core AMD Epyc 7281. So that's 64 cores with hyperthreading, but the hyperthreading didn't give any mentionable difference. Running the VM with 32 or 64 cores both yielded about the same results in time.
Using these same two mods:
10k: 1.00s
100k: 3.22s
200k: 6.69s
500k: 29.48s
Ryzen 7 3950X, Win10, Py 3.9.2
Impressive and there are still a bunch of optimizations that can be done with in the cluster
6:42 You say the cluster needs to manage communication between the nodes. However, the MPI script splits the task in independent, almost identically sized parts which means there is minimal overhead in parallelizing on a cluster. That said, your cluster does look really neat and you clearly put some effort in making this video, nice job!
Yeah this script and software setup in general is not a good example of that because the script already splits the tasks up. There are much better software packages for clusters which better manage the split up and delegation of tasks.
5900 X here! Python 3.9.1 Win10, 5 runs each setting
Normal Run:
10000 - 0.23 - 0.26 s
100000 - 18.28 - 20.87 s
200000 - 69.83 - 77.14 s
Multi-Threaded:
10000 - 0.46 - 0.49 s
100000 - 11.68 - 11.89 s
200000 - 43.05 - 48.01 s
Are comments being removed?
how did you enable multi threading????
@@MichaelMantion there is another script for multithreading
Have the same CPU, but am on Linux and every time is extremely similar. Some faster, some slower, most within just a second or two.
What's interesting though is the 10000 multi-thread. Your fastest is 2.3x slower than my slowest, plus your threaded is slower than your unthreaded (which our times are identical for). Not sure what would cause that besides maybe a process already using some of the core it picked. Interesting result.
Are you sure you ran the multithreaded properly? Your single threaded scores are what I would expect compared to my Ryzen 9 4900HS, but your multithreaded scores are 4-10x slower than mine. My laptops multithreaded scores were
10,000: 0.05s
100,000: 2.98s
200,000: 11.57s
There’s no way my little laptop is faster in multithreaded than a 5900X.
@@-Burb I've got the same processor as the person above, turns out the multi-thread script was only using 1 core. With that fixed my times were
10000 - 00.06
100000 - 01.20
200000 - 04.36
500000 - 25.18
awsome video man i put a comment on the cluster asking what the benefits were quiet clear now aha :)) welldone
6:49 Wait a second, you're running a script that contains no logic for splitting up work and makes no MPI calls. What are you actually benchmarking here?
About to ask.. how is the work load being shared between the nodes?
I've been working on a some render farm software. Basically a cluster interface. If you would like to test it out hit me up. Its pretty early stages but it's pretty fun to use. It's originally designed for render farms but has been more generalized over the last few months to run most tasks you can think of.
Where i can find it?
@@techbrosita9698 Its in active development but if you want access to some early test builds feel free to join our discord
discord.gg/V5jXaBgGtp
So, I ran this at 200k on a few different pieces of hardware that I have access to:
Westmere (2.27GHz) - 1 core: 340.88s (python 3.5)
Westmere (2.27GHz) - 60 cores (10 CPUs): 14.77s (python 3.5)
Ivy bridge (4.0GHz) - 1 core - python 3.6: 148.49s
Ivy bridge (3.6GHz) - 16 cores (2 CPUs) - python 3.6: 10.36s
Zen2 (4.35GHz) - 1 core - python 3.9: 108.53s
Zen2 (4.25GHz) - 24 cores (1 CPU) - python 3.8: 8.19s
CPU TDPs for these systems?
zen2 - 1x280W
ivy bridge - 2x120W
westmere cluster - 10x60W + networking
I imagine your raspberry pi cluster is quite a lot more power efficient than any of my systems!
P.S. end_number = int(sys.argv[1])?
You've imported the sys module, might as well use it!
Thanks for trying this out and sharing your results!
Yeah this cluster runs at about 80W on full load, so it's quite a bit more power efficient than traditional computers.
Ryzen 5 2600 (on a pretty bloated Windows install) using the multi core edit:
Python 3.7.5:
10,000: 0.6s
100,000: 33.62s
200,000: 120.19s
Python 3.9:
10,000: 0.93s
100,000: 27.87
200,000: 102.48s
Pypy (for fun):
10,000: 4.55s
100,000: 6.27s
200,000: 11.34s
Hope this helps
U SOUND HAPPY THATS THE MAIN THING
he sounds monotone and emotionless. So yeah, agreed, /s
Man, you do good work! Great video!!!
Thanks Timothy!
It makes me anxious how you are manually setting up every pi via ssh instead of using ansible
Hey, what other tools are available like ansible for this task?
Please share more details about doing it this way.
@@stalinsampras terraform, chef, and puppet are the main ones i know of
@@gunstorm05 Hey, thanks for replying.
@@stalinsampras absolutely
Very cool experiment and demonstration of cluster computing. Thanks for showing us one of many applications of cluster computing. I will stay tuned for thermals. Great work Maker Michael!
Thank you Deech!
RaspberryPie's website is hosted on a cluster of 16 pies.
@Sam Why do you say that?
@@jaredschmidt4614 cache
For Ryzen 9 5900x at 4.5Ghz, Windows Version 19042.746
To 10000: 0.27 seconds
To 100000: 21.49 seconds
To 200000: 88.44 seconds
As others have said, this is a single core program and used very little system resources. Running this only increased processor utilization about 2-3%. I t would be awesome to see a multiprocessing iteration
Thanks for sharing your results.
Oh, dang... now I kind of want to put a cluster like that together and finally get around to get an MPI implementation of my raymarched mandelbox thingie. Ihis cluster is excessively cute
Could you make a video of the OLED display which you then connect to the raspberry Pie 4
And with instructions
Sure, I'll put together a video for connecting an OLED display to the Pi and running a basic script. The issue is often with the software, it's updated so often that tutorial videos are out of date just a few weeks/months after posting them.
@@MichaelKlements ok thank you
when do you make the video for this OLED display?
i'd be interested in seeing the power draw of each device as well, or at least the cluster's
I know that the two sharing USB ports started struggling at 2.0Ghz full load, so I didn’t some measurements and add an extra supply to these. They use 1.7A at full CPU load, so the full cluster uses about 70W when running at full load.
@@MichaelKlements thank you !
Very interesting stuff! Thank you!
Tested on a Xeon e5 2678 v3 @ 3.3 ghz on all cores
Edit (tested again on singlecore and multicore version):
Singlecore -> Time elapsed: 0.52s, 41.91s, 157.99s
Multicore -> Time elapsed: 0.88s, 26.36s, 96.32s
For me one of the advantage of the Pi is the energy efficiency … I run 10 Pi (40 cores) on BOINC and use around 40 Watts for all. Can’t beat that with i7 or Ryzen. I don’t win a race but have a steady energy efficient computing “cluster”.
Yes, this is also a good point you've made. Pi's are actually really energy efficient for the power they've got!
This shows how group projects can be done way faster with the whole group working. I'm talking about you Dave, shut up and do your work.
Hardkernel MC-1 Cluster: 4x Arm Cortex A15 @ 2.0 GHz & 4x Arm Cortex A7 @ 1.4 GHz per node
4 nodes, total 16 threads binding to the large cores (A15)
10000: 0.12 s
100000: 15.04 s
200000: 61.49 s
500000: 431.29 s (7 m 11.29 s)
Cool, so you tested a single core application on multithreaded systems, then made it run parallel on 7 cores (basically). Yeah no, this is flawed.
First half of the video he did single core for all the systems, nothing flawed. All the results are clearly tagged.
Them he show it running on multicore for the pi's. Have a blessed day good sir.
Nice setup. To configure all the cluster nodes, i would suggest ansible playbooks so you don't have to login to all of your nodes seperately.
Thanks for the suggestion, I had a quick search for a more efficient way of executing the shutdown command across multiple nodes, but came up empty. I'll have a look at Ansible Playbooks
@@MichaelKlements for i in 10 11 12 13 14 (etc); do ssh 192.168.1.$i sudo poweroff; done
@@MichaelKlements Yes, definitely suggest ansible for shutdown, install updates, change cfg files, etc etc - it's great and very easy to learn and get into + it can use SSH, so it doesn't need a dedicated agent installed on the nodes.
Well, that settles it. I'm building a cluster computer to handle computations. That's freaking sick!
NVIDIA makes a GPU microprocessor (Jetson Nano) with 120+ cores. I wonder how viable that would be in terms of GPU super cluster, lol.
Results on Pentium 2 T4400 (Dual core 2.2 Ghz):
Single process script:
n=10000, time=2.03 seconds
n=100000, time=221 seconds
Multi-process script:
n=10000, time=1.05 seconds
n=100000, time=92.26 seconds
Second script was optimized as mentioned in other comments.
In short, Pentium 2 T4400 is slower than Raspberry Pi 4B used in the video.
This was cool. If I understood it right, they communicate via Ethernet? So some better hardware connection should speed up even more.
Yes correct
There is an actual picluster board and compute modules that look a bit like PCIe cards.
Test Results on my Lenova Thinkpad T480 at work for primes up to 10000: 1.24 seconds
Will test on my AMD Ryzen 2700X system when I get home....great video and amazing concept!
Update 1: 1.2 seconds on my 2017 MacBook Pro at work.
> Tests perfromances
> Uses python
Why not write a C/C++/Rust program?
Here is a version in C:
pastebin.com/fAjwP61f
On my box it runs in 1.79 seconds (54.03 seconds for Python3 version).
Or... use an algorithm that doesn't suck. Implementing a sieve of Eratosthenes on my machine for 200k to .03 seconds.
@@InfiniteLemurs I'm sorry but it's not the algorithm's fault, for example a for loop in python is going to take you ages compared to one in Any real programming language or even scripting languages like php/js
Go try it yourself.
@@JustKatoh There is significant overhead in python, (though JS is arguably just as bad) but my argument is that inefficiency of the algorithm matters much more than the inefficiency of the language. A good prime sieve is magnitudes of scale faster than the one shown in this video-- a factor of utmost importance here. Writing bogosort in cpp is still going to be worse than quicksort in python; obviously things would be more performant in cpp/rust, but this isn't even beginning to approach optimization yet. The C code above gets absolutely stomped simply by using a better alg. Python isn't the best choice here, but the algorithm is the truly poor one.
@@InfiniteLemurs It does play a major role, lookup nodejs performance compared to python, is just as fast as JAVA, a fully fledge programming language. People trying to defend python's performance is just funny, python is literally a prototyping language to write hacky scripts quick and dirty, it's the only thing good for.
I can't find 1 thing made in python, that works well, performant and is preferred over other language alternatives.
I think you're misunderstanding what an algorithm is, algorithms usually are written in pseudocode and adapted to languages, be it scripts or programming, Take for example Huffman's compression algorithm, sure there is prewritten assembly and C code but 99% of the time it's going to be presented under a pseudocode format, you can look at it however you want, it's not Huffman's fault python is incapable of it's implementation in a real world scenario when even one of the most inefficient scripting languages ( PHP, nodejs is so much faster ) can still easily run it in a production environment reliably.
I really don't get what you're trying to bang at with the "algorithm" argument when algorithms are simple stencils to use independent of the language at hand,
Excellent experiment, very simple. It really shows just how much of a difference it makes to utilize cluster computing.
Thanks for the great feedback!
@@MichaelKlements Absolutely! I'm a student learning about databases, cluster computing, and MapReduce right now. So it's very cool to see that just how effective this strategy is, even when equipped with Pi's rather than a full sized PC. I feel like my knowledge is rudimentary at the moment.
Are you aware there's a cluster hat available for the Pi4 that controls 4 Pi zeros at once? Might be a cool prospect!
That sounds awesome! I’m also still learning a lot about clusters as I go along (as you can probably tell), it’s been really interesting and fun.
Yes I have seen them. It would be quite interesting to try clustering a cluster of a Pi 4 and zeros, I should have a look at that. Thanks for the tip
On my galaxy tablet A, I got 18.3 seconds for 10k. When I tried 100k, the app would just say, "experiencing server issues, try again later" after a couple of minutes 😐
My AMD Ryzen 9 3900X (12 core) took for:
10000 - 0.39 seconds
100000 - 25.37 seconds
200000 - 92.46 seconds
I didn't do any modifications (No overclocking etc.)
I hope this has been helpfull :)
The whole world is behind pi pico now. when we are going to have videos on that.
I've got one on order, so hopefully soon!
This is so cool! Would have been very curious to see some analysis of the power consumption of the different computers, and what your cluster could do without the cooling running. Great Video!
Thanks for the suggestions, I'll have a look at that.
You should try transcoding videos on that stack.
I've ran the multi-process version on my Ryzen 5950X, and I've also re-written the benchmark in C++. I'll only post the results for max_number 500000 here since I find that's more relevant due to different overhead mattering less:
Find all primes to: 500000
Primes found in all cases: 41538
Python Script (no changes): 257.7s
Python Script (chunks set to 100 to actually use more than one core): 24.67s
Python Script (chunks set to 100 AND only check divisions to candidate/2): 12.56s
C++ single-threaded (to candidate): 13.085s
C++ single-thread (to candidate/2): 6.584s
C++ multi-threaded (to candidate/2): 1.076s
It's fun to see how huge the differences can be, especially considering they scale massively with more and higher numbers to test. The Python script with chunks fixed was using >90% of the CPU reliably, but the C++ multi-threaded version was not - there's definitely improvements to do in my version there, but even as is, it's significantly faster.
Edit: An example of silly scaling - when computing up to 2000000, the same multi-threaded C++ code did indeed use the entire CPU and was done in 7.097s. The python script with the optimizations (chunks at 100 and check divisions to canditate/2) did the same 2000000 in 177.61s. A CUDA kernel version of the same prime checking function completes the 2000000 test in 0.309s - and that's a very unoptimized way to do it in CUDA.
*I ran it on my ryzen 7 (laptop):*
10000: *0.58s*
100000: *47.41s*
200000: *172.33s*
Ryzen9 4900HS in an ASUS G14 Zephyrus
10,000: 1.35 s
100,000: 16.66 s
200,000: 59.96 s
500,000: 342.95 s
I ran these with remote desktop and had just shut down the mining software when I ran these so the cores may have been hot and resources may still have been in use.
I have a Ryzen 9 3900XT, 10000: 0,32 sec 100000: 25,57 sec 200000: 98,33 sec
thanks i was to lazy to test it my self
Awesome build and great explanations, thank you and well done!
An IPad Air 2020 does it in:
To 10000: 0.28 seconds
To 100000: 22 seconds
To 200000: 82 seconds
Quite impressive for an Ipad.
I’ve done a lot of benchmarks for distributed computing with the Pi’s and although they’re great little boards, very often a used desktop will perform more for less money. I’ll only add Pi’s to the cluster when replacing them with faster versions. To keep up in cluster workloads pi’s need to get cheaper or faster at the same price.
Thanks for your thoughts on clustering. I agree, you can definitely get better performance out of some old desktops, but the Pi’s are still quite competitive in terms of power consumption and size.
ryzen 5 5600x with PBO set to 5ghz, win10 python 3.7.8
10000 0.28sec
100000 22.75sec
200000 87.59sec
only 1 threat whas in use
for the second run
10000 0.28sec
100000 22.73
200000 85.21sec
Thanks for sharing your results
I know this is not fair but when actually changing the code so it performs more efficiently for example, only going to the square root of the number it gets sooooo much faster. My PC laptop was able to do your code for 100,000 in 56 seconds but when using a new algorithm and using numba for further optimization if can do 100,000,000 in only 361 seconds. This was a great video, now I see how they managed to make a super computer out of PS3's... Keep it up.
Just sharing th-cam.com/video/D3h62rgewZM/w-d-xo.html since it's what can happen with more efficient code and apparently both channels did finding primes challenges almost at the same time. Though I think here it's even mentioned that it's not for efficiency but to compare same code on different setups.
@@tubeincompetence Hey, thanks for recommending that video to me, interestingly I have watched it (I mean it is the youtube algorithm after all) and yes I knew that the purpose was to just see what can run the code faster not to optimize specially when it starts to get more hardware and language bound, I was just shocked on how efficient it got using the optimizations (something I knew but I had no sense of scale for) and so in general thanks for replying to my comment and trying to help me out.
@@PBlague It at least shows very well that you don't have to throw more hardware at every problem to solve them faster. :)
@@tubeincompetence That is super true... that is bad to think that not every task and calculation is possible to be processed in different threads specially in different systems, and that is where being creative in how you're gonna make it more efficient and sometimes even find ways to do the task in different threads even if in first look it might not seem like it's a good thing to do or it's even possible.
When you were testing the Mac I thought it was going to be one of the new arm laptops
My pie cluster: I'm afraid I can't let you munchie tonight J! _Fridge Locks_
The frustrating thing here is how inefficient the alg for finding primes is!
Didn't try it on AMD or Intel CPUs but on an RTX 3090 using the Numba Compiler to run the python script on all cuda cores. My results were about 8 seconds for 500.000 digits of pi.
100.000 and 200.000 both below 1sec
digits of pi? you mean prime numbers til 500k? (that's what the script does).
Regarding your results: 8 Seconds? Sounds like a LOT of python overhead on the cuda cores? Currently don't have the numbers of cuda cores on 3090, but expected way faster results.
This is a very poor comparison. It would have been great maybe in the 90's, but it's single threaded. Even the PI's have more than one thread each. I actually have the Ryzen 5 2600x, so 11 cores are sitting at absolute idle while one luckless bastard of a core gets pegged while it's working on the larger datasets.
Double + for the use of 'luckless bastard' - Made my coffee experience much better today.
It would definitely have been a more thorough comparison to get the script running across all cores of each device, but I haven't been able to get a multithreaded Python script to run stably on multiple platforms. I'll keep working on it.
The Pi cluster is also sitting with 24 unused cores during this test.
@@MichaelKlements that's a video i would like to see, as i have 3 older server systems needing something to test out. Should you get it working i might even go out and buy more pi4's just to build and play with. thanks michael
That cluster from 2 years ago nearly keeps up with my 12th gen i9 12900H 14 Core Monster. (Laptop)
10000 - 0.23 sec
100000 - 17.86 sec
200000 - 71.04 sec
Great video, thank you for sharing.
And so pretty!
Around 2:45 you said there's exponentially more factors to check when increasing the range. Isn't the growth here quadratic instead of exponential? That is: I think runtime should grow more like x^2 instead of to the power of x.
this script runs on single core, so this isnt the way to compare these...
Ryzen 5 3600x 6-core 12-threads 4.0Ghz
Multi threaded times:
*10.000- 0.27s
*100.000-18.92s
*200.000-68.07
These tests were run using Ubuntu 20.04 with the 5.8.0-41-generic kernel using python3 3.8.5.
AMD Ryzen 9 5950X - Win10
RTX 3080
128GB 3600MHz RAM
2TB NVMe
10000 = 0.22
100000 = 18.03
200000 = 66.82
Thanks David, that's really impressive for a single core!
Posted on the homepage, but ran the multi-cpu edited code posted there on my Ryzen 5900X (win10, plenty of wasted resources with plenty of extra stuff running, running from within PyCharm with python 3.7.8);
In short - HyperThreading / SMT seem to give better results, increasing the amount of processess per core carries a penalty, time to spawn processess (guess), but in general, it seems to be something that needs to have a median or be an average of runs.
Changing the num_processess = mp.cpu_count(), this can be made a bit more clear -
(tests carried it with 1x runtime, so no averages)
Just going 1 (*1, 24 processess):
10 000: 0.29s
100 000: 16.01s
200 000: 59.66s
500 000: 344.81s
Default settings in the script (*4, 96 processess):
10 000: 0.48s
100 000: 16.38s
200 000: 59.59s
500 000: 349.91s
Tweaking the script, doing *24 (makes it angry at low finding, 576 processess):
10 000: 1,89s
100 000: 21.9s
200 000: 63.27s
500 000: 346.81s
And just because fun -
1 000 000: 1300.92s
Tried it out quickly on my 3900X with Linux and Python 3.8.5;
Default settings *4 (96 processess oddly not spreading out well to cores..)
10 000: 0.25s
100 000: 16.52s
200 000: 59.16s
500 000: 340.22s
Going *1 (24 processess)
100 000 16.55s
And because it's fun:
1 000 000: 1311.5s
In short, there's a limiter to how one can scale this on a single core, and increasing the amount of processess per core carries a penalty, and delegating to several cores seems to be a bit of a bother.
Well, what will be impressive too is, what will happen, when using 20 PIs and chunking encodes, ffmpeg splice a file in chunks of e.g. 10 seconds.
After that each node renders x of chunks and compare it energywise with an Ryzen 9 or so...
So for video encoding (software gets better visual quality than gpu based) this might be an good option too.
You could really use ansible for simplifying cluster wide config (one script, run against all IPs in a list/range, report results back)
Awesome hardware build. I’m on the hardware side of things and I think it pretty cool what you have made. I can’t speak for software side but still awesome to see. Thank you for sharing this program with us.
Thanks for the great feedback!
Super cool cluster! Have you heard of CSSH? Would recommend looking into it if not. Can SSH into all the Pis at once, and send commands to them all at the same time. Absolute lifesaver for clustered setups like this.
Thanks, I'll have a look at it!
Would there be a way to treat a PC with multiple cores similarly to the cluster where the task is divvied up to the different cores resulting in potentially faster computation by more fully utilising the power available? It seems great care was taken to fully utilise each Pi, but were the PC or Mac using more than a portion of their power?
Also the weight of the OS seems like it could be a hindrance. I know that's probably outside of the scope.
I see this has already been brought up and well discussed in another thread here. Don't mind me.
In this case, you benchmark using integer arithmetic, not floating point. I saw another video that benchmarked a Turing CM 3 cluster. But there too the applications were running not floating point processes but tasks in Drupal and Wordpress. How do your 3 systems compare performing floating point math benchmarks?
What are the practical applications running the Pi's in such a way for someone such as me (not a computer science major or programmer, etc)? Would you be able to run an OS capable of running Windows or Linux and associated apps or will this only run specific programs written for the Pi? Thanks and sorry for my lack of knowledge in this subject!
Pi's themselves are perfectly capable of running a range of Linux based apps and many people use them as standalone desktop computers, typically the Pi 4Bs. Clustering them is purely to improve processing power, mainly for simulations, rendering or bulk computing tasks.
A fellow South African?
Thanks for this, so cool to see the pi’s being used in what I think is their intended manner
Yes!
This had been a fun project to work on and I've learnt a lot along the way as well. I'm looking forward to building a few more clusters in future, this definitely won't be the last.
Ive never owned a cluster of any sort. does ffmpeg have a cluster render option? I know programs like mandelbulber2 have cpu clustering features for fractal rendering... Host needs some RAM. We need a motherboard that will take 8 pi4 compute modules.
I like your cluster setup, and hope to get my own running soon enough.
I ran your scripts on my Opteron workstation. (circa 2009, except for the cpus)
2x 6328 cpus, 16 cores total @ 3.2ghz
SP:
Find all primes up to: 10000
Time elapsed: 0.91 seconds
Number of primes found 1230
Find all primes up to: 100000
Time elapsed: 57.64 seconds
Number of primes found 9593
Find all primes up to: 200000
Time elapsed: 216.64 seconds
Number of primes found 17985
MP:
Find all primes up to: 10000, using 16 processes. chunk size: 4
Time elapsed: 0.58 seconds
Number of primes found 1229
Find all primes up to: 100000, using 16 processes. chunk size: 48
Time elapsed: 4.42 seconds
Number of primes found 9592
Find all primes up to: 200000, using 16 processes. chunk size: 97
Time elapsed: 15.65 seconds
Number of primes found 17984
Find all primes up to: 500000, using 16 processes. chunk size: 244
Time elapsed: 87.72 seconds
Number of primes found 41538
I had to calculate chunk size to saturate the cpu better, without changing the brute-force prime number algorithm.
chunk_size = max_number // num_processes // 128
I modified the scrupt to run in 6 processes with 5 threads in each process which enabled my macbook pro to complete 200 000 in 25 seconds. Just FYI. The single process and thread version completes in 168 seconds.
Here is what my Intel i7 1165G7 (4267MHz LPDDR4) got:
Single:
10,000: 0.27
100,000: 22.73
200,000: 87.38
Multi:
10,000: 0.53
100,000: 15.84
200,000: 62.88
Interesting sidenotes:
-The the m1 chips are ~40% slower in the singlethreaded test. Not quite sure where the gap comes from but multiple persons in the comments here reported the same. Maybe the apple chip just isnt *that* good in everyday use when you can have an accelerator.
-The singlethreaded results can take on most desktop cpus here. The 5900x seems to be around 10% better. Not bad for a laptop i7!
-Intels performance has not been stagnating as much as i thought, in the comments you can see clear generational improvements.
-Thanks for adding a multithreadded scipt as well!
next video is crypto mining on a raspberry pi cluster lmao
I ran the script on my M1 Macbook air (fanless) using python IDLE 3.9.0.
Single process:
10,000: 0.51s
100,000: 36.96s
200,000: 138.68s
Multi process:
10,000: 1.26s
100,000: 27.5s
200,000: 101.2s
I think the only other thing that stands out to me is that my computer never got hotter than 35*C or barely warm to the touch.
Tested this on my Ryzen 3800X. 8 Cores, 16 threads, base clock 4GHz. Running Windows 10 Pro.
Single-core version:
10,000: 0.32s
100,000: 24.42s
200,000: 92.35s
Multi-core version (didn't appear to actually use all cores, according to Task Manager):
10,000: 0.58s
100,000: 16.37s
200,000: 61.36s
Sorry to tell you, but when you were testing the computers on some you ran the script in the IDLE shell. The IDLE shell is very slow and un-optimized, you should rerun this without using IDLE.
could over clock more with the water cooling? If you rewrote your script to use threading and workers does the PI cluster still hold up?
Aight. Imma build a supercomputer of pi.
Cool Video, i testet the script on my I7 8700. I got the following results:
10000: 0,37 seconds
100000: 36.29 seconds
200000: 123.79 seconds
I have no experience in pyton so i ran the script right in PyCharm, what might not be efficient but who knows lol.
For more comparaison data , made with a ryzen 5 2600 :
Single process :
- 10000 : 0.47s
- 100000 : 38.44s
- 200000 : 144.72s
Multi process (48) :
- 10000 : 0.89s
- 100000 : 22.61s
- 200000 : 81.07s
I am surprised by the relativly low difference but it may be perfectly normal.
Awesome video! I would 100% suggest something like ansible to manage your cluster though :D
Thanks Kyle. Yeah this is one of the most basic packages to get running on a cluster. I'm going to try Ansible and Kubernetes next.
did you make use of multithreading on the computers?