Gary thank for the quick follow-up on the RP2350 FPU performance. Single precision is the most common use case for DSP applications. In this case according to your data, the RP2350 almost provides 1 FLOP/MHz. This is huge deal and makes DSP applications like audio, digital communications, control, etc very viable on RP2350. The FPU is a huge feature upgrade and for the price, RP2350 is a bargain. It will enable many exciting signal processing projects. Thanks again for running the analysis.
The Cortex-M33 cores have the standard ARM single-precision FPU. The RP2350 adds a double-precision coprocessor (DCP), inaccessible from the RISC-C cores.
Do you know if the DCP offers more parallelism than using the FPU would? (i.e. can a double-precision instruction happen while the CPU is busy doing something else, in a way that a single-precision instruction cannot?)
@@colinmcconnell827 Short answer: No. Long answer: A double-precision calculation requires a series of DCP instructions, each one instruction takes one cycle, so the ARM core does not have to wait between them.
@@DQSoft Thanks. I have looked at the RP2350 Datasheet, and I suspect there is probably enough information in there to answer my question, if I can manage to interpret it all correctly!
I'm surprised it isn't greater. The 2040 floating point is all software emulated whereas the 2350 is hardware implementation. I'd expect at least a 100x speedup.
I'm sure that RISCV floating point performance can be improved quite a lot, bringing it close to the performance of the RP2040, which is also a software implementation. Bring hand optimized assembler to RISCV and that should help.According to the RP2350 data sheet the Hazard3 implementation has the M extension (multiply and divide) and it has a fast multiplier.
Yes. I haven't looked at their implementation of FP on RISC-V, but I know they used a third-party optimized assembly library for the RP2040, that was further optimized for the RP2040, but this library only has ARM Cortex assembly. They may not have done anything for RISC-V (haven't checked) and so possibly it's just the software FP emuilation of the compiler here.
Thanks Gary. Another nice summary video of the new Pico. Frankly I was surprised that the new machine with hardware fpu was not that much faster in double precision (and even in single precision) I was expecting well over two orders of magnitude increase in floating point performance. That's the level of speed increases I remember from 8087 days. Maybe I am misremembering
The Cortex-M FPU is single-precision only. RP2350 adds a custom coprocessor to accelerate double-precision. The speedup for double-precision comes from that coprocessor, not from the standard FPU. The coprocessor itself is rather fast but there is some overhead in getting data in/out of it.
@@Wren6991I think what the op is saying is that the single precision in the rp2350 should be at least 100x faster than the rp2040, since the rp2040 floating point is implemented in pure software emulation. I agree with that, and find it strange.
@@arthurswanson3285 oh, I see now, thanks. The original 8086 was a 16-bitter if I recall correctly, so there is some extra cost there trying to do soft float on a machine like that. You would also have to look at the structure of the benchmark and see how much time is actually spent on floating point vs memory access and control flow.
Would be interesting to see which floating point software implementation you are measuring. RP2040 has highly optimised soft float in ROM, whereas the RISC-V cores are using whatever junk the compiler provides. You can use the compiler soft float support on RP2040 too (there is a CMake flag) and the performance drops off quite a bit when you do.
I couldn't tell from the video which toolchain you are building against. There's a lot of variance in soft float performance depending on which architecture variant the soft float library was built for. If you are using the CORE-V toolchain (e.g. you are on Windows) then I believe you just get an RV32IMAC soft float library, which is missing all of the bit manipulation instructions. Bit manipulation helps out a lot with soft float performance.
I was using whatever the VS code extension installs on Windows. I didn't know that the tool chains had different functionality depending the host, that doesn't seem like a good idea 😬 I will retest using Linux as the host.
I suspect that the main use for the FPU on the RP2350 is going to be TensorFlow. Cheap devices that can run ML models at the edge are going to become increasingly useful. The extra RAM will help with these workloads too.
Stop smoking weed, it is a microcontroller, not an NPU... Do you know of anyone doing machine learning on Cortex-M33 from ST or NXP for example? The floating point unit in these devices are used for DSP or control generally.
@@JonitoFischer Maybe you should start smoking weed? Because the op said "run ML models" - which has nothing to do with machine learning (training). People run ML models on pico devices currently to do voice recognition.
Yes, there is a build of TensorFlow Lite that runs on the Pico and it was updated at the start of the year to use both cores. Voice commands has been one use, which is a nice low power way of controlling home automation devices. The FPU would give a big boost to the speed of running those models.
@@the_hetmanthe main use? Doubtful, seems most of the buzz is in fact for dsp and audio atm, but perhaps that could grow as more open spurce code becomes available. There's plenty of ppl interested in ml atm, but most of that crowd wouldn't be able to do much without resources. Either way I'm excited for it, both are useful.
For microprocessors, single precision float is the norm. So in the conclusion I would say the 2350 is not 5x but 7.5x faster than the 2040. (125/16.7=7.5)
Quite a difference in the M33 cores vs the older M0 cores. You said there was no 64 bit hardware in the M33 MCU. Yet double precision is 5x faster using the M33. Interesting. Are you going to be releasing the code that you used for your test?
@@GaryExplains oh yeah, I know and that's a useful thing. But an integer divider won't help speed up FP enough to compete with an FPU. Even a 16bit FPU would probably come out almost twice as fast as software at double precision than the M0 cores. And a 64bit FPU would have probably been around 3 times faster than the M33 cores.
@MechanicaMenace Indeed. But the fact that the RP2040 comes out ahead of the RISC-V CPU in the RP2350 means that there is something extra going on inside the RP2040. My guess is that it is related to the hardware divider/multiplier, but that is really just a random guess. Unfortunately I don't have time to investigate more.
I'm awaiting someone making a video on why they might have chosen this weird ARM-or- RISCV approach. I would have rather expected the 2 ARM cores and 1 RISCV in parallel. That would have had a benefit, but now I guess most usecases will just stick to 2x ARM, no?
I have the challenger board as well. The three pin JST SH connector on the bottom of the board. Is that the SWD port for the RP2350? Any idea if the SHA256 speed will help with HMAC of the top of your head? Just started using HS256 for JWT messages so being able to do that on the Pico would be helpful as I can put the key into the OTP memory and never have to worry about someone extracting it with only booting signed firmware.
I wonder if a future revision of the RISC-V core will have a way to use the FPU. Apparently using a coprocessor is not precluded, since the SHA256 hardware can be done.
SHA-256 is just a memory-mapped peripheral, it's not a coprocessor. The Arm single-precision FPU is a standard Arm component inside the Cortex-M33, which can't be modified. Adding access to the DCP from the RISC-V cores would be totally doable though.
Great information Gary. I’m guessing these benchmarks are single threaded and only using a single core? If that is the case, the RP2350 with multi threaded floating point operations would be even more significant increase over the RP2040
You mean a wireless version of the Pico 2? There isn't technically a wireless verison of the RP2040, the Pico W uses another chip for the wireless. It will be the same with the Pico 2 W, which should be out before the end of the year. There are other wireless boards already like the Challenger+ RP2350 WiFi6/BLE5.
It's used for secure boot, so hardware SHA-256 has a significant impact on boot times when secure boot is enabled. It's not a coprocessor, just a normal memory-mapped peripheral.
Are people really encountering CPU bottlenecks for the types of projects a microcontroller like a pico is used in? Or is “Hey do you REALLY NEED that performance?” a stupid question?
try and run a full graphic HD UI and you'll see these mcu' although powerful are not capable, why stm32 included a vector graphics gpu in some of their new ones to take the load off and now you can finally have 60fps UI which was hard to do with blackpills @ 400mhz mcu's, even with fpu's. So yeah we really need this preformance, say youre doing a digital compas for a sailboat and you have a nice round display but youre 600mhz fpu laden wahtever chip is maxed out at 99% and youre only gettings 25fps and your compass looks like crap. Now it wont thanks to MOAR power. th-cam.com/video/MqBqnPLM-wM/w-d-xo.htmlsi=cmEbC6sIV_J-4f0Q
I design and create custom RF hardware+software using the pico. More CPU = more bandwidth because physics. I use the rp2040 for some devices because it enables me to decrease cost/size/weight. No github or public availability because treason.
Gary thank for the quick follow-up on the RP2350 FPU performance. Single precision is the most common use case for DSP applications. In this case according to your data, the RP2350 almost provides 1 FLOP/MHz. This is huge deal and makes DSP applications like audio, digital communications, control, etc very viable on RP2350. The FPU is a huge feature upgrade and for the price, RP2350 is a bargain. It will enable many exciting signal processing projects. Thanks again for running the analysis.
The Cortex-M33 cores have the standard ARM single-precision FPU. The RP2350 adds a double-precision coprocessor (DCP), inaccessible from the RISC-C cores.
Do you know if the DCP offers more parallelism than using the FPU would? (i.e. can a double-precision instruction happen while the CPU is busy doing something else, in a way that a single-precision instruction cannot?)
@@colinmcconnell827 Short answer: No. Long answer: A double-precision calculation requires a series of DCP instructions, each one instruction takes one cycle, so the ARM core does not have to wait between them.
@@DQSoft Thanks. I have looked at the RP2350 Datasheet, and I suspect there is probably enough information in there to answer my question, if I can manage to interpret it all correctly!
How many cycles does a single precision multiply take in a 2350?
The jump in performance for single point precision calculations is insane! Very excited to get my hands on this processor soon...
I'm surprised it isn't greater. The 2040 floating point is all software emulated whereas the 2350 is hardware implementation. I'd expect at least a 100x speedup.
Sounds pretty tasty. Wonder how it compares to something like, say, esp32-s3. Purely in raw power, just a curiosity of mine.
And nrf52840
About the same as s3 but you can overclock it quite easily to like 200mhz officially
Ah Whetstone... KDF9 was one of the first Algol compilers I used over 50 years ago, I feel almost nostalgic!
I'm sure that RISCV floating point performance can be improved quite a lot, bringing it close to the performance of the RP2040, which is also a software implementation. Bring hand optimized assembler to RISCV and that should help.According to the RP2350 data sheet the Hazard3 implementation has the M extension (multiply and divide) and it has a fast multiplier.
Yes. I haven't looked at their implementation of FP on RISC-V, but I know they used a third-party optimized assembly library for the RP2040, that was further optimized for the RP2040, but this library only has ARM Cortex assembly. They may not have done anything for RISC-V (haven't checked) and so possibly it's just the software FP emuilation of the compiler here.
Be interesting to see the impact of the hardware sha2 on throughput of tls or ssh, even though it's just the sha element and not hardware aes or gcm.
Thanks Gary. Another nice summary video of the new Pico. Frankly I was surprised that the new machine with hardware fpu was not that much faster in double precision (and even in single precision) I was expecting well over two orders of magnitude increase in floating point performance. That's the level of speed increases I remember from 8087 days. Maybe I am misremembering
The Cortex-M FPU is single-precision only. RP2350 adds a custom coprocessor to accelerate double-precision. The speedup for double-precision comes from that coprocessor, not from the standard FPU. The coprocessor itself is rather fast but there is some overhead in getting data in/out of it.
@@Wren6991I think what the op is saying is that the single precision in the rp2350 should be at least 100x faster than the rp2040, since the rp2040 floating point is implemented in pure software emulation. I agree with that, and find it strange.
@@arthurswanson3285 oh, I see now, thanks. The original 8086 was a 16-bitter if I recall correctly, so there is some extra cost there trying to do soft float on a machine like that. You would also have to look at the structure of the benchmark and see how much time is actually spent on floating point vs memory access and control flow.
Would be interesting to see which floating point software implementation you are measuring. RP2040 has highly optimised soft float in ROM, whereas the RISC-V cores are using whatever junk the compiler provides. You can use the compiler soft float support on RP2040 too (there is a CMake flag) and the performance drops off quite a bit when you do.
I couldn't tell from the video which toolchain you are building against. There's a lot of variance in soft float performance depending on which architecture variant the soft float library was built for. If you are using the CORE-V toolchain (e.g. you are on Windows) then I believe you just get an RV32IMAC soft float library, which is missing all of the bit manipulation instructions. Bit manipulation helps out a lot with soft float performance.
I was using whatever the VS code extension installs on Windows. I didn't know that the tool chains had different functionality depending the host, that doesn't seem like a good idea 😬 I will retest using Linux as the host.
Very informative. Thank you!
thanks for the great video
I suspect that the main use for the FPU on the RP2350 is going to be TensorFlow. Cheap devices that can run ML models at the edge are going to become increasingly useful. The extra RAM will help with these workloads too.
Stop smoking weed, it is a microcontroller, not an NPU... Do you know of anyone doing machine learning on Cortex-M33 from ST or NXP for example? The floating point unit in these devices are used for DSP or control generally.
@@JonitoFischer Maybe you should start smoking weed? Because the op said "run ML models" - which has nothing to do with machine learning (training). People run ML models on pico devices currently to do voice recognition.
Yes, there is a build of TensorFlow Lite that runs on the Pico and it was updated at the start of the year to use both cores. Voice commands has been one use, which is a nice low power way of controlling home automation devices. The FPU would give a big boost to the speed of running those models.
@@the_hetmanthe main use? Doubtful, seems most of the buzz is in fact for dsp and audio atm, but perhaps that could grow as more open spurce code becomes available. There's plenty of ppl interested in ml atm, but most of that crowd wouldn't be able to do much without resources. Either way I'm excited for it, both are useful.
For microprocessors, single precision float is the norm. So in the conclusion I would say the 2350 is not 5x but 7.5x faster than the 2040. (125/16.7=7.5)
Thank you Gary, for remembering me that we are here on a microcotroller. I seem to confuse it with a "normal" cpu, when i read fpu.
I never need to run fpu in mcu
@@ray-charc3131 I do.
Quite a difference in the M33 cores vs the older M0 cores. You said there was no 64 bit hardware in the M33 MCU. Yet double precision is 5x faster using the M33. Interesting. Are you going to be releasing the code that you used for your test?
The source code for the Whetstone test is in my GitHub repo. The SHA256 code is just the example code in the RP2350 documentation.
The 2040 has no FPU at all, so floating point is done purely in software.
@MechanicaMenace The RP2040 doesn't have an FPU, true, but it does have a special integer divider.
@@GaryExplains oh yeah, I know and that's a useful thing. But an integer divider won't help speed up FP enough to compete with an FPU. Even a 16bit FPU would probably come out almost twice as fast as software at double precision than the M0 cores. And a 64bit FPU would have probably been around 3 times faster than the M33 cores.
@MechanicaMenace Indeed. But the fact that the RP2040 comes out ahead of the RISC-V CPU in the RP2350 means that there is something extra going on inside the RP2040. My guess is that it is related to the hardware divider/multiplier, but that is really just a random guess. Unfortunately I don't have time to investigate more.
I'm awaiting someone making a video on why they might have chosen this weird ARM-or- RISCV approach. I would have rather expected the 2 ARM cores and 1 RISCV in parallel. That would have had a benefit, but now I guess most usecases will just stick to 2x ARM, no?
I have the challenger board as well. The three pin JST SH connector on the bottom of the board. Is that the SWD port for the RP2350? Any idea if the SHA256 speed will help with HMAC of the top of your head? Just started using HS256 for JWT messages so being able to do that on the Pico would be helpful as I can put the key into the OTP memory and never have to worry about someone extracting it with only booting signed firmware.
I wonder if a future revision of the RISC-V core will have a way to use the FPU. Apparently using a coprocessor is not precluded, since the SHA256 hardware can be done.
SHA-256 is just a memory-mapped peripheral, it's not a coprocessor. The Arm single-precision FPU is a standard Arm component inside the Cortex-M33, which can't be modified. Adding access to the DCP from the RISC-V cores would be totally doable though.
@@Wren6991 thanks for the answer, and for the nice work on the RP2350.
Great information Gary. I’m guessing these benchmarks are single threaded and only using a single core? If that is the case, the RP2350 with multi threaded floating point operations would be even more significant increase over the RP2040
The RP2040 is also dual core, so I'd expect the perf ratio to remain the same
5:07 ohh, that is interesting indeed !
Very interesting.
2nd! Woop woop! Thanks for another great video!!!
Nice content
As requested a thumbs up from me.👍
thank you!
Really looking forward for a wireless variant of the RP2350. Sadly it'll probably take a year
You mean a wireless version of the Pico 2? There isn't technically a wireless verison of the RP2040, the Pico W uses another chip for the wireless. It will be the same with the Pico 2 W, which should be out before the end of the year. There are other wireless boards already like the Challenger+ RP2350 WiFi6/BLE5.
@@GaryExplains Hi - Is there Python support for the ESP32 - SPI (for WiFi) on the Challenger+RP2350 WiFi/BLE6? Tnx.
SPI-SPI with an rp2040-W for now.
I want to use Pi Pico 2 as a keyboard controller.
Wow! The 2350 is way faster.
why didn't they just allow us to use the 4 cores at the same time? it would be awesome, at least from a benchmark numbers point of view
Many suspect the cores aren't fully segregated, sharing some core functionality that prevents them being used simultaneously.
Didn't expect the Whetsone benchmark to make an appearance. Real blast from the past.
Expect the unexcepted! 😜
Bitcoin mining uses SHA256 btw..
SHA256 performance at low watts you say, hope there isn't a cryptocurrency that will eat stocks up by miners😇
It is still a microcontroller, not a number cruncher.
@@suki4410 yes but the hashes per watt, may make a cluster of these viable.
@@PaulGrayUK Yes, maybe.
But can it hack Nintendo Switch better?
no wifi no fun, is not better enough
I'm surprised that SHA is important enough to justify a co-processor.
It's used for secure boot, so hardware SHA-256 has a significant impact on boot times when secure boot is enabled. It's not a coprocessor, just a normal memory-mapped peripheral.
I approve instant boot time :-)
Are people really encountering CPU bottlenecks for the types of projects a microcontroller like a pico is used in?
Or is “Hey do you REALLY NEED that performance?” a stupid question?
I believe this will enable new types of signal processing and machine learning projects.
try and run a full graphic HD UI and you'll see these mcu' although powerful are not capable, why stm32 included a vector graphics gpu in some of their new ones to take the load off and now you can finally have 60fps UI which was hard to do with blackpills @ 400mhz mcu's, even with fpu's. So yeah we really need this preformance, say youre doing a digital compas for a sailboat and you have a nice round display but youre 600mhz fpu laden wahtever chip is maxed out at 99% and youre only gettings 25fps and your compass looks like crap. Now it wont thanks to MOAR power. th-cam.com/video/MqBqnPLM-wM/w-d-xo.htmlsi=cmEbC6sIV_J-4f0Q
I design and create custom RF hardware+software using the pico. More CPU = more bandwidth because physics. I use the rp2040 for some devices because it enables me to decrease cost/size/weight. No github or public availability because treason.
first pls