Raspberry Pi RP2350 - Testing its FPU and SHA256 Performance

แชร์
ฝัง
  • เผยแพร่เมื่อ 17 พ.ย. 2024

ความคิดเห็น • 94

  • @ChromaticReflection
    @ChromaticReflection 2 หลายเดือนก่อน +21

    Gary thank for the quick follow-up on the RP2350 FPU performance. Single precision is the most common use case for DSP applications. In this case according to your data, the RP2350 almost provides 1 FLOP/MHz. This is huge deal and makes DSP applications like audio, digital communications, control, etc very viable on RP2350. The FPU is a huge feature upgrade and for the price, RP2350 is a bargain. It will enable many exciting signal processing projects. Thanks again for running the analysis.

  • @DQSoft
    @DQSoft 2 หลายเดือนก่อน +29

    The Cortex-M33 cores have the standard ARM single-precision FPU. The RP2350 adds a double-precision coprocessor (DCP), inaccessible from the RISC-C cores.

    • @colinmcconnell827
      @colinmcconnell827 2 หลายเดือนก่อน +5

      Do you know if the DCP offers more parallelism than using the FPU would? (i.e. can a double-precision instruction happen while the CPU is busy doing something else, in a way that a single-precision instruction cannot?)

    • @DQSoft
      @DQSoft 2 หลายเดือนก่อน +4

      @@colinmcconnell827 Short answer: No. Long answer: A double-precision calculation requires a series of DCP instructions, each one instruction takes one cycle, so the ARM core does not have to wait between them.

    • @colinmcconnell827
      @colinmcconnell827 2 หลายเดือนก่อน +3

      @@DQSoft Thanks. I have looked at the RP2350 Datasheet, and I suspect there is probably enough information in there to answer my question, if I can manage to interpret it all correctly!

    • @arthurswanson3285
      @arthurswanson3285 2 หลายเดือนก่อน

      How many cycles does a single precision multiply take in a 2350?

  • @relic985
    @relic985 2 หลายเดือนก่อน +3

    The jump in performance for single point precision calculations is insane! Very excited to get my hands on this processor soon...

    • @arthurswanson3285
      @arthurswanson3285 2 หลายเดือนก่อน

      I'm surprised it isn't greater. The 2040 floating point is all software emulated whereas the 2350 is hardware implementation. I'd expect at least a 100x speedup.

  • @Kolyasisan
    @Kolyasisan 2 หลายเดือนก่อน +19

    Sounds pretty tasty. Wonder how it compares to something like, say, esp32-s3. Purely in raw power, just a curiosity of mine.

    • @mahvaz-u7z
      @mahvaz-u7z 2 หลายเดือนก่อน +5

      And nrf52840

    • @matgaw123
      @matgaw123 2 หลายเดือนก่อน +1

      About the same as s3 but you can overclock it quite easily to like 200mhz officially

  • @TheOwlman
    @TheOwlman 2 หลายเดือนก่อน

    Ah Whetstone... KDF9 was one of the first Algol compilers I used over 50 years ago, I feel almost nostalgic!

  • @lennartbenschop656
    @lennartbenschop656 2 หลายเดือนก่อน +2

    I'm sure that RISCV floating point performance can be improved quite a lot, bringing it close to the performance of the RP2040, which is also a software implementation. Bring hand optimized assembler to RISCV and that should help.According to the RP2350 data sheet the Hazard3 implementation has the M extension (multiply and divide) and it has a fast multiplier.

    • @joseoncrack
      @joseoncrack 2 หลายเดือนก่อน

      Yes. I haven't looked at their implementation of FP on RISC-V, but I know they used a third-party optimized assembly library for the RP2040, that was further optimized for the RP2040, but this library only has ARM Cortex assembly. They may not have done anything for RISC-V (haven't checked) and so possibly it's just the software FP emuilation of the compiler here.

  • @Monk_Duck
    @Monk_Duck 2 หลายเดือนก่อน +4

    Be interesting to see the impact of the hardware sha2 on throughput of tls or ssh, even though it's just the sha element and not hardware aes or gcm.

  • @ksbs2036
    @ksbs2036 2 หลายเดือนก่อน +2

    Thanks Gary. Another nice summary video of the new Pico. Frankly I was surprised that the new machine with hardware fpu was not that much faster in double precision (and even in single precision) I was expecting well over two orders of magnitude increase in floating point performance. That's the level of speed increases I remember from 8087 days. Maybe I am misremembering

    • @Wren6991
      @Wren6991 2 หลายเดือนก่อน

      The Cortex-M FPU is single-precision only. RP2350 adds a custom coprocessor to accelerate double-precision. The speedup for double-precision comes from that coprocessor, not from the standard FPU. The coprocessor itself is rather fast but there is some overhead in getting data in/out of it.

    • @arthurswanson3285
      @arthurswanson3285 2 หลายเดือนก่อน

      ​@@Wren6991I think what the op is saying is that the single precision in the rp2350 should be at least 100x faster than the rp2040, since the rp2040 floating point is implemented in pure software emulation. I agree with that, and find it strange.

    • @Wren6991
      @Wren6991 2 หลายเดือนก่อน

      @@arthurswanson3285 oh, I see now, thanks. The original 8086 was a 16-bitter if I recall correctly, so there is some extra cost there trying to do soft float on a machine like that. You would also have to look at the structure of the benchmark and see how much time is actually spent on floating point vs memory access and control flow.

  • @Wren6991
    @Wren6991 2 หลายเดือนก่อน +2

    Would be interesting to see which floating point software implementation you are measuring. RP2040 has highly optimised soft float in ROM, whereas the RISC-V cores are using whatever junk the compiler provides. You can use the compiler soft float support on RP2040 too (there is a CMake flag) and the performance drops off quite a bit when you do.

    • @Wren6991
      @Wren6991 2 หลายเดือนก่อน

      I couldn't tell from the video which toolchain you are building against. There's a lot of variance in soft float performance depending on which architecture variant the soft float library was built for. If you are using the CORE-V toolchain (e.g. you are on Windows) then I believe you just get an RV32IMAC soft float library, which is missing all of the bit manipulation instructions. Bit manipulation helps out a lot with soft float performance.

    • @GaryExplains
      @GaryExplains  2 หลายเดือนก่อน +1

      I was using whatever the VS code extension installs on Windows. I didn't know that the tool chains had different functionality depending the host, that doesn't seem like a good idea 😬 I will retest using Linux as the host.

  • @matpearson9711
    @matpearson9711 2 หลายเดือนก่อน +2

    Very informative. Thank you!

  • @jorgkorte7334
    @jorgkorte7334 2 หลายเดือนก่อน +2

    thanks for the great video

  • @the_hetman
    @the_hetman 2 หลายเดือนก่อน +4

    I suspect that the main use for the FPU on the RP2350 is going to be TensorFlow. Cheap devices that can run ML models at the edge are going to become increasingly useful. The extra RAM will help with these workloads too.

    • @JonitoFischer
      @JonitoFischer 2 หลายเดือนก่อน +1

      Stop smoking weed, it is a microcontroller, not an NPU... Do you know of anyone doing machine learning on Cortex-M33 from ST or NXP for example? The floating point unit in these devices are used for DSP or control generally.

    • @23lkjdfjsdlfj
      @23lkjdfjsdlfj 2 หลายเดือนก่อน +1

      @@JonitoFischer Maybe you should start smoking weed? Because the op said "run ML models" - which has nothing to do with machine learning (training). People run ML models on pico devices currently to do voice recognition.

    • @the_hetman
      @the_hetman 2 หลายเดือนก่อน +2

      Yes, there is a build of TensorFlow Lite that runs on the Pico and it was updated at the start of the year to use both cores. Voice commands has been one use, which is a nice low power way of controlling home automation devices. The FPU would give a big boost to the speed of running those models.

    • @xxportalxx.
      @xxportalxx. 2 หลายเดือนก่อน

      ​@@the_hetmanthe main use? Doubtful, seems most of the buzz is in fact for dsp and audio atm, but perhaps that could grow as more open spurce code becomes available. There's plenty of ppl interested in ml atm, but most of that crowd wouldn't be able to do much without resources. Either way I'm excited for it, both are useful.

  • @var67
    @var67 2 หลายเดือนก่อน +2

    For microprocessors, single precision float is the norm. So in the conclusion I would say the 2350 is not 5x but 7.5x faster than the 2040. (125/16.7=7.5)

  • @suki4410
    @suki4410 2 หลายเดือนก่อน +2

    Thank you Gary, for remembering me that we are here on a microcotroller. I seem to confuse it with a "normal" cpu, when i read fpu.

    • @ray-charc3131
      @ray-charc3131 2 หลายเดือนก่อน +2

      I never need to run fpu in mcu

    • @arthurswanson3285
      @arthurswanson3285 2 หลายเดือนก่อน

      ​@@ray-charc3131 I do.

  • @sgodsellify
    @sgodsellify 2 หลายเดือนก่อน +5

    Quite a difference in the M33 cores vs the older M0 cores. You said there was no 64 bit hardware in the M33 MCU. Yet double precision is 5x faster using the M33. Interesting. Are you going to be releasing the code that you used for your test?

    • @GaryExplains
      @GaryExplains  2 หลายเดือนก่อน +4

      The source code for the Whetstone test is in my GitHub repo. The SHA256 code is just the example code in the RP2350 documentation.

    • @MechanicaMenace
      @MechanicaMenace 2 หลายเดือนก่อน +4

      The 2040 has no FPU at all, so floating point is done purely in software.

    • @GaryExplains
      @GaryExplains  2 หลายเดือนก่อน +3

      @MechanicaMenace The RP2040 doesn't have an FPU, true, but it does have a special integer divider.

    • @MechanicaMenace
      @MechanicaMenace 2 หลายเดือนก่อน +3

      @@GaryExplains oh yeah, I know and that's a useful thing. But an integer divider won't help speed up FP enough to compete with an FPU. Even a 16bit FPU would probably come out almost twice as fast as software at double precision than the M0 cores. And a 64bit FPU would have probably been around 3 times faster than the M33 cores.

    • @GaryExplains
      @GaryExplains  2 หลายเดือนก่อน +2

      @MechanicaMenace Indeed. But the fact that the RP2040 comes out ahead of the RISC-V CPU in the RP2350 means that there is something extra going on inside the RP2040. My guess is that it is related to the hardware divider/multiplier, but that is really just a random guess. Unfortunately I don't have time to investigate more.

  • @MisterkeTube
    @MisterkeTube 2 หลายเดือนก่อน

    I'm awaiting someone making a video on why they might have chosen this weird ARM-or- RISCV approach. I would have rather expected the 2 ARM cores and 1 RISCV in parallel. That would have had a benefit, but now I guess most usecases will just stick to 2x ARM, no?

  • @Dygear
    @Dygear 2 หลายเดือนก่อน +2

    I have the challenger board as well. The three pin JST SH connector on the bottom of the board. Is that the SWD port for the RP2350? Any idea if the SHA256 speed will help with HMAC of the top of your head? Just started using HS256 for JWT messages so being able to do that on the Pico would be helpful as I can put the key into the OTP memory and never have to worry about someone extracting it with only booting signed firmware.

  • @slimhazard
    @slimhazard 2 หลายเดือนก่อน +2

    I wonder if a future revision of the RISC-V core will have a way to use the FPU. Apparently using a coprocessor is not precluded, since the SHA256 hardware can be done.

    • @Wren6991
      @Wren6991 2 หลายเดือนก่อน +1

      SHA-256 is just a memory-mapped peripheral, it's not a coprocessor. The Arm single-precision FPU is a standard Arm component inside the Cortex-M33, which can't be modified. Adding access to the DCP from the RISC-V cores would be totally doable though.

    • @slimhazard
      @slimhazard 2 หลายเดือนก่อน

      @@Wren6991 thanks for the answer, and for the nice work on the RP2350.

  • @jacquesmillard
    @jacquesmillard 2 หลายเดือนก่อน +5

    Great information Gary. I’m guessing these benchmarks are single threaded and only using a single core? If that is the case, the RP2350 with multi threaded floating point operations would be even more significant increase over the RP2040

    • @sawyerbergeron3288
      @sawyerbergeron3288 2 หลายเดือนก่อน +1

      The RP2040 is also dual core, so I'd expect the perf ratio to remain the same

  • @autohmae
    @autohmae 2 หลายเดือนก่อน

    5:07 ohh, that is interesting indeed !

  • @marcusk7855
    @marcusk7855 2 หลายเดือนก่อน +1

    Very interesting.

  • @gavinskurrie
    @gavinskurrie 2 หลายเดือนก่อน +2

    2nd! Woop woop! Thanks for another great video!!!

  • @ragesmirk
    @ragesmirk 2 หลายเดือนก่อน +2

    Nice content

  • @chipcode5538
    @chipcode5538 2 หลายเดือนก่อน

    As requested a thumbs up from me.👍

  • @doa_form
    @doa_form 2 หลายเดือนก่อน +3

    Really looking forward for a wireless variant of the RP2350. Sadly it'll probably take a year

    • @GaryExplains
      @GaryExplains  2 หลายเดือนก่อน +7

      You mean a wireless version of the Pico 2? There isn't technically a wireless verison of the RP2040, the Pico W uses another chip for the wireless. It will be the same with the Pico 2 W, which should be out before the end of the year. There are other wireless boards already like the Challenger+ RP2350 WiFi6/BLE5.

    • @johnwilson3918
      @johnwilson3918 2 หลายเดือนก่อน

      ​@@GaryExplains Hi - Is there Python support for the ESP32 - SPI (for WiFi) on the Challenger+RP2350 WiFi/BLE6? Tnx.

    • @23lkjdfjsdlfj
      @23lkjdfjsdlfj 2 หลายเดือนก่อน

      SPI-SPI with an rp2040-W for now.

  • @savousonee7225
    @savousonee7225 หลายเดือนก่อน

    I want to use Pi Pico 2 as a keyboard controller.

  • @bertblankenstein3738
    @bertblankenstein3738 2 หลายเดือนก่อน +2

    Wow! The 2350 is way faster.

  • @olhoTron
    @olhoTron 2 หลายเดือนก่อน

    why didn't they just allow us to use the 4 cores at the same time? it would be awesome, at least from a benchmark numbers point of view

    • @xxportalxx.
      @xxportalxx. 2 หลายเดือนก่อน +1

      Many suspect the cores aren't fully segregated, sharing some core functionality that prevents them being used simultaneously.

  • @NoToeLong
    @NoToeLong 2 หลายเดือนก่อน +2

    Didn't expect the Whetsone benchmark to make an appearance. Real blast from the past.

    • @GaryExplains
      @GaryExplains  2 หลายเดือนก่อน +6

      Expect the unexcepted! 😜

  • @nThanksForAllTheFish
    @nThanksForAllTheFish 2 หลายเดือนก่อน +1

    Bitcoin mining uses SHA256 btw..

  • @PaulGrayUK
    @PaulGrayUK 2 หลายเดือนก่อน +1

    SHA256 performance at low watts you say, hope there isn't a cryptocurrency that will eat stocks up by miners😇

    • @suki4410
      @suki4410 2 หลายเดือนก่อน

      It is still a microcontroller, not a number cruncher.

    • @PaulGrayUK
      @PaulGrayUK 2 หลายเดือนก่อน +1

      @@suki4410 yes but the hashes per watt, may make a cluster of these viable.

    • @suki4410
      @suki4410 2 หลายเดือนก่อน +1

      @@PaulGrayUK Yes, maybe.

  • @anonanon5146
    @anonanon5146 2 หลายเดือนก่อน

    But can it hack Nintendo Switch better?

  • @Luix
    @Luix 2 หลายเดือนก่อน

    no wifi no fun, is not better enough

  • @TomLeg
    @TomLeg 2 หลายเดือนก่อน +1

    I'm surprised that SHA is important enough to justify a co-processor.

    • @Wren6991
      @Wren6991 2 หลายเดือนก่อน +3

      It's used for secure boot, so hardware SHA-256 has a significant impact on boot times when secure boot is enabled. It's not a coprocessor, just a normal memory-mapped peripheral.

    • @TomLeg
      @TomLeg 2 หลายเดือนก่อน +1

      I approve instant boot time :-)

  • @johnsimon8457
    @johnsimon8457 2 หลายเดือนก่อน +1

    Are people really encountering CPU bottlenecks for the types of projects a microcontroller like a pico is used in?
    Or is “Hey do you REALLY NEED that performance?” a stupid question?

    • @DQSoft
      @DQSoft 2 หลายเดือนก่อน +4

      I believe this will enable new types of signal processing and machine learning projects.

    • @mikejones-vd3fg
      @mikejones-vd3fg 2 หลายเดือนก่อน +2

      try and run a full graphic HD UI and you'll see these mcu' although powerful are not capable, why stm32 included a vector graphics gpu in some of their new ones to take the load off and now you can finally have 60fps UI which was hard to do with blackpills @ 400mhz mcu's, even with fpu's. So yeah we really need this preformance, say youre doing a digital compas for a sailboat and you have a nice round display but youre 600mhz fpu laden wahtever chip is maxed out at 99% and youre only gettings 25fps and your compass looks like crap. Now it wont thanks to MOAR power. th-cam.com/video/MqBqnPLM-wM/w-d-xo.htmlsi=cmEbC6sIV_J-4f0Q

    • @23lkjdfjsdlfj
      @23lkjdfjsdlfj 2 หลายเดือนก่อน +2

      I design and create custom RF hardware+software using the pico. More CPU = more bandwidth because physics. I use the rp2040 for some devices because it enables me to decrease cost/size/weight. No github or public availability because treason.

  • @IAmSinister5
    @IAmSinister5 2 หลายเดือนก่อน

    first pls