Raspberry Pi RP2350 - Testing its FPU and SHA256 Performance

Поделиться
HTML-код
  • Опубликовано: 16 ноя 2024
  • НаукаНаука

Комментарии • 94

  • @ChromaticReflection
    @ChromaticReflection 2 месяца назад +21

    Gary thank for the quick follow-up on the RP2350 FPU performance. Single precision is the most common use case for DSP applications. In this case according to your data, the RP2350 almost provides 1 FLOP/MHz. This is huge deal and makes DSP applications like audio, digital communications, control, etc very viable on RP2350. The FPU is a huge feature upgrade and for the price, RP2350 is a bargain. It will enable many exciting signal processing projects. Thanks again for running the analysis.

  • @DQSoft
    @DQSoft 2 месяца назад +29

    The Cortex-M33 cores have the standard ARM single-precision FPU. The RP2350 adds a double-precision coprocessor (DCP), inaccessible from the RISC-C cores.

    • @colinmcconnell827
      @colinmcconnell827 2 месяца назад +5

      Do you know if the DCP offers more parallelism than using the FPU would? (i.e. can a double-precision instruction happen while the CPU is busy doing something else, in a way that a single-precision instruction cannot?)

    • @DQSoft
      @DQSoft 2 месяца назад +4

      @@colinmcconnell827 Short answer: No. Long answer: A double-precision calculation requires a series of DCP instructions, each one instruction takes one cycle, so the ARM core does not have to wait between them.

    • @colinmcconnell827
      @colinmcconnell827 2 месяца назад +3

      @@DQSoft Thanks. I have looked at the RP2350 Datasheet, and I suspect there is probably enough information in there to answer my question, if I can manage to interpret it all correctly!

    • @arthurswanson3285
      @arthurswanson3285 2 месяца назад

      How many cycles does a single precision multiply take in a 2350?

  • @relic985
    @relic985 2 месяца назад +3

    The jump in performance for single point precision calculations is insane! Very excited to get my hands on this processor soon...

    • @arthurswanson3285
      @arthurswanson3285 2 месяца назад

      I'm surprised it isn't greater. The 2040 floating point is all software emulated whereas the 2350 is hardware implementation. I'd expect at least a 100x speedup.

  • @Kolyasisan
    @Kolyasisan 2 месяца назад +19

    Sounds pretty tasty. Wonder how it compares to something like, say, esp32-s3. Purely in raw power, just a curiosity of mine.

    • @mahvaz-u7z
      @mahvaz-u7z 2 месяца назад +5

      And nrf52840

    • @matgaw123
      @matgaw123 2 месяца назад +1

      About the same as s3 but you can overclock it quite easily to like 200mhz officially

  • @lennartbenschop656
    @lennartbenschop656 2 месяца назад +2

    I'm sure that RISCV floating point performance can be improved quite a lot, bringing it close to the performance of the RP2040, which is also a software implementation. Bring hand optimized assembler to RISCV and that should help.According to the RP2350 data sheet the Hazard3 implementation has the M extension (multiply and divide) and it has a fast multiplier.

    • @joseoncrack
      @joseoncrack 2 месяца назад

      Yes. I haven't looked at their implementation of FP on RISC-V, but I know they used a third-party optimized assembly library for the RP2040, that was further optimized for the RP2040, but this library only has ARM Cortex assembly. They may not have done anything for RISC-V (haven't checked) and so possibly it's just the software FP emuilation of the compiler here.

  • @TheOwlman
    @TheOwlman 2 месяца назад

    Ah Whetstone... KDF9 was one of the first Algol compilers I used over 50 years ago, I feel almost nostalgic!

  • @Monk_Duck
    @Monk_Duck 2 месяца назад +4

    Be interesting to see the impact of the hardware sha2 on throughput of tls or ssh, even though it's just the sha element and not hardware aes or gcm.

  • @ksbs2036
    @ksbs2036 2 месяца назад +2

    Thanks Gary. Another nice summary video of the new Pico. Frankly I was surprised that the new machine with hardware fpu was not that much faster in double precision (and even in single precision) I was expecting well over two orders of magnitude increase in floating point performance. That's the level of speed increases I remember from 8087 days. Maybe I am misremembering

    • @Wren6991
      @Wren6991 2 месяца назад

      The Cortex-M FPU is single-precision only. RP2350 adds a custom coprocessor to accelerate double-precision. The speedup for double-precision comes from that coprocessor, not from the standard FPU. The coprocessor itself is rather fast but there is some overhead in getting data in/out of it.

    • @arthurswanson3285
      @arthurswanson3285 2 месяца назад

      ​@@Wren6991I think what the op is saying is that the single precision in the rp2350 should be at least 100x faster than the rp2040, since the rp2040 floating point is implemented in pure software emulation. I agree with that, and find it strange.

    • @Wren6991
      @Wren6991 2 месяца назад

      @@arthurswanson3285 oh, I see now, thanks. The original 8086 was a 16-bitter if I recall correctly, so there is some extra cost there trying to do soft float on a machine like that. You would also have to look at the structure of the benchmark and see how much time is actually spent on floating point vs memory access and control flow.

  • @Wren6991
    @Wren6991 2 месяца назад +2

    Would be interesting to see which floating point software implementation you are measuring. RP2040 has highly optimised soft float in ROM, whereas the RISC-V cores are using whatever junk the compiler provides. You can use the compiler soft float support on RP2040 too (there is a CMake flag) and the performance drops off quite a bit when you do.

    • @Wren6991
      @Wren6991 2 месяца назад

      I couldn't tell from the video which toolchain you are building against. There's a lot of variance in soft float performance depending on which architecture variant the soft float library was built for. If you are using the CORE-V toolchain (e.g. you are on Windows) then I believe you just get an RV32IMAC soft float library, which is missing all of the bit manipulation instructions. Bit manipulation helps out a lot with soft float performance.

    • @GaryExplains
      @GaryExplains  2 месяца назад +1

      I was using whatever the VS code extension installs on Windows. I didn't know that the tool chains had different functionality depending the host, that doesn't seem like a good idea 😬 I will retest using Linux as the host.

  • @matpearson9711
    @matpearson9711 2 месяца назад +2

    Very informative. Thank you!

  • @jorgkorte7334
    @jorgkorte7334 2 месяца назад +2

    thanks for the great video

  • @the_hetman
    @the_hetman 2 месяца назад +4

    I suspect that the main use for the FPU on the RP2350 is going to be TensorFlow. Cheap devices that can run ML models at the edge are going to become increasingly useful. The extra RAM will help with these workloads too.

    • @JonitoFischer
      @JonitoFischer 2 месяца назад +1

      Stop smoking weed, it is a microcontroller, not an NPU... Do you know of anyone doing machine learning on Cortex-M33 from ST or NXP for example? The floating point unit in these devices are used for DSP or control generally.

    • @23lkjdfjsdlfj
      @23lkjdfjsdlfj 2 месяца назад +1

      @@JonitoFischer Maybe you should start smoking weed? Because the op said "run ML models" - which has nothing to do with machine learning (training). People run ML models on pico devices currently to do voice recognition.

    • @the_hetman
      @the_hetman 2 месяца назад +2

      Yes, there is a build of TensorFlow Lite that runs on the Pico and it was updated at the start of the year to use both cores. Voice commands has been one use, which is a nice low power way of controlling home automation devices. The FPU would give a big boost to the speed of running those models.

    • @xxportalxx.
      @xxportalxx. 2 месяца назад

      ​@@the_hetmanthe main use? Doubtful, seems most of the buzz is in fact for dsp and audio atm, but perhaps that could grow as more open spurce code becomes available. There's plenty of ppl interested in ml atm, but most of that crowd wouldn't be able to do much without resources. Either way I'm excited for it, both are useful.

  • @var67
    @var67 2 месяца назад +2

    For microprocessors, single precision float is the norm. So in the conclusion I would say the 2350 is not 5x but 7.5x faster than the 2040. (125/16.7=7.5)

  • @suki4410
    @suki4410 2 месяца назад +2

    Thank you Gary, for remembering me that we are here on a microcotroller. I seem to confuse it with a "normal" cpu, when i read fpu.

  • @sgodsellify
    @sgodsellify 2 месяца назад +5

    Quite a difference in the M33 cores vs the older M0 cores. You said there was no 64 bit hardware in the M33 MCU. Yet double precision is 5x faster using the M33. Interesting. Are you going to be releasing the code that you used for your test?

    • @GaryExplains
      @GaryExplains  2 месяца назад +4

      The source code for the Whetstone test is in my GitHub repo. The SHA256 code is just the example code in the RP2350 documentation.

    • @MechanicaMenace
      @MechanicaMenace 2 месяца назад +4

      The 2040 has no FPU at all, so floating point is done purely in software.

    • @GaryExplains
      @GaryExplains  2 месяца назад +3

      @MechanicaMenace The RP2040 doesn't have an FPU, true, but it does have a special integer divider.

    • @MechanicaMenace
      @MechanicaMenace 2 месяца назад +3

      @@GaryExplains oh yeah, I know and that's a useful thing. But an integer divider won't help speed up FP enough to compete with an FPU. Even a 16bit FPU would probably come out almost twice as fast as software at double precision than the M0 cores. And a 64bit FPU would have probably been around 3 times faster than the M33 cores.

    • @GaryExplains
      @GaryExplains  2 месяца назад +2

      @MechanicaMenace Indeed. But the fact that the RP2040 comes out ahead of the RISC-V CPU in the RP2350 means that there is something extra going on inside the RP2040. My guess is that it is related to the hardware divider/multiplier, but that is really just a random guess. Unfortunately I don't have time to investigate more.

  • @Dygear
    @Dygear 2 месяца назад +2

    I have the challenger board as well. The three pin JST SH connector on the bottom of the board. Is that the SWD port for the RP2350? Any idea if the SHA256 speed will help with HMAC of the top of your head? Just started using HS256 for JWT messages so being able to do that on the Pico would be helpful as I can put the key into the OTP memory and never have to worry about someone extracting it with only booting signed firmware.

  • @MisterkeTube
    @MisterkeTube 2 месяца назад

    I'm awaiting someone making a video on why they might have chosen this weird ARM-or- RISCV approach. I would have rather expected the 2 ARM cores and 1 RISCV in parallel. That would have had a benefit, but now I guess most usecases will just stick to 2x ARM, no?

  • @slimhazard
    @slimhazard 2 месяца назад +2

    I wonder if a future revision of the RISC-V core will have a way to use the FPU. Apparently using a coprocessor is not precluded, since the SHA256 hardware can be done.

    • @Wren6991
      @Wren6991 2 месяца назад +1

      SHA-256 is just a memory-mapped peripheral, it's not a coprocessor. The Arm single-precision FPU is a standard Arm component inside the Cortex-M33, which can't be modified. Adding access to the DCP from the RISC-V cores would be totally doable though.

    • @slimhazard
      @slimhazard 2 месяца назад

      @@Wren6991 thanks for the answer, and for the nice work on the RP2350.

  • @jacquesmillard
    @jacquesmillard 2 месяца назад +5

    Great information Gary. I’m guessing these benchmarks are single threaded and only using a single core? If that is the case, the RP2350 with multi threaded floating point operations would be even more significant increase over the RP2040

    • @sawyerbergeron3288
      @sawyerbergeron3288 2 месяца назад +1

      The RP2040 is also dual core, so I'd expect the perf ratio to remain the same

  • @autohmae
    @autohmae 2 месяца назад

    5:07 ohh, that is interesting indeed !

  • @marcusk7855
    @marcusk7855 2 месяца назад +1

    Very interesting.

  • @gavinskurrie
    @gavinskurrie 2 месяца назад +2

    2nd! Woop woop! Thanks for another great video!!!

  • @ragesmirk
    @ragesmirk 2 месяца назад +2

    Nice content

  • @chipcode5538
    @chipcode5538 2 месяца назад

    As requested a thumbs up from me.👍

  • @doa_form
    @doa_form 2 месяца назад +3

    Really looking forward for a wireless variant of the RP2350. Sadly it'll probably take a year

    • @GaryExplains
      @GaryExplains  2 месяца назад +7

      You mean a wireless version of the Pico 2? There isn't technically a wireless verison of the RP2040, the Pico W uses another chip for the wireless. It will be the same with the Pico 2 W, which should be out before the end of the year. There are other wireless boards already like the Challenger+ RP2350 WiFi6/BLE5.

    • @johnwilson3918
      @johnwilson3918 2 месяца назад

      ​@@GaryExplains Hi - Is there Python support for the ESP32 - SPI (for WiFi) on the Challenger+RP2350 WiFi/BLE6? Tnx.

    • @23lkjdfjsdlfj
      @23lkjdfjsdlfj 2 месяца назад

      SPI-SPI with an rp2040-W for now.

  • @savousonee7225
    @savousonee7225 Месяц назад

    I want to use Pi Pico 2 as a keyboard controller.

  • @bertblankenstein3738
    @bertblankenstein3738 2 месяца назад +2

    Wow! The 2350 is way faster.

  • @olhoTron
    @olhoTron 2 месяца назад

    why didn't they just allow us to use the 4 cores at the same time? it would be awesome, at least from a benchmark numbers point of view

    • @xxportalxx.
      @xxportalxx. 2 месяца назад +1

      Many suspect the cores aren't fully segregated, sharing some core functionality that prevents them being used simultaneously.

  • @NoToeLong
    @NoToeLong 2 месяца назад +2

    Didn't expect the Whetsone benchmark to make an appearance. Real blast from the past.

    • @GaryExplains
      @GaryExplains  2 месяца назад +6

      Expect the unexcepted! 😜

  • @nThanksForAllTheFish
    @nThanksForAllTheFish 2 месяца назад +1

    Bitcoin mining uses SHA256 btw..

  • @PaulGrayUK
    @PaulGrayUK 2 месяца назад +1

    SHA256 performance at low watts you say, hope there isn't a cryptocurrency that will eat stocks up by miners😇

    • @suki4410
      @suki4410 2 месяца назад

      It is still a microcontroller, not a number cruncher.

    • @PaulGrayUK
      @PaulGrayUK 2 месяца назад +1

      @@suki4410 yes but the hashes per watt, may make a cluster of these viable.

    • @suki4410
      @suki4410 2 месяца назад +1

      @@PaulGrayUK Yes, maybe.

  • @anonanon5146
    @anonanon5146 2 месяца назад

    But can it hack Nintendo Switch better?

  • @Luix
    @Luix 2 месяца назад

    no wifi no fun, is not better enough

  • @TomLeg
    @TomLeg 2 месяца назад +1

    I'm surprised that SHA is important enough to justify a co-processor.

    • @Wren6991
      @Wren6991 2 месяца назад +3

      It's used for secure boot, so hardware SHA-256 has a significant impact on boot times when secure boot is enabled. It's not a coprocessor, just a normal memory-mapped peripheral.

    • @TomLeg
      @TomLeg 2 месяца назад +1

      I approve instant boot time :-)

  • @johnsimon8457
    @johnsimon8457 2 месяца назад +1

    Are people really encountering CPU bottlenecks for the types of projects a microcontroller like a pico is used in?
    Or is “Hey do you REALLY NEED that performance?” a stupid question?

    • @DQSoft
      @DQSoft 2 месяца назад +4

      I believe this will enable new types of signal processing and machine learning projects.

    • @mikejones-vd3fg
      @mikejones-vd3fg 2 месяца назад +2

      try and run a full graphic HD UI and you'll see these mcu' although powerful are not capable, why stm32 included a vector graphics gpu in some of their new ones to take the load off and now you can finally have 60fps UI which was hard to do with blackpills @ 400mhz mcu's, even with fpu's. So yeah we really need this preformance, say youre doing a digital compas for a sailboat and you have a nice round display but youre 600mhz fpu laden wahtever chip is maxed out at 99% and youre only gettings 25fps and your compass looks like crap. Now it wont thanks to MOAR power. ruclips.net/video/MqBqnPLM-wM/видео.htmlsi=cmEbC6sIV_J-4f0Q

    • @23lkjdfjsdlfj
      @23lkjdfjsdlfj 2 месяца назад +2

      I design and create custom RF hardware+software using the pico. More CPU = more bandwidth because physics. I use the rp2040 for some devices because it enables me to decrease cost/size/weight. No github or public availability because treason.

  • @IAmSinister5
    @IAmSinister5 2 месяца назад

    first pls