base R, stringi, and stringr: Benchmarking string manipulations with (CC282)

Поделиться
HTML-код
  • Опубликовано: 2 ноя 2024

Комментарии • 18

  • @YannC-p1q
    @YannC-p1q 5 месяцев назад +1

    I just went through all my code and switched to the stringi version of it! stringr has str_squish() which i love, that can be replaced with stri_replace_all_regx ("\\s+, " ") and stri_trim() - and still much faster. Thank god also for the wonderful GPT engines, that really help make this transition once you provide the stringr version of the code!!

    • @YannC-p1q
      @YannC-p1q 5 месяцев назад +1

      Hey, I found that tidyr's replace NA is actually better, and also that for this usecase, stringr was even faster than stringi. Results for 100 evals, microseconds:
      tidyr median: 1617.70
      stringr median: 7963.00
      stringi median: 8015.65
      base median: 1590.65
      for tidyr, stringr and stringi i used mutate(x,lang = {package}::{function}(lang,"en"))
      for base I used x$lang[is.na(x$lang)]

  • @djangoworldwide7925
    @djangoworldwide7925 5 месяцев назад +2

    Wow. Had no idea stringi is so much more performant. I actually like its syntax for some cases. Will definitely move to using it

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      same here! thanks for watching 🤓

  • @belantaribrahim850
    @belantaribrahim850 5 месяцев назад +2

    I learn a lot from each of your videos...thank you

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      my pleasure - thanks for watching!

  • @theproblembelief7549
    @theproblembelief7549 5 месяцев назад

    I have followed a few of your videos and indeed I have learned a lot even though I do not know anything about DNA analysis!

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      phew! 🤓 thanks so much for watching

  • @orgadish
    @orgadish 5 месяцев назад

    I think base R is slower because of issues with string encodings. I found, for example, that basename/dirname are much slower on Windows than Mac, and the R dev who fixed this (for an upcoming R version) noted it had to do with encodings.
    I use those functions frequently on data frame columns (eg FilePath) where most of the column is repeated. So I created a lightweight package `deduped` that speeds up running on a vector with lots of duplication. It might help speed up your case, too (though it looks like your inputs are all unique, so may be not?).
    Thanks for sharing your performance development journey!

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      Interesting - thanks for tuning in!

  • @tedhermann3424
    @tedhermann3424 5 месяцев назад

    Thanks for this series. It's been very helpful to watch the whole process. I would have been interested to see how much each change contributed to the overall performance boost of building the database. I'm guessing removing the for loop provided 90% of the decrease in time.
    Also, have you looked at the targets package for pipeline management? I got started with Snakemake, but then found targets, which is incredibly feature-rich (e.g., easy parallelization and batching), made specifically for R, and has great documentation and a helpful, active developer. Might be worth a look for yourself or another series.

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      Thanks Ted! If I had to guess it was the substr/substring step. I think this might be what you're referring to. I'll have to check out targets. I like Snakemake because I'm often using numerous non-R tools and I like to have a single system for doing everything

  • @abdullahalmohamad244
    @abdullahalmohamad244 5 месяцев назад

    There is always an issue with the sound

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      hrmmm, what's the issue?

    • @JordiRosell
      @JordiRosell 5 месяцев назад

      ​@@RiffomonasI think we listen the sound of your computer. It happened in multiple episodes.

    • @Riffomonas
      @Riffomonas  5 месяцев назад

      @@JordiRosell thanks. is it my typing on the keyboard or a fan? anyway, if you have a timestamp where it happens I'd be happy to see what I can do

    • @JordiRosell
      @JordiRosell 5 месяцев назад

      @@Riffomonas the fan from 0:00.