I just went through all my code and switched to the stringi version of it! stringr has str_squish() which i love, that can be replaced with stri_replace_all_regx ("\\s+, " ") and stri_trim() - and still much faster. Thank god also for the wonderful GPT engines, that really help make this transition once you provide the stringr version of the code!!
Hey, I found that tidyr's replace NA is actually better, and also that for this usecase, stringr was even faster than stringi. Results for 100 evals, microseconds: tidyr median: 1617.70 stringr median: 7963.00 stringi median: 8015.65 base median: 1590.65 for tidyr, stringr and stringi i used mutate(x,lang = {package}::{function}(lang,"en")) for base I used x$lang[is.na(x$lang)]
I think base R is slower because of issues with string encodings. I found, for example, that basename/dirname are much slower on Windows than Mac, and the R dev who fixed this (for an upcoming R version) noted it had to do with encodings. I use those functions frequently on data frame columns (eg FilePath) where most of the column is repeated. So I created a lightweight package `deduped` that speeds up running on a vector with lots of duplication. It might help speed up your case, too (though it looks like your inputs are all unique, so may be not?). Thanks for sharing your performance development journey!
Thanks for this series. It's been very helpful to watch the whole process. I would have been interested to see how much each change contributed to the overall performance boost of building the database. I'm guessing removing the for loop provided 90% of the decrease in time. Also, have you looked at the targets package for pipeline management? I got started with Snakemake, but then found targets, which is incredibly feature-rich (e.g., easy parallelization and batching), made specifically for R, and has great documentation and a helpful, active developer. Might be worth a look for yourself or another series.
Thanks Ted! If I had to guess it was the substr/substring step. I think this might be what you're referring to. I'll have to check out targets. I like Snakemake because I'm often using numerous non-R tools and I like to have a single system for doing everything
I just went through all my code and switched to the stringi version of it! stringr has str_squish() which i love, that can be replaced with stri_replace_all_regx ("\\s+, " ") and stri_trim() - and still much faster. Thank god also for the wonderful GPT engines, that really help make this transition once you provide the stringr version of the code!!
Hey, I found that tidyr's replace NA is actually better, and also that for this usecase, stringr was even faster than stringi. Results for 100 evals, microseconds:
tidyr median: 1617.70
stringr median: 7963.00
stringi median: 8015.65
base median: 1590.65
for tidyr, stringr and stringi i used mutate(x,lang = {package}::{function}(lang,"en"))
for base I used x$lang[is.na(x$lang)]
Wow. Had no idea stringi is so much more performant. I actually like its syntax for some cases. Will definitely move to using it
same here! thanks for watching 🤓
I learn a lot from each of your videos...thank you
my pleasure - thanks for watching!
I have followed a few of your videos and indeed I have learned a lot even though I do not know anything about DNA analysis!
phew! 🤓 thanks so much for watching
I think base R is slower because of issues with string encodings. I found, for example, that basename/dirname are much slower on Windows than Mac, and the R dev who fixed this (for an upcoming R version) noted it had to do with encodings.
I use those functions frequently on data frame columns (eg FilePath) where most of the column is repeated. So I created a lightweight package `deduped` that speeds up running on a vector with lots of duplication. It might help speed up your case, too (though it looks like your inputs are all unique, so may be not?).
Thanks for sharing your performance development journey!
Interesting - thanks for tuning in!
Thanks for this series. It's been very helpful to watch the whole process. I would have been interested to see how much each change contributed to the overall performance boost of building the database. I'm guessing removing the for loop provided 90% of the decrease in time.
Also, have you looked at the targets package for pipeline management? I got started with Snakemake, but then found targets, which is incredibly feature-rich (e.g., easy parallelization and batching), made specifically for R, and has great documentation and a helpful, active developer. Might be worth a look for yourself or another series.
Thanks Ted! If I had to guess it was the substr/substring step. I think this might be what you're referring to. I'll have to check out targets. I like Snakemake because I'm often using numerous non-R tools and I like to have a single system for doing everything
There is always an issue with the sound
hrmmm, what's the issue?
@@RiffomonasI think we listen the sound of your computer. It happened in multiple episodes.
@@JordiRosell thanks. is it my typing on the keyboard or a fan? anyway, if you have a timestamp where it happens I'd be happy to see what I can do
@@Riffomonas the fan from 0:00.