That was super helpful, thank you! I'll admit that joining was something that I hadn't really got the hang of, and even though I have gone through the tutorials, I didn't really appreciate what was going on. Only got rolling to figure out and then I can say I've mastered data.table! I would add though that the animal_legs_dt[animal_sounds_dt[uniq_animals]] for a full join is... pretty ugly! Instead, data.table provides its own version of merge that looks exactly like base R.
Just to note, using as.data.table() or setDT() will be considerably faster than data.table(). data.table also comes with its own version of merge() so you don't have to use the funky syntax for a full merge.
@@Riffomonas I made synthetic datasets since I don't have your fasta data. In my test, dtA was the fastest, followed by using setDT and as.data.table. data.table() was similar to dplyr with inner_join. Here is my code: each_num
@@Riffomonas Apologies if this is showing up for a second time, but I replied earlier and now it seems to be gone. I made some synthetic data because I don't have your fasta data. I consistently find that dtA is fastest, followed by setDT() and as.data.table(). dplyr and data.table() are comparable. Here is my code: each_num
@@Riffomonas I've tried replying a few times, but youtube seems to be autoremoving the comment. Maybe something to do with the code snippet.... Anyway, I made large synthetic datasets because I don't have the fasta data, and ran everything again. dtA is consistently fastest, followed by setDT and as.data.table(). Here is my code below. each_num
@@Riffomonas I've tried replying numerous times, but my comment gets removed each time. I think it doesn't like the code snippet I'm trying to share... Anyway, I made synthetic datasets ~50,000 rows long, where each row is a unique group so that it is comparable to your fasta data. dtA is consistently fastest, followed by setDT and as.data.table. One thing I had to control for was using a copy of the dataframe for setDT (e.g., df_copy
A few thoughts come to mind. 1) dplyr always outputs tibbles. If you're going to use dplyr, it might be worth using tibbles throughout your package. The loss in performance is worth the consistent formatting, and tibbles are just better. 2) dplyr allows for multiple backends (dtplyr, dbplyr, duckplyr, arrow, etc). Would those affect your code? If I call duckplyr::methods_overwrite(), and a package has a custom function that calls dplyr::inner_join() under the hood, would it now call duckplyr::inner_join() under the hood instead? 3) Similarly, If I pipe a dataframe into dtplyr::lazy_dt() and then into a custom join function that calls dplyr::inner_join() under the hood, would it work and use the data.table method? Or would dtplyr just not know how to translate the code? I know that you're not planning to write a custom join function, but your video still sparked these curiosities in me. Lately I've been looking at these dplyr backends as a way to scale up our work for big data projects without making my team have to learn new syntax, so they've been on my mind a lot. Great video as always!
Great - thanks for the feedback. For now, the input to the phylotypr functions will be data.frames, but they should work fine if people provide tibbles or data.tables. The output will be base R structures like lists and character strings.
Great video, I was used to dplyr and it is very interesting to see other approaches.
Thanks! Glad you enjoyed it 🤓
That was super helpful, thank you! I'll admit that joining was something that I hadn't really got the hang of, and even though I have gone through the tutorials, I didn't really appreciate what was going on. Only got rolling to figure out and then I can say I've mastered data.table!
I would add though that the animal_legs_dt[animal_sounds_dt[uniq_animals]] for a full join is... pretty ugly! Instead, data.table provides its own version of merge that looks exactly like base R.
Thanks - I hadn't seen data.table::merge. That would simplify things considerably
Have you tried the join from the collapse package? It is very fast in my tests. collapse::join(x, y, how="inner", on=c("a"="b"))
Thanks - I'll have to check that out
Would you like me to recommend you to use `bench::mark()` whenever you benchmark the expression?
Thanks - I've used it in other episodes, but I find the {microbenchmark} is easier to use for some applications
Just to note, using as.data.table() or setDT() will be considerably faster than data.table(). data.table also comes with its own version of merge() so you don't have to use the funky syntax for a full merge.
Thanks for the feedback - I'm finding that if I use as.data.table or setDT, I get similar results as plain data.table and inner_join
@@Riffomonas I made synthetic datasets since I don't have your fasta data. In my test, dtA was the fastest, followed by using setDT and as.data.table. data.table() was similar to dplyr with inner_join. Here is my code:
each_num
@@Riffomonas Apologies if this is showing up for a second time, but I replied earlier and now it seems to be gone.
I made some synthetic data because I don't have your fasta data. I consistently find that dtA is fastest, followed by setDT() and as.data.table(). dplyr and data.table() are comparable. Here is my code:
each_num
@@Riffomonas I've tried replying a few times, but youtube seems to be autoremoving the comment. Maybe something to do with the code snippet.... Anyway, I made large synthetic datasets because I don't have the fasta data, and ran everything again. dtA is consistently fastest, followed by setDT and as.data.table(). Here is my code below.
each_num
@@Riffomonas I've tried replying numerous times, but my comment gets removed each time. I think it doesn't like the code snippet I'm trying to share... Anyway, I made synthetic datasets ~50,000 rows long, where each row is a unique group so that it is comparable to your fasta data. dtA is consistently fastest, followed by setDT and as.data.table. One thing I had to control for was using a copy of the dataframe for setDT (e.g., df_copy
A few thoughts come to mind.
1) dplyr always outputs tibbles. If you're going to use dplyr, it might be worth using tibbles throughout your package. The loss in performance is worth the consistent formatting, and tibbles are just better.
2) dplyr allows for multiple backends (dtplyr, dbplyr, duckplyr, arrow, etc). Would those affect your code? If I call duckplyr::methods_overwrite(), and a package has a custom function that calls dplyr::inner_join() under the hood, would it now call duckplyr::inner_join() under the hood instead?
3) Similarly, If I pipe a dataframe into dtplyr::lazy_dt() and then into a custom join function that calls dplyr::inner_join() under the hood, would it work and use the data.table method? Or would dtplyr just not know how to translate the code?
I know that you're not planning to write a custom join function, but your video still sparked these curiosities in me. Lately I've been looking at these dplyr backends as a way to scale up our work for big data projects without making my team have to learn new syntax, so they've been on my mind a lot. Great video as always!
Great - thanks for the feedback. For now, the input to the phylotypr functions will be data.frames, but they should work fine if people provide tibbles or data.tables. The output will be base R structures like lists and character strings.
> class(iris)
[1] "data.frame"
> x class(x)
[1] "tbl_df" "tbl" "data.frame"
> iris |> dplyr::inner_join(x) |> class()
[1] "data.frame"
> iris |> dplyr::inner_join(x) |> tibble::as_tibble() |> class()
[1] "tbl_df" "tbl" "data.frame"