Using dplyr's group_by function with and without summarize (CC233)

Поделиться
HTML-код
  • Опубликовано: 15 июл 2024
  • Did you know you can use dplyr's group_by function without summarize? This is a powerful tool that we'll use to normalize monthly average temperatures to the monthly averages between 1951 and 1980. Along the way we'll see some cool date functions like today, year, and month from the lubridate package and we'll plot the data with geom_line using the ggplot2 package. We'll do everything within RStudio.
    You can find my blog post for this episode at www.riffomonas.org/code_club/....
    #ggplot2 #dplyr #R #Rstudio #Rstats
    Want more practice on the concepts covered in Code Club? You can sign up for my weekly newsletter at shop.riffomonas.org/youtube to get practice problems, tips, and insights.
    If you're interested in taking an upcoming 3 day R workshop be sure to check out our schedule at riffomonas.org/workshops/
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Introduction
    1:04 Repurposing a script to automate data retrieval and clean up
    4:10 Using group_by with summarize
    7:26 Normalizing temperatures to a range of years
    11:23 Using group_by with summarize for two variables
    15:02 Using group_by without summarize to normalize by month
    18:22 Plotting and interpreting results
    22:45 How does this year compare to 1951-1980?
  • НаукаНаука

Комментарии • 33

  • @jean-claudegolovine5725
    @jean-claudegolovine5725 4 месяца назад

    Hello from Scotland. Many thanks for this excellent video!

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 4 месяца назад

    Pat, I think I have pointed this out before but the fpp3 package will do most of these things that your are doing with much simpler code. fpp3 was after all designed for time series. I still love your code gymnastics and have watch some of your videos multiple times - each time I learn something new.
    Thanks!

  • @szco9814
    @szco9814 Год назад +1

    Your video is a gem!

  • @PhilippusCesena
    @PhilippusCesena 2 года назад +1

    Excellent job!

  • @timmytesla9655
    @timmytesla9655 Год назад +1

    Wonderful video as usual. Thumbs up.

  • @djangoworldwide7925
    @djangoworldwide7925 2 года назад +1

    Beautiful analysis work.

    • @Riffomonas
      @Riffomonas  2 года назад

      Thanks! I’m glad people are enjoying it

  • @dasrotrad
    @dasrotrad Год назад +2

    Pat you have tutorials for all levels, which is fabulous. You are so prolific, unfortunatly, I can't keep up with all you produce. You are amazing. Somehow I missed "riffomonas." Where does that word, "Riffomonas", come from?

    • @Riffomonas
      @Riffomonas  Год назад +2

      Hah! It comes from the idea of riffing in music but riffing on other peoples code. My hope is that people can see how I riff on my own code to do the same for their own purposes. The “omonas” is a common ending for bacteria

  • @PeperazziTube
    @PeperazziTube Год назад +1

    Great stuff. You can make your life easier sometimes by using the %in% operator, e.g normalized_range = year %in% 1951:1980 also gives you TRUE/FALSE indicator and more concise code. The nice thing about the %in% operator is that it works on many datatypes (bools, integers, reals, chars) in both lists and vectors.

    • @Riffomonas
      @Riffomonas  Год назад

      Thanks! It’s all a matter of what I remember when I’m under the spotlight of recording 😂

  • @caseyj1144
    @caseyj1144 Год назад +1

    I like to save versions of my data after/as I clean it as .RDS files so I can see what I did/reproduce easily later.
    People asking about organization: Usually I group my projects with /background_info /in_data /out_data /code as separate directories. I don't think there's anything special about those files except that it's organized enough and general enough to be consistent so I easily program and reuse paths across projects. I organized this way originally when reading about reproducible research and data sharing in neuroscience and psychology, so you might want to see if there's something that a group has suggested for your field that you can work within (if you want to data share). If it's a big project I also have a README with dependencies/version info and an RProj with a source.R that auto-opens and runs everything.
    Thanks Dr. Schloss! Learning a lot here :)

    • @Riffomonas
      @Riffomonas  Год назад +1

      Awesome! My only caution against Rds files is that they limit you to R and they aren’t text files. I prefer to work with csv/tsv files as much as possible

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 Год назад +1

    I did not see the same trend you see in your data - namely a gradual increase. The curve for my local station (near the southern tip of Lake Michigan) is essential flat. What we may be looking at is the moderating effect of the lake. But I do see the cold October of 1925 (the relative deviation is -7.3) but I am missing measurements for 1917. Interesting stuff - also a good lesson in how to deal with NAs.
    Please bring more stuff like this with broad appeal and data that is easily and freely obtained.
    Thanks.

    • @Riffomonas
      @Riffomonas  Год назад

      Cool results and insights! 🤓

  • @bassamabdelnabi3117
    @bassamabdelnabi3117 Год назад +1

    Totally awesome… man … you really explain things very well… you hit the spot … thanks so much please keep doing great work and help people

    • @Riffomonas
      @Riffomonas  Год назад +1

      Thanks for the encouragement 🤓. Im glad people are finding this thread of videos helpful

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 4 месяца назад

    Pat,
    So when you replace the empty spaces with zero was that a form of imputation? Basically replacing missing values?

  • @shadyamigo
    @shadyamigo Год назад +1

    Thank you for another great video. Quick Q What does the ‘group’ argument do in the ggplot aesthetics as you also have Color set to year. Thank you

    • @Riffomonas
      @Riffomonas  Год назад

      They group aesthetic here links all the data from the same year together. You could use color=year but then every year would be a different color. Instead I used color=is_this_year to get the two colored figure.

    • @shadyamigo
      @shadyamigo Год назад +1

      @@Riffomonas thank you

    • @shadyamigo
      @shadyamigo Год назад +1

      I must have missed the last few minutes when I posted the earlier question. I meant before you added the is_this_year column you already had group= year Color = year so to rephrase my question at that point of the tutorial is the group parameter doing anything in addition to the Color parameter as both are set to year at that point

    • @Riffomonas
      @Riffomonas  Год назад +1

      Right - in this case they do the same thing. I tend to use group for line plots even if it’s redundant with color just to be safe

  • @r.hainez2131
    @r.hainez2131 2 года назад +2

    could you please explain how you manage your .R files (workflow wise)? And why setwd() is not your favorite?

    • @Riffomonas
      @Riffomonas  2 года назад +1

      Using paths in R and why you shouldn't be using setwd (CC179)
      ruclips.net/video/StqDYjM6ULo/видео.html

    • @sven9r
      @sven9r 2 года назад +1

      @R.Hainez look for the here package very useful

  • @szco9814
    @szco9814 Год назад +1

    Hello Boss! Could you please elaborate why you drop the groups after you group by and summarise. It was so confusing that you said when group by and summarize will remove the grouping to the right. I did not see any change after you drop the groups. The tibble size is 1558*3 which is exactly same size compared to the tibble without drop groups. Thank you sir!

    • @Riffomonas
      @Riffomonas  Год назад +1

      Thanks for watching and for your question! It doesn’t change the size of the tibble only the grouping or structure of the tibble. I remove the groupings because they can mess with downstream processes. If I did another mutate with the data still grouped there could be unintended results

  • @dmalarekable
    @dmalarekable 2 года назад +1

    I still can't wrap my head around the fact why you normalize the temps between 50's and 80's. Shouldn't you normalize between all the years?

    • @Riffomonas
      @Riffomonas  2 года назад +3

      Here’s a FAQ describing the idea of the temperature anomaly and why nasa does it this way… data.giss.nasa.gov/gistemp/faq/