Scraping weather data from the internet with R and the tidyverse (CC231)

Поделиться
HTML-код
  • Опубликовано: 15 июл 2024
  • R has powerful but simple tools that allow for easy scraping of the internet. In this episode, Pat will show you how to track down local weather data from the NOAA website and make it accessible within your R session in RStudio using tools from the tidyverse including dplyr, lubridate, and more
    You can find my blog post for this episode at www.riffomonas.org/code_club/....
    #tidyverse #R #Rstudio #reproducibility #Rstats
    Want more practice on the concepts covered in Code Club? You can sign up for my weekly newsletter at shop.riffomonas.org/youtube to get practice problems, tips, and insights.
    If you're interested in taking an upcoming 3 day R workshop be sure to check out our schedule at riffomonas.org/workshops/
    You can also find complete tutorials for learning R with the tidyverse using...
    Microbial ecology data: www.riffomonas.org/minimalR/
    General data: www.riffomonas.org/generalR/
    0:00 Introduction
    1:04 Finding weather station data at NOAA
    8:06 Finding the closest weather station
    17:47 Get and tidy local weather station data
  • НаукаНаука

Комментарии • 39

  • @sven9r
    @sven9r 2 года назад +6

    For everybody having a hard time with parentheses like Pat has @13:00
    Tools -> "Global options "-> "Code" -> On the top to "Display" and then tick Rainbow parentheses

    • @Riffomonas
      @Riffomonas  2 года назад

      You don’t like my “see if we get an error message”? 😂

    • @sven9r
      @sven9r 2 года назад +1

      Not at all! I'm loving it! But beginners often struggle with this stuff!
      Cheers

    • @Riffomonas
      @Riffomonas  2 года назад

      @@sven9r 🤣

    • @yaqinguo8971
      @yaqinguo8971 Год назад

      It's a good hint. But, interestingly, i did not have this option.

  • @davidmantilla1899
    @davidmantilla1899 Год назад +2

    Your tutorials are great. I have a purely wet bio background and your videos helped me kickstart my computational biology literacy. Thank you for openly sharing your knowledge.

    • @Riffomonas
      @Riffomonas  Год назад +1

      My pleasure! Thanks for watching David 🤓

  • @eric13hill
    @eric13hill 2 года назад +1

    This is my favorite video of yours. It is so useful for what I want to do. Thanks!

    • @Riffomonas
      @Riffomonas  2 года назад

      That’s awesome to hear! What part do you find most useful?

  • @NdengoMarcel
    @NdengoMarcel Год назад +1

    This tutorial in practice is very interesting. I did manage to run the entire code but using my local latitude and longitude as you suggested. I did work. My interested variables were TMAX and PRCP. In Rwanda we do not have SNOW. Thanks a lot.

    • @Riffomonas
      @Riffomonas  Год назад

      Wonderful! I'm glad to hear you got it working. Sorry that you all miss out on snow 😂

  • @sven9r
    @sven9r 2 года назад +1

    Great episode as always! I just ended a course about german raster data with some students :) !

    • @Riffomonas
      @Riffomonas  2 года назад

      Awesome! As always thanks for watching 🤓

  • @djangoworldwide7925
    @djangoworldwide7925 2 года назад +1

    + looks like a fun assignment to create a shiny dashboard containing time series plots of this data

    • @Riffomonas
      @Riffomonas  2 года назад +1

      Yeah I’ve thought about this but I’d probably build all the plots in the backend using a cron job or something. Then serve them up with minimal JavaScript. I don’t think the overhead of shiny would really be necessary 🤷‍♂️

  •  2 года назад +1

    Excellent! There's one station in my city!

  • @zjardynliera-hood5609
    @zjardynliera-hood5609 Год назад +2

    I love this, use the rainbow parentheses btw!!

    • @Riffomonas
      @Riffomonas  Год назад

      Hah! I try to stick close to the defaults so beginners don't get too freaked out when they see something that looks different from their computer

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    I must add the vroom read it in fast (lazy loading I suspect) but I not so sure about the column allocations. It seems to have created new ones with mixed type data. So be aware.

    • @Riffomonas
      @Riffomonas  2 года назад

      Some times the simpler packages are good enough

  • @djangoworldwide7925
    @djangoworldwide7925 2 года назад +1

    I might be wrong but mehh, I'm just gonna make this assumption.
    Science in a nutshell 😅
    Great tutorial sir. I always enjoy your videos since I learn so much more than what I came for (might you elaborate about top_n ? Couldn't quite grasp this one)

    • @Riffomonas
      @Riffomonas  2 года назад

      Thanks for the question! top_n returns the n rows (plus ties) for a particular variable that have the highest value. If you give it a negative number you’ll get the smallest values. There’s also slice_min and slice_max which are a bit similar

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад +1

    Pat,
    I used vroom to read in the file and it read it fast and detected the columns. The only thing I had to do was to clean the column names.

    • @Riffomonas
      @Riffomonas  2 года назад

      Great - I haven’t tried vroom yet

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад +1

    I had no trouble pulling up data for my best neighborhood station. However, my question is the temperature - what is the unit? Kelvin?

    • @Riffomonas
      @Riffomonas  2 года назад +1

      I think that was a question that is flashed in the last 5 min or so of the episode. I’ll definitely cover it in tomorrows episode

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад +1

    This is great! I like how you build it up and have a specific goal in mind. This is also a problem any of us can tackle since the data is readily available.
    I typically write my own code for these sort of exercise (since I at least I can understand my own code) - that is how I learn best. I came up with a slightly different way of finding my "closet" weather station. I wrote a couple functions to do this - and tested the distance on Houston-Chicago and got pretty close. Here is how I tackled the problem.
    I set up two functions to run inside tidyverse - so used rlang (hence the enquo() and the bang bang !!).
    The first function converts to radians:
    radians_func %
    distinct(station) %>%
    pull(station)
    My closest station was about 500 m form my current location but has only operated for a couple of years. The filter gave me another station about 4 km away with a more extensive record. I decided to filter for stations with over 100 year record (although it is not clear what kind of record that is).
    It seems like the search should be more focused, though. What are we after? Temperature it seems. And it seems like that is the one variable most often measured.

  • @lancesnodgrass8016
    @lancesnodgrass8016 6 месяцев назад

    I'm having issues finding the same website as shown in 1:45 and beyond. Any info on how the path has changed from a year ago?

    • @Riffomonas
      @Riffomonas  3 месяца назад

      I just checked it and everything was working. Perhaps the site was down when you tried.

  • @ahmedmostafaahmedkamel8532
    @ahmedmostafaahmedkamel8532 Год назад

    where is this script, please?

  • @kmbrahm
    @kmbrahm 2 года назад

    TMAX looks very high, is that combining rows?

    • @kmbrahm
      @kmbrahm 2 года назад +1

      answered my question - TMAX = Maximum temperature (tenths of degrees C)

    • @Riffomonas
      @Riffomonas  2 года назад +1

      Good sleuthing! I’ll fix this and the precipitation in the next video 🤓

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Must be F with errant readings...

  • @haraldurkarlsson1147
    @haraldurkarlsson1147 2 года назад

    Pat,
    Webscraping has - at least in my mind - a different meaning that what you are doing here. It uses rvest etc. It might be misleading for those looking for actual webscraping.

    • @Riffomonas
      @Riffomonas  2 года назад

      🤷‍♂️I’m getting data from a website. It’s a form of webscraping