- Видео 16
- Просмотров 220 265
Tom Henry - data science with R
Австралия
Добавлен 28 май 2020
How to get started with R and RStudio to analyze data.
How to download R on Mac
Download R & RStudio:
* posit.co/download/rstudio-desktop/
Free online "R for Data Science" book:
* r4ds.hadley.nz
Timestamps:
0:00 Links
0:21 R vs. RStudio
0:42 R installation
2:13 RStudio installation
3:18 Book "R for Data Science"
3:50 Welcome to the R community!
* posit.co/download/rstudio-desktop/
Free online "R for Data Science" book:
* r4ds.hadley.nz
Timestamps:
0:00 Links
0:21 R vs. RStudio
0:42 R installation
2:13 RStudio installation
3:18 Book "R for Data Science"
3:50 Welcome to the R community!
Просмотров: 6 549
Видео
't4ncr, or Tips & trips for naming columns in R
Просмотров 1,3 тыс.2 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q In this episode of R basics, we're talking about how to name variables and columns in R, and how to rename them. There are 3 principles and several tactics for naming variables in R which can help us read code more easily and reduce bugs. Timestamps: 0:00 Why it matters 0:25 Set up 0:50 Interpretability (to humans) ...
Solve the NYT Spelling Bee with R - Regular Expressions
Просмотров 1,6 тыс.2 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q Link to the New York Times Spelling Bee: www.nytimes.com/puzzles/spelling-bee 🎉 *Enjoyed this video?* Leave a comment below to share what you liked the most!
R Basics: How to Use filter() to Select Rows Based on Column Values
Просмотров 14 тыс.2 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q In this episode of R basics, we're talking about how to use the filter() function to pick rows (observations) from data based on a condition - for example, "Pick all the rows where the penguin is over 2 feet high." We'll look at the easiest way to do this, as well as the hard way. 🎉 *Enjoyed this video?* Leave a com...
How to download R on Mac
Просмотров 58 тыс.4 года назад
Welcome to the R community! In this episode we'll go through how to install R and RStudio on Mac for the very first time, step-by-step. R is the coding language for working with data (like Python), and RStudio is the best editor to write and run code written in R. We'll also look at the best book to learn R, called "R for Data Science" by Garrett Grolemund and Hadley Wickham. It's free and you ...
Tidyverse in R - tips & tricks
Просмотров 28 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q Here are 18 ways to speed up data cleaning, tidying, and exploration with the tidyverse packages in R. They'll help you to work with data more efficiently, simplify your R code, and surprise your friends!! 🎉 *Enjoyed this video?* Leave a comment below to share what you liked the most! 0:00 Intro 1:04 Create new colu...
Plotting in R - the basics
Просмотров 11 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q In this tutorial you'll learn how to build plots using R with ggplot and the tidyverse, including how to plot multiple lines on the same plot. We'll work step-by-step through the "Visualisation" chapter in R for Data Science (r4ds.had.co.nz/data-visualisation.html) written by Garrett Grolemund and Hadley Wickham whi...
Effortlessly Simplify Your R Code with These Tips and Tricks
Просмотров 7 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q Here's how to simplify R code in RStudio and automatically clean up messy R code using a keyboard shortcut in RStudio. This video covers how to use RStudio's default method to clean up R code, then look at how to install the "styler" package, set up keyboard shortcuts, and also how to clean up code in both plain R f...
9 Must-Read Books for Machine Learning with R
Просмотров 4,3 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q I'll take you through 9 free online books (and an article) that everyone learning machine learning should know to learn statistical techniques, understand how to master different concepts, and work with data in R. 🎉 *Enjoyed this video?* Leave a comment below to share what you liked the most! 0:00 Intro 0:24 Introdu...
How to Plot Counts in R: A Step-by-Step Guide
Просмотров 6 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q In this video we look at how to count data in R and plot the results. We look at counts of volcano types, and then a more complex plot with counts of volcano types by region. 🎉 *Enjoyed this video?* Leave a comment below to share what you liked the most! Volcano data from the Tidy Tuesdays R project: github.com/rfor...
Plotting Time Series in R: A COVID-19 Example
Просмотров 17 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q Here I walk through an example showing how I'd plot time-series data in R using the ggplot2 package. In this plot, we look at coronavirus COVID-19 new daily cases in the United States and Australia. 🎉 *Enjoyed this video?* Leave a comment below to share what you liked the most! 0:00 Intro 0:15 Download the data 1:00...
11 YouTube Channels Every Data Scientist Should Know
Просмотров 12 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q I'll take you through 11 RUclips channels that every data scientist should know. These channels will help you to grow in your data science journey, learn productivity and study techniques, and gain exposure to new techniques to work with data in Python and R. 🎉 *Enjoyed this video?* Leave a comment below to share wh...
Text analysis / mining in R - how to plot word-graphs
Просмотров 30 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q Here's an easy approach to start using R to generate insights from text data. I'll take you through the process of exploring themes in text data by visualizing the relationships between words, using a real dataset with user reviews of the Animal Crossing game. To do this, we'll use R, the tidytext package, and the g...
Discover 7 Hidden Gems in the R Package Ecosystem
Просмотров 13 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q In this video I'll take you through 7 R packages that are really useful but less well known. They are game changers if you want to....... 0:00 Intro 1:06 Load csv data rapidly (vroom) 3:08 See what is happening when you make changes to data (tidylog) 4:48 Clean column names (janitor) 6:55 Work with dates (lubridate)...
Clean column names in R with clean_names()
Просмотров 5 тыс.4 года назад
🔔 *Subscribe for weekly R videos:* ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q Have you ever loaded data into R only to find that the column names contain spaces, parentheses, camelCase, or are otherwise difficult to work with? In this video, I'll show you the clean_names() function from the janitor package, which I run whenever I load data from a file into R. It cleans the column names of the...
How to plot a time series in R with ggplot2 in 2020 (coronavirus example)
Просмотров 6 тыс.4 года назад
How to plot a time series in R with ggplot2 in 2020 (coronavirus example)
Hi Tom You are really good at teaching and R More content like this will make you viral and this is a good one
Helpfully
How do I make it so styler does 4 indentation spaces instead of 2?
Open your Rprofile configuration file by running this command in R: ```file.edit(file.path("~", ".Rprofile"))``` Once the file is open, add the following lines to set the indentation level to 4 spaces (e.g., for the tidyverse style): ```options(styler.addins_style_transformer = "styler::tidyverse_style(indent_by = 4L)")``` After saving the file, restart your R session. Keen to hear how it goes, it works for me on Mac but I have not tested on other systems!
I can't believe this video only has 545 likes and 13K views...this is awesome!
reminded to tidylog nice one
Thank you so much
Thank you! This was super clear.
Thanks, I work in a bank we migrated from SAS to R. This is so helpful.
This is really informative and helpful. But why is it that @21:06 in the demo dataset, the information on the x-axis (cut) automatically rearranges in alphabetical order instead of maintaining its original format in the table? This is the same issue I am having with my data (using geom_point). Any suggestions?
super helpful!
My output keeps saying "object __ not found"; Error in "select ()" I don't understand, I rewrote everything
This is so niche I’m so glad I found this video! Thank you lol
👏🏽👏🏽👏🏽
Hi Tom, thank you for this. for using tidylog I need to add tidylog() at the end of code chain? somehow if I don't add tidylog() at the end I don't see any transformation steps
Did you run this line near the top of your code? library(tidylog) You may also need to use these options where appropriate, but normally putting library(tidylog) at the top of your code is fine: # turn logging-output on options("tidylog.display" = NULL) # turn logging-output off options("tidylog.display" = list()) (more details on those here: rdrr.io/github/elbersb/tidylog/f/README.Rmd) One possibility is that another package you are using is overriding tidylog, but that is unlikely.
@@tomhenry-datasciencewithr6047 thank you Tom for the prompt reply Ll give it a go Any interesting packages, add-ins and/or tips using R? A video is due:) Appreciate the efforts, they really make a difference
Very helpful. Thank you for this.
thank you
make_date renders str_c a waste of time in lubridate walk through
How does the code change if the columns have text instead of numbers? I want to select the rows that contain data for specific forest areas, which are marked with text in the data frame.
please how do i filter rows with character data types. For example, rows that have total in them from a dataset
Hi how do you filter if you want for exemple rows where depth is BETWEEN 100-150?
To filter rows in R where depth is between 100 and 150, you can use: library(tidyverse) # assuming you've loaded this earlier filtered_data <- your_data %>% filter(depth >= 100 & depth <= 150) # if you want to include 150, too filtered_data <- your_data %>% filter(depth >= 100 & depth < 150) # if you don't want to include 150 For more information on the filter function and other data manipulation tools in R, check out the book "R for Data Science" by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, available online at r4ds.hadley.nz/data-transform#filter :)
Thank you very much!
Thank you ☺
los links para descargar lor archivos no funcionan, los tienes en alguna carpeta de la nube?
Awesome, extremely handy and Just in time.
Thanks, Tom. Very useful! Is there a package to help to code more efficiently in R like in VScode, like autocomplete, autofill.
This was SO HELPFUL for a project i'm doing for one of my classes. THANK YOU!
Glad it was helpful!
Just found this gem! Thank you so much for this! Very useful!
You're very welcome!
thank you for this Tom. Wonder how you would complete a timeseries data that misses a number of dates . e.g. 2022 data that doesn't have 1-20 Aug days
It depends - are you trying to do time series forecasting or do you just want the plot to show a particular way? I would recommend checking out tsibble.tidyverts.org/reference/fill_gaps.html .
@@tomhenry-datasciencewithr6047 thank you for this. I am trying to do both but starting with making a plot. I recall there is an command like complete() are you on linkedIn? we could connect
@@ahmed007Jaber For time series data, I would recommend tsibble.tidyverts.org/reference/fill_gaps.html However, if using tsibble is not an option, you could create a table of dates, e.g. dates_desired <- tibble(date = seq(lubridate::ymd("2022-01-01"), lubridate::ymd("2022-12-31"), by = "1 day")) and join your dateset onto it: data_with_all_dates <- dates_desired %>% left_join(your_data_with_missing_dates, by = "date") PS - I'm not on LinkedIn unfortunately!
@@tomhenry-datasciencewithr6047 thank you will check this out for sure. I have timeseries data and I am searching how to know do a shiny dashboard to interact with data LinkedIn community is good and I really have learnt a lot through connectionss and posts. what is the best way to reach you out? interested to grow my network and I like your content.
Starting to learn text mining this semester at school! Found this video really useful and interesting!
Glad to hear!!!!
Hey sir, i was wondering if you know how to remove only a specific percentage of row where we have a specific value. For exemple we have 10 rows with a value of 1 and we want to remove half of those row
What is your goal? Depending on what you are aiming to do, there may be more specific methods. However, a general approach works like this: your_data %>% group_by(specific_value) %>% sample_frac(0.5) %>% ungroup()
@@tomhenry-datasciencewithr6047 Hey Tom thank you for your answer. I try but it din't work. Instead i used the following approach! tmp_data <- data %>% filter(x == ma_valeur) %>% slice_sample(prop = 0.5) data %>% anti_join(tmp_data)
Please keep on doing these kind of videos!
Excellent tutorial. I wish if you make more of these videos.
Thanks for your help. How would you filter with percentages? Like say if I want the top and bottom 1% of people and discard that data how would I be able to do that? Thank you very much
It depends on the exact scenario. Assuming you have one row per person (i.e., not multiple rows per person), you can do it like this: your_data <- your_data %>% mutate(variable_of_interest.percentile = percent_rank(variable_of_interest)) your_data %>% filter(variable_of_interest.percentile <= 0.01) # lowest 1% your_data %>% filter(variable_of_interest.percentile > 0.99) # top 1%
If you are interested only in excluding the top and bottom 1%, you could do your_data %>% filter(variable_of_interest.percentile >= 0.01 & variable_of_interest.percentile < 0.99) **However** - my guess is that you are trying to filter out outliers - is that right? If so, depending on your situation I would usually recommend two alternatives: 1. retaining outliers but using a technique that is more resilient to outliers (e.g., quantile regression using the quantreg package - basically quantile regression optimizes for the median, whereas regular linear regression optimizes for the mean) e.g. # install.packages("quantreg") library(quantreg) rq(y ~ height + age, data = ...) (same kind of interface as linear regression) 2. figuring out non-arbitrary absolute cut-offs - e.g. if you know that a customer spending more than a certain amount is unrealistic or a patient will never have a lab value less than a particular amount, then just filter out records where they exceed those values. The reason is that in many datasets, the top 1% (say) may have some invalid records but also a lot of valid records, so it's best to be explicit about your inclusion/exclusion criteria rather than using a % criteria for exclusion, where possible.
@@tomhenry-datasciencewithr6047 Wow thank you so much for your help! Amazing!!! Love from the UK
@@tomhenry-datasciencewithr6047 You got it exactly right, I needed to get rid of outliers and I will take these better suggestions from you on board. Thank you so much
@@JC-wt5fw Hope it goes well! :)
Thanks for this video Tom. Do you know how to filter the data with multiple conditions. So if X=1 then filter y>2000 AND if X=2 then filter Y>3000 AND if X=3 then filter Y>4000. Thanks for your help.
You can do this like so: your_data %>% filter((x == 1 & y > 2000) | (x == 2 & y > 3000) | (x == 3 & y > 4000)) (OR is the `|` symbol, and AND is the `&` symbol). There are other ways of doing it which depend on your use-cases and which could be better, but this is the general method.
@@tomhenry-datasciencewithr6047 This is SO helpful. Thanks Tom. I look forward to watching more of your videos. They are super clear!
Thank you for inspiring video!
This video proves, how unstable and shyyt R is to be honest. Crashes, errors etc. this is what im facing every single day at work. Good video though!
im getting error saying "n_confirmed obj is not found" what to do?
I didn't realize this was a thing... "R"... Though I thought the whole video was great, at the end, it seems like there could have been some better kind of formulation which would offer a better insight to the reviews. Word-pairing is great, but they were all "out of context" and due to the "chaining", it leaves one to assume that there is actually a connection of 3 or more words, when there may not be. For instance, you saw "bugs" and "fishing"... Bug = a thing you fish with, or programming issue with fishing? (I assume the prior) I see "bombing, review, click"... That could have been "review bombing", or "bombing review"... Were there bombs in the game? Was it "... bombing. Review ..." or "... review. Bombing ..." I am sure that there was no reviews that had "bombing review click" or "click review bombing". I have an issue with the "tainted results". You threw away valuable "review words". I say tainted, or "corrupted", because you removed them, which now "pairs" possibly unrelated words. Also, periods... You don't constrain "pairings" to "sentences". You are getting cross-contamination of thoughts, creating pairings that truly don't exist. Then there is the matter of "word association" and "similarity" and "depluralizing" that should be done. I saw "player" and "players", textually the same content also a pairing of "reviews negative", but no "review negative" only "review bombing" and no "reviews bombing", also island and islands, were oddly isolated. Word association... "Nintendo switch", "Nintendo game", "Nintendo switch game", "Nintendo console game" That contaminates "switch", and "game" and "console" with other relevant pairings, having nothing to do with a "game console" branded "Nintendo", specifically the "Switch" model. (Also the removal of the games title from the review, which contaminates pairs related to "animal" and "crossing" and "horizon/s".) I noticed a lot of foreign words in there too. De, en, el, es, se... Perhaps a LOT was missed since those were quite commonly found, but they surely were not reviewing in English, and pairings of foreign words, even if translated, would not always be the same. The dialects orders are often different. Thus, the "word association" needed. Which identifies the subjects and relative words you threw away half of. "Good game", with "good" being one of those common words you surely had in the list, possibly found a hundred times. Good, like/d, love/d, enjoy/ed. Missing critical triplets and notable phrases too, I assume... "well worth the money", "not worth the money", you just saw "worth money" as a pairing. "waste of my time", "no time to waste, get it now", as "waste time" and "time waste". I guess my mind just works different. I feel that you were on the right track in the isolation of good/bad, but the pairing doesn't seem to be a good metric for anything other than "game content confirmation". By the reviews, the text suggests that it is a game that involves fishing, customization/crafting of things, multiplayer, animals, it works on Nintendo switch, there are islands in it. (Compared to the game developers description, it could "confirm" game content.) Perhaps a better metric for good and bad would be the isolation of words NOT found in both. Seeing "not fun" as a pairing in a horrible review is expected. However, if you see "good value" or "worth ... money", then its not so bad.
You are asking exactly the questions that would go into a more detailed analysis! There is a useful function called SnowballC::wordStem() which reduces words down to a common 'stem.' For example, it produces: "then there is the matter of word associ and similar and deplur that should be done i saw player and player textual the same content also a pair of review negat but no review negat" For a more rigorous look into this subject see Julia Silge and David Robinson's excellent 'Tidy Text Mining in R' which is free online: www.tidytextmining.com Text analysis is always imperfect (and will remain so forever, I suspect), but it can yield good insights when applied to a large dataset, provided a human is in the loop!
@@tomhenry-datasciencewithr6047 Something like this would be PERFECT for what I am doing, but it is honestly above me in complexity, at the moment. I was looking for a formulated way to form a sense of textual hierarchy. One which could be used to help new entries "find a logical category level", where it belongs with other similar associated words. In a basic sense... object => transportation => vehicle => automobile => car => gasoline_engine => hatch_back => ford => mustang => 1985 => cherry_red_paint So a "truck", which, by similar associations, would be at the level of "car". transportation => vehicle, as opposed to skates (accessory vs something you drive) or a ski-lift (not drivable) vehicle => automobile, as opposed to a bicycle or skateboard (non-motorized vehicles) etc... The hierarchy being assumed by known relations, and/or by simple volume of appearance and order. People tend to say, "my 1985 mustang", classed in reverse, "specific => generic". However, by volume, 1985 appears less than mustang and ford appears more than that. Continuing up the chain to automobiles, vehicles and transportation, which has progressively more and more "objects" that they are identified with. While knowing the relation, without needing to know the specifics of any one car... ford can be aligned with lexus, chevy, mazda, etc... Because of the similar preceding and following similar groupings. Why turn the entire language into a form of "word tree of origins"? Partly for use with AI classification of image contents. Knowing that cars have rims, wheels, paint, body styles, manufactures. Partly for extracting and isolating the correlating subject matter and "emotion/opinions" within descriptions. Partly for assisting the extraction and isolation of "finer details", such as the more rare descriptions of the types of tires, ground effects, antenna types, rim types. I could go on, but my primary goal in my knowledge-quest, was for the things just mentioned, as a whole. Extending to the final purpose of being used as a guide for helping others "classify images contents", with more valuable information that AI can use for digestion. (All in relation for text2image and image2image AI created art, which is assisted by "human textual prompts". They have a rough system in place, but it is hardly extensive or adequate enough to be used with any form of accuracy or repeatability.) P.S. Looking into this further, because of this video you posted. (Yet another language to consume my brain-cells.)
Great video sir. Thanks for the walk through! Will be applying some of this to a project I am working on
This is mind-blowing! I would have never dreamed to get to a so lovely code.
what if the other category is the biggest? do we use a different factor?
Awesome video, very clear and well explained. You are a great teacher. However, I am bumping into an issue. I have a 700mb dataset with 6MM rows and I want to calculate the top 20% of objects in a text string. Everything works well and I can get the count results, the variable is in my Environment, but it is null in the pane and when I try to plot it I get an error message that the object can't be found. This is after successfully assigning the variable and seeing the count results. Any suggestions? Been searching for over an hour. Thanks for any help, subscribed and looking forward to your other videos.
I also went through your tutorial in this video and had the same error message. When I get to line 30 I receive an error that "object 'volcano_type_counts' not found". For some reason my assigned variables disappear. Any advice?
Hi James! Can you post the code snippets you are using here? (for both getting the counts and plotting the count results)
Thanks! Very explanatory and super simple to understand.
Excellent! tidytext looks very interesting.
Great video! Very dense with information and straight to the point!
Excellent!
Excellent video. I was enthralled the entire duration. You've also given me some ideas for something I'm working on
You are a life saver!!!
Thanks. This was helpful.