R Basics: How to Use filter() to Select Rows Based on Column Values
HTML-код
- Опубликовано: 6 ноя 2024
- 🔔 Subscribe for weekly R videos: / @tomhenry-datasciencew...
In this episode of R basics, we're talking about how to use the filter() function to pick rows (observations) from data based on a condition - for example, "Pick all the rows where the penguin is over 2 feet high." We'll look at the easiest way to do this, as well as the hard way.
🎉 Enjoyed this video? Leave a comment below to share what you liked the most!
🎉 Subscribe if you want more videos like this! - ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q
😃 Comment below if you have questions about how to use filter()!
How does the code change if the columns have text instead of numbers? I want to select the rows that contain data for specific forest areas, which are marked with text in the data frame.
Thanks! Very explanatory and super simple to understand.
Thanks for your help. How would you filter with percentages? Like say if I want the top and bottom 1% of people and discard that data how would I be able to do that? Thank you very much
It depends on the exact scenario.
Assuming you have one row per person (i.e., not multiple rows per person), you can do it like this:
your_data %
mutate(variable_of_interest.percentile = percent_rank(variable_of_interest))
your_data %>% filter(variable_of_interest.percentile % filter(variable_of_interest.percentile > 0.99) # top 1%
If you are interested only in excluding the top and bottom 1%, you could do
your_data %>% filter(variable_of_interest.percentile >= 0.01 & variable_of_interest.percentile < 0.99)
**However** - my guess is that you are trying to filter out outliers - is that right?
If so, depending on your situation I would usually recommend two alternatives:
1. retaining outliers but using a technique that is more resilient to outliers (e.g., quantile regression using the quantreg package - basically quantile regression optimizes for the median, whereas regular linear regression optimizes for the mean)
e.g.
# install.packages("quantreg")
library(quantreg)
rq(y ~ height + age, data = ...)
(same kind of interface as linear regression)
2. figuring out non-arbitrary absolute cut-offs - e.g. if you know that a customer spending more than a certain amount is unrealistic or a patient will never have a lab value less than a particular amount, then just filter out records where they exceed those values.
The reason is that in many datasets, the top 1% (say) may have some invalid records but also a lot of valid records, so it's best to be explicit about your inclusion/exclusion criteria rather than using a % criteria for exclusion, where possible.
@@tomhenry-datasciencewithr6047 Wow thank you so much for your help! Amazing!!! Love from the UK
@@tomhenry-datasciencewithr6047 You got it exactly right, I needed to get rid of outliers and I will take these better suggestions from you on board. Thank you so much
@@JC-wt5fw Hope it goes well! :)
Thanks for this video Tom. Do you know how to filter the data with multiple conditions. So if X=1 then filter y>2000 AND if X=2 then filter Y>3000 AND if X=3 then filter Y>4000. Thanks for your help.
You can do this like so:
your_data %>%
filter((x == 1 & y > 2000) | (x == 2 & y > 3000) | (x == 3 & y > 4000))
(OR is the `|` symbol, and AND is the `&` symbol).
There are other ways of doing it which depend on your use-cases and which could be better, but this is the general method.
@@tomhenry-datasciencewithr6047 This is SO helpful. Thanks Tom. I look forward to watching more of your videos. They are super clear!
Hey sir, i was wondering if you know how to remove only a specific percentage of row where we have a specific value. For exemple we have 10 rows with a value of 1 and we want to remove half of those row
What is your goal? Depending on what you are aiming to do, there may be more specific methods. However, a general approach works like this:
your_data %>%
group_by(specific_value) %>%
sample_frac(0.5) %>%
ungroup()
@@tomhenry-datasciencewithr6047 Hey Tom thank you for your answer. I try but it din't work. Instead i used the following approach!
tmp_data % filter(x == ma_valeur) %>% slice_sample(prop = 0.5)
data %>% anti_join(tmp_data)
Hi how do you filter if you want for exemple rows where depth is BETWEEN 100-150?
To filter rows in R where depth is between 100 and 150, you can use:
library(tidyverse) # assuming you've loaded this earlier
filtered_data % filter(depth >= 100 & depth = 100 & depth < 150) # if you don't want to include 150
For more information on the filter function and other data manipulation tools in R, check out the book "R for Data Science" by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, available online at r4ds.hadley.nz/data-transform#filter :)
Appreciate your videos!
Glad you like them!
please how do i filter rows with character data types. For example, rows that have total in them from a dataset
My output keeps saying "object __ not found"; Error in "select ()" I don't understand, I rewrote everything