R Basics: How to Use filter() to Select Rows Based on Column Values

Tom Henry - data science with R

Просмотров 13 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 6 ноя 2024
🔔 Subscribe for weekly R videos: / @tomhenry-datasciencew...
In this episode of R basics, we're talking about how to use the filter() function to pick rows (observations) from data based on a condition - for example, "Pick all the rows where the penguin is over 2 feet high." We'll look at the easiest way to do this, as well as the hard way.
🎉 Enjoyed this video? Leave a comment below to share what you liked the most!

Комментарии • 21

@tomhenry-datasciencewithr6047 2 года назад
🎉 Subscribe if you want more videos like this! - ruclips.net/channel/UCb5aI-GwJm3ZxlwtCsLu78Q
😃 Comment below if you have questions about how to use filter()!
@lefterisparasyris808 Год назад ⁺²
How does the code change if the columns have text instead of numbers? I want to select the rows that contain data for specific forest areas, which are marked with text in the data frame.
@bukolaadebayo1891 2 года назад
Thanks! Very explanatory and super simple to understand.
@JC-wt5fw Год назад ⁺¹
Thanks for your help. How would you filter with percentages? Like say if I want the top and bottom 1% of people and discard that data how would I be able to do that? Thank you very much
@tomhenry-datasciencewithr6047 Год назад ⁺¹
It depends on the exact scenario.
Assuming you have one row per person (i.e., not multiple rows per person), you can do it like this:
your_data %
mutate(variable_of_interest.percentile = percent_rank(variable_of_interest))
your_data %>% filter(variable_of_interest.percentile % filter(variable_of_interest.percentile > 0.99) # top 1%
@tomhenry-datasciencewithr6047 Год назад ⁺¹
If you are interested only in excluding the top and bottom 1%, you could do
your_data %>% filter(variable_of_interest.percentile >= 0.01 & variable_of_interest.percentile < 0.99)
**However** - my guess is that you are trying to filter out outliers - is that right?
If so, depending on your situation I would usually recommend two alternatives:
1. retaining outliers but using a technique that is more resilient to outliers (e.g., quantile regression using the quantreg package - basically quantile regression optimizes for the median, whereas regular linear regression optimizes for the mean)
e.g.
# install.packages("quantreg")
library(quantreg)
rq(y ~ height + age, data = ...)
(same kind of interface as linear regression)
2. figuring out non-arbitrary absolute cut-offs - e.g. if you know that a customer spending more than a certain amount is unrealistic or a patient will never have a lab value less than a particular amount, then just filter out records where they exceed those values.
The reason is that in many datasets, the top 1% (say) may have some invalid records but also a lot of valid records, so it's best to be explicit about your inclusion/exclusion criteria rather than using a % criteria for exclusion, where possible.
@JC-wt5fw Год назад
@@tomhenry-datasciencewithr6047 Wow thank you so much for your help! Amazing!!! Love from the UK
@JC-wt5fw Год назад ⁺¹
@@tomhenry-datasciencewithr6047 You got it exactly right, I needed to get rid of outliers and I will take these better suggestions from you on board. Thank you so much
@tomhenry-datasciencewithr6047 Год назад
@@JC-wt5fw Hope it goes well! :)
@cheriseregier4729 Год назад ⁺¹
Thanks for this video Tom. Do you know how to filter the data with multiple conditions. So if X=1 then filter y>2000 AND if X=2 then filter Y>3000 AND if X=3 then filter Y>4000. Thanks for your help.
@tomhenry-datasciencewithr6047 Год назад
You can do this like so:
your_data %>%
filter((x == 1 & y > 2000) | (x == 2 & y > 3000) | (x == 3 & y > 4000))
(OR is the `|` symbol, and AND is the `&` symbol).
There are other ways of doing it which depend on your use-cases and which could be better, but this is the general method.
@cheriseregier4729 Год назад
@@tomhenry-datasciencewithr6047 This is SO helpful. Thanks Tom. I look forward to watching more of your videos. They are super clear!
@henrytadja7327 Год назад ⁺¹
Hey sir, i was wondering if you know how to remove only a specific percentage of row where we have a specific value. For exemple we have 10 rows with a value of 1 and we want to remove half of those row
@tomhenry-datasciencewithr6047 Год назад
What is your goal? Depending on what you are aiming to do, there may be more specific methods. However, a general approach works like this:
your_data %>%
group_by(specific_value) %>%
sample_frac(0.5) %>%
ungroup()
@henrytadja7327 Год назад ⁺¹
@@tomhenry-datasciencewithr6047 Hey Tom thank you for your answer. I try but it din't work. Instead i used the following approach!
tmp_data % filter(x == ma_valeur) %>% slice_sample(prop = 0.5)
data %>% anti_join(tmp_data)
@AgneKif Год назад ⁺¹
Hi how do you filter if you want for exemple rows where depth is BETWEEN 100-150?
@tomhenry-datasciencewithr6047 Год назад
To filter rows in R where depth is between 100 and 150, you can use:
library(tidyverse) # assuming you've loaded this earlier
filtered_data % filter(depth >= 100 & depth = 100 & depth < 150) # if you don't want to include 150
For more information on the filter function and other data manipulation tools in R, check out the book "R for Data Science" by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund, available online at r4ds.hadley.nz/data-transform#filter :)
@hornytholigist 2 года назад ⁺¹
Appreciate your videos!
@tomhenry-datasciencewithr6047 2 года назад
Glad you like them!
@ogclinton4780 Год назад
please how do i filter rows with character data types. For example, rows that have total in them from a dataset
@crazyneon285 7 месяцев назад
My output keeps saying "object __ not found"; Error in "select ()" I don't understand, I rewrote everything

Следующие

Автовоспроизведение