Remove Outliers from Data Set in R (Example) | Find, Detect & Delete Outlier Values | boxplot.stats

Поделиться
HTML-код
  • Опубликовано: 11 сен 2024

Комментарии • 100

  • @t_thyme5845
    @t_thyme5845 3 года назад +5

    Herzlichen Dank!!! Step by step, concise, line of code with what it does ... a perfect example of what R tutorials should look like.

    • @t_thyme5845
      @t_thyme5845 3 года назад +1

      Is it possible to do the same with grouped data?

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Glad to hear that you liked it, thanks a lot! :)

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад +1

      Please have a look here for more info on how to perform analyses by group:
      statisticsglobe.com/mean-by-group-in-r
      statisticsglobe.com/summary-statistics-by-group-in-r

    • @t_thyme5845
      @t_thyme5845 3 года назад +1

      @@StatisticsGlobe thanks for the info. The problem I have been having has been to eliminate outliers by group. I already have the pipe function to section groups and identify the outliers...now i just need to delete them. I tried making a function to name the outliers and then isolate and remove and also tried make a new excel sheet in which the outliers were replaced with n.a and tried to remove the n.a but that just ended up turning my data into a character. any suggestions?

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Have you tried to do that with a for-loop? You could loop over your groups and within the loop you would remove the outliers of each group. That's probably not the most efficient way, but it should work.

  • @annisazulkifili6663
    @annisazulkifili6663 3 года назад +3

    Great help on my last minute assignment! Thank you so much

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Glad it helped Annisa, thanks for the nice comment!

  • @agsoutas
    @agsoutas 3 года назад +1

    Thanks for the insight, Joachim. I am definitely going to deepen my understanding of this topic since I will be working on a relevant project.

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад +1

      Glad it was helpful AG, thanks for the comment!

    • @agsoutas
      @agsoutas 3 года назад +1

      @@StatisticsGlobe 😃😃👌

  • @caduguimaraes
    @caduguimaraes 3 года назад +2

    Excellent short tip. Tks

  • @leandroandradaguio6070
    @leandroandradaguio6070 2 года назад +3

    Thanks

  • @paulabarros4145
    @paulabarros4145 2 года назад +1

    This video contains perfect explanation!!!!!

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Thank you very much Paula, glad it was useful!

  • @ZeeNoorTrip
    @ZeeNoorTrip 3 года назад +1

    thank u i have done the same way. and it works. except i remove the word stats after boxplot. and it works perfectly

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Ah, nice to know that this works as well, thanks for sharing!

  • @hemantjoshi5034
    @hemantjoshi5034 Год назад +1

    Thank you for sharing this tutorial !

  • @wildermanuel210
    @wildermanuel210 3 года назад +1

    nice work dude, you help me in my exam

  • @hirunisilva5158
    @hirunisilva5158 2 года назад +3

    Thank you for your explanation!
    Can you tell me how to remove the outliers in a multi-column dataset? What I want to know is how to merge those columns after removing the outliers by column.

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад +1

      Hey Hiruni, in case you want to remove outliers in multiple data frame columns, you would have to decide if you either would like to delete each row with at least one outlier, or if you would like to insert NA values in case an outlier occurs. Which option do you prefer for your data set? Please keep in mind that the removal of outliers has to be done with care, and only if there is a good theoretical reasoning for the removal. Regards, Joachim

  • @galan8115
    @galan8115 3 года назад +2

    So... what we do un a multivariant data?

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hey Galan, I have planned to release a tutorial on multivariate outliers in the future. Until then, you may have a look here: stackoverflow.com/questions/45289225/removing-multivariate-outliers-with-mvoutlier Regards, Joachim

  • @davidsanchezarranz8118
    @davidsanchezarranz8118 2 года назад +1

    Really useful video thanks.

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Thank you very much David! Glad it was helpful!

  • @idsfilm
    @idsfilm 2 года назад +3

    Thanks for the clear and concise tutorial, I am running into one problem however. When I use the code for removing the outliers, it changes my data (frame) to values, which stops me from making a box plot using qplot.

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey, thanks for the kind feedback, glad you like the tutorial! Does it help to convert your values back to a data.frame using the following code?
      data

    • @idsfilm
      @idsfilm 2 года назад +1

      @@StatisticsGlobe Yeah that helped but it also made me realize I have made a completely different mistake. Basically I am trying to run multiple boxplots next to each other and all the data points are stored in one column and the variables they are linked to are in a different column. So I was trying to remove outliers from a column that has data from different variables (which). Any idea how to fix this or should I order my data differently? P.s. great that you are still replying and helping out people from a video a year later! That is amazing!

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      This also depends on what you want to do with the data later. However, for the outlier removal this might be a good idea. Maybe you can simply reshape your data from
      long to wide format (see here statisticsglobe.com/reshape-data-frame-from-long-to-wide-format-in-r)? Thanks a lot for the kind words! Actually, I try to respond to every single comment on the channel. This is a lot of work, but also a nice way to interact with the community! :)

  • @hemanthchenga5671
    @hemanthchenga5671 2 года назад +1

    short and simple

  • @jeffreylin235
    @jeffreylin235 2 года назад +1

    This is a very concise and useful video. I have a basic question. Do you believe it is appropriate to remove the outliers? I'm working on a research project and would like to remove the outliners from the boxplot for the purpose of better visualization. But, is it considered data manipulations?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Jeffrey, thanks for the kind words! I'm sorry for the late response, I've been on vacation for the last couple of days. Are you still looking for an answer to this question? Regards, Joachim

    • @jeffreylin235
      @jeffreylin235 2 года назад +1

      @@StatisticsGlobe yes.

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      This strongly depends on your specific data and on what you want to show in your research paper. Generally speaking, I would be very careful with the removal of outliers - Usually you need a strong theoretical reasoning to do so. In case you decide to remove the outliers, you would definitely have to discuss this in your paper.

  • @alessandrorosati969
    @alessandrorosati969 Год назад +1

    How is it possible to generate outliers uniformly in the p-parallelotope defined by the
    coordinate-wise maxima and minima of the ‘regular’ observations in R?

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hello Alessandro,
      Sorry for the late reply. Do you still need help?
      Regards,
      Cansu

  • @ifeoluwaakinola7260
    @ifeoluwaakinola7260 3 года назад +1

    You finally save me

  • @anwar5843
    @anwar5843 3 года назад +1

    Thank you so much

  • @ZeeNoorTrip
    @ZeeNoorTrip 2 года назад +1

    Do you have any idea how do i remove outlier from all columns? For example if u take breast-cancer dataset?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Zee, you may apply this code to each data frame column separately. Note that this would remove different observations in each column.

  • @dalga6175
    @dalga6175 2 года назад +1

    Concise but very informative video! Thank you for this! I have a quick question if possible. Say I plotted some normalized values, and then I noticed one extreme outlier on the plot. In R, I would like to identify those extreme outliers in my data frame in order to check the value manually. How can I do that? I used a code(attached below) that listed all the outliers; however, I would like to only identify those outliers that are too far from the mean.
    IQR

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Dalila, thanks for the kind words, glad you find the video helpful! Could you please illustrate what the column QP2_Labanov_norm$duration_ms looks like? Could you share some example values? Regards, Joachim

    • @dalga6175
      @dalga6175 2 года назад +1

      @@StatisticsGlobe Hello, Thank you so much for your swift reply! Yes, the column represents values of duration in mile seconds that are converted from raw values in seconds. below I pasted two screen shots one with raw values in seconds("duration column) and the one for converted values to ms("duration_ms column). When I normalize data I do normalize the mile-second ones

    • @dalga6175
      @dalga6175 2 года назад +1

      Sorry! Just realized that the screenshots didn't go through. Here is the summary of the column of duration in mile-second(Please note that the values here are not normalized yet): Min. : 8.00 1st Qu.: 40.00 Median : 52.00 Mean : 52.53 3rd Qu.: 65.00 Max. :745.00
      I hope this information is helpful. Thank you so much again for your guidance1

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Thanks for the clarifications. I tried to reproduce your code above, and in my case it runs properly (even though I don't know if this is a proper way to identify outliers). Could you explain your question in some more detail? What exactly is the problem with your code? See below for my example code:
      set.seed(586732)
      QP2_Labanov_norm

    • @dalga6175
      @dalga6175 2 года назад +1

      @@StatisticsGlobe Thank you for all the effort to help your audience get through their R statistics difficulties. Let me rephrase my concern here: My question was how I can identify outliers in my data frame. I used the code that I shared it worked (It gave me some values) but how I can see the actual outliers and find them in my data frame so I can check if they can be corrected manually or if they are just the way they are. It was hard for me to pick every value (outlier) and then go to my data frame to search it(it takes lots of time)...so i wanted to know if there are ways to just ask R to identify the outliers in that specific column then give the rows in which they are found ( I don't know if I ma asking R too much :) ) . Extra information: I used another straightforward code to get me the values that might be outliers, which is boxplot.stats(QP1$meanf0)$out
      and it gave me some values:[1] 108 113 130 114 104 104 107 105 116 107 111 122 104 129 111 108 118 119 105 107
      [21] 112 118 111 105 116 106 136 109 106 105 105 107 111 103 125 129 111 105 107 119
      [41] 116 105 111 119 104 111 114 116 122 114 116 113 109 104 109 115 108 103 106 119
      [61] 106 112 111 124 133 114 112 114 108 106 103 125 105 105 103 103 123 127 109 103
      [81] 115
      Sorry for this long message.
      Do you offer individual tutorials that might be paid?

  • @user-bz8nm6eb6g
    @user-bz8nm6eb6g 3 года назад +1

    Thanks!

  • @dalga6175
    @dalga6175 2 года назад +1

    Also, is it normal to see more outliers on a graph with normalized data vs another graph with non-normalized data?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Dalila, I'm not an expert on this, but I found this video which seems to explain your question: ruclips.net/video/KGIPh_DFb8U/видео.html

    • @dalga6175
      @dalga6175 2 года назад +1

      @@StatisticsGlobe Thank you so much for sharing this!

  • @lebzgold7475
    @lebzgold7475 3 года назад +2

    How do you remove outliers in just a normal scatter plot?

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hi Lebz, in this tutorial I have explained how to remove outliers from a univariate variable. A scatterplot is usually based on multiple variables. For multivariate data you would have to apply different methods. For example, have a look here: stackoverflow.com/questions/45289225/removing-multivariate-outliers-with-mvoutlier Regards, Joachim

  • @karolinagora2187
    @karolinagora2187 3 года назад +1

    Heyyy,
    Looks like a great tip !! but...
    I have a trouble to implement this code to my console, because an error pops up "Error in command 'h (simpleError (msg, call)'): error computing argument 'table' when selecting method for function '% in%': undefined columns selected"
    I rewritten your code, I only changed the data .. I need some additional library or do you have any other idea? Please save me

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hey Karolina, thanks for the comment! This "undefined columns selected" suggests that you may have misspelled your column names, or you may have specified the column names at the wrong position. Could you share your code and explain the structure of your data in some more detail? Regards, Joachim

  • @catarinaesteves3
    @catarinaesteves3 3 года назад +1

    Just a question, this did not worked for me, is it because you are using a univariate data and my data is multivariate? please help and thanks in advance

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hey Catarina, could you explain in some more detail how your data looks like? Regards, Joachim

    • @catarinaesteves3
      @catarinaesteves3 3 года назад +1

      @@StatisticsGlobe Thanks for replying. So I have 117 observations, and 4 variables. the 1st variable is a factor with 3 levels (and everytime I do PCA I do it without the first column, which is this variable) The other 3 variables are numeric, and in different scale. But the main concern here is I have to do PCA but these 3 variables have outliers. Should I remove them? And if so, how do I remove them?
      Sorry to bother, I would appreciate some help if you can :)

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Thanks for the clarifications Catarina! Generally speaking: Outliers should only be removed in case you have a very good reason to do so. This strongly depends on your specific data and the way you have collected it. If you decide to remove outliers from your data set, it makes sense to check for outliers based on all your variables simultaneously. I'm not an expert on this topic. However, I found this thread on Stack Overflow, which seems to be helpful: stackoverflow.com/questions/45289225/removing-multivariate-outliers-with-mvoutlier Good luck with your analysis and let me know in case you have further questions! Joachim

    • @catarinaesteves3
      @catarinaesteves3 3 года назад +1

      @@StatisticsGlobe Thank you for your reply Joachim, I will check out that thread. I'm still deciding on removing the ouliers or not, but i wann try it to see if the linear relantioships differ with and without outliers... thanks!!

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      You are very welcome Catarina! :)

  • @larissacury7714
    @larissacury7714 2 года назад +1

    Hi, thank you! Do you know an equivalent function to rstatix::identify_outliers which allows two collumns at once?
    obs: I know that this function allows group_by(), but it doens't solve my problem this time..

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Larissa, you may simply apply this code multiple times to different variables. Or is there a specific reason why this wouldn't work? Regards, Joachim

  • @tetrapygus
    @tetrapygus 3 года назад +1

    Excelente!!

  • @samirhajiyev6905
    @samirhajiyev6905 2 года назад +1

    how can I remove outliers from different columns?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад +1

      Hey Samir, you can apply this code to a column by using the $ operator.

  • @amilachathuranga5541
    @amilachathuranga5541 3 года назад +1

    iqr calculation isn't require for remove outliers ?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Amila, outlier detection is a huge field of research, which is discussed controversially. In this video, I'm showing a relatively basic way to remove outliers. However, depending on your specific situation it might be advisable to use more complex methods. So in my opinion it is not possible to answer your question in a generalized way :)

  • @ZeeNoorTrip
    @ZeeNoorTrip 3 года назад +1

    how can i do if i have non numeric values too. i mean i want to remove outlier of all data. can u please let me know

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hey Zee, in this case I would convert your data to numeric first. Have a look here: statisticsglobe.com/convert-data-frame-column-to-numeric-in-r Regards, Joachim

  • @sanjayverma-dm9ep
    @sanjayverma-dm9ep 2 года назад +1

    what if we have multiple variables n all of their outliers ranges differently ?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Hey Sanjay, you could apply this code to each of these variables. Or you could do a multivariate outlier analysis. That depends on your specific data. Regards, Joachim

  • @academicskillsdrkhurram
    @academicskillsdrkhurram Год назад +1

    Hello, this is an excellent video on outlier removal. But I have a question. After using your code, it removes outliers from data, but the problem comes when I want to re-bind this data column to my original data file for my other work. Now, R gives error, due to unequal values as codes remove some values. I request you to develop codes that just silent outliers or convert them into NA instead of removing it from data set. I Hope, you get my point. Thanks ------- my error is as under
    Erro ! Assigned data `data_rs$root_nitrogen[!data_rs$root_nitrogen %in% boxplot.stats(data_rs$root_nitrogen)$out]` must be compatible with existing data.
    x Existing data has 24 rows.
    x Assigned data has 22 rows.
    i Only vectors of size 1 are recycled.
    Run `rlang::last_error()` to see where the error occurred.

    • @cansustatisticsglobe
      @cansustatisticsglobe Год назад

      Hello,
      Sorry for the late response. I am not sure if you still need help. But it looks like our tutorial on Statistics Globe: Create Data Frame of Unequal Lengths in R would help you.
      Regards,
      Cansu

  • @therlott8310
    @therlott8310 3 года назад +1

    Was ist wenn folgender Fehler kommt Fehler: Objekt 'x' nicht gefunden
    ???

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hi Theresa, das bedeutet, dass die Variable x nicht existiert. Hier ein Tutorial dazu: statisticsglobe.com/error-object-not-found-in-r Viele Grüße, Joachim

  • @jamesleleji9470
    @jamesleleji9470 2 года назад +1

    How do you remove outliers in a specific column

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад +1

      Hey James, you may use the same syntax as shown in this tutorial by extracting the data frame column values using the $ operator (i.e. data$x).

    • @jamesleleji9470
      @jamesleleji9470 2 года назад

      @@StatisticsGlobe I tried it but it didn't work. My dataframe is 'my_data'. The column for which i want to remove outlier is 'income in 2012'. Can you use this to show me the code. Thanks

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      Are you looking for this?
      data

  • @ammar46
    @ammar46 3 года назад +1

    How to add that column to the main data after removing the outliers??

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад

      Hey Ammar, are you looking for this?
      x_out_rm

    • @ammar46
      @ammar46 3 года назад +1

      @@StatisticsGlobe Thanks dude, I used to the quantile method to remove all the rows of the outliers.

    • @ammar46
      @ammar46 3 года назад

      outliers_cutoff

    • @StatisticsGlobe
      @StatisticsGlobe  3 года назад +1

      OK nice, glad you found a solution! :)

  • @NguyenQuyen-wg9iv
    @NguyenQuyen-wg9iv 2 года назад

    Thank you for your helpful video. I
    Sorry if my questions seem silly. I have a data frame with the first column is tax code in text format and 5 variables (numeric). I don´t know how to remove all numeric outliers at once in this case. After removing outliers, how can I create/show the new data frame in table form and export it to excel? Could you please help me?

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад +1

      Hey Nguyen, thank you for the kind words! Regarding your question: You may replace the outliers in each column by NA values. This way, you could keep the structure of your data frame. Please note that outlier detection is a very controversial topic, and it has to be done with care. In your case, it might be a better approach to perform multivariate outlier detection. But this depends very much on your specific data.

    • @NguyenQuyen-wg9iv
      @NguyenQuyen-wg9iv 2 года назад

      @@StatisticsGlobe thank you so much for your suggestion. I´ll give it a try

    • @StatisticsGlobe
      @StatisticsGlobe  2 года назад

      You are welcome! :)

  • @RevenueRocketeers
    @RevenueRocketeers 3 года назад +1

    Thanks