R Stats: Multiple Regression - Variable Selection

ironfrown

Просмотров 96 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 22 окт 2024

Комментарии • 55

@OniksWalks 4 года назад ⁺²
Thank you, Professor, you are the best! I wish you 100 years of teaching!
@jennhollifield9883 2 года назад
Thank you so much for this video! I am a beginner in statistical modeling/and R this helped me very to understand how to of analyze model, which i needed for my homework.
@gabrielanavarro2735 4 года назад
Very useful to help me understand better a stats graduate course I am taking. Great complement and thank you!
@ironfrown 8 лет назад ⁺³
Since this video was created the UCI Machine Learning repository moved to the new location. What it means is that the web location shown in the script is not working. However, I have updated the link to the lesson data in the video description.
@pnm3225 2 года назад
May God BLESS YOU!!!!! Thank you so much!
@avi20009 7 лет назад ⁺¹
Could you please explain what the function impute does here. Does it replace the missing values with mean/ median value. Also why did you replace NA's with question mark. How did you define the variable 'Num of doors' to be a categorical variable?
@ironfrown 7 лет назад
Dibyajyoti Chowdhury, as you said impute replaces missing values with either the mean or median. I do not replace NAs with question marks, quite the opposite, missing values in data files have been coded with question marks, and so question marks are replaced with NAs while reading the file. As the number of doors is coded in the file as words, such as "four" or "two" then the variable will automatically become a factor on reading the data frame. It would be different if we were to assign values, in which case these vars would become chars.
@子陈-z6s 2 года назад
Very good and funny videos bring a great sense of entertainment!
@yiyuanzhang6335 3 года назад
thank you very much! what happend if the missing data consists of over 90% of an independent variable? should we delete this variable?
@ironfrown 3 года назад
@yiyuan 90% of missing values in an independent variable is a lot, virtually all attempts of dealing with this, apart from dropping the variable will harm your predictive model. Eliminating rows with missing values will leave you with less than 10% of data for training, replacing NAs with mean will falsify distribution, and if you think imputation with decision trees or k-NN would work (or rather prediction of missing values from the remaining independent vars) then really you do not need this variable for predicting a dependent variable (due to redundancy). However, if missing values are the "feature" of data entry and is not random (say, missing value was entered instead of zero) then you'd better investigate this and fix the way data was recorded, in which case you could deal with 90% of such intentional missingness.
@yiyuanzhang6335 3 года назад ⁺¹
@@ironfrown Thank you! If i understand you correctly, it is better for me 1) to delete this variable or 2) if the variable is important to the analysis, investigate and fix why such huge missingness was caused. am i correct?
@yiyuanzhang6335 3 года назад
@@ironfrown also, I am now having a multiple regression which contains many independent variables with missing values varying from 1% to 99% of the total observation. I will proceeds my analysis in the following way. 1) delete variables with over 90% of missing values (if the missing are not the feature of data entry) 2) if varaibles with missing values over 60% but less than 90%, i wlll run a regression of dependent variable on each of these independent variables respectively, if it is not significant, i will also delete these varaible. For independent variable with missing values less than 60%, I will use KNN approach to impute the missing value. Do you think my approach works? thank you very much
@ironfrown 3 года назад
@@yiyuanzhang6335 Any variable which has over 20% missing values is a good candidate for removal. If your variable has over 50% missing values i would not touch it. If it is between 20-50% you can try imputing the missing values. In all cases, validate the resulting model on a test data (which has no missing values) to see which option gives you the best outcome. Never use imputed variables for testing as the results will be misleading.
@JoaoVitorBRgomes 4 года назад
When I imputed() numer of doors, it became all NAs. Why?
@surajbhagat2672 4 года назад
how to increase the text size (Horsepower, City.mpg, Peak.rpm, Curb.weight, Num.of.doors, Price)?
@danishmumtaz6655 6 лет назад
Thanks for your response. Just one more thing. I applied log transformation on Sale Price now when I want to convert it back to the normal price I am getting Infinite values. I did 10^ sale Price. Please advise what is the correct way of conversion.
@ironfrown 6 лет назад
Log does not like zeros, just add a small value before the log (to be deducted later after the power). It should fix the problem?
@121Kathmandu 4 года назад
Very Helpful, thank you.
@sanjampatwalia9223 4 года назад
Thank you Professor.
@justAdancer 3 года назад
unable to use pairs.panel() function, how do I fix it ?
@ironfrown 3 года назад
At the beginning of the script (lines 17-19), you will find a number of commented out statements, which load the required libraries. Uncomment them and run them once only. The package 'psych' is the one which provides the pairs.panels. Good luck!
@nitz1755 7 лет назад
Hey, very nice and clear explanation. Thanks
@danishmumtaz6655 6 лет назад
What if the train.corr and the adj R sqaured is not nearly equal. What is wrong in that case and how to fix that?
@ironfrown 6 лет назад
@danish mumtaz, Adj R2 and squared correlation are mathematically very close for simple regression. However, once we start adding more variables their values will diverge. There is nothing wrong with this. However, cross-validation measures, such as MAE, RMSE or correlation, are better indicators of the model quality and performance than Adj R2, which is only correct when all of the regression assumptions / requirements have been met.
@davidwollover1517 8 лет назад
Very nicely explained!
@danielj5851 3 месяца назад
And again, data set not available ¿wouldn´t it be better to have it all (data set, jupyter notebook) in, say, github?
@ironfrown 2 месяца назад
I am not sure what you mean by data set is not available. Please follow the link in the video description and you'll find the data there. As data used in my videos was copyrighted I was resisting simply copying it and redistribution. Any links in the script itself may not be accurate as the script is quite old by now At that time, running R in Jupyter was a black art, which not many beginners could master. This seemed the simplest distribution.
@danielj5851 2 месяца назад
@@ironfrown Well, I mean that the link provided leads to a data (and names) file. Apparently this can be easily used with pandas. I expected a csv (or excel) file (currently im working with r). I mention the jupyter notebook because you talked about it
@ironfrown 2 месяца назад
@@danielj5851 I'll check this, they must have given up on R
@DungNgoc-xs2tr 3 года назад
The video sound is pretty good, beyond my imagination
@WahranRai 2 года назад ⁺¹
5:45 I dont like your regression : where is the normalization of features (0:1 for example)
@ironfrown 2 года назад
Regression linearly scales its variables so scaling into 0:1 interval is not needed - the regression will do so automatically. However, normalisation of some skewed variables, eg using log / sqrt / sqr could improve regression but this needs to be tested. Such unnormalised variables can be unnecessarily rejected in the process demoed in this video (which aims at a simple case for learning purposes).
@ironfrown 2 года назад ⁺¹
I agree with you that variables ideally should have similar distribution of variance and here normalisation could improve the outcome
@agusruslani259 8 лет назад
Can you reupload the data for this lesson? I tried to open that link but Google said the link has been changed.
@ironfrown 8 лет назад
Thanks Agus, sorry for the delay, however, I have been traveling a bit. Indeed UCI have moved their repository, so I have modified the link to the data set used in this video. It will affect the video description and not what is showing in the comments of the script in the video though. Good luck with this.
@ironfrown 7 лет назад ⁺¹
And they moved it again, so I have re-updated the link again :)
@Ajayk124 7 лет назад
Hi I am getting Error at auto$price
@ironfrown 7 лет назад
Ajay Kumar R is case sensitive, you are using variable Price and price so one has no values in it!
@sudheerakawickramarachchi6775 4 года назад
great!
@andrepaim7007 7 лет назад
Hey, that was really great!
Thanks a lot for the video!
@muhammadsaleemkhan5761 3 года назад
many thanks nice videio. can u please check the link for r source code. it is not working. thanks
@ironfrown 3 года назад
I see what you mean, unfortunately visanalytics.org reached the end of its life and those links expired. Give me a few days and I'll provide a new home for these sources and I'll update the video descriptions. I am sorry about the inconvenience!
@ironfrown 3 года назад
Ok, I have fixed links in this video, I will progressively correct the links in my other videos soon!
@muhammadsaleemkhan5761 2 года назад
@@ironfrown many thanks for your reply and updating the links. i want to ask a question regarding linear regression variable selection, my outcome is continuous lung function and main exposure variable is no2 air pollution (quintiles), in linear regression when i add deprivation score to my model, the sign of coefficient of no2 change (from negative to positive), i am not sure what could be reason for this. i check collinearity, there is no collinearity. i would really appreciate for your help in this regards? is it possible to have your email to contact number please. thanks
@anthonywongkl7979 7 лет назад
How to calculate columns of data?
@ironfrown 7 лет назад
Anthony WongKL I am not sure I understand the question. Let me assume you ask how to create a new column, such as c, in a data frame df based on a formula, such as +, using values from other columns, such as a and b? If so, you'd execute an R statement: df$c
@niccodalpiaz 5 лет назад
Can you share your .rmd notebook so i can reverse engineer?
@ironfrown 5 лет назад ⁺¹
Nicco, if you check the description of the video, it has the links to both the data and the .R script. Enjoy!
@SatishRahi 6 лет назад
I get "impute function not found" having loaded the same libraries. Puts a break
@ironfrown 6 лет назад
Satish Rahi, make sure that the packages for all the required libraries have been previously installed. Note that the sample R code has commented out statements "install.packages", which need to run once only - so uncomment them for one run only. When the package "Hmisc" is properly installed and the "library" statement is successfully executed, the "impute" function will work.
@LuqmanAbu 6 лет назад
Unistall the package then install it again, when prompted "Do you want to install from sources the package which needs compilation?." put no instead of yes. It should work after this.
@jababnamgay6366 4 года назад
Seems good but too small view and cannot view it
@ironfrown 4 года назад
@Jabab Namgay, indeed R Studio format is not ideal for RUclips presentations. However, you can get the R script from the links included in the video description. My new R videos utilise Jupyter Notebooks, which allow much clearer display of R code.
@jababnamgay6366 4 года назад
@@ironfrown thank you sir

Следующие

Автовоспроизведение

R Stats: Multiple Regression - Variable Preparation