Gini Index and Entropy|Gini Index and Information gain in Decision Tree|Decision tree splitting rule

Unfold Data Science

Просмотров 142 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 13 янв 2020
Gini Index and Entropy|Gini Index and Information gain in Decision Tree|Decision tree splitting rule
#GiniIndex #Entropy #DecisionTrees #UnfoldDataScience
Hi,
My name is Aman and I am a data scientist.
About this video:
How does a Decision Tree Work? A Decision Tree recursively splits training data into subsets based on the value of a single attribute. Splitting stops when every subset is pure (all elements belong to a single class)
This video explains Gini and Entropy with example.
Below questions are answered in this video:
1.What is Gini Index?
2.What is Information gain?
3.What is Entropy?
4.What is tree splitting criteria?
5.How is decision tree splitted?
About Unfold Data science: This channel is to help people understand basics of data science through simple examples in easy way. Anybody without having prior knowledge of computer programming or statistics or machine learning and artificial intelligence can get an understanding of data science at high level through this channel. The videos uploaded will not be very technical in nature and hence it can be easily grasped by viewers from different background as well.
Join Facebook group :
groups/41022...
Follow on medium : / amanrai77
Follow on quora: www.quora.com/profile/Aman-Ku...
Follow on twitter : @unfoldds
Get connected on LinkedIn : / aman-kumar-b4881440
Follow on Instagram : unfolddatascience
Watch Introduction to Data Science full playlist here : • Data Science In 15 Min...
Watch python for data science playlist here:
• Python Basics For Data...
Watch statistics and mathematics playlist here :
• Measures of Central Te...
Watch End to End Implementation of a simple machine learning model in Python here:
• How Does Machine Learn...
Have question for me? Ask me here : docs.google.com/forms/d/1ccgl...

Комментарии • 292

@islamicinterestofficial 3 года назад ⁺⁷⁷
There is a mistake in your video:
You said to choose that attribute that has less information gain. But actually we have to choose that has high information gain...
@UnfoldDataScience 3 года назад ⁺³²
Yes Naat, thanks for pointing out. I have pinned the comments related to it in the video for everyones benefit.
@islamicinterestofficial 3 года назад ⁺⁵
@@UnfoldDataScience Pleasure sir
@nikhilgupta4859 2 года назад ⁺¹
If you are saying that we have to choose high information gain. Then as per video we should take the impure node. For pure node gini would come 0 and hence 0 IG. Isn't something wrong.
@DK-il7ql 2 года назад
At what time that has been said and corrected?
@RaviSingh-xx2wq 2 года назад ⁺²
@@DK-il7ql 10:37 he said Low information gain by mistake instead of high information gain
@ahmedalqershi1245 3 года назад ⁺²⁷
I usually don't like commenting on RUclips videos. But for this one, I felt like I had to show appreciation because truly this video was extremely helpful. University professors spend hours explaining what you just explained in 11 minutes. And you are the winner. Perfect explanation.
Thank you so much!!!!
@UnfoldDataScience 3 года назад
I appreciate it Ahmed. Your comments motivate me :)
@malavikadutta1011 3 года назад ⁺¹⁵
Institutes spend two hours in explaining these two concepts and you made it clear in some minutes.excellent Explanation .
@UnfoldDataScience 3 года назад
Thanks a lot :)
@sandipansarkar9211 3 года назад
@@UnfoldDataScience I agree
@jehanbhathena6270 2 года назад ⁺⁵
This has become my favourite channel for ML/Data Science topics,thank you very much for sharing your knowledge
@UnfoldDataScience 2 года назад ⁺³
Thanks Jehan, your words are my motivation.
@zainahmed6502 3 года назад ⁺¹
Wow! Not only was your explanation amazing but you also answered every single comment! True dedication. Keep it up!
@UnfoldDataScience 3 года назад
Thanks a ton Zain.
@__anonymous__4533 10 месяцев назад ⁺¹
I have an assignment due tomorrow and this helped a lot!
@Guidussify 17 дней назад
Excellent, to the point, good examples. Great work!
@indrajithvasudevan8199 2 года назад ⁺⁴
Best channel to learn ML and Data science concepts. Thank you sir
@UnfoldDataScience 2 года назад
Thanks Indrajit. Kindly share video within data science groups if possible.
@akhilgangavarapu9728 4 года назад ⁺²
If i feel any concept is hard to understand, first thing i do is search for your videos. Very intuitive and easy to understand. Thank you so much!
@UnfoldDataScience 4 года назад ⁺¹
Your comments are my motivation Akhil. Thanks a lot. Happy learning. Tc
@shyampratapsingh4878 3 года назад ⁺¹
The simplest and best explanation so far.
@UnfoldDataScience 3 года назад
Glad it was helpful Shyam.
@valor36az 3 года назад
I just discovered this channel what a gem
@UnfoldDataScience 3 года назад
Thanks a lot. please share with others in various data science
groups as well.
@indronilbhattacharjee2788 3 года назад
finally i am getting some clear explanations for various concepts
@UnfoldDataScience 3 года назад
thanks Indra.
@Pesions 3 года назад ⁺²
You have a really good explanation skills, thank you man , i finally understand it
@UnfoldDataScience 3 года назад
Most Welcome :)
@kunaldhuria3935 3 года назад
short simple and sweet, thank you so much
@UnfoldDataScience 3 года назад
You're welcome Kunal.
@KASHOKKUMARgnitcECE 6 месяцев назад
Thanks bro...explained in easy manner...
@joeycopperson 6 месяцев назад
thanks for clear and easy explanation
@travelbearmama 4 года назад ⁺¹⁴
With your clear explanation, I finally understand what Gini index is. Thank you so much!
@UnfoldDataScience 4 года назад
You are welcome. happy learning. Stay Safe!!
@abhijitkunjiraman6899 4 года назад
This is brilliant. Thank you so much!
@UnfoldDataScience 4 года назад
Thanks Abhijit. Keep Watching. Stay Safe!!
@hassangharbi3687 2 года назад
Very goog and clear, i'm french speaking and i had understood almost everything
@UnfoldDataScience 2 года назад ⁺¹
Thanks Hassan.
@alexandre52045 Месяц назад
Thanks for the video ! It was really clear and well executed. Would have been great to detail the entropy calculation though, I find it a bit elusive without an example
@ARJUN-op2dh 3 года назад
Simple & clear
@UnfoldDataScience 3 года назад
Thanks a lot.
@bhargavsai8181 3 года назад
This is On point, thank you so much.
@UnfoldDataScience 3 года назад
You are so welcome Bhargav.
@mavaamusicmachine2241 Год назад
Thank you for this video very helpful
@vishesh_soni 2 года назад ⁺¹
Your first video that I came across. Subscribed!
@UnfoldDataScience 2 года назад
Thanks Vishesh.
@awanishkumar6308 3 года назад
I appreciate your concepts for Gini and Entropy
@UnfoldDataScience 3 года назад
Thanks Awanish.
@eiderdiaz7219 4 года назад
love it, very clear explanation
@UnfoldDataScience 4 года назад
Thanks Eider. Happy learning. Tc
@anandramm235 3 года назад ⁺¹
Crystal Clear Sir!! Keep Going!!
@UnfoldDataScience 3 года назад
Thank you Anandram.
@Shonashoni1 2 года назад
Amazing explanation sir
@muhyidinarif9248 3 года назад
thank you so much, this helps me a lot!!!
@UnfoldDataScience 3 года назад
I'm so glad!
@deepikanadarajan3407 3 года назад
very clear explanation and very helpfull
@UnfoldDataScience 3 года назад ⁺¹
Glad it was helpful Deepika.
@nalisharathod6098 3 года назад
Great Explanation !! very helpful . Thank you :)
@UnfoldDataScience 3 года назад ⁺¹
Glad it was helpful!
@fromthenorthfromthenorth8224 3 года назад
Thanks for this clear and well explain Gini index.... Thanks ....
@UnfoldDataScience 3 года назад
Glad it was helpful!
@Kumarsashi-qy8xh 4 года назад
sir Your explanation really very much helps me thank you
@UnfoldDataScience 4 года назад
You are welcome.
@9495tj 2 года назад ⁺¹
Awesome video.. Thank You so much!
@UnfoldDataScience 2 года назад
Thank you.
@seanpeng12 3 года назад ⁺¹
Your explanation is awesome, thanks.
@UnfoldDataScience 3 года назад
Thanks a lot for your valuable feedback.
@priyankabachhav5315 2 года назад ⁺²
Thank you so much sir, before watching this video I have watched 4 videos related to impurity but everyone is doing mixup of entropy and impurity n it was not really clear like what exactly formula is, how does it works.. But after watching ur video.. It is tottaly cleared now. Thank you for this beautiful n clear explanation
@UnfoldDataScience Год назад
Glad you understood
@RaviSingh-xx2wq 2 года назад ⁺¹
Amazing explanation
@UnfoldDataScience 2 года назад
Thanks Ravi.
@Kumarsashi-qy8xh 2 года назад
U r doing great job sir
@UnfoldDataScience 2 года назад
Thanks a lot.
@MrKhaledpage 4 года назад
Thank you, well explained
@UnfoldDataScience 4 года назад ⁺¹
Glad it was helpful!
@response2u 2 года назад
Thank you, sir!
@UnfoldDataScience 2 года назад ⁺¹
Very welcome!
@ece7700 10 месяцев назад
thank you so much
@yyndsai Год назад
Thank you, no one could have done better
@UnfoldDataScience Год назад
You comments mean a lot to me
@zuzulorentzen8653 6 месяцев назад
Thanks man
@mahimano4469 2 года назад ⁺¹
Thnaks alot
@UnfoldDataScience 2 года назад
Welcome
@prernamalik5579 3 года назад
It was very informative, Sir. Thank you :)
@UnfoldDataScience 3 года назад
Most welcome Prerna.
@sandipansarkar9211 3 года назад
great explanation
@UnfoldDataScience 3 года назад
Glad it was helpful!
@reviewsfromthe60025 2 года назад ⁺¹
Great video
@UnfoldDataScience 2 года назад
Thanaks a lot.
@jarrelldunson 3 года назад ⁺¹
Thank you
@UnfoldDataScience 3 года назад
Welcome Jarrell.
@subhajitdutta1443 2 года назад
Hello Aman,
Hope you are well. I have a question. Hope you can help me here.
If probability(P) =0,
Then Gini Impurity becomes = 1,
as per the formula.. Then why it always ranges from 0 to 0.5?
Thank you,
Subhajit
@23ishaan 3 года назад
Great video !
@UnfoldDataScience 3 года назад
Thanks for the visit
@kamran_desu 3 года назад ⁺¹
Very nice explanation and icing on the cake for comparing their performance at the end.
Just to confirm, is Gini/IG only for classification?
For the regression trees we would use loss functions like sum of squared residuals?
@UnfoldDataScience 3 года назад ⁺¹
That's a good question, since it's based on probability so it is applicable to classifiers. For regression, we see something like to minimize SSE or other error.
@mannankohli 3 года назад
@@UnfoldDataScience
Hi sir, as per my knowledge "Information Gain" is used when the attributes are categorical in nature. while "Gini Index" is used when attributes are continuous in nature.
@mannankohli 3 года назад
Hi sir, as per my knowledge "Information Gain" is used when the attributes are categorical in nature. while "Gini Index" is used when attributes are continuous in nature
@Sagar_Tachtode_777 3 года назад
Thank you for your wonderful explanation.
Please make a video on PSI and KS index.
@UnfoldDataScience 3 года назад
Will do soon Sager. Thanks for feedback.
@ranad2037 Год назад
Thanks a lot!
@UnfoldDataScience Год назад
You're welcome!
@adityasrivastava78 Год назад
Good teaching
@UnfoldDataScience Год назад
Keep watching
@sadhnarai8757 3 года назад ⁺¹
Great content.
@UnfoldDataScience 3 года назад
Thank you.
@lalitsaini3276 3 года назад ⁺¹
Nicely explained....! Subscribed :)
@UnfoldDataScience 3 года назад
Thanks Lalit. So nice of you :)
@skvali3810 2 года назад
i have one question aaman . at root node is the gini are Entropy is high are low..
@bishwajeetsingh8834 Год назад
Which one to choose, like how by seeing the data I can assume, what we can use gini or IG?
@UnfoldDataScience Год назад
Cant decide in advance, its more of trial and error(there are some directions though)
@shubhangiagrawal336 3 года назад ⁺¹
very well explained
@UnfoldDataScience 3 года назад
Thanks for watching Subhangi.
@mx1327 4 года назад ⁺¹
does the CART go through all the possible numerical values under loan to find the best condition? If you have a large amount of data, then should it be very slow?
@UnfoldDataScience 4 года назад ⁺¹
That is a good question. Thanks for asking. In general, for a numerical variable, first split point is chosen randomly and then the point is optimized based on "in which direction" loss function is moving. Please note, loss in this case is the node purity after split.
@OverConfidenceGamingYT 3 года назад ⁺¹
Thank you ❣️
@UnfoldDataScience 3 года назад
Welcome.
@chrisamyrotos8313 4 года назад
Very Good!!!
@UnfoldDataScience 4 года назад
Thank you Chris. happy learning. stay safe. tc
@ykokadwar 2 года назад
Can you help to explain intuitively the Entropy equation
@yohanessatria2220 2 года назад ⁺¹
So, the only difference between Gini and Information Gain is only the performance speed right? I assume with the same state of descision making and data, both Gini and Information Gain will be able to pick the same best attribute, right?
Great video btw!
@UnfoldDataScience 2 года назад
That is correct. Also the internal mathematical formula is different.
@SivaKumar-rv1nn 3 года назад
Thankyou sir
@UnfoldDataScience 2 года назад
Welcome Siva.
@nikhildevnani9207 2 года назад
Amazing explanation aman . I have one doubt like suppose there are 5 columns(4 independent and 1 target). For split i have used 1,2,4,3 columns and other person is using 3,2,1,4. Then on what factors we can decide either my splits are best or the other guy's split is best.
@UnfoldDataScience 2 года назад
Its algorithm decision which columns to use.
@johnastli9250 4 года назад
Awesome work and very intuitive explanation! Thank you. I have an exam in Data Mining and you helped me sir!!
@UnfoldDataScience 4 года назад ⁺¹
Glad it helped! Happy Learning!
@melvincotoner4878 3 года назад ⁺¹
thanks
@UnfoldDataScience 3 года назад
Welcome.
@soheilaahmadi4807 2 года назад
Hi Great explanation. Thank you so much. Do you have any videos explaining the criteria for Decision Tree regression?
@UnfoldDataScience 2 года назад
Thanks a lot. for Regression, not yet, will upload soon.
@gmcoy213 4 года назад ⁺¹
So if i am using the C5.0 algorithm? Which Separation technique will be used?
@UnfoldDataScience 4 года назад ⁺¹
entropy for measuring purity.
@datafuturelab_ssb4433 3 года назад ⁺¹
Great explaination
I have que
Is gini index negative
@UnfoldDataScience 3 года назад ⁺¹
Hi, no it can not be.
@abhishekraturi 3 года назад ⁺¹
Just to make clear, the Gini index ranges from 0 to 0.5 and not 0 to 1. Jump to to video at 7:10
@UnfoldDataScience 3 года назад
Yes, this is the common comment from many users. Your are right Abhishek.
@amnazakria3876 4 года назад
sir how you choose the loan amnt as root node ?we have to find gini for all columns and then select the root node?
@UnfoldDataScience 4 года назад
Hi Amna, this is a good question. Thanks for asking. Yes for all olumns and
select the optimal split.
@dracula5505 3 года назад ⁺¹
Do we have to calculate both gini and entropy to figure out which is best for the dat
aset??
@UnfoldDataScience 3 года назад
Only one at a time.
@saumyamishra9004 3 года назад
Firstly sir , how much i know higher the information gain gooder the split.
& I wann know that is any of them is for continues variable?
@UnfoldDataScience 2 года назад
Higher IG is better
@vishalrai2859 3 года назад ⁺¹
Thank you so much sir please do some projects
@UnfoldDataScience 3 года назад
Thanks Vishal.
@geethanjaliravichandhran8109 3 года назад
well sir how the root node selection criteria occurs if two data sets shares same and lowest gini index value
@UnfoldDataScience 3 года назад
Happens very rarely, Geethanjali.
@awanishkumar6308 3 года назад
But if we have datasets with multiple columns like more than this example then how we will decide select which input column should be splited?
@UnfoldDataScience 3 года назад
Answered.
@stevenadiwiguna1995 3 года назад
Hi! i want to make sure about gini index. You said that "criteria of the split will be selected based on minimum GINI INDEX from all the possible condition". Is it "gini index" or "weighted gini index"? Thanks a lot tho! Learn a lot from this video!
@UnfoldDataScience 3 года назад
Thanks Steven. "Gini Index".
@anildelegend 3 года назад ⁺¹
Good explanation.. But correction needed. Gini oscillates between 0 and 0.5.. The worst split could half positive half negative.. Gini impurity for that wing is 0.5 also overall weighted gini would be 0.5..
It is entropy that oscillates between 0 and 1.
@UnfoldDataScience 3 года назад ⁺¹
You are Right Anil. This feedback is coming from other viewers as well may be I mentioned this part wrong in video. I am pinning your comment to top for everyone's benefit. Thanks again.
@bhagyashreemourya7071 3 года назад ⁺¹
I'm a bit confused between Gini and Entropy. I mean is it necessary to use both methods while analyzing or we can go for any one of them?
@nikhilgupta4859 2 года назад ⁺¹
We have to use only one of them. Which one to choose depends on data.
@UnfoldDataScience 2 года назад
Depends on case not both to be used
@sahilmehta885 Год назад ⁺¹
✌🏻✌🏻
@frosty2164 2 года назад ⁺¹
which model has less bias and high variance-logistic, decision tree or random forest? can you please help
@UnfoldDataScience 2 года назад ⁺¹
Decision tree high variance low bias
Logistics regression - high bias, low variance
Random forest - Tries to reduce the high variance of decision tree. Bias is low.
@frosty2164 2 года назад
@@UnfoldDataScience Thank you very much.. can you also share the reason behind this.. or if you got any link where i can understand
@samhitagiriprabha6533 4 года назад ⁺²
Awesome Explanation, very sharp! I have 2 questions:
1. Since this algorithm calculates Gini index for ALL splits in EACH column, is this process time-consuming?
2. What if the algorithm finds TWO conditions where GINI Index is 0. Then how does it decide which condition to split on?
Thank you in advance!
@UnfoldDataScience 4 года назад
1. It is process consuming but it does not happen one by one internally for numerical columns, algorithm tries to figure out in which direction it should move smartly. For categorical columns it happens one by one and time consuming.
@UnfoldDataScience 4 года назад ⁺¹
2.0 means homogeneous sets hence no further split will happen
@siddhantpathak3162 3 года назад
I calculated the Gini Index for (4, 2) splits, it came as 4/9. Shouldn't it come close to 1 ? Since it is the worst case scenario?
@UnfoldDataScience 3 года назад
Need to check with data and calculate however not always mandatory that it will be close to 1.
@abelhirpo3109 2 года назад
it is a nice tutor Sir ! But how could it be such category comes true ? since you made greater or equal to 200 and should be inclusive to the GINI index ?
@UnfoldDataScience 2 года назад
Yes, that mistake I accepted already 🙂
@rosh70 2 года назад
Can you show one numerical example using entropy? when the formula starts with a negative sign, how can the value be positive? Just curious.
@kunalshaw2440 2 года назад
because log(x)
@abhinai2713 3 года назад
@10:38 where the information gain is high ,there we try to split to node right??
@UnfoldDataScience 3 года назад
That is a good question. The formula you see @10:38 is for entropy of a node.
Information gain for a split = Entropy of node - Entropy of child nodes after the split
Decision tree splits at the place where the information gain is highest. In other way you can say , decision tree splits where entropy is reduced to largest extent.
@karthikganesh4679 3 года назад ⁺¹
Sir kindly explain entropy in detail just like the way you presented gini index
@UnfoldDataScience 3 года назад
Sure Karthik. Keep watching.
@tanzeelmohammed9157 Год назад
Sir, range of Gini Index is from 0 to 1 or 0 to 0.5? i am confused
@UnfoldDataScience Год назад
see previous comments. we have discussed it.
@subhajitdutta1443 Год назад
How gini index ranges from 0 to 1? For best case it is 0 and for worst case it is 0.5..then how it is possible? Please explain..
@UnfoldDataScience Год назад
The coefficient ranges from 0 (or 0%) to 1 (or 100%),
@umair.ramzan 3 года назад ⁺¹
I think we select the split with the highest information gain when using entropy. Please correct me if I'm wrong.
@abdobourenane9294 3 года назад ⁺¹
You are right, When an internal node is split, the split is performed in such a way so that information gain is maximized.
@UnfoldDataScience 3 года назад ⁺¹
Thanks Abdo. Yes maximum IG is considered for split. Probably I missed to include in video.
@abdobourenane9294 3 года назад
@@UnfoldDataScience You are welcome. i also get some new informations from your video
@nomanshaikhali3355 3 года назад
For titanic dataset what type of criteria we have to use??
@UnfoldDataScience 3 года назад ⁺¹
Hi Noman, cant say, we need to try and see which one works better.
@anthonyamponsah1693 4 года назад
hello, very insightful. You almost explained the best times to use either of the criterion. Can you shed more light into that. The best kind of criterion to use for data in a model
@UnfoldDataScience 4 года назад
Hi Anthony, it is usually not easy to say which method(gini/entropy) works on what kind of data beforehand. Usually we try to check with various options to see model performance and then choose one. Hope this clarifies. Thank you.
@anthonyamponsah1693 4 года назад ⁺¹
@@UnfoldDataScience Yeah Thank you.
can i get your email? I'd like to stay in touch
@UnfoldDataScience 4 года назад
Sure it's there in my RUclips.
@awanishkumar6308 3 года назад
Then which splitting criteria we should use Gini or Entropy i. e information gain?
@UnfoldDataScience 3 года назад
Depends on Data
@ammanhabib4770 4 года назад
Wouldn't Id two loan amount true will be on the side of true?
@UnfoldDataScience 4 года назад
Yes if it satisfies the tree splitting criteria. I will have to check once. please check the splitting criteria again. Thanks you. Happy Learning!
@prasanthkumar632 4 года назад
Aman, Can you please explain entropy also with an example like you did for Gini Index
@UnfoldDataScience 4 года назад
Yes Prasanth, I will try to cover that topic in one of the upcoming video.
@prasanthkumar632 4 года назад
Thank you Aman
@tathagatpatil1419 3 года назад
When shoud we use gini index and when should we use entropy?
@UnfoldDataScience 3 года назад
It's a question of tuning parameter based on your data, for some input one. Ay work better for some other one.
@adiflorense1477 4 года назад
sir, i think id is not useful attribute to make model prediction. so we can get rid of that
@UnfoldDataScience 4 года назад
Absolutely, these comments give me motivation that my viewers are connecting with me. Here id is just for showcase purpose. Id does have have any meaning in data science models.It should be removed when we fit the model. Thank you. Happy learning.tc
@ashishbhatnagar9590 3 года назад
Sir why decision tree gives good accuracy in imbalanced dataset compared to logistic regression
@UnfoldDataScience 3 года назад ⁺¹
Good question Ashish, It is because, there is no mathematical equation involved in Decision tree hence learning happens purely on rules.

Следующие

Автовоспроизведение

Entropy (for data science) Clearly Explained!!!