Understanding Decision Trees (CART) | Classification | Machine Learning Part – 1



alright so in this video we are going to be learning about decision trees and how we can actually perform the classification tasks from our machine learning models to classify the data so just to be clear there are multiple models in machine learning that we can use to classify data and precisely decision trees is one of them right so we're going to be learning about the fig increase you know you might have heard from the classes you must have taken you know related to data mining or data analysis or maybe even machine learning that decision trees are just another form of data structure that that you know like we simply just go ahead and generalize the data based on to make decisions right so the second trees can be used to generalize on data so what happens happens to decision trees it simply sees our data set and it just goes ahead and creates these kind of rules so you can see right now we have I'm just I just attached a picture and and by the way we are going to be using the same data set here we're going to be constructing the same data set from scratch using the theoretical or mathematical concepts that we need to understand before we can just go on to the implementation part so this video is all about learning how decision trees works and what is the mathematical background behind it and in the next video we're actually going to be implementing this series right so so there are four algorithms in decision tree number one is the I three algorithm and this is our algorithm which works only for the categorical or nominal data sets and yeah and that's why it is limited it cannot work for the numerical values or it cannot perform the regression task or anything like that see four point five on the other hand can only perform the regression tasks or simply you know it just output its just takes in the numerical influence right and cart this is the classification regression trees this is the one that if you we are going to be study in this video and this is the one that we are gonna be in a standing why am i actually gonna choose well I'm taking this one is because thus I could learn library that we are going to be using to implement this it increase in our next video uses the same there's are formulas that are involved in the cart and basically what happens is that you can use hard to perform both the tasks so you can either perform the classification tasks and you can also perform the regression tasks so you know well so that's that that's that one teacher right so moving on so then what are the formulas and car so you can like read another definition here card is an alternative decision tree build an algorithm it can handle both classification and regression tasks the algorithm uses a new metric named Gini index to create decision points for conservation task 1/4 that's the only formula you need to know in the decision trees is the Gini index now what is it this is Gini index Gini index is simply this formula you can see it's calculates the probability of the data sets or attributes and we have and don't worry if you don't nest n this formula right now we are going to be like taking a deep dive into this formula and we are going to be using it in our step-by-step calculation to calculate the teeny index for each and every attribute so you can see right now we have I have written a little steps here then the first step is to calculate Gini index for each attribute right so don't worry if you do understand this formal right now you don't have to like you know you don't have to worry about this so and then the second step we're gonna use this Gini index and we have will calculate the weighted sum so we have calculator let's say I'm some sort of a probability for each and every attribute and then we will simply just go ahead and multiply that with the Gini index that we have calculated for each and every attribute and then we're just gonna add them up and we are gonna get a Janee and we're gonna get a value for that feature and then you'll have multiple values for every attribute that we have calculated and out of all those all of those all out of all those values we'll pick the lowest one and that is going to be our first node and we have to recursively repeat these this one two and three step until we have created generalized three for our dataset well completely listen so let's just go ahead and start working why I like I'm sorry if I'm going a little faster data today in this video because well there's not a lot of things that you need to know and we also need to move on to the implementation part this is just a theoretical part just for the sake so that you just for the sake of so that you can understand this and what's the what's the mathematical background behind a decision tree right so this is the data side we're going to work on and you can see that this is not exactly based on the decision tree that we actually we're and this data set is quite simple and understand it's it's it's a golf precision data set so what's happening is that based on some into infer attributes such as outlook temperature humidity and wind we are making a decision so it could be you know whether to play golf or not or dinner or not playing all right so so our output can only be Q so it's a binary classification task so it can only be either we know or either be yes so you can see that we have some examples of that so we are giving it some input attributes and then it's just in these goes a hasn't takes a decision so what our decision tree has to do it is has to create a model based on the state so it will just generalize on this lady's head right moving on so you can see that it is right now we have an Outlook attribute so also a tribute is what we need to calculate first because that's just start let's just start with the first side to be right so we could have we could start from any attribute but aren't picking Outlook as the first attribute so let's start with that okay so now how do we start start calculating the Gini index for Outlook attribute first of all we need to understand how do we calculate the probability for each and every each and every nominal nominal category inside this out the contributor so I would look a tribute is simply you know there are it has three categories so it could be either sunny or cast and rainy as you can see that oh that is that only the whole data set or all the 14 rows and only these three categories sunny overcast and rain right so if you take a look at this we have sunny the currents is that a number of times sunny has occurred in the in the in this column is 5 so 1 2 3 4 & 5 so you can see that there we have five decisions for the sunny and similarly B ok we're gonna count over constantly over classes 1 2 & 3 & 4 so you can see I have four decisions for our cost and similarly for rain we have five rains and and we have five decisions all right so now we know that so we're actually going to just go ahead and create so this is exactly what I've written here you can see that or out locator has a number of instances yes has two so sunny so let's say we have we take sunny so we have five sunny right and if you take a look at this for sunny if you only have one one here and other one is right about here so you only two yeses for sunny this is exactly what I've written and similarly the noes are three and the number of instances for total sunny is five and similarly we have to calculate the same thing for our cost and we've done the same thing will rain with that being said no there's a timer you have to use this formula this Gini index formula right so now all you need to do was use need to write exactly the what what was written here right sunny here is 1 minus the sum of the probabilities is square right so now what are the probabilities so as you can see that I have calculated is I've just written down these probabilities right here so if you take a look at the sunny and the yes probability of the sunny you will see that we have the probability of 2 upon 5 right and the second probability for no is 3 upon 5 so we can just simply put these probabilities in in the in the formula of Gini index so you can see that I'm just putting this 1 minus 2 upon 5 or 2 over 5 and then we have squared it and since the formula had a minus sign right here so it's just something to go I suppose ahead of multipliers that I'm multiplying it with every single Sun this is exactly why we have minus on before everything's right so minus 2 upon far 2 or 5 squared and then we have minus 3 or 5 squared and you can see that the answer for this is zero point 4 8 let me calculate the same thing for overcast so if you have 4 upon 4 and zero point 4 and then you're going to square them and then we're just going to say 1 minus 1 upon 4 squared minus 0.4 squared and then you have the answer 0 and then similarly for for rain we're going to calculate this so we have these three answers here right and then the third and the Turks the second step for this was to calculate the weighted sum of Gini index s for that feature so now we have the answer for each and every so we have Calvillo Gini index for every nominal category here for every nominal category right so we just be now gonna calculate this bit the weighted sum so you can see the weighted sum just uses the normal probability right so the total number of outputs right here you can see that the total number of words were 14 right so if you counted them total number of rows you know 1 2 3 4 5 and all the way to 14 and how many are how many instances of Outlook are there for the decisions so you can see the number of instances are 5 so we just put their fire over 40 and then we multiply this with the Gini index that we had right so we use somebody's go ahead and multiply that and then we are gonna add this and you know 7 that this is the same thing for overcast and then we just go ahead put at for over 14 so the number of instances before over 14 and then multiply it with the answer for Gini index here and this is exactly how we calculate the weighted sum so you can see that we are just summing all of the all of them together and then you're also multiplying them this is exactly what it means to calculate the weighted sum right and then we have the output as zero point three for two now we have to repeat the same step for other attributes it's like temperature humidity and wind we have to use the same same method to do this again and again and again and then we simply just go ahead and see so you see that we had the output zero point three four to four for the outlook at for the outlook that you built right now we will have we will get a so if you calculate for temperature humidity and wind you're going to get the output such as 0.49 0.367 and 0.4 going no no is the time you need to pick what what what node should be picked first right the first node we need to pick is so if you have to choose the lowest one right so so you can see that the lowest number here is 0.34 – out of all these four so we will pick outlook as I will force a decision node or simply the root node because if the first node so obviously it's the root node right so this is exactly now our decision tree looks like so outlook was instead we have picked out focus on a first node and outlook has three branches and it's these branches are based on the number of categories are looking at so it has sunny over constant rain so we know that and now we have our data set is divided into three three three parts so then the the first the first part is just our outlook with sunny with sunny as the category and the second one is the outer with overcast category and then the third one is the rain with with the rain only right so we need to calculate for them individually now so you need to deliver you know pre the step repeat the same steps for each and every single data set that we have got in here right so if you take a look at this data set overcast right here you can see that all of the decisions for this are yes so we could be obviously into if you need to put the we need to put the yes directly because even if you use the formulas and calculate for this we are ultimately going to get yes so we don't have to calculate for this and we automatically replace so there is not Vidia so whenever overcast is yes or whenever we have a you know out log equals forecast a number input we can simply just go ahead and directly output yes right so this is something this is the first decision we have made for our cost now we need to do so now you can see that in the in the sunny the decisions are multiple decisions right so we have no any s as well so for this we know you know that you have to calculate for this right so we will calculate for this so now I have done this directly because I have already explained how we do this for how we calculate out of this so if you calculate for this we have already gone to the outlook so we need to now do the same thing for temperature humidity and wind because we already use Outlook so we cannot use outcome again but with the help of sunny we're going to calculate for temperatures and temperature has how hot miles humidity has high higher normal VIN as weak and strong and then we simply just go ahead a calculation that says for all of them then we simply just take out the rate wait if some for all of them and then finally we will have some values out of which we have to freak and these are those values so out of all of these values you can clearly see that humidity has the lowest which is zero so we can pick humidity and replace humidity as the next decision load with this sunny dinner head right and this is exactly you can see it on the upper outlook so I'm just before see I'm not showing the that part of it on the tree and I'll just ignore the rest of the tree but you can see that now we have picked humidity for the sunny branch and now in the humidity if you stick a look at our data set here humidity has only two categories it could be iya houses three so we have heart mild and oh I'm standing and sitting there temperature one the humidity has only higher normal so we have only choose right so if you're gonna take humidity as higher than normal and then we can just calculate for higher novel so you can see that if we just go ahead and replace high even no there's nothing wrong with that because the output decisions are only no here and the output decisions from normal is only yes so we don't have to calculate further so it is directly in a place the sunny with the sunny high with low and normal with yes right so so we have now made two decisions for humidity four which means that and the humidity is going to be high the output is going to mean no and the humidity is going to be normal it's going to be yes all right so it means we have we can play golf right so now we need to do the same you need to be the same number of steps that we have done for humidity to calculate for rain and why do we need to calculate how we do we know that we have to calculate for rain it's because well we have multiple outputs errors that we have yes and no and if you had let's say multiple classes so if you had more car number of classes if you if you have only a single class we don't have to calculate we can magically write that class as an output but since we have multiple outputs we need to calculate it if we need to make a decision node we need to find out the next decision over here all right so when we do this for rain we are going to get something like this so you can see I've already shown you how it will repeat the steps so you can do the same thing for rain I believe in you that for that and once you have done that you'll see that you'll end up with a decision tree that looks kind of like this so we have sunny so let's try actually understand this decision tree and there is not any better way understand business isn't free but could test it right so on so right now you can see right now I have just created an input for this decision tree so let's say if I give this input to the decision tree right so we need to so let's see however this is increased when I make decisions so let's say the old outlook here is stunning all right so outlook is sunny temperature is hard humidity is high and VIN is strong so how our decision tree is going to make the season on this so first of all it's just going to go ahead and check outlook okay so what does outlook outlook is sunny so now it has to go here now if it's sunny it will check whether the humidity is high or normal so our humidity right here is high so it's high it or it will automatically decide it's the that you don't have to play golf well because that's however this is increasing center line some data right let's say I change the outlook to something like to rain and let's say I change this to strong right okay so now what let's see what the decision is so if our outlook is rain it's just gonna go to the rain and if it's checked for wind now so I didn't get season except for failing so what has been it is the wind week orbiters have been strong our village is strong so it's not too many decide to say well the output is simply going to be yes you can play right and let's say I change the outlook to overcast so if you check we check this out you can see that if you just go for overcast we have a decision node that directly says yes so we don't even have to check for all of these things and by the way if you never notice one thing that the temperature right here is not even included in the decision tree which means that the temperature so if you want to further pre-process the data you can kind of get the idea the temperature here does not really matter in the day this hat because it just directly generalized over the other attributes and it was able to make the seasons the temperature doesn't really matter here so so for from this from this model you can kind of get the idea that if you don't need temperature for anything so you can just directly remove temperature from a data set if you wanted because even if there is no temperature you will still be able to make decision it's just the way you want to write because the decision tree does not have the pressure all right so with that being said in the next video in the next part of this video we will be implementing the same our decision tree and we will be working on the same that I said that I have you know simply is explained to you we will be using the same that I said we were encoding these values will be applying the free processing and then we will or training a lot more our decision tree model based on this data set and we will see that the decision tree will be able to make the same amount of decisions that are making right here so I hope you kind of got the idea that how the tree works and you know this is not really important for for you as a data scientist because normally the the tasks or the competitions that that you're going to join you don't need to have all this information all you need to know is how do we really implement this in Python or any other language so why I'm actually explaining this to you is because well so so if anyone asks you or you know if it's a part of your interview or just for the sake of knowledge you can know you should know that what's happening behind it right so that's exactly why I'm explains to you so the that being said thank you for watching hopefully in the next week

Leave a Reply

Your email address will not be published. Required fields are marked *