Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- # example - pszemraj/long-t5-tglobal-base-16384-book-summary
- - summary is first, input text is below
- - summarization model is on huggingface https://huggingface.co/pszemraj/long-t5-tglobal-base-16384-book-summary
- - the "input text" into the summarizer is ASR text on a lecture with vid2cleantxt (link: https://github.com/pszemraj/vid2cleantxt)
- - summarization completed on tesla v100 GPU, chunking input text into 4096 token batches. number of beams used for beam search = 16
- ## output summary
- In this chapter, Dr. Rivers explains how supervised learning can be used to predict future events with trees. He discusses the use of trees as a building block for more advanced techniques such as "the rainforest argument" and "tree boosting" which often have very high prediction. He then introduces a decision tree, a type of tree that predicts future events based on an indeterminate set of values. For example, if a tree were divided into two parts, each part would predict the other's future behavior. If it were split into only one part, then the predicted future behavior would depend on the difference between the two parts. This is called a "regression tree."
- Score: -10.1654
- In this chapter, the narrator explains how to use cost, complexity, and uncertainty measures to build a classification of trees. He shows how each leaf has an alpha that is proportional to the tree's size. For example, if a tree were divided into two parts, the first part would be smaller than the other. If the second part were larger, it would be more complicated. Finally, in order to find the best tree, he divides the training and test samples into two groups. The first group predicts classes, while the second group only predicts probabilities. This way, for every class, we just count the amount of samples that relate to the class. Thus, instead of counting the total number of samples, we simply count the sample that corresponds to "class" or "probability" as the largest majority vote.
- Score: -14.6073
- In this chapter, the UM uses a random forest model to predict air pollution. He shows how it can be used to predict the distribution of air in different parts of the world.
- Score: -4.4919
- Next week's lecture will be the last one, and it will give you a chance to ask questions about the exam.
- Score: -3.0964
- ## input text
- Good afternoon microphones and a good afternoon work compared today. Selection applied to multi various statistics. Today began to continue a topic diet. We start at two weeks ago, which is on classification. So we see Congressional Trees and classification trees into rainforest. Agree it's not just on classification, it's also very personal. So it's a set of modern supervised learning algorithms with usually very high prediction accuracy in particular terms of concerning random forest. So the view in the area of supervised learning. As it said, many techniques have classification and regrets national means. We start with regression services trees and we start with trees. What? Whether are trees about to trees or simple malls that still can account for interactions and non linearatires? So the polls we saw in the lecture number nine it was so logistical Graduate La. these are linear most. Unless the names says this, they do not account for it now. Liabilities: potential nonlinearitis and interactions. Trees are particularly useful when models are still require to be ichurchable so they can account for billionaires and interaction while at the same time being interpreted. Obviously, the sector has a certain drawback usually, so they have slightly lower prediction and compared ferences to rainforest. So having said that, trees are also used as building blocks for more advanced algorithms such as the ranoforestcl argument that we see today. Also, tree boosting which often have very highprediction occur. So what is a treat? We can describe there in various ways. We start with a visual description such a decision diagram that we see it here. So this is now just for one predictive alarm x and I guess it's prettyiself act contrary. If you look at this diagram, what it's called a tree may be a flipped up inverted tree with the leaves at the bottom, but apart from this. So how does this work? So you want to make a production for a variable win based on a predictable x and now if you have a specific value for it that x you just go down to decision there. So if X is for instance, let's say one, let's take one point five. So here in the first place you decide whether it's above or below one, if it's above one, you go here and if it's below to which is the case. From one point five you end of this leaf here and it predicts one point there. So once your terminal leak or predict will be constant. but it depends on how you make your position in feel. Hell is there based on this. But because of this trees are also called decision trees. Now let's have another specific in to begin with for a regression tree which also shows you rages. She's some first intuition on how trees are fitted. So what you see here on the left end side is to date it's drawn as a line. Points are connected because also for at points. Now it goes without saying that the linear model is not the best model for this state as the wad so here. So how does Hit Tree work if we're not doing a split at all and its left at all or we do as we take the global average of world diplomats here. In this case our protection would be one point to know what a tree does. When we build a tree we make split along to predictive roles along the tax robbers we will see in a few minutes how to choose such as splitting location and the splitting variable. Let's assume we do split at one Now the protection if we are below one the average in this area. One point for if we above one it's the average in this area zero point seven. That's not yet a very good model so we continue doing splits. Let's do any one here. its zeropoint there. If we below nonprofit there it's the prediction's going to be monitored to offers between zero point there and one. The prediction is the average in this area which is one point nine if where the next split is made at two. if we are between one and two the protections the average in this area its one point to if we are above to we take the average in this area which is zero point there. in this case life finished here. we could continue maybe let it would be appropriate somewhere between one and two, maybe. but this is just an lilustrative example. We can finish here. So this is how a regression tree works. So you see our predictions in every leaf are: constant war, peace, vice conditions Used an example on a classification. There is a famous data set whether you survives to Titanic accident or not. So we have data for one thousand persons in total and be used to following convention for displaying the data On the left hand side of the always have the non survivors number of non survivors on the right hand side, the survivors. So its eight hundred none services to hundred survivors in total and the prediction is a history vote. For forestry we could predict probabilities which would be just taking the average of it. Yes, the fraction of yes samples in the data hears zero point, two or twenty percent. If we also make a point prediction the prediction of the majority of votes Here it's not because the majority did not survive. Now we make splits along hubris in the stadium and one of the verbuses that is in the data. sex of mainland female here in this tatu. In those there were two hundred women and eight hundred men. One hundred to fifty is the ratio of non services to survive for women and six hundred fifty to hundreds and fifty for men. The most of them have less than fifty percent of survive chance. so are project was to be Not Now we make another split. So this is the album. Who decides to make the split here at age thirty five? Now if you were a woman of age below thirty five, there were twenty on board and seventeen did survive so on. While the probabilities were quite agree for this group and above thirty five, the sobabo poor village was very low. So third were divided by one hundred nightly. the prediction would be known. We make another split for the man group at age twenty seven if men had an age below twenty seven, so the will chances were quite between one hundred and three of two hundred men did survive. And below the twenty seven, it's your wife. And if you remain above twenty seven, you had a very small civil chance. Out of six hundred, only twenty did your wife. That's what we see here. We will see how we make these splits in the following that we can also see from this program: misclassification rate in some So based on the data that was also used for training Dirty so we can count the number of misclassifications. So it's there for this leaf because we predict goes, but there did in fact not survive thirty there. For this leaf, we predict no, but thirty There did survive seven to year twenty. So if we sum this up divided by one thousand which is the total number of samples with a moderate misclassification rate of approximately twelve point. Another approach from Twelve Point Six percent. We can also see this separately for the Yes and the No class and get individual Euro rates for the two classes as well. Okay, this was an introduction and now we're go in more to the details. So mathematically a terminal can be written like this so its say some of cardinality. The summons or cardinality teased a number of leaves also sometimes called the size of the tree So below. So before here in this example the number of sleeves would be for there was the size of the tree so it some observed over these leaves and for every leaf the prediction is constant. So this is either a probability for regression trees is just a value, a real value a real number and then we have an indicator function which decides in which region the point X lifts. So arm is a binary partition of the entire space prop. So because of a predict eats and arm farms leaves they form foreign foreigners of this space and they they form partition of the space. A binary partition will become clear on the next slide by what we mean by this. So what we do is we essentially just go down the tree in mathematical from any years. So we just check in which leaf are we if the indicator from or in which regions So region leaf use is sometimes also cell these are used to interchangeablyand we check in which region we are and then we make a constant prediction for every leaf. The regions are recent handles of the following form here. So how do we learnt such a tree from data and how can we better understand to from us? So this is now under next style. So here below you see the notation the tree diagram decision diagram that used into introduction which is very intuitive. So one of the branches of streets is that they are very intuitive. You can explain this presents it to none. The people is not a strong statistic background, they will understand what you're talking about. So this is the reputation here. Now here this is two dimensional Predictive herbs are two dimensionals. So to predictive erbls x one need two and here you see the partition of the space with the rectangles. Five rectangles here in this example. So as an example here if we take or one for our one x one is below it one and x to is below to two So below the two here and below to one this is our one. Let's take another example partition soy region there are there well that this is X one is above to one and below it there so ex. one is between the one and it there. So this is an example and the interpretation here by the way of the condition here is fulfilled. You go to the left and if the condition is not fulfilled, you go to the right when watching the tree if you're thinking you're driving down the tree into the other medium years below here we have visualize the coefficiente the so as if very track signal. We have the coefficion of protection. It's constantly an every dark tangle. so you see here this peacewise constant function. And one thing that we have to mention here is that when we build such a partition, we restrict ourselves to reclusive binary splits. So we only make splits on one point and the split that has been done can not be reserved. So signs such as these, even though this is also a partition of the space we will never see with. I mean there are various different trees, but Almost is one of the ones that we see. These card trees do not allow for obtaining such partitions and that's also what we mean by when we say partitions. the rectangles regions are rectangles of from here yes you and is tan for lowering exactly So you and all is there reason why we spring son is that's a good observation. It I'll probably be better to reverse them and it will take up this insult just now there is no do not see, do not think this is there is good reason that was probably just yet yet. Thanks a lot for letting me know he so the use and dials would be here for instance to region on these values here and these values here so it two and zero and it one Yes Now how do we find this partition and how do we find the leaf values the projections that comes these coefficients well recreation he went to minimize to replicate some of sparse so something we do in ordinary be spares for feeling regractiong law to reliable sum of scraps. Now finding the best binary partition is automatically inevitable. We can try all possible combinations so we will use to give the approach meaning once we have many of toys we can not revert it anymore. This is still an active area of research finding optimal decision trees. They're making progress there but but nothing of them is widely applied and we restrict ourselves to greedy approaches. The full combinatoriiss solution is not possible. There are approximations that with strong of high probabilities will find the solution but but its challenging. Now before we go in to the details how we find the partition. what we can say is that if the partition is given finding the coefficient is belief values be easy we just take the average and every leaf of the date of So for region number me where in be denotes the number of samples, number of observations in this region of just some or all response were absent, Divided by this number of cents, this sample mean imperial average. Now how do we build such a partition? How do we find the partition? The Green algorithm birds as flu. This is sometimes also called card classification and regressionatry algorithm, so it works relatively so, but denotes the set of contains the partitions at the set of the regions. We start with the tribunal partition comprising the entire space that was the global prediction when we did not make any split at all in the first introductory example. and now what we do is for every variable job. So J goes from one up to p and for every split point as we define two new regions, one and at two or one is the space where x, J is bilori, potent is and or two is the axes in the space for x, A variable job is a boss so larger than this split point's Now for every variable job and's we essentially see what happens if we make a split there. So what We know us that if we do it a split like this for rare, big and variable's then we have these two new regions and then we can easily also get the new prediction because the credit that they deal with the coefitients are just the empirical average. so we know the effectiveness in these new regions. The principle average in these two regions. Now since we know the efficient we can pluck this in into the receivable sum of affairs. This is what we obtain here. So this is the first region in the first region and this is the second region so hear the resettle. Somewhat scarce in just two entries since we at the beginning and having just two coefficients either if you are in region or one or are to know well and now since we can calculate this for all variables and for all split points's and we just take the carbon the splitpoint's for which is minimum. Nowyou might object well trying all veribul set possible in my desert and finite. but what about if j is a continuous verbal trying out all split points? That's not possible. that's true theoretically speaking but in practice we have just in data points so we just take the midpoint between data points the midpoints and if you move between two data points nothing changes because it will have no difference on your relative sum of squares. So we just have try a finance number of of spot points also for continuous service and we can even make this easier. So we have the large sample millions of data points trying out every different a staring up one million different clitpoints is still automatically very demanding. So what you do is you take for instance questions of your X routes ormilion one hundred anybody tires instead of these different Its so once we have found the variable job and the splitpoint which minimizes the receivable sumeof squares, we have obtained the first partition and we have found two rectangles and then be continue in every direct tunnel. We repeat this step for every recent denial. So in each of the resulting regions we continue this. Splitting procedures that is described here in in Point to You I've mentioned here the first one. What do we do with continuous variables when we can search of cross battling marbles? If if it's a discrete carnival we just have to really ouptions of viruses that we can obtain so we could just try all of them out and we stop the splitting procedure when a minimal milestone is obtained. So for instance, five so minimal notesites would mean the number of samples in a note or in a leaf or in the region. So if it's different names for the same thing whites is the convention that is used in the tree leteror note leaf region. So anyway once you have less than certain minimal notes as so for its life you do not continue splitting his third note any more because as you might guess it's going to be difficult to to obtain coefficients that have another superhybiangs. Speaking of this, if we do this and we grow a large free until it every leave there just five samples left, we will usually do what is called overfitding are heard of advertising before raise for hand. that's the majority. So overfinning means you fight your training data very well. resolve actually. but not your new test data. So you see here typical situation Very here in this graph weather the complexity of a model athlete. This is just a qualitative graph of the beginning with very simple models they not good. Usually so here in terms of error, once we increased a model to common complexes for trees would be the depth of the tree to getting better. our training era becomes lower little some upscarse in the training data but also the test error when we get predictions for new data or but at some point so the training era will always go down. But at some point we start to explain noise and not signal any more And once your model starts to explain noise this is bad for future values which a temala has not yet seen less than the kind of distracted of noise and the training air goes down. But the test error is the error for predicting new data increases again. So this is visualized here by these tree different models's same data. A two simple model is not good. A too complex model which interpolates the data here is also usually not good because such a quadratic model here is usually better. You do not want to explain the measurement noise here or some random noise assignment so this is an example of of overfitting danger. So how can we avoid this with trees? Well, we the first build a deep street until in every region we just have five least five samples left and then we apply what is called pruning. So we cut back the tree so we cut back the leaves. The question is obviously as we can always cut back, but when do we stop? how much should we cut back If we cut back alot will be very simple but may be too simple if they cut back not enough. The model might be too complex, but who knows what is too complex. So for this what we do is we apply to can Costs complexity criteria and in a way this is very similar. If who as heard of criterians such as Air d it before, this is very similar to us. Certain it's less household and say statistically motivated. There is less theory underlying these quarters, but they work very similarly. So the cost complexity quarterback is given here by this expression. So what do we have here may be to start with the second summoned here. That's the first. So the first one is the total cost or goodness of fit. So we have to sum or all leaves, some or all leaf and for every leaf we have a impunity measure so called impunity measures. So for regression trees this is simply the means of this partition. The sorry this should not be called partition but reason the partition is comprised of the whole set of regions. Sorry for correcting the sum. The fly I should not do this but since I'm real listing it only when Now let's do it. So this is the Menopause in recent with the multiply belief the number of scams in this region up. So what you have for regressionis essential just as some of we steal compromises. But why are we writing it so to complicated? Well because we also not applied this to classification where we're not using the receituale sum of spares and there we use other impunity measures. So queue up is an important measure for cell origin under ad For research it's just a measure and we wait this by the number of samples inclined me. So this measures the goodness of it Again for regression it's just a rescue sum of squares. the more complex you more the lower will just be. But then we have another term designated the models Complexity Conservative Tea is the size of the tree It that's the number of leaves. So the bigger they tree the lower the first part but the higher the second part. So you see we have this tradofeer between having a complex model and a model at just to data well and the auto parameters. The auto parameters is a tuning parameters. This sees the model goodness of fit on the one side and themol size or complexity on the other side. So who should? how should you wait This straight off and this paramater alpha that's different difference with him. there is no such winning term in to big There The waiting is done by mathematical theory where we do it. Simplicity: So old's chosen using cross validation Yes So and get this wrist and as follows. So we iteratively remove a leaf such that the class complex generated is lowest. Always remove the if that minimizes the cost complexity most and we stop once the criterion starts to increase them pruning the yes. So here is it altogether the summary of what we have done so far we use to Greedy algorithm to grow a large tree. We apply cost complexity criteria for a given alpha for a large tree to obtain the best sub tree. The question the how do we choose Altha When athlete is chosen by trust solidarity so late we divide the data into training and test data. The algorithm never see the test data. We make predictions on the test data, calculate the precautionary agents and by training on the training data we mean growing the deep tree and applying the pruning for the alpha. So the predictions are done with the pruned smaller and then we choose the alpha that has the smallest autosample error. Once we have found the alpha we returned to sub tree. The prune tree with corresponds to the best alpha also here. so Alfis continuous various continuous number remains a real number covert from positive real number. So technically is speaking that can be an intimate a number of of appeals. but it turns out since these trees are instead we only need to consider a a finite number of appeals yes any questions so far on regression trees So we will now extend this to classification. These classification brings a various similar regression there a prediction now in law spoke predicting probabilities in those of classes. classes are usually predicted by a majority of word once we know a partition of the state, production is again easy. Probabilities are simply empirical probabilities in this region. So for every class so just count the number of samples that correspond to the class to divided number of as in relief predictions of classes instead of probabilities with just taking the probability as the largest value majority vote So once we have prescription is easy. how do we build that partition the similarly as for regression trees which some modification has illustrated here. So both in the greedialgoritm for finding the partition and also in the pruning step instead of the release sums of squares that we have used before, we now uses a different total cost function whenever. replace the measurable error by another impunity measure and there are various ones of these of the first one year is the misclassification error gaining exes In one of the worst Attorneys Trust events. it's the third one. Often we use the Gene index for partition building and the his classification air for pruning. Why is this often done? Both the Gene Index Once you look at it in more detail here, you will see that it favors what is called pure participants so it favors performances that never eventually have most of the stands coming from just one glass's good on for interoperability andthis is the main reason it's not necessarily always the best choice. He here you have an illustration in terms of Pea for binary classification One of this function off four different impunity measures. So to understand this insecurity measures better how this works for classification, let's have a look at an example. We want to predict whether we have a side effect or not. After a certain treatment with one person, one hundred persons, fifty website effects, and fifty of outside effects, we have information on age and sex. If we do a split on sex, we have three, seventy men, thirty women. We can calculate the gene index in both regions. We multiply the going by a number of samples. so the total cost is forty seven point five if we do as split on on age. For simplicity, we certainly have just two categories handled young. My knowtis is over simplified, but it's an introductory example. Sixty old persons ten to ten had no side effects and head side effects gene index here. zero point to eight and and for the young ones, you have to calculate the gun index yourself and the toll cost. So this is now a cricket question. So again, the question is will will we do a split on ex or he depending on for which one the toll cost is lower and you have to calculate for this, the Gene ingredient exits the predicted probability of class K takes one machine biking because of red classes, so predicted probabilities are easily obtained here. For instance, for the no Side effect class, it's ten divided by x total number of patients in this leaves. We need to do this for this leaf as well and then you calculate it total cost. And based on this, you know whether we should do a spot on age or sex first when we build a tree. And as I said, this is now a slicker question. A half an hour ago, no one already locked out. That is too fast. That's wrong. There we go. Let's also try to have the definition of the genuine next. Upon this flight Arts Pass of Splinchford's let's lay three more seconds and that's very well answered. It would in fact be done in and I hope you dead the calculation on track and did not have a look at the next slide but I assumed that I did the calculation correctly. So here's the solution. The gene index for this leaf is zero. Predict the probability is either zero or one for the two groups for two classes. so we have two times zero times one which if we snub the sub gifts zero. So the total cost for this there is sixty times deep eight plus zero which is clearly lower than for the other tree where we might split on sex so we choose to split on. Now let's also have a look at a pruning example. Assume we have obtained a deep tree. It's not a deep tree, but it's just for illustration. Assume we have obtained this tree here where we first split on on the barbo and and then we split on the variable height tall or shorts. Just two categories for simplicity you see here we have misclassification errors of zero in both please Presidential leaves. So we have a pure tree here. If we prune the tree and cut back these leaves, we end up with the smaller tree which in this levy have a misclassification era of ten divided by sixty. So the total cost here for the first tree is zero and the motor complexity is if we choose an half of zeropoint five as an example is the three sizes which is there three leaves here so it of the cost complex run is one point, five for the first, three for the second three the total cost is sixty times. This is classification error but the total cost is much higher than model complexity. The termines slightly lower because we have only two leaves instead of street, but the effect of the total cost is is much stronger compared to the mall size so we prefer this deeper there. even though it's slightly more complex slightly more complex model, the additional complexity of the the model is a virus. while at least for this all of their point five so we would clearly do not prune the tree and remain with this large tree here. So let's do a break at this on it would continue in fifteen minutes. We'll come back after the break. We in introduced classification and regression streets in the first hour which are relatively simple models that allow for non listeners and interactions. A big advantages that they are highly interpretable used so how we can visualize them you can present them to non statisticians for instance, non people with not a quantitative background so to speak. We have no background of the methods here but they will still understand them. So now that we have covered the basics of the method tologyat's have look how we can apply to assumed the are part library in the or part function. Our part stands for resources partitioning so crash a term builds a partition off the space. That's where the name comes from. We also call this partition familiar formula notation. We used the air pollution data where we want to predict the pollution variable which is so too to if we use the rapt function. If we obtain this tree here, this is the tree written out as the usinverts. It's better to plot this. It has a nice plodding function here as well where we can add the labels. So pursued a number of relationships in a leaf to predict value to see is and the decisions here. So if the condition is fulfilled, we go to the left and side of the tree. Predictions can be done with the predict function here. prediction for three data points. This was a regression to export are let's also quickly have a look at the classification tree. example. We used Thrift, The first stands of three species. I've not mentioned it here above, but the Rapt function automatically does counting and chooses the alpha based on ten fold trust validation. This is the model here we have here. Yes, we can also avoid doing running. so if we do the alpha parameters maybe let's put it off. Here corresponds to Up. So in the apparent function, there's an argument up. This essentially correspon to our alpha. It's alpha bydivided by the goodness of fit of an empty tree. But it's up to a constant multiplative constant. It's our alpha. So if we tell, use no Cost Complexity pruning. So we set Alpha to zero the minimum number of samples to make a split this one then we obtain a deep tree. This is our deep tree. Here we can also visualize this cost complexity generation for different Alphas here you see it and then also do maternal pruning. For instance, every one to used an Alpha or cup of zeropanntheotrorry yes protections. Again of the predict function, we can define that we want to predict probabilities, type, probability, or classes of the class argument Yes, this was very briefly on the pre functionality and or other software packages have similar software implementations have similar functionalities whenever of questions let me know. of course we now continue with the next topic is tomorrow from it mentioned it in the beginning. One of the reasons why trees are so interesting is because they are used to building wall and some of the most successful because a modern statistical learning or machine learning alguri in an a sense which successful remain prediction accuriousy and around the forest verkesis first. So the forest consists of several trees so the name is quite self explanatory. so life obtain several trees. We will see precisely how we obtain these several trees. Let's see we have three trees's hard to pronounce. Anyway we have these three trees and prediction is then being done but just taken average Overall and for classification this corresponds to a majority vote. So if two of the trees vote disease and one vote's aunt, diseased, healthy and the majority votes diseased. So now let's assume we if these tree trees and we have some new data with these attributes. If these the predictive roles, that's what we assume here. and we're now doing another cricket question to select the statements that are two there. One classifies the new data's heallseet there to classifies the new data's health as the majority rule of the rating. Forest is diseased, took's fat. that's why it served in most scheduled reason that's very well done. First tree gives disease as prediction secondary helsy, third one diseased as well. So two times diseased. majority of votes diseases. that's how we obtain lose results here. So this is a functioning referendum force mall. Once we have trained the mostly for predictions, but there was the prediction steps to know how do we obtain these trees and that's a little bit of matching. I mean they're explanation. Somebody in the random force actually worked so well and there are very interesting explanations. We could discuss them for for hours one at one. Let is for sure another month so we can not go into the details and we will see some of the explanations. So rainforest in the technical bagging which we're not covering here, but bagging works very similar works by first building. Capital B is the number of trees deep trees each having low bias but high warning. So deep tree has low bias at high variants and incidentally distributed. So this means they are based on the same date and the algorithm. Also why distribution The tree is the terministic thing, but we will have some random this year also depending on whether you consider the predict the robes random or not. but because we do some bootstrap. So let's start from the beginning and then we'll hopefully become clear why we stay here incidentally distributed. But what we do is once we have these capital by trees, we take the average order you have done before as well. The bias if you do an average calculation remains the same. expected value is leaner so the bias remains the same as for the individual trees, but you reduced the bands. So here's to random from algorithm. The first ingredient is we get draw a woodtrap sample. Who has heard of woodstrap before? Raise your hands. That's the majority here. So I'm not covering everything but bootstraps. So a bootstrap sample from the training data is described here in this slide as well. So we make a new random data set from the original data set and we draw end times with replacement data from the original data. With replacement means once we have drawn a random set from the data, we go it back to the data so we can grow it again So we will also have a data set of size and but very likely some valueslill occur multiple times where as others might not occur at all. So it's hard of your original data to introduce some randomness which gives you some new data to which found the boot drab sampled by sampling with replacement. That's the first step. So in the sense every tree sees to some slightly different data but the way the data is general is the same mechanism so idea tacit distributed is part here. Now we grow a tree based on this books drab sample by doing a century to recipe that we have seen before in the first hour. The only difference is here highlighted in the sorry toilet when we select this stifling variable. So remember we are doing a trial there for finding splitting other split points when we select a splitting verbal, whenever one can use split we are not considering all variables but only a random subset of the me. So instead of looking at all but meters we only focus on the randomly selected verdict for times or smaller usually than but can also be up but it's usually better to make it more than the people. So draw bodies trap sample and whenever you make a split do not look at all bars but only look at a random selection of a word. So why the heck would we do this? Give you some explanation? It will give you some explanation in the following. So the Accord is a space ensemble for a collection of capital public trees As it said before to make predictions for a new point to just take the empirical average to the average. Overall trees for probabilities is the same for if you win uptime class predictions will just take a majority vote. So this is summarised here once more, in particular with focus on the standard regression tree classification on regression tree as we saw before. So every tree is trained on a boot strap sample of the data. Every tree sees slightly different data and whenever we do a split we consider only with randomly selected Wiralds instead of all. But this is a tuning parameters of usually is chosen by Cost Solidarity. that's one of the big advance random ports. It doesnt have very strong tuning carmeters in the sense that if you choose them wrong things can go very wrong. It's not usually not the case. Also, default values work well and when they changed and the results change a little with but not too much. So it's really a method that can be applied out of the box. but the rule of some here is for a classifica action we use the approximately spirit of p and to aggression the but shirt with different minimum note sizes when you grow the trees. And yes, we do not promote the trees so no pruning. We make deep trees large trees until minimal notes out of one or five. This is technically speaking for ruining parameters but here it's either one or five classification of regression So let's better understand how this work by looking at the bias brain straight it whispered if the bias varying straight off before. It's also a very common team in statistics and it's actually related to overfiding as well in looking at prediction accuracy or in Generals W. So the means square air can be decomposed of the means square air of your predictions but also of the parameters of any of anything that one estimate or predict but put her prediction can be decomposed as the variants plus the bias squad So means your ear as two components, bias and various use and variants of. This is a very virtuosic illustration here. So low bias and low variants means you always correct on average and we have very little variation across different sons so every point here is a new data said. In practice you having only once mayor said I know but let's assume to bring it repeatedly and you fully look how good is for poll. This is still having low basis because in average your very accurate but the variant is larger there is variation. This is the situation where we have low variants. You're all always concisely getting the same result but it's wrong. Unfortunately it's have high paid here and the virus situation is when you're off a wrong on average. But you're also having high bags. So this is highways and I bias. You can sing for yourself in which corner you want to be. Usually if this one warns habits, what's worse of veterans or habits? Difficult to say. We usually prefer lawsuits in slightly higher variants than the other way around. It obviously also depends on the magnitude. Often there is a tradeoff one is to make between having a low bias or being correct on average and the low banks. and this is related to modern complexity. So the more complex you're models, the lower your bias but the higher is our bariance. Or I can also say the simpler your model is the lower isourvarianse and the higher is your bias and the means are error is to some of these two and well you want to find this point here where we have the optimal combination of bias and variance to rational classification. Prices have been produced in the first hour. Account for dhisbyous variant straightoffs by pruning so by pruning we cut back the tree. It becomes simpler with a deep there where some where here low veterans sorry low bias but high vans. If we cut back that tree we increase the bias a little bit. but we released avarianc and it is there while to do this up to a certain point because the biasquared is decreased is increasing less than the decrease by the invarians. Know the random from the magic of the random forsalogrom is that we have deep trees with low bias to remain here with low brains. but we can still decrease the variants so we do not have to go back here. It seems almost like we tricked this. Five lands straight off, but it will become clear in a second why we can do this So I mean why does the bays remained the same will if we do an average the bias is just an expected volunteer straight kids. The true so it's near to the bias of you is ever just the same as the bias of her individtory. But why will reduce the vines And we do take this average well. if we calculate the variants of of this original but production of this time of this average of the different trees, well we have to take into consideration that the different trees are not independent. There is dependency among the trees. One they used the same data. We introduced some independence so to speak by using a boot sharp sample, not seeing exactly the same data and whenever we do a fit beyond lies all the animal varies out of the pop total viralas. So this randomness source introduces some independence or decreasedependence. But they are not completely independent. so the variants of the somewhere of this average can be written as this expression here and I think we have time to briefly it be a look at where this comes from. So we will briefly switch over to this document camera here to and this is flipped I have to be change to variants of one or B. I would have to write it of x but I'm not doing it savoring a literus notation. can you read this in the election? who is to too small smoke so it will change this. Okay so variants of the sum of random Rubles or first of all this term they can take out on or be square. So variants of the sum of random rubles as the Mdablsan over the Conservatives That's basic probability and we can decompose this in two parts. Conferences of when it equals us this are them just the vacancies And then the true conferences is to speak her with some over all eyes and as her eye is not able to go after conference between two trees to and let's assume me right the correlation as raw. So then the conference is can be written as correlation pints, bags the same clean section onthis is correlation road. This is the way from bad handwritten yes so this is one over be square. Since they are immediately distributed it's bearing to signal square times by plus. Here we have some of public Minus one times by summons and they all have the same value. It's real time sickness are and telephone is just juggling around. If the terms this is some this up we obtain here True Times Signal Square plus Stigma square One minus road divided by capital by I Will also put this as a handup on my own many's time switching back to the slides day so we know the bias remains the same fragrance as given by this expression here. Now how can be reduce Democracy While first of all if we make the correlation smaller so if boy's smaller correlation between two trees then the variants will be smaller And this is why whenever we do a split we consider only the randomly selected valves. So this reduces the correlation between different trees because when whenever we grow trees we will have different grounds considered when we lose gilts. So this reduces the coordination to a different and the second term is a trivial term. This decrease if the number of trees increases so we can let this go to zero by just having by large enough. This is also by the number of trees is a very simple tuning proprietary rate for value technically large enough. He will see later on that a certain point does not matter it the sort of fence converged on and the brains here is this second point is essentially zero. A nice feature of Around the Forest algorithm is that it has a what we would call internal cross validation functionality. what is called the Out of Bacgair. This is very similar to cross solidarity but without any additional computational cost. At those this work let's assume this is our data. It's just seven data points. Against this is toy data. Just seven data points. Now when we build a tree, we first draw a boot dress sample. So let's assume our bootstrap sample is this sample here. Also, seven data points. You see some of the data points are selected twice and these three points are never selected at all. That's by coincidence by cancer we brought remain woods. Now reinforce your algrum prices by building a tree based on this booth draft sample. So we get the tree. Let's assume we have this tree. but then we realized that this there has not seen part of data. All the samples that are not selected in the boot shrapstep. This is data that the tree has never seen. This is what is called out of back data. so we can make a prediction for this data. but we also know to choose so we can calculate an airport. and this is now an out of sample narrator. So this is not having the danger of an overfitting issue having here. and so this is a outosample data. It's called the out of back air rate and so we can do this for all trees. So every tree has a slightly different out of back data, but we can calculate this out of back air rate for all trees and they take an average overall out of bad areas of all trees. Yes and this is very similar. Cross validation We do not have to refit anything so no small, just be trained several times. Its just part of the algorithm that's very nice now. Team Force and also other methods Container methods have the disadvantage that even though they are often highly accurate in predictions. A The disadvantage is that they are what is called a black block small so they are difficult to interpret. They well understand how such a mall works well. You can look at every tree and understand the tree, but if you take average of one thousand trees it's almost impossible practically speaking to understand how this works and so interpreting black box mall as a whole field for itself. We could probably do an entire semester course. Still also intact the area of research quite some interesting things and appeared there be only looking very briefly at some options and the simplest one is variable importance. So of variable importance is a summary measure for very variable that tells us as the name says, how important this is variable. Conceptually this is similar to A values or k statistics from leaner models, but only concepsually or there are different ways how we can assign importance to different variables. We will see one that is based on plummeting the out of back data. It's often a preferred way. Another one is relatively simple. Whenever you do a split for a certain variables you keep book of how much the impunity measure is decreased. Do your bookkeeping and just ssomis up for all bonus for all trees and then you say all because of this relevance is dead marching together of the other mill of that much. Then you get the ranking or not just a ranking score for everywhere. Attendant's also something that we can do. Another one is based on what is called chop values but we're not covering it. Disclosures: Let's assume that let's say they all have their advantages and disadvantages. For many data sets, it does not really matter which one you use, but then there are counter examples in particular of their coronation on the features. Financial erabs it with going too much into the details. Here let's have a look at the estimation based one which is a very nice option to do feature. Importance of dust is work on this is your data. Know what you do is for every tree a backdrop is so so. We have public times, different boot strap samples and we also have out of back data. So every book at found or every tree's data weather has not seen technically speak in every data point could be selected just once after the data is exactly the bootthrip example is exactly the data and there is no out of back data starting from a reasonable size sample size. This proibinge is its are so practically this is no the nation So we build a tree based on the boot scrap sample and from this we get the other bad error for every tree. that's what we saw are so far as well before. So we get the out of back air by building the tree on the training data and the boostrup data and then looking how good is it on the out of back data and now what we do to find out whether reversible is important or not and how important it is Well we have here deatof bags are we make predictions. With this we see how why the predictions are on the ideas, the falling stance. So if the various important or not we look at when we replace this war it some alogus sense, not the correct rules, not the correct values that were in the data for this role but something stupid. It should not be too stupid because if you do something to stupid you're doing something really stupid. That's so practically speaking why we're doing permutations. So permutations is a good thing. A good way for doing something that is not correct so we still not it. So the main issue is this: we still not have really stick values for everybody. so if we just take the average for instance, it might be that treasure actually having values that never occurred in the data events. With permutations. there is sometimes critique because of correlation of your taste out of condex shape, but that's a different steroid and forget that. I'm just bad it because we're not covering in the course. so. but anyway, so we're not just taking the average because the average might not be a good stupid say so to speak. So we do perfume the varies in survivors high. So let's say the botched upsamp has several later points and for rarable eye we just shuffle to data plans. so then every diplomat will not have the correct value for deferable eye but some other value from the data. This makes sure that it's a realistic value something that it has occurred, but it does not correct. On to this dateophone so it's not the real true value. So in that sense once we picture the variable were distorting the varab in quotation mark, doing something stupid for this rebel eye and we're making predictions. Once more of these permitted bills and we do calculate this out of bigger again and then we look at the difference between these two. Now when do you expect the this different to be large, when variable eyes important or when it's not important, sink for yourself. When do you expect this difference to be large? When a variable is important or when it's not important. Any suggestions if it's important, exactly if it's important, it contributes to prediction accuracy and when we change, distribute or perfume it with putting wrong values there, then the prediction are it should decrease so we expect to see a difference exactly. Can we calculate the average decrease overall trees and we standardize this because we can also calculate the variants of these pieces also to account for for instance, the fact that the out of back datasets might have different sizes also for so that's why we standardize this. So we divided by the standard deviation and this is now the promotion based variable importance. Yes, the other variable ports are mentioned before by just bookkeeping counting the number of not just the number we can actually use account how often is available use for making a split. but the one thing is mentioned here is summing up the decreases in purity whenever we go split for its old. this has some disadvantages in particular if various continues as slightly upward bias compared to categorical labels and so forth. Again, this is in parentheses. In my experience from many applications, these variable importance measures tend to agree, but they are obviously a cases where they do not agree and if we have to choose between option two and one in prefer option one here which I've just described true in slight zirdithree yes. To conclude and summarize here, compare trees and random Forest senior productions of trees tender of high variants that's a disadvantaged ndlargeso to of means they're not that accurate. Corporate rennofores, reformists have smaller prediction variants and or have more higher prediction accuracy. Another advantage that Trees for the mean linear models statistical models because we know how to know them so far and this no course do not have tuning parameters at all. But once to go into the area of statistical learning machine learning, you will see that there are tuning parameters.' M not sure if you have heard about counterterrorism, deep learning or tree boosting as well things such as excess. They have lots of tuning parameters and there the tuning parameters are really important. If you choose them wrong, it do not work nicely here. it's not that important for both methods. He's very nice, what concerns and the handling of tuning parameters and the big advantage of trees is that's what it's called a white box mall so it yields inside into decision rules. You know what's going on. You can communicate the easily phantom forces more. A black box is difficult to understand even though their techniques for understanding the model's better. but it's never going to happen that you fully understand them. Yes, the remaining five minutes are for some examples. May have questions before its La mine we used a random forest back is here. There are no variant packages as another one called Ranger. It's a faster. It's not the fastest out there so often has a faster implementation in psychic learn. but for this course does not matter for reasonably size. Devastates does not really matter. So we used to random from function here for the air pollution data. The output here is the here no that did not break short for this year so we obtained this output here. Let's have a look at the classification example as well and the data that we use is an interesting data. Its fragments of glass influences work on the glass compositions or the composition of the glass fragments that were found chemical elements in the glasses. We want to predict the type of crime scene, the type of glass sorry so window float, gas, vehicle, window, glass containers, taiblewaar, and so forth. So in total there are seven different types of glasses. and we want to predict these glass tubes based on the chemical decompositions of the glasses. So you see this is the first. This is how the data looks. one to three for five, six, seven, eight, nine variables and this is to predict the response by two thousand and fourteen samples. If we fit the random forest, obtain this model Here you see here the out of bad airport also has a nice plotting function. Its makes larger here. So what we plot here's the air rat versus the number of trees. Let's have a look at the black line. that's the overall air rate. We also have an air rate for every class but it's the mostly interesting, the overall air rate and what you see as it goes down it decreases and at some point it tells out. We can say the there has converged. That's way where the vaccine reduction due to the second term they are in the balance. This composition does not matter any more. Be is large enough so hear how many trees would be used probably around one hundred and then we're fine. If we use more results do not get better but not better. So if we used one hundred a fine let's refit a tree with one hundred. sorry their run of course with one hundred predictions can be obtained with the predict function. As usual am also having a short comparison here to trees. I'm doing out of sample across solidation to leaf one out cross validation so we in times fit a tree and random forced and make a protection for the left out data point. This is possible its in is to quite a small data set so before we had a generalization error of twenty percent this was the other back air to based on the random force algorithm. Now now we do true core validation. Let's see what we obtain now. and so you see the his classification and its twenty percent. So the out of back and was very a very good approximation of what we do to obtainwith the cross relidationeer a tree as a uh, higher airfare of twenty eight percent so theirs a big difference Here We can also compare this to lead for instance lady function as an internally fownout cross radfunction and here its third five percent to will see very large differences in prediction on on these three methods. That's briefly how we can obtain variable importance measures with the car and old function. We get on the left hand side, the penetration based, the right hand side, the decrease based. So here we use the gene index so decrease in node impunity due to splits here. Yes now we can also see what happens if we only use the three more important variables provide actually increases to twenty for percent. If we use the three least important variables then the airline is very large in almost or forty For yes, we conclude here. unless there's some final question from your side, this is not the case. So I'm thanking you for your attention Next week It's to be the last lecture. Having twelve lectures in told next week is going to be the last lecture it will also provide some exams information in case you have questions. This is obviously also the opportunity to ask questions on the worse or in the exam. Yes! So we'll say good bye here, have a good afternoon, and see each other again in a week. Terror? Nobody.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement