if you feel confused don't sweat it Stan Quest is here stack quest hello I'm Josh stormer and welcome to stack quest today we're going to cover another machine learning fundamental the confusion matrix and it's going to be clearly explained imagine that we have this medical data we've got some clinical measurements like chest pain good blood circulation blocked arteries and weight and we want to apply a machine learning method to them to predict whether or not someone will develop heart disease to do this we could use logistic regression or K nearest neighbors or random forest or some other method there are tons to choose from how do we decide which one works best with our data we start by dividing the data into training and testing sets note this would be an excellent opportunity to use cross-validation and if you're not familiar with that we'll check out the stat quest then we train all of the methods were interested in with the training data then test each method on the testing set now we need to summarize how each method performed on the testing data one way to do this is by creating a confusion matrix for each method the rows in a confusion matrix correspond to what the machine learning algorithm predicted and the columns correspond to the known truth since there are only two categories to choose from heart disease or does not have heart disease then the top-left corner contains true positives these are patients that had heart disease that were correctly identified by the algorithm the true negatives are in the bottom right hand corner these are patients that did not have heart disease that were correctly identified by the algorithm the bottom left-hand corner contains false negatives false negatives are when a patient has heart disease but the algorithm said they didn't lastly the top right hand corner contains false positives false positives are patients that do not have heart disease but the algorithm says they do for example when we applied the random forest to the testing data there were 142 true positives patients with heart disease that were correctly classified and 110 true negatives patients without heart disease that were correctly classified however the algorithm misclassified 29 patients that did have heart disease by saying they did not these are false negatives in the algorithm misclassified 22 patients that did not have heart disease by saying that they did these are false positives the numbers along the diagonal the green boxes tell us how many times the samples were correctly classified the numbers not on the diagonal the red boxes are samples that the algorithm messed up now we can compare the random force confusion matrix to the confusion matrix we get when we use K nearest neighbors k-nearest neighbors was worse than the random forest at predicting patience with the heart disease 107 versus 142 and worse at predicting patients without heart disease 79 versus 110 so if we had to choose between using the random forest and K nearest neighbors we would choose the random forest BAM lastly we can apply logistic regression to the testing data set and create a confusion matrix these two confusion matrices are very similar and make it hard to choose which machine learning method is a better fit for this data we'll talk about more sophisticated metrics like sensitivity specificity ROC and AOC that can help us make a decision in the next stack quests now that we have the basic confusion matrix figured out let's look at a more complicated one here's a new data set now the question is based on what people think of these movies Jurassic Park 3 run for your wife out cold spelled with a k and Howard the Duck can we use a machine learning method to predict their favorite movie if the only options for favorite movie were troll to Gore police or cool as ice then the confusion matrix would have three rows and three columns but just like before the diagonal the green boxes are where the machine learning algorithm did the right thing and everything else is where the algorithm messed up in this case the machine learning algorithm didn't do very well but can you blame it these are all terrible movies BAM ultimately the size of the confusion matrix is determined by the number of things we want to predict in the first example we were only trying to predict two things if someone had heart disease or if they didn't and that gave us a confusion matrix with two rows and two columns in the second example we had three things to choose from in a confusion matrix with three rows and three columns if we had four things to choose from we get a confusion matrix with four rows and four columns and if we had 40 things to choose from we get a confusion matrix with 40 rows and 40 columns double bam in summary a confusion matrix tells you what your machine learning algorithm did right and what it did wrong hooray we've made it to the end of another exciting stat quest if you like this stat quest and want to see more please subscribe and if you want to support stat quest well consider buying one or two of my original songs alright until next time quest on