# Biostatistics &amp; Epidemiology Lecture Series – Part 2: Probability and Types of Distributions  that’s our denominator and probability of selecting a boys just going to be the number of boys and in that age burg there are a couple of different conditions of probability one is mutual exclusivity so mutually exclusive events the other is conditionally dependent events mutually exclusive means that if one happens the other one is not going to happen so like a coin flip it’s either going to be heads or it’s going to be tails but it can’t be both at the same time role of a single died it’s going to be one of those numbers one through six more germane to that health care the day of week that that a patient arrives you can’t have more than one mechanism of injury mostly although in that if you watch action movies of course the hero can get shot stabbed and cart crash car but usually we classify them as either a fall or a motor vehicle or or motorcycle or assault or something like that gender again mostly is only one so there’s an additive rule among mutually exclusive events and that’s that says that if we want to know the probability that of having one outcome or the other you can just add those probabilities together so in this case we had 4.2 percent of our patients got injured on a motorcycle 15.9% were injured as a motor vehicle occupant that is they were riding in a car or a truck or a van of course there’s no overlap in those so you don’t get any of these yes and yes in here but eighty percent of the of our patients were neither of those so if the assuming those are mutually exclusive the probability of X or Y motorcyclist or motor vehicle occupant is just the probability of X plus probability of Y so if we want to know the proportion of patients that were either motor vehicle occupants or motorcyclists we just add these up 4.2 percent plus 15 point nine percent and get twenty point one percent of our patients fell into one of those categories now conditionally dependent is where the outcome of one variable depends on the other variable or vice versa so like alcohol results in the time of we will see here if your uncalled between midnight and 4:00 am going to see a lot more alcohol patients and if you’re better than if you’re here between 8 a.m. and noon for example survival in GCS of course poor GCS you’ve got poor survival the patient’s gender and the mechanism of injury there’s a little big relationship there so this is looking at the alcohol results alcohol test results by the time of date that the patient arrives and you can see midnight to 359 a.m. nearly 60 percent of our patients tested positive for alcohol going down to 8 to noon that’s it’s more like 10% and you can see how that all shakes out there so there’s a conditional probability going on there with this this is where the we have a multiplicative law so if if those two variables if alcohol and time of day are conditionally probable then the probability of both of those happening together is the probability of X probability of alcohol times the probability of a time of day given alcohol I should probably praise that the other way so the probability of time of day times the probability of alcohol given that time of day so so if we want to know what proportion of patients arrived between midnight and 359 a.m and at A+ alcohol screen then we can look at that the proportion who actually arrived in in that time frame 18 and a half percent and the proportion of those who arrived in that time frame and had a positive alcohol screen 59.2% and if we want to know what proportion of our total patients arrived in that that period of time and had a positive alcohol screen we just multiply these two together to get 11 percent of our patients were patients who arrived between midnight and 359 a.m. and had a positive alcohol screen this will come in handy later it also goes to the concept of Independence when one variable is well testing whether one variable is independent of the other this is an example of something that is not independent so there is a dependence between time of day and alcohol results clearly as as you can see that ranging from from nearly sixty percent all the way down to about 10 percent here so if  probability that I actually have the disease so that’s where we have positive and negative predictive value positive predictive value says okay you tested positive here’s the probability that you actually do have the disease and and that’s going to depend on the sensitivity of the test and the specificity of the test but also on the prevalence of the disease so if the if it’s a very rare disease positive predictive value is going to be lower just because there are fewer people with the disease out there if it’s if it’s more common positive predictive value is going to be higher and so using that same example we had 21 positive results 18 of those actually had a positive confirmatory test so that they actually did have have the disease essentially so so they’re positive predictive value ended up being eighty five point seven percent so so that works out pretty well same deal with negative predictive value out of those who tested negative that 76 how many of them really died were truly negative and that 70 so 70 divided by 76 is 92 point one percent I swore I was not going to spend a lot of time on equations so the idea behind Bayes theorem here is is just that you can calculate the positive predictive value or the sensitivity if you have enough other information about the test so if you if you happen to know the sensitivity of the test and the prevalence of the disease and and the proportion of tests that are positive then you can calculate the the positive predictive value so example here is we’ve got a doctor who wants to screen patients for a disease the disease prevalence is disease is pretty rare 0.2% with the sensitivity is pretty good 85% so 85% of people who actually have the disease have a positive test and 8% of it screens actually have a positive test so that’s that that’s a proportion that P of B and 92% are negative so we want to know what’s the likelihood of having the disease if that’s a positive test other word for that is positive predictive value so we can plug all that information into to our equation based on Bayes theorem so but so prevalence of disease is point zero zero two that goes for the prevalence of disease a probability of testing positive given that you actually have the disease that sensitivity so we’re just going to multiply the sensitivity by the prevalence and divide that by the proportion of tests that are positive and that gives us a two point one percent positive predictive value so you can see with it with a more of a rare disease like this where only 0.2% of patients actually have the disease the positive predictive value goes wait which would make sense fewer people have it it’s less likely that you have it kind of sounds like circular logic okay but I said I was going to talk about distributions and so I am the binomial distribution is probably one of the most popular binomial distribution is is when you have two possible outcomes heads or tails success or failure yes or no so of course it is a discrete outcome and and each of the the other properties are that each of the replications of the process are independent of any of the other replications so the probability of success is the same each time you each time you do it and there’s some important notation here and is the number of times the process is replicated P is the the probability of success and X is the number of successes that you’re interested in and I’ve and once again I swore I was not going to do a whole lot of equations and this one is very scary-looking we’ll go into what that means essentially there are two components to this equation one is the number of ways that you can get that number of successes so that’s the number of combinations and and the second part is the probability for each of those possible ways so the Lakers just acquired a new center Dwight Howard he’s a very good defender good post player but a bad free-throw shooter he makes about 40 percent of his free throws let’s say the Lakers are up by two with ten seconds left in the game and he gets hacked because the opponent knows that he’s a bad free-throw shooter and I’ve and so it’s really important that he makes at least one of these these free throws because the the opponent has a really good 3-point shooter so we want to know and we’re but we’re taking well making a friendly wager at that toilet towards the end of this game based on how likely it is he’s going to make at least one of these shots and hopefully ice the game or at least not lose it in regulation since he is a 40 percent free-throw shooter we know the probability of him making each shot is 0.4 40 percent but we want to know okay they’re in the double bonus he’s got two chances what’s the probability he’s going to make at least one of those there are four possible outcomes here he can make both of them he can make the first one and miss the second one he can miss the first one and make the second one or he can miss them both any time for each make that’s a probability of 0.4 so if he makes them both that said that probability ends up being 0.4 times 0.4 for a probability of that happening of 0.16 16% chance that he makes both shots if I more likely he’s going to make one and miss the other so if it makes the first one misses the second one the the probability of making the first one is point four missing the second one is 0.6 multiply those together probability of that happening is 0.24 same deal with that without come three where he misses the first one and makes the second one that that probability is also point two four and and the most likely of all is that he misses in both but but fortunately that’s only 36% of the time so what’s the probability he’s going to make at least one of them well the probability makes both of them again as I said with 16 percent probability makes one of them we don’t care what order he makes them in is going to be 48 percent so the probability is going to make at least one of those free throws is just that adding all those up together at 0.1 6 plus 0.2 4 plus point 2 4 so we’ve got a 64 percent chance he’s going to make at least one of those so we’ve got at least a three point lead and as long as we don’t foul the three-point shooter we’re not going to lose the game and regulation all right and this is the so I mentioned that the number of possible combinations of ways that you can get an outcome this is actually the equation for that it’s a this is not pronounced and it’s pronounced and factorial and and that just means that you multiply the end so if instead of n you’re saying 5 that that is calculated as 5 times 4 times 3 times 2 times 1 it’s just multiplying sequentially down into until you get to 1 obviously that we have computer programs to do all this and and you can you can calculate the number of combinations on a basic scientific calculator using this this n choose X or maybe n choose R a button on your calculator so that’s the first part is finding the number of combinations of ways that that outcome can happen the second part is finding the probability of that number of successes in a number of specified trials that’s just the the probability of success multiplied x times for the for that number of successes times the probability of failure 1 minus P times the the number of times that failure happens or multiplied by it set by that that many times so to use a more health-related example and we’ve got a medication for allergies that’s effective in reducing symptoms 80% of the time so so that’s going to be that P P equals 0.8 if the medication is given to 10 patients we want to know the probability it’s effective in exactly 7 so again the probability of seven successes here’s the the number of combinations of ways and that’s the probability of each of those outcomes and number of combinations with seven successes out of 10 trials and the probability for each of those combinations so there’s a hundred and twenty different ways that you can get seven successes out of out of ten trials and and the this other part is the probability for each of those outcomes so the probability of seven six is ends up being 0.2 zero one three we can extend that so if we want to know at least seven then we just need to repeat that four seven and eight and nine and ten add those all together and that gives us the the probability that it’s going to be successful in in at least seven at least seven times now if we look at it that the binomial distribution actually has a pretty characteristic shape actually looks a lot like the normal distribution when you start to calculate all of the possibilities so there’s the each of these positive each of these outcomes has a has a measurable probability of happening and you can see it rising up to it to a peak at eight and going back down obviously there’s not a very long tail here because we only have ten possible outcomes if we were to to increase the number of trials to 100 we get a much smoother curve here and it looks a lot more like a normal distribution and we can use some of the properties of the normal distribution instead of that the unwieldy equation to to measure some of the probabilities of the binomial distribution so other things you need to know about the binomial distribution we can essentially refer to the mean of the binomial distribution as the number of trials times the the probability of success so n times P in our example before if we were repeating the the process ten times that would be n and the probability of success is 0.8 then that then the mean or the expected number of successes would be 0.8 times ten or eight the variance is maybe not quite as intuitive is is calculated as as n again the number of trials times the probability of success times the probability of failure so n times P times what we usually use cute referred to 1 minus P and so the standard deviation again is just a square root of the variance of square root of n times P times Q alright so that’s a binomial distribute the normal distribution is the model for a continuous outcome there but with many continuous variables we we assume that they are distributed in a normal fashion and and I’ve been by that we mean this bell curve shape it’s also known as the Gaussian distribution and there are some some other names for it but but it has a couple of properties here that the mean is right here in the middle so it happens to be equal to the mode and the median also on our standard normal distribution we call that that mean zero and then we measure in terms of fractions or units of standard deviations away from the mean so so we have a score that’s called the z score and what the z score represents is the number of standard deviations away from the mean that it is other properties is that about two thirds of observations are going to be within one standard deviation about 95 percent of properties are within two standard deviations of the mean this is an example just from the from our registry you’re looking at systolic blood pressure you can see it doesn’t look quite as pretty as as the the the smooth curve but but you can see it said it but we would say that it approximates a normal distribution because it’s pretty close or mean here is a hundred and thirty seven millimeters of mercury and standard deviation 224 so it sounds like our patients have kind of high blood pressure again our notation is going to be that mu is the mean Sigma is the standard deviation and this kind of repeats that so if we want to know well that’s it let’s say we have that that we know that body mass index for men and the certain age of 60 is normally distributed the mean is 29 standard deviation is 6 we want to know what the probability that just some randomly selected guy who’s 60 years of age has a body mass index less than 35 for some reason we’ve got a study that said that where we need to have men who are not more than 35 in their BMI so since this is held that’s distributed we know 35 is is above the mean what we need to do first is convert our units convert their units of body mass index to a z-score so it fits with with what we saw before where where the mean is 0 and standard deviation is 1 so in order to do that we take that observation in this case 35 subtract off the mean of 29 and divide by the standard deviation and so so that’s going to convert it to a standard normal distribution with mean 0 standard deviation of 1 so since that equation is going to give us a result of 135 minus 29 is 6 divided by 6 is 1 what we want to know is the probability that Z is less than 1 okay in almost any statistics test text I should say you have a table in the back of the book these are all derived from cell at some equation that shows you the area underneath the normal distribution curve to the left of the the value that you’re interested in and it’s a shirt kind of hard to see all these little numbers but it gives that again the area under the curve which can also be interpreted as the probability of an observation happening less than that value so you can see here at zero the way to read this is if you look at the margins so zero point zero zero is going to be the mean and so so the value there is 0.5 meaning half of the observations are less than that which we know and I’ve and and that that goes up so for so for a probability for AZ of less than one so here’s one point zero zero and and that probability is zero point eight four one three meaning eighty four point one three percent of observations are going to be less than one step standard deviation above the mean and so this is a way to think about that it’s alright so we’re talking about this much of the area under the curve and I kind of just went through all that so basically what we’re going to do is just find that value for but for Z of one that gives us eighty four point one three percent we can also find that the expected percentiles if we have a distribution that is normally distributed based on on those Z values just like we had before of course the the median is the fiftieth percentile so but so that would be a zero zero first quartile is a twenty-fifth percentile third quartile the 75th percentile and and those are determined by and we can reverse calculate to find that the actual value of that distribution say body mass index or whatever it is we’re interested in to determine the percentile of that distribution so here are the Z values for certain percentiles we might be interested in that the 95th percentile for instance for where the Z value there would be one point six four five or the 90th percentile if you want to know the top ten percent and so that we know that again that BMI men follows that normal distribution mean 29 standard deviation is 6 and in women it’s a normal distribution with the mean 28 standard deviation 7 and so that 90th percentile for men is going to be again that that mean plus the Z value for the 90th percentile times the standard deviation so the 90th percentile is going to be 36 point 6 9 and you can calculate it the same way for for women mean plus that Z value times the standard deviation so and there’s a remarkably close between the two of them okay the central limit theorem this is really useful because not all distributions follow a normal distribution that they don’t all obey that bell curve things like length of stay in the hospital or ISS are good examples of that so but if we have a population with a known mean and standard deviation Sigma then that we can take this is assuming we just take random samples of a consistent size and and then as long as those are big enough and by big enough we usually mean about 30 the stet the sampling distribution of the means is going to follow a standard normal distribution the only different thing about this is that instead of using that our Sigma for the standard deviation we’re going to take that sample size into account and so the so we’re going to calculate what’s called the standard error and the standard error is calculated as the standard deviation divided by the square root of the sample size otherwise it’s exactly the same as what we just did again we’re going to apply this to non normal not normally distributed on populations and and and again we’re going to define that the distribution of the sample means it approximately follows the normal distribution so you can use that Z that to compute probabilities where before we were we were subtracting off the population mean and dividing by the standard deviation now we’re just dividing by the standard error so that the standard error being the standard deviation divided by the square root of sample size to find the Z value that’s when we’re dealing with it with samples so our example here HDL cholesterol has mean 54 standard deviations 17 in this age group of patients physician has 40 patients so n is going to be 40 who are over the age of 50 and he wants to know the probability that the mean value for that sample is above 60 so probability that X bar is over 60 is well again we do just like we did before we need to convert this to a Z value so so we subtract off the sample mean or subtract off the the population mean from the from the sample mean and divide by that standard error so that the standard deviation is 17 divided by the square root of the sample size which is 40 gives us a Z value of 2.2 2 so that’s 2 point 2 2 standard deviations above the mean that you would expect and so the so the probability of finding that though this is the value from that table we would get so 98% of observations are going to be less than that since we want to know the probability that it’s greater than that we’re going to take 1 minus that that value from that table and find that that only about one point three two percent of the time we’re going to expect to have a sample that that is that large or has what HDL cholesterol that high all right so in summary probabilities is the likelihood of an event is going to occur it ranges from zero to one zero it’s not going to happen one is it’s absolutely inevitable it will happen if we have mutually exclusive events that means that if one happens the other one won’t and we can use the additive rule when we want to know if one or the other is going to happen conditionally dependent that’s the multiplicative rule or if you want to know if both are going to happen that’s just the probability of one happening times the conditional probability of the other happening we went over sensitivity and specificity and positive and negative predictive value as well as the binomial distribution that’s just the number of possible combinations of ways something can happen times the probability of each of those combinations happening and and when it’s large enough that binomial distribution can approximate the normal distribution so you can use a lot of the properties of the normal distribution to a to evaluate that even if you only have two outcomes normal distribution again has some characteristics all its own the mean equals a median in the mode and and Z’s probably the most important thing to remember here Z is just the number of standard deviations away from the mean remember that that ninety-five percent of our population of our observations are going to be within two standard deviations of the mean so that’s why a lot of the times we’ll will use two standard deviations as a rough estimate of significance and and central limit theorem is useful but when evaluating the distribution of sample means because the sample means are going to approximate the normal distribution even if the raw distribution is skewed alright so we’re done for this morning next time we’re going to talk a bit about hypothesis testing statistical inference when you’re looking at at peer-reviewed journals often you will see p-values and possibly confidence intervals sometimes in the same spot we’ll talk a little bit about how to interpret those and and and how to know what tests are appropriate to use in what situations depending on what kind of variables you have any questions