Popper, Lakatos, and Bayes: How to think scientifically

okay this session let’s pull everything together by applying it to a particular paper so I hope you’ve had a go at this paper here now the first section is about popper so I’ll go through that and the first question is concisely state the theory that the author’s present has been put up to test so in terms of what been discussing before this is asking for the substantial theory right it’s asking you to isolate what substantial theory was presented in the paper so the substantial theory is hopefully bold interesting idea that’s falsifiable and in fact is something that would be given up if the if everything about the experiment works well and you get certain results so when you think about a substantial theory one thing to think about is can you make it as bold and interesting as possible so take a different example you might be reading a paper about priming if you prime people with words like I didn’t know achieve win words like this that they will they work hard on crosswords afterwards so you might say just essential T’s priming with achievement words makes people work hard on crosswords okay maybe that is a claim that a particular experiment might be putting off to test that they would give up if they got certain results it might be all the way of putting it for example priming people with the concept of achievement makes them work harder and tasks and you see you would give up that this second theory with the results of the experiment so that’s an even bolder claim that is being put up to test so when you try and identify substantial theory try and think about what’s the boldest most interesting claim that I could state here that make sense because what we’re interested in as scientists is bold and interesting theories not just any old theory and that goes back to in my lecture and I was saying the goal isn’t to keep as close to the data as possible the goal on a prepare in view of science is to have interesting ideas that we test and I subscribe to that personally myself so those are just some thoughts in terms of generally at tackling this task anything thinking about your project or any research you’re involved with but now this particular paper just have a go first of all saying the person next you what you think the substantial theory was being put up to test in this paper okay so this is what I put down here actually I’ve taken it pretty much from the paper so I think in this case it’s not terribly problematic although maybe you disagree people who view apologies are socially desirable following interpersonal transgressions will overestimate the value of an apology when receiving it in imagination compared to when they actually receive them and then for study to hence people believe they’ll be more trusting after receiving an apology than there really would be so I think those are the ideas been put up to test those are the substantial theories any comments on that that seem reasonable yes Walter like people who people tend to always debate everything they imagine whether they would receive brother they actually Rutledge’s yes yes so that there is there is an even bolder hypothesis that in in every case in which you how should we put it that you imagine a social situation or

an emotional reaction to social situation yes all right yeah good thought so let’s let’s yes that would be bold one okay so now what pattern of results if any would falsify the theory yes yes yes yes yes yes yes exactly what the boulders could be and so that there is an issue here to go back to say recording here so the boulder hypothesis is that forecasting errors in general will overestimate and as a general claim that would be falsified by not been true in this case but there is an issue do we want to treat that as falsifiable or do only what do we want to protect it as a hard course you’ll come back to that issue when we when we consider lactose because we might regard the boulder theory as true in many occasions but not necessarily always but we’ll still use it as inspiration in our research so then we move into that Couture territory yes very good point yes yes you could condense it I was covering studies 1 & 2 you might have concentrated one of those studies more than the others so you might have just done half of that or you might have found a way of putting it will briefly to cover both they’ll be absolutely fine yes all right so now what would falsify this claim or the claim that you’ve used could have used a slightly different claim as we’ve been talking about there’s not necessarily a right answer here and it might be it might be in a given paper the authors don’t clearly state a substantial theory but you read behind the lines and get one or you come up with a different one than the authors have and so on so it’s not as if there’s a you know you must say this there’s a range of answers that you could that you could give but once you’ve decided on the substantial theory that then sort of determines or influences what you answers you can give for the following questions for example what would falsify what are the exhilarate hypotheses come to that so for the substantial theory that you have given what would falsify it just have a chat with the person next year so for your substantial theories there might have been a bit different what results would falsify from this paper okay so the theory we falsified if people rated the apology or determine – gave the rate of apology is more valuable or gave more money in reality than in imagination okay so it went significantly in the wrong direction so the the little trap as it were or that were the thing to watch out for is that a non significant result in itself would count for nothing so a non significant result would not count against the theory not in itself so that’s the little thing watch out for because that the whole literature behaves otherwise so then but we do have means for interpreting what would be non significant results in meaningful ways so you could have a Bayes factor less than a third that means you would have evidence for the null hypothesis and against the theory so that would count against the theory or if you’re able to come up with a meaning minimal meaning full amount that you’re happy with and the confidence interval did not exceed that minimal meaningful amount then that would also account against the theory so we’re going to look at these what is involved in in a and B here as we go through the the assignment yes yes so that that in effect is a non

significant result yes yes Oh accuracy amounts to no difference between imagination and reality a non significant result wouldn’t then the reason is because he you cannot infer from the non-security significant result that imagination and reality were the same that’s the problem so if you found that they were the same in the population yes they’re false was the theory the problem is how do we get evidence for that in the population and non significant result is not evidence for sameness so how do we get evidence for saying this will a base factor or if we can have a minimal meaningful amounts we say say it’s same as long as it’s not you know as long as it’s only trivially different then we can use intervals all right so next question is what background knowledge inspired this theory that is not being directly tested so we have the substantial theory and that theory might have been inspired by number of things but they’re not it’s not actually being put up a test so not all theories that will be mentioned and findings that are mentioned by the authors are actually being being tested but they’re relevant for why they postulated this substantial theory in the first place so just now have a little go with your partner saying what you thought about the answer to that question okay but the point just came up about the previous slide here is I’ve sort of constructed half of a null region right I didn’t do a full null region that’s because it doesn’t matter how far the confidence interval extends in this direction we still falsify the theory because it’s directional so that was I think an interesting point just just raised okay so now let’s look at some theories or findings that help motivate the research but aren’t being directly tested and was a point brought up earlier people’s poor affective and behavioral forecasting and other domains are discussed so in terms of the this is why I say it depends what you substantial serious in terms of the substantial through being about apologies then whether the same issues apply to how sad you’ll be how happy you be and and so on well may be applies to some of those domains and not others who knows but it could still apply to apologies so the status of that research we don’t need to assume it’s safe you see yes it did inspire the substantial theory but the experiment has an integrity all of its own that depends mostly on this this half of the diagram here so whether behavior forecasting applies to sadness or not well it doesn’t matter we can still test and have integrity in this experiment as a test of its application to apologies that’s why it’s good to separate out these these the role of these these different theories a part of this comes down to Papa’s philosophy that we didn’t induce the substantial theory from anything though there is no method by which we can infer this theory and say therefore it’s right what we do is we guess a conjecture and then we test it and whatever we use to inspire ourselves to say it that’s great but once we have the theory on the table that’s what’s important and then whether we’ve got the right conditions a tester which is this side of the diagram now what background so now we’re going to have a look at that side of the diagram a little bit more closely what background knowledge must be assumed for the test to be a test of a theory in one so we have here auxilary hypotheses what do you just have a going to define to the person next to you what an auxilary hypothesis is okay so you could define auxilary hypotheses as that the things you need to assume to go from the substantial theory to specific predictions okay so what connects a substantial theory to predictions thus essentially we’ll talk about what you’re interested in like

the extent to which people value in apology but in the end what you need is a particular dependent variable you need a measure of value right so maybe you get people to rate how much they value the apology or you see how much money they give we could do something else but what you have is a measure and there’s always the question does your dependent variable measure what you says it measures and it so there’s an assumption there maybe the assumption that needs to be tested further but you least you need to wonder do your variables measure what they say they measure when you manipulate something have you manipulated what you think you’ve manipulated so there’ll be all sorts of assumptions because a substantial theory is ideally something fairly general and interesting which you could test in different ways and you somehow have to go from that generality to the specifics of your particular experiment and we’ll see examples of that so what am I looking at this is that you made a prediction and then in there some possible falsifying observation let’s say you do get the falsifying case will this discrepancy here be transmitted back to the substantial theory and you say that counts against the substantial theory if you’re feeling particular bus that falsifies a substantial theory you see instead of a discrepancy in the predictions being transmitted to this central theory you might give up some other assumption you might say oh that’s because we just uses rating scale of how much they valued apology what does that mean I don’t think we measured how much they value the apology very well we need to measure it better so then the the the possible falsification stops here you see so what we need to do is to make sure the assumptions we make in deriving predictions are just safe enough that we would give up the theory rather than those assumptions so does a falsification or an unbeliever outcome transmit back to the substantial theory that’s that’s the crucial question because if it does we’re in business we have a falsifiable theory potentially being put up for falsification and you would give it up if you had certain outcomes now in terms of it being a severe test certainly we need an axe or a severe test we do need the predictions to follow clearly firmly from from from the theory so the predictions follow from the theory the predicted outcome is likely given the theory if the theory is true we need that for severe tests we need something else for a severe test let’s say there’s another theory out here and this is where this side can play some role in the scientific nature of what we’re doing they say some Theory out here like these example demand characteristics as it comes up so often subjects us out the experiment try and give you what you want maybe some theory out here would produce the same outcome as your theory now still true that a falsification will transmit back to your theory so still in the business of science but there’s another problem let’s say you get a supporting result the prediction is a success does the credit transmit back to the theory well if there’s another theory that predicts it it’s not obvious that this result strongly corroborates this theory here right because there’s something else is predicting it or in terms of the notion of test severity even if the theory are false you might still get the same outcome say because a demand characteristic so it’s not a severe test the theory hasn’t been made to stick its neck out that’s what we want to be a test you say if it’s going to be falsified in any situation surely it’s here I’ve got a feeling it’s gonna fail and then if it passes you say that theory strongly cooperated so that’s what we’re looking for for a for a severe test so what background knowledge in other words auxilary hypotheses must we assume for the test to be a test of the substantial theory we’re talking about the auxiliaries here and other theories that might predict the same outcome let’s just have a go at talking about that with the person next year and then

I’ll come back all right so now I’ll go through some and follow me it’s not as if there’s a fixed number of these so all that you have to mention the fixed number those are always you cannot go on forever with things you must assume so just like you have to assume the statistical analyses to appropriate the distributions were distributed in the right way if the analyses done and so on and that’s not you know that might seem a bit trivial things like the statistical package actually carries out the statistics it says it does but there was a recent case where it was found I think it was partially 2 squared or something in SPSS was not actually being calculated correctly so those are some shins you know they are real assumptions even if they we regard them as it they’re too trivial it was only part of that problem in spss cuz it’s a private company the the software is given can’t be publicly checked and that’s why there’s new software being developed gasp jasp which is free freeware and free so that you anyone can check the the code that goes into it and that makes you on safer ground in terms of it doing what it says it does the only other thing you want to say about auxiliaries is just to make sure they’re additional assumptions so then so they’re not part of the substantial theory so something like subjects will value apologies in imagination that’s not an additional an additional assumption that’s part of what the theory is saying so the observer hypothesis has to be other things you need to assume for the test to be a test of the substantial theory I’ll go through some examples here well say one one example is always that the variables measure and do what they say they measure we’re doing so the ratings of how valuable and reconciling the apology is reflects people’s actual feelings on the matter might now when you assess the which is the next question how safe is that assumption when you do the assignment I’m not looking for you to be hypercritical and call everything unsafe demand more evidence for this and I’m not looking for you to say everything’s safe I’m just looking for genuine judgment so in that sense there isn’t a right answer just be honest with yourself and say do you think this this is good enough at least good enough that if you’ve got results opposite to prediction that would count against the theory in your mind my personal judgment here is when they when they took the ratings initially they all interact correlated highly they have plausible face validity as far as I’m concerned so I’m happy to treat that as safe but that’s entirely your your judgment as I say there is the issue day that relates to this one of demand characteristics we need to seriously consider subjects did not determine the point of the experiment and respond according to demand characteristics so with ratings is an important important point the is the man characteristics are common the issue to think about with with any psychology experiment it is mitigated to some extent when you have it between subjects design because in the subjects don’t know what it is they’re being compared against and this was between subjects so did the subjects really know that this is comparison between reality and imagination my personal judgment is I think is unlikely that they would have suss that by being in one of the conditions and so actually I regard this one happens as there’s a safe assumption as well people imagine the apology is having equivalent objective characteristics as the real one subjects received so we need to make sure when we’re comparing imagination reality that the extent of the authenticity of the apology was was the same because if if subjects imagined you know highly normative a someone they imagine the face of the person really apologizing and meaning it that might have a different impact and then a message on your computer speech message on your computer screen and it’s not necessarily do the imagination versus reality is to do with what you imagined compared to what was there and what the

authors say is they they got people to imagine the identical apology so I think I’m probably okay on on that one they were showing the messenger and said imagine this well I think that’s what they did they’re not entirely clear now here we come to two very interesting points which will help us think about experimental design and experiments in general which may be not as well understood as they as they should be she might say don’t we assume the subjects in the two groups did not differ systematically in mood in socialization about the value of apologies and so on that might affect the outcome of the experiment cuz let’s say people in you know in one of the conditions happen to have more positive mood than the other condition all happen to be socialized more about the value of apologies in one one condition rather than another and you could go on listing differences between subjects that might be relevant did they different empathy in how extroverted they are and so on so did they differ in their age so these sort of participant characteristics that might be relevant to the outcome there’s a large number of them so shouldn’t we be worried about them well Fisher introduced a little piece of magic here that is really something regarded as as a thing that makes an experiment an experiment and that’s random assignment of participants to groups and the authors said they random in every experiment in that fatal a so they randomly assigned participants to groups now when you come to do your project or any research random assignment really means random assignment so let’s say you’ve got three groups what you might do is you have three cards like with one two and three you shuffle them that’s the order in which you’re going to assign the next three subjects you shuffle them again that’s the order in which you get an assign the next three subjects and so on so then you have a genuinely random process by which subjects are allocated to groups but in a way that assures even numbers in each of the groups so that’s a random process so that that means the only way that one of the groups would get more sad people more empathic people and so and then the other is because of the vagaries a chance but that’s exactly what we control for by any statistical method that we use whether it’s p-values or purple Bayes so yes it might happen by chance but it’ll happen such that it affects our outcome and in one in twenty times by chance if we’re doing significance testing it and that’s exactly the risk we’ve said in advance we’re prepared to take so in that sense we’ve we’ve accounted for it but the beauty of random assignment so four four is actually safe because they randomly assigned random assignment wouldn’t be by the way I think this participant is suitable for this group I think it’s participant is a good imagination of what them in imagination group okay well let’s say you’re doing a mood induction experiment this person’s quite happy I put them in the happy induction group that is no longer animus island that introduces systematic confounds into your experiment and you would now no longer know where there was your intervention or the way or the type of participant that was responsible for any difference between groups which is why random assignment is so important and random assignment maybe it’s like an even random assignment doesn’t even mean if you’ve got two groups I’ll put the first one in Group A and second one in Group B and you do that all the way maybe the more extroverted person they come in pairs and we’re actually the person who’s always the one who’s a little bit before would you say how do you know so just to be safe random assignment user genuinely random process like card shuffling random numbers flips of coins to assign your subjects to groups and it’s so important as I say that experimental this regard that as in some cases the defining feature of an experiment well now let’s move on to another situation in which randomness plays a role you might say other subjects really representative enough of the population to which the theory applies so that as you say you know these are just undergraduate students from some university whereas the theory is is meant to apply to human beings or human beings who value apologies surely this

is this undermines an experiment well here the role of randomness is is slightly different if if you if you’ve identified the population to which the theory applies you would perform random selection from the population you could do any one then you say what you have is representative of population because you’ve randomly selected from the population so you need to clearly define the population now random selection again really means something doesn’t just mean how I sort of selected arbitrarily from the sort of people we’re talking about so it comes up in voting if you’re trying to predict vaiting if you only contact those people who have email to say how are you intending to vote you’ve now have a biased selection of the population because the people who don’t have email may well write in systematically different ways than the people who do so what you have to do is you have to have a list of everyone in the population sign them a number use round numbers to determine which ones you’re going to test and then you’ve got to work bloody hard to test those people track them down and test them that is random selection from population psychologists don’t do that very important say for predicting voting psychologists it’s not so important and the reason why it’s not so important is that we only need for the test to be test of the theory we only need to do a population a population to which the theories meant to apply because there’s a sales for that population the theory fails so we don’t we don’t we don’t need a test of the population we only need to make sure we’ve identified a population even if roughly to which the theory applies so therefore a falsifying observation transmits back to falsifying theory so what we do is we don’t do random selection at all we just select any way we can this undergraduate sign up for credits or whatever it is it’s kind of hard to say exactly what the population is it’s something like psychology on the graduates at that university but in any case we don’t really need to say as long as we can say the theory applies to these guys and so the role of randomness has is rather different in these two cases it’s crucially important that we randomly assign subjects to groups and that’s easy but it’s not important for the integrity of the test that we randomly select in the population and that’s really kind of convenient because that’s difficult just put your hands up if you if you understand that distinction between random assignment and random selection and the role they play in theory testing have I been clear on that Rehan sir excellent there’s a good show of hands that some of you perhaps a bit uncertain can I get you now to explain it to each other and then we’ll move on sadistic okay so incidentally I just want to have a sort of a footnote to what I said it can be very interesting issue when people tried to directly replicate an experiment so the initial experiments on America they directly replicate in China you don’t get the results then it is really interesting to think about well maybe this cross-cultural difference is going on here maybe it applies to some populations then another then you can think about your substantial theory the fact that it doesn’t apply in one of those cases didn’t apply in China does falsify your theory so there was a population to which it was meant to apply did theories false so now we need to come up with a with a better theory yes if you resign yourself to the fact in psychology people you can never offend a theory that’s going to try to explain an idea I put it the other way around we we try and invent theories that do apply to everyone because we know if it’s falsified for any subset of humanity the theories falsified as a generality for

everybody and that’s precisely why first we want to come up with a theory that’s bold most bold boldest and most interesting to read come on so we apply to everyone we find the group of people which it doesn’t apply you might just give up on the theory and say this doesn’t work human beings but if you as I said ran it with a distinct group may be culturally different or different in age socioeconomic status or something like that and seems to work in some cases another now we’re in a position to come up with a more interesting theory a richer theory that we we hope does apply to everyone so we might say certain people think more holistically other people think we’re analytically as defined by certain practices not bringing in different cultures and that explains the differences but there’s a new theory then is meant to apply to everyone so thing aha what other cultures are like China in training people to think listicle ‘i will try it out there so at every step of the way we’re coming up with theories we hope applies to everyone at least that’s my recommendation because you want the boldest most interesting theories you just say so there is the danger that sono but the danger saying is psychology in the end will destroy yourself because there are no regularities about human beings in the end there’s just individuals behaving individually I think from what we have from psychology so far as given research and confidence so that isn’t where we gonna wind up that there are regularities that apply to human beings but that is an article of faith right so that’s the popper section so now we move on to orthodoxy and well I say orthodoxy but I’m trying to get you to do something within the Orthodox framework that’s a bit better that’s out there namely think in terms of null regions and intervals instead of significance testing so this question was how the authors determined what minimal difference could be expected if the theory were true well the general answer to that is no they won’t because it’s extremely wet rare that you see a paper that where they’ve done that okay if not determine one yourself and state your reasons usually construct in our region why I suppose state your reasons a bit of a rhetorical question so I’ll deal with study two here which is about someone’s cheated on you they’re doing you an apology and then you get a chance to give some money back to them for a second time then if it’s ten euros is involved so what would be if there’s a difference between reality and imagination what would be a minimal meaningful difference well I don’t know what how to point to something objective about there’s $1.00 seems meaningful to me if there’s $1.00 difference between conditions I think that sounds you know don’t run on 50 cents it’s probably so I had to reach deep in my soul for this one take a good hard look and I came up with 20 cents but I have no justification for that but you have to do that process if you’re Orthodox statistician and you’re going to use power or confidence intervals to accept a null hypotheses so it mean if you find this process difficult part of it is me saying to you yeah and if you’re gonna system doing Orthodox artistic this is what you’ve got to do and it’s difficult isn’t that it’s subjective it’s arbitrary you know things are not entirely arbitrary in a 20-cent so I’m sort of happy with yes question if the Nord returned what it means is so what I’ve said to my null region is from minus 20 cents plus 20 cents what I’m saying is differences between groups and there are minimal in the sense they’re so small that you know even if God told me this is the real difference in the population I’ll say well that’s so trivial that how could I be interested in that it’s as good as you know it’s as good as zero or it has no it shouldn’t it shouldn’t bear on our theories or it doesn’t have practical relevance it’s so small as to be relevant it’s not I mean where the user – I mean maybe so I put

term euros there that should be there should be sense I know it is your basic order I didn’t back it up one two so just I just interested what numbers do doesn’t really want to share their number for a minimal meaningful difference what did people come up with what do you just just discuss what a minimal meaningful amount would be to you for this experiment and I’ll just come around and listen to some numbers okay so some numbers I got 40 cents 50 Cent’s 75 cents there’s no there’s no particular reason for these years ago to ask yourself what would actually be interesting I mean if I said to you if we tested everyone in in the world which as though he’s meant to apply so we have the population and and and the difference when you average over everyone it’s tiny it’s point zero one of a cent you would say together that’s not interesting point one of us in together that’s as good a zero as far as I’m concerned well once then so where do you stop when does it just start getting interesting yes No that’s another time to subdue giving the whole amount represents right so what your argument is saying is this is not a suitable case for parametric statistics and t-tests we don’t have the right sort of scale so then so they gets back to auxiliary assumptions the the results are analyzed incorrectly in that case so this just a good point and you’ll be right because of the the Shirley right about the the end effects there but just so we can carry on and use the analyses as given we then need to come up with a minimal meaningful amount if we do assess the sensitivity of the study to enable us to interpret to say anything about support for a null hypothesis using conventional conventional statistics I mean really it’s a matter of asking yourself when does it become interesting because what other basis do you have sometimes I was asked about looking at other papers when authors often use 1 to 7 rating scales and extremely common if you know about literature’s that use 1 to 7 rating scales you often find the difference between groups of about say a third of a rating scale point significant and the author’s getting excited by well what that’s telling you is a third of a rating point is taken as real and interesting by that community it’s a probably a minimal meaningful amount is smaller than that so you can use previous studies to get an idea of the sorts of things that people are taking is real but I kind of suspect they didn’t pay much attention to the effect size the roar effect size and we’re just looking at the p-values and they were just going to get excited if people listened points or if eyes no matter what do you think sighs that’s my guess so we have a null region and then the next question is to calculate a confidence interval and use it in the Nile region to argue whether or not the data sensitively distinguish the alternative and null hypothesis well you have the the mean difference I’m doing study to 1.89 cents to get a confidence interval we need that the mean difference we also need the standard error of the difference and we can get that if we know the F or the T that they used then you what we’d like is the T and the thing you just need to know is an F an F done in the same situation as

you do a T is really the same test you just square the T to get the F so you can square root the F to get the team they happen to give us F square root T to get the T so that’s just worth remembering in a situation where you would have done a t-test the F is just a square of the two another thing to remember is that the T depends on the mean difference in the standard error so if you want to find the standard error it just is the mean difference divided by the by by the obtain two T the actual mean difference obtained divided by the actual T obtained that will give you the right standard error of the difference so any paper worth assault should give you the means and get the mean difference and it should give you the F’s and the t’s for the t’s so you can get the standard error now the next thing you need to know to do a confidence interval is the critical value of T that two-tailed remember two-tailed to get a confidence interval it’s a two-tailed critical value at 5% level you need to type in the degrees of freedom and that will be written in the F or written after the T so if they have an F you’ll be f brackets 1 comma 4 1440s the degrees of freedom so they have a t they’ll have t and then in brackets 14 the forties and degrees of freedom so this website asks you for forgiving degrees of freedom it’ll then gives you the critical value of T it’s going to be roundabout – so you you know you’ve got it right but a value that’s close to two two point zero two in this case then the confidence interval is the main plus two times the standard error and the mean on the one side that’s that and on the other side the mean minus two times the standard error and that’s that so the mean plus two times the standard a mean minus two times I say – I mean precisely this number well in any case I look through that make sure you understand that if you have any questions on this sort of the procedure there just email me email better study direct sign so everyone can be it can be clarified that’s just a little procedure you go through you end up with an interval and says it goes from here to here won’t you just define a confidence interval to the person next year before I carry on okay so to clarify one thing because the interval has two ends to it that’s why we’re doing a two-tailed critical value even though we’re making a prediction of a directional prediction right you get a confidence interval which goes from one value to another that has two ends so it’s a two-tailed critical value now the confidence interval everything outside the confidence interval is significantly different from the sample mean difference you can exclude what I means is you can exclude you’re allowed to exclude as possible population differences without to reject everything outside of the interval well that means is everything within the interval remain as possible population differences now is zero in that interval from point one to three point six zero in each other no zeros not in there another way of saying that was there was a significant difference from zero or people just say it was significant okay but now we’re we’re thinking in terms of intervals we have the null region and well see it depends what you’re not region is but I came up with 20 cents point two so this confidence interval spans the null region which I have said the values in there are so small damned if I’m interested in them and valleys outside than their region which I’ve said they’re sufficiently big yes they do count so what does that mean so what follows from this yes data insensitive so what I was saying my data are consistent with values I think so small to be trivial but it’s also consistent with values that are so big as to be interesting so

I haven’t distinguished the interesting possibilities I haven’t distinguished the null region boring values from the interest of the artists so it’s insensitive so you see when you think in terms of intervals you’re giving yourself a tougher task than you do just in terms of significance Dustin but what do we gain by that what we gain is the ability to assert null hypotheses because if the confidence interval had been within the null region I could have asserted the null region hypothesis the null hypothesis if you just do significance testing you’re not in the position to ascertain all hypothesis so that’s the end of Orthodox sophistic Sui now move on to lactose State the hard core of the research product program the authors are working in so we go back to this diagram here and put a laugh at Ozzie and spin on it so the hard core is a subset of stuff over here theories and ideas that motivated the current substantial theory it’s very particular part of our tips it’s those core ideas that have historically persisted for some time and motivated a number of particular substantial theories furthermore ideas that the authors wouldn’t give up on just because some particular motivator theory had been falsified so the particular theories are back at our scores protective Bell so when we’re doing pop Aryan science before lactose would have said that’s all good but what you’re doing there is working in the protective belt just remember there’s the research program the whole research program which includes the hard core in which you have hypotheses protected from falsification so that’s kind of how fits in with the popper notions that we had before so the hard core isn’t something you can logically it as us someone he was saying before you sort of said you read up on de crema and you found he works within a forecasting behavior forecasting paradigm so he’s he’s historically handy you can get that from the introduction he’s had a certain commitment to a certain set of ideas that have motivated particular papers so now just have a guy saying what you think the hard core is with the person next to it again I just want to say sometimes if there’s the right answer to that because it’s not as if the paper is gonna say and this is my hard core okay so what you what you want to do is specify its to think of something is this something I can say which I think has been there for some time you know historically which I think that has been not directly put up the test that has been motivating aspiring particular theories that are put up for test okay so this is what I came to it’s basically the behavioral forecasting type idea people systematically underestimate the behavioral and emotional reactions in a broad range of domains yes yes yes it as I say it’s a sort of a sub-steps the hunt was a subset of of those things so so certainly in three you might have mentioned the hardcore he might have mentioned other stuff that wasn’t the hardcore yeah good point and the other thing this relates to is the substantial three because this is fairly close to a substantial theory that was mentioned previously well if you are substantial this has actually come up as a mistake in previous assignment sometimes people have mentioned is the hardcore the same thing as a substantial theory but that can’t be right why can’t it be right exactly the substantial theory is the thing that you would give up whereas the hardcore is something you’re not gonna give up is not gonna be falsified so you think about in relation to the substantial theory so if you had the substantial theory a claim that was similar to this people systematically under mistake the behavior them and emotional reactions in any domain which they’re imagining versus in all domains

in which in all domains imagination versus reality is going to involve underestimates if you say for all domains and that’s your substantial theory which is essentially what we had before then that cannot be your hard core so just be clear about that now I’ll put it here in a broad range of domains because that renders it essentially unfalsifiable so it doesn’t apply to apologies that’s all right it does apply to this this and this let’s carry on using it to think about somewhere else where it might apply you see so you can you can keep it as your guiding idea even though you’re not directly testing it yourself and that and that’s your that’s your hard core does the paper contribute to the research program in a progressive or degenerating way well remember progressive means theoretically progressive means you made a novel prediction did it make a novel prediction as far as I can tell they do they they predict that this behavioral forecasting phenomenon applies to apologies okay now expert are just based on the paper it seems it’s novel so theoretically progressive they test it and confirm it so it’s empirically progressive so the paper contributes to the research program as we’ve defined the research program in a progressive way yes degenerating degenerating means they don’t actually make a novel prediction yeah so we did we discuss that in the lacquer toss yeah lecture but it but it’s a really good point you think well isn’t that the case then whereas degenerating if and that’s what it depends if you’re taking your replication to be part of the direct replication should be part of the process by which you’re establishing the claim that that defines the result that you’re replicating then my answer was you’re still contributing in a progressive way because what you’re doing is you’re still making sure we can accept the result because only that when we reset the result which is itself a sort of a low-level regularity or a hypothesis it’s only one we decide to accept the result that we have confirmation of the prediction or negation of it but if it was a well established result that’s been replicated many many times and for your research program you have it as my prediction of this research program is this result that we already know then that wouldn’t count as a novel prediction so that itas slice predictions to be novel he discussed the tweak on that temporal novelty or yes they could not really make any headway yes and that’s the really interesting case where one research is contributing progressively to one but not to the other because that’s how research programs despite being unfalsifiable in themselves can out-compete each other so that is a really interesting case yes Oh back to direct replications which is of relevance to to your assignment so my answer is a result when people say I have a result what they’re actually claiming is hypothesis their example they’re saying when people they say what I would like what I found was that if you prime people with treatment words then they try harder on crosswords especially hypothesis because all you’ve actually observed is builded this Jane did this Sarah did this and so on but not claiming that builded this Sarah – those are particular observations what you’re claiming is when you prime people with achievement words they did they try harder on crosswords so what people say as a result is actually some low-level regularity that probably bears on a substantial theory so do we accept that regularity well not just with so the thought is not just with one experiment we need to directly replicate it but it comes a time when you’re directly replicated enough you then accept the

result as genuine and then it can count against the theory of for the theory so to the extent the replication is still in the business of saying is there really this result there and making it safe or unsafe then it’s still potentially contributing in a progressive way does that make sense at least that’s my take on the situation right final section so I didn’t give you a break but I thought we get through this you finish early and then yes was a question but anyway that was those fire bans criticism of akatosh she said isn’t it were just all shades of gray really and how much progressiveness is really progressing and so on but yeah I could see the pattern research overall you might say it’s not clearly progressive or degenerating well then what like I just tried to say about that was as as long as you do make some novel predictions and at least some of them are empirically confirmed then it’s progressive so you can have a fair amount of failed failed predictions in your predictive belt because that’s what productive belt hypotheses do they sometimes fail alright the Bayes section the first two questions are really just asking you to note down the gain what the thing couple things you’ve really worked out because they’re relevant to the question let’s get them down so what was the mean difference and what’s the standard error of the difference we worked it out before so the questions slightly different but now the issue here is – can we come up with given our theory and expected size of difference what I’ll do here is or a maximum so this is what I’m going through in this slide ignore the exact wording of the question there now all they will study – because it introduces it’s just slightly more complex I mean for a study study one you could use the pilot and just say the difference in the pilot is my expected difference for support for study one fine study – follows two previous studies and uses a different scale so it’s a slightly more complicated so just go through that just to give you some more conceptual tools so the point the mean difference in the pilot on the rating scale was 0.75 now if you’re analyzing study one you could say well that’s as good a guess as any as the order of magnitude I’d expect that’s fine now that could be you expected effect size for study 1 now 31 is it was 1.75 so if we’re doing study 2 what we could do is find the mean of both studies 1.25 that’s using all the information but notice is on a seven-point scale and study 2 is 0 to 10 euros so I’m gonna do something which i think is just a simple way of converting between the two scales this gal goes 0 1 2 3 4 5 6 and there are tens got eleven eleven scale points this was a one two seven zero to seven scaling one to seven scale so it had seven scale points so I’ll simply scale up by that ratio eleven to seven you could do it other ways but this this just strikes me as the the simplest way of getting between a difference on the two scales we don’t really know if the scales will behave in the same way but in the absence of any other information about this this would be one way of skate health sort of calibrating and that leads to a difference of two euros now this isn’t the one right way of doing it there’s a number of approaches you could do so as long as what you provide is some sort of informed answer say based on a previous study and the good thing about replication studies direct replications is this is really easy because you have the result from the first study that the authors were excited by so that’s what you expect for the second study unless you’ve got some other reason but by default as we expect from the second study now what about a maximum well there is a hard maximum by hard maximum and I hear they couldn’t possibly be a difference between conditions and more than ten euros that would obtain if in one condition it was 0 unit euros and the other condition in the population was 10 euros everyone did ten in one case everyone did zero in the other you think at a population mean difference of

10 it just can’t be more than that that is extreme so we have a hard hard maximum here but note this this maximum is kind of implausible in the sense that what you’d have to have is everyone in the population doing ten in one condition and everyone in the population doing zero in the other that’s kind of implausible right it does have the advantage of simplicity though we can refer to it without looking at data so this is more informed and maybe this is more simple so to go back to the general problem of I mean the problem here is how do we model the predictions of h1 that’s what we need to do a base factor how do we buy model I mean represent how do we represent what Alto and fly policies are saying but our theory is saying how do we represent what is actually predicting what’s the same problem is how do you derive predictions from theory we’ve been talking about you have a theory we’ve been talking about substantial theory to two particular predictions right we said we need auxilary assumptions what do we want about those assumptions well we want them to be informed and simple and that’s just what it is with a base factor because different values is doing science so we want informed and simple auxiliaries that lead us to representing what our theory predicts so the maximally specified probability distribution for the difference expected by the theory and justify it so we gamma 2 if we used the maximum here we could say 0 to 10 I put a uniform on it this isn’t informed by anything particular but is simple or I said the other rule of thumb you use is if you’ve if you’ve got a an order of magnitude expected value you do a half normal and you scale it by having the standard deviation equal to them effect size you think is somewhere in something the order of magnitude of what you expect so what are they’re saying what it is saying smaller values are more likely and bigger values I think there’s a good reason for thinking that’s a good way of of approaching this with a reproducibility project I found when you take a hundred studies published in 2009 try and directly replicate them effect sizes are overestimated by factor about two so whatever you think a good effect size is probably smaller values and more likely in the bigger values and that’s what this half normal represents so what is the base factor in favor of theory of the null hypothesis I’ve asked you to put in a screenshot of the Bayes factor calculator site and see what you’re doing so I put it in here for the for the half normal sorry I said to justify the justification for the maximum vague theory simplicity but the justification for this is you’ve been informed and it’s otherwise simple I personally think this is the best combination of simplicity in being informed this one I find a bit implausible frankly this one is informed by the actual same studies in that paper by the same authors in the same lab so this is the one I’d really go for but I’m just showing you the other the other case to bring up these issues you’d only need to do one of them asking you to be to what I’m doing to here just so you you see how how to think about you get different answers so therefore the vague theory the uniform call the vague because the theory is saying I don’t know what it is is from anything from the minimum zero to a maximum of 10 that’s the vaguest a theory can be this there is a bit more precise and the vague theory gets less support by Bayes than the precise 3-phase punishes you for being vague so if you can use the literature fast studies to help inform you about what you really predict then that’s going to be good for you if the predictions come out and remember when you do that you hear otherwise state the reasons and the reasons the the calculations we did here the reasons are I based it on the previous two studies and then we scaled a little bit yes I was informed and then a reader of your paper or of this is in a position to see whether they by those reasons if not you have an argument about come up with better ones so you remember the criteria here what would follow from this Bayes factor of two yeah we had our conventions around about three means evidence is worth taking notice off so this is this this

would be saying the evidence isn’t particularly strong it’s not really worth taking note of yet for the vague theory but here this is above three so the evidence is worth taking note of for the for the more informed theory and since for me in terms of assessing the substantial theory in the paper I’m going to go with informed theory this does surprise this is evidential support worth taking note of for the substantial theory over the null hypothesis screenshot here of what you do you say in terms of the distribution of plausibility x’ of different population data is given the theory is not a uniform so then these boxes come up and for the half normal you always put zero here and you always put one here then you’re left with just one box and what you put in there is your estimated effect size you oughta manager tutor thank you think’s gonna go in there here you put in the mean difference sample mean difference and the standard error okay go and get the base like a 5.4 to as I’ve reported final question human drama what does this base like to tell you the t-test does not well it’s sort of gonna depend on what you get and the paper I’ve put something here fairly general of course you can’t copy this just to be clear you can’t just put this as your answer or paraphrase it closely so now that I’ve said it you’re gonna have to think of a different way of putting it or something else to say but they both provide support for the theory rather than the null in that you know the t-test was significant and the author’s took that as a theory and the Bayes factor tells you their support as well what about Hector does by the way which you don’t get from orthodoxies it tells you the amount of evidence I know also sometimes say that they they got strong evidence a weak evidence that isn’t a conclusion that follows an orthodoxy and no Orthodox statistic tells you that the Bayes factor is the measure continuous measure of degrees of evidence the t-test just allows a black and white decision to accept or reject a null in this case to even reject it so in one sense in a pragmatic sense they came to the same conclusion in you know that if you only ideological about of conceptual the radically different things in a given situation there might be very different for example of a non significant result the the t-test might not allow any conclusion to follow at all it’s just non significant but the base factor might tell you the degree of evidence you have for the null and am I too early actually there is substantial evidence for the null here something T test can’t tell you or might just say it’s insensitive and that’s something you didn’t know from doing or just the T test all right so that’s everything I hope this now helped you to integrate and pull together everything we’ve covered in the course and seen a little bit of the connections and the overview and giving you a set of conceptual tools for approaching research generally now when you come to doing your own assignments please feel free to use study direct to ask questions for everyone benefits from the discussions have thank you yep