9.520 – 09/16/2015 – Class 03 – Carlo Ciliberto & Charlie Frogner: Math Camp

so today that the class is divided in health the first part I will be giving a refresher on basic knowledge that you need to know in order to delve into functional analysis while later Charlie will be discussing about the knowledge of probability that you need to have in order to face this course we are putting slides and notes online so that slides from previous years and notes so that you I mean if anything is not clear after these refreshers you can check them and and see what you didn’t get care but of course you can ask question as soon as you have them today okay so regarding the functional analysis part we are going to to see two objects it with spaces we’re going to consider concepts of elba spaces and linear operators and the main reason why we are focusing on these two objects in particular hilbert spaces that would be cubic reduce during this discourse is that as Thomas said in the previous classes one fundamental ingredient of any learning problem is to define a good space of functions where you look for candidate solution of your problem okay and as it turns out interspaces are extremely flexible so they allow you to to to define and to describe many interesting families of functions that we are going to study and at the same time they provide you with enough structure to and simple enough structure to do optimization to use linear algebra tools but also on on spaces of functions that can be infinite and indeed we are going to study we’re going to review a bit of the kind of operation that you can do on an inverse space through linear operators that you can see them a bit like generalization of the concept of matrix in a finite dimensional space so let me start by giving you the definition in the space H is a Hilbert space if is a vector space equipped with an inner product that I will denote like this and complete with respect to the norm induced by this inner product okay so I introduced this concept back to space in a product completeness norm I will go through all these points in order to refresh what they mean now I’m assuming that all of you have already seen vector spaces so if you have not maybe you can check the notes I’m not going to give all the definition but basically these are sets where you can do operation like sums and multiplication by a scholar in particularly in this course we’re going to consider vector spaces on the real field but with some I mean small change all the things that I’m going to tell you today can be generalized also to the complex case anyway so examples of vector space that probably seen are are the the set of real numbers or it’s a Cartesian product with itself like RN then you have infinite spaces like l2 l2 the space of all square summable sequences space of functions that would be like interesting for us in order to look for a candidate solution to a problem like square or did the set of square integrable functions on the interval for instance 0 1 but you can have many the space of continuous functions 1 or you can make many example so these are all vector spaces not all of them are in the spaces we’ll see why what kind of things they do not satisfy of this definition but anyway once we have a vector space we we will want to present new operations and compare elements in the space since we are looking for a solution that it’s actually is an element of this is a vector in this vector space it will be a function in this vector space the four will we need something to compare them and inner products are what I mean are an extremely good tool to do that so the definition of inner product is these let’s assume that we have V vector space and we have that this operation this dysfunction that goes from V times

V to R is an inner product if must satisfy three three conditions so the first one it must be symmetric so V times W is equal to W times V four it should be and W in V then you must be bilinear so 1 2 W must be equal to lambda 1 V 1 W times plus lambda 2 V 2 W sorry for each lambda 1 lambda 2 in R and for each V 1 V 2 and W in B okay so it’s my linear because of the symmetry so we can do that the same thing on the on bow on both sides of this function and finally it needs to be positive semi-definite which means that the inner product of an element of on the vector space with itself must be always greater than or equal than 0 and furthermore it will be 0 if and only if V is the actual zero element of your vector space okay see it’s called the positive definiteness of the inner product okay so examples that you have probably seen inner products are on our for instance the the vector product for instance so you have X my elements of RN and you can define the inner product between x and y as you can also write it like this in a sort of matrix notation and this simply be some while it goes from 1 to N X I Y I where X I’ve written it in its coordinate with respect to Iran okay so this is the inner product of an example of inner product on RN you have probably seen something like these four functions like receptive square integral of functions on 0 1 we can take two functions here F and G and define the inner product of F times G minus 2 as the integral between 0 1 f of X G of X in DX ok and it is well-defined and it’s a it’s actually satisfying all these properties and is an inner product on on earth 2x well in this case I took the simple case of the interior of the interval on the real line but of course you can take a turn on my general space yeah so okay so we we have inner products now we we have a among the many properties that in a product have they also induce always in norm on the space day they are keeping and norms are similarly useful for persons to measure the relevance in some really hand-wavy sense of of different vectors that they elect in some sense of course this mix is more intuitively clear in RN it becomes less clear when you’re starting to consider norms in infinite dimensional spaces of functions for instance but so the definition of a norm which doesn’t require the deviation of an inner product in the following take a vector space V a norm is a function that goes from V to our okay and satisfies first of all non negativity so the norm of V is greater than zero for each V in V and furthermore we have that the norm of V is actually so the inequalities strict the Equality holds only if V is actually dead zero element OB then we have the the norm norm must be homogeneous with respect to the field so lambda V must be equal to the absolute value of lambda times V for each V and of course lambda in R and finally we

have the Sebata TV T so V plus W must be less or equal than the norm of V Plus the norm of W for each V and W okay and the point is that you can easily verify that if you take this definition did the diffusion of this norm as the square root of the inner product on with itself with respect to an inner product for instance the one that we so there this one is unknown and is the norm that is induced by that inner product not all norms will be induced by some inner product and that will make the difference between different kind of function spaces but so the the main point of introducing a norm is the fact that norms provide lots of tools to study how sequences of elements in a vector space behave of course we couldn’t need we could just use very weaker notions such as that of metric or topology but since we’re always working with Dilbert spaces so we’ll always have an inner product and the norm in disdain in their product we’re going to see what what means convergence with respect to to a norm and therefore let’s go with this definition so we can take this sequence the end sequence of elements in our vector space and we say that the N converges to V as n goes to infinity with a certain B exists a certain B in V if well the norm of VN minus our target point converges to zero as n goes to plus infinity so we are leveraging on the fact that we have a structure of a real field we know what limits mean in in R and we are using this information to construct convergent I mean divine convergence in in the vector space so we are getting close enough with the sequence to this point and and I mean asymptotically we are converging to that point so norms allow to to do this and every convergence are every convergent sequence in a vector space has the following property for each epsilon you take greater than 0 there exists a number and Epsilon for instance a natural number such that for each small n small n greater than this number epsilon a Neptune we have that V n minus V M is smaller than this Epsilon so what this means is that if you take her betrayal II I mean a brutality arbitrarily far elements of this sequence arbitrarily in sense that they will be larger than a certain number that you don’t necessarily need no no the two elements that the couple of elements that you take from this sequence will will be arbitrarily close one to the other and basically what this is telling you is that the sequence is slowing down so fast that the distance between between two or two elements far enough in the sequences is getting close to zero is tending to to zero and this property is called Cauchy so this a sequence that satisfies this property is called Cauchy or C sequence okay however is not always true that this arrow can be inverted so it’s not true that if you take a sequence that satisfies this property that sequence is Cauchy and this is actually I mean strange if you think about that it means that you have a sequence in your space that is slowing down very fast up to the point that I mean any two points that you take arbitrarily large can be a bit alien close one to each other but this is not converging to a point in your in your space and indeed we won’t like this kind of space isn’t in we’re going to consider will be spaces that are so-called complete and complete means that for a space that every Cauchy sequence is actually converging to a point in that space so if we take the normed vector space so a vector space with a norm is Cauchy Eleazar is

complete if Cauchy sequence is convergent Eevee okay so we are not taking spaces that have holes in some sense and indeed example of complete spaces that the standard example of a complete spaces is R which actually is by construction complete because if you ever seen the way you construct the set of real numbers this actually is exactly done by taking the set of rational numbers and taking all the Cauchy sequences there and using the limit of this that Cauchy sequence that is not converging as the symbol that you’re using for filling the hole in our for instance you take a Cauchy sequence that you know that gets close to PI for instance or to square root of 2 or something like that and you you add that symbol to your space and in the end you will get R so this process of completion it’s usually indicated like this and so R is the completion of of Q ok some less standard example that you can have of space that for instance is not complete like Q is is the set of functions that are continuous on a closed interval such as these because in this case you can have you know that the space of continuous functions closed the interval is contained in in l2 and therefore it has an inner product means ok now I will be a bit hand-wavy on that the kind of proof showing that C is not complete with respect to the norm induced by this inner product but with me so I am going to take these sequence of functions f of n that are done like this so I have the interval 0 1 I’m going to take some interval a B which is constant for all for all elements in this sequence so this is a function that is 1 here and then it has a diagonal slope here and here it says these two diagonal component and the only thing that they change as n increases is the steepness of this slope but this stiffness will never get to be vertical so all these functions are are continuous therefore our in our in seed zero one okay the only problem is that if you take this sequence which is of course the sequence in l2 the sequence will converge in l2 in the norm which will denote the piece norm of l2 will converge to this f bar which is done like this which is the step function of the segment AP of course this one is not continuous so f bar is not in C zero one okay so the point here is that we have a sequence which in this pace with that norm is convergent so it’s a Cauchy sequence because we know that every convergent sequence is a Cauchy sequence but these Cauchy sequence is not converging to a point within C okay is it clear okay so the point here is that C is not it’s not complete therefore it is not an example of Gilbert space like instead of as to one zero one is the main point is that is not always true that C I mean is not true in general that C is not complete you could consider other kind of norms right so an example is the one of the infinite norm on the space of continuous function which is defined as the maximum of the absolute value of f of X with X in the interval this is a norm and actually with respect to the norm C 0 1 is complete okay and the intuition behind this is the fact that by changing the norm you are changing the set of that the family of sequences that are actually Cauchy or not and indeed this sequence is not coaching anymore with respect to this norm indeed if you take F of n minus F of M okay any two elements in the sequence you can see that the maximum of the difference since that support is decreasing at each step is actually constantly one okay so these of course will not be a Cauchy sequence and that’s the intuition behind why this pace can be complete you are reducing the number the set of functions that are another set of sequences that are crucial and that afford need to converge okay so now we have the

definition of a inverse space we can go through it again we have an inner product we have the norm induced by the inner product and we see that this is completely Driessen with respect to this norm and so this paste is base they they all satisfy this kind of condition and so apart from the definition of Hilbert space a valid part from the concept of in the space we will actually will actually use some further property we will actually ask our in the spaces to have the further property of being separable and I will tell you in a moment why so I guess it is this part I guess throughout the course we will always assume unless specified that space is DN by spaces are inseparable and the definition is as follows V is a normed vector space again actually it’s a hilbert space we don’t care okay and we see that is separable if there exists a set s contained in V such that such that s is countable and – if we complete as we get to be okay it’s exactly the same thing that Upson in the case of the real numbers with rational numbers so the point of this definition is to tell you that we we will work with spaces that can have fancy structures that they’ll be infinite dimensional the main point is that we will we will in some sense want to control the kind of size that this space itself and the reason this happens I mean intuitively is that there is a skeleton which is countable and this means that every point in the space can be reached by by a sequence in the countable space so the the actual dimension of the space although infinite is controlled in some sense by the the fact that there is a space that is countable inside and we will see the following theorem why well what this actually means so these so we have as a hypothesis that we is an inverse space and that V is separable is equivalent to say that there exist a sequence the N of elements in V which is an orthonormal basis of B and this means I can put it like this but it’s countable ok so by requiring superb ility we’re having we’re imposing our little bit space to have a countable basis this is very important when we are going to study linear operators and why we can generalize properties of linear algebra to infinite dimensional spaces so so count so orthonormal basis means that you have elements in the basis which are normal which means that they have norm one and that they are orthogonal so the inner product of V I times V J is equal to 0 for each by different from J ok and the point of having Norton I mean an orthonormal basis what gives you is the fact that for each element right here so an orthonormal basis tells you that for each X in your space you have a unique representation of that vector unique the composition of that character in terms of the linear combination of the elements of the basis so in this case what means that we can write X as the series here we put plus infinite or the projection of x over the v is x the VI okay this is the unique way of writing an element with respect to to its paces

and well we like that fact that this is discomfortable okay so now that we have imposed or our conditions on it but spaces we can start to think about what we can do what kind of operation we can do on in the spaces so if you have any questions on this first part okay so your operators are the the most natural operation that you can do on linear spaces because they they keep they maintain the structure of of linear space as they operate so the definition is quite simple a linear operator that goes from a space V to a space W I’m assuming throughout that this will they will will always be in B spaces that are separable actually linear operators do not need all these kind of properties but for simplicity let’s assume that they are always like this so L is linear if it has the linear property that we have seen before so L of lambda be 1 plus 2 equal to lambda 1 L 1 lambda 2 so as I told you linear operators are a sort of generalization what matrices are indicated in the final dimensional case indeed if you take V and W to be for instance our m and RN with some n n n you have that L can be associated to a matrix with n rows and n columns okay and it’s always true that if you take a matrix with an N columns and rows and n columns it induces a linear operator on your space and therefore we can generalize some properties of matrix that we know for instance we we know that for matrices we have the transpose of a matrix which is an element in R M times M okay and you can define transposes as I mean defining how you change the elements in the matrix but of course you can have a more abstract definition that actually allows us to generalize the concept to to that of adjoint of a linear operator in the case of genera infinite spaces so from the transpose of a matrix you see that for instance if you take vectors like X in RN and Y in R M you can define the inner product between x and y like this or what I suppose or also L transpose of X apply to Y and these two are simply if you write them in the inner product notation as x times y which means that is equal to l transpose x and y and indeed the definition of a joint which is the generalization that we want now is the following so L star that goes from W to V is the adjoint of L if for each V and V in V and W in W we have that v lv w with respect to the inner product of V W sorry is equal to the inner product of V times L star W in a problem B okay so if if an operator for a linear operator satisfies this property it is the adjoint of of L and you can show that it always exists and so on now if you have L here L that goes from V to the same space you can define what is called a self adjoint operator which is an

operator which is equal to its own its own adjoint these are self adjoint and our the generalization of symmetric matrices okay good so okay so before going further I guess I can introduce also for self adjoint operators the concept of a eigenvectors and eigenvalues so we have L a linear operator which is self adjoint for simplicity and v is an eigenvector for L with eigenvalue okay if L applied to V is equal to lambda V so basically if you apply the linear operator on the vector you get back the vector but scaled by a certain lambda okay so so before going to the spectral theorem that will allow to generalize concepts such as the decomposition eigendecomposition then singular by decomposition in the case of linear operators we need to introduce a class a specific class of linear operators that are compact operators you will not really need to know what compact operators are for the class because we will always work basically with compact operator hey I guess so this is just the notion that is needed the property that is needed in order to than to to do linear algebra on on our functions pages so I’m just going to tell you what it is but you will need to understand it if you if you don’t that the this kind of know it so the point is that a linear operator is compact if you take the image of the unitary ball in the space V and the closure of this object is compacted in W and compact means that if you take any open cover of this set you can always find the sub cover that is finite and and still cover the entire object okay so this is the property that we actually do need for in order to get to the spectral theorem so cheering goes like this we have a back hilbert space and that goes from A to B and is self-adjoint and combats that’s why we need it okay the disease is that there exists and of elements in the and the sequence lambda and of elements in our which is in orthonormal basis for beam but at the same time is also a set of eigenvectors with these kind of eigen values for L so L of V n is equal to lambda n BN for each n okay and this actually implies the following thing that L can be written in terms of these operators the summation of n that goes from 1 to infinity of lambda n times the N composed with the end star VN star is the transpose of N and V n is the operation that that goes from here I’m writing with some abuser mutation but the fact that V n is the material that goes from R to V and sends an element C comma in Sigma V n okay in the scaled

version of V N and you can get of course the transpose because these are linear space is the adjoint of V N and you get two functions and you combine them so this object the combination of these two element is what is called a rank 1 operator because the image of this operation is as is a I mean it’s a vector space of dimension 1 in your target space and and therefore ya the these vectors they are not only a basis for your space but they are actually also a way to decompose your your linear operator and furthermore we have that the lambda n goes to 0 as n goes to plus infinity this is true for compact operators okay so this generalizes what you already know for for matrices the fact that you can do the egg and echo position and indeed for a self-adjoint operator you can write in this way what I’ve written here in a more compact way so L is the combination of the following three linear operators where u goes from V to V and also Sigma goes to V to V they’re designed accordingly but basically it’s the combination of these three linear operator so you can always take a self adjoint operator which is compact and and expand it like that and this will be useful because then when you are going to do we’re going to throw out the class you you will be just using this formulation and you’re just going to work with this object without thinking too much if you can do that or not because you know that all the properties that you need in order to do that they they are available and so as you have these as you have the egg and the composition you can have also the singular value decomposition for generic compact operators which simply requires to introduce more just a bit more of notation but it’s exactly analogous so we take L it goes from V to W and this compact and then you have that there exist a sequence VN WN Sigma n of elements in V in W and in our these are orthonormal bases and we have that L can be written as the linear combination of Rank 1 operators that are like this Sigma i wi combined with VI star okay and you can do exactly the same thing goes here so you have L is equal to okay this is P Sigma you start with Q and P according to the spaces will be Q from V to V P from W to W and Sigma that goes from V to W okay so of course this will there will be the generalization of what our or tone orthonormal matrices so this would be what are called unitary linear operators that are operators that go from one space to the r2 to the space to itself but they don’t change the norm of the of the vectors and I guess this is all you you will probably need for class yeah sorry and then admission so you need in the case that in which which we are now so using the drill there you’ll feel the point is that in our mission as the property basically in this case that is positive semi-definite so what you have is that if you take V and take the inner product with L applied to V you have that this is greater or equal than zero for each V in V and this is positive semi-definite and in the case of the real field it’s actually the definition it’s actually equivalent to our median while for come back for for the complex case it generalize a bit more but the point is that you have these and you can see these these are required to be self adjoint and so hermitian operators can always be the composer can always be

seen as let’s call it m time M star so the separation of a linear operator with its own adjoint okay so it’s a self adjoint suffocation in some sense that goes this way and if you go to the spectral theorem what what is actually characterizing this kind of object is the fact that you not only have that these lambdas this sequence of lambdas must be in R but must be in R plus okay let’s see this difference and indeed I mean if you think about matrices you have positive semi definite matrices with a symmetric but have also the spectrum which is always non-negative while symmetric matrices can never have real values but on the spectrum but it can be negative let’s say so I guess if you don’t have any more questions golly hello okay so I’m doing the second part on probability theory so something that we want to study in this class is theoretical framework for analyzing the performance of learning algorithms and that is going to rely heavily on probability theory so the basic set up is this so a learning algorithm we’re going to treat it as mapping a sample of input-output pairs to a function and this is the function that you’re going to use to make predictions and the sample we’re using a subscript n to indicate that this is a sample of n points so this is your training set and what we want to to study theoretically is the performance of the algorithm the the error that they commits on the test set in the training set and for that we use two concepts one is the empirical error yeah I can try okay so the empirical error is the average loss on your training set and V is the loss function and this could be for example like the square loss or the hinge loss things that were probably mentioned in lecture all right the second notion we’re going to use is the expected error right sorry so back to empirical error we’re gonna denote that by this function I sub N of F all right that’s the empirical error of the function f and then the expected error is going to be very similar but it’s an integral now over the entire space of

possible XY pairs with respect to a probability measure on the XY pairs and just hopefully makes a little clearer so our sample is coming from a space that I’ll call Z so the integral is this space over the space Z all right so there are two major or theoretical concepts that encode sort of what we mean by good performance of a learning algorithm and one of them is called consistency and what consistency says is that the expected error of the function that you learn is close to the expected error of the best possible function in your hypothesis class so F sub n is the function that we learn and say and that’s converging to what’s called f star is the function with the lowest expected error in the hypothesis class and consistency is this is an asymptotic notion so this is true ads in the limit as the number of samples goes to infinity all right and I don’t think it’s a notion that we’re going to spend a lot of time on in the course the one that we really focus on is generalization and generalization is about the relationship between the empirical error and the expected error of the function that you learn and it looks somewhat similar so again it’s asymptotic it’s as the number of sample points goes to infinity the empirical error converges to the expected error okay so now what we would like to say then is that our learning algorithm generalizes if this is true and that means that the empirical error is a good proxy for the expected error and one way of looking at it is that the you can analyze your performance on this finite data set by analyzing the expected error which can turn out to be tractable to look at all right so there’s also two related notions one of these well I’ll just talk about one of them so something that you’ll see is the general generalization error and this is probably the closest one to what you wouldn’t intuitively expect a measure of the performance of learning algorithm to be and this is saying that for a finite number of examples the empirical error is close to the expected error of the learn function okay and this will be true with some probability which we’re going to say is one minus Delta alright so it’s giving you a relationship between this upper bound on this on the difference between empirical and expected error versus the number of training points in your training set and you want that to be small okay so for the rest of the time here what I’m going to focus on is generalization and the full definition is we want to say that that this quantity the empirical risk converges to the expected risk in probability okay so what I’m gonna do is try to

unpack exactly what this means um and start from basic definitions random variables and stuff like that so first of all the random variable here there’s really there’s several okay so the randomness really starts with the training set okay you have samples these XY pairs that are sampled from some distribution and from that randomness we have a function over a training set which is the error at any one of these points so this is a random variable whose probability distribution is derived from the sample and then of course the mean error is also going to be a random variable whose distribution is derived from the sample okay so a random variable then we’re just going to think of it as being a variable that takes on values in some range okay for example the reels and is defined by a probability measure oh sorry where’s the eraser okay okay so we’re gonna have a variable that has values in our sum range and the most important part is that it’s associated with probability distribution or a probability measure the probability measure what it does is assigns probabilities to sets that are subsets of whatever this range is okay and we’ll call that subset an event and the assumption that we normally make when you’re talking about a measure over say the pairs that comprise the sample is that there iid so you have a collection of random variables and of them in this case and what that means is that their joint probability distribution is exactly the product of the marginal distributions and these distributions are all identical so independence is that the joint factorizes as the products of the distributions and then the distributions for all the random variables are identical is the ID part okay so that’s iid and then the last concept here is the expectation so the expectation of a random variable is going to be the integral of Earth’s range with respect to its probability measure okay so if anybody has questions

about that yes let me know otherwise all right I’m gonna move on to how we actually go about proving something like this statement of generalization and the fundamental problem is that we want to prove something about we want to bound this probability that the empirical risk deviates from the expected but we don’t actually know what the probability distribution of these random variables is okay and the tool that you can use in this case is called the concentration inequality and what it gives you is if you have some very limited information about the random variables such as its mean or bounds on its value then you can bound the probability of certain events usually that the value of the random variable will be greater than some fixed value you can bound the probability of that event so I’m going to show maybe the simplest concentration and quality which is called Markov inequality and the information that you have about your random variable in this case is just that it’s non-negative so all of its values are greater than equal to zero and the statement is that if I take some constant some positive constant C then I can bound the probability that X this random variable exceeds C by its expectation divided by C and that’s true whenever we for any non-negative random variable so the proof is just a couple of steps so we’ll just write out the definition of the expectation okay and this all of these values are positive and so clearly if you take out part of the domain of integration then the value of that integral is going to be less than when you have the whole domain of integration is this still too small for people okay okay so we take out part of the domain of integration and so instead of integrating from 0 to infinity we go from C to infinity sorry and that value should lower bound the expectation and all of these values of X from our domain of integration are greater than equal to C so if we instead say that instead of X is just a constant C then that integral has a value that is less than or equal to the value of this integral and we can factor out the C okay and then now you just apply the fundamental theorem okay and so what we have on the right hand side here is just the probability that the value of the random variable exceeds C times C and so rearranging that gives the inequality so markups inequality is true for any non-negative random variable now another inequality

that we’re going to end up using is chebyshev’s and now the random variable is no longer non-negative it can have negative values but what we know about it is that it has finite variance which is Sigma squared okay so what we want to bound now in this case is going to be the deviation between X’s value and the value of its expectation and we’ll call that f of X okay so we want to bound this probability of f of X in greater than or equal to C now it turns out for Markov inequality you can prove in almost exactly the same way the analogous inequality for any non-negative function of any random variable meaning it Maps the random variable to non-negative okay so we’re going to use that to derive abound in terms of the variance on this value and the trick to this one is we can go ahead and apply a monotonic transformation to F and the same transformation to the constant and then the probability doesn’t change so we can go ahead and say that that probability is equal to this probability and then applying the Markov inequality you get this and value that expectation on top is exactly the variance of the random variable X so this is our bound it’s Sigma squared over C squared okay so those are two pretty common concentration inequalities and so that gets us some of the way towards proving that the value of that probability on the inside of the limit is bounded now the limit itself requires that we define convergence in probability okay so I’m going to define convergence and probability and then we’ll combine it with the concentration inequalities to prove something close to that statement but not quite so convergence just regular convergence of a sequence so these are not random variables this is just any sequence of numbers or elements in a set or whatever and you’re gonna say that let’s see let’s just say so the sequence approaches a limit X if for any epsilon if for any epsilon there’s a point in

the sequence beyond which oops beyond which for any element all of the elements beyond that point are within epsilon of the limit okay and that holds for any Epsilon and when that’s true then you say that the sequence converges to the limit now with probability there’s a few ways to define convergence of random variables you have to deal with the fact that you have a probability of distribution now the definition that we’re using here is convergence in probability and what we’ll say is that random variables X sub I converged to a limiting random variable if for any positive epsilon you have this is true the sequence of probabilities that the random variable deviates from the limit converges to zero so this is one way to define convergence of random variables school in it and that’s the definition that we’re using in generalization so generalization is really saying that the empirical risk converges in probability to the expected risk yeah okay so now we can put that together with the inequalities over there to prove something like this but not quite it’s the weak law of large numbers and what the weak law of large numbers says is that you have this sequence of random variables of length n these are iid and they have a common mean and variance will define their empirical mean just the same as over there so it’s X bar sub n equals the average of their values so this empirical mean is also a random variable and what we want to show then is that it converges to its expectation in probability the mean of the random variables converges to the expectation as the length of the sequence goes to infinity okay so this is basically one step actually it’s going to be just an application of chebyshev inequality and the key fact that we need is that the variance of this random variable is finite and we really and we want to know its value so we can plug it into chebyshev so so let’s see just writing out what this

variance is just using the definition of the empirical mean this is the definition of the variance and expectations are linear so we’ll rewrite it pushing it inside the sum and so this here this value we already know the expectation of the random variable is just mu I did I forget it yes let’s just write it like that okay so this value here should just be new so we have Miu squared and then this value here we’re gonna use the fact that these are independent random variables and when they’re independent they’re uncorrelated and so any of the off-diagonal if you want any term here where J is not equal to I is going to be 0 it’s just what it means to be uncorrelated so this turns into just a sum over the expectations of these random variables squared ok minus u squared and we can rewrite it one more time like this and to make each one of these the expectation of the random variable squared minus its mean squared so okay so this thing on the right is just the definition of the variance of any of the random variables X and so what you get is the variance of the empirical mean is the variance of the individual elements of the sequence divided by n okay so that’s going to be our key facts so we know that this random variable has finite variance and we can just apply chebyshev’s inequality then we’re just going to say that the probability that it deviates from its mean by more than epsilon is bounded by the variance over epsilon squared which is equal to Sigma squared over n times epsilon squared okay now the key property this is true for it any element of the sequence and then the key property is that these converge to zero which is going to be true because the ends on the bottom here so as n goes to infinity this goes to zero so that’s the weak law of large numbers and it looks a lot like our definition of generalization there we have the mean of these random variables converging to the expectation does anybody see why this does not suffice to prove that they given learning algorithm generalizes so there’s an assumption that does not

hold in the case that we’re looking at in the class so the assumption that doesn’t hold is that the random variables that we have here these individual losses they’re not independent and the reason is that the learned function depends on all of the points simultaneously so we no longer have independent random variables so we can’t just apply the weak law of large numbers and I guess later in the semester we’re going to show a way to approach the problem for this particular learning setting so that should be pretty much that do you guys have questions okay