Variational Methods for Computer Vision – Lecture 2 (Prof. Daniel Cremers)

okay welcome to the next class on multi LM not multiple view geometry this is variational methods taken class and this is still the introduction and I’m finishing the introduction today the introduction was mainly designed to motivate a little bit the topic of the class to show you some of the recent developments in that field to show you what kinds of things you can do with variational methods one of the things that has become extremely popular in the field of computer vision is novel sensors and very often once a new sensor comes out that sparks a lot of research initiatives people looking into what can you do with that sensor and it often turns out you can do things that were often not possible without the sensors before this sensor that came out a few years ago is so called RGB D camera it was made popular by Microsoft it was actually the consumer electronics that sold the fastest ever I think they sold millions of these cameras in just a few weeks the technology is not entirely novel the the way it works who has seen this camera before okay almost everyone yeah so that comes with their Microsoft Xbox in fact in addition to the camera there’s multiple microphones and all sorts of things in there but what is important for us is actually mostly the camera you see there’s three holes here one is a standard color camera an RGB camera one is a projector that projects an infrared pattern onto the world you don’t see it with your eyes but if you have an infrared camera you can actually visualize that pattern and the third one is a sensor that actually looks at that it that acts in the infrared and sees the pattern and based on the distortion of the pattern it can infer dips and so you can imagine there is a raster of infrared patterns on the world and based on the distortion you can estimate the depths of things in a way that is actually fairly accurate and as a result you not only get color images but you also get depth images so a little bit like what a laser scanner gives you depth information the difference being that this is a dense 2-dimensional depth map and it comes at around 30 frames a second so it actually has the same speed as the color camera and so you get simultaneous steps and color images which is more than the usual laser scanner gives you because it doesn’t scan the image point by point line by line but it generates a 30 frames a second dip spot and so one of the things you can do is you can try to recover the geometry of the world around you from that camera here’s an example of how that could work approaches that were propagated by Steinberg or Carol and coworkers the idea is that let’s say you have your camera located here and at the next time instance you moved it to a new location the issue is that what you need most critically in that context of reconstruction is you need to know where is my camera at any given moment here the way you can determine it is by a variational approach so you want to find six parameters that model that so-called rigid body motion of the camera translation three translational degrees and three rotational degrees for your camera and these six parameters you can determine by minimizing a cost function and if you minimize it locally this is typically what the group the variational technique is about you set up a cost function and then you try to find the parameter that minimizes that cost function and what is this cost function look like in this example there are many cost functions that people proposed in the last year’s and and what they do is they basically evaluate the accuracy of their method and compare to other cost functions to other algorithms and which one is the best one is still up to

debate today and then with every year we get new and improved algorithms to track the camera because once you have tracked the camera what you can do is with these dips estimates you confuse them to get a coherent 3d map of the world a dense surface of your environment and so the camera parameters and one thing you can imagine is critical for the accuracy of the overall reconstruction a critical thing is to get the camera motion accurately you know you have one depth estimate from one location you move the camera get a new depth estimate if there is an error in your camera tracking then the scans don’t align and if you try to merge them you will get offsets and you will get weird artifacts and so in order to project all these depth maps into one world coordinate system you have to know exactly where each camera was and so it’s critical to get an accurate estimate of these six parameters in this setting in contrast to a lot of stuff we talked about in the last lecture this is not an infinite dimensional problem it’s a six dimensional problem in that sense it is simpler and in fact one the strategies you could think of in six dimensions is you could still do a complete search you could still discretize your translation and rotation space and just try all configurations to see which one gives the best cost you can do it it’s tedious it takes time but in principle it’s possible and if you implement it efficiently there are more sophisticated so-called branch-and-bound techniques to more quickly prune the space of feasible configurations and they are actually more or less practical in fact there is one algorithm for this type of problems coming out later this year where people use branch-and-bound techniques to efficiently find the six parameters here the cost function looks as follows this is a cost function that actually uses both the depth image and the color image the way it works is quite simply you say that the color at any given pixel let’s call it X should be the same color at the corresponding pixel once I move the camera because I’m looking at the same 3d point and so I say the color let’s call this a zero is the color at that point X the color here III is the color at the next image let’s call it I the image a little I and then what do we have here the point X is transferred into 3d if you have a depth value u it essentially means that you scale the point by u that is assuming that the point is in what’s called homogeneous coordinates for those who attended my lecture last semester Oh well I don’t have a pin on me so the homogeneous coordinates essentially means that there is an X a Y and the third component that is one so it means that you encode this point with with its Zed coordinate being one the distance from the origin so all points on the image plane have the third coordinate one so X is actually three component vector XY and one and once you multiply it with the depths you have u times X u times y and u and that is this point here and since we have a Kinect camera we have a depth estimate to function U we have it so we can actually prop project that point from the image plane into the 3d world and then we can say once we rotate the world and translate with that rigid body motion G we get the same point but in coordinates of the new camera mind you we don’t know the rigid body motion but let’s assume we have it then we can transform that pi is just a generic back projection which divides by the Z coordinate to get back the 2d coordinates and so then the color at that point and now we can say these two colors should be the same and that should hold for all points X in this image plane Omega yes oh wow perfect thank you yes so how much in these are called

homogeneous teenie teenies coordinates and what it means is that every point X has XY 1 so this is just the third component being one you might wonder why would you represent 2d points with this representation well as you can see in these equations it makes the world a little simpler because then you times X is UX uy u and that is exactly the point that we see up there in the 3d world it has the Z coordinate U and x and y are scaled with you as well and so this would be a cost function and then we have to minimize this cost function there are many ways to do that I will not go into detail one idea would be to do a gradient descent to start with some initial guess of Zhai and then gradient descent another alternative that is commonly used is to linearize this inc sigh I think we’ll see a little more later in class on that and then you get the cost function that is convex and that can be minimized globally in but the assumption about in the linearization is that you are already close in some sense to the right estimate then the assumption is that the camera motion was small in practice whether that is a good assumption or not depends on the application for example if you do real-time camera tracking and you get a new camera motion every 30 times a second then typically the motions are fairly small because U and n in fact a lot of these approaches in computer vision make small motion assumptions to solve the problems because see the difficulty with this cost function is it’s convex inks I so if I try to solve this for the best sign are six and I don’t want to do a complete search I there’s is very hard to actually find optimal solutions in general it’s not possible and so you typically make additional assumptions if you assume psy is small you can do tailor approximations and then you can get a convex approximation and that can be solved and so under the assumption that the motion is small you get good results and fortunately for us the faster the cameras get from 30 frames a to 60 frames to 120 frames the more the small motion assumption is fulfilled even if you move the camera with a constant motion if you have a higher frame rate from one frame to the next the motion is more likely to be small and so the algorithms will tend to work better with the higher speed of the camera here’s examples of what these data look like I think many of you have seen the Xbox many of you may not have seen that the data it actually generates so this is the color image and this is the corresponding depth image and you see the dips is represented as brightness values here dark means closer by and bright means further away and so you can determine the depths in that way and from that once you track the camera you can try to fuse these dips maps in order to get a coherent 3d representation of the scene and you cannot only fuse the dips the information in a world coordinate frame but you can also do that with the color and so you get a colored 3d model of the world you see it’s okay you can recognize things but you might not be able to recognize the person anymore how to improve the accuracy of these 3d models is a huge effort that people are working on these days of course the difficulty in this setting is that there are many factors that limit the accuracy one factor is how accurately can estimate the camera motion the better the camera motion estimate the more accurate or the more consistent is diffusion the second challenge is that the sensor itself has limited resolution both in the depth if there is a limited resolution has in the X&Y coordinates so the solution there would be to wait

for the developer of the camera to bring out the second generation and keep your fingers crossed that that second generation actually is more accurate Kinect Xbox Kinect camera there is a second generation coming out this fall or this winter whether it really has more accuracy more resolution in X Y and in the depth I don’t know yet there’s debates on that the texture mapping yeah so there are different ways to do that here the way it’s done is that we store the color in the 3d volume as well and we do an averaging of the colors in the same line as we do an averaging of the depth values from the different cameras but an alternative would be to first estimate the dips and then project the colors from the images and for example you could use the super resolution approach that I mentioned last time to get a very accurate coloring and that might improve not the geometry but at least the visual appearance of what we see we haven’t done it here because I should be honest that super-resolution approaches a bit computationally intense the super resolution texturing of the rabbit would nowadays take about 20 minutes to compute whereas what we’re aiming for here is the system that really is real-time capable where you scan and you interactively get a 3d model right away to look at and so at least for now the super resolution approach is not applicable in a real-time scenario but there are many alternatives and some of them might give you a better compromise in terms of accuracy and speed and so in computer vision we’re always working in that regime of what is the optimal solution to a problem and what is the fastest solution to the problem and depending on the requirements of your customer you will have to devise different strategies my experience is typically it’s a good idea to at least think about what the optimal solution would be think about how computationally intense with that optimal solution be what are ways to make it more efficient but it’s often a better strategy to first find the optimal solution to a problem and then think about speed if you start with speed as the primary concern then you will never actually get to the best possible solution typically so here is a system that actually works in real time for those who were at the open day on Saturday you could actually try it yourself we demoed it there and we scan tons of people all day long the way it works is there is a rotating chair or rotating platform that you stand on and as you can see here is the camera and here is the 3d model emerging on the fly so to speak so while you rotate around there is a 3d model of that person being generated in real time and so as soon as you finish rotating you rotate around one loop you have a 3d model that you can turn in the computer here again you see the process how it works it essentially carves out that 3d geometry using the different depth estimates and there is many challenges many open issues to make it more accurate you see the resolution is okay you can recognize the person if you know the person but I constantly feel that it should be higher resolved and there should be ways to improve the accuracy but as I said there are many factors that play a role the system actually makes the models Hollow which is useful because you can nowadays print out 3d in color and the cost of printing depends on how much material and so what what you can do is you can just print and have it printed there services where you send it and for 10 or 15 euro you get 3d models of various sizes and so one of the crucial features of this system is it’s very fast and it’s very robust so we scanned as I said on Saturday we scan hundreds of people and it works pretty much every time the only challenges you have to hold still while

you’re rotating so it does not work on my daughter for example she’s three years old and if I tell her to hold still she says yes daddy and then she goes like that and and the reconstructions do not look so good you end up with multiple noses and things like that you can imagine a little bit what the reconstructions look like but if people do hold still and typically we find it works as I said you know we scan lots of people we found even P children from the age of five or six years on you can scan them because they understand what we mean when we say hold still and so these are a lot of the models that we scan in and you can see you can get fairly high resolution both in the coloring and in the geometry so a lot of these folds that you see here they’re actually in the 3d model once you have it in your hands you can verify and so it’s a system that allows to generate little toy figures of family and friends in fact if you if you like this the software can be downloaded you can test it at home if you have an Xbox camera Kinect camera and if you have a PC with a more modern graphics card you should run no problem and then I don’t know that we haven’t really figured out what you do with the system like that but you know you can put things on the shelf or whatever one of the the same system more or less same approach can be used in different scenarios and this is the last part I wanted to show you about recent developments in in our research lab this is next door the kitchen and this is where we do a lot of experiments what you see here is a quadrocopter of flying a helicopter with four rotors they have in they exists in different sizes all the way down to so-called nano copters of this size and as you see on top there is a Kinect like camera on top it’s not this exact same camera but pretty the same sensor and this quadrocopter flies autonomously using that sensor so it can localize its own location using that camera with essentially the algorithm I showed you and then you can use that location and rotation estimate to autonomously navigate a flying system so there is no user steering the system it actually flies autonomously acquires data and this is what the data looks like if you map the data into a world coordinate system so it’s exactly the depths and color sensor data that we get from that sensor but we remove this the ego motion of the quadrocopter because we can estimate that motion and so we can turn and translate all the measurements back into one coherent world coordinate system and this is what you see here and we can map the estimated camera location into that system so here you see the predefined trajectory and the actually flown trajectory and then at some point it lands again and will end and then you confuse these depths Maps exactly like you saw it with the person scanner and you can scan rooms and so once the quadrocopter lands you have a dense colored model of your room and if you’ve seen how people nowadays measure rooms to install things if you ever had you know workmen in in your room to install whatever system they measure the size of the of the of the room it’s very tedious you I mean you don’t have to have a manual measurement there are lasers nowadays that you can position to determine the distance between two walls but they would never give you a dense 3d model and these models are fairly accurate and they are determined on the fly literally what we’re working on right now is actually to extend systems like that so that they can map the entire building ideally with an autonomous quadrocopter the challenge there would be that you have to do obstacle avoidance etc and one challenge that we don’t know how to resolve is if the door is closed and you know the quadrocopter will not open the door so it can fly from one room to the next but the assumption is that the

doors are open so there are certain limitations here so to summarize this first part I showed you a little bit about applications of variational methods what what is being done these days dents reconstructions from images coloring texturing 3d models super resolution texturing as in this example reconstruction of actions over time where you reconstruct not just one geometry at one given time but from a calibrated multi-view video stream you can generate actions over time towards what is nowadays often referred to as 3d television then I talked about it I talked about it I think real-time reconstruction of dense geometry from a handheld camera I may have skipped that I don’t remember RGB D modeling where we use variational techniques to estimate the motion of a so called RGB D camera D stands for dips so cameras that have color and dips and in the end I showed applications of these ITP D cameras on autonomous quadrocopters to actually reconstruct the world around us from an autonomous system okay that concludes the overview or introductory part now I’m moving to the first part of the class and as I said these slides will also be online the first part is taking one step back and going more into an introduction of basics of image analysis because we’re learning about variational methods for image analysis and computer vision so I want to start by making sure that everyone knows a little bit the basic ideas about about images how to represent images how to process images so that we have a little bit of the basics as a foundation for then the variational techniques some literature I will bring some books next time I couldn’t bring them today because the the stack is so so large there on my desk but tomorrow I’ll bring the books at least some of these books there is much literature in the field the class the way I designed it is not covered by one single book this is a big problem in the field of computer vision the field is developing so fast that a lot of things you will not find in one single book but there is a collection of books that have at least some aspects of interest in them and so these are four books on variational methods and partial differential equations for example the corn probes or Bayer book I think it already exists in the second edition now mathematical problems and image processing partial differential equations and the calculus of variations then tonnage in image processing and analysis variational PDE wavelet and stochastic methods morale so limini is slightly older but also quite interesting book variational methods in image segmentation so this is more focused on segmentation and this book is revolves a lot around the so-called Mumford Shah variational model that was very influential in image segmentation and this is a the most recent among these books a book by Brady’s and Lawrence that the at least this version is in a German Edition but I was recently contacted by Springer or told no I don’t remember about they were interested in translating that book to English why brady’s and Lawrence developed wrote a book in German I don’t know exactly but I’m pretty sure there is an English version following up if it’s not out yet it should come out someday the books are slightly different a lot of these techniques and similarly a lot of these books are on the interface between computer science and applied mathematics and similarly the authors are typically on the interface these are often matters from from math from the field of math but so there is a little and there is actually more books I will

bring next time that are more on the mathematical side but that are a little more advanced on specific aspects of variational methods okay one of the key features in variational methods is often that they take a continuous viewpoint so they make the assumption that the world we live in is a continuous world and similarly images are often treated as continuous mappings and this is an appoint of I should say endless debate because as you know the images we get from a digital camera are discrete and not just discrete in space as you see here but also discrete in color the color values the brightness values they have a clear discrete spectrum the discretization in space is often called sampling the discretization in the color space in the value space is called quantization and so you typically have both of these aspects in in in discrete images the way you acquire them nevertheless you will see throughout the class that there is an advantage in actually modeling images as continues mappings so there are different levels of discretization as I said you have discretization in color and brightness called quantization you have discretization in physical space called sampling and then you have in addition if you are talking about videos for example you have discretization in time the images only come one step at a time nevertheless you can model videos as continuous in time as well of course the measurements are discretized but whatever you see in these images the action behind it is in continuous time this as I said is an endless debate at some point it becomes very philosophical and whether the world we live in is really continuous or discrete is you know if you go deep down into quantum physics people will tell you there is a discretization as well in the world but at least for example for our eyes in the physical world you can approximate it as a continuum when I say images what I mean is a mappings from some subset of RN to Rd Rd and to give you some examples so in general images are a mapping from subsets of RN to R D so this is taking the continuous viewpoint and just to give you examples n is typically the dimension of your input domain and so for regular images n would be too if you have volumetric images for example if you get some 3d scan done in hospital then you have 3d data and so then your input domain Omega is a subset of r3 or if you have 2d images but video so over time then you can model that as a mapping from some subset of r3 where the third component is your time component and similarly if you have volume images over time there are nowadays ultrasound scanners that scan 3d in in time so it’s actually quite amazing what can be done today and that would be N equals 4 then the drawback or the difficulty in that setting is that the data they generate is so huge that you can typically not store it so there is quite a need in that domain for online algorithms that process the data as it comes because it’s too much to store it D is the dimension of the output space if we talk about gray value or brightness images D is typically 1 if we talk about color images typically D is 3 once D is larger than 1 people talk about multispectral images and recently when it’s really large people talked about hyperspectral images I’ve never met anyone who can specify at what D it becomes hyper from multi to hyper and you might wonder if you already have the term multi why do

you need to introduce a new term well usually new terminology is a sales point if you want to sell your technology to for example funding agencies and you want to tell the world I’m doing something radically new here and one way of doing that is to introduce new terminology and so this is how hyperspectral comes about yes so we have that continuous mapping say for images in r2 we have coordinates x and y denoting the point in space and once we discretize them we get values at discreet pixels 1 1 1 2 and we get an array of values and typically the function values are also discrete as I said for example in the range from 0 to 255 or in in more a more detailed color quantization so that very much depends on the image format that you have there’s tons of different formats for images and and so depending on whether their brightness or color and how they’re resolved debt cetera some comments before I go into the continuous world about drawbacks and advantages of discrete and continuous methods so to start with the advantages of discrete representations because there is a large part of the community working in the discrete setting they argue that digital images are discrete and so when you process them you also require ultimately a discretization in the computer you have to process a finite number you cannot process a continuum directly and so once you discretize you don’t need to numerically approximate things because you can represent them directly in the discrete setting so you don’t need to do a transition from a discrete measurement to a continuous world and then a discrete up implementation in the computer again in addition for discrete we formalized formalized problems there exist for some problems efficient algorithms from discrete optimization graph theoretic algorithms typically that you can apply to that problem so when you we saw a couple of variational approaches they often have integrals over the image plane in the discrete setting you would have sums instead of integral sums over the discrete pixels and then you would have to minimize some cost on that discrete grid and depending on what the cost looks like what the form of the cost function is there are polynomial time algorithms around to solve the problems and and if you have a polynomial time algorithm chances are it may be very efficient and very fast to do in turn the advantages of the continuous representation are firstly that the world that we see through the camera is as I said a continuous world at least that is a fairly good approximation and since ultimately computer vision doesn’t want to make statements about the discrete image but statements about the world that is seen in these images it is often of advantage to have a continuous representation and the mathematics of continuous worlds has a much longer tradition it is very old partially because we didn’t have computers for centuries and so mathematicians developed theories in continuous space and so there is functional analysis there’s differential geometry how to model surfaces in a continuous setting there is partial differential equations and group theory to model motion in a continuous space etc in addition there are some properties for example rotational invariance that are easier modelled in a continuous setting if I write down a continuous formulation there is no underlying grid and so typically by default I will have rotational invariance whereas if I formalize something in the discrete setting this is often more difficult because if you rotate a regular grid the rotated grid is no longer regular etcetera so things are are a little bit more tricky in the discrete setting in some sense the continuous models correspond to the limit of infinitely fine discretization

and with more advanced sensors we are actually getting into that limit you know the digital cameras have more and more megapixels the frame rate of video cameras gets higher and higher and so the discretization is approximating more and more continues world so here’s an image and it’s discretization in fact I was a little bit puzzled when I started image processing and I wanted to see the pixelization so I down scaled the image to a level where it only has 32 by 32 pixels and then I wanted to see the pixelization that’s not actually easy to do surprisingly because if you down scale it and then you want to see it looks like that you don’t really see much then if you take your standard viewing program and enlarge it you don’t see the pixelization either why because typical programs will interpolate so they see immediately the resolution is too low to see it on a thousand by thousand pixels on the screen so they interpolate the values meaning you don’t actually see the original data your image image display program will typically fool you and show you something else but nevertheless of course you can do it and if you do that for example by by imposing that you don’t want any interpolation or what’s called a nearest neighbor interpolation and then you get this pixelization and there you see the discretization artifacts in that closer here is quantization so that is the discretization of of images not in the spatial domain but in the brightness domain so here you have a brightness image with 256 levels of gray 16 levels four levels and two levels meaning just black and white and here one should say the two levels were chosen appropriately in the sense that much of the relevant structure is still there in the two level setting if you choose a bad quantization then you don’t actually see the head anymore you might just see the nose area or something like that why is quantization helpful it is sometimes helpful to quantize image to get to boil them down to a black and white world because sometimes you want to process the data later on with some algorithm that for examples determines the area where is the head where is no head and and in this image is difficult for a computer to say where’s the head here once you’ve quantized that it essentially amounts to taking a decision which parts are still part of the head which are not so it’s it’s a kind of a post-processing that facilitates subsequent you know robotic tasks for example so for example in an industrial application of computer vision this kind of binarization is important to see where our parts on a conveyor belt where’s the part if you want to grab it with the robot arm you need some hard decision where is the object where is the boundaries of the object and this is then called the segmentation and as you can see with just quantization you get some kind of foreground background separation not the best one of course but at least it’s an algorithm that is very fast and and works as I said usually when you enlarge images you don’t see this discretization in space but what you typically will see is this so you will get a blurred version of of the high-resolution image and what what is being done here is called an interpolation and there are many strategies to interpolate and this is an important problem you can imagine because often images are low resolution but the user wants to see it some are in more detail and the difficulty is how to interpolate it in a way that you really do get better details there is only so much you can do of course if you have no information it’s hard to hallucinate that information but one strategy a popular one is called bilinear interpolation the way it works is that you approximate the brightness at pixels

x and r at points x and y as as a function of x and y that is bilinear in x and y so ax b y c XY plus d which essentially means that you have your four pixels in a neighborhood and you want to find the value in between and so what you do is you interpolate that by linearly fit sum by linear function to it and then read out the value in at any point in between and so you have four coefficients here ABCD to represent the brightness function in a certain area in in-between four pixels and then you can determine these parameters simply by fitting that function to the brightness values at these four pixels so if you have four pixels here one two three four you have four brightness values and you can fit this function to the four values read out the parameters ABCD and with that you can determine the brightness value according to this fitted function in between pixels this is the most standard interpolation there’s also by cubic interpolation where you take more parameters and need a larger environment to fit the parameters or there is a simpler so-called nearest neighbor interpolation the nearest neighbor interpolation essentially gives you this function this image what it does is it just for any pixel checks what is the nearest pixel pixel in some distance and takes that value and so if you have a regular grid say here are your four pixels and you want to determine the value here then you have colors at these four pixels this one is the nearest one and so all the points here get one color the points there get the color of that pixel etc so this is a is a is a very simple naive interpolation that gives you a larger image so although we only have the values at discrete points we can of course fill the continuous space by interpolation once you have images you can filter images and this is an important aspect because it leads the way to so-called a more advanced so called diffusion filtering techniques and also to then the variational approaches the idea of filtering is that you get some input image you process it with some operator T and you get an output image that I call G here typically G the func that this operator T acts on a certain spatial neighborhood so the value of the new image the filtered image Z at X and y will typically depend on the values of F in the vicinity of x and y sometimes it only depends on the value at the same pixel x and y this is the simplest form of operators and this is then called a local brightness transform and that can be simplified by saying depending on what the brightness F is there is some output brightness and and so if R is the input brightness s is the output brightness and then you apply that transform to every pixel that is a very simple and naive transformation nevertheless you can do a lot of interesting things with these simple brightness transformation so I want to look into that a little bit this may not be novel to people who have already worked on image processing but for those who haven’t it maybe gives you a little bit of an idea of what you can do with brightness functions in most cases when you do brightness transforms you assume that the transformation is monotonically non-decreasing so it’s monotonous in the sense that if the brightness r1 is smaller or equal than r2 then after transformation that ordering of brightness values is preserved so typically you don’t want to impose transform the brightnesses in a way that something that this bright gets darker than something that is less bright so you want to preserve this ordering of brightness values if you have less than here then the transformation is called strictly

monotonous there is a difference if a transformation is strictly monotonous then you can invert it you can get back from the transformed image back to the original image in any kind of image processing and transformations it’s useful to at least keep in mind whether the transformation you apply is invertible or not because invertible essentially implies that you’re not losing information it means that you can always get back from the transformed image to the original data and so the original data can be reproduced any transformation that is not invertible invariably lose this information you cannot get back to the original data so some information about the original data is lost in the transformation and so if you want to preserve information then invertible transformations are helpful here are some invertible transformations at least this one is invertible this is called a contrast stretching what it does here you have the input brightness dark being low values on the left and bright being high values on the right sim and here is the output brightness T of our and so as you can see what this transformation does I said it’s invertible that’s actually more or less true what it does is that all the brightness values in this part of the spectrum are stretched right and so of value input brightness here gets mapped to a very dark value increase brightness there gets mapped to a very bright value you would apply such a transformation if the structures you are interested in say in a medical image the doctor wants to see where is the bone where is the cartilage etc and in that setting you want to check what is the range of brightness values associated with the bone and then you may want to stretch that so that you use the brightness information that is available to represent the area of brightnesses that you are most interested but in turn these areas are collapsed right so so all of these brightness values are essentially mapped to dark and if it is perfectly flat it’s actually no longer invertible because then I have 0 here as an output and the input could have been any value in that range so invertible means the slope should never be exactly 0 should always be positive the extreme case of a contrast stretching if I stretch more and more and more I get this mapping and this is called a threshold in threshold is a transformation that essentially buying arises the image from some gray value input to a black and white image where all values below some value M are set to black all values above are set to white like the head image I showed you earlier that was binarized in that fashion and so thresholding as I said is a limiting case of the contrast stretching provides a binary image and as I mentioned earlier this can be useful for further post-processing because there is in some sense a decision taken for every pixel are you in that domain or in that to me there are other important brightness transformations so-called power law or logarithmic transformations and here you see some of them there’s different ways to represent them the logarithmic transformation would look like that log of 1 plus R and the the power law would be C of R to the power of C times R to the power of gamma and for different gamma values for the power law you get these curves here and so you can imagine what effect they have the curves down here tend to darken the image so the range before before was this range so very bright values here and they’re all mapped to essentially dark and these transformations tend to brighten the image in fact the inverse to the power law transformation is

called the gamma correction and the gamma Corrections they look the same essentially the inverse means you flip them at the at the axis the gamma correction is frequently done in monitors and if you’ve ever given a talk and shown photographs or images you may have found this issue that on your screen you see the picture really well right and then you project it with the projector on onto the onto the wall and all of a sudden the image is completely overexposed or completely dark and you can no longer see the faces anymore often when people give talks you have that issue they see oh here’s a picture with my friend in the sunset and you don’t see anything because the interfaces are too dark and then you wonder why is that so why do I see the image differently on that screen and on this screen and the reason is that between the two screens there is some gamma correction being done and you can actually invert that by applying the inverse and so you can correct this brightness by doing a gamma this is why it’s called gamma correction it’s done very frequently for this place to make sure that the image actually is is in a presentable brightness range so if you ever have issues with overexposure or under exposure in your slides you know gamma correction is the way to fix it and it’s not surprising you just take the values and you you know increase the brightness or decrease the right brightness as desired here’s examples of so-called contrast enhancement the power-law transformation can be used to enhance contrast in certain domains this is the input image and these are different power law transformations with different values of gamma gamma the smaller gamma gets the the more it moves away from 1:1 means a uniform an identity transform so nothing happens and the more different you are from one the more the stronger the transformation and what you see here all of a sudden there are structures emerging that you didn’t actually see in the input image and this may be important if for example in a medical domain if the doctor is particularly interested in that structure then the input image is not going to help him but then the power law transformation does the trick so with very simple brightness transformations contrast enhancement contrast stretching gamma Corrections you can actually change change the the the brightness values to bring them into a domain where the human eye can see structures better here is more extreme transformations these two so-called gray level slicing approaches what do they do anyone can tell me let’s look at this one for example what does this transformation do anyone have an idea yes yes so it only it masks out certain gray values but what does it do with these gray values to some to some constant brightness yeah so this is also a kind of a binary zation a mapping to two values of brightness and basically you you say for example if you want to highlight the structure some component in the medical image and you know that component has brightness values in the range from A to B then with this transformation you can make it light up and the rest will be completely black and so this is a very interesting processing that simplifies the task for the user and basically shows them here is the structure you are interested what does this do in turn yes exactly so it also makes the structure you are interested in constant but it cranks up its brightness so it lights up but the rest of the image stays the same so these are interesting transformations mind you these transformations are not invertible so if you have exactly that gray value it may well be that it was the same gray value in the input but it may be that it was much darker so

obviously these are not strictly monotonous so you cannot invert them but sometimes you don’t want to invert them you want to light up structures and to display them to the human filtering when I started image analysis my background is actually physics and filtering was a strange term to me if you’re doing electrical engineering you’re more familiar with filtering for me the only filter I knew was the coffee filter and and so it took me a little bit to get used to the term but it’s used frequently in signal processing and it comes actually it’s derived from the frequency space where you transform an image to its frequency representation using the Fourier transform and then you can filter out certain frequencies and say I only want to preserve this frequency and put all the other frequencies to zero and so it’s much like a filter that lets some frequencies through and PI and filters out other frequencies and so this is where the term filtering comes from and so these are often called filters and we’ll see some filters in in the following one notation or one terminology I want to mention is from from mathematics you may be familiar with that term linear transformation and operate at ease called linear if it fulfills these two properties so T applied to two images F and G should be T of F plus T of G for any pair of images F and G and similarly for image F and scalar alpha any image F and n is alpha T of alpha F is alpha times T of F and an operator is linear if and only if it fulfills these two constraints why is that useful it’s useful because it basically tells you that it doesn’t matter whether I first add two images and then transform them or whether I first transform them and then add them comes out to the same and this is often a useful thing property to have to know there are various linear transformations and among them a very popular one is the so called convolution in German you see I sometimes put the German terms for those interesting convolution in German called fail to the convolution looks like that G the output image at x and y is a linear combination of input brightnesses and with some weighting function w so the way you can see that maybe I’ll draw a picture here is your image if you want the output brightness at that point you do a weighted combination of the input brightnesses typically in a certain environment doesn’t have to be but frequently this is this is the case and so if this is the point X will have here the point X minus X prime so X prime being the offset vector between the two and and then we have a weighting depending on on the offset W X Y and this is a weighting in the discrete setting this is what it would look like you would have a summation of the pixels in typically the neighborhood and then you have an an offset i minus MJ minus n and some weight that depends on m and n and depending on how you set the weights heat the convolution will do all sorts of interesting things to an image so these are brightness type transformations that are not local anymore where the brightness at any pixel depends on the brightnesses of the input image ideally in some neighborhood but possibly in the whole image if you want some examples I’ll show you in a second so but before I do that some terminology so this weight matrix is is often called a mask the convolution mask and one mask to act in a very local vicinity of 3×3 pixels would be this one so that’s the weighting matrix W zero zero one one minus one minus one so where the

argument is always this offset vector in the continue setting that weight matrix the mask is called the convolution kernel and this is called the convolution it’s often written as W star if is a convolution of W with if and then if in this case is the input image and W is the weight matrix or the convolution kernel what are useful convolution kernels here is the most commonly used convolution kernel the Gaussian kernel and you can see actually the the coordinates are not the right ones at least in this representation the center is typically zero zero so the central pixel it gets most weight and the more you move away from the center the less weight you get this Gaussian convolution creates a blurring of the input image because you say the brightness at the output is a weighted sum of my neighbor brightnesses where the weight decreases the further I go away in practice the Gaussian is always positive it’s never zero exactly but once you implement it typically you go to maybe one to three Sigma’s and then the rest you save doesn’t matter you can ignore so this is called low-pass filtering smoothing or low-pass filtering these kinds of filters the Gaussian and various other filters there are particular types of neighborhood filters because the brightness of the output depends on the brightness is in some neighborhood input brightnesses in some neighborhood here is the gaussian smoothing kernel as you can see it has some Sigma some widths that depends on how much blurring do you want to create the larger Sigma the more averaging of over a larger and larger neighborhood happens and so the more blurring you get in fact if you let Sigma go to infinity you get a constant output image there are alternative solutions often simpler solutions sometimes faster to implement as well for example so-called box filters can be implemented very efficiently and they are much like a Gaussian filter except that the weights are constant in that mask and typically you would want the averaging or the this weighted average to be such that the overall brightness is preserved and you can assure that by normalizing these masks so for example if you have these three by three masks with lots of ones there is nine ones and so you divide one over nine so you normalize the mask and that assures that you preserve the average brightness it means that once you filter the image then it gets blurred but it doesn’t get brighter with every filtering the average brightness is preserved this is true but inside the image once you get to the boundary its you have to adapt this weighting because once you do a weighted sum of pixels in the vicinity and this is the end of the image so to speak then you have no more brightness values there and so then you’re only averaging four pixels namely the in the corner and then you should do one over four in this case so you have to adapt the mask appropriately at the boundary this is actually a critical issue I should say in a lot of image processing techniques and papers people detail the algorithm but they often neglect what they actually do at the boundaries these are often technical details that are not in the papers one solution to this issue and that is often simple before changing the masks what you can do is you can just expand the image a tiny little bit by copying the boundary values to one over if your 3 by 3 mask it means you still have enough neighbors in that expanded image so this is often the simplest solution the Gaussian blur ring is interesting but it’s often not what people want in practice they want to remove noise and oscillations to due to noise in the images but they don’t want

to blur structures and so one alternative is the so-called median filter it’s one particular type of an order statistics filter and these filters are typically not linear anymore so I showed you what linearity means and as I mentioned it’s a nice feature to have nevertheless in a lot of image processing nonlinear approaches are more powerful because with linear techniques you can only get so far and there’s a lot of nonlinear techniques that give you much better results and a lot of the variational techniques we will discuss the more interesting ones are actually nonlinear approaches all the statistics filters are filters where the brightness I showed I mention it here of the filtered image depends on the order of brightness values in a certain neighborhood so what I do for example if I am at a certain pixel here let’s say you have your pixel grid and you want to process say this pixel here then you look at the brightness values in a certain neighborhood and you order them in increasing fashion and then what you can do is you can take the central value the so-called median of these numbers and put that as the new brightness value here that you can imagine also creates a certain denoising for example if this pixel was white and all the neighbors were black then the median would be black as well and so the white pixel would completely disappear in the median filtering so you know fine scale noise so-called salt-and-pepper noise is ideally removed with the median filter yeah so I showed it here this noise is called salt and pepper sometimes is called impulse noise and this noise if you have that kind of noise on images then Gaussian smoothing is not good but median filtering will typically give very nice results and so in in general the median filter will induce less blurring so it will although also smooth the image D noise the image but they will not blur the image as much here’s an example I don’t know how well you can see it this is the noisy input image this is the Gaussian smooth image it looks almost the same it is a little bit but as I said the reason why it looks the same is that I think the human that the human eye does a certain kind of denoising so if you look at the numbers they’re definitely not the same but to the human eye there is very little difference here and and still you see that edges are getting blurred here here that there is a sharp edge and it’s getting slightly more blurry in the Gaussian filtered image and then if you want to remove more of the noise you have to increase Sigma in the Gauss filter and that blurs structures even more and this was a huge issue in the 70s and to some extent 80s in the image processing community that you want to have denoising but you want to preserve structure as I mentioned last time there is an endless debate what is structure and what is noise if we have both superpose we don’t know beforehand but typically noise happens on a high frequency domain on a high because it’s typically independent from one point to the next and so you want to have some smoothing but you want to preserve edges and as you can see the median filter does quite a good job on that and this shows you one example of where nonlinear methods are often more general and more powerful than linear ones the next thing our derivative filters so these are filters that do not create smoothing but they try to find for example discontinuities in images edges and and highlight edges and the edge the idea being that an edge is a transition from a dark if I have space here and I have an edge it goes from dark to bright

and so this location where I have the transition from foreground to background from from dark to bright is a transition is where the the derivative is very high here’s again the different the definition of derivative you should be familiar with that in images of course we have two coordinates and so invariably we will have partial derivatives so DF by DX is defined of f of X plus epsilon and y minus f of x and y divided by epsilon and then you let epsilon go to zero in a discreet world you cannot let epsilon go to zero because you don’t have continuous values but you can easily approximate these derivatives what I show here is some notations that I will use throughout class to denote derivatives this is the common one DF by DX sometimes I’ll just write it as DX f with the round D denoting the partial derivative or sometimes I will just write FX and this is one possibility of how to discreetly approximate this derivative at any point x and y at any pixel for example x and y you take the pixel to the right you take the pixel to the left the difference between these brightness values you divide by the distance you walked to if you are exact you should say 2 times the width of a pixel because that is the step size epsilon that we talked and so this is a discrete approximation of this once you have this discrete approximation the next thing you might ask is why do we do it this way why not differently indeed you can do it differently and the short message is there is not one solution that works best for every application and so in practice this is called the symmetric difference because you do it symmetrically forward backward and this is called a forward difference where you just take the pixel on the right and subtract the central pixel and similarly this is called a backward difference and so you in discretizing variational methods and partial differential equations people use all sorts of discretization they use metric differences in some settings forward and backward differences and I think I’ll show you a little bit in the coming lectures what the advantages and drawbacks are of these discretization but ultimately the full picture only emerges you know it’s it’s an endless story one could say there’s you know you could take five or six classes on numerix of of the perennial equations and and then you can get into all the details of different kinds of discretization so it’s a very long story and at some point it even becomes somewhat of an art to find the best discretization for a given problem in practice so if you take any discretization you will typically get good solutions and often the discretization is not critical there are some phenomena in in the sciences where having a good discretization is of crew is crucially important to get meaningful simulation results for example if you want to simulate the evolution of shockwaves then having the right discretization is crucial so there are certain physical phenomena where it’s important to discretize properly so here is an example of brightness functions and what they look like in the computer this is an input image it’s almost black and white you would say at least for the human eye as I said it is this is white this is black but if you look into the brightness values if you take this ray here and and and then basically show the brightness values along this line then you see they’re not really constant they’re really fluctuating a lot and some of the dark values in this area are almost as dark as dark as some of the bright values in that area and so if you add a little bit of noise that gets substantially worse very quickly here are some examples of noise this is a pure black and white image if I take the X derivative I can

actually localize the transition from white to black and so people thought well we can use derivatives to localize objects and determine their boundaries and this is how image processing was done in the 70s he would try to identify parts on a conveyor belt by computing the derivative and checking where is it maximal and then you would say that you would mark the points of maximum derivative and that would be the boundary of your object and once you actually ran it in real-world conditions you found it didn’t work and the reason why it doesn’t work as you can see it here if you add some noise you can as a human still nicely see this transition from black to white if you then draw the points of it goes compute the X derivative this is what it looks like so you can still see a darker line but if you then want to localize mark mark those points with strong X derivative you will get tons of points everywhere you will not no longer get the boundary of of your object and so what we find and this was a finding that people realize in the 70s and 80s is that derivatives are very sensitive to noise so if you have a certain amount of noise this is the derivative along the x axis and for for a certain y coordinate and then if I ask you where is that derivative maximal and that is the transition so everywhere you know it looks like an almost random function but in principle what we see in that way is that vertical edges like we saw them brightness edges can in principle be determined as Maxima of the X derivative and similarly of course I can localize horizontal lines as the Maxima of the Y derivative with that simple approach we can only lower determine horizontal and vertical edges but there’s ways to generalize that to arbitrary edges but what we see is that there is a sensitivity to noise maybe I’ll ask around to see how creative you are how do you think we could resolve that noise sensitivity how could we use derivatives to determine edges in a way that is more robust to noise what could you do maybe pre process the image in some way to reduce the noise any blur exactly we can blur it and as we saw earlier maybe even apply a medium filter to blur it mean a way that the edges are preserved and then compute the derivative on the blurred image and in fact anyone who does image processing knows that and uses that so if in any paper you see we compute the derivative of the image typically what the author means is of the blurred image because derivatives of the input image are so noise sensitive that you may not want to do them so typically whenever a paper computes the derivative of the input image you have to be careful either typically what it means is you have to at first blur it may be Gaussian smooth a little bit then then apply the derivative once you have edges in arbitrary direction that you want to capture with derivatives you can compute the gradient the gradient for those who haven’t seen it is that triangle with it pointing down mind you this is very important and it so I put it on the board for some people it’s obvious but for if you’re not from a more mathematical background this is called this is the derivative operator it stands for derivative in X and in y direction this here the triangle down is called the Laplace operator and it’s the X derivative squared plus the Y derivative squared in two dimensions so these are two very different symbols and you should not confuse them if you apply this operator it’s called nabla actually in the literature this is called the nabla operator I don’t know where the term comes from actually and this is called the Laplace operator the nabla operator is the derivative operator so a

derivative in X and the derivative in y direction if you apply it to a scalar function if then it’s called the gradient of F and so it’s a vector containing the x and the y derivative of your function if you can then determine the norm the Euclidean norm of that vector this is this expression here so the X derivative squared so the partial derivative squared sum of partial derivatives squared and the square root this is the gradient norm sometimes in the literature this here itself is called the gradient so there is a little bit of a ambiguous terminology if you read gradient strictly speaking it should be the vector of the partial derivatives but sometimes what the authors mean is the norm of that vector and they just call it the gradient as well so for example when you say an edge is a location of strong image gradient of course you mean the norm of the you know for a vector to be strong doesn’t really have a meaning so these terms are used exchanged ibly the gradient norm is a very useful operator but you should keep in mind it’s a nonlinear operator which means if you take two images and sum them and then compute the gradient norm or first compute the gradient norms of the two images and sum them afterwards of course you don’t get the same thing but it is an operator that allows to detect edges in arbitrary orientation the reason being that if you have an edge in an arbitrary orientation then the brightness changes both in X and in y direction sometimes a little more in X direction if it’s a more vertical X sometimes more in y direction if it’s more horizontal edge but you basically measure the changes in X and in y direction and you sum these changes in in this manner and so it measures changes in any direction the gradient norm you can check it for yourself is what’s called rotationally covariant that means that if I first rotate an image and then compute the gradient or if I first compute the gradient and then rotate comes out to the same in the literature this is often called rotationally invariant strictly speaking that is not correct invariant means that if I apply it the structure doesn’t change at all rotationally invariant means if I rotate the image and then apply the gradient I get the same as if I just apply the gradient but but of course that’s not true I mean of course you get the rotated gradients and so yeah so this is called rotational e covariant you can actually write it down if you want you can say I think that’s getting a little bit too far but but so this is called rotationally covariant it means you it’s exchangeable you can either apply the gradient then rotate or first rotate and then apply the gradient this is very useful because it means the gradient operator or this gradient norm is an operator sorry that can be used that can be used and it’s independent of how I hold my camera I will get the same performance even if I rotate the cam so this covariance is vital for image analysis here’s an example of the image gradient so this is the input image and these are the is a visualization of the gradient where black means the gradient is zero and white means the gradient is large and so you can see at the boundaries of objects you typically get large values of the gradient but what you also see if even you know coins are essentially uniformly colored and this might be an application that would be useful in a you know let’s say you are at the cashiers this and you throw a lot of coins on the table and you want a computer to figure out quickly how much money is on the table you know if you that it would be terribly useful for example if you’re standing in line at

Aldi or so and then then you have an older lady in front of you and she can’t count the money and she throws it all on there takes forever sometimes right and you wish there was some algorithm that did it quickly and said how much money was there but it turns out it’s not you know you could use these gradient filters to determine the boundaries of coins and then from then on you can determine the size of the coins and using the color information you can determine which coin am I looking at etc but you already see here although the coin is uniform material since it’s metallic it reflects the light and so you get strong brightness value transitions also in this and as a consequence it’s a much harder to localize coins just using grading although this image has no noise in it right and in real-world conditions if your background is not really white either it may get more and more difficult but at least you can see you can do some things with this operator here’s the flipped version of the nabla operator the Laplace operator here it is the Laplace operator is sometimes written as nabla squared what it basically means is if you take the derivative operator and square it it means DX dy transpose times DX dy you get DX squared plus dy squared and that is indeed not tell applies operator so it’s an operator of second derivatives knobloch applied by the way to a vector is called the divergence so the nabla this is why I wrote nabla here sometimes this is called gradient if you apply it to a scaler and sometimes it’s called divergence if you apply it to a vector and I know researchers who don’t like the symbol knob lab for that reason and so in their papers they will either write gradient as grad often short grad for gradient and div for divergence and they do that to really tell the reader what I mean which structure they’re applying this operator to whether it is a scalar or a victory they apply it and so this is the Laplace operator is just the sum of second derivatives in each component so the derivative in X square the second derivative in X plus the second derivative in Y by the way the same operators of course apply to 3d as well and then you would have plus the second derivative in Z coordinate or if you want you can even do it in time if that makes sense the Laplace operator in contrast to the gradient norm is a linear operator which means here I wrote it again that if I apply Laplace to alpha if alpha 1 F plus alpha 2 times G then I get alpha 1 Laplace of F plus alpha 2 Laplace of G for any alpha 1 alpha 2 for any F in G so that is linear and as I said linearity has some advantages because I can for some images and then apply the operator or first apply the operator and then sum the result comes out to the same here’s what the Laplace operator does maybe a little intuition behind it why the Laplace does what it does as you see the assumption here is always that the brightness jumps from black to white say background foreground so we live in a simplified black-and-white world I’m already over time am I not yes he should remind me of that let’s stop here and continue tomorrow