Why did Netflix use NGINX and FreeBSD to build their own CDN? by Gleb Smirnoff

hi there my name is Gleb Smirnoff I’m working at the nginx and we are working in tight cooperation with our partner Netflix and to go together we’re working on their Netflix content delivery network and every day we make it faster and more efficient than it was yesterday the CDN is based on open source technologies the FreeBSD operating system and the nginx web server in the first place and my today talk to the headlock talk on how we build it how we run the network and how we benefit from using the open source and how does the open source benefits from us using it I think here probably everyone knows what Netflix is just in case if you don’t know it’s a video on-demand streaming service and I’d like to show you some numbers to show how actually the big the services so we’ve got 50 millions subscribers this actually are not the does between 50 million people means more because one subscriber is usually a household most of the surprises are in the United States by the part from United States were already running in more than 40 countries and these countries are rapidly growing markets the net in video collections consists of over 1 petabyte of data and every byte of this data is available on demand right now you can watch it speaking of the u.s. USA we are the traffic generator number one in North America and we’ll cover more than one-third of all downstream traffic I think this is impressive and now let’s look how this enormous traffic generator works inside here is the basic layout of the components of the network’s video streaming we run all the complex stuff in the Amazon this includes the business logic our data mining encoding of videos recording them resize and encryption the actual video stream is also started by the Amazon so the client logins and the website authenticate itself and then the cloud controls its operation but all the bulk video data the actual data that consists use that one-third of internet that I have just showed you is served by a content delivery network when Netflix started its instant video streaming service initially we outsource the content delivery to the big sewage city and the level 3 Akamai and limelight but as online video service had grown its popularity the amount of traffic short also grown rapidly faster than the Citians could build themselves up and soon it was clear the delivery or the Internet the video delivery over the Internet is the principal activity of the networks that we are no longer a company the ships DVDs we are post where video streaming company and if this is a principal activity we probably should nots outsource it there is not a reduced actually a number of reasons and the first reason I already mentioned is we we need to grow faster and CDNs can’t go can keep the pace and of course this amount of traffic being outsourced it is very expensive and running our own City and will reduce costs except these two obvious reasons what are the other reasons to build our own CDN the first it’s about the control of the video streaming so with the streaming has like simplified three components the video player of the you that connects to the web server this one the web server itself that’s – and the Internet in between them ideally we want to control all the stuff but at least we want to control the ends of this chain the client and the server we already control the client because all the players are developed in the networks and in this chain client internet server there are plenty of things that can go wrong and video streaming is very sensitive to packet

loss round-trip time jitter delays and all kind of internet anomalies if the player experiences certain problems it reports this problems with the control server and we want to fix them we want early detection of problems and clients be rerouted to other servers and we also won’t be able to login into these servers and loop from from the app from the inside what’s wrong so in short we want to control the server side controlling the server side also allows us a possibility gives us possibility to run our specific TCP congestion control algorithms to run some special HTTP models and so on and of course this this all means that we are building a specialized CDN not a generic one can solve anything but the CDN that can serve our video and of course a specialized product it works better than the Nisshin Erik one and finally we want to make our CDN spread around the countries and put the cons and closer to clients to reduce all this mentioned in turn anomalies and the answer to all these reasons is open connect so what is open connect we want to spread our caches throughout the internet and put them closed appliance and that’s what we do is what we we offer is peace our caches for free so that I speak and install our cache into their own rack and these cache will be dedicated for serving the video content to the network customers that are connected to this HP look here this is actually a triple win situation so the ISP reduces the load on its outer linked by one third the clients get better connection to the video and thus better movie watching quality and we Netflix we gain all the stuff that I mentioned on the previous slide unfortunately not always peace accepted this offer so in some places we need to install caches of our own at the points of large internet exchange in attempt to improve streaming quality for their customers that are connected to the eye space that were unlocking to notice that the open connect initiative technical is this is caches they differ from the ISP ones only in configuration of routing how do these sketches look like so on this slide I just put a photo one of the first sketches we call them open connect appliances the core idea of building the appliance is that we push as much terabytes as possible into one unit of rack space originally that was a custom chase’s of four units in head and it was painted in a corporate red color and we also put words from classical movies unto them and this is one of my favorites unfortunately we no longer paint them red because some ways piece said that they prefer neutral gray and white colors and directs but we may revise it however okay let’s turn this thing upside down and look what’s inside there there are a lot of disks so right now we’ve got two versions of the open Connect appliance the first one is full of spinning disks the one on this slide and it it also has a few extra solid-state drives for the most popular content it carries 144 forty four terabytes of data in it that means it can carry a large part of the entire networks collection the other version the SSD box consists of only SSDs it’s one unit in case and in carrot it can carry ten times less but at the same time it can generate more gigabits of traffic per second so it is used for

storing the most popular part of the collection these appliances versions or how we call them revisions wrote and revised like a couple of times per year and we are picking more modern hardware for them we usually choose the disks of the maximum available volume at the moment but we do we do go for crochets and the most expensive CPU or the mainboard there is no reason to do that because we usually are not CPU bound but this bount or network bound and if we are CPU about who actually would better invest into the improving the code rather than appreciating the most expensive CPU we actually do count money when we choose Hardware for example we started to buy the 40 40 gigabit NICs only after these cars became more cheap than a set of 4 10 gigabit cards and now for the software what’s there inside of open connect appliance the core components as already mentioned is a freebsd operating system and the nginx web server the results of the bird protein daemon trans bgp to install the software on the appliance we do not use any regular usual installation procedures instead we build nano busy images we call them from wires the front wire consists of all things required for the open connect operation it’s a operating system kernel utilities package of Angelique’s all additional modules for engines appear on broad package scripting languages all our internal scripts and finally all configuration files we’ve got only the framework – for unattended upgrade and roll back up from worse so to upgrade or rollback a front wire on a certain appliance it’s actually one-click operation and why did we choose these software components to be used in the open connect as the open connect project started it was at the very beginning clear that we are going to push the limits and walk and get in one more gigabits per second from a single box and thus we needed to start with the open source product that we actually can modify and since we give away the appliances to ISPs and that means that there is involvement of third party and although we give it for free still we need them to be BSD license because GPL is tricky in the legal area when it comes to the given away to the third party and why exactly FreeBSD we chose it since it is known to be a good platform to build in an internet server even unmodified it runs fast and stable and that’s a good ground to start with the second important key point is that the FreeBSD has very nice community to work with willing to cooperate with the vendors and that is actually what we need if we are going to work an improvement store to the to the sources for the webserver no surprise the best one is nginx again same arguments it is a fast and stable out of the box and we wanted to launch up and connect as fast as possible and nginx is somewhat unique I think everyone here knows it that it’s open source busy licensed product and at the same time all its developers are working full-time employees as a lack of a legal body and this little body offers a superb commercial support for the product that’s somewhat unique and it’s actually combination of of the most beneficial things of open-source and the commercial software at one at one piece of software what’s also very important for a network video streaming is the flexible framework for the custom models because we’ve got a couple of models that are specific to video streaming and the last row in the scales is that–how engine ease and through BG cooperate together when running on FreeBSD engine is driven by the business specific advance of akq event notification system one of the best api’s to multiplex a home and also out

of the box without any extra efforts engines on through BG would use special trick of using sent file system call together with together with a synchronous read system call this trick it prevents and file to block on disk IO resulting in outstanding performance this is just a random day from a random open connect appliance in this october typique typical day traffic you can notice the peak traffic in the Union duration 20 gigabits per second yeah this is silly because the the horizontal axis is in UTC in the night the traffic goes down and the Box switches to the feelin mode when it stops off in the data and it starts to renew its own collection pullin new titles from the origin server this small negative peak is actually the feelin then we’ve got a tiny peek in the breakfast time and then during the day the traffic goes up at the evening we we do the fill in window because simultaneous writes and reads to the SSDs give a penalty on the i/o time so we try to either read from SSDs or write this is nice details on the how the streaming goes on we usually serve up to 30 thousands of TCP connections per appliance with the clients request data in quite small chance usually around 100 kilobytes which is quite far from being optimal for the server and although we’ve got some popular content the vast majority of data requested by clients is not found in the operating system memory cache so we need to read it from these directly to network so streaming several tens of gigabytes of traffic per second with such an friendly request pattern is not an easy task task yeah and the next important topic I would like to speak about today is how we build on top of open source here on slide you can see a typical strategy for most large companies of most large companies on dealing with open source software so what they do they grab a stable version imported into their own repository and they start to develop a product on top of it this had been done in many places for many years and I think that many of you are familiar with this approach at first glance this looks like a good strategy because you start with the stable version and it probably has no bugs everyone agrees ok and you develop your code without any interaction with the community without wasting time for these jerks who just do not set your bar your patches and so on and this of course there’s a lot of time and you don’t need to synchronize with what they could not they need to indicate your changes to be included in the apps and so on but does it launch in the long term at Netflix we think that no it doesn’t work in the long term many of open connect team members came from different companies and which managed their software exactly in this way and we all independently from each other learn that this is the wrong way to do things and this is an ethics way it’s the opposite we pull the bleeding edge version software and we constantly push our changes back and before I go and tell what’s good about our way let’s return back to the traditional one and see what is wrong with it so the traditional approach is baten several myths and I’m going to dispel them one by one the first one the third myth is of course about development versions being Bobby and I will not argue of

course they are but I will claim that they contain approximately the same amount of bugs per line of code as the stable versions do yes of course in development versions there happened a stupid typo some brainers but these are usually fixed at next day and the concentration of the non-trivial box in the say in the stable and developed versions are usually the same those who do not believe me I you can suggest do the phone you take a look in the bug tracker of any large open-source project you choose one for me I did for me this so the bug tracking software usually allows bug submitters to enter version of the software where they encounter at the back so what you need to do is to query the database and see how many bugs people discover and development versions and how many they discover in stable versions so most of bugs have discovered and stable versions and what this actually means this means that bugs are discovered only when the code is tried out and they are not discovered simply because cold sits in a wrapper for several months and ages like a wine okay so many people actually already dispelled the first names ourselves and they invented the next one that if we will wait for stable version someone else will encounter and fix all our bugs they believe that if they act like Free Riders which is a term from game theory then they will benefit from others and when stable versions is to be released then we wait for yet another year and finally try after a year or so so I will not go into ethical implications of such approach and that if all if everyone act like three riders that usually yields in the problem named the tragedy of the Commons instead of these ethical things I will notice that these people will get an outdated version with the support time reduced by the for the time period they actually waited for and finally the a strategy simply doesn’t work first their Leafs and box that only you will discover if you are doing something more complex than building a personal webpage if you are building a commercial product on top of open source of you probably doing something trivial and you are about to examine code paths and situations not tested by others and the second point is that many open source software used to merge the stable versions not only security bug fixes and critical bug fixes but they also merge improvements for performance or new features and this actually means that emerging bugs and this means that waiting for stabilization to stabilize means waiting forever another important point of early bug discovery compared to late bug discovery is that when you discover a buck in development versions version the author of the code is still around he is working here the code is hot and it will be fixed quickly quick quick quicker than in the other case after a couple of years the author can just quit the company who worked for switch to other project he can go for a trip around the world and so on even if he even if he’s still there there are some in stable versions there are API and ABI constrains the overall cold completeness and frozenness and fixing box takes more time and if in case if you report bugs early you actually not only fix bugs you influence the actual development of the code because you report your real-life scenarios to the developer he who is writing code right now and probably the resultant code will fit your needs better because you are the early tester the next important myth is that about saving time of course following the development version working with the open-source community it all this also Afghans use time consumes time it is so tempting to cut these expenses and sit in on a version that was initially

imported unfortunately experience of numerous companies show that at a certain time in future you will face a choice of either you do a major upgrade a big jump in a couple of versions or your product is dying why simply because when you forked office open source you started to improve it in in the direction you need you do not cover the entire project and at the same time new new card is released new protocols are getting common the internet around you switches to new standards and the code base doesn’t support this all stuff at certain point you will also need to handle the security advisories yourself since all the companies of yours of your product are no longer supported by vendor so usually upgrade over five years of independent development of upstream will require several months of several experienced developers working only on this upgrade taking them off their normal developer activities and here you will weigh all the time you saved before and even more I think that some of you can point meet some exclusions like Apple that to FreeBSD and they made Marco stand of it and they never did complete merges from open-source and they just went forward with their own development what’s different between Apple and you that Apple is actually developing a practice system is their principal activity they’ve got developer manpower of the size that is comparable or bigger to the open source community that’s why they can run this way most companies can so here is the rule of thumb if your developer manpower is smaller than the open source community you take a product from then you need to follow the open source otherwise you’ll find yourself in couple of years in an uncomfortable situation and the last important if it’s about sharing code that if you disclose your sources you are disclosing in select intellectual property ok let’s look at Netflix what Netflix does it provides customers with movies and series and we are not participating in any gigabits per second Olympic competitions so if we share our no house on how to improve FreeBSD and nginx to serve more gigabits of data per second how would that help our competitors in any way or if we keep private our bug fixes to open source companies would that any help prevent someone else to fix this box so that was assured that sharing code that is generic and not closet.i closely tied to the network video is absolutely safe for our intellectual property now that I hope that I convinced you that sharing code is harmless but is there any benefit in sharing code usually rhetoric about the open source they speak about fairness of giving back that like you take and you should give back and people who open their code are depicted as altruistic donors however there are definite benefits to give away your code if you give code to community then you automatically became become the part of the community and if you give code on behalf of a company then not only use the person but the company also becomes part of the community and what does it mean a community member can influence the community in any kind of discussions the voice of the community member is more important than the voice of a non-member who simply just takes from from the open source so we are going to influence the development of software were interested in two ways we inject the code there and we are being heard by the rest of the community second important point once the code goes upstream we no longer carry burden of maintaining it

any new change in the upstream should now pass build tests with our code and we are not responsible for that the others are and this would save us a lot of time in future thought once we open the code we’ve got more eyes on it we got several dozen of experience people around the world who will read our code and if they find something wrong visible they will report to us and even if some tricky bug has sneaked through our internal reviews and through open source reviews it can be found only by encountered encounter indeed and country it and once the code is running in open source we’ve got a lot of free testers around the world who will discover the back for us and actually we want to be considerate at least okay let’s take a look how this story that I just described he works in the practice what what did we achieve relying on these strategies the open connect initiative started in 2011 and there were just two developers in the team in June 2012 the first open connect caches the open connect was announced and the first open connect caches started to serve by this dead there were three developers in team and in the first month of the open connect operation the appliances were able to serve less than 10 gigabits of traffic but the team made the goal for 30 gigabits per second for the next two years and we did achieve right now the team had grown up to 10 developers and believe Monod our next goal is 80 maybe this sounds over optimistic but two years ago 30 also did what do we do to improve the traffic through output so in first we’re interested in the network stack of FreeBSD because this is the the part that we use to push the data to the client and the second hand we read the data from disk and this is the second singular interested in and of course the data flow between the network stack and the storage tag involves the operating system worlds from memory subsystem so here are the three core systems we’re about we are to modify a little bit technical so what we actually already did with modern multiprocessor Hardware usually the most common performance improvement is reducing contention on logs and on cache lines so here are enumerated some important changes in the network stack that we need to improve its SMP friendliness I won’t go very deep into technical so in short what we did we reduce the time the logs are held the coupled memory writes to the same cache line by different CPUs changed some algorithms from shared memory to per CPU copies and so on anyone interested in details can ask me after the talk some subsystems we need larger changes like we write a major overhaul and this includes the kernel for table 3 mese G some code in the engine is around the same file call we also converted several subsystems in freebsd to be multi-threaded like before so today’s was a single threat and now there is one thread per mount point of course we whenever we encounter a bug even if the bug is not in the our area of primary interested interest we still fix it and upstream it so there are several parts of software that we modified not for preferences about various different bug fixes our ok we also did some a couple of completing use of systems in FreeBSD kernel from scratch in free/busy 10 we introduced

fast and worthless counters that store data in thir CPU memory and we achieved that without a new login even without kernel critical section in FreeBSD 11 we already converted all Network related statistics to this new facility right now we are very close to releasing complete in use and file implementation for FreeBSD the core feature is it doesn’t work on this guy oh but actually it does this guy or in background it also features configurable read ahead and configurable VM caching actually both these features on the slide they deserve each 30-minute talk so I will not go into detail right now another example in the community involvement is that we not only develop our own code but we also work together in a cap ration with other open-source community members so here’s couple of examples when I salon and FreeBSD foundation sponsored the on map at i/o we were the early testers of it and actually we did run the inhabit io patch in production all over our CDN before it was even committed to FreeBSD the same stuff went on when EMC developed there some improvements to the we’ll talk my virtual memory page lookup algorithm in the FreeBSD VM and now about the things that are not yet released but you will seem probably next year or two so we already got a new speak injection control algorithm it doesn’t have a cool name yet so we call it Netflix it already works in production and I think that it will be disclosed right now together with the Chelsea or the company that makes our interface card and network interface cards we’re working on hardware as is the TCP pacing that means that when operating system sends a TSO chunk to the interface card it also specifies the pace at which packets should go out to prevent bursts we already got a prototype for kernel side TLS offload so what’s that so right now in any operating system the self file system call cannot be combined with the SSL connection and we are about to close this so what what we’re going to do is that nginx does it SSL handshake with the with the peer and then it uploads the session keys to the socket and after that we can issue send file on this socket and this will yield in data being read from disk encrypted and sent to the socket we are working hard on some low-level improvements to the SSD drives in tight compression with vendors that will allow us to write to SSDs without any penalty on the read speed we are brainstorming some different improvements to the virtual memory FreeBSD one of the topics is the Multistrada in the page demon we also may consider invest some developer time in the proper Newmar support and FreeBSD and I’m pretty sure that this list is not full so actually what I’d like to show you that we used a different strategy on building a product on top of open source and with a small team we actually achieved a lot in a short period and I think that proved that this new strategy it works thank you so the question is whether the congestion control is run on server side or on both or client and server no it is run only on the server side I think that

no so you know in FreeBSD you got pluggable congestion control algorithms yeah and this is what it is actually is portable and you can set a global to global CC CTL and you can add figures congestion control via sets a cop so what we do is that we set it via sortsa coped so we do it only on the HTTP sockets I see I yeah of course one side because the client is running on some Windows or Mac or something else we actually don’t have data centers we’ve got the caches spread all over the Internet and as you have seen on the plot the amount of traffic sort inside like pool and titles from a region server to the cache it’s really small and it probably runs through default from eg congestion control the written because I said we said that CC Netflix congestion control per socket so anytime engines accept the connection it does set so-called the clients so you mean who initiates transfer as far as I know the cache itself pulls later so it gets the database of titles compares it with what holy God and fetches the deep the feeling no no Cassius Cassius do not pull from casual they all go to the Iranian server and clients never go to origin server Oh actually we do peer to peer oh I’m sorry i I’m quite far from operations so that’s it ISIL do we select ice peace we actually offer this initiative to all ice peace but not all accepted we do not select them of course as far as no we don’t take full route in table there is no reason to do that absolutely we only propagate our rotis through so we run it not to fudge the road but to propagate our roads you mean there’s some security implications well where we hope that it will improve performance we actually won’t right now we’re only prototyping that we’re already got some working code but it’s far from being ready and for the security implications yeah probably that stuff should go

through security review of course so since since the the handshake is still happening in the usual and and this is the most complicated part and we afford only the block ciphers of the kernel we will simply use the kernel crypto facilities that are already there and FreeBSD so we are not adding we are not adding new ending you crypt crypto stuff in the freebies if the kernel already has all all we need only theoretically so that actually means rebuilding the entire software from scratch everything