Intel DPDK: Data Plane Development Kit – John Fastabend, Intel

they’re all right I’m John Kasich when I work for Intel I talk about the Intel DB DK quick agenda and just going to the first few slides are kind of high level of almost marketing level slides but they at least give you a kind of some sense of how the shark software’s organized we have a few things about performance and then the virtualization piece might be actually one of the more interesting pieces here and and you might you can either take that as you like what they’re doing or you don’t but at least highlight some of the limitations of existing virtualization models that we have alright so from a really high level you know what is the dbj it’s basically a what they call an EI o which is basically Maps the hardware registers in the user space and then on top of that we have all set of libraries and fundamentally that’s what it is it’s a map the hardware and user space and then we have a bunch of packet libraries to do classification we have packet libraries to do you know dude batch transmits you know and all the classification stuff is optimized for the intel intel product here it’s optimized for the Intel SSE which are the CPU optimizations and then they have some memory mapping kind of they use huge paces for example and that gives them a lot of advantage as well right here and I think this is probably what I just said memory manager buffer manager queue management flow classification just optimizing all those things they’re all libraries when you actually get this you don’t get like a stack or anything that you would think of like that it’s very simply just a bunch of libraries with some sample applications that run on top of it we have for example you cam as part of the published kind of source there’s a l2 forwarding application which uses the libraries you compile the l2 app you can do l2 forwarding similarly there’s in the layer 314 which does basically IP based forwarding and then there’s a few other things like they already have their four samples but these are really sample applications with the intent that you will modify these for your environment they’re not yours to per se like supported applications that we are giving to people that are actually running them in the production environment I expect that you’ll go and tweak these and make them your own so you know one question I get is you know where does all the performance come from and obviously or why is this good for performance and it’s a question I had to when I was looking at this and here are a few of the things that they can actually do one is they put the drought it’s all pole mode so that helps the other thing they do is they aggressively isolate the CPUs so in all the documentation and the sample applications that they provide for this you see a very one-to-one mapping of software threads to hardware cues and all on one CPU and they can do this because um some of the environments they’re looking at here are not big multi processing multi-threaded environments that you’re looking at very targeted kind of applications for this and into the point where they’re even isolating CPUs right at boot time that are going to be bound to this application so that the scheduler isn’t trying to use them the next thing we do is we do there’s some batch packet processing going on there to do multiple packets at a time you know this is something that I think we could probably do in the kernel as well there’s you know just because a lot of stuff is done in a user space doesn’t mean it couldn’t be done in kernel it just means that we haven’t got to it yet or they’re you know it hasn’t been done yet batch packet processing might be something that we could pick up and I know it’s come up a handful times here and there the other one is the huge memory page which I just mentioned that basically came out of some performance analysis that we’ve done with some tools that showed that the TLB was crashing by using huge pages they reduced the flash in the TLB okay now the next thing is the SSE instructions so those are the Intel streaming SIMD extensions it basically doing vectorization for some of these tuples and things yeah GCC can also do this as well so in theory you could compile your kernels with this support in the last one there is they have some lock list queues so the other queues are using our CAS queues they’re not really spent locking on anything combine that with the fact that you’re you’re pinning everything down to CPUs and you’ve made a performance improvement um here’s just the kind of a what application would kind of look like receive a packet

reassemble it run your flow table match from there you can create a bunch of actions ACLs and then the Kay and I which is the kernel network interface this is an actual interface back into the kernel now as soon as you start pushing packets back into the kernel you’re losing the whole a lot of the benefit of having them in user space so so there’s a quick performance numbers that they’re showing and I just put the slide in here so to say some of the challenge that they’re trying to solve with this you know at 40 gigs you’re looking at you know 60 million packets per second you know and then they’re showing you some of the latency numbers ferry l3 cache hit latency is 40 cycles memory read 70 nanoseconds so we’re just trying to show that if you’re trying to get line rated 64 bytes going from 40 gigs and then future hundred gigs you’re looking at let’s see you know really short packet arrival times they’re sixteen point eight nanoseconds and so on and this is our test setup by just another kind of high level picture of what we’re what they use for their test setup basically they have two systems they use this exia which is just you know one of these package enraging go by which you know will flood the link with line rate at 64 bytes and what they’re doing is they’re kind of their footing in these quad ports which are for port 10 gigabit NICs putting them on a pci gen3 slots and when they do l2 forwarding they send them out one port and they send them back out the other point and then they count the dropped frames or you know count only frames are making a throw and then you can get your packets per second for forwarding test it’s a pretty straightforward benchmark to do if you could do it without the you could do it without the hardware test in fact actually a lot of the times when I’m doing this in in my setups I don’t have this fancy hardware generated there so I just use package in on the other Linux box and do it that way and it works pretty well so this is the kind of numbers that they’re seeing right now there’s actually a whole slide deck of different CPUs and different platforms and the different in it’s quite extensive in 20 minute we could spend you know half an hour just trying to interpret them but I’m going to work with some of the guys I think and see if we can maybe get this published somewhere so people can actually sit there and look and see the different values for all of these things but but at a high level you know they’re seen about 47 you know around 50 million packets per second doing l3 forwarding and that’s with a quad port using all four ports and what do I have here on the gin – they’re seeing about you know I guess the max is 23 so these are kind of the numbers Oh on the right here you have the theoretical values so 10 gig at 64 byte like you said is 15 million and then 16 million at 40 gig and about a L 240 and we see about 30 million packets per second per core at 2.1 gigahertz and then the bottom that we have you know per core numbers that they’re looking at and then get about 30 million packets with one core at 64 bytes doing l2 and then layer 3 they get about 24 obviously is the packets get larger you can see it’s packets per second goes down throughput goes up yeah oh sorry go ahead just interrupt me so the l2 forwarding in these cases it’s it there’s only one port so I guess they’re they’re mapping ports to back to back so there’s not really a table lookup simple answer look at the packet send it out very simple and it since it’s all done in user space and you have the Rings there it’s just a matter of pushing it around so virtualization which is I think maybe maybe the more interesting piece of this I want to talk a bit about this so what they found is they can push the stuff up into user space into this Intel DVD K block here which is the yellow block and if they use shared memory and share a memory actually share memory with the virtualized guest you can get basically bare-metal performance of course this is a horrible idea if you don’t trust your guest right there’s no security here you’ve skipped all of the stack security mechanisms there’s nothing at some point it’s almost like running on bare metal you’re just pushing it into the VM but

the point there I guess is to illustrate that it can be done and if you have an application where you you trust the application which happens to be the case for a lot of comms applications perhaps then this works and that that’s the floor bullet for there now bullet 2 is using what they call the KN IV host cane eyes their kernel network interface and basically they have a device driver that will translate their internal basically than skb it’s actually a bsd-style in buff but it translates that in buff into an sk b and injects it into the networking stack and by doing this you can use existing kernel features and this is one way to plug into a virtualization you can use the existing stack and interfaces you lose the performance is basically equivalent to not doing this and using the normal kernel stack at that point there’s some slight benefit but it’s not much and then V hostnet is a V host extension that kind of understands there in buff stuff basically there’s some performance benefit there but what they’re seeing is it’s mostly the performance benefit comes from the performance bottleneck is actually in the Verdi yo and not somewhere else question so the question is why not pass the PF into the into the VM that can be done but then you know you’ve effectively bound one PF to one VM so you can’t share that PF across more than one VM then now now you could use something like srl V which would give you multiple virtual functions and then you could do direct assign with virtual functions and that works as well and you get a lot of performance benefits then because because you you know that VM is directly handling the packets the problem with that is if you are a customer and you have some special software that you want to run between the VM and the and the physical function and you don’t have any way to insert that logic anymore because you’ve directly assigned it to the VM so if your Intel DVD K block there is running some library or some set of filters or something that that you have written you know but by using a VFR direct assigning a PF you’ve lost that effectively yeah yeah you not really I don’t believe because because what they’re seeing now is then then the Verdi Oh becomes the bottleneck and it you know you can pack it and process packets as fast as you’d like in the in the host but we’re bottleneck at the Verdi owe them I agree and I think the the problem then is if Verdi oh and the the virtualization infrastructure is the bottleneck that doesn’t solve that problem right so you can make the host as effective as you like but you’re still need to remove that bottleneck at the at the V host bird i/o level and maybe there’s some folks here that are working on that space I have some good ideas around there I’ve seen a few things but I don’t have too many details in that space yeah right all right so this other question just in case you didn’t hear what it was um how much of these kind of tricks and performance improvements could actually push into the kernel you know that I think we’ve talked about this someone this morning that you know huge pages it’s something we use this and I’m not sure how well that would work in the kernel but maybe there’s some way you could push huge pages in there the trick a lot of this they do at boot time right they say you you isolate CPUs and you isolate huge table memory tables at boot

time so whether or not that would be a good solution for like a driver based solution is or a general-purpose solution you know it’s debatable on batch TX I think that’s come up a handful of times in different discussion I think that could be done yeah yeah so that I mean that comes back to this virtual VM problem right as soon as we we do the user copy pull pull this out of the chronal put it in user mode with this Intel DB TK we can do this copy into the VM but shared memory without ever having to do a kernel user space copy so that for VMs that’s where a lot of performance comes from and it’s how they get this to run line rate 64 bytes into into a VM and the the shared memory so they’re sharing memory so it’s a zero copy effectively so so you yeah and you’re short yeah right yeah um there’s security problems are there so right I mean is that what you’re alluding to morning for it a and but the difference there we

Oh there are proprietary stacks that exist that I know people are running on top of this I’m not familiar with onload is all mode just a software stack that’s okay it’s a user space tcp/ip stack or something so I should there are there are proprietary tcp/ip stacks I run on top of this and I know P some people are probably doing that I don’t have the direct insight to know in there kernel-space yeah I think a lot of these people are I mean that’s why the shared memory is even viable right you what you would if you’re a KVM virtualized with guests that you don’t trust you would never do this okay because you don’t you don’t you never use the shared memory model with a zero copy right you have to be a an embedded packet crunching application of some sort I think for to get a lot of the benefits of you know doing the shared memory doing the CPU aligned and and I don’t think I feel good yeah so that’s what the numbers are so that’s the KNI piece there what is the reinjection package back into the kernel and you know all the performance numbers aren’t based on that right all those are based on this shared memory model where you directly copy and Cain is one piece of it I don’t necessarily want to be hung up on that whether it’s useful or not I guess it’s that’s that’s an Intel block that comes with this DVD K it stands for kernel network interface and the point is it you can take the packets from user space and dump them into the kernel network in sec it’s open source but not upstream in the sense that’s part of the Linux kernel yeah okay similar question yeah okay bread sure right okay most likely yeah all I get this last question yes so the quick no the other default on this is that this is an all-or-nothing model I’m like net map where you can I

think even PF rango you can pull certain cues out this is pushing it there there’s no notion of having these packets go to user mode and then this is the main stack driver so it’s either all in user space or none of it at this time so so flow director will work with this but not in the sense that you’re thinking all the cues are mapped in the user space at this point so you still your slow director but you’re only yeah