014-01 – Larry Smarr – UC Berkeley Cloud Meetup 014 (May 26, 2020)

– As I was talking to Chris Hoffman earlier who’s doing some work with research IT, I asked him if he wanted to say anything and he said, “No, we’ve got to keep this absolutely short, because Larry is amazing and we really just want to hear from Larry.” and so I’m going to keep this super short, but Larry is a physicist and an astronomer and a world leader in scientific computing He’s the founding director of the California Institute for Telecommunications and Information Technology And he’s also at UC San Diego in a partnership also with UC Irvine and he’s a professor there and doing work at UC San Diego’s Jacobs School A long time ago, he was the founding director of NCSA, the National Center for Supercomputing Applications, which at UIUC, which always in my mind is just associated with Mosaic But most importantly for tonight, he’s the Principal Investigator on the NSF’s Pacific Research Platform Berkeley is a partner institution, and actually Camille Crittenden, I believe, is one of the Co-PIs on that, so she is joining us in the audience here tonight And I’m not exactly sure what we’re going to hear about, whether it’s Nautilus, which is a hypercluster for running containerized big data applications, but Larry is a visionary, a futurist, and an amazing promoter of the things that should happen, so Larry I’m going to turn it over to you and you can go ahead and share your screen and we are here – Okay, thanks a lot! This is going to be a lot faster than I can go into details on each slide, but it’s roughly 15 minutes and the slides will be available to everybody afterwards What I’m going to talk about is how academics have actually built effectively a private cloud that is interconnected with all of the commercial clouds and then I’ll show examples of how we use it to solve problems that otherwise would be impossible to solve So this all started with Esnet in 2010, coming up with this idea that on campuses, you needed to have a separate network called a ScienceDMZ that would enable you to do high speed movement of data and not just the normal commercial internet that we’re all using And in NSF adopted this and amazingly enough, then over the next quite a few years has been using calls for proposals to give about a half a million dollars to all these campuses to establish a ScienceDMZ on their campus And so it was obviously the next stage was to interconnect those In other words, if you think about the DMZ the way I like to think about it is like the freeway system, say in LA, but the interstate highway system is what interconnects the freeway systems of the cities And so that’s what we proposed back in 2015 and as you can see, Camille who is on there on the Zoom with us is a Co-PI and in fact the leader of the Science Engagement Team And Phil Papadopoulos who was at the San Diego Supercomputer Center, now is head of research computing at Irvine Tom DeFanti, who has been a long time thirty-some odd year colleague of mine and Frank Wuerthwein at UCSD in Physics These are the Co-PIs And in the proposal we showed how we were going to be able to not only link together all these campuses that are on the map using the CENIC optical network back-plain that already connects them, but also we would be able to bring in supercomputers from LBNL, SDSC, from NERSC, and NASA Ames and NCAR And then this was driven by over 50 different big data research projects that we had write-ups of and over 32 CIO’s and other campus IT people So it’s a huge undertaking, really An example of unexpectedly how to do things came from Berkeley They worked with Merced and UCSD to establish these virtual realities exhibits that are interconnected at 10 gigabits a second to be able to move, to have common datasets which can show up in all of them and this was in the CITIRS Tech Museum and then also later in other sites on Berkeley campus And again, it’s that you need this large optical pipes to interconnect Now that was those kiosks are driven by PCs of course, which can handle 10 gigabits

And so we made a generalization of these PCs which are typically rack mounted, well we wanted to go to 40 or 100 gigabit lengths, and these are just rack mounted PCs but they have up to 200 terabytes of rotating storage, multi-terabytes of solid-state disks to be able to inter-mediate the flows coming through these very high speed pipes and the disk which occurs a lot slower than the solid-state disk But in the back of these you could have eight slots where you can actually then put GPU cards, gaming cards And so these things become actually machine learning small desktop supercomputers and they’re being used very widely as you’ll see In fact, we got a further grant that followed on, that built on the PRP to add 256 of these GPUs on 10 different campuses, including Berkeley and Merced and Santa Cruz, that would be for helping people who are working on machine learning and AI algorithms And then a third NSF Grant we got extends this across the United States using the Texas optical networks, the Great Plains Network, and on through to New York and Pennsylvania, and brings in two more supercomputers: the Texas Advanced Supercomputer Center and the and PSC So this was just October of ’18 The thing that was amazing is, of course, we had this all hooked together, but it was the cloud that completely changed everything and that’s because Google, which of course has a planetary computer, had to figure out how to run a billion searches a day and so forth but couldn’t have humans in the loop, so they adopted containers and then in 2014, they made their entire Google Global Cloud using an open source thing called Kubernetes software that they developed, which is now one of the cloud-native computing foundations And then in 2016, they made that available and in 2017, all the cloud providers accepted that they would adopt the open source instead of rolling their own And by now, essentially, two-thirds or three-quarters of the Fortune 500 have adopted Kubernetes as a container orchestration system So effectively, our PRP began to act and feel just exactly from a software point of view like commercial clouds And in fact for doing the managing the storage, we were able to use Rook under Kubernetes, which was bascially orchestrating a Ceph cloud-native storage system that came out of UC Santa Cruz So what we decided to do was to make a subset of PRP to and take all of the FIONAs, all the end points that could run Kubernetes and put those together into a single hyperpcluster, if you like, all running Kubernetes and that is what we call Nautilis And what that does is it allows us to manage petabytes now of distributed data all across the rotating storage on these PCs, but interconnected at enormous speeds, about a 1000 times a normal internet Now I don’t want to go into any detail again this is just so you can have these later for slides, but if you look these are the details of all those FIONAs for Merced, San Diego State, UCSF, Stanford, all the way around that are part of Nautilus and are all interconnected with the CENIC 100G Network This has been extended now to, first of all the PRP is not just California It goes up to the Pacific Northwest Gigapop, in the northwest say it’s over to Chicago to their MREN and StarLight, which is where a lot of a hundred gigs it’s probably more a hundred gigs coming into StarLight than anywhere in the academic world And then on top of that, we’ve added in the other regional networks and internet2, which has massive data caches in New York and Chicago and Kansas City, and all that is now interconnected by essentially the PRP But it gets better The Netherlands was a part, University of Amsterdam, in our original proposal since then

we’ve added in these other international partners, all interconnected at 10 to 100 gigabits and in fact, we were able to show that we are able to get disk-to-disk transfers at up to five gigabits a second out of 10 gigabits a second from Korea in KISTI to San Diego That was impossible in the old days just using layer 3 internet or long distance high bandwidth because TCP/IP would back off and you’d lose the bandwidth, but with the PRP it’s no problem, and so we have a lot of global collaborations now And in fact, you can see the scale of this We have over 6000 CPU cores now and nearly 600 GPUs and we have petabytes of storage So what does it let you do? Well for instance, Scott Sellars got his PhD at Irvine at the Center for Hydrometeorology and Remote Sensing Then he did a post-doc at Scripps Institution of Oceanography in their Center for Water in the West And we are he is able to download NASA satellite imagery, do machine learning on them, and then transfer those down from to Irvine and then to San Diego And it went from taking 19 days to do a single workflow without the PRP to now 52 minutes, which is over 500 times faster And of course that changes the science totally Then a part of the PRP of course is to get more people as we got examples of how it worked, to get more people involved at Berkeley Camille and Chris and others, Jeff particularly Weekley, have been involved in the science engagement So this is a workshop, for instance, in 2018 that was held at Santa Cruz, organized by the PRP Science Engagement Team at Berkeley and co-sponsored by Merced and Santa Cruz and CITRIS And of course, you know, this is one of the few projects in which two of the greatest institutes: CITRIS and Cal(IT)2, are actually co-PIs on a project of this scale, it’s probably the only one actually between the four institutes But let me just give you an example of what we didn’t expect to show up: ICECUBE, this is a South Pole Neutrino Observatory that NSF runs and one of the ways that you figure out where the Neutrino, which are, you know, they’ll go through like 18 light years of lead without scattering once, so they’re like ghost particles But occasionally they scatter and they send off this cascade of photons and there are all these under ice assimilation cameras that see them: Counters And so it turns out that this is great for using GPUs They were using CPUs on the open science grid, which is another basically a kind of a cloud, but they didn’t have GPUs, which are a hundred times more effective than CPUs for this, because of the parallel nature of the proton transport What you’re seeing here is the number of GPUs vertically, which goes up to 400 They are used by the different name spaces, each application has a name space and is completely secure from the others, and the blue is how in March of ’19, a year ago, ICECUBE came out of nowhere like a horde of locust and basically ate almost all of our GPUs, which is fine because they weren’t being used at the moment by other things So that was a totally unexpected thing It drove the use of GPUs up 4x in one year on the PRP But then it got better That wasn’t enough! So for those of you who have not tried to whet the appetite of an NSF billion dollar observatory, let me tell you, these folks have no limits to their appetite And so they said, “Well it was great! We ate all your GPUs, three or four hundred of them, but we actually would like tens of thousands of GPUs.” And so we decided now, since we’re using Condor, which is a way to do high-throughput computing, we could extend off of the PRP into all three of the commercial clouds and we could grab all their GPUs for a single problem And this, as far as I know, has not been done by academics before And so we went out to AWS, Azure, and Google Cloud, collected about 50 over 50000 of their GPUs over eight generations,

we don’t care whether their 32 bit, 64 bit, we don’t care are they old, are they new, we could use them just fine And what you can see here is over 51000 GPUs used for 2 hours, something like that, effectively a third of a hexaflop hour So this was just an example of how once you have interoperability between something like the PRP and the commercial clouds, you can do things that really hasn’t been done before One of things that Camille and her team has helped with, of course, is part of the outreach, is getting us a really nice website And so for those of you who want to dig in and find more, this was redesigned as part of what Berkeley helped with UCSD on recently So I’m going to stop here I think I’m 30 seconds short of 15 minutes and take some questions – Hi. This is Jason at Berkeley in Research IT there That’s a tremendous talk I’m just curious, I can’t help but wonder what are some things in the pipeline for this project? It’s tremendous – Well we are very fortunate in that our program officer, NSF, has just approved, not I guess officially officially, but by email, that they’re going to fund an extension of one more year past the five years that it was approved for originally That’s really important because, of course, the other two proposals are extending beyond that five years, so there’s a lot of interest You know if you think about the Great Plains Network, that’s all the states basically north of Texas and south of North Dakota, and there’s so many universities there that we could be involved Another thing, and just giving an example, many of them, for instance, use San Diego Supercomputer Center or the Tex, you know, TACC or something Well, you know, these days, supercomputers are so fast that you will generate a massive amount of data that’s just sitting there at the supercomputer If the supercomputer is lightly used enough that they can now let you do your science analysis of that data on their computer, well that’s fine But in fact, they have to turn down 50 to 30 two-thirds of the projects that want to be on the computer So they don’t have extra time But in the case that Shaw was at Santa Cruz We had an example where he was using NERST, he was helping the astronomers there They had a 1000 node cluster on the campus Well, but they didn’t have the kind of amulet that this did They now are up to something like 100 gigabits They can download three dimensional supernovas or large astronomy surveys to their cluster and then do highly interactive science analysis Well I think there are so many examples of that lately sitting out there but nobody seems to think it’s their job to find the end users and the supercomputers and knit them together And with people like Chris and Shaw I think at NERST for instance, we can helping, you know, our colleagues there I think there’s a lot of examples out of NERST that we could do The same amount of SDSC, the same amount of TACC, and so forth So I think that’s one of the areas, one of the use cases So I guess another way of saying it is we’re looking at the individual users and to figure out by abstraction what is the use case And then saying “Well, what other folks out there have the same use case, maybe even a similar work flow, but haven’t realized that they can supercharge it by a factor of 10 or 100 So and that’s again what Berkeley with Camille their leading the science engagement That’s one of the things that science engagement is about is trying to find the end users and then help them understand how they can take advantage – Larry, first of all good to see you again Secondly, so I have a- – Greg! What are you doing here? – Well I- – This is continuing adult education (Larry laughs) – Well that’s the job of my CTO, you know, and students The job of the CTO and the students is to educate the professors and as you know,

they always fail But so I have two quest- I have a question and a suggestion – Sure – And actually, I’ll go with the suggestion first I don’t know if you’ve talk to the people of the NANOGrav experiment I can provide you contract details later – Sure – It’s an extremely interesting project to me, because these are people who look at the through timing analyses of millisecond pulsors looking for very low frequency nanohertz, in fact, gravitational waves – Sure – And they’re convinced that there is gravitational wave data in these timing analyses I’ve talked to them several times over the years and they’ve told me that they have a huge problem of getting data from the repos at West Virginia, right? – [Larry] Yes It’s got to sound familiar And it seems to me that this would be a perfect candidate for the Global Research Platform – Well it depends on whether the repository is reachable by these high speed optical networks or not – Cornell and West Virginia – Yeah, so it turns out that sometimes a place like West Virginia are a little less endowed with this sort of thing But I’ll tell you, I’ve got an example in South Dakota, where South Dakota is part of the Great Plains Network and that’s where in Sioux Falls is the (Greg coughs) repository for the Landsat imagery and all of a sudden we’re finding that there are lots of folks in the UC system, for instance, that need Landsat data but, you know, the new height resolution Landsat is going to be much larger datasets and they’re going to need much of our thing So we’re going to be, I’m hoping in this next year coming year, we’ll be able to hook them up So if you want, let me put me in contact and we’ll see what it takes to get them hooked up – Yeah, I will. But and now I actually have a question – Okay Are there is there a use case for last mile basically to have lightweight compute right next to relatively small datasets or a single desktop user? I’m thinking of relatively low data rate or bandwidth sensors, but that are inaccessible or hard to reach places In other words, is there a place that you might get to that the optical networks can’t reach? – I’m sorry to jump in This is Jason at Berkeley In the interest of time and for the other speakers, I just need to ask that, let’s punt on that question for the moment and if we have time we’ll come back around to it Or you guys can take it offline and chat But- – We have actually the high performance wireless research and education network is all of the servers are on the PRP And they are all the data is coming in over from the sensors over wireless networks – Okay, thank you – I had a real quick question that came in about the sustainability of the effort and when you have the one year extension, what are your thoughts beyond that and what are ways that universities or other measures could be taken to help make it more sustainable ongoing? – Right, well it’s I’m not, it’s not my job to do anything past the end of the sixth year It is my job during the 18 months left to find folks who will carry out similar activities But the PRP will vanish at the end of an NSF Grant, that’s just the way the NSF does things So the two sustainable folks are the CIOs and the people on the head of research computing on campuses, which I think are going to have more and more demand for this kind of thing It’s not unusual to be able to have 10, and 40, 80, even 100 gigabit lengths across campus They’ve got a lot of optical fiber, it’s just not necessarily set up this way And the other is places the regional optical networks themselves such as the Great Plains Network or LEARN in Texas, or for that matter, Internet2 and Internet2 is a partner with this So I think these are services that you can imagine how something like Internet2 offering as well as the regional optical networks because after all, every year they have to explain to their campuses why they should keep paying the money to Internet2 or regional optical networks, including CENIC And so this would be a new capability, which is all possible within their existing optical network Nobody has to lay any new fiber or anything – Thank you. The CIOs in the audience thank you (Larry laughs) – Well without them this wouldn’t have happened