Spotify's Journey to the Cloud (Cloud Next '18)

[LOGO MUSIC PLAYING] PETER-MARK VERWOERD: Good afternoon, everyone Welcome to this last session of the day Thank you for coming here and taking up this time between you and going to have some fun over at AT&T Park What we’re going to do today is talk about Spotify’s journey from on premise data centers to Google Cloud Platform I’m Peter-Mark Verwoerd I’m a solutions architect here at Google Speaking with me today are Max, who’s a colleague of mine from Google, and Ramon and Josh, who are both from Spotify And I will let each of those gentlemen introduce themselves when they come up to speak The four of us were all part of a core team composed of Googlers and Spotifiers that made this migration a success And so we’re going to cover what that was and how we did that And to do that, I’ll introduce the next speaker, Max [APPLAUSE] MAX CHARAS: That’s nice [INAUDIBLE] So this presentation is around how we executed a multi-datacenter migration to GCP We worked with over 100 teams in two countries And we migrated over 1,200 services and 20,000 data processing jobs This is a very active collaboration between Google and Spotify still So the presentation is split into four parts We’ll start with an introduction, where we cover the Spotify organization and kind of how it works But we’ll also walk into the higher level goals of the migration itself After that, we have two main chapters, and they cover the migration One is for the user facing services and the other is for data processing And after that, we have lessons learned during the migration So a note to you all, though, this migration– the strategy is very tailored to Spotify, both technically and organizationally So if you would want to do something like this yourself, it might look very different So let’s kick off the intro and here is Ramon [APPLAUSE] RAMON VAN ALTEREN: Hi My name is Ramon van Alteren I’m the Program Manager for the entire cloud migration at Spotify Now before we start diving into this migration, what I would like to do is take you through the two main reasons why we wanted to move to cloud And the first reason is something that applies at a global company level If you think about the amount of effort it takes to maintain compute storage, and network capacity for a global company that serves more than 170 million users, that’s a sizable amount of work Just think about the amount of work it just takes to replace the servers that you depreciate over every three years That’s a lot of work to do And if I’m really honest, what we really want to do at Spotify is be the best music service in the world And none of that work working on data centers actually contributes directly to that We’d much rather spend all that focus, attention, and energy into becoming the best music service in the world and build features for our customers and our users and our creators that make it the best music service So we saw a good opportunity to take some of that work towards Google, and allow our own engineers to focus on Spotify-specific actions That’s one thing But the second thing is– and that helps with choosing a cloud, is that if you start analyzing this, one of the things that is really important to us is fast product development And if you start looking at the ecosystem of services that are slowly being built on top of cloud parameters, like instances, networking, and storage capacity, that’s an ecosystem of services that allows you to move much faster and deliver a much better product faster to users Think about things like BigQuery, or PubSub, or Dataflow, those things are building blocks that allow us to innovate a lot faster So let’s take a look at where we started As Peter-Mark already mentioned, we had everything in on-premise data centers We’re known for running a large microservice architecture At the time we started this entire process, we had about 1,200 different microservice that served our users By now, it’s several hundreds more And we had a pretty mature engineering culture around that, with lots of automation around deployments, spinning up instances, spinning down instances This was something people could fairly easily do

We ran our service out of four data center sites in the world– one on the West Coast, one on the East Coast of US, and two in Europe Each of those sites is essentially a standalone copy Most of our backend in those data centers is essentially control plane of the surface So the content delivery, the music that gets delivered to your phone and to your desktop or to your speaker comes out of CDN networks and owned CDN So it’s not coming out of our data centers That’s coming from a separate place And we’re big on data We’re a data-informed company We have about 20,000 data processing jobs, 2,000 unique workflows, and about 100 petabyte of data when we started And all of that had to move to cloud And lastly, we had about 100 teams Important to know is probably that those 100 teams each owned operational responsibility for the services they developed So that means they’re on call for those But also own sort of the entire business problem that they’re trying to solve Here’s the three main goals we had with this migration First, link to the first reason We wanted to minimize disruption to our product development But you can’t stop feature development for the duration of a cloud migration, because that’s completely unacceptable both to our users and our stakeholders The second thing we felt was really important was speed Hybrid states– where you’re in both states– are both expensive, but the rules are risky You force engineering teams to keep two cognitive models, one for on premise and one for cloud And everything you deploy, you need to go through the process of does this work on cloud, does this work on premise– for incident mode, similar problems The last goal we had was clean up This is a forcing function for your long tail, that you don’t end up with small things that are still running in the data center, and make sure that you finish What might come as a big surprise to you is that most of this cloud strategy actually came from the infrastructure groups The people maintaining those data centers, building the network, building the instances, ingesting all the hardware– those were deeply involved in building out this cloud strategy And we spend a lot of time prepping and involving them in that Part of that involvement was talking about what does your job look like once we moved to cloud But it’s to some degree a scary thing to move from a completely owned stack to a cloud, and what are you going to be doing as an infrastructure engineer? And we spent a lot of time talking about how that would evolve And the net result of that was that the entire group of engineers, which are some of the most deeply respected engineers inside Spotify, have a lot of contact with feature teams and often own the larger scale core services that we have, ended up being advocates for our cloud storage So we did a big kick off with them We built a whole set of prototypes And during those prototypes, we found two key blockers One broker was the lack of VPC when we started, which is now there and available for everyone And the other problem was related to VPN And we’ll talk about– in the different migration tracks, we’ll talk about these two different blockers The last thing we did was build a minimal use case and a good one If you want to move a multimillion dollar company, you can’t show a small prototype You have to show you actually can viably run your business on top of cloud So we built a minimal working prototype that played music and was capable of processing the data coming out to pay out royalties I think I would like to spend a little bit of a word on dependencies I think for every migration, but especially for cloud migrations, dependencies are important But if you look at a microservice typically servicing a single customer request, takes about 10 to 15 services And the same is true for our data processing A single data processing workflow will often involve multiple intermediate datasets and transformation steps across these Stop the world and everybody migrate is not an option We talked about that So that means that you need to find a strategy that allows your teams to move independently without having to migrate all their dependencies in one go as well And we felt it an important problem to solve first in this migration And with that, I would like to invite Peter-Mark up here to talk about some of these strategies [APPLAUSE] PETER-MARK VERWOERD: So when we talk about migration to the cloud in the abstract, it breaks down generally into about two different strategies The first strategy is lift and shift That’s when you try to keep your architecture as simple as possible to whatever you have You minimize your changes to virtual machines, application layer, databases, network, and so on

The upside of this is that you can focus on speed and keep complexity down while you’re migrating The downside is that you can potentially increase your costs, because your workloads can’t really use the elastic design patterns or higher level services available on Google Cloud The other strategy is some form of a rewrite That’s where you’re taking some part of your application, or potentially all of it, and completely rewriting it, using more Cloud-native products and architectures Complexity of this is basically directly proportional to how complex is your code base, how well does the development team know it, and how much change are you trying to make to that application So as Max said before, this presentation is divided into two parts And the first part will cover the migration of the user facing services, what we’re calling the services migration And the second chapter will cover the migration of data processing workloads, so data migration Services migration used a lift and shift strategy throughout And the data migration use a hybrid rewrite and lift and shift approach, depending on the kind of workload that they had And so right now, we’re going to dive into the services migration with Ramon Thank you [APPLAUSE] RAMON VAN ALTEREN: Hey, I’m back [LAUGHS] So we talked about two blockers So let’s dive into them a little bit The first one was VPC There was no VPC when we started this It’s a new feature We build it collaboratively with Google, based on our feedback But what we essentially found was that if you had microservices in a project, the only way for them to talk to a microservice in Project B was for the request to go essentially out to the internet and then come back in And that didn’t really work that well for us We’re a pretty latency sensitive service That was not acceptable We briefly try putting our entire 120 team engineering organization in one project, but that really didn’t work either So we quit that pretty quickly What we ended up– or what Google ended up stepping up to the plate and developing was VPC, which allows you to build similar to an internal network that connects multiple of these projects and that can cross-talk easily enough So we ended up with a decision where we said, OK, every service is going to live in a single project So not every team has a single project, but every service has a single project And this gives team good control over their own destiny They get to do the things that they need to do for their service in that project, and if they shoot themselves in the foot, they shoot themselves in the foot, not the entire company And that’s important to us as well I think the second blocker that we found in this was VPN When we started, and this is 2015-2016, right, we tried connecting our data centers to Google Cloud to solve the dependency problem So you can move one service, and it could connect to its dependencies in data center That takes, with 1,200 microservice, quite a lot of VPN bandwidth and quality on that And to be honest, I think the VPN service at that time was more or less scaled on, let’s connect an office with 25 developers in it And then we showed up with four data centers and multiple 10,000s of service That didn’t really work that well Again, this was a collaboration with Google where the quality was improved fairly rapidly And we managed to get this working solidly So we build several multiple gigabytes of network pipes between our different data centers towards Google Cloud to allow for that dependency problem to disappear Now after that, we spent some time building just enough infrastructure to make it actually slightly easier to spin up capacity in cloud and it wasn’t on premise– slightly faster, slightly easier, same tool set Another key realization in the service migration is that we could decouple service migration from moving our user traffic 170 million plus users generate a lot of traffic towards our service If you have to do that full bang in one migration, that’s hard You can do the work of migrating a service to cloud, and do the development work you need for that without necessarily moving all that traffic So we deliberately separated these two with different roadmaps, where we focused on getting all these applications ready to run in Google Cloud, all these backend services And then we had a separate road map that allowed us to gradually connect more and more users to GCP and that allows us to control both reliability, the user experience, and our migration speed And it allowed us to control how much traffic was flowing over these VPN links, as well The next thing we did was build a small migration team

It was a hybrid team staffed out of both Spotify engineers and Googlers And the first thing we did as preparation for running that migration team was we built a visualization of the entire migration state and I’ll show that to you in a second The second thing we did was we built a completely standardized migration sprint program How do you migrate a backend service to cloud? Well, here’s a template program on how to do that We built some teaching material for common migration cases, like how do you migrate your data store? How do you make migrate a Cassandra? How do you migrate a Java application? And we created a set of teaching material about Google Cloud What is changing? What is important for you to know? Here’s how that visualization looked So every bubble is a system The size of the bubble is proportional to the amount of machines that the system uses, the sort of capacity that the system has If it’s red, it’s in a data center If it’s green, it’s been migrated to Google Cloud And this had a number of interesting side effects I think one, it saved me as a program manager a lot of time doing status updates about the state of our migration, because people could just go and look That was nice But the side effects were interesting, I think What ended up happening was that several people that would work in a team– and if you look at the visualization, those blue-ish circles represent teams and departments– so someone heading up one of these departments or in a team, would go and hunt for these red bubbles And go, look, why is that not migrated? And that was pretty useful to us The other thing that it did was that it created a real sense of accomplishment in teams that were migrating From the start to the finish of this migration, they could directly see what impact they were having with their work And we updated this multiple times a day So literally, once they migrated the system, they could see the bubble go green And that helped a lot with motivation This is how it looked by the time we stopped doing the migration with backend services in May 2017 So we stopped the sprint program Someone is wondering about the big red bubble in the middle That’s our Hadoop cluster We did not migrate that one So starting a sprint, we did two things One, and this is a really important one, was focus What we essentially did was we negotiated with these different product departments where there was room in their product roadmap to spend one or two weeks migrating so they would not get in trouble with their deadlines The con side of that was that we asked them to stop all other work– so no feature work, no other things, no minimal things, no small things on the side, nothing All the engineers would focus, for the duration of the sprint, only on cloud migration The second thing we did was an intake meeting where we covered a number of things One was what’s the amount of systems that you own? Which ones do you want to migrate? What are the ones that you don’t want to migrate? Why? And that was a pretty interesting start to do, because in many cases, people said, well, we’re planning to deprecate this anyway in three months So we can actually speed it up and do it now So in many cases, it actually reduced scope instead of increased it The other thing we looked at was large data storage systems But if you own multiple terabytes of data, it actually takes time to transfer that And it’s a bit silly to sit with a whole group of engineers to wait until it’s transferred to the other side So often, if there were very large systems, we would start that transfer earlier so we could hit the ground running while starting sprint The last thing we did was an operational risk assessment There is a difference between moving a Tier 3 service that does a minimal feature that degrades gracefully for our users and a Tier 1 service that if you fuck it up, the entire service goes down And then that makes a difference, so it’s important to know about that Excuse the language [LAUGHTER] So with that, I would like to ask the two people that joined me in the migration team to come and talk about this We split the work in this in three big blocks And about half of the work was actual migration work And the interesting thing was that some of the feedback we got out of engineers was that it was boring It actually worked fairly simple You just redeploy the application on cloud and it just works That was really nice to hear So it took far less time than we thought it would, in many cases What we did was we asked people to focus on one region they’re taking production traffic But also spin minimal capacity up in the other regions that we were using And then that one region was taking enough traffic to see any performance or reliability issues

A key thing was, where you get these people want to fiddle with the systems and make them better at that time So we were pretty strict about no, we’re not cleaning up tech [INAUDIBLE] We’re migrating at this time PETER-MARK VERWOERD: So once the majority of the systems had been migrated, usually by the end of the first week, we would deliberately, in a controlled, recorded way, actually induce failures in their newly migrated systems to see how comfortable and where the teams were in managing incidents I mean, basically, we just started breaking things, slowly, until they started noticing And then we watched and saw and recorded how they would recover So while it was absolutely fun to break things and watch teams scramble, there were a couple of interesting things that came from that The first thing– there were a couple of obvious ones So the first one was make sure that the monitoring systems were properly extended to the new cloud deployment Make sure that that’s all working If we started breaking something in and the team didn’t notice, that was obviously a big red flag that’s something they need to monitor But I think what was really interesting was when something broke, team noticed, recovered, and we’re debriefing and talking about how we did what we did or what the failure mode was And teams starting to look into different ways of monitoring, and said, well, you know, we noticed it But if we were monitoring on this metric instead of that one, not only would we have noticed, but we’d have gotten to the root cause much more quickly And so that kind of iterative process over deliberately breaking things really kind of helped a lot of teams And then finally, we had this playbook that they get start going to and start building on for failure modes in the cloud that they may not have had to deal with before or encounter before And so that was really a great experience to kind of watch that discovery process and kind of build on that as we went throughout the migration MAX CHARAS: Great So we also had four educational packages And you can see these at the bottom here I think it’s important to state, like, we did this for more or less every team that we encountered So they had to go through these lectures and workshops where we discussed different topics The first was GCP support And that was really about how to interact with support And it was about interacting with support, both during incidents as well as for more long term development support And it was really important, the small things here, like seeing that people actually created support accounts and opened the case to see how it worked So when in the future they would have an incident, everybody in the team would feel comfortable interacting with support The second one was networking and what we covered here was really the different semantics of networking between GCP and on prem Because the teams were used to operating on prem in very specific data centers The third was a product overview, where we went through, most notably, the different database options within GCP And the goal here was to allow the different teams to feel comfortable moving to more managed solutions in the future Now we didn’t do this during the migration itself This was something for future purposes And last, but not least, we also covered costs, so that everybody in the team would feel comfortable monitoring and understanding the different semantics of costs on GCP RAMON VAN ALTEREN: So as I showed in the visualization, by May, we had done the migration sprint with nearly every team in the company And at that point, even during those migrations sprints, we slowly started increasing traffic towards Google Cloud Well, we can control where users connect So more and more users started connecting to GCP And during the migration, started gradually increasing that It allows us to measure and control the user experience This was a key thing for us to make sure that it continued to be good We have a reputation for carrying around all the world’s music in your pocket, and if you press play, it starts So this user quality was important for us And we wanted to make sure that that did not degrade for moving to cloud And this allowed us to measure that really effectively The other thing we did was manage reliability risks with this And the last thing is we were able to control the amount of users we put on cloud, based on the migration speed that we had The more services we had in GCP, the more users we managed to direct there And then, by the time we had all applications or all services on GCP, we did a number of things while increasing more and more traffic towards 100% The reliability program we ran with individual teams helps teams reason about their reliability for their system There is also an aspect of the entire service So one of the things we did is we ran company-wide reliability

exercises, where we closed an entire data center, where we closed an entire cloud region, and made sure that we could still provide service to our users Just before Christmas in December of last year, 2017, we hit 100% users But at that point, the first data center was already closed We closed our San Jose data center here on the West Coast in October 2017 And that’s a pretty strong signal for people with long tail applications that are still running in data centers– that they need to get a move on There’s a roadmap for the rest We have already closed our secondary data center in the US And there’s two data centers left in Europe, which were already scaled down significantly But they’ll close by the end of this year This concludes our overview of the services migration I would now like to hand over to Josh, my colleague to talk about the data migration [APPLAUSE] JOSH BAER: Thanks, everybody So my name is Josh Baer I lead the data migration And that’s what I’m going to talk about right now So just to recap a little bit about where we were before we moved our stuff over to GCP– we called ourselves the largest Hadoop cluster in Europe And we were able to call ourselves that because nobody ever corrected us, so it worked We were certainly one of the largest clusters, for sure So we had about 20,000 jobs that ran every day from around 2000 workloads And as Ramon mentioned at the beginning, this was from as many as 100 different teams distributed throughout Europe and the US So this is a view of what our dependency graph looks like, all those jobs interacting with each other And this is actually a vast simplification of our dependency tree Because if we showed the full one, it would be way too wide and way too deep to actually project it, and you wouldn’t actually see a lot of it So the crazy thing about this graphic is each little smudge that you might see there is a job And each job has an input and each job has an output And those inputs are the dependencies And those dependencies could be from a totally different team than the team that you work on And it could be in a different location than the location you work on So it’s fairly complex to try to move these, all those jobs, over to GCP without causing downstream failures And that was really the challenge And overcoming this challenge through our strategy is what I’m going to talk about right now So you can now understand the complexity of what we had to move And one thing we could have done is to tell everybody to stop everything they’re doing, shut down the Hadoop cluster, copy all the data over to GCP, copy the state, and start things back up again Now unfortunately, we couldn’t do this Because as you’ve seen in this graphics, even with 160 gigabit per second network link that we had to Google initially, it would take us around two months to copy all the data that we had on the cluster And because we’re a business that relies on our data to pay our labels, and we rely on our labels to provide content, and we rely on our content to provide a service for our users, we wouldn’t be much of a business if we were down for two months And since every node in the dependency tree earlier could represent individual team we had a rough job of moving things without disrupting other users So what could we do? So our strategy was to copy lots of data, essentially to duplicate data, both on premise and on GCP So as you moved your pipeline, as you moved your job over to GCP, what you would do is you would copy your dependencies, all the data You would set up a copy job to copy you dependencies over to GCP, and then you could port your job over Then if you had downstream consumers, you might have to copy the output of your job back to our on-premise cluster, so that the downstream consumers weren’t broken And since the bulk of our data migration lasted around 6 to 12 months, we were running a lot of these jobs to fill gaps in our dependency trees As you can see in the graphics here, at peak, we had around 80,000 jobs that copied data from on premise to the cloud and around 30,000 that copied data back to on premise, so a lot of copies, a lot of duplicate data, and a lot of complexity And I mention that these copy jobs were not exactly easy things to get right So especially early on, we had the 160 gigabit per second network bandwidth And we found problems that when jobs were broken and they had to catch up, they would saturate that network bandwidth And then we had our high priority jobs that were getting blocked by maybe jobs that aren’t such high priority So we had to figure out how to get that working That was one challenge that we disrupted over time And we learned it’s massively important in this transition,

when you’re in the hybrid state, to over provision your network bandwidth Another problem that we ran into, or another lesson that we learned, was to avoid using the VPN whenever possible in these copy jobs And fortunately, GCS has application layer security, so you really don’t need to use the VPN to copy data back and forth So while we still couldn’t tell everybody to stop everything they’re doing and move everything to the cluster, what we could do is say, stop everything new that you’re doing on our on-premise cluster and start writing all the new things on GCP And the benefit of that is there’s some natural attrition as people are rewriting or replacing their jobs They’ll just kill off their old ones on the on-premise But what about the jobs that already existed and weren’t being replaced? So they had two paths to choose from, as was mentioned earlier, and that’s what I’m going to highlight The first path was called the forklifting path So pure lift and shift wouldn’t work for us And there’s a few reasons for that The first one is that we really wanted to move away from this process where we had one single Hadoop cluster that managed all our data jobs and one team that managed that Hadoop cluster that bore all the operational responsibility if it was up or down We also had to adapt to the assumptions that we made from on-premise cluster to what the reality was in GCP So for example, our orchestration framework really depended on atomic moves And in GCS, they don’t have that same atomicity, so we had to build some tooling to bridge that gap So at the start of migration, we actually spend a lot of time to build infrastructure and tooling to make it easier for our teams to use And I want to highlight a few of those tools that we built in our process And I should mention that all of these are open source, so you can check them out, and maybe they will be useful to you The first tool that I want to mention is Styx So early in our previous on-premise world, we had the concept of these long-lived edge nodes that launched jobs We wanted to move to a world where you could just containerize your job in Docker and use Kubernetes to schedule it But we had to build a scheduler for that, and that’s what Styx is So in addition to being more fault tolerant, this process, it allows us to be very flexible across the cloud and very flexible with how we use our resources And we don’t have these really hand-tuned edge nodes anymore The second tool that I want to highlight is something called Spydra And in a few words, Spydra is Hadoop cluster as a service Now you might be thinking, there’s already Hadoop cluster as a service that GCP offers Its Dataproc, right? That’s true But what we wanted to do is go one level of abstraction higher so that our users really didn’t have to worry about learning a new tool, learning how Dataproc offers They just write their job, and we’d take care of the spinning up and the tearing down of clusters by ourselves And we’d take care of the net tooling, centralizing the logging, so that if their job fails, they can still figure out what’s wrong with it So this was great and troubleshooting was still easy And there was no need for that centralized infrastructure team and no need for that centralized big main Hadoop cluster So the forklifting path was primarily the path that teams had to take if they had a lot of other priorities and they couldn’t take the time to rewrite But I’m going to talk about the rewriting path now And that was really advantageous for some other teams So in this path, you would either rewrite your job using BigQuery or a tool called Scio, which I’ll talk about in a second, which is essentially Dataflow So the benefits of this is this was massively exciting for all our engineers, right? One of the real advantages for moving to GCP is our engineers could go and play with the latest and greatest tech, the stuff that we had read about in papers, that we always had wanted to try out So this was exciting for them And the benefit for the teams is this path is actually fully managed, too So they don’t even have to worry about spinning up and tearing down these Hadoop clusters We just can outsource that to the Google teams to provide the managed infrastructure for us And this was also useful for teams that didn’t really feel comfortable just porting over their jobs using the lift and shift, using the forklifting path And they might not feel comfortable because maybe they inherited these data jobs somewhere in the past and they never actually had looked at them, so they wanted to dig into them a little bit And while they were digging into them, they might as well rewrite them So like the forklifting path, we ended up creating some tooling to support our users in this And the main tool that I’d like to support,

that’s also open source– you should go check it out in GitHub– is a tool called Scio So Scio is essentially a library that’s wrapped on top of Beam or Dataflow, which simplified some of the less familiar functions of Dataflow for our data engineers And it was a bit closer to scalding, something that we used pretty extensively in our Hadoop world And the great additional benefit– it’s got real tight integration with BigQuery So if you just want a few specific columns that you want to process over, you can use BigQuery to get those columns, and then use Scio to process them So it’s been really successful, and I’d highly recommend you check it out So but this path, too, wasn’t without its challenges The biggest thing is it required a much larger time investment from teams, right? They’re not just porting their jobs over, they’re actually rewriting it And as engineers, what usually happened is when you start to rewrite it, you say, hm That thing I did– what was I thinking two years ago when I wrote that? I would want to re-architect it, too So when teams usually told me that they would rewrite their jobs in around two weeks, it usually lasted a month And when teams told me, it was a month It was usually three months So that leads into this quote here, Hofstadter’s Law– that it always took longer, even taking into account that law So towards the middle and end of our migration, we actually had to tell people to avoid the rewrite path altogether Migrate their stuff using the forklifting path, then if they really did want to rewrite it, or re-architect it, it was all already on GCP So we could still hit our migration targets So that’s our strategy I want to quickly highlight the tactics that we used to move teams over to So at this point, we’ve provided the tooling for the teams to forklift their jobs over And we’ve provided libraries that they could use to rewrite their jobs on GCP natively But for other teams, the reality was still that ramping up to speed with these new tools was a challenge There were new things, and they didn’t really have capacity to learn new things, or maybe they had specific edge cases that the tools didn’t cover So to solve that, we decided to create a temporary team Temporary teams can be composed of both infrastructure engineers and the feature teams, the people that understood the business logic So the benefit of having our infrastructure engineers is that they’re the people that created the tools So they understand really well how to adapt jobs to get them to run on GCP And the benefit of having the feature teams work on it is they understood the business logic And they were especially important because they would own the jobs when it’s on GCP And they would have to understand how their logic worked So this task force lasted around one to two weeks of highly focused sprints with one main goal And that was to take all your jobs, port them over to GCP, and oh, yeah, don’t break your downstream consumers Use those copy jobs So now that was the tactics and strategy I’m going to quickly highlight what our data stack looks like now on top of GCP So the first tool that I’d like to highlight is BigQuery And maybe some of you saw– Nicole was in the audience over there to do a talk yesterday on how we use BigQuery at Spotify When we started seeing BigQuery used, it really to underscores a powerful explosion in lowering the friction for asking and answering questions and how adoption really increased As you see now, we have around 25% of Spotify’s employees use BigQuery And before the migration, I think we had around 50 monthly active users for Hive that were asking and answering jobs using that [INAUDIBLE] Another tool that I would like to highlight is Pub/Sub So Pub/Sub feeds both our batch and our real time data processing jobs And the great thing about Pub/Sub is it’s significantly lowered our operational costs from before the migration Then I’d like to highlight both Dataflow and Dataproc So we’re heavily, heavy users of both Dataflow, in particular, I think a few months ago I talked to the PM And we were the largest Dataflow user I think we’re still at least one of the largest now We run around for 5,000 jobs per day And almost all of these jobs are using Scio, which our engineers absolutely love to use So you might be thinking, what happened to Europe’s largest Hadoop cluster, right? What did we do with that? So at the end of Q1 this year, we held a big gathering in our Stockholm office, in the cafeteria on the top floor And we had one of the first engineers that ran a Hadoop job at Spotify press a big red button shutting

down the power in our London data center that was our Hadoop cluster And the crowd sat there and watched a live feed of our data center And the first thing they noticed is all this whirring and buzzing of fans in the data center immediately went silent And it was crazy And then one by one, you’d see all these blinking lights stop to blink, and shut down, and go dark And that really signaled the end of our life as on-premise Hadoop user and signaled the start of our life as fully native on GCP And to complete that day’s celebration, we had an “Office Space”-style dismantling of one of our name nodes that was used early on And as you could see there, really just let the steam out and signaled that transition So that concludes the migration We want to highlight– and bring everybody on stage– and highlight a few of the most important lessons that we learned throughout the process [APPLAUSE] MAX CHARAS: Great So the first lesson is prepare And I think it’s important to state here that we prepared for probably two years before actually doing the migration And each migration took roughly a year So we spent a lot of time preparing Now what we really did was we tried, as you can see here, a minimal use case that showed the benefits of moving to GCP But when we say minimal here, its important to say that it wasn’t a small thing It had to have been an important feature for the entire business to show the true value of moving over For each of these minimal use cases, we refined the model for moving over So over time, it became better and better One thing that’s also important to note is that your early adopters, if you decide to do this, are super important Because they will be your developer advocates later on in the organization as your migration progresses Great That was the first lesson, prepare RAMON VAN ALTEREN: I’ll take the second lesson The second lesson is focus It is truly amazing what you can do with a team of engineers if you focus on a single thing We literally had sprints that lasted a week where we moved more than 50 to 70 services It is amazing Do that It helps a lot It will also help your business stakeholders They’ll be a lot happier that there’s a short period of time that there’s no product development instead of an extended period of time And if you try doing migrating while doing all the other priorities that you have within the business, your migration speed will slow to a crawl And that will make this a far more difficult process than it has to be PETER-MARK VERWOERD: The third lesson is creating a migration team Both of the two different paths, the services and data migration, used a migration team to help enable the teams actually doing the migrating and accelerate that process And the way we looked at it was to go into teams and act as guardrails for them The teams that maintain the services, know the code better, know the services better than we do But we came in and said, you know what, we’re here to make sure that you know what you need to know We can pass on past experience, past learnings, and make this as efficient as possible We’re here to be resources that you need, whether you need to talk about new libraries that you’re doing for data migration, or maybe talk to somebody at Google because you’re curious about how something else works And what was great about that is that we got to carry the experience from each migration from team to team to team, so you’re not doing, say, five teams migrating in isolation Team one and team four run across the same roadblock and spend the same amount of time solving the same problem Well, what we could do is take that experience and carry it forward Carry it throughout And so each sequential migration actually improves on the previous, in terms of what we could carry through And so that was very, very helpful on both sides of those migrations to be able to move those teams forward JOSH BAER: And the last lesson that we wanted to highlight was to get out of hybrid as fast as you can So in the data migration, as we mentioned, all these copy jobs back and forth to clusters was both expensive, in terms of network bandwidth, but also complex in terms of people had to figure out where the jobs were actually running and where the dependencies were It was also a problem in the services, as Ramon highlighted early on So when you were in an incident mode, you had to figure out where is the problem? Is it in the on-premise tooling? Is it in the GCP tooling?

Or maybe is it in the glue in between? Yeah And this also, as Ramon mentioned, adds cognitive overhead for your teams And when you want to focus on the world just purely on the cloud, that’s not something you want to think of So if you move to the cloud as fast as possible, you could just focus on being cloud native So that concludes the presentation We’ll now take questions, I think, using the mics on the side Thank you all [APPLAUSE] [LOGO MUSIC PLAYING]