Vintage Rundeck: Salesforce's Alan Caudill presenting at the 2013 San Francisco Rundeck Meetup

all right cool so London Gigantor Malan ourselves earlier so let’s see so a little bit of what I was gonna cover is you know it’s talked about a number of different plugging points and so I was gonna cover a little bit of our solution and how we plug in to some of those various points and how we use one deck so all right so I’m going to introduce really what Gigantor is talk about some of the components salt and kingpin being notable ones we spend some investment in security applying security from our existing infrastructure on top of run deck and how we did that and then sort of go into let me know an unknown future directions whatever you want to call it some of which honestly touched on so and hopefully some of these you’ll just tell us that they are now known knowns so Gigantor so this is our internal project as Phil had mentioned we spent a while kind of researching various technologies before he decided where we were so this is really our orchestration solution so we have a ton of in-house tooling that you know we’ve been running our business for a decade on it it does a lot of things and I think the points you touched on are very much why we selected on Dec things like loosely coupled components having a high order language that can call out you know we can actually do the transition with existing tooling and sort of stabilize some of these workflows and make them repeatable check them any source control test them and do a number of other things that are provided we have a lot of operators going to the data center to do things and I think one of our larger visions that we’re hoping to realize is everything goes through a software development lifecycle right so it’s right now very easy for us to you know we have some good tooling where people do log in and execute things within and we want to take that out but we also want to make it so people don’t have they have an easy way to make changes in production globally repeatably you know audibly without without just logging in themselves and fixing it on the one box mention that sits atop other tooling so that’s that’s kind of the motivation of what you know of what we’re trying to build and please this talk send me pretty quick so feel free to jump in with questions so components run Dec notably and I’m not going to talk too much about that because hopefully we all know what that is but please ask questions as they come up kingpin watch them we talk about salt first so salt stack is our distributed execution engine as you mentioned run deck has the default perilous age it’s got a plug-in for saltstack talk about that in a minute the way we were looking at it is you know we have tens of thousands of nodes right now and we are conceiving on how to get to hundreds of thousands of nodes and how can you execute workflows across that reliably and get it distributed so it’s got a pretty good solution for that if you want to drill in Phil can go into some of the history on Khan how that particular decision was made but we’ve actually got all of this integrated right now I don’t have a demo for you but I have a couple screenshots so kingpin is our front-end so again going back to that little box Alex had with here all the plug-in areas we also did the custom front-end let’s see what are some of them so right now this is our human interface a lot of what we want kingpin to be is we want to automate our data center more and so we’re definitely thinking ahead with like api’s and you know right now we’re gonna have you know right now it’s all manual user gets in the day executed command-line and we’re gonna pull that into a workflow where user now logs on to a web UI and does it well going forward you want integrate that with potentially monitoring so we can actually short-circuit the obvious paths of you know servers down I want to restart it I don’t need a person I just need not a trail on that so we’re doing a lot of API is around key and so kingpin is definitely wrapped around run deck API and it itself is gonna expose api’s as well and then a lot of it which the new run Becky was really nice for this is we do want constrained execution you know we definitely want to make sure that not only do people log in they’re constrained to the authorizations that we want them to have but it looks it’s presented in the language that makes sense to them right because you know we have different people doing different things in the data center even you know we talked that he showed all the groups of the different islands and the expertise that people have they want to see things in their language so that’s why we opted to you a custom front-end for that and then run deck of course like sentence in the middle it’s the workflow engine it’s the best one we found why we selected it and it’s the system of Records so you know these other things salt and kingpins sit on front but what’s going on in run deck is that’s our system record that’s where I want to see what happened that’s where we’re gonna do audits I’ll talk a little

bit later about some some things that we I think we were hoping to figure out how to make that nice and reliable too so salt I think it already came up that someone on our team developed a source node stuff plugin it’s the github link is there and find it pretty easily if you search for run Dex salt on github so when you put this guy in you have the little circle you get a couple of options you fill those out and this will actually execute all out through Salt API and then your into the salt ecosystem so whether we have some ideas about you know salt and syndecan distribution going forward but at the simplest level this bridges the to and some of some of the interesting things about this is the plugin actually when you call out to salt the plug-in actually can monitor can sit there monitoring salt it knows how to pull for what nodes were distributed to and there’s plenty of documentation as well on the plug so kingpin a couple of things about kingpin we talked about how do you know what knows to target and how does that work so with kingpin we’ve built a front end so the thing on the right is kingpin thing on the left is run deck so when you do no targeting you actually do that within kingpin so I don’t know is this a good Wow sorry so you have additional concepts besides just notes for instance like the top thing there is data center and so the way we select knows because there’s a large number of them a list isn’t really gonna work so we have them broken down into Salesforce relevant concepts so we have things like data centers of course pod is another construct that we use so when you’re targeting some more flows can target a pod but if you want to actually target an individual node and digital host you can actually still go through data center pod and then get the relevant list of hosts there than there are your targets some of some of what we’ve done here is we have some custom naming so the kingpin you I would pick up you can’t read it here but that says data center and that says I think that’s a–that’s a Superfund so the idea being the keeping you I knows how to reach out to our inventory system so when it says this workflow requires a object data center it makes the right call out to the inventory system and puts the relevant list of data centers based on your selection there it subsequently says oh super pods so it filters this selection based on that selection and so that’s just some of the keywords we’ve built in there to make it so again when our operators go in they’re seeing data that’s relevant to you know what they’re trying to do in a language that they’re familiar with what I said before about where do the resources come from we actually expose a API endpoint on kingpin so when run deck dynamically says what are my available nodes kingpin is invoking the run deck API which is checking the nodes against the kingpin exposed API so right now that’s just a really convenient way for us to sort of tie into our inventory data with the you know keeping the loop just between King and a run deck it gives us a lot of options going forward if you know if you wanted to target a hundred thousand nodes we don’t want to be blasting a hundred thousand nodes back to run deck to validate we can actually know what it’s calling back on and filter that list and do cashman all sorts of other things so the security so of course you know we already have a data center full of operators and release engineers we already have security models in place that we need to build on top of so one of the things is both run deck and salt expose they can use an LDAP plug-in for authentication and authorization so what we set up here is Kerberos has been in our data centers that’s how our operators off authenticate so kington has a plugin into Kerberos so the operator logs in to kingpin he authenticates directly with Kerberos we identify who he is and then in doing so we create a dynamic and a ephemeral set of credentials and LDAP so we basically generated user on-the-fly so while you’re logged in to kingpin you get a set of users then we use those set of credentials to actually off into into run deck so and again that connection is really through the API right you know the users don’t ever see it but that way when run deck is checking against is this a valid user it’s checking against a set of credentials that were dynamically generated basically for that session so they’re very ephemeral they disappear go away so we have a really nice you know way to keep fade down from the run deck and saltstack dynamic users we don’t have to provision we provision in Kerberos it gets provisionally to run

deck then additionally salt has its own set of its own set of considerations I mean the great thing about Mirenda api is once you’ve off to it you’ve got your token for ever you have your configured time schedule of course there’s a lot of things that are going on in salt that may need to authorize back depending on the workflow is so those credentials can be more long-lived and so kingpin itself actually monitors run deck to say you know when a user logs off it removes the credentials it monitors how long runs are going to make sure that the credentials are available to salt so if that sort of step7 it needs to reauthorize those are still valid credentials at that time so this is a security model so open issues there’s a whole there’s a whole bunch of them that we’ve sort of identified and we have nothing we don’t have any thinking around but that we still feel we need to be solving which is hadr and recovery you know right now we have really great answer for run decks like in the data center we’ve got a lot of proof of concepts of things working at that data center where it go away how do we very you know transparently switch over I mean especially the more we pile into Gigantor the more we’re putting on top of run deck you know we even in especially more we automated you know we can’t just be if someone gets a text and we do a VIP switch right even that is probably too slow but we also have to make sure that you know the job you just kicked off is very immediately transferred over into a store system that’s available in another data center or another area streaming progress output so the way we have salt step working right now is we get streaming output from run day step by step in the workflow but once we call out to salt it it kind of ends there we know that salt is executing but we’re not getting the log output of any operations that it executes dashboard so one of the things interesting is you know I think you’ve pointed out in your talk that it’s not just the operators that want to know what’s going on I mean obviously releases all the developers want to know and this isn’t just this is may not even be developers that are writing the workflows themselves but it’s actually you know everyone throughout the comedy especially like we have a whole release manager process how do we how do we keep the security of the operators being within the data center and secured but then exposing it in a meaningful way especially across a large number of nodes so that’s one thing we’re thinking on workflow contribution is another one we do want to be able to empower everyone throughout you know we’re building a platform that’s the sense so we want to empower everyone throughout the organization to build on top of that platform but at the same time we don’t want to make it insecure or or uncertain as they’re doing it we don’t want you know we don’t want to make it without testing and verification to be able to get in production and have an operator say oh great here’s this new workflow let me hit that right now when it wasn’t vetted through you know what might have worked in your test environment may not actually be safe for production it needs to be vetted and how do we do that sort of a large organization you know even not just through different products but we do acquisitions some things that you know may not have full understanding of how they operate within our data center break pointing came up as one you know when we have a workflow that’s well-defined but there’s an intermediate step that requires going out to another system that may not have an API endpoint or just some form of human interaction how do we do that workflow workflow comes compositions through a partial failure rollback you know they love to hear more thoughts with you guys on if I run a workflow and something fails in the middle I can rerun the workflow but that requires of course the workflow pays attention to I set up three of my seven steps well you know do your workflows end up with a ton of check code is that the right way to do it or is there you know is there a better sort of pattern to leverage here reusable workflow components I think you touched on this as well you know we’re gonna have you know I think what are we targeting thousands of workflows how do we allow people to build on increasing complexity of workflows and leveraging the I want to do something you know maybe it’s interfacing with a load balancer or checking you know the status on a metric and that might be a very simple workflow that you can execute and then of course you can compose these into more and more complex ones and so what’s the right way to encode those in run deck I mean one of the things I think we weren’t sure about is you know what’s the right way to sort of bubble up options if you have a workflow that has two options and then composed into a workflow that has its own option and as it goes up the chain how do you do that intelligently with me

me how do you know that the value that specified here actually flows through is the same value in all these downstream work clothes and then solution free synchronous callbacks for salt operations is you know if we do really want to run something across a hundred thousand nodes right now we’re doing polling of the salt you know of what’s going on in the salt masters so how do we is that the right pattern how does that scale or is there a way that we can actually sort of throw something out into a queue and then have run deck take it back thank you and again like I said many of these things we have thoughts on some of them the answers might be fairly apparent because these are just kind of where our head spaces so I forgot my thank you slide but that was that was pretty much what it is anyone have any questions about these points or any of the earlier stuff so that hrdi recovery work a very large Wall Street Bank they have actually sent of or no Tomcat cluster in for data centers and they’re running a hot pot everywhere and they know what they’ve done is they’ve got Oracle RAC on the back and they wrote some custom Tomcat and session sharing run decks I haven’t contributed back to us yet but they haven’t working so and they’re a global company let us say where they are you’d be proud that it is so it looks like a viable solution and we want to basically make the implementation of it so that’s encouraging they had a sort of a they had before mance would sit for but if they had an H a solution as they faced I think the reason you said they can can’t communicate anything that has too any site any place yeah as a good place to run it so they can’t a state decision so that’s thing was I know what you mean by trickling down need the options to the various layers because I’ve been building more plugins for various you know customer cases of just experimented cases and I promised aim thing plugins are so much like jobs that they almost I don’t know yet from a design standpoint there’s so much back above that and I finding the mapping like what I call it up here make sense of the end user context there’s an input that this plugin needs that made sense for the plugins perspective but how do you map these things down especially across stepheson there I’ve gotten into that problem the other thing was I was noticing one of your screenshots you had with the credentials this guy previous one the next this one this one that that one there I noticed it had like was it one before this I got em with the song yeah okay like yeah it’s all a PID off all that stuff the person who danvers has the person that writes a job wouldn’t necessarily need to know those things right I mean the person that’s managing the salt layer goes the guy who writes the job doesn’t necessarily need to know my brain yes that’s it no no you’re right these they’re not going to know that you so this is a place where I needs to be pretty useful because you can hide a lot of the that kind of configuration data in the red that configuration and and so basically you can hide all that stuff you can push all that kind of configuration stuff into the run day to day and instead of having a job this could be a place where going to be a plugin makes sense so the plugin would look in the render configuration you get that info and only present whatever options the guy who writes the chocolate okay I’m not the guy who wrote this might even say the plug-in actually a salt step Logan so we certainly abstract that out if that’s if that’s an option to do and plug-in development so we need up the doctor developer with this I mean for sure you know that some of the livers actually writing the input workflow the job there there’s no way they wouldn’t even know these things yeah right number one you know I already

pointed out to them right dynamic and the other two you would have no way of knowing as a workflow out there nor should you one of the things we kicking around is you know do we need a DSL for actually composing these jobs right we’ve worked on some custom user interface to to create the jobs as well you know we’ve definitely done a bunch of stuff with the run back UI but I think we’re kind of kicking around if you know are there even some higher-level concepts that can effectively compile in and see the underlying workflow yeah I feel the same way it’s kind of gets proposed it feels like a lot of boilerplate which is also nice because as you add new features you go back and change your compiler yeah the DSL still lives any other once you’ve already solved that you can tell us about this is our number one excitement is that we are gonna cover you’re gonna say these are problems we talk about a librarian plugin I’ve been getting deeper and deeper in a plugin so I guess if you talk about it I think that’s a new avenue for creating reasonable tasks and I’ve been looking at other job automation systems where all these tasks were at this huge library there’s campuses you can track all this stuff and they’re all single step items and which is nice for the guy who writes the job because they’re just looking for functionality and there’s two levels of how you define the functionality one is a process but you know the other part are these little point tasks like bring up this yeah check the opinion or I mean though the granularity of them is whatever you can imagine so you have this you know kind of lower level libraries and the plugin system I think and then there’s the process reuse which we call a job or a workflow I think that’s sort of a solution type of reuse and that’s that’s where our thinking is Giap libraries and talking other users it sounds I mean feels like we’re there’s a convergence in in in the I think it’s one five or six we added the air handler is adjust our TVs area so I’ve been starting to use it an air handler appointee and in them I don’t I was hoping to have a demo of it today for one back because I’ve been using it sort of a simpler sense so like try to do something a single step in a job and when it fails the air handler tries to really do it or bail out but I wanted actually the demo I want to do is it roll back up a box deployment so you start to deployment and something goes wrong and then the higher-level roll back rolls everything back to the last night it’s big so I think there’s a primitive there but I don’t think there’s there’s like good examples and maybe there’s some gap still to do that well we were just joking about that today you know what happens when your roll back that’s who you roll back right you know there’s Turtles all the way down right I think with some things to remove that step one remove excess and then restore access yeah well I think that’s where you’re going to need a break point to I mean in some cases you might say I could possibly do a theoretical rollback but yeah run home to mama button through there that’s a good one you haven’t been playing with the conditional logic in there and it’s nice you can pretty much when you can do basic logic with it but there’s some times where it aesthetically it’s painful seeing that error message show up in the logs when you’re actually handling but is it an error State it’s just a decision stage if I think we should we should fix that I know exactly I mean I used to have this this pattern and we call it a surf pattern so assert the condition I want and so the condition a certain running and then I have like an air handler that if it’s not ready try to do it threat and now it’s a pretty cool little pattern but it’s not cool if assert you know the certain doesn’t pass the condition that it’s running and yeah no no I’m just this is Kuwait this is my logic there exactly yeah so I think we should just these easily just having them like a box and says this is not like not not an error right like don’t alert as air can we give it a making a notice or just call a

conditional and have a button to say pandalus is an error so I’m curious about the you visualize in the output or getting the output from salt and other tools we talked something recently that was a heavy chef user and he was like yeah you know I got my automation in hand but having that visualization layer to show people what’s going on under the covers with something that you know he wanted out of my neck didn’t know exactly how I wanted it but just ever said that was a yeah where you wanted to go so what were you guys your thoughts on especially yeah related to the work to tear down the salt so where we’re at right now I mean I think for us my comic here is more around the you know did this not so much that this step failed but you know just how’s my deployment going right so so much higher level things because I think there’s the other level abstraction of course which is you know the guy who clicked the button he wants to see every single step and when it fails needs to be able to go you know investigate what’s going on right now where we’re at is you know we’re pretty well taking the output straight out and run deck and representing it you know just in our own UI it is too verbose right now I mean even in our sort of trivial proof of concept demos it’s like alright this is this is a little bit too much I’m not sure what it was I mean some of the demo you should wear you know each step was you know had a title and you could flip back between you know the execution and the definition I thought that was Devin that looked pretty good right it looked very intuitive and I assume that there’s a way you can tell down into a single note so you could just like watch that note it’s on step three it’s on step four and hold out a baby step because we actually got some good sponsor development this is the users in their end users are really technical so they wouldn’t know a table and they just want to know did it work or not they don’t even want to know a very low level of what worked or not either they’ve won kind of a very rolled up you right being in there’s like these rough steps in its life spinning check next step spinning check I mean that’s all they want and they want to be able to drill down from there so that’s that’s really what’s gonna happen next in that was just sort of a step in that direction good that’s what I mean another moment that I thought would be useful is you know if you’re doing obviously once you get your single note I think it’s really clear right you just you kind of want to see that you know green green green red and then drill down but if I have like a hundred or two hundred nodes red green isn’t really enough because if if they’re all failing on step 6 I care that one thing but if they’re failing in various ways what’s a good presentation for that just a thought we’ve had I don’t know that you know we took a couple passes in the mock-ups one of those cases of what logical step across all the notes that they all seem to fail on or some fail on yeah so on what you mean by kind of aggregating the result right one was be like just just show the difference so if you’ve got you know 500 oats just show me the things that weren’t different yeah just show me the outliers don’t show me all the ones that would say oh it’s yeah so maybe it’s kind of obvious if this is I think where we really want to push the future well at least the next kind of major development apprentice is how do you handle these output streams because this is all about self service like how do you make the Automator book awesome if they can turn around their boss many people and give them like right that’s hope you know that’s really what we want to we want to push it so many recommendations or ideas or contributions or you know whatever perfect yeah I mean I think that’s that’s where we’re at is we have a lot of attention during releases I mean I loved your primary line deployment problems only just the canary for everything before it um you know I think the place to where we win is when we say don’t ask me go to this dashboard and then they took then they say did you see the dashboard right that’s that’s you know we stood case study thing with colleagues or datacom and that was precautious big thing was like that they you know they were doing their climates and something would go wrong and everybody would have to like you know throw him some link to something that’s Blanco try that don’t really see what was happening whereas failed so it was a major point of contention and you know all kinds of bad things with would would would come out of that so his biggest he’s like a I can’t make this stuff you know anyway but this big point he some run deck was giving everybody one screen to watch so they could see what exactly was going on so something failed everybody new new new you know what it was so it was a hundred percent of

visibility here’s one of the guys who like turned our thinking around about this that you know the automation part it’s a salt problem he doesn’t million ways to do it it’s the visibility part that even have an answer for we should talk to him too go sneak up on Billy yeah I can tell you from a visualization point of view what I at least what we’re hearing lately from our release managers is like what would I boil it down to is they would love to see a progress bar with like an ETA of like yes at the Microsoft yeah it just sits at 99 actually notice you had a progress bar but yes the more you do actually then it becomes interesting too I would love to know um the sort of the 20 minute operation you know I mean at 23 or whatever is like whoa you’re statistically out of spec here that’s pretty cool well we have this this protocol so was your run day soup again and this is a kind of ways we can talk to users and this comes up they’re like yeah okay so it’s great on average it’s hitting it but they want to like almost like an SPC turf they want easily deviation wise for a job because people riding steps or to change the implementation of the step what the ad pornos or who knows what’s going on and they want to know you know isn’t a stability of the process predictable anymore I feel like that that’s like a hard like when I think about all the dynamic configurations if you’re applying a job against you know different sets of nodes or half the nodes or whatever how do you and you how do you even store that data and present it you know your average is you know could be all over the place I think they just want to know when it’s like kind of just the signal oh you know what’s kind of steady and now it’s not yeah it’s deviation just just to know that much groups it’s already a nice when I first created like one general-purpose job that it had a lot of parameters into but then I decided that I really liked seeing the progress bar be vaguely accurate so I broke them up into separate jobs just like half the job to take a huge amount of Tyre XMB ashore yeah you know it’s so fun anyone I’m in Testament but this is the same thing with testing so if I’ve got a function or whatever it’s bad small Olympic and I tested cycle I’m testing next what’s going on to next steps so that makes me break it down break the method or function down into small or something that’s the part that really sucks or that’s the part that’s always variable so I think I think getting visibility into the variation you know is partly breaking it down right but this tram will see it spiking up and down whether it’s teddy this that’s the key part overtime ooh stop it one more beer well shall we call it a night he returned thanks so much for