How to Use Machine Learning with Spark SQL Data, Tableau Analytics & Simba ODBC Connectivity

one let’s go ahead and get started my name is Toph Whitmore I’m the Vice President of Marketing here at Simba technologies in Vancouver British Columbia I’ll be the emcee today and today we’re going to talk about machine learning I’d like to welcome all of you to our webinar I would also like to point out for everyone who is joining late does make a lot of sense since we’re at the beginning but we will make a recording of this webinar available at Simba calm after the presentation but I would like to welcome you all to today’s webinar faster processing faster insight how to use machine learning with sparse equal data tableau analytics and Simba ODBC product we’re going to talk about three things we’re going to talk about the new connectivity model I’m going to talk about data exploration with spark 1.5 and we’re going to show you some data visualization with tableau analytics I’ll be joined today by data bricks director of client solutions Pat McDonald who will show you what you can do it spark 1.5 and then product manager from tableau Jeff Fang will show you some great data visualization putting together the spark data and showing what you can do with the performance of spark and the powerful bi solution of tableau analyst so first I should point out that the marketing guy will speak the least today perhaps mercifully symbol who are we what are we doing here just very quickly Simba is in between the bi solution and the data itself we’re the connector we build the drivers so choose your metaphor we’re the bridge were the glue were the plumbing with the on-ramp to your big data you need us if you need to connect your analysts from their bi solutions to their data in some form or another you may need a custom connector we build a net we have an SDK environment for creating those but we also license our driver and connectivity technology to partners like data bricks and tableau both of whom are customers of ours if you’re using for instance tableau for the Mac you may already be familiar with our sequel server ODBC driver which is embedded in that and data bricks licenses our spark driver or their data Brits club a couple of more fun facts we know odbc pretty well because we co-opted it back in 1992 with Microsoft no offence to data bricks but we’re not a new company we’ve been around for 25 plus years and before anybody emailed me I know that the old Microsoft logo that sort of look like a 1992 and just because I have the opportunity we are hiring and if the projection screen wasn’t covering my window I’d be looking at this view right now come to Canada and work for us we’d love to talk with you so context diagram and a little more perspective on where symbol fits into the the broader big data ecosystem we’re in the middle there we provide a in this case the ODBC interface and Driver implementation connecting the tableau analytics solution to data brisk along the way we have a sequel connector as part of our driver and we map the sequel queries to spark sequel and back again and that can be very valuable if you are an enterprise with a strong you are an enterprise with strong sequel expertise in your analyst workforce you can leverage far by learning smart sequel but you can also get a spark with your existing sequel query knowledge and we take care of that translation process along the way in addition we are agnostic about whether you’re on pram or in the cloud we can support any model depending on how you want to do it ok I think that’s enough marketing for today let’s take this into a more practical realm with that I’d like to hand it over to Pat MacDonald who is the director of client solutions address right solution and can show you a little bit more about right now thanks for add thanks for the intro and I’m excited to be on the webinar today today we mentioned we wanted to give you a picture of kind of an end to end solution you know connecting or using spark to process your big big data to build some machine learning models and then serving them up to to tableau of course was Simba sitting in the middle so in case anyone’s unfamiliar with data bricks it’s pretty simple to sum up our company were the folks who created spark and continued to to build the majority of features in the spark project our team originally came out of UC Berkeley’s amp lab the spark

project was one of the key key pieces of the stack they were working on there and ultimately the team was focused on making big data simple now sparks a big part of that but also what we’re doing at data bricks to to really solve that that equation and end and make it so data scientists and data engineers can basically get to work without worrying about a whole bunch of infrastructure and use tooling that they really enjoy to solve a sophisticated big data problems for anyone unfamiliar with SPARC it’s become the de-facto big processing engine and platform for a lot of reasons many people very much enjoy using it the SPARC project consists of the core project which is a set of several api’s that users can can take advantage of to process data pipelines end-to-end so that’s reaching into several different types of data sources you know performing things like feature engineering and data preparation may be feeding them into one of the built in libraries you can see some of those listed here so that’s includes things like spark streaming machine learning library for for a bunch of for several algorithms available out of the box graphics for graph processing and then of course full sequel support both from a language perspective as far as the API but also with a with an optimized engine under the covers to make the processing of data even faster partly I need to make an adjustment here so another key feature of SPARC is that it it runs on on many different infrastructure platforms so SPARC is very much compatible with Hadoop but you don’t have to use it with Hadoop or on top of Hadoop many people use SPARC all by itself many people use it on running on top of something like mesos you in the cloud on-premise etc so it’s very flexible and can can become kind of a one-stop shop for for anyone looking for a set of tools to process their their data work with their big data and of course you know solve solve big data problems and building sites spark has had become a very large ecosystem with a number of supporting technologies or a number or support for a number of different technologies that you might need to to use and you know whatever whatever the data processing is that you’re doing so you know we mentioned that in the previous slide that the spark can run on a number of different platforms it supports a number of different environments in terms of using them using their resources running on those cluster managers etc spark also has a number of libraries and tools for working with different data sets and accessing data from different types of systems and that’s become a kind of key advantage of spark so whatever the data is that you might be needing to process or bring into the system chances are spark already has some sort of connection to that type of data and also there’s a number of growing set of tools that that might use spark as the underlying engine so you can see several of them listed at the top there but of course it’s also worth noting that spark out-of-the-box provides a lot of friendly developer api’s which might make it such that you don’t necessarily need to use another third-party tool and so so with all this said spark has become very popular for the reason of all this flexibility that that’s available basically out of the box over the last couple years many organizations and many different software vendors have have chosen to put spark at the core of their big data systems so you can see a number of companies listed on the Left anything from hi-tech web companies to to traditional enterprises have chosen spark and and you know spoken at various conferences for example the spark summit where you can encourage you to go check out the website from from those conferences you can see a lot of use case discussions of how exactly they’re using spark in their enterprise and then of course there’s a number of vendors who have built spark into their data district their distribution and/or their data platforms so we partner with a number of the organizations that you see there and of course at data bricks we have our own platform as a service available in the cloud for running your SPARC applications so over the last year we’ve had a couple of major developments that have taken spark beyond its early roots of simply you know being based on our TVs or resilient distributed data sets and I’d like to kind of summarize some of those here so first of all I mean the last couple of releases we focused a lot on a new API called data frames and we can talk about that in a bit in a slide in a few minutes but data frames are a very important addition to the SPARC platform the the data scientist who’s been using maybe Python or R it’s very familiar with what a data frame is basically a higher level abstraction that provides a lot of convenient api’s that are very useful when doing data

science it also becomes a great abstraction for us to hook into our sequel optimization engine so we can optimize the processing of data so one example of that would be as compared to what you might have seen from a performance perspective when using the let’s say the native Python API prior to data frames by using the Python data frame API you can take advantage of several underlying performance enhancements that are again provided through through our sequel optimization engine and the end result is that your your processing of data through python api is just as fast it’s choosing any other any other language and any other programming language another important addition in the last several months or several releases is support for our first-class support for our so this is very exciting to a lot of the data scientists out there who use are on a day to day basis so with this new support people who are comfortable using the r programming language can start to to use the SPARC api’s through through arm and take advantage of distributed data processing through what’s otherwise typically a single machine experience and then machine learning pipelines are another important abstraction and API that we’ve added in recent releases so this is a similar to something like scikit-learn where data scientists can define their entire data pipeline from from the point of you know pulling data from let’s say a relational system or from using sparks equal to to you know grab data in a distributed optimized fashion and then doing you know data preparation feature engineering maybe holding out test data and training data running a few different versions of your model depending on what parameters you want to use and then ultimately validating if the if the model is accurate or not and so you can define all those different steps through the new machine learning pipeline’s API and a lot of folks are very excited about that so those are some of the data science updates in the last couple releases we also have a lot of of what we call data engineers some people might consider these folks to still be data scientists but we consider it kind of a different different breed of engineer and these are folks who are dealing with large-scale data pipelines which which need to connect into different systems or maintain curated data sets or whatever might be and maybe ultimately feed into a model that a data scientist is built but a slightly different class of problems to solve and these folks are also very excited about SPARC for some of the reasons I mentioned previously mainly the fact that spark has over the last year grown a large ecosystem of connectors if you will to different types of data sources different data types where they’re fully compatible with spark using the existing api’s and connecting to those systems might be done with with several optimizations like credit could push down or partition pruning things that make the data access very fast but to the developer API feels very native and easy to use and as you can see in this particular example you might even use a sequel flavor to consider one of these data sources a quote unquote table and use your select statements to grab data in that way so this is very exciting it’s become a very important integration API for many people to build their own connectors to the spark ecosystem and so do helpers can connect to their data whatever it might be another important development is the spark packages website and tooling so basically this is similar to any other any other toolkit that you’re used to where you can kind of easily grab a library that may not be part of the main distribution in that same spirit we added spark packages to the spark toolset and created this website so that any developer who’s maybe not contributing to the mainline code can create a librarian and make that available to developers fairly easily and so this has become a great place for people to add the data sources that I was just mentioning so for example you can see in that screenshot you know maybe you need to connect to Avro data well it’s it’s fairly easy to just grab the spark outtro spark package add that into your application and then from that point forward the developer can use their same familiar data frame api’s to get access to that data same thing as you can see there with redshift another example of spark packages would be connectors to spark streaming type connectors or machine learning algorithms and this is a pretty key development for spark because it allows individual developers to build packages that are interesting to other people and in that way those those packages can kind of maintain their own life cycle would not be not adhere to the spark release life cycle and can kind of grow on their own and this is you know one of the reasons why for example our has been a very successful project because they

have a very large package ecosystem which which is very healthy so one thing to keep in mind is that spark development moves very fast I just mentioned several features several very large features which most for the most part only came in in the last let’s say six seven months and so because it moves fast you know this is one of the reasons people are very interested in it every single release which comes out every three months we’re adding new features that that people are very interested in but at the same time of course we’re maintaining backwards compounding and compatibility so that we’re not you know breaking applications so as much as we possibly can you know we define which API is our public and stable I mean people to use those so that when they move from version to version they’re not seeing major issues and and of course every single release we want to do something very interesting so you know whether it’s for data scientists for data engineers we’re always adding new new features new algorithms that are quite important to people and of course this makes it such that everyone always is looking to great grab the latest version and perhaps it’s not always so easy because you know releasing every three months is is is pretty rapid but with that said let’s talk about some of the most recent features of spark 1.5 actually I’m gonna go through about five features here some of which are not brand new as of one not five but they are of course in this latest release and so hopefully these are exciting features to you and will encourage you to go check out the project first of all data frames so I mentioned this a bit but anyone who’s done some data science work with with are with baby pandas within Python should be familiar with what a data frame is basically it’s a abstraction that’s kind of like a table so it has columns with you know that have names and then several operations that help you operate against that quote-unquote table so things like selecting certain fields or maybe filling in null values or transforming specific fields or doing aggregations all these operations are of course very useful for manipulating data by bringing these to spark we’ve adopted the best practices from these other languages I mentioned but of course everything we do in spark is is distributed and so we’re taking what’s typically a kind of single machine construct and we’ve made it accessible in a distributed data platform so that’s very powerful and exciting to a lot of people especially folks who are already familiar with those api’s but of course it’s also important to mention that all these these data frames like everything we’ve done in sparks people previously are are completely operate able operable partly from from the actual sequel language as well so if for whatever reason you’re not comfortable using these data frame api’s you can just drop back to using the actual sequel language and so you know depending what’s the best fit for the given news case you’re working on pick the language that works best for you so this has been a really exciting development like I mentioned from the the API is very comfortable to a lot of people but also because by having developers use this API it gives us the opportunity to do a lot of Asian under the cover essentially sequel optimizations which we were already doing for this Park sequel project now we can bring that to a wider set of use cases and this is this is key to you know performance and also because this this kind of this data frame abstraction has become the narrow waist for use in other parts of the project so for example the machine learning pipelines I spoke about previously you know you might as you are doing feature engineering essentially add additional columns to a data frame and then ultimately feed that data frame into your algorithm so data frames are very exciting and then last note about data frame since that they are the the kind of the end result of the data sources I I spoke about previously so to to get access to any of those external systems those external data sources or data types that we talked about you know write a couple lines of code which know how to reach out to that system in an efficient way and then the end result is again the data frames so this is very very much kind of the center of spark at this point in time very excited about that development another recent feature which is really important to to learn about is what we call project tungsten and this is in a lot of ways a completely architecture of the internals of spark so that we can manage memory directly so no longer depending on the JVM for for garbage collection so off memory management and also looking to take advantage of other very important capabilities for for faster performance such as code generation cache aware computation etc so if you’re more if you’re interested in that type of thing I definitely encourage you to check out Josh Rosen’s YouTube clip at spark summit you can see it listed here let’s

go go to youtube and and search for tungsten spark antia the excellent presentation that makes it pretty complicated sophisticated underline we are contexture very easy to digest so go check that out there if you’re if you’re interested but for the most part the important thing to keep in mind is the end result will be much faster performance and an even larger scale than what we’ve been doing previously another new feature which is just came out in spark 105 oh by the way I forgot to mention spark 105 just just came out was it yesterday or the day before so this is a brand new release that we’re talking about here and one of these new features is related to the spark streaming package where we can better handle back pressure so of course back pressure happens when let’s say you have some sort of no sense worth failure in the system right maybe a particular node goes down or for whatever reason the the processes that are ingesting data slow down or maybe the yeah maybe the amount the rate of of data picks up and so that can be a challenge to handle in a streaming system and so we’ve added some new features to adjust to that type of scenario another recent feature which actually was not just part of 1.5 this came out and I think one for time frame is window functions and this is you know something that’s near and dear to the heart of a lot of folks doing doing advanced analytics so let’s look at this particular example with this with this data set we want to ask you know what’s the difference between the revenue of each product and the revenue of the best selling product in the same category so for example if we look at here at the tablet category how much different was the mini tablet compared to the to the pro tablet because the pro is the is the highest revenue product and we can see the answer as a thousand so with a dataset like this we want to get the answers on the right there so that we we already calculated manually that the answer to the second one was a thousand you can see that same pattern holding true with the rest of the results this is a hard problem you know this is easy to do in an Excel spreadsheet but of course harder to do when you’re dealing with with large-scale data and window functions are very important for defining that type of question and getting an answer so you know how would we calculate that well you need something like window functions and we’ve added those now in SPARC so right here this is the this is the kind of I guess data frame version of how you might write a window function where we use you know partition by the category and define a range etc however like like I mentioned in previous slides note that you can always drop into into writing actually sequel to do the same thing all right last feature I want to talk about is the machine learning pipelines so I already talked quite a bit about this but you know machine learning pipelines are a higher level abstraction than what we were doing previously with them I’ll live so this goes beyond just providing algorithms but also providing all the API is required to kind of define the pipeline leading up to the steps of using an algorithm building a model and so you can see some code here where we’re doing essentially a lot of feature transformation and as of the as of this release we’re happy to say that we’ve got feature transformers basically on par with scikit-learn so all the kind of things you might need to do to work with your data ahead of a head of building a model those should be easy to do out of the box now and of course these api’s are also abstractions you might use to write your own feature transformer if you need to we’re also continuing to add more of the algorithms from ml Lib and new algorithms into the ml pipeline’s API and and of course you might in case the the code is a little too confusing just here’s a visual representation of what this is doing right so this is basically defining all the different steps that are required or that are needed in this particular example up to the point where you have before you’re actually training your model this visualization I might note is also something that we’ve made available in recent versions as well so that stuff is all very exciting and but as I mentioned previously you know one of the one of the challenges some people have a spark is that it’s it’s moved so fast that it can be kind of hard to take advantage of these new features and so let me talk a bit about how you might do that in in data bricks by summarizing first of all what data Bert’s is so we’ve talked a lot about spark leading up till now right and spark has several advantages you know the fact that it’s a unified platform it’s very fast we didn’t talk too much about performance but basically spark holds the the the grace or benchmark record right now and so you know there’s no issues of the scale or speed it has plenty of flexibility which you might have noted from all the api’s I’ve discussed and all the different systems you can integrate with etc and of course because it has broad adoption you know choosing spark as your SDK if you will

for writing applications is is a safe choice because it’s supported from a wide variety of vendors and platforms so that’s all great but at the end of the day it’s it’s still a distributed system and distributed systems are inherently hard to use so what we’ve done at dana bricks is as i mentioned at the top of the webinar we focused on solving the end-to-end problem which is to say if you really just want to get to work as a data scientist a data engineer you want to build a data pipeline create a model do some data exploration or whatnot it’s great that we’ve given you the tools set in spark to do to do that to write the code but actually getting to the point where you can write the code might be hard to do so case in point we were just talking about you you might be imagining hopefully we excited you hopefully excited you up to this point and you want to try spark 1.5 so you know in your head you’re saying hey how exactly can I do that and the truth is that’s not always easy right there’s not a kind of there’s a lot of infrastructure requirements that might get in the way so at data breaks we’ve created a system that we believe solves a lot of those problems by providing essentially zero management clusters by using using the cloud and so kind of push-button provisioning of smart clusters we’ve made an environment that is focused on notebooks so you can you can collaborate in real time with your colleagues you can kind of define an intent process you know share that notebook with others and then of course we make it very easy to take that same notebook you might create and turn it into a production job and at the end of the day while we’ve solved a lot of problems in data bricks we have we don’t want to pretend that we’ve solved all of them and so we make this whole system open and extensible and this allows other people to integrate their platforms with data bricks of course you’ll see in a bit how tableau might do that so the end result is that you know we’ve got a lot of customers who are enjoying quite a bit higher productivity they can really just focus on on writing the code that they want and you know performing the analytics and building data pipelines they can take whatever that work is you know whatever that research or experiment might might start at start out as is very easy to turn into an actual production job and then of course the end result is that this this is the essence the data democratization so we’re making this data easy to work with but also very accessible to other people in the organization through collaboration and sharing so with that said why don’t I dive into a quick demo so this is the data bricks platform and basically as I mentioned previously it’s a web-based platform which is a you know hosted in the cloud and it’s all oriented of course around using spark for data science and data engineering and so the first step of course is to work with clusters and we can see in this environment there’s already a few clusters provisioned so we’ll use the demo cluster here it has you know six nodes running in the cloud and it’s running spark well not for of course it’s very easy to go ahead and just create new clusters as well using this dialog so I’m as I previously mentioned you know wow it may be hard otherwise they try to use the latest and greatest version of spark we’re actually always providing the latest version in data bricks in fact we had the an experimental version of spark 1.5 available as of three weeks ago or so so if you’re always in ever interested in finding the latest features of spark come try the try them out in the data bridge platform so I won’t create a plus two right now right at this moment I’ll just use one that already exists and I’ll open up this notebook which is set up for this particular demo and this notebook is an example of kind of an end-to-end machine learning use case where we’re going to explore some Salesforce data subsequently build a prediction model and then test it out and make it available to to other users so you can see that a notebook this particular notebook is a Scala notebook so a notebook in in this particular example has markdown documentation at the top but you’ll you’ll see even though a scholar notebook I can drop into sequel which maybe you know familiar to most folks on this call and start to explore the data that I need to work with here so we’re working from this table new leaves SF DC we can you know get a little bit of summary of the data itself maybe do some simple visualizations which are built into notebook I can show me a few things about it so you know what what’s the distribution of data across the different states we can see it’s heavily leaning towards California and then we start to look into the field we care about the lead source because we’re gonna build a predictive model based on that they can try to predict the the lead source based on the state and we can see there’s a pretty wide distribution of lead sources here and so what we want to do is prepare the data so now we’ve dropped into code here we start of course by just running a sequel statement and creating a data frame and then with that data frame we’ll do our subsequent steps of feature engineering and categorization and whatnot so here we are starting to do some

categorization we’re going to actually or me we’re gonna actually categorize the data into less less lead sources because we found that you know for whatever reason the data wasn’t it wasn’t as clean as we would like so we’re gonna reduce the number of lead sources available then we can basically start to do the feature engineering required to eventually build a model here and at this point we split the data into training and test data we prepare the model and then here we are at the point where we actually have our model so we’ve just built the model with with this is using ml Lib a decision tree here you can kind of inspect the this decision tree model basically decision trees are like a series of nested if if then statements the next steps would be to kind of test this out so for example I might actually use my my function here to take a look at how well this is performing with the state of let’s say California all of these cells are fully executable so I can um you can see there that the end result is paper so in this simple function here I’ve actually I actually am using the model to to run my predictions and that’s pretty well we can actually use those functions in sequel queries at a later point in time but what we’re gonna do here is basically build a table that we can host for other systems for example in this particular case we’re gonna build this table that will be used by Jeff as I hand it to him in a second here so that he can read from tableau and take a look at what we actually predicted and of course here we can see kind of how well we’re prediction is this is kind of a simplistic example doesn’t make a lot of sense but in this particular case the real the the real lead source was digital but it’s predicting paper right now in this particular case the real lead source is paper and it’s actually predicting paper so that’s good so anyhow we can do a little bit of validation within our notebook here to see how well the performance is so in this particular cases let’s let’s take a look at the the leads that have a paper lead source and we can use this visualization to say okay here’s what the actual lead sources look like let’s compare that to the predicted and we can see it roughly looks the same maybe it’s a little more weighted towards California so for paper leads then it’s not so bad however in this particular case we can see if we try to use the digital leads and we take a look at the actuals here this is what it looks like but then if we look at the prediction while our prediction is not doing so well so we’ve got some work to do here and we would of course continue to iterate on it but at the end of the day you can see how this notebook construct makes it very easy to share something that’s pretty pretty sophisticated with with anyone in your organization to collaborate on these notebooks together and build out models which are which which you can subsequently use in in things like tableau and whatnot which Jeff will show you show you know so that’s it for my demo in my presentation I’m gonna go ahead and hand it to to Jeff to to take it from here submit them under here to come on or the questions window in GoToWebinar thanks I Jack you there I serve great over to you a lot Toph and thanks bad so we’re at a blower super excited but spark and we’re proud to partner with Dana bricks and Simba and enabling conductivity between Tablo and smart review spark Zeebo we’re also spark certified by data bricks as of late last year as a spark sequel as you know is a powerful tool to enable you know business analysts developers and data scientists to mix and match sequel and as well as for it this week sparks api’s and tableau plus bar do we really do there let me show you this great eye for oh you guys look at me yeah if you’re back on car we lost you first but I can keep going okay perfect and so tableaus mission since day one has to help people to see and understand their data we help empower business users by allowing them to freely explore their data and focus on the questions of the data people or the process unit and so add tableau we have four primary products we have tab o against marks equal today then we have a server which is our tool that allows users to collaborate for each other securely as well as giving administrators the you know the enterprise features they like in terms of you know provisioning for

you know education for the privileges and such then we have tableau online and this is essentially the hosted version of tableau server it’s our cloud pay-as-you-go offering that enables organizations to get up and going who don’t want to own the underlying infrastructure of tableau server and lastly we have tableau public and this is a free offering you can think of it as YouTube for data and this enables data enthusiasts and journalists and anybody who wants to share data visualizations with other people to be able to do so freely quick overview about the user interface refer using tableau and and I’m going to show you this is just second gear and so when you open up tableau and you connect a data source it’ll automatically bring in all of dimensions and measures and it’ll categorize them based on other particular data from there it’s a simple drag-and-drop either dragging dimensions and measures on to the main canvas window or you can drag it on to one of the filters pages or the or the shelves that we have ships and the next Locker and just just a few points about those valid proposition for megiddo first three data platforms so besides to spark we also connect to a number of Hadoop distributions as well as no sequel databases secondly available visual analytics coding so no longer get to write anymore sequel statements you can simply connect your data and go the third is we have a hybrid data architecture and what this means is we can connect to a data source live or we can create an act pass that data and bring it into our and everyday engine we also allow the capabilities blending of a track between four data sources so say as an analyst if you are connected to spark and you want to make state it together from somewhere like Salesforce or Oracle database you can do so within our platform fit is we’ve greatly and that’s there our query performance I mean the most speed to the user and the last point in no regardless of data source you’re connecting to whether whether it’s the Duplin there’s relation database or or web application we provide consistent interface to visualizing data so that’s it the presentations and water we they started so what what I’m showing here is tablet desktop and this is version 9 of her product and I’ve made a connection to the SPARC see the spark cluster at the database is created so no start by selecting a scheming but see that the Falls is available and the first date said I’m going to connect to is gifts that called me leads s FDC just give it a second here to queue up and now what I’m going to do is before connected to the data right away I’m going to run some initial sequel and this is just an additional statement to cache the data in memory into a data frame so such simple cache tables new leads as FTC and now I want to update data I can see preview of all of the the the schema as well as the samples of the data so if I scroll down here and look at see what’s available in our data set there’s one particular area that I think could be interesting which is the status of the leads but we see that there is like a delimiter between the first part of the second part and so what I’d like to do here is I’d like to actually split this apart so what I’ll show you now is a quick dinner prep function split it by the dash and it was put off all and it’ll go and run that and now I see that there’s two additional columns I’m just going to simply rename this one I’m going to name it to lead status and this one let’s call it has a leave-in contacted okay I’m just gonna click update preview data and it looks pretty good so why don’t I now jump over to our our dashboard and so this is just show it now as I mentioned earlier it splits into Vince’s

measures based on the ticket if I we could start by showing on the comedy records or in this dataset so I just double click on the rough records and I’ll break it on to the canvas window let me see there’s but you know 524 records okay that that looks good and so let’s say we’re interested in seeing how this varies by particular lead source so I’ll simply go down here check these sorts onto the column shelf and let’s sort it from min the max and a minute so we could see that the biggest source of leaves where distribution lets Mahler advertisements and retail trade shows and I believe that this data matches the same data that Pat was showing earlier and in the end of workbook okay and then one more thing to do here is I’m going to say you know let’s let’s you know let’s purchased it to see what the lead was contact I then drag it simply on sirs color shelf and we see that majority the leaves have have not been contacted yet alright great so let’s let’s let’s let’s call the sheet leads and let’s open up another worksheet let’s say we’re to just didn’t see where these these are coming from so now I do is I can grab number records on to the row shelf and there is a dimension field here called City and so I’m just going to simply drag this over other two columns and then click the net view and also dragged number of records on to color to give it some more depth here and just boost up the size just a little bit so you can see how easy it was to just quickly create visualization and based on your some questions that you have some data I’m just going to show one more thing here which is let’s if we’re just said pulling these these two dashboards to save these two worksheets together I’m simply going to select new new dashboard and drop map into the field first and then leaves next and I remember as for all of these visualizations as I’m going it’s generating a live query just going directly to the spark cluster so now just interact I simply it’s Tucker and say for this sheet I so now if I state the other chart with you know where these are coming from just a second there so real quickly IV this is it’s a it’s cool to see file you know five combined tableau and spark together it allows any data analysts to be able to to be analyzed the insights their data without having to write code not sure what’s going on here but maybe I’ll just cancel but yeah so you but you get the main point of the demo so in closing you know we’re just I just like to say that you know that tableau for spark is a powerful combination to Mabel analysts to be able to analyze data without having to read code we are really excited to partner with both data bricks and Samba in terms of enabling connectivity and so I would have pass it back over to don’t close it up I’m going to flash back here to our lovely PowerPoint slideshow appreciated hearing from you Jeff appreciated hearing from you pass we do have a couple of questions and so I will I’ll share those now I believe the first one is for you Pat can you cover how the data was made available in the spark notebook demo sure thing yeah so there’s the number of ways to ingest data with spark and in data bricks in this particular case we had a essentially an export from Salesforce that was placed into s3 and then we we just imported it from there so basically point to the to the CSV file and that’s three and then make it available as a quote unquote table in data bricks so that’s a feature I didn’t show you but it’s worth noting that at the table Jeff was accessing is actually coming from that day works environment I showed you and so all of those tables are basically you know sparks equal tables and in this particular case that table which is

really a kind of a misnomer because it’s basically just a virtual mapping of a of a CSV file in s3 to spark sequel so that sparks equal notice how to process it and make it available as a data frame or table for subsequent queries and of course we pulled that into the notebook at sit as the first step by simply running queries against that table or calling data frame commands like a question for you how exactly does tableau connect to spark is it Simba ABC or is it through hive or both Jeffie there well if we lost him on maybe I’ll just jump on that one this is bad yeah so essentially so in our smart clusters in data bricks we are running the the spark sequel server and and so that gives you access to all of those tables I was talking about as well as temp tables and then tableau gets access to those through the Simba driver that connects it into the spark sequel server that’s correct sorry guys I was on mute to earlier and myself but uh yeah since actually by using this ability driver and connected to start a server that enables the overall productivity so it’s a beautiful solution of weak acknowledged partners working together cool thanks Jeff Thanks looks like we’ll do maybe one more here for you Pat question about spark sequel is that the same sequel is hive QL or anti sequel oh that’s a good question um so spark sequel is its own engine with its own optimizer and its own even as of recently we’ve even created our own you know functions which are which all take advantage of code gen etc so it’s completely its own engine however the dialect that it understands is is hive QL for the most part that is you know so basically all of almost everything that exists within hide QL plus a few features and – a few others cool thanks Pat not sure I think there’s another one for you Pat can the Salesforce database data be queried using rest and directly persist as our DD / DF in the data Brooks without taking – a three talking yet three anything to go yet yeah so I mean the example we use this is just the common as I mentioned you know CSV in s3 type of integration we do have some interesting or we know of some interesting new spark packages which are basically built to use the Salesforce API is to query them directly so yeah there are some spark packages that can give you that I mean the important thing to note for us is as far as spark is concerned as long as there is a data source that creates a data frame what happens then to the covers is essentially an abstracted from us and it all looks the same so in in this one case yeah a CSV in s3 it’s the way we ingest the data but we could do this exact same notebook if that if that table was actually driven by a plug-in from a smart package and as I mentioned there is a spark package that’s been released or I think will be released next week or so or maybe last week that does the very thing you’re asking which connects directly in the Salesforce great thanks with that I think I like so let’s go ahead and wrap up today’s webinar I want to be very clear to those of you on the call if you joined late we will be making a recording available as soon as possible we weren’t able to accommodate everyone who registered on the call today and we’ll make sure that all of them get access to the recording as well thank you to pack thank you to Jeff really appreciate the opportunity to share this information with you guys and to really see how a 100 achieve best practices with machine learning with data bricks sparks sequel and tableau analytics and Simba ABC connectivity with that I’ll say goodbye and look forward to having you all join us on our the next time we have a weapon I like this thanks everyone take care of Avaya