Apache Spark: Distributed Machine Learning using MLbase



without further ado I meet one second all right thank you everybody thanks Twitter for having us here and for the spark ended up organizers for inviting us and yeah you guys most importantly for for being here so yeah I'm gonna be talker I am a postdoc in the head lab rat goes machine learning Evan is right there he's a grad student to me I plan with the background in systems but also I guess background in data science as well previously so we'll be talking today about ml base this is we're coming out of the ED laugh but it's also a joint work with collaborators at Brown & Reed here at VMware so our work is motivated by three trends in computer science which I'm sure most of you were aware of the first being just the emergence of a rise of big data for various applications including Twitter also just measuring human activity scientific data sets sensor networks and so on second second trend is just you know emergence of distributed computing in part to deal with this big data and third the maturity of machine learning and particularly for you know certain well-known problems such as classification and regression and collaborative filtering and so on and we think about my basis being right in the intersection of these three things and very simply we want to perform oisin learning on Big Data using distributed so this talk I will first be talking about the vision no base and then talk about the lowest level of our systems which is I nailed it and then evidence I can take over and talk about the two higher level components and then give you some details about the base plan alright so to start with what you know what are the problems that we're dealing with with ml base well there's two problems the first being that scale with implementations of machine learning can be hard for an amount about rings so here's an amount developer that we say is you know trapped in that lab th as many machine learning developers like prototyping they work with MATLAB it leads to you know easy prototypes but not really scalable scalable or robust some limitations you know not everybody uses MATLAB and there are existing distributed libraries out there for you learning but you know these libraries aren't necessarily as easy to use as you like if it not really extensible so you know she learned developer scaling petitions it's a problem on the other hand we have end users who have access to begin of user data scientists perhaps and they want to rubbishy learning at scale but it can be difficult there's lots of algorithms to choose between them even if you know what algorithms to use should you know tuning the knobs of these different algorithms screaming out what parameters to use can be very hard it's difficult to debug things or they don't go wrong as I mentioned before you know things aren't always as scalable as you want and you know there's a lot of machine learning people out there but they're not necessarily on the right touch or in one clear direction so that my base our goal is to bring together I'm not expert it's people like me as well as systems experts people that green and Evan and we want to go this unified system that simultaneously simultaneously provides for an easy scalable machine learner development for ml developers but also user-friendly machine learning at scale for end-users and along the way we're getting insight and sense of computing so I'd like to know before I talk about each part of our stack I want to use an analogy bit MATLAB and what it consists of so you can think about as you know it's running on a single machine maybe a multi-core machine on top of that the lowest level of madam you can think of as being way back it's a little before chime library highly optimized for numerical linear algebra and on top of that the MATLAB interface that we're probably more comfortable with is it has higher level abstractions and provides much more functionality we just laid back but it clearly leverages lay back for intensive matrix operations so that this isn't just something format up some similar story for our Python where lower level primitives are in see your language so with that in mind let's talk about the ml based at the lowest level we have a rep time and you know we eventually want the higher levels of the runtime to be agnostic the higher levels of ml base to be agnostic to the wrong time but of course to start we need to make a run time my one that you know was amenable to machine learning and not surprising they get a lot of power here independence our kids we chose to inspire it should know it's a cluster computing system as you all guys as you well know it's designed for irritant a ssin which is created from machine learning and also you know there's a focus on a piece of use both in terms of setting it up and in terms of the contributing environment so we start this part we can we think of the lowest level of ml base as we call another and this is kind of now I guess to lay back it's a low level machine learning library written directly in spark as part of the code base it's called well from Scala Java and ultimately one day we also a Python about that mother we have ml I and this is analogous to the MATLAB interface it's an API and a platform for feature extraction and algorithm development and includes higher level functionality than another it also has expected that faster developmental assignment develop excitement then the highest level of ml base is we come PML optimizer there's no real quarter into that in the MATLAB stack but the idea here is that this is a tool for for end users some sort of automated machine learning as much as we can and it's solving a search problem over you know what we will be able we included the feature extractors and the items in Vietnam all right so where are we today this all sounds nice but we've actually made some progress on this since months the first lower two components of America my language ur geared towards Annabel developer were actively planning on a release in the next three weeks and the second part the optimizer this is still a listen area of research at this point we're hoping for these sometimes okay so hopefully you have some idea of these different layers of a system work but let me give you hopefully a more concrete example that was coming out so imagine you know you want to use some text file and you want to perform classification on data in this text file how could you do this with them through their smart system so let us stop with that I know that the first thing you need to do is you know read in the data and process it yourself to feature as the data manually what you see in the underlying code here is basically this is a user function called process data MLM is not helping you that so much once you have that and can very nicely call logistic regression function inside so basically the seems you had a feature I was data an RDD and then you can call the function alright and contrast what would you do with em alive when you step you know one level up boom system so again we're solving the same problem and you know first you can use go and feature extraction functionality here so what would I have underlined here is you have a sprite data you can lose functionality compute engrams on this data and then to compute weights associated with these bank rhythms and once you do that once you have these features you can then coverage assert aggression from the ml I and you know similar to that level impact given that logistic regression is written in ml it was just a regression function in ml I leverages all right so you know in addition to that there's other things you can do the MRI so for instance you can embed embed this code in a cross-validation the team to try to you know figure out the proper partner parameters for the distribution you can also use different feature extractors or algorithms included in the Emeline or you know create new ones given high-level abstractions and provide okay so high stuff that we have a semi optimizer and the idea here and I want to be clear that this at this point if this is research which is why it's sort of written in pseudocode but the idea here is we want to we want to take things a step further in terms of ease of use we want a user to just basically tell us their data or what the features are or what the raw data is what the label is and you know they declaratively specify the task to classify and then the optimizer is going to search through the MLI the space of feature extractors and machine learning algorithms and returning the training model so that's how the three that's other three components work let me now dive into the most subtle component which is ml didn't tell you a little more about that right so to start with let me describe a rough label and at least for some of the wrong employers here these are players of my players I'm using machine learning libraries and I want to describe them in terms of two dimension so one hand you a piece of u s– on the other hand your performance or scalability so on the top left you have things like mat avatar which are very easy to set up and use but of course they're single machine well next to us something like move which is distributed library on top of Hadoop and it is distributed it scales better further along on the right you have software that graph that or VW's both started off there's academic or research projects graph up so you know at CMU focus on graph based algorithms VW started off and research actually similar machine starting with particulars to caster gradient descent primitive but none have since you know expanded to provide more more machine functionality so where do we hope my lip fits here you know we hope that it's gonna match the scalability of GraphLab with VW would be slightly easier to use in particular in terms of seven ok and what will our where we you know the first release of my mother which we're gearing up for my next few weeks as I mentioned before will include basic algorithms for classification regression collaborative filtering clustering and many of these algorithms will rely on lower level optimization and a few things to know about MLM is someone as I mentioned earlier this is you know it's important you want you know it's important that MLM has continued to be supported by the SPARC project and it adapts to changes within the rest of the time and also the MLM is gonna be released as with the next release of spark which is okay so let me tell you a little bit about performance in there Melvin so we measure performance and in various ways the first is this wall time which is the same thing as parallel run time or a time and then we we also measure we run two forms of scale the experiments the first is weak scaling the brace is weak scaling and the idea here is that we want to fix the problem size per processor or in other words in other words of that means is that we want to as we grow the size of the cluster if we want to grow the size of the data accordingly and as a result we want the wall time to remain constant as we bring up up the data event request another way to measure scaling is using strong scaling in the nd here is that we fix the total problem status and we grow the cluster and we hope that as we grow the cluster give enough fixed data size that we get a linear speed-up okay and our experiments we run on ec2 with these come to you for excellence and with class or 7:30 tuitions and so first time shows some experiments for logistic regression these are weak scaling experiments they're 200 thousand images they're very wide 160,000 deaths features each and you see on the left here is the scaling results and again ideal would be flat and neither of us are flat but we're comparing Emily with VW and what we see is that we you know we have roughly similar weak scaling this VW but I should note those I should know that this you know it's probable that's is basically to show the relative Walter Mississippi the way you figure out the scaling is we see how fast things are at the first point and then we want to see how much faster or I guess how much slower they are as we scale so this doesn't tell us anything about the actual wall time we see either right here is that you know we of course we're much faster than that amount of us already take advantage of the distributed cluster that doesn't at all and we're going to factor it too of all time the PW all right and now you know in terms of strong scaling we use the fixed data set 50,000 images again very wide on same same set of features and when you see on the left is that we actually scale slightly better than VW we're not you know we're not quite linear as we might but you scale better and as we as use a bigger bigger cluster or the eventually that actually faster than VW concerns in real time and again that lab you know it's not really fair to compare it because it can't be run into distributed fashion but even on keep it on a single machine both us and VW were faster okay so next for ALS and now this is a common for collaborative filtering so these are results for wall time we have previous results with scaling but this is a new version of the algorithm for which you don't have extensive scaling results yet but you know the initial results are quite encouraging the dataset that we use is a scale version of the Netflix data so the ideas that we started with the Netflix data we basically you know replicated it both in terms of rows and columns to get a final data set that's nine times as big so it's synthetic but it has the same sort of you know patterns observed entries as an original almost real data so we work on a cluster of 90 students and you know what we see here so MATLAB takes awhile takes over four hours Madhu is roughly you know four times almost four times faster but then graph I'm an MLA are much much faster so I my limit is a murder about magnitude faster than the booth and we're within a factor of two in breadth level so these are concurrently so the final thing I want to do in terms of them are there then it's a deployment so in all the previous experiments that I talking about we're making the assumption the cluster is installed with whatever software we need and then the data is formatted such that they can be read by the different libraries that were tested on it turns out that you know the amount of work to get to that point varies a fair bit different by system so one hand VW GraphLab there's data preparation that's specific to each of these systems that's required to get you know get the code running setting up on the cluster is not trivial and either then provide intelligence you contrast MLA because it's running on spark we can leave from HDFS we can launch compile and run with just a few commands and our needs for my stalkers books so that's actually the end of my part and that's now going to tell you a little bit about MLA sure so thanks for the first part of the top of me I'm going to talk to you a little bit more I know that me talk about this bass Colonel Larry of machine learning I'm going to talk to you about MLI which is our developer API as well as the optimizer which is sort of the van user facing component of our system so again you know coming back to this label and slide we show of you you know ml live kind of they're a little bit easier to use than then sort of the state of the art in terms of Plus per machine learning and I bet you can guess what this next part is going to go probably to the bottom left somewhere of course ml is so you know this is this is a higher level of abstraction than you know writing sort of vanilla spark the goal here is to give users data structures and operations that are familiar with when processing data and writing machine learning so let's before we talk about that let's talk about you know revisit what the current options are so again you've got systems like MATLAB or are there is a sort of single note in memory and you know how would you think about prototyping to outgrow these things well you probably want to go to MATLAB or Python or something like that first because they're pretty easy to use they these nice high level languages there they're really good for writing research papers or prototyping but they're not so good when you're ready to scale up to a full cluster and then further there's this this additional problem where you start you've read a new algorithm in MATLAB and now you know you make some poor engineer take that out other than re implemented in C++ on top of you know some distributed runtime and there may be some sort of loss of translation there and this is something we've heard again and again from companies that have gone through through this in production so the other option is you know I know I have a big data problem I don't want to do machine learning in scale I'm gonna start with one of these scalable systems and you know they're they're definitely scalable they're sometimes pretty fast and you know these three systems in particular both xi lever you know available they're they're out there open source but they're pretty difficult to set up and they're pretty difficult to extend and so I'd like to provide a couple of anecdotes that sort of back up this is miss story so first you know to give you an example of the MATLAB case so Amin was actually working on a paper a few years ago on it distributed algorithm for solving the necklace problem so matric completion collaborative filters and you know it was a distributed algorithm at its heart they decided to prototype in Menna and that was totally sufficient for their paper they didn't even need to run it and distribute the setting to prove that it kind of work and that was great until somebody said okay can we can we actually run this our clusters so they said okay sure I won't bore you with all the details but it was a mess you know you have to download a MATLAB compiler runtime you have to hack a bunch of clusters kept loading scripts and in the end you know they ended up with a much more complicated system than they would have liked to start with the second example is going back to starting with one of these scalable systems so in this case we were writing a paper describing the MLI and we wanted to make sure that we were really doing it out those apples comparison and our experiments so there's this thing called really stopping that you can do with this ALS algorithm you know in theory there should be three lines of code where you check to see if some criteria is met and break out a little bit if so we just wanted to check whether method was doing that and so in reality what we have to do to see if this was happening will sit through seven different files and a thousand lines of code just to trace through the code path and see if it was doing this thing and you know you think about extending a system like that and it gets kind of scary so the insight from this is that we really would like programming these abstractions machine learning at scale so you know we want to be working with these familiar mathematical objects that we're used to so maybe tables for working with data and matrices for working with numbers for working through parts of our algorithms and so what we're presenting is so it provides these three common data structures versus ml table it's a distributed relational table kind of supports the things that you'd expect library from linear algebra on local partitions of data as well as some distributed convex optimization from this I'll talk more about all these three things next but to give you a you know sort of an early pipeline of this the example of you know me starting with MATLAB and to this crazy you know tribute prototype he could have started with Emma hi hadn't existed two or three years ago and done his whole algorithm and about 50 lines of scholars and so we think that's that's a pretty big similarly with ALS this early dropping stopping criteria that I talked about really was three lines long and you know the whole algorithm is 40 clients total so that's all you know want to file there's no chasing your tail to get there so you know a lot of code is not the best comparison of course you know we're we're building in Scala it's a super expressive language but you know we think it's it's in terms of ease of use for a developer as good or proxy is anything for how hard it is gonna be to you know develop and maintain the system so you know sort of the gold standard in terms of developing machine learning systems of course you know MATLAB ten twenty lines of code for these two algorithms logistic regression and alternating squares you take a look at the existing you know what's out there for distributed machine learning for these two algorithms you know the state of the art systems and you've got almost two orders of magnitude more code to write these other pros in ml i fifty lines to write logistic regression again not quite as tight as mat laughs it's still pretty good and for ALS only 30 lines of code so we're pretty encouraged by this in terms of an ease of use so let's say let's preventative the details a little bit more so you know one of the motivating examples we'd like to get AB is that you know when we started this of course we wanted to build a distributed environment machine so what's the obvious thing to do well your algorithms have to take as input some feature vectors and so the obvious thing to start with it's okay my future vector is that you know as a vector of doubles that's you know packed into an RVT and that's pretty good and so you want to start doing something like dr. oz and then so okay spark provides this created class vector that does that products for you but then you want to take advantage of some hardware support so we're going to link in the breeze library and that's that's not great and then you're like oh wait I want to do sparse matrices that are GPU accelerated so I linked in yet another library and the problem we saw here was that all of these objects are kind of there to do the same thing they're there to provide you know mathematical abstractions on top of you know data sets and so we said okay I with that all the algorithms should be taking in this at all table object as input and then we didn't worry about how we optimize the back end going forward and so the story here really is that ever MLI the ML table and the other stuff I'll talk about provide a generic interface for feature extraction and data manipulation a common interface on top of which are optimizers consider and then also you know the ability to abstract away the actual implementations of these things so that our system can now choose what's the best implementation of you know dot product to use or what's the best representation for sparse matrices so the first and the first item you know in the MEAP is this concept of an ml table and it's a table so it's a just it's a distributed relational table supports loading from a variety of common data formats of CSV JSON or XML others test heterogeneous data across columns so again this is sort of regular sort of database table you know every column is going to have the same type but it can be different stuff across columns support for missing data feature extractors with you as transformations on these tables so take it in take you know spit spit a table out combined tables that sort of thing and it supports both MapReduce and relational operators so you know this is inspired in large part this isn't you know brand new ideas inspired in large part by data frames in our and but we give them to you you know distributed this part so you know again this is feature extraction example this is very similar to the code of each show before but I want to kind of highlight this we're talking about feature extractors ask transformations on these tables so if you look at it from this perspective first we start with a big flat text file of HDFS then maybe you know we want to run and grab extraction on each document and that text corpus you know extract the top 30,000 diagrams for each record then we want to compute this tf-idf statistic which is highly useful in information retrieval and then finally now our data is is in the form that it's right for input into machine money and that's this feature a stable object so the second component of the API is what we're calling ml sub matrix and it's about sub matrix because what we give you is linear operations on local partitions of your data so one of the first pieces of feedback we get when we talk about this is why only local linear out why can't you give me a distributed matrix matrix multiply well the reason for that and we certainly couldn't give you to serve you to make matrix matrix multiply but the reason for that is really those types of algorithms that rely on inverting a huge huge matrix or multiplying two very large major systems really you know sort of fundamentally don't scale you know you're looking at polynomial time for a huge matrix matrix multiply and so it turns out that there are a lot of ways you can approximate the algorithm that you're trying to do with common convex optimization methods without without allowing users to do these very expensive operations so at least for starters we're encouraging users to work on sort of local partitions to their data when they're doing these advanced living around the robber Asian's and then particular results together so this is this is very much the Sephora store the data parallelism as inherently members and so you know out of the game we plan to support both sparse and dense matrices is sort of first-class objects in our system so you know looking at how you would actually write these against these data parallel matrices what we have in this example and this was a lot of codes bother trying to read too far into it but at the basic level we've got this big distributed data object we now want to operate basically run a map on each of the individual cell matrices that make up the data object and then that's this matrix back to my mine and then once we were running on local copies of the data once we're running on these partition matrices we can now do the familiar sort of transpose it as at matrix matrix multiply and solutions on those local copies in the data sort of a traffic away so the third component of the API is what we're calling for now ml solved and so this is a suite of convex optimization algorithms out of the gate and our first release we plan to support stochastic gradient descent parallel gradient descent so variants thereof and we're working on sort of quasi second order methods l-bfgs and 80mm as well as data grab which are a little bit more complex ways of getting in sort of the same answer but possibly a little faster so the the key thing here is that that these convex optimization operators can sit at the heart of many of these machine learning algorithms so it's we've got logistic regression that you're trying to solve or linear regression sometimes the best way to solve that in a distributed context the CSS J's and we don't want our developers to have to reinvent the wheel every time they want to move on SGG instead we give them you know high quality implementation of that particular convex optimization primitive all they have to do is specify a gradient function which is is along with the data what the algorithm takes as input and they can use that for training their algorithm so different different versions of machine learning algorithms will take a different gradient function so we'll have different parenting functions but we give this to you as so again coming back to sort of what's in ml hi today we've got everything that's provided by ml of course we're falling into these high-quality kernels and Science Park we're also adding several new algorithms and optimizers so in brackets here in red are the things that we're working on and but not in brackets are the things that we plan to include the first rotation importantly we're also including several common feature extractors PCA and Engram feature extractors normalizer is that sort of thing as well as some tools that you need to run actually run these ml you know do your mo solve your mo problem in sort of a foolproof pipeline so cross validation and other sort of evaluation metrics let you judge the quality of your models so now that that we've sort of talked about the the lower two levels of the stack is this ml live in ml I now I want to talk to you about ongoing work that's going on in Berkeley right now in terms of trying to develop this optimizer so this is the what makes the life of the end user who actually wants to use all this stuff a lot easier so an end user you know comes up with a machine learning problem they say I just want to build a classifier for something I've got my data I know I want to predict who's going to click on this website or something like that and that's what they want to do well what do they actually have to do to do it today so first you you have to learn a bunch of algorithms you have to potentially learn one of these distributed frameworks if you know you're big enough or maybe you learn about sampling you've got to implement three or four of these algorithms maybe write some search for code to search over these things and it's you know on and on it's a big mess and probably the first time you do it you don't know what you're doing you got to ask somebody for help that's kind of a crummy experience so what would you really like to do so we take the page out of the sort of the databases textbook and we said okay we've got sequel which is a way of telling the system what you want issuing a query against the system and getting back result you don't tell the system this is the particular joint implementation I want you to use this is the way I want you to rewrite my query that sort of thing the system handles all of those details for you similarly we want a declarative way for users to specify machine learning tasks do classify give me a model and the system returns a model instead of a model is an object you can use to make predictions about Mankato when it comes in so again coming back to this example supervised classification the important thing to note here is that it's totally algorithm independent so the user says a lot of my data and they say do classify all my data and they don't have to worry about logistic regression versus destiny house whatever and all that crazy Lander parameters that they would be setting to get those things just right the system does that for them so we look at this problem of how do we choose among all of this space of possible algorithms and they're hyper parameters and so on and how do we actually get to a decent model pretty quickly so we think about this as a really big space and we look at this as a search problem and so I won't bore you too much with the details here but basically you know you think of a system starting to search through this space maybe it tries boosting algorithms for a little bit and switches to SVM and thinks those are good then after five minutes that's that's sort of the best answer you know it's come up with in terms of predicted out-of-sample accuracy and you know that's what the model that the system would give back to the user this presents also some opportunities for physical optimization so from a systems perspective this is pretty exciting as well so an example of this is something that we noticed so you normally when you're solving machine learning problems you're sort of training models one at a time you try one and kind of works a little bit and maybe you try changing it a little bit or maybe switch model families but you've that's sort of a naturally a sequential process but when you're we're in this regime where we know we're going to be searching over lots of different models for the same dataset we notice that we tend to be a io m during that model journey so maybe we can use some of the spare CPU that's left over to train multiple models at once and this again comes from databases the idea of a shared cursor so if you've got the database query running it might be doing a full table scan maybe halfway through that full disk and another query comes in and wants to be scanning the same table that's starting where you can piggyback on the first one and you know reduce the total number of i/os with physical items without changing the number of logical items and so similarly here we can take a single password with the data and trade many models at once so this is you know training 10 different versions of logistic regression at once we've done this to make it in the context of multinomial logistic regression we've also it also exists in the context of the mckameys example that's in ml live at the moment so you know what's the relationship of this optimizer with them online so it sits on top of M alliance it's at the top layer of our stack but what's the actual interface there we have this idea that called called contracts which is a way that the algorithms that are in ml I and MLM can provide a set of metadata to the optimizer that it can use to sort of a numerator search base so this includes sort of what type of algorithm is this you know what problem is in solving what are the various parameters how does the runtime of this algorithm relate to the amount of data you're throwing at it both in terms of you know definitely the data dimensions some information about the input and output so enough information for this optimizer to make them begin to make an informed decision about so finally I want to talk to you a bit about release plan so we you know we're working on the first two levels of stuff for this just releasing in just a few weeks and then the optimizer will come later but before I do that I want to just point out the contributors everybody who's worked on this project and in particular for students and I guess former students at this point who people who have worked really hard on getting this code release out and ready to go for the end of the summer so sync now Virginia nishapur we would say we we definitely couldn't have gotten there without the help of these so in the first release we're including and I'll live which is this low-level library of you know underlying little machine learning kernels built on top of SPARC part of the sparse standard library I call both your Scala and Java again MLI this is just API for feature extracting and development of new memo algorithms you know we really hope this is going to become a platform for scalable machine learning to public development and hopefully it's it's also gonna provide a faster dev cycle then you know getting getting a full requesting its accepted in spark or you know it's a pretty high bar at the moment second release coming in the winter the optimizer so built on top of mo i automatic model selection and you know if this will include these contracts that i mentioned briefly as well as this retired very pretty language additionally we've pretty closely with a vision computer vision group in Berkeley and so we wanted to extend our feature extractors to cover more than just text numbers and also include some vision examples so that sort of longer term we you know that doing this research and building the system has opened up a lot of doors new places where we could go and future directions to take the system so so what I did is okay what's what's sort of the basic set of ingredients that goes into a machine learning algorithm what are what are the key primitives that we need to be able to perhaps optimize these algorithms all the way down to the middle and so you know you can think of this sort of as what's the relational algebra for machine learning so definitely something we're interested in also thinking about unifying the language that end users and developers use so you can think of you know in our user might be coding up a new and a new machine learning algorithm and then running that algorithm all from the same environment we'd certainly like that experience for our users we'd also like to you know have exposure to those users so if you're running our Python today we want you to be able to call in to your cluster and say you know please you know give me back a model in the in the form that my you know platform for analysis expects it so certainly we want to be able to do that visualization so thinking about how we enable sort of interactive guidance of the system in terms of learning particularly for unsupervised problems so how can we tell this just how can we enable users to tell the system if the model looks good or bad well we have to kind of show looks like first and then further you know we're only scratching the surface of what was available in machine learning today so we've got you know people have asked us about time series algorithms graphical models more advanced optimization techniques all stuff we definitely plan to get to and want to get to but you know also future work at this moment so you know that's that's about it for the talk I just want to say that contributions to this stuff are highly encouraged our URL is down there on the left if you you know if this stuff is exciting to you please leave your name and contact info we'll let you know as soon as we did our code pushed out for the for the greater world additionally we've got an Kent 3 coming up which is indirectly August 29th or the 30th and that's an event put on by the am flap it's a good event good opportunity to get your hands dirty with the rest of the badass staff in addition to the MLI and spark so thanks for your time and I mean I will be taking questions so feel free to and was it was great thanks again to Twitter first of all like to thank Twitter for providing awesome space and thank the two awesome speakers right now but as you can see we are outgrowing our venue this part we have is growing and we are looking for future meet-up locations you can have something in mind please come and talk to me imitate Patrick a lot of us are perfect Thank You around now repeat the question so the question is why bet on on spark so there are kind of two answers to this question the first is we think spark is the best thing from sliced bread and truly I think for machine learning and and in particular data analysis the distributed memory kind of paradigm allows you to work on bigger data sets than fit on one machine easily but also allow for sort of this interactive response times and so we we end up with algorithms that are you know as fastest as the state of the art with with very little code the second reason and the second answer that question is we haven't been entirely on spark curve the upper layers of the stack we would eventually like to be able to plug into the sort of arbitrary Batman's graphlab etc you know those levels are the layers of abstraction against the system here the problems have you been useful to be able to solve with this first release and how many other problems well I would say that you know we were definitely looking below has calmed and I'm Mara and I'm trying to subset those problems out there but thankfully you know we didn't just design this you know okay we talked to a lot of people there many people would you have fairly narrow but straightforward but their problems do fit into the last vacation regression or collaborative filtering that's alright so by no means is this exhausting there's many many other ways we can go but you know we have mined up many you know people within Burgundy and outside of Berkeley who are interested in using it and for whom a specific problems per bitch yes so one of the problems you talked about with MATLAB and Python was the deployment story getting the libraries in place that sort of thing what are your thoughts on how how you're going to deploy this to like a cloud technology like Amazon what's the story around fooling around I'm just sure well I mean I never supposed to answer this but to start with I mean we're piggybacking for now a lot on its part and what they do and you know in our experience it's pretty easy to get things set up and you see – yes well so mo based is kind of the umbrella ever sorry but the question is for our learners students what their priority is it getting optimizer out of the door but I'm not sure with the other alternative yes I would say both but I also you're right the time is limited and you know the few that people working out right now we need to grow a team that sub-site that is to recruit more people to work on it especially if you know open source community can video grill and Ally you want to show that some sort of you know first pass optimizer actually can't work so I mean I think it's possible to assuming that what we should restrict enough we can do both of those but getting more people to work on it is also a way to kind of get around that Oh so we're recording this so if you don't have a mic your voice can trigger in the video so please try and signal to me that you want to ask question great talks before I have a question about the ML language itself so we've seen one snippet of it it looks like a piece of scholar to me and the real question is I think given what people think about cql Brent you can all the feelings evolved around it I think the key question for you is all to you about the similar language will it be expressive enough for common pieces will it be specialized or not for power users and can we see more examples of this language how far are you about thinking about what this one which looks like this is you tell my mq well right yeah so I would say that we're not far enough along yet so at least you know explicit or many many examples of it the work we focused on more is less the language aboard the optimizer itself and how we can search through the space quickly and you know get an answer that's either correct or fit quickly figure out that there is no good answer given the data and problem trying to be solved so we realize instead of what I'm probably tried to stress that at least the language itself is something that we have a high level idea what we want to do but we're not we haven't pinned it down and we you know we'd love to get feedback from people about what didn't want to see you know oh yeah thanks again for the talk I was curious when it comes to the ML optimizer are there any examples from not distributed about cases that you can draw on as inspirations to give us a sense of what the upcoming challenges are for this kind of black box I just want to classify this text kind of approach as it's been tried before and simpler computational cases yeah so I mean you know the problem was sort of models collection and model search is one that you know has been studied variously by the statistics community for quite a long time you know in terms of you know I start with big raw text and automatically sort of feature as it into future selection that's definitely a hard problem when it's a big challenge you know I think there are what we're sort of shooting for right now is to reproduce sort of what the common ml workflow is the thing that that a person like a meat or somebody who's a data scientist is doing sort of day to day I'm just like alright well I know I got to normalize my features all right well I got a throw out the stuff that's you know got too many outliers or whatever you know and then I want to cross validate things and it's this big machine that kind of everybody seems to have a similar approach to and we think that a lot of that can be automated so that's kind of the tack we're taking at the moment my name is Lee coach a question I have is about I did the items identified as future workers things you look forward to there are a couple of items in there that we are interested in using now would you be willing to support answering questions that might have so we can do it ourselves in yeah yeah I think would we be able to get that kind of support in the so but so it's all about if you one is to one minute relations so yeah I think this is the kind of thing that you know if there's an item on that sort of future work list and a member of the community is really excited about and wants to you know get back to the project we'd certainly like to talk to you about kind of your thoughts on your approach if you have issues with the system and you know provide you some guidance there you know definitely so you know feel free to shoot me an email fine or any of that I mean open-source is definitely I understand how that works but there are many of us who do closed source development and if something is not yet ready from our perspective to be given back to open source we'd still like to be able to understand from your side any documentation or any guidance that you could provided about how we could go about doing it and then might be ready to give it up sure yeah you know I think that's and else

3 thoughts on “Apache Spark: Distributed Machine Learning using MLbase”

Leave a Reply

Your email address will not be published. Required fields are marked *