Analyzing Optimization and Generalization in Deep Learning via Trajectories of Gradient Descent



okay you can take right okay did not appreciate homicide who think I started so yeah let further ado like the first stop is another comment and we'll talk us about the what vector is the question we think about trajectories of oxidation algorithm is not only the properties of the stationary points talk is a logic optimization generalization deep learning through trajectories of gradient descent it will start with kind of a recap or perspective on what the basic questions are for optimization generalization deep learning and then I will kind of state the main argument that I'm trying to make in this talk and all the rest of the time will be devoted in to trying to make the case for this argument and so at least in supervised learning when we talk about optimization we mean fitting training data by minimizing some objective it's a loss function and generalization refers to controlling the gap between our performance on the training data and on the test data this could be achieved for example by adding regularization term to the objective or a constraint in classical machine learning the basic theme it starts off from the fact that the objectives are convex usually we choose models and problems such that the objective will be convex there's usually one global minimum that's sufficiently attainable and the choice of algorithm effects the speed at which you get to that minimum but no matter what sensible algorithm you use you will get to the same solution so the statistical properties of that solution are irrelevant to the algorithm generally speaking in terms of generalization there's a famous bias-variance tradeoff which qualitatively speaking states that if I apply more regularization than the gap between train and tester decreases but the training error will go up and I care about the sum of these two things the test error hence there's a trade-off and on the other hand if I make my regularization weaker then the gap will open up but the training error will go down all at all I think it's fairly safe to say that we have a reasonable understanding of what goes on here deep learning the situation is very very different we allow the objectives to be non car and there's not just one global minimum there are a lot of them and a priori it's not clear that one could even reach them efficiently but somehow on typical problems different variants of gradient descent seem to reach one of these minima and we don't really understand why and for generalization the situation is even more peculiar we know that least empirically we know that some global minima generalize well they have a low test error but others generalize very poorly and for some miraculous reason on typical problems with typical data use gradient descent or variants thereof you often get to a solution that generalizes well and maybe most surprisingly is the fact that there seems to be no explicit bias-variance tradeoff peter talked about this hey we can train huge models with a small number of examples without applying any explicit regularization and we will generalize over all I think most here would agree that we don't really have a good understanding of these questions and that is why we're here and the argument that I will try and make today is that maybe the language that we developed for classical learning theory is not sufficient for understanding what goes on in deep learning and in that case I believe that a potentially fruitful direction is to look closely at the course of learning and that means the trajectories of gradient descent study what happens throughout the optimization process try and make the case for this argument through deep linear neural networks yeah that's gonna be the basic theme of the talk so without further ado let's move to the actual content the material that I'll present is based on three papers one from the last ICML one from this year's IC LR and another one that's under review my esteemed collaborators know Sanjeev in elad and graduate students Wei who and you King law and Norah Gullah which which was an undergrad visited us last summer so what's a linear neural network a linear neural network and the learning jargon is a fully connected neural network with linear activation that just means that it's a mapping from input to output that is linear but it's parameterised as a sequence of matrix multiplications so from a representational perspective it only realizes linear mappings there's nothing special about this but in terms of optimization in generalization this simple model is highly non-trivial and it exhibits a lot of behaviors that are similar to what we see with nonlinear DBMS and for this reason it's often viewed as kind of a theoretical surrogate and a lot of people are looking into this including a lot of people here there's just a sample of papers from last couple of years what we're gonna start with is summarize an analysis of the trajectories of gradient descent or gradient flow to be exact over linear nuts and then we're gonna use those results for analyzing optimization generalization and along the way we will see that using our trajectory analysis allows us to circumvent problems that one encounters if you try and adopt more traditional viewpoints a gradient flow this is a continuous version of gradient descent this is what you get when you take the step size to be very very small it's a differential equation curve in space and every point your movement is in the direction that's opposite to the gradient so gradient descent to these discrete steps gradient flow is this continuous curve this thing is useful because it's continuous and allows you to use a lot of mathematical machinery and we will see how these some of the results that we're gonna derive we're gonna translate them from gradient flow to gradient descent so gradient flow is kind of a theoretical stepping-stone now let's start talking about the trajectories of gradient flow over linear neural Nets so the kind of objectives that we're interested in have the following for L is some loss over linear models this video – loss or cross entropy or whatever usually it's convex and this kind of loss immediately induces an over parameterised objective over a linear net a fee w1 – W and that is our objective over a linear net the value teni just multiply the matrices by one another get the end-to-end mapping plug that into the original loss so you know L is usually convex fie is highly non convex here's the definition it has profound implications but for now it's just out of nowhere we say that the weights W 1 2 WN are balanced if the following equality holds for every j WJ plus 1 transpose times itself is equal to WJ WJ transpose where every layer this is just a definition it as I said has profound implications the reason why it's reasonable is because this either holds exactly or approximately under fairly common initialization schemes if you initialize close to 0 then this thing is almost met if you initialize to identity just kind of linear resonates then it's met exactly and what we show is that gradient flow over linear Nath's preserves balance nough said if the weights are it initialized to be balanced they're gonna stay that way throughout the entire optimization trajectory ok gradient flow does not move arbitrarily in space it follows certain trajectories and in particular balanced trajectories if you initialize in these common ways and this result allows us to address the following question suppose we optimize the linear neural map with gradient flow and we can ask at every point in time what the what is the equivalent linear model hey we're optimizing over W 1 to W N and what we ask is what happens to the end-to-end matrix the one that represents the entire mapping how does that thing move in space ok it does not follow gradient flow over L over the loss because we're not running gradient flow over the loss we are over parameterizing with a linear neural net and runni gradient flow on that so how does this thing move in space and the result that we show using the balance this property is the following if the weights are balanced at initial then the end-to-end matrix follows what we call the end-to-end dynamics it's what's written here ignore this blue term you just get gradient flow and to add matrix moves according to the gradient with a minus sign that just means to take a matrix arrange it in this vector if you don't have the blue term it's meaningless blue term is what manifests the fact that we used a linear neural net this thing is a pre-conditioner it's a PSD matrix it has actually a closed form expression here it is please don't try and parse this I'm just showing you to get a sense that it's not that complicated this is the closed form expression for the pre-conditioner intuitively what this pre-conditioner does is it promotes movement in the directions that you've already taken and what do I mean by that the pre-conditioner depends on the end-to-end matrix and those singular directions that have a large singular value here those are the ones these directions are the ones where the gradient is going to be stretched more significantly so if I start off near zero I wander somewhere in space I reached a point those directions were I've traveled a long distance the gradient there is going to be pushed more significantly yeah that's kind of an intuitive way to describe what goes on here it's possible to formalize this bottom line is that if we had a linear model and we add redundant linear layers we turn this into a linear network we are implicitly inducing a pre-conditioner okay that promotes movement and directions already taken all of the results we're going to derive for optimization generalization rely on this theorem all right let's now move to optimization this is gonna be about half the time I'll try to focus more on generalization because that's a newer stuff yeah the results for that too like we can talk about actually show some of that where it's approximately bounced I'll get to that thanks okay so before we talk about utilizing our trajectory analysis for optimization let's say word about the alternative classic approach one of the most prominent approaches for analyzing optimization in learning nowadays is to look at the geometry specifically characterize the curvature the geometry of critical points have different types of critical points I could have a local minimum that's good the objective is close to the global minimum I could have a local minimum that's bad the objective is much worse I could have a saddle with a negative curvature direction so there's kind of directions of significant descent or I could have a non strict saddle so it's very flat yet to walk significant distance to start the descent formally fashion here does not have negative vacant values okay and a result that was shown yes couple years ago by different groups and different flavors many of them sitting here basically if you sweep away a lot of the details it states the following if there are no bad local minima all right so you don't have any points like this and there are no non strict saddles you don't have any points like this then gradient descent will eventually reach global minimum basically it means that it's not gonna get stuck in strict saddle points this is very the fairly encouraging you can hope to maybe show convergence to global optimum just by looking at the geometry of critical points without considering the interplay between the model and algorithm and things like that and indeed there have been many works that have focused on establishing these properties no bad local minima and no non strict saddles and you could maybe hope that maybe it's possible to prove convergence to global optimum for deep linear nets using this approach but it turns out that this kind of approach hits a wall right at the set because if you're interested in deep models three or more layer models then property two is just flat-out false it's not true that there are no non-straight saddles for example when all the weights are zero all the second derivatives are going to be zero and there you go there's a non spring set so you can't really hope to use this kind of approach as is we're gonna do is leverage our trajectory analysis so this is the pre-conditioner these are the dynamics that we derived i didn't show you this but it's pretty easy to see that if the end-to-end matrix is full rank then the free conditioner is positive definite meaning it's not going to kill this gradient and so the descent is going to continue the loss is gonna decrease until one of two things happen either the gradient is zero okay in that case obviously there's no movement or maybe the gradient is not 0 but the pre-conditioner killed it and that's only possible the end-to-end matrix was singular usually the loss L is convex so reaching 0 gradient means that you have reached the global minimum so we get the following corollary for gradient flow if the loss is convex and I initialized the linear net such that the loss at initialization is better than the loss of any singular matrix so by going down I'm never going to hit singularity that's condition 1 condition 2 is that the weights are balanced under these two conditions gradient flow will eventually converge to global minimum this is a de Convergys result for gradient flow under two assumptions on initialization there's no notion of computational efficiency obviously because this is a continuous optimization practice we're interested we are interested in computational efficiency what we do is translate this result from gradient flow to gradient descent this is the corollary for gradient flow I will now wiggle a little bit with these conditions I'll just write them differently I'm not gonna change them instead of write up saying that I want the loss to be better than the loss of any singular matrix I'll say that I want it to be better than the loss of any matrix with a zero singular value exactly the same thing let's first plug in the definition that's the definition of balance miss now instead of writing that these two things are equal I'll just write that the norm of their difference is zero then change anything now we're gonna slightly modify this and get a theorem for gradient descent ok I'm just gonna show you what we prove I'm not gonna so first of all the result for gradient descent we proved it for l2 loss it's possible to extend it to any strongly convex foster and these two conditions we kind of defined discrete versions of them instead of requiring that the loss be better than the loss of any matrix with the zero singular value I'll require a little bit more I wanted to be better than the loss of any matrix in the singular value that smaller the other hand instead of requiring perfect balance these norms to be zero I will allow them to be small not X it don't have to be zero okay smaller than something that scales like C square under these two conditions gradient descent with a step size that scales like C to the fourth gives you linear rate convergence exponentially fast convergence loss iteration T scales like e to the minus constant times T so we have here a result for linear rate convergence of gradient descent over arbitrarily deep models under two assumptions on initialization what can we say about these assumptions so first of all we know that in some sense they're both necessary meaning that you can't just erase one of these conditions and hope that the theorem will hold because it does not hold it's easy to come up with counter example if you race any one of them then I can easily come up with an example where convergence is gonna fail or even diverge so you can't just remove them you can maybe weaken them the second thing we show is that for the case of output dimension 1 if you use a random initialization that's similar to what people usually use but not exactly without getting into the details we call this random balanced initialization if you use this thing then with constant probability both of the conditions are met and you get linear rate convergence the global minimum by constant I don't just second I don't mean some positive constant I mean 0.25 or close to 0.5 so use this initialization no assumptions at all with probability close to 0.5 you get linear rate convergence the global optimum the second so what we have here is a result that shows efficient convergence of gradient descent on a compact model one that you could actually fit in your computer and that model is allowed to have three or more layers an arbitrary number of layers ok to my knowledge this is the most general guarantee to date of this particular kind what it means well it's a condition of how good your loss at initialization is ok you need to be slightly better than matrices that are almost singular ok it's some condition on the question of how reasonable this is then this is supposed to be the answer the constant probability that 0.25 is good so you need to be better than chance then the origin becomes 0.5 times something to the minus n that's correct to the minus output dimension that positive probability there's still a positive constant probability but it depends on the output dimension in an exponentially bad way so it's not that impressive for higher output dimension even for a per to mention one to my knowledge there are no results no it's just the end to end it's just in the end to end space there's no this theorem if well if the output dimension is one then that can never happen but this theorem is not these conditions are not gonna be met if you have this bottleneck layer that's narrower than the input analysis yeah I'm trying to kind of circumvent some of the okay this was possible thanks to our trajectory analysis just looking at the geometry kind of a classic approach would not have given us this result and what we're gonna see now is something I think that is potentially more interesting in tests even more to the importance of trajectories and that's the effect of depth on optimization the classical view is that convex optimization is much easier than on convex and optimizing a linear model is easier than optimizing a deep net because this gives you a non convex objective this gives you a linear one that's the classical view and what we're gonna see is that through our trajectory analysis we're gonna see that this is not always the case sometimes it's easier to optimize these things than it is to optimize so I won't get into too many details kind of work because they want to get the generalization part I'll just state the result hey this is our end to end dynamics in the discrete version so there's discrete updates here's our pre-conditioner and what we show is the following for any P greater than two there exist setting was where the loss is the LP loss so it's based on the P norm to the power of P of the difference between true label and production that's a convex loss there exist cases with this loss where the end-to-end dynamics reach the global minimum much much faster than gradient descent so what this means is that there exist cases where I could run gradient descent over a linear model optimizing convex objective or on the other hand do something crazy which is to replace the linear model by a deep linear network and we run gradient descent on that the latter option would give me much faster convergence there exist such cases so this is non-intuitive it goes against the conventional wisdom but it might not be practical the fact that there exist such cases doesn't mean that we encounter them in practice the probe here cooks some certain data set that makes this happen it might be possible to mock we didn't invest too much thought into that but it's what we show this for fix no it's not met it's critical the fact that it be greater than 2 and there's the intuitive explanations for why because you need the loss to be very flat around the minimum and steeper around the edges we kind of take advantage of that I'm trying not to get into all these details but it's not true for people still so the classical explanations for acceleration based differential equations is that in gradient flow you get a first-order in the accelerated gradient yeah so by analogy here if you are just using gradient descent and then let's say you over parametrize message is two layers when you have that balance conditions you can eliminate and get a second equation so it's related to momentum which is related to second-order there is a relation some difficulties as to whether you're saying you can construct examples for the verse of the situation where this is generic because over privatization would guess and please the order of the differential equation I don't think this is I know it's not true for P equals two or less for P greater than two actually every setting we tried experimentally this did happen I don't think this is generic I think you could maybe well I could come up with data sets worth what would happen maybe in a random data set it will but we didn't we don't have that kind of result to get a sense of how realistic this is how common this situation is in practice we run experiments ran a lot of them I'll just show you the basic one this is a regression problem from UCI machine learning repository we use out for loss because we need P greater than two and this is what you get where you compare a one layer model in blue that's a convex loss versus a two layer model and a three layer model these are non convex objectives and you see that the loss decrease is much faster this is long scale so in this particular case adding depth speeds up gradient descent even without any gain and expressiveness and despite the fact that it introduces non convexity every single setting we tried with L for loss or L or P gradient to this happened this also happened with non linear models and I recently heard that someone took this idea and applied it to more large-scale settings that's a great question yeah so and it is higher but just to make that point in this particular case this is output dimension 1 and every layer is just a scalar so it's almost the same this is like a thousand parameters this is seven thousand one this is thousand two so even if you divide by the number of parameters it's it okay let's move to the new stuff generalization we're gonna look at a setting where generalization carries a very concrete meaning and that is matrix completion it's a classic problem I get a subset of entries from some unknown matrix and I want to complete the entries that I haven't seen has a lot of practical applications as we know and this can be seen as a supervised learning problem I can think of any entry that I observe as a training example and every entry that I did not observe as a test example and generalization means recovering the testings or being accurate on the test examples obviously you need to assume something about the grant truth matrix otherwise you can't do anything and the baseline assumption is that the grant truth matrix is low rank and there's a classic algorithm now it's classic by Ben and Candice and essentially what it states is the following if I look at the set of matrices that agree with the observed entries and from this set I choose the one with minimal nuclear norm nuclear norms the sum of singular values that's a convex program then under certain technical assumptions let's put them aside because they are in some sense rare if I have enough observed entries then I will get perfect recovery so a convex program will give me perfect recovery if I have sufficiently many entries which is the minimal number that needed times some logarithmic factor and for this reason this matrix completion problem is often regarded to be solved we're gonna see is that this in practice requirement for sufficiently many observations makes a difference and there are cases where other algorithms do a much better job than nuclear norm minimization yeah we'll see that what we're focused on is the use of linear Nets for doing matrix completion let's start with a two-layer case because that's already classic algorithm in that case I parameterize the grant truth has a product of two matrices w1 times w2 I can constrain the shared dimension and thereby force a low-rank or I could not do so in any case this algorithm is called matrix factorization hey running like optimizing over this parameter ization and there's an empirical phenomenon that I think is fairly known but not these group I guess but the spotlight on this thing and that is that if you run gradient descent over this parameter is a ssin with a small step size an initialization close to zero then even if I didn't constrain the rank a low rank matrix is going to be found matrix factorization does a good job in recovering low rank matrices and doing matrix completion there is some implicit regularization at play and the conjecture that they made is that in this case gradient descent with small step size and initialization close to zero over this matrix factorization implicitly realizes the classic convex algorithm implicitly it finds the minimal nuclear norm solution and they actually prove this for a certain specific setting so what we tried to do is build on this great work and look at what happens when you have n layers not just two in this case I parameterize the grand truth as a product of n matrices right and we call this unsurprisingly deep matrix factorization okay and the first thing we did was run an experiment we ran matrix completion and compared deep matrix factorizations to shallow once when the step size gets smaller and the initialization becomes closer to zero so this is what you see here we're completing a matrix here this is depth to depth 3 and f4 x axis here that's the standard deviation of initialization you can see what happens when it gets smaller there are actually two curves here they're almost one on top of the other but they correspond to different learning rates so you can see what happens when that gets smaller and in this case depths three and four significantly outperform depth two in this case depth enhanced the implicit regularization towards low rank and what we were interested in is trying to theoretically analyze this phenomenal the first direction that we went to is to follow great footsteps of others okay the conjecture that was made by goodness Eckert albeit grouping TTI is that the implicit regularization with depth two boils down to minimizing nuclear norm she's a surrogate of rank we initially thought is that maybe because we see that deeper models give you a low rank solution then maybe with a deeper Network there's some other norm or quasi norm that's being minimized which is closer to rank the nuclear norm is here's the most straightforward example Shatan p quasi norms the second p quasi norm of a matrix is the sum of the singular values to the power of P okay for P equals 1 this is exactly nuclear norm that's what according to the conjecture corresponds to depth 2 and 4 be smaller than 1 this is a quasi norm and it's actually closer to rank the nuclear normals so what we initially thought is that when I add depth I am taking P from being 1 to being smaller hey that's what we initially hope to show but things did not exactly pan out as we planned them to and which often happens the first result we ended up proving is that in the particular case we're going to secure it out prove their conjecture nuclear norm is being minimized not just with depth 2 but with any depth all of the depths minimize nuclear norm and we also show that it there are certain instances of this setting we're minimizing nuclear norm means that you are provably not minimizing any Shatan quasi norm not even locally so this means that what we hope to show that the implicit regularization with depth is equivalent to Shatan quasi nor minimization or maybe even locally that's not the case we know for a fact that this is not true the other hand if you kind of adopt the lens of gonna Sakura doll then maybe you end up with a different conjecture and that is that all the depths actually minimize nuclear note so then we came up with kind of maybe this direction but this does not comply with our experiments our experiments show that depth gives you closer approximation of rank so what's going on here okay so what we did next is kind of more careful experimentation of this nuclear nor minimization idea we this is the setting of completing a rank five hundred by hundred matrix and we start with the case where you observe a lot of entries okay so one hundred by hundred rank five matrix so you need about a thousand there are a thousand degrees of freedom but we use five thousand observed entries here we compare nuclear nor minimization to linear nets of different dusts this is the reconstruction air that's the nuclear norm of the final solution and this is the effective rank Peter mentioned you just think of this as rank some continuous approximation and in this case because there are a lot of observed entries minimizing the nuclear norm indeed gives you an almost perfect reconstruction air so it finds the grant truth right the nuclear norm is 221 any effective rank is five which exactly complies with the rank of the ground truth linear Nets in this case also give you a very good reconstruction not as good but almost perfect as well so obviously they found the same matrix so the nuclear norms are going to be the same and so are the ranks so in this case there is a correspondence these linear Nets of all depths indeed minimize the nuclear norm okay but nuclear norm here minimizing nuclear norm finds the grant truth so we can't really distinguish between nuclear nor minimization and maybe there's another implicit regularization that pushes these models to low rank you would get to the same matrix so on the way I would also minimize nuclear norm but that might not be the driving force and to get a more kind of fair comparison look at a setting with a smaller number of entries and here nuclear nor minimization does not give you a very good reconstruction the default is point eight this is point one the nuclear own that's found is to ten it's lower than before but the rank is higher so by focusing on nuclear nor minimization that kind of missed the low rank it got in ranked nine and in this case linear Nets give you better reconstruction error especially depths three and four and they give you a lower rank here it's like almost perfect reconstruction the norm is higher so they seem to give up on the nuclear nor minimization in favor of lowering the rank just second so in this case we can interpret this as a discrepancy and this happened in every setting we looked at where the nuclear norm does a poor job in reconstructing all these without any noise any kind of noise because when you have that matrix it is presumably much more at least intuitively more sensitive to well the idea is that they find a lower rank solution you can argue if that's accurate or not with respect to what you want to recover but we didn't we didn't play with these settings in this experiment no but what we are kind of saying here is that sometimes it's nuclear on that's being minimized sometimes it's something else maybe there's this kind of by modality where after some threshold it becomes this norm afterwards it's that I know it's no shot it's no shot and cause I know that I know but maybe it's something else and the kind of conclusion we reached is that maybe it's just not possible to capture everything that's going on with a single norm or quasi noise and that kind of led us to resort to move back to our trajectories and see if we can show some using that kind of dynamical analysis so this is our implicit pre-conditioner here I'm marking in the green that's the end-to-end matrix and what we're gonna look at is what happens to the singular values of this end-to-end matrix we would like to somehow show that it fine there's a kind of implicit tendency towards low rank meaning that the singular values the gaps are opened between the singular values that's kind of what we were hoping to show and actually that's that is what we discovered Sigma are here these are the singular values of the end-to-end matrix and you are V R these are the corresponding left and right singular vectors the result that we derived for the dynamics of the singular values is what's written here ignore this Brown turn for a second see that the movement of a singular value is just minus the projection of the gradient on to the corresponding singular component the only place where depth n comes into play in this equation is in these factors if N equals one then these factors reduce to 1 they disappear and all you get is this projection that's what governs the dynamics of singular values if I start introducing depth then these factors come into play and while these constants don't make any difference because I can absorb them into the learning rate these coefficients do make a difference what a singular value is small then this coefficient attenuates the movement it makes it move much much slower on the other hand if a singular value is large that it's movement enhances it moves much much faster and the larger end is the greater the power is here so the more significant the effect is and this is what happens when you optimize a linear net over any loss and suppose you're optimizing some underdetermined loss like in matrix completion and you initialize close to zero then what we would expect is that when to think of their values are small after initialization they're barely going to move and then each singular value once it reaches a certain threshold it'll start moving rapidly shoot up and that ultimately will give you a low rank solution because most of the singular who's stayed close to zero and only a small number of them shot up and this is what we observed in every experiment that we ran okay this is matrix completion this is for depth one this is our singular values move there's no implicit regularization this is depth one okay in the unobserved entries you're not learning any and all the sinc of their values move together you don't get a little rank solution this is what happens when you add a layer so the smaller secular values move much less and these values leading components once they become larger than some threshold they start moving much faster and you get something that's closer to low rank and indeed the reconstruction error is much lower if I add another layer then this becomes even more significant and the smaller ones barely move at all they get a very low rank solution and this gives me an almost perfect reconstruction so this is kind of a miracle illustration the intuition and there's kind of we also characterized its movement as well I'm just not talked about it but there is a non-trivial dependency there yeah it's not like there's a complete decoupling so this kind of analysis in general it's not as elegant as saying that some norm is being minimized obviously usually in classical learning theory when we regularize we use some norm so it's natural to think that maybe deep Nets implicitly do this it would obviously be a much better outcome than having a dynamical analysis but what I am arguing is that there's no reason to believe that that it's possible okay the fact that we're explicitly regularizing what norms does not mean that the implicit regularization with deep Nets does something similar it could be different and look at it the trajectories is a way to somehow take into account the relevant details for this special case super simple just second of one observed entry we can actually give closed-form expressions for the dependencies a different sink of their values we know that for depth one the relationship is linear so if once together value gets larger the other one can grow linearly grows linearly with it for depth two it's polynomial so if this singular value grows this one could grow like square root and things like that and with higher depths it becomes asymptotic so if this singular value could be huge this one would still not pass some threshold over all we see here that depth leads to larger gaps between singular values which means lower rank iteration this is the course of learning why is it not an integer all right so let's conclude so recap is the main argument or position that I'm making is that understanding optimization generalization deep learning might not be possible through the language that we developed for classical learning and in that case it might be beneficial to analyze the trajectories of gradient descent that's the entire course of learning the dynamics of what goes on there we try and make the case through deep linear neural maths we analyze the trajectories and show that depth induces a pre-conditioner that promotes movement directions already taken we use this for optimization we derived a guarantee of efficient convergence to global minimum which shows you fast convergence over an arbitrarily deep model that's compact and can fit in your computer this is the most general guarantee of that kind the nominee we're up and more surprisingly we saw them in this case depth can actually accelerate convergence even though it doesn't give you anything in terms of expressiveness and despite the fact that it introduces non convexity and for generalization our trajectories show that depth enhances an implicit regularization towards low rank which for problems such as matrix completion amounts to better generalization okay and just a word about the future so we've talked about linear nets which are nice a great theoretical tool but at the end of the day these are not the models that we usually run practice we're interested in nonlinear nuts and it's difficult to analyze nonlinear nets with value and things like that it's not analytical but there's another type of nonlinear neural nets which I believe are much more a reach and we saw that linear neural Nets they correspond to matrix factorizations and in the kind of language of decompositions of multi-dimensional arrays this could be plotted as this chain every node here corresponds to a matrix parameter and this graphical language allows us to describe not just matrix factorizations but tensor factorizations as well factorizations of high dimensional arrays and there are different topologies like trees and trains and things like that and what colleagues and I may be University have shown few years back is that just like matrix factorizations correspond to linear Nets tensor factorizations correspond to arithmetic nests arithmetic nets are nonlinear models ok so if you have a tree factorization that corresponds to a convolutional nets and a train factorization corresponds to a recurrent net these networks have a structure that is very or it's actually identical to practical models that people use but there's one difference and that is that their arithmetic ok the nonlinearities are only based on products these are kind of polynomial networks so they're much more algebraic in nature but they're not exactly the type of models that usually people use but nonetheless they do work well in practice for example on so far they reach over 90% and they are more amenable to theoretical analysis so I believe these kind of models could be the next step beyond linear nets and we actually started doing this and analyzed the trajectories and they some of the results for linear Nets transfer to these models as well like the notion of balance 'no sand things like that so i hope that maybe in the near future we'll have some results on the trajectories of these models may be able to say more about optimization and generalization for a nonlinear model it's not exactly what most people use but it does work well in track [Applause] so our question regarding your main I bought this is so it seems to me like you're associating to every point in the invasion derive the same results by a global structure in a global Romanian structure instead of great great question that's it relates the mirror descent and stuff that's sound going on going bad at offline there is something you can describe what what did I do I took a model a linear net and I replaced that by an equivalent algorithm I said that this linear Nets gradient descent over the linear net can be thought of as some other algorithm over a linear model turns out that I could even do more it's the equivalent to gradient descent over a linear model but with an underlying metric that's different [Applause] so our our next speaker is Misha and who come talk to us

Leave a Reply

Your email address will not be published. Required fields are marked *