Mad Max: Affine Spline Insights into Deep Learning

next our each other I renege we'll talk talk about our fine spline insight into the attorney Oh Mike max it's great to be here how can everybody hear me at the back okay yeah it's great to be here really looking forward at this meeting that is part of a program on foundations of deep learning and and what I'd like to do is just talk a little bit about some of the progress we've been making trying to find a language a language we can use to describe what we're learning as we scratch away at these black box deep learning systems that have been promised to catapult us into going into the end of the future right and what I'm gonna argue is splines provide a very natural framework for both describing what we've learned but also providing us avenues for extending both the design and analysis of a whole host of different deep learning systems and so I'm going to talk a little bit about a particular kind of spline today and then I'm gonna give you a whole bunch of apples of how we've been using it to describe and extend and of course we're not the first people to think about the relationship with deep nets or neural nets and splines this goes way back to pre the the previous deep net winter but I think that we have you know particularly we've identified a combat a collection of particularly useful splines for modern deep nets okay so just let's jump in and talk about the basic set up so we all know that deep nets solve a function approximation problem we're trying to use training data to approximate the prediction function from data to some prediction might be a regression problem might be a classification problem and we do this in a hierarchical way through a number of layers in a network and what I'm going to argue is that deep net do this in a very particular kind of way using a spline approximation so show of hands how many people here know about splines okay so there's two key parts to a spline approximation the first is a partition of the input space or the domain so if X is just a one-dimensional input variable then we have a partition Omega in this case we're splitting the domain up into four different regions we're now the second important thing is there a local mapping that we use to approximate some in this case function so we have a blue function we want to approximate here and we approximate it we're gonna be interested in piecewise a fine or piecewise linear splines by just in this case four piecewise of fine mappings okay makes sense to everybody but really this yin-yang relationship between the partition and the mapping that that works the magic in splines there's two big classes of splines there's the really powerful splines for example free not splines this is where you let the partition be arbitrary and this and then what you do is you jointly optimize both the partition and the local mapping these are the most allow you to have the highest quality approximation but it's important to note that they're computationally intractable in 1d in fact in hired even dimensions to an above it's not even clear how to define what a free not spline is so these are something we'd really like to be able to do but very very difficult typically what people do is they they fall back to some kind of gritting type technique and if you think of even what wavelets are they're really just a dyadic grid approximation type of spline so what we're going to focus on today is a particular family of splines that we call we don't call that we're kind maxify splines by Stephen Boyd a number of years ago and these were developed just for the idea of approximating a convex function with a continuous piecewise affine approximation okay so let's do continue with this really simple example to just define a max of fine spline we're interested in approximating this could convex function over R capital R regions so we assume that we have our affine functions these are parameterised by a set of slopes and a set of biases we're gonna have our set to those here's an example for R equals four we have four separate four of these distinct affine functions and if the key thing about the reason why we call the maxify splines is very conveniently if we want to approximate a convex function by these splines all we have to do is take the maximum the vertically highest in this case I find function okay so if you think of these four f-find functions that we thrown down here and we think of approximating this blue curve all we're going to be using is simply the top right the piece that actually sits on the top okay and so the really important thing here is that it's just by fixing these four sets of slopes and biases we this automatically generates an implicit partition of the input space right yet you switch from one partition region to the next whenever these affine functions cross and that's gonna be you know really important for laters this makes sense to everybody very very simple right of course this also gives a continuous approximation so let's just think a little bit about pointing towards deep nets without going to a lot of details just imagine so we're still in one dimension and we take our input X we scale by a add a bias B and then pass it through a ray loop right this operation here well it's pretty easy to show that this is a max affine spline approximation with R equals to a find functions the first is being 0 0 is the flat function 0 function and then the second being basically that think of this like a like like the Ray Lu but now shifted over and with the slope change by the parameter a two okay so just this should get your yourself thinking about other deep net operations and whether they can be related to the max affine splines we're going to define a max if I'm spline operator simply by concatenating K of these max affine splines so you can think of an input vector now we're no longer in 1d X is in D dimensions and then we have K different splines the and the output of each of those splines will just be one entry of this output vectors Z we're gonna call that Amazo or a max affine spline operator so what's the key key realization well let's start by just talking about deep nets okay if you think of the lion's share of what the deep nets that are used today basically any architecture you can think of using piecewise linear or you know affine operators fully connected operators convolution operators leaky Ray leaky Ray Lu or Ray Lu absolute value any of these types of pooling z– these are all built these the state-of-the-art methods are all built out of these kind of architectures and these kind of operators and it's actually pretty easy to show that all of these operators that comprise the layers of bit essentially all of today's state-of-the-art deep nets are maxify spline operators you can think of then each layer of a deep net is just a max affine spline operator and so that what we're doing is we're doing that we have a convex approximation going on at each of these layers and therefore a deep net is just a composition of maxify spline operations no longer a convex operator because composition of two convex operators isn't necessarily convex okay so so this is gonna so we're just going to call this in a fine spline operator it remains continuous but it doesn't have this max affine spline property anymore I just as an aside if you wanted the overall net to be convex it's pretty easy to constraint to show that all you need to do is just ensure that the all of the weights in the second layer and onward are positive number right that guarantees that the overall mapping will be convex any questions about this very simple baseline stuff okay so they'll recall the nice thing about these these particular splines is that as soon as you fix the parameters right the slopes and the offsets wherever those hyper planes are these affine varieties cross that defines a partition and that's really where things get interesting right is to think about the partitioning that goes on in these maxify spline operators because that allows us to think a lot about the geometry of what's going on in a deep network so again just reiterating what I just said if you think about a set of parameters of a deep network layer they're gonna automatically induce a partition of the input space of that particular layer into convex regions and then if you compose several of these layers we're going to form a non convex partition of of the input space and and this provides really interesting non-trivial links to classical ideas out of signal processing information theory computational geometry namely ideas like vector quantization k-means and and Voronoi tiles and we'll get into these as we go so one of the yeah key ideas is linking these modern deep net ideas back to more classical signal processing ideas so let's just do a toy example so that you can visualize what goes on in one of these vector quantization partitions of the input spaces so let's just consider a toy example a three-layer net we go from an input space it's two-dimensional we're gonna four classes in the two dimensional input 2d just so we can visualize can't visualize really anything beyond 3d we go to a layer of 45 units then three units and then we're gonna do a four class classification problem so we have four units on the output okay makes sense to everybody so this is the inputs this is what goes on in the input space we have four different classes with these four colors our goal is to build a classifier to to tell these apart this is the first axis of the input space the second axis and this is what happens after we go through the first layer right the we go to 45 this the layer the Maslow layer that map's the input to the output of this first layer this is the vector quantization partition or the spline partition that you that that that you obtain importantly we're going through a single layer so the tiling is convex right these are convex regions right makes sense okay moreover let's remember that these are splines after all so we can ask what what is the mapping from the input of the first layer to the output of the first layer well it's just a very simple affine map because it's an affine spline after all okay that once you know a signal right a particular ax lands in this particular tile right there is a that that gives you a particular matrix a right and a offset vector B right that are different for every VQ tile but then the mapping for the input to the output of that layer is just simply this affine map so you can think of a the mapping from the input to the output of one deep knit layer is just a VQ dependent affine transformation so this is one layer so now if we go through two layers and we think of the partition induced on the input space we now see that we start picking up non-convex or we start having non convex regions because the non-convex operator however we still have the same concept right that if a signal falls in this particular tile right this particular partition region the mapping from the input to the output of the second layer remains just simply in affine map right where the a and the B are indexed by this particular tile and just to be SuperDuper clear about it one more time every signal that lives in this tile that falls in this tile on the input space has the exact same f-fine mapping okay and this is what happens when you learn just to see when you if you initialize with random random weights zero biases you just get a set of cones and as we go through learning epochs you see that we end up with these cones pulling away from the origin and then cones being cut up by other cones and we result again and for at least layers wanted to this particular mapping it at convergence okay and I'm gonna III think that it's really thinking of this geometrical picture is really very useful to think about the inner workings of what's going on in a in a deep network in particular a deep net is a VQ mer machine right it's computing a vector quantization so was their question yeah we said oh let me just think if I got this right we set all the biases to zero in the whole in the whole network yeah so we'll just still it's still there's no beat there's no box set so it's just gonna remain yeah calling the corners of Collins is just calm that makes sense okay good so let's talk a little bit about some of the geometrical properties you can actually delve deeper into the what the structure of these VQ tiles and show that that the part of the partition of each of a single layer right a single layers input space in terms of the output is something called it's actually not a Voronoi diagram it's something called power diagram question anybody here heard of power diagrams okay fantastic so it's a generalization of a Voronoi diagram now instead of just having a centroid it has a centroid and a radius all right so it's a it's a mild generalization of a Voronoi tiling but the basically you just compute a Voronoi tiling but with something called a genre distance instead of the standard Euclidean distance but the tiles remain convex convex polytopes right and in high dimensional space moreover given these affine maps given the the entries in these a matrices and these B bias vectors we have there they're close form formulas for the centroids and the radii that determine all of these polytopes so you can understand you can study the the geometry of these the the eccentricity the size etc by thanks to the in closed form thanks to these formulas moreover it should be pretty clear that since you're piling layers upon layers that the the pop the power diagram formed from let's say two Mazal layers applied in composition is going to be formed by a subdivision process because the cutting of the cuts from the second layer input of the second layer to the output will basically cut the vq tiling from the first layer right and so this is just an example of an input space tiling first layer will just be a set of straight line cuts the second layer is going to be a subdivision of those cuts we colored them gray here subdivision but now the important thing is that the cuts are going to be bent right they're going to be bent at the gray bond arees which are the boundaries defined by the first layer cuts and these by these bends you can actually compute bounds for example on the dihedral angles and and these bends are precisely two main continent to maintain continuity of the mapping from the input to the output of this operator if you didn't have these bending then you could have the spline become non continuous okay but again these bends are very important and they have a lot to do with weight sharing in deep deep networks so one of the conclusions you can just take away from this part partway through is that deep networks are really a very practical very efficient way of doing something that is extremely difficult which is free not spline approximation and Heidi all right that's that's that's really what deep deep networks are doing you could carry this all the way to the last layer in a classification problem say it study the decision boundary of the deep net it is again just going to be one of these basically just one of these cuts and you can understand for example the the smoothness of the boundary by the fact that the you can you can only have so much bending between the cuts when you cut through the power diagram partition that you obtained from the previous regions there's lots of things that can be done to understand for example smoothness of the decision boundaries in different kinds of deep Nets this is one direction that we've been exploring the other is looking in particular at these affine mappings so again when you're in in a VQ tile you know that there's for all signals that live in that tile there's just a fine map that goes from the input to the output what what what what properties can we do for me glean from these okay so in particular if we think let's just study the the simple the the case of input to the output of the entire network okay which we'll call this Z big L you can ignore the softmax that doesn't really enter into any of the the discussions I'm gonna bring up but we're interested in the mapping through all the layers of the network the this affine mapping formula applies no matter where you are in the network you'll just have different A's and B's but we're interested in the one from the input to the very output okay from the input to the very output well you can develop closed-form formulas for this map particularly for a continent this is what the a the a matrix looks like this is what this offset B looks like we can know all of these matrices here in close form so you can do different kinds of analyses for example look at the Lipschitz stability two different points in the network based on different inputs but the thing I'm most interested in talking about here is what what are the rows of this a matrix look like because if you think about this what is the output of the deep net right everything up until the the softmax well it's basically just a matrix a multiplied just ignore this typo it's this matrix a multiplied by my signal ax plus a bias well that then that this is just a vector how big is this vector is one output for every class right I and what how do I determine which class the input is it's whichever of these outputs is largest right okay so let's think we have a matrix a that we're multiplying by our X each entry in this output is what just a inner product of a row of a with X so what is that right if we think about this matrix a well the C throw right course buying a Class C is dist we're just going to take the inner product of the C throw with X in order to get the C output of Z we want to find the biggest what's the what's the the what do we call this in signal processing nomenclature we call this a match filter bank right because basically what we're doing is we're applying to our signal a set of filters by inner products cauchy-schwarz tells us that the more the filter looks like the input the larger the output is going to be okay and the optimal filter is what where the row is exactly the input right standard standard stuff that's done in you know radar signal processing sonar walk communication systems etc and you can actually visualize this in a real network so this is just see far ten here's an input of an airplane here's the row this that this is the the row of the corresponding a matrix for that input vector unvectorized so that it looks like an image see it looks if you squint it looks a lot like an airplane okay I have a large inner product if you look at these other rows corresponding ship class dog class you see they don't actually look like a ship or a dog but more like a anti-air plane all right in order to push down the inner product largest inner product smallest even smaller inner product in fact I yes sir I didn't talk about the bias but the way to think about the bias is if if you're a Bayesian then the B's would be related to the prior probabilities of the different classes so if you knew that that planes were very very likely you would put you would load B with a large number in the beef the beef entry that make sense yeah yeah Renee yeah so it's subtle it's subtle but you could think of this as like a dictionary learning machine that's basically given an input is is is defining then a bit of Bayesian classifier does that help a little bit okay so and of course if you if you think what these rows of these a matrix matrices are and you think of the fact that we're decomposing the deep net input output relationship in terms of affine maps there's a just a direct link between the rows of this a matrix and what are called saliency maps by by the the community so it gives new intuition behind what goes on or what happens when we when we think about salience email moreover if you if you if you you can prove a simple result that says if you have a high capacity deep net that's capable of producing basically any and arbitrary a matrix if you will from a from a given input then you can show that the Seath role of the a matrix when you input a piece of training data xn is going to become exactly xn when you're on the true class right X is late X ends label and essentially minus a constant times xn when you are not in when you're in a different class and so this is this will tell us a little bit both about again reinforcing this match filter interpretation but also helping us understand a little bit about this memorization memorization process okay a couple more a couple more points so another thing we can do is we can think now because we under we can have formulas for these affine Maps we can characterize the prediction function f that map's the input to the output and we can think of different kind of complexity measures that we can derive out of these affine mapping formulas so there's a lot of applications for complexity measures for example you might want to compare two deep networks one which has a very complicated prediction function the other that solves the same task but has a much simpler prediction function Occam's razor type idea we might also want to apply a complexity measure as a penalty directly to our to our learning process right so there's a large literature of deriving different complexity measures and complexity penalties for deep nets all this point to you know two of them as examples one is there's a very nice recent paper that that links the the ubiquitous to norm to norm of the weights penalty for learning to a particular measure of the second derivative of the prediction function all right so that it really does say that for at least a very very simple kind of network there's a link between the weight values of the weights and the wiggliness of s and then there's another school of approaches that looks at well we have a VQ tiling of the input space let's count the number of tiles because presumably the more codes there are in your code book the more tiles there are the more complicated the function that you're trying to approximate so these are two approaches well I'm going to get one that really expands upon these two and it it is leveraging the fact that we can was leveraging really the the fact that lots of data sets of particular image type data sets we have a reasonable reasonably true property that the the the training data live on low lowered that doesn't that be low dimensional but lower dimensional manifold sub manifolds of the high dimensional space okay so let's assume that our data lives not filling up the entire input space but living on some lower dimensional sub manifold or sub manifolds and in this case we can we can look into the manifold learning literature and there's a beautiful paper by Donna Hahn crimes that defines what's called the Hessian eigen map manifold learning technique which is is basically trying to flatten a curvy manifold using the tangent a CN along the manifold so we can just a dot this this same measure and we can define a complexity measure C as the integral of the of the tangent hessian along the manifold so you can just think of it roughly speaking is the low look you have F it's a peacefull at the continuous piecewise defined function and what we're looking at is the local deviation of F from flat so f if f of X was a completely flat function this measure would be 0 if F was very very jagged II meaning locally when you look over a few regions it's jumping up and down wildly this will be a large this will be a large number yes well simply by integrating along this basically we're just integrating along the tangent manifold part yes yes yeah and you could also you know just integrate over the entire space but then you lose some of the nice some of the nice properties did that help oh yeah and we could talk we can talk about after so the nice thing about this measure is that you can develop a marker a Monte Carlo approximation in terms of the training data points the X ends that are your training data point and the affine mapping parameter so it's actually extremely easy to actually compute the value of C given a set of trained training points and giving the affine mapping parameters no I won't left it just think of P is true for right now so it's all the ideally you will choose the P depending on particular for example the manifold dimension the ambient dimension week yeah let's talk about it at the break this is a date let's the data manifolds assume the training data or samples from some sub manifold in the in the ambient space because there are two factors that can increase C one is f there is the many other you mean the smoothness of the manifold say yeah absolutely but but if if we just assume let's just say you have to to prediction functions and their domain in both cases is the same manifold and that would be normalized out right yes yeah are you no longer working with values here this is where rail is absolutely or piecewise peaceful a convex piecewise a fine nonlinear fashion then zero everywhere I know yeah let's talk about let's talk about it offline I think otherwise I'll run out of time yeah well it won't be yeah okay so let's let's look at an application of this to something that is I would say somewhat still mysterious in the deep learning world and that is data the data augmentation so if we think of how deep networks are trained today they're typically trained with this technique called data augmentation that we don't just feed in images of been right we feed in images of translates right of been rotations have been etc and if we have a hypothesis that the images have been somehow came from a lower dimensional manifold in high dimensional space where points on that manifold were translates and rotations of each other then the it's very convenient that these augmented data points that are generated from just your raw initial training data will live on the same data manifold okay so there's a in this particular setting you can you can prove a result that says that just starting with date writing out the the cost of say cross-entropy cost function with data augmentation terms you can actually develop that those data augmentation terms and pull them out of the first part of the cost function and show that they basically form a this Hessian penalty this Hessian complexity regularization penalty okay so what that's saying is that data augmentation implicitly implements a hessian complexity regularization on the optimization okay so that's like the theorem here too or just a simple experiment with the CFR 100 data that so this is training epochs in the x-axis this complexity measure in the vertical axis and all we're doing here is we are as we're training the network trained trained without data augmentation in black and with data augmentation in blue what we're looking at are based on the A's and the B's that we have the that we have learned with the network we're plugging that into our complexity measure that was on the previous previous slide and we're seeing that the measure is showing that the network that has learned using data augmentation has far lower complexity than the network that has learned with data augmentation this is both on the training and on the test data yeah we're doing the station depends on the loss and is these anticipation arising due to two square lows oh yeah good question well in this case we it was it was cross entropy loss it wasn't that wasn't weird work with rotations of two similar images have the same label yeah so the loss in comparative labels and so that has to change the regularization has to change to the same yeah let's yeah so so for just for now the interest of time let's just assume cross-entropy loss for a classification problem rather than l2 l2 loss for a regression problem and then let me think and maybe I'll have a better answer by the time that we get to question yeah complex measure is only a function of the model but not a lot is that right absolutely yes – okay so let's one last quick quick note what can we do beyond piecewise defined deep nets because sigmoid hyperbolic tanh these are still very useful in certain applications in particular in recurrent deep nets and it turns out that you can bring those under the same umbrella that of this max affine spline framework and the way to do that is to switch from a deterministic hard vector quantization approach or way of thinking where if X lives in a in this particular vector quantization tile it definitively lives in that vector quantization tile to a soft VQ approach where now we just we have a probability that X will fall in a given vector quantization tile where for this particular signal maybe there's a high probability in this tile somewhat smaller in the local local region a burring tiles and then decreasing probability as you move away so if you just set up a very simple Gaussian mixture model where the means and covariances are based on these A's and B's that we derive you can you can basically derive nonlinearities like the sigmoid like the softmax directly from Ray Lu absolute value and other piecewise defined convex nonlin non-linearity's and in particular if you if you do a look at a hybrid approach it's this between a hard vq and a soft vq alright with where you're basically blending between the two you can generate infinite classes of interesting and potentially useful nonlinearities and I'll just point out one how many people here have heard of the swish non-linearity a few so this was a non-linearity that was discovered a few years ago through an empirical search that's the empirical search for is there a normal in the area that works better than Ray Lu right for large-scale classification problems and it turned out there there there was and it was a fairly sizable you know non-trivial gain in a lot of cases and it's this black dashed line here and the interesting thing it's hard to know if it's a coincidence or not but if you look at the in some sense the midway point between hard VQ and South vq i based on the the Ray Lu function at the hard VQ side and the sigmoid gated linear unit at the soft vq the swish is precisely halfway in between it's quite quite ok you could also pull out sigmoid hyperbolic tangent by adopting a probabilistic viewpoint of the output of a layer no longer being just a deterministic output of the input but instead the probabilities that you fall in the different VQ regions in the input that's what we can do beyond piecewise so I better wrap up so what I hope to get across is that this spline in particular max affine splined viewpoint can provide a useful language to talk about the things that we're learning about deep networks but also frame the kind of questions that we would like to move forward with my talked a bit about the the basic you know framework of max affine splines and deep nets I talked about the relationships with with vector coin sation and really that a deep net is you could think of it as a vector quantization machine or you could think of it as a free not spline machine there's there really I think interesting links between power diagrams from computational geometry and the the subdivision that's generated by this layer upon layer max affine spline process this the affine transformations that we derive based on that these difficulties different vq regions allow us to link deep nets back to old-school signal processing ideas like match filter banks and they allow us to define new kinds of complexity measures in particular this Hessian measure that we talked about it's all in there and say there's some some papers that people would like to take a peek and I'd be happy to answer any additional questions [Applause] it's all really a question the second derivative of that one so yeah so the basically the the way the way that we think about it is that you have a it's okay there's there's let's stop it to hear a heuristic way of thinking about it is if you had a piecewise-defined function it's gonna be undefined that obviously kinks right but if you thought if you think of basically here some heuristic way to smooth any heuristic way you can think of to smooth out that kink then the second derivative is going to be related to the slope parameters on one side and the soul parameters on the other side with yeah exactly and the bandwidth of this movie that's the epsilon that was in that in that formula there's a there there details that we could you know we could talk about the brain yes you mean the bat like how large the measure yeah so that was what this yeah that the point of this experiment was really to look at ads as we are training so as we're going through training affects what is happening you know what is the value of in this case the value of this complexity measure as we train the network through the various training training cycles both with in this case with data augmentation and in this case without data augmentation does that make sense no you just so it just in a nutshell think of think of a Gaussian mixture model a Gaussian mixture model defined in terms of covariance means and covariances we're now the means and covariances are defined in terms of these different tiles pardon me so okay what's the best way to describe it so start from start from a hard VQ where we have a tiling now you have because of the power diagram you have a radius and you have a centroid now use that radius in that centroid as the per to develop a Gaussian mixture model for example a you know circle a symmetric Gaussian mixture with that particularly the radius being the the variance now under that if you think in terms of now under that model you'll have a given an input you can think about the probability that it falls into each of these individual tiles will be determined by the probability under that particular each of those mixtures does that make sense and if you put if you now look at these probabilities these probabilities behave like in the case of say you start with the tiling derive from array Lu you will you will end up with set of probabilities that are that that follow the functional form of a sigmoid gated linear unit in that case does that help that is not funny it's already trained you can mean there is no time right oh okay okay you know you I my question is I assume there is a procedure to arrive so can we go backwards you're saying I need to think about that yeah we were thinking only in the one in one direction going from a hard v q to a soft vq presumably you could reverse that process by if you have a certain kind of but it would have to be not all possible nonlinearities are reachable via this saw this this soft our relaxation of this hard v q right you can't reach you can't reach arbitrary you can't reach arbitrary nonlinearities right only certain kind of nonlinearities if you wanted to reach arbitrary ones you would have to there's no way you could do that with just a ghost standard kind of Gaussian mixture framework that help yeah yes go back to the contacts to make sure it was sure or are we here it's a very good question um so it sounds like a blood test sorry for the low D a after one and fifty ipok there's no number it is just because of a proton so that's something happen like oh oh sorry sorry yeah no we jet this is an artifact of the plotting we shall probably should have stopped a plot here yeah yes like a total number of examples yeah so the that's a really good question because in fact we're we're only the way that we compute okay there's the complexity measure and then there's the computation of the approximation of the complexity measure the more training samples you have the more densely you will sample the manifold the the signal manifold and that closer your approximation will be to the true the true measure and so the more training data that you have the closer you'll get to the true measure yeah that's a really good question so there are methods that attempt to do adaptive yawn selection movies like Trent oh yeah and you can apply it with some on occasions to multivariate data and get adaptive finally yeah is there some sense of so it would give you similar results it's right hierarchical VQ would be another yeah yeah but is there some sense about the properties of these quantizations and these timings are that will differentiate you from something that's that more directly tries to penalize is that to me this is this is the this is this is a really key key unanswered question right what the question was there that there are you know there are a number of different ways to try to find different free not spline approximations in higher dimensions and why are deepening of the Oh Howard deep nets bet you know different or better and this this is a big question I answered question there there's no question that the methods that we use the you know current training method our optimization approaches are enabling us to find these free not spline approximations in there truly ridiculously high dimensions right where a lot of these other techniques you wouldn't even attempt to them right but still it does not mean that this is the that we that we've stumbled on the best way of doing this so I think that as we think of new kinds of optimization methods for example our new kinds of architectures we're finding new ways to hierarchically build up these these blank partitions and eventually we'll find out that there are some ways that are better than other ways and we're probably not even you know partway there yet okay I guess this sexist speech [Applause]

2 thoughts on “Mad Max: Affine Spline Insights into Deep Learning”

  1. Is there an implementation of the Hessian regularization somewhere? How expensive is it to calculate the regularization term?

Leave a Reply

Your email address will not be published. Required fields are marked *