Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series


– Today, we’re happy and honored to have Vladimir Vapnik with us, co-inventor of supported
vector machines, support vector clustering, VC theory
of statistical learning, and author of “Statistical
Learning Theory”. He’s one of the greatest
and most impactful statisticians and computer
scientists of our time. Plus, he grew up in the Soviet Union, eventually heading the
Computer Science Department, Institute of Controlled
Sciences in Moscow. So he will give today’s lecture in Russian and I will translate. Just kidding, right. (laughing)
(audience laughs) It’s an honor and a pleasure
to have Vladimir with us today, so please give him a warm welcome. (audience applauds) – Thank you. About 50 years ago,
Professor Chervonenkis and me started statistical learning theory. The problem was to answer the question when if we will do well with training data if you will have small amount
of our own training data, you will do well also on the test data. You will minimize expectation of error. So this solves this problem. There are this theory in most in all books they might be written
in different languages, but mostly they follow this line. The line is that just law of large numbers is not enough, then you need uniform law of large number, you need convergence, and so on, but because we started this
discussion about empirical error how good you do on the training data, and bounds shows the better
you do in the training data, you will be better on test data. People decided that this is
only way to have a training data and do something with this training data, to minimize number of
error and all algorithm was constructed based on this principle. About five years ago I
found that there exists another principle, even more
interesting than this one, because in this principle
it is brute force principle, give me more data, you
give better answer with. So the second principle
is intelligent principle and I will talk today about this. So I will start with
statistical learning theory and then I will introduce this new theory, but because there are only
two ways for generalization, one is data, another I will show what I call the complete
statistical learning theory because there are more, another way to do something. There is no short way for generalization. You should use both of them,
so that is complete theory. But it is not so bad because you will see that learning theory move in different direction. In direction of intelligence
to understanding what is intellence. It is not the same but Turing explained. Turing told that you should
imitate intelligent person but now, question what is intelligence, and we will discuss this. So let me start. The first part is, this is
theory of generalization, and that is the question, when in set of given set of function, you can minimize functional. This is pretty general functional. Instead of y, f of x,
and L is loss function, I can consider difference
between y and function. It is what they’re doing in regression and pattern recognition
and all this stuff, but this is more general setting. But this is more important that when we consider
minimization of functional, we should say in which set of fucntions. In the given set of function, you have to minimize functional. If probability measure is unknown, but we are given iid data pairs. That exact setting of
both pattern recognition and regression estimation problem of pattern recognition set of function is indicated functions for regression function
because continuous function, but the statement is the same, but the answer is very simple. We can minimize this functional using data if and only if this dimension h of set of function is finite. Everything depend on which set of function you have to minimize this functional, and you can not avoid that. So we call it capacity,
but maybe better called, diversity of the set of function, but measure of this capacity,
diversity, is VC-dimension. And what is VC dimension,
I will give you definition. First for your set of indicator functions, t is step function you
consider continuous function f, and in front of it, you’re
looking for indicator. If function is positive you say one, if not positive you say
zero, that is indicator. The VC-dimension of set
indicator function equals h if h is the maximal number of vectors that can be shattered,
separated in all possible two h subsets using indicator functions from this set. So you have set, you should find h vectors which you can shatter in all possible way, in the l possible way. But you cannot shatter
in h plus one vector. Then VC-dimension of this
set of function will be h. This is purely a rhetorical definition. And if you can shatter for any l, then this says the VC
dimension is infinite. And then I will give you two
theorems which we will use, and then the main theorem,
probably of this theory. If set of function has VC-dimension h, then this probability one
minus eta for all functions in the set the bound holds true. And because this bound
holds true for all function, and you would like to
have minimal estimate, minimal right hand side, you will choose function
which minimize empirical loss. And the second theorem, if you
have set of linear functions, indicator from the set of linear function, it so happen that your
vector x inside of circle for radius one, and w inside
of C, then we see the dimension is bounded by maximum of two radius C and n, n is the dimensionality
of the object plus one. That means exactly that
VC-dimension can be smaller than dimensionality of the space. And you can control some
part of VC-dimension, and I will show you
afterwards how to do it, it is very important theorem. But, what is the general way
suggest the VC-dimension for searching for functions. You have set of function,
maybe set of function have infinite VC-dimension. Then you make a structure
of this set of function. You choose subset of function,
the small subset of function, with VC-dimension h, then
another subset of function which includes a small
one, VC-dimension h two, so this last VC, we have
last VC type of function. It’s loss to VC dimension. And then, when you
minimize exist functional, you’re doing two things. You choose appropriate subset, and then in this subset,
you pick up function which minimize empirical loss. But, you can see, that epsilon in this depend on VC-dimension of subset you choose, at VC-dimension over L. And also, it depend on empirical
loss which you achieve. So you can do as soon
as you make a structure, you can do whatever you want, even if you have infinite VC-dimension in initial set of function. But, now, this is more
or less, all what contain the main result of VC theory. But VC theory does not answer
four very important questions. How to choose loss function L y f? I told that any function how to select admissible
set of function f of x? I told you that given set of function, but when there’s a stupid set of function, how to construct good set of function? How to construct structure on
admissible set of function? And then, how to minimize function and how to construct the structure. In this talk I will try to
answer all those questions. Target functional for minimization. And this is important
slide, God plays dice. What is setting of pattern
recognition problem? I will consider in this talk,
just pattern recognition problem for two class specification, but generalization is straightforward. So, what is setting of
pattern-recognition problem? Given generator, given
nature we generate randomly, independently, citation x
which come on the object, and this object is
conditional probability of y. Given x, say that y equals one, given x, and y equals zero given x. This object knows this
conditional probability, and plays dice. So he has the function of
conditional probability, he has x on the input,
he plays dice, and say y. That is the most general
setting of logic problem, deterministical particular
case of this set. And what does learning machine, learning machine has a set of functions, and it can choose any
function from this set. The problem with observing l observations, x one y one, x l y l, to pick up the function for classification. That means, given observations generated by a kind of
conditional probability P x, y equal P y given x on P
of x, that’s our theorem, finds the rule that minimize function l. So in my case when I have y zero or one, or in theta, also zero, one,
it is indicator function, so last function l, just
collect how many errors I do, but I have probability measure, it collect expectation of error. I would like to find
function which guarantee the smallest expectation of l. But this is not very good. Why it not very good? Because my function l, this model is just, it is everywhere zero except for some points where
it is one or minus one, and if I could model it is one. So, it is zero everywhere in defined, and one in some point. So I cannot use gradient in this case. So I should to do something smarter. And what people doing is they replace model and indicator function with just y minus f of x. This create error. Instead of whatever I formulated before. It’s not so bad choice
because it so happen that minimum of this function l gives conditional probability function, probability of y equal one given x, and then when we can find this probability of y equal one given x,
we easily can construct our decision or rule, we
just consider function if our conditional probability exceed .5, say first class, if it’s
less than .5, second class, and this is optimal solution. But something wrong with this replacement. Let us rewrite the first line. I will subtract from bracket
inside on the first term. Regression is an odd regression, so I have two brackets instead of one, and then I make a square. So the last integral show
me the first integral does not depend on function,
which I looking for, and I have to minimize
my function l over f over set of function, just sum of two last terms. Have a good it is just
normal binome for two terms. Square of one, square of second, and two terms is multiplications. But our goal is to
minimize first integral, to find function which is close to conditional probability of function, not sum of two integrals. We can show the second
integral eventually will go, it goes to the, it’ll go to zero, but it will slow down rate of convergence. To have a rate of convergence big, we need to find way, how
to minimize first integral, not sum of these two. And that means that not this square loss, but something else. What they can do? There exists, first of all, when y, it is zero or one, probability of y equal one given x is some real valued function
between zero and one, because from Bayesian formula, we know that conditional probability of y equals one given x, and then p of x, it is joint density with
y given now comma x. That is just always true. Now, if I will multiply on some function, G of x minus x, star, which belong to L two space. And the integral, I
will have this equation. And they can say that
conditional probability solution of this equation. It was constructed like that, because you should put conditional probability,
I did something like that. But, I would like to solve this equation to find function when I don’t
know probability measure, but I’m given data, given observations generated according to p x, p y,x, I would like to solve this equation. But solving equation,
it is ill-posed problem. OK, let’s do that. But before I will do that,
I would like to mention that in classical statistics, there is a way how to replace unknown probability measure with empirical measure. And that is the most important part, is main inductive step statistics. In statistics, we’re given data and would like to know function, and it doesn’t matter how many data. We will see, it is not
equivalent to function. So in classical statistics, people suggest to approximate cumulative distribution function by empirical cumulative
distribution function, and that is empirical cumulative
distribution function. And 30 years, mathematicians
tried to prove, that it is good idea that we can do that, and in ’33, Komogorov found exact bound which is on the last line, almost the same like in the last line, and then people prove that
you can bound like that. Now, if we can replace unknown measure with empirical measure, we can construct our problem, our constructive problem,
what we should do. Let us replace in this function now which we would like to minimize instead of a real mirror our
measure, empirical measure. And then you have, empirical, real square root functional which you have to minimize to find our problem, to find our conditional
probability of this is whatever you want. But let me consider new
constructive setting, where we also will replace
unknown probability measure with empirical probability measure, obtained on the training data. And you will see the last equation to find conditional probability, we have to solve this equation in set of function f of x, right hand side is known
because our function G is known. In left hand side you don’t
know f of x or it is our goal to set a function to find this function. In classical statistics,
it was one algorithm called Watson-Nadaraya estimator which show how to estimate
conditional probability or integration of function. They just somehow defined this. And this is, I show function which is have a numerator, and denominator. So, G is special kernel, say, Gaussian. So this is estimate of regression, a very general way of
estimation regression, conditional probability,
and all this business, how to find, if Gaussian,
how to find the best value of variance to approximate
conditional probability well. So they spent a lot of
time on this subject and they have this. But you can see the line in middle, this is Watson, Nadaraya-Watson estimator comes from corrupted equation, not from equation which we derive. Here, it is in the middle is f of x, not of f of x, i like in last, like in kernel,
so then you can put out of sum f of x and you
will get this function. So, actually classical
Nadaraya-Watson estimator, it is solution of corrupted
equation which we obtained. But what means to solve equation? To solve equation means I
just take the difference between left hand side
and right hand side. Define area where I would like that my function will operate, take the square, and sum, and integrate over
sum probability measure, and minimize this functional. So let us do that, and if
you will do simple algebra, it is just very simple,
and you can check it, you will have this R f function
which is y minus f of x, y j minus f of x j multiply on some
coefficients, where x y, x j. So, we can estimate this value, and this value is j x y, x i, j x i, x g over measure, and this is matrix, if you
know from Watson-Nadaraya exact formula, we know this
matrix, this element of matrix. So, what is V-matrix? If we will replace this integral
with empirical integral, we will have this estimate of the matrix. If we will use just, say, line between minus one and one, and new f x is uniformly
distributed over this line, we will have, we will have n G’s Gaussian distribution, we will have this V-matrix. So V-matrix is easy to find. Now, I would like to use vector notation. What I will do, I will call Y vectors of elements of training data, y one, y l, I am given l pairs, so I create Y, vector Y, which is vector of all Ys. I will create F, capital from f, which is also all dimensional vector of I will pick up function f, and this is really of this function, endpoint x one, and the last
is value of this function, is a point x l. So this is f, and I have V-matrix. So, and I can rewrite
this functional in the way that in matrix form I have
y minus F, capital from f, V, y minus F, capital from f. But if I will write this
notation in least squares method, I will have Y minus F capital from f, Y minus F capital from f, and here, instead of Y, identity matrix. So, I got some improvement over least squares method. And I hope that it will give me a rate of convergence better than this least squares method, but, let’s see. But it is not major stuff. Because, OK, I am prove
rate of convergence, but is still this square method good? But now, the most important part, selection of admissible set of functions. What it means? When you construct a neural network, you talking that you are
doing smart structure, what it means, you’re talking
that you’re constructing smart, admissible set of functions. You know something, you are smart guys, you’re just constructing,
and then you minimize over the set of functions. But, let me consider from
a theoretical perspective, what it is. If you consider Hilbert space, and also Euclidean space, in this space, there are
two ways of convergence. Strong convergence, it is
convergence of functions, it is first line. My set of, my sequence of
function f l converge to f zero, if f l is integral converged
from f l goes to infinity. But there exists weak convergence. We say that my set of, my
sequence of function f l converge for f zero, if this inner product converged to this inner
product for all function phi, from Hilbert space. You can see that this is an inner product described property of function, if for all functions,
property is the same, that will be convergence. So it is easy to show that if
you have strong convergence, you also have weak convergence, just from Watson, from Schwarz inequality, Cauchy-Schwarz inequality. But also, one can prove that
if you have weak convergence, and your set of function
belong to compact, you also have strong convergence. So, in some sense, the both
convergence are equivalent. In our consideration of machine logic, you can see the strong
convergence everywhere, 100%. But what about weak convergence? Let’s explore this opportunity. Let us consider pattern recognition case. For pattern recognition
case, the first line, I just write in the
definition of weak convergence equals second, and that is I use bias in the equation, it is phi from dP equal one over phi for all functions from phi. So it converge, it must converge for
all functions from phi. If it converge for all functions from phi, I will have one function
which is desired one. But it is not realistic. Let us do following. Let us select from set of Hilbert space, m function phi. We will talk a lot how to
select these functions. So, and then we will consider equality, not for all function, but
just for this m function. And we will call admissible
set of functions, the set of functions which
satisfies this equality. We know that our function must satisfy this equality for any phi, any phi, because of the convergence, but we select something which we want. OK, if you will use, instead of our cumulative distribution
function, an empirical estimate, we will have instead of
integral property like here, the property written by the sum. Again, let me use the same notation for matrix m, I will use Y vector, I will use function F capital
from f, which is f from x one, f from x l, which is vector, for any function f I have vector, and also, I introduce new vector, vector of predicates on the value x one, x l. Because phi is function, I can consider value of
function in this case. Then, I can write my equation in this form. I would like that my
admissible set of function satisfy in vector form, these m equations. Now, let me explain what we talking about. There is a duck test logic. If it looks like a
duck, swims like a duck, and quack like a duck,
then it probably is a duck. What this means, we have
these statistical invariants in vector form like this, this line. What it does, it collect
admissible function which identify animals as a duck if it looks, swims and quack like a duck. So if you will choose
predicate which explains what means, looks, swims, and quack, then your admissible
function will be a function, such function for which classify animals that swim, quack, and looks like a duck. Concept of predicate, it is
very different from feature. Why so? With increasing number of predicates, the VC-dimension of admissible
set of function is decreased. Why does it decrease? Because we have set of function, then we from this set of
function, select function which satisfy new predicate. Not all of function will satisfy, and consider only set of
function which satisfy all these predicates. But with increasing number of features, VC-dimension increase
because your decisions are all becoming more and more diverse. So what is exact setting of
complete learning problem? Minimize functional
which is this V matrix. A little bit improved of
the square functional, subject to this constraint. And this constraint is
what you would like to see in the set of admissible functions. But, existing classical
method of pattern recognition, they just minimize this functional, the least square functional. So, minimizing this functional
subject to this constraint, that is our set. That was exact setting, which is in mathematical is called conditional optimization, optimization of functional
under conditions. But the approximation is
unconditional optimization. I would like minimize this functional subject to this constraint,
but I will do following. I will make sum of this functional, and I will take square of difference between this constraint and make a sum, the sum weight, weight, it should be one. So I would like to do both,
to minimize over both, and both weights have
important for me to minimize, they can have important for
me to minimize constraint. And then if I will do that, I can rewrite this function in this way where you have to
minimize this functional, and where P is just covariance
matrix of your predicate. And everything is simple compute. So that is concept, what we have to do. We have to solve our problem using both big and strong, strong convergence that
means, using invariance and minimizing functionals, and we can do it in exact way, and in approximation. But, here it was written
for any set of functions. I did not talk how I will minimize that. So it is true how as the
least squares method, you can minimize the
least square functional for any set of functions,
that is the same here. But now, let me see how I can find the solution. And first of all, I will
do it for reproducing kernel Hilbert space. This is the definition of
reproducing kernel Hilbert space. You have some kernel
which is Mercer kernel, you multiply, you’re taking the product this function f of x, and
you have the same function, it is important to use the same function it is called reproducing
kernel of Hilbert space. And it is known that
kernel, Mercer kernel, have expansion of lambda where lambda is there’s no negative values, and psi is orthonormal functions. So, set of function is inner product and norm. This is inner product, and that is norm. Forms reproducing kernel Hilbert space, it is very easy to check. It means that if you will use some function psi, orthonormal function psi, and its expansion of c, and if you will have this set of function, and if you will introduce
special inner product, inner product of this type, and then you will have the definition. So you will have reproduction
kernel Hilbert space. It is pretty general space. But, in reproducing kernel Hilbert space, you have a great theorem
called representer theorem. And representer theorem says if you would like to
minimize this functional, you subset the function,
and subset of function, it’s norm of your function in reproducing kernel
Hilbert space is bounded, so then your solution has
a linear representation, over kernel with finite
number of parameters. So, let’s introduce
matrix, vector of functions f K x one effects, it is vector expansion, and then square of our norm will be A, and this is A
over K, this is what is, how it looks, your function
which you’re looking for, A K A is norm of your function from reproducing kernel Hilbert space, K is matrix K xi xj, and this is F of f from reproducing kernel Hilbert space is just linear function. Subset of function is bounded
norm inside of VC-dimension, the smaller C, the smaller VC-dimension. And that is according to
second theorem with j, show you before. To control VC-dimension,
you should control just C. You should be looking for
norm of this function. So the conditional
minimization in producing kernel Hilbert space,
has closed form solution to minimize this functional
subject to this constraint. And constraint on bound of the norm will give you, and this is your solution, linear expansion of this
vector of functions, and value of coefficients is like that where it is just in closed form with this function of matrix,
this multiplication of matrix, this is gamma c, C is depends on this , where you see gamma of
c depends on this C, this is identical matrix,
or you have the solution. And to find u over here, you
have to solve linear equation. So, you’re solving linear equation, and then you have closed form solution. So the complete problem in
reproducing kernel Hilbert space have closed form solution. But what about unconditional minimization, approximate minimization like that in this constraint? It also has closed form solution, and this is how it looks,
closed form solution. Your solution is coefficients
over this expansion, and you have this equation you have to find A, so everything is computable. But very special rule play in explanation support vector machine. What is support vector machine? Given data, I would I would like when reproducing
kernel Hilbert space, this bounded norm
minimize this functional. Then, when they’re
minimizing this functional, that is exactly you will
get support vector machine. If you look there I’ve
supported the machine, it is just minimization this function now, this set of function. But here, you’ll do something else. I will do data like in
support vector machine, y i minus A theta A, that is I would like approximate data. It’s unknown like here,
but also I would like to that my invariant will be good enough, will be close. Left hand side and right hand side, often invariant will be close. So I would like minimize this functional under the same constraint. And this is the solution. The solution is like A, A is function which is phi from t, and phi from t, as you remember, it is vector of predicate, and this is indicator vector. If I would like, here to make strong for invariants and not too strong
for approximation function, I just want to use just weak convergence. I will have that my A is defined by invariants, by function of predicate. But, my function of predicate,
how I choose predicate? I can choose any function I want. That means that if I
can choose one function which will give me optimal solution, then there exists a smart predicate that I will not need a lot of data. I need what, why I need data, I need data for expansion of our kernel. But for estimating
coefficients, I don’t need data. I use my predicate and that means that what is, what means predicate? It is property, it is explanation about what I want. I will talk about this later. So, this example of
support vector machine show that philosophy of this
learning is very different. According to representer theorem, solution of learning
problem in reproducing kernel Hilbert space have a property. It defined linear parametric function in form of expansion of kernel function. That means that optimal expansion belong to one layer network,
not multi-layer network, but because of reproducing
kernel Hilbert space, is richer than this. Maybe it is not the best idea to use deep network, deep network, OK, we will discuss this deep net. Observation vectors and kernel
define basis for expansion for optimal l parametric solution. But parameters is defined by invariants, but you could use these invariants, and what it means if you
will write in the form, that you have K which is matrix, depending on your training data, phi is predicate matrix, so
you have vector over here, and this is Y vector, F is vector, so you have element, you have different formulation of learning problem in term of this vector and these values. So since function in Hilbert space can be, since any function from
Hilbert space can be used, because when we talked
about weak convergence, it converged for all functions
in the case of Hilbert space. So any function can be used. So we have a chance to pick
up several smart functions, or maybe even one, and it will
be enough for our training, and then we will talk
about what means smart, it is intelligent learning, not just brute force learning. So, let me give you an
illustration to have a feeling. This is least square method, what it does. This is V-matrix method, it will do a little bit improvement. It’s a little bit better. Here, if I will introduce
invariants, I will do better. Here, if I will use both
invariant and V-matrix, so you can see there’s
difference for 48 points, and to be sure that it
is difficult problem, we took 16 from one class
and 32 from another class, it is not difficult. This is 98 points, the same story, this is 192 points, the same story, and this is multidimensional case, so we check something and we did it. And that very interesting case. We introduce some invariants and what 22.73 error rate for diabetes. And we decided can we find invariant to improve it, performance. And what we did, we’re
just looking for area where our invariants does not take place, they violate it. Then we just took predicate which just doing with this area which can’t have one when your, when point belong to this area, and zero when it doesn’t belong. And we improve using this invariant from 73, .73 to .07. So, that means that if you have smart way to looking for invariants,
then you can have a chance to improve your performance, but this is exactly the same
philosophy which use physicist. Find the solution, the box in figure, where the existing solution,
find situation, the box, where existing solution, the
approximation which we obtained contradicts evidence,
does not keep invariants, contradicts invariance inside the box. And then modify the solution,
obtain new approximation which resolves this contradiction, which doesn’t have this contradiction. So you’re just looking,
you inventing invariants, looking where you have contradiction, and that is exactly the same principle that use physicists to
discover law of nature. To discover law of nature, physicists first trying to find situation where existing theory
contradict observations. So invariant fail. Theoretical predictions do
not supported by experiments, that means invariant, but. Then they trying to reconstruct theory to remove contradiction, they
construct new approximation of theory which does not
contain contradiction observed. But the most important part in physics like in here is the more difficult part in scientific discovery, how to find contradictive situation. Let me show something about neural net. I am not fond of neural nets, but we can use our theory from neural net as well. What is neural net? That is neural net, you’re
minimizing least square error. I would minimize this
approximation of invariants which is contained VP
matrix, V plus P matrix, matrix P with invariant, and then I will do the
same back propagation. It so happen that I easily
can do back propagation, not just for this matrix,
but also for this matrix. And if I will do back propagation, I need the back propagation
to do only one correction. Instead of back propagation error where is y minus what you
have on the last layer, and you have all l observations. You just have this matrix and you multiply your vector of propagation on this matrix, you’re correcting your propagation. So that is what is back propagation about. You do for one step, it is OK, the same. You have back propagation,
it is border condition, in back propagation step you
should do some correction, only one on the very last level,
and then update your steps. So I came to NSC and ask guys who have a neural network to incorporate this,
just this improvement, small improvement this. I took just one invariant,
a very trivial invariant. C of x equals one, c of x equals one, I will show you that it’s
not so simple invariant, we will discuss what
it, what invariant does. And then, we set V equal
I, we did not use V matrix. Here it is just identity matrix. We used 1,000 examples,
100 per class, batch six. And this is this line is what does neural network, deep neural network, they have
a good deep neural network. And that’s what this does
improve neural network. Instead of 3.1 they got 2.9. OK, let’s do just one, not very important cosine coefficient. I have a picture of my, my digit. I make cosine expansion, so I have Fourier coefficients for one coefficient for here,
and I will use this predicate so I will use predicate
with this one coefficient. And again, you will see just one invariant with stupid cosine makes improvement, OK? Then we decide, let’s do more. Let’s do 16 coefficients of Fourier. Four from x one and four for x two. And we got .6 bigger, but I can do whatever I want,
I can do 1600 invariants. And I can make, it’s a
lot of game can be played. But, let’s also, but in neural net, we use approximation of exact solution but it works. The statistical part of
learning theory is complete. Why is it complete? Because there exist only
two ways for convergence in Hilbert space. Convergence in functions,
convergence in functionals. There are no third way of convergence. So, from a conceptual point of view, you can play one of two
game or two games together. So why I call this complete, you cannot imagine something else. If you would like to do something what is improved learning,
you should ask yourself how to choose invariants, and that’s what I will talk about. Invariant, it is something
about intelligence, when they talk about duck test, I use looks like a
duck, swims like a duck, quack like a duck, but I can
say play chess like a duck. I can say whatever I want and it should be invariant equality, or say singing can not be like a duck, OK? But, among all these stupid predicate, there exists smart predicate, and subject of learning and subject of intelligent learning, some have to pick up the smart invariants. Let us discuss what is predicate. I don’t know, exactly, I think this is for many
hundred years theory, I will show you that it is continuation of major philosophy from Plato to Hegel to Wigner and so on, I will show you. But it is in predicate, so my claim that when you’re talking
not about imitation, but what is essence of
intelligence, essence in predicate. Predicate is something extra. OK, I will show you later. Let’s say predicate, that is that invariant holds true, mathematically. So let’s take f of x equals one. What does this predicate? Expected number of element
of class y equals one computed using this predicate, equal to number of training
example of the first class. When you will use this predicate, you will pick up function
which will give you on this training data, the number of represents of the first class, exactly the same like in your training data. And that is this predicate. And you saw how strong this
predicate for neural net in terms of class of recognition because they have something. Let’s take another
stupid predicate, just x. It looks like a duck,
it is center of mass. I want the expected center of mass, expected to be the respect
the conditional probability will be equal to average to
center of mass which I see on my training data. That looks like a duck. But you can do smart looks like a duck. So, I can consider x x transport which makes matrix, it is correlation, covariation matrix, and I can see n squared over two predicate of this type that’s covariation which I will get using this predicate using function which, sorry. Predication which I will
get with obtained solution will be the same like
covariation which I observed on my training data. That means this predicate. So, but, I sense that we should not
go for general predicate, and we can imagine when your
predicate, I am not so smart, that to construct very general predicate, but let’s do for 2D image predicate. Like u x one, x two, the
function of two variables, it is image of digits, say, in our case. And we have this function,
y, this function, y. And let’s consider predicate
like I will take image, I will consider coefficients for Fourier, it is my predicate. I want expected value with respect to this over my predicate, will be the same like I show my training data. I can consider convolution, because convolution neural
network is one predicate. I will show you which is predicate. And this is this convolution of point x y, x of different points. You can use value, you
can use whatever you want, because whatever is
coming from inner product it is you who decided what you should use. And the understanding
of image recognition, it means understand which
predicate involved in that. But also, the difference
between predicate and invariant, predicate is abstract concept, but invariant is from your training data, it’s what makes specific
your abstract concept. It’s also general, but it is specific. And that is, I want you to show instruments for special predicates. Let us consider vector x y, x j, just the digit recognition, keep in mind, your digit recognition. And suppose x is your pixel space, and you may clean your transformation, small linear transformation
of your pixel space. And if you have small linear
transformation of pixel space, you transform your picture. But you can transform picture, you can also transform
using Lie derivative, I will show you what is that. But, to see what is
that, I took this picture from paper by Simard et al. Show you, this is
transformation, the first line. In pixel space, you may
clean your transformation. And here, you make the same transformation as Lie derivative, this is Lie derivative, this black one. It just computed Lie derivative, I will show you how to compute. And alpha is coefficient, and
using different coefficients, a equals minus two, you
just have this pair, a equals minus one, you can have this, this, this, and all this stuff. So, you can create clone, clones of digit two, which is transformed with
respect to Lie derivative. You don’t need to have a lot of data. You need to have for digital recognition, you need to have predicate,
and from any data you can get this predicate. And OK, I will show you
invariant with this predicate. And this is what is Lie derivative. It is horizontal translation, x, first coordinate, plus a, you just move it in direction a x one. Then vertical transformation. This is rotation, this
standard from geometry, for full rotation, for small
rotation, you have that. And this is d dx, this is d dx two and so on. You have all this stuff here. And this is a big illustration
from Patrick Simard. Clones, you have this three and
you create all these clones. Just you choose, one, two, three, four, five Lie derivatives, two
different coefficients, and then you have all this stuff. But this is smart guy, why you not taking, like predicate, just Lie derivative of it? And we’ll take invariant with
respect to Lie derivative, it is, I would like to learn
if using statistical invariant try to estimate such
conditional probability, which keep invariant with
respect to all derivatives. But even more, it’s
again from what was done. So suppose I have this set of clones of my digit. This is set of clones of another digit. I call tangent distance
the closest element from these two clones, what that means. I have two digits, they are
different, say, two, three. Then I massage them with
linear transformation to make it as close as is possible, and measure closeness of them. That’s called Lie invariant. That’s called tangent distance. And now, I believe that
this is general concept of predicate symmetry. When you have any picture, you can say what is measure of symmetry
of these two pictures. So for example, you have
three, what I can do, I have, this is my digit three, I will have for horizontal symmetry, I will take first line
here, second line here, last line here, then I
will do the following. I will make another digit. I will make last line the first line. This line, the second line. So what I am doing. I take three, so I leave it like vector like that. Now I will do this. It is two different images. And now, I will say, let me massage them, this tangent distance, these
two different pictures, how close they are. If they are very close, I can say this is coefficient of
symmetry of this three. But I can say horizontal
symmetry, vertical symmetry, antisymmetry, what means antisymmetry? Digit s, it has vertical antisymmetry. I can have vertical symmetry, everything. So you can play many games with symmetry, and this is just one predicate. I have conclusion remarks. What we did, is that we can
minimize this functional which is slightly better
than, say, least square subject to this constraint
which is serious, because this constraint means
admissible set of functions which are trying to construct being smart. So I can consider, say,
all continuous functions. And then from these continuous functions, select by smart predicate,
admissible set of functions. And I can do it, because
according to weak convergence, any invariant take
place with any function, invariant take place with any predicate. So, and also, this provide unique solution for reproducing kernel Hilbert space, and approximation for neural
network, approximate solution. But further progress goes
beyond statistical reasoning. It goes in direction of
searching of predicate which forms basis for
understanding of problems existing in the world. And what means understanding? It means that, in say, 2D image recognition, there exists concept of symmetry, there exists concept of structure, and if you will know these concepts, I believe that it’s not a lot, I will show you why I say it is not a lot, you will understand this problem. And I think that this
line, it’s very old line. It start from Plato, what says Plato? Plato says that there is a vault of ideas, and vault of things, and vault of ideas make vault of things. But you see that I have ideas which is predicate, which abstract can be applied
to different situations, but I have vault of things. But then, in 300 years ago, it was classical German philosophy about that, what it means,
ideas, what means things. And Hegel told, whatever
is reasonable, it exists. It is exactly what we
said about predicate. And whatever exists, it is reasonable. So there is two connections. But recently, 60 years ago, Wigner wrote an article about unreasonable
effectiveness of mathematics. It just says that mathematics
knows something about reality. If you would like to understand reality, you should look in equation
and you will see how it works. So predicate, it’s abstract idea, while invariants that are built using them form elements of solution. These two concepts reflect
essence of intelligence, not just its imitation which
is in artificial intelligence. But, that is subject
which we should attack. And also, I have two more
slides, one slide is challenge. I know that people from deep network get .5% of error rate using 60,000 observations. The challenge is, use 1% of this data and get the same result. But even smart predicate and
all these clones which exist, I think that it is doable. And the very last slide. In 1928, guy Valdimir Propp published book “Morphology of Folk Tale” where
he describes 31 predicates that allow to synthesize
all Russian folk tales. Later, his morphology, the 31 predicates, was applied to literature, to theater, to film, to television,
to television series, to games, et cetera, and this 31 was enough. And this I read from
Wikipedia, you can check it, with Wikipedia of the book. Propp found 31 predicates which describe different actions of people in real world. Probably there exist a
small amount of predicates that describe 2D images. And that is intelligence,
that is how to find them. That what I believe
should be learning about. Thank you. (audience applauds) – [Host] I think we have
time for a few questions. – [Man in Audience] Hello,
thank you, I have two questions. First one is, do you know of any predicates that you recommend for language classification
tasks, specifically? And the second question is,
do you have any strategies for hedging against over fitting? Like if you specify too many predicates, then you might be sort of– – Sorry, I not hear you well, but second question is about over fitting? – [Man in Audience] Over fitting, yes. – Yes, let me answer this. – [Man in Audience] Sure. – The more predicate, you have
why over fitting can happen. Because your set of function is big, and you have small amount of
data in selecting function. So you can select whatever you want. But if you increase number of predicate, you decrease set of function. So the more predicate, the
less over fitting happened. And if you will, a theory
of mathematics says, that if you have infinite
number of predicate, you are left with one
function, if you want. – [Host] He also asked
about natural language. Recommendations for
predicates for language, natural language
processing, the Turing test, any good predicates. – You know it is very complicated story, natural language, I don’t know. – Questions?
– You know, it is, whatever I am talking it
is very trivial, simple. Everyone familiar with 2D images. And we can think, like
this guy Vladimir Propp, what is predicate in these images. Can we formulate, if you’re smart guy, say, couple of dozens, or
maybe one dozen predicate, it should be enough. – [Host] But language
is harder than images. – Oh yeah, absolutely. (audience laughs) Yeah, but don’t do the
immediately hard problem. Try to–
– Try? – Yeah, I tried a very
simple, just step-by-step. It so happens that is
main line of philosophy from Plato to this guy who says that ideas is not too much. There’s not too many
ideas that are existing in world of ideas. It could be like that. – Vladimir, thank you so
much for coming today, and please give him a big hand. (audience applauding)

26 thoughts on “Complete Statistical Theory of Learning (Vladimir Vapnik) | MIT Deep Learning Series”

  1. I really enjoyed this talk by Vladimir. Here's the outline:

    0:00 – Introduction

    0:46 – Overview: Complete Statistical Theory of Learning

    3:47 – Part 1: VC Theory of Generalization

    11:04 – Part 2: Target Functional for Minimization

    27:13 – Part 3: Selection of Admissible Set of Functions

    37:26 – Part 4: Complete Solution in Reproducing Kernel Hilbert Space (RKHS)

    53:16 – Part 5: LUSI Approach in Neural Networks

    59:28 – Part 6: Examples of Predicates

    1:10:39 – Conclusion

    1:16:10 – Q&A: Overfitting

    1:17:18 – Q&A: Language

  2. HI Lex ๐Ÿ™‚ , Can you do a series/playlist on NLP research and where NLP is going after 2020 and its future? That'd be really helpful!!!!

  3. Arriving here from the podcast, must say that horizontal expansion will give us the models that we need and yes, even after than it would be an imitation. Intelligence seems to be far from our reach as of now.

  4. Huge respect for the gentleman (he is a legend for us, AI-Masters students in Ireland;) Thank you for uploading to YouTube.

  5. i find russian intellectuals to be extremely interesting. they make huge contributions and yet get a miniscule amount of funding. they make briliant work while at a time when stalin is putting them in goulags. what little of their work i have been introduced to they always seem to be focused on the practical and inutitve while americans who are comparatively extremely well funded go off and try to seem extremely esoteric and ivory towerish. gotta respect folks who focus on being useful and practical.

Leave a Reply

Your email address will not be published. Required fields are marked *