– Today, we’re happy and honored to have Vladimir Vapnik with us, co-inventor of supported

vector machines, support vector clustering, VC theory

of statistical learning, and author of “Statistical

Learning Theory”. He’s one of the greatest

and most impactful statisticians and computer

scientists of our time. Plus, he grew up in the Soviet Union, eventually heading the

Computer Science Department, Institute of Controlled

Sciences in Moscow. So he will give today’s lecture in Russian and I will translate. Just kidding, right. (laughing)

(audience laughs) It’s an honor and a pleasure

to have Vladimir with us today, so please give him a warm welcome. (audience applauds) – Thank you. About 50 years ago,

Professor Chervonenkis and me started statistical learning theory. The problem was to answer the question when if we will do well with training data if you will have small amount

of our own training data, you will do well also on the test data. You will minimize expectation of error. So this solves this problem. There are this theory in most in all books they might be written

in different languages, but mostly they follow this line. The line is that just law of large numbers is not enough, then you need uniform law of large number, you need convergence, and so on, but because we started this

discussion about empirical error how good you do on the training data, and bounds shows the better

you do in the training data, you will be better on test data. People decided that this is

only way to have a training data and do something with this training data, to minimize number of

error and all algorithm was constructed based on this principle. About five years ago I

found that there exists another principle, even more

interesting than this one, because in this principle

it is brute force principle, give me more data, you

give better answer with. So the second principle

is intelligent principle and I will talk today about this. So I will start with

statistical learning theory and then I will introduce this new theory, but because there are only

two ways for generalization, one is data, another I will show what I call the complete

statistical learning theory because there are more, another way to do something. There is no short way for generalization. You should use both of them,

so that is complete theory. But it is not so bad because you will see that learning theory move in different direction. In direction of intelligence

to understanding what is intellence. It is not the same but Turing explained. Turing told that you should

imitate intelligent person but now, question what is intelligence, and we will discuss this. So let me start. The first part is, this is

theory of generalization, and that is the question, when in set of given set of function, you can minimize functional. This is pretty general functional. Instead of y, f of x,

and L is loss function, I can consider difference

between y and function. It is what they’re doing in regression and pattern recognition

and all this stuff, but this is more general setting. But this is more important that when we consider

minimization of functional, we should say in which set of fucntions. In the given set of function, you have to minimize functional. If probability measure is unknown, but we are given iid data pairs. That exact setting of

both pattern recognition and regression estimation problem of pattern recognition set of function is indicated functions for regression function

because continuous function, but the statement is the same, but the answer is very simple. We can minimize this functional using data if and only if this dimension h of set of function is finite. Everything depend on which set of function you have to minimize this functional, and you can not avoid that. So we call it capacity,

but maybe better called, diversity of the set of function, but measure of this capacity,

diversity, is VC-dimension. And what is VC dimension,

I will give you definition. First for your set of indicator functions, t is step function you

consider continuous function f, and in front of it, you’re

looking for indicator. If function is positive you say one, if not positive you say

zero, that is indicator. The VC-dimension of set

indicator function equals h if h is the maximal number of vectors that can be shattered,

separated in all possible two h subsets using indicator functions from this set. So you have set, you should find h vectors which you can shatter in all possible way, in the l possible way. But you cannot shatter

in h plus one vector. Then VC-dimension of this

set of function will be h. This is purely a rhetorical definition. And if you can shatter for any l, then this says the VC

dimension is infinite. And then I will give you two

theorems which we will use, and then the main theorem,

probably of this theory. If set of function has VC-dimension h, then this probability one

minus eta for all functions in the set the bound holds true. And because this bound

holds true for all function, and you would like to

have minimal estimate, minimal right hand side, you will choose function

which minimize empirical loss. And the second theorem, if you

have set of linear functions, indicator from the set of linear function, it so happen that your

vector x inside of circle for radius one, and w inside

of C, then we see the dimension is bounded by maximum of two radius C and n, n is the dimensionality

of the object plus one. That means exactly that

VC-dimension can be smaller than dimensionality of the space. And you can control some

part of VC-dimension, and I will show you

afterwards how to do it, it is very important theorem. But, what is the general way

suggest the VC-dimension for searching for functions. You have set of function,

maybe set of function have infinite VC-dimension. Then you make a structure

of this set of function. You choose subset of function,

the small subset of function, with VC-dimension h, then

another subset of function which includes a small

one, VC-dimension h two, so this last VC, we have

last VC type of function. It’s loss to VC dimension. And then, when you

minimize exist functional, you’re doing two things. You choose appropriate subset, and then in this subset,

you pick up function which minimize empirical loss. But, you can see, that epsilon in this depend on VC-dimension of subset you choose, at VC-dimension over L. And also, it depend on empirical

loss which you achieve. So you can do as soon

as you make a structure, you can do whatever you want, even if you have infinite VC-dimension in initial set of function. But, now, this is more

or less, all what contain the main result of VC theory. But VC theory does not answer

four very important questions. How to choose loss function L y f? I told that any function how to select admissible

set of function f of x? I told you that given set of function, but when there’s a stupid set of function, how to construct good set of function? How to construct structure on

admissible set of function? And then, how to minimize function and how to construct the structure. In this talk I will try to

answer all those questions. Target functional for minimization. And this is important

slide, God plays dice. What is setting of pattern

recognition problem? I will consider in this talk,

just pattern recognition problem for two class specification, but generalization is straightforward. So, what is setting of

pattern-recognition problem? Given generator, given

nature we generate randomly, independently, citation x

which come on the object, and this object is

conditional probability of y. Given x, say that y equals one, given x, and y equals zero given x. This object knows this

conditional probability, and plays dice. So he has the function of

conditional probability, he has x on the input,

he plays dice, and say y. That is the most general

setting of logic problem, deterministical particular

case of this set. And what does learning machine, learning machine has a set of functions, and it can choose any

function from this set. The problem with observing l observations, x one y one, x l y l, to pick up the function for classification. That means, given observations generated by a kind of

conditional probability P x, y equal P y given x on P

of x, that’s our theorem, finds the rule that minimize function l. So in my case when I have y zero or one, or in theta, also zero, one,

it is indicator function, so last function l, just

collect how many errors I do, but I have probability measure, it collect expectation of error. I would like to find

function which guarantee the smallest expectation of l. But this is not very good. Why it not very good? Because my function l, this model is just, it is everywhere zero except for some points where

it is one or minus one, and if I could model it is one. So, it is zero everywhere in defined, and one in some point. So I cannot use gradient in this case. So I should to do something smarter. And what people doing is they replace model and indicator function with just y minus f of x. This create error. Instead of whatever I formulated before. It’s not so bad choice

because it so happen that minimum of this function l gives conditional probability function, probability of y equal one given x, and then when we can find this probability of y equal one given x,

we easily can construct our decision or rule, we

just consider function if our conditional probability exceed .5, say first class, if it’s

less than .5, second class, and this is optimal solution. But something wrong with this replacement. Let us rewrite the first line. I will subtract from bracket

inside on the first term. Regression is an odd regression, so I have two brackets instead of one, and then I make a square. So the last integral show

me the first integral does not depend on function,

which I looking for, and I have to minimize

my function l over f over set of function, just sum of two last terms. Have a good it is just

normal binome for two terms. Square of one, square of second, and two terms is multiplications. But our goal is to

minimize first integral, to find function which is close to conditional probability of function, not sum of two integrals. We can show the second

integral eventually will go, it goes to the, it’ll go to zero, but it will slow down rate of convergence. To have a rate of convergence big, we need to find way, how

to minimize first integral, not sum of these two. And that means that not this square loss, but something else. What they can do? There exists, first of all, when y, it is zero or one, probability of y equal one given x is some real valued function

between zero and one, because from Bayesian formula, we know that conditional probability of y equals one given x, and then p of x, it is joint density with

y given now comma x. That is just always true. Now, if I will multiply on some function, G of x minus x, star, which belong to L two space. And the integral, I

will have this equation. And they can say that

conditional probability solution of this equation. It was constructed like that, because you should put conditional probability,

I did something like that. But, I would like to solve this equation to find function when I don’t

know probability measure, but I’m given data, given observations generated according to p x, p y,x, I would like to solve this equation. But solving equation,

it is ill-posed problem. OK, let’s do that. But before I will do that,

I would like to mention that in classical statistics, there is a way how to replace unknown probability measure with empirical measure. And that is the most important part, is main inductive step statistics. In statistics, we’re given data and would like to know function, and it doesn’t matter how many data. We will see, it is not

equivalent to function. So in classical statistics, people suggest to approximate cumulative distribution function by empirical cumulative

distribution function, and that is empirical cumulative

distribution function. And 30 years, mathematicians

tried to prove, that it is good idea that we can do that, and in ’33, Komogorov found exact bound which is on the last line, almost the same like in the last line, and then people prove that

you can bound like that. Now, if we can replace unknown measure with empirical measure, we can construct our problem, our constructive problem,

what we should do. Let us replace in this function now which we would like to minimize instead of a real mirror our

measure, empirical measure. And then you have, empirical, real square root functional which you have to minimize to find our problem, to find our conditional

probability of this is whatever you want. But let me consider new

constructive setting, where we also will replace

unknown probability measure with empirical probability measure, obtained on the training data. And you will see the last equation to find conditional probability, we have to solve this equation in set of function f of x, right hand side is known

because our function G is known. In left hand side you don’t

know f of x or it is our goal to set a function to find this function. In classical statistics,

it was one algorithm called Watson-Nadaraya estimator which show how to estimate

conditional probability or integration of function. They just somehow defined this. And this is, I show function which is have a numerator, and denominator. So, G is special kernel, say, Gaussian. So this is estimate of regression, a very general way of

estimation regression, conditional probability,

and all this business, how to find, if Gaussian,

how to find the best value of variance to approximate

conditional probability well. So they spent a lot of

time on this subject and they have this. But you can see the line in middle, this is Watson, Nadaraya-Watson estimator comes from corrupted equation, not from equation which we derive. Here, it is in the middle is f of x, not of f of x, i like in last, like in kernel,

so then you can put out of sum f of x and you

will get this function. So, actually classical

Nadaraya-Watson estimator, it is solution of corrupted

equation which we obtained. But what means to solve equation? To solve equation means I

just take the difference between left hand side

and right hand side. Define area where I would like that my function will operate, take the square, and sum, and integrate over

sum probability measure, and minimize this functional. So let us do that, and if

you will do simple algebra, it is just very simple,

and you can check it, you will have this R f function

which is y minus f of x, y j minus f of x j multiply on some

coefficients, where x y, x j. So, we can estimate this value, and this value is j x y, x i, j x i, x g over measure, and this is matrix, if you

know from Watson-Nadaraya exact formula, we know this

matrix, this element of matrix. So, what is V-matrix? If we will replace this integral

with empirical integral, we will have this estimate of the matrix. If we will use just, say, line between minus one and one, and new f x is uniformly

distributed over this line, we will have, we will have n G’s Gaussian distribution, we will have this V-matrix. So V-matrix is easy to find. Now, I would like to use vector notation. What I will do, I will call Y vectors of elements of training data, y one, y l, I am given l pairs, so I create Y, vector Y, which is vector of all Ys. I will create F, capital from f, which is also all dimensional vector of I will pick up function f, and this is really of this function, endpoint x one, and the last

is value of this function, is a point x l. So this is f, and I have V-matrix. So, and I can rewrite

this functional in the way that in matrix form I have

y minus F, capital from f, V, y minus F, capital from f. But if I will write this

notation in least squares method, I will have Y minus F capital from f, Y minus F capital from f, and here, instead of Y, identity matrix. So, I got some improvement over least squares method. And I hope that it will give me a rate of convergence better than this least squares method, but, let’s see. But it is not major stuff. Because, OK, I am prove

rate of convergence, but is still this square method good? But now, the most important part, selection of admissible set of functions. What it means? When you construct a neural network, you talking that you are

doing smart structure, what it means, you’re talking

that you’re constructing smart, admissible set of functions. You know something, you are smart guys, you’re just constructing,

and then you minimize over the set of functions. But, let me consider from

a theoretical perspective, what it is. If you consider Hilbert space, and also Euclidean space, in this space, there are

two ways of convergence. Strong convergence, it is

convergence of functions, it is first line. My set of, my sequence of

function f l converge to f zero, if f l is integral converged

from f l goes to infinity. But there exists weak convergence. We say that my set of, my

sequence of function f l converge for f zero, if this inner product converged to this inner

product for all function phi, from Hilbert space. You can see that this is an inner product described property of function, if for all functions,

property is the same, that will be convergence. So it is easy to show that if

you have strong convergence, you also have weak convergence, just from Watson, from Schwarz inequality, Cauchy-Schwarz inequality. But also, one can prove that

if you have weak convergence, and your set of function

belong to compact, you also have strong convergence. So, in some sense, the both

convergence are equivalent. In our consideration of machine logic, you can see the strong

convergence everywhere, 100%. But what about weak convergence? Let’s explore this opportunity. Let us consider pattern recognition case. For pattern recognition

case, the first line, I just write in the

definition of weak convergence equals second, and that is I use bias in the equation, it is phi from dP equal one over phi for all functions from phi. So it converge, it must converge for

all functions from phi. If it converge for all functions from phi, I will have one function

which is desired one. But it is not realistic. Let us do following. Let us select from set of Hilbert space, m function phi. We will talk a lot how to

select these functions. So, and then we will consider equality, not for all function, but

just for this m function. And we will call admissible

set of functions, the set of functions which

satisfies this equality. We know that our function must satisfy this equality for any phi, any phi, because of the convergence, but we select something which we want. OK, if you will use, instead of our cumulative distribution

function, an empirical estimate, we will have instead of

integral property like here, the property written by the sum. Again, let me use the same notation for matrix m, I will use Y vector, I will use function F capital

from f, which is f from x one, f from x l, which is vector, for any function f I have vector, and also, I introduce new vector, vector of predicates on the value x one, x l. Because phi is function, I can consider value of

function in this case. Then, I can write my equation in this form. I would like that my

admissible set of function satisfy in vector form, these m equations. Now, let me explain what we talking about. There is a duck test logic. If it looks like a

duck, swims like a duck, and quack like a duck,

then it probably is a duck. What this means, we have

these statistical invariants in vector form like this, this line. What it does, it collect

admissible function which identify animals as a duck if it looks, swims and quack like a duck. So if you will choose

predicate which explains what means, looks, swims, and quack, then your admissible

function will be a function, such function for which classify animals that swim, quack, and looks like a duck. Concept of predicate, it is

very different from feature. Why so? With increasing number of predicates, the VC-dimension of admissible

set of function is decreased. Why does it decrease? Because we have set of function, then we from this set of

function, select function which satisfy new predicate. Not all of function will satisfy, and consider only set of

function which satisfy all these predicates. But with increasing number of features, VC-dimension increase

because your decisions are all becoming more and more diverse. So what is exact setting of

complete learning problem? Minimize functional

which is this V matrix. A little bit improved of

the square functional, subject to this constraint. And this constraint is

what you would like to see in the set of admissible functions. But, existing classical

method of pattern recognition, they just minimize this functional, the least square functional. So, minimizing this functional

subject to this constraint, that is our set. That was exact setting, which is in mathematical is called conditional optimization, optimization of functional

under conditions. But the approximation is

unconditional optimization. I would like minimize this functional subject to this constraint,

but I will do following. I will make sum of this functional, and I will take square of difference between this constraint and make a sum, the sum weight, weight, it should be one. So I would like to do both,

to minimize over both, and both weights have

important for me to minimize, they can have important for

me to minimize constraint. And then if I will do that, I can rewrite this function in this way where you have to

minimize this functional, and where P is just covariance

matrix of your predicate. And everything is simple compute. So that is concept, what we have to do. We have to solve our problem using both big and strong, strong convergence that

means, using invariance and minimizing functionals, and we can do it in exact way, and in approximation. But, here it was written

for any set of functions. I did not talk how I will minimize that. So it is true how as the

least squares method, you can minimize the

least square functional for any set of functions,

that is the same here. But now, let me see how I can find the solution. And first of all, I will

do it for reproducing kernel Hilbert space. This is the definition of

reproducing kernel Hilbert space. You have some kernel

which is Mercer kernel, you multiply, you’re taking the product this function f of x, and

you have the same function, it is important to use the same function it is called reproducing

kernel of Hilbert space. And it is known that

kernel, Mercer kernel, have expansion of lambda where lambda is there’s no negative values, and psi is orthonormal functions. So, set of function is inner product and norm. This is inner product, and that is norm. Forms reproducing kernel Hilbert space, it is very easy to check. It means that if you will use some function psi, orthonormal function psi, and its expansion of c, and if you will have this set of function, and if you will introduce

special inner product, inner product of this type, and then you will have the definition. So you will have reproduction

kernel Hilbert space. It is pretty general space. But, in reproducing kernel Hilbert space, you have a great theorem

called representer theorem. And representer theorem says if you would like to

minimize this functional, you subset the function,

and subset of function, it’s norm of your function in reproducing kernel

Hilbert space is bounded, so then your solution has

a linear representation, over kernel with finite

number of parameters. So, let’s introduce

matrix, vector of functions f K x one effects, it is vector expansion, and then square of our norm will be A, and this is A

over K, this is what is, how it looks, your function

which you’re looking for, A K A is norm of your function from reproducing kernel Hilbert space, K is matrix K xi xj, and this is F of f from reproducing kernel Hilbert space is just linear function. Subset of function is bounded

norm inside of VC-dimension, the smaller C, the smaller VC-dimension. And that is according to

second theorem with j, show you before. To control VC-dimension,

you should control just C. You should be looking for

norm of this function. So the conditional

minimization in producing kernel Hilbert space,

has closed form solution to minimize this functional

subject to this constraint. And constraint on bound of the norm will give you, and this is your solution, linear expansion of this

vector of functions, and value of coefficients is like that where it is just in closed form with this function of matrix,

this multiplication of matrix, this is gamma c, C is depends on this , where you see gamma of

c depends on this C, this is identical matrix,

or you have the solution. And to find u over here, you

have to solve linear equation. So, you’re solving linear equation, and then you have closed form solution. So the complete problem in

reproducing kernel Hilbert space have closed form solution. But what about unconditional minimization, approximate minimization like that in this constraint? It also has closed form solution, and this is how it looks,

closed form solution. Your solution is coefficients

over this expansion, and you have this equation you have to find A, so everything is computable. But very special rule play in explanation support vector machine. What is support vector machine? Given data, I would I would like when reproducing

kernel Hilbert space, this bounded norm

minimize this functional. Then, when they’re

minimizing this functional, that is exactly you will

get support vector machine. If you look there I’ve

supported the machine, it is just minimization this function now, this set of function. But here, you’ll do something else. I will do data like in

support vector machine, y i minus A theta A, that is I would like approximate data. It’s unknown like here,

but also I would like to that my invariant will be good enough, will be close. Left hand side and right hand side, often invariant will be close. So I would like minimize this functional under the same constraint. And this is the solution. The solution is like A, A is function which is phi from t, and phi from t, as you remember, it is vector of predicate, and this is indicator vector. If I would like, here to make strong for invariants and not too strong

for approximation function, I just want to use just weak convergence. I will have that my A is defined by invariants, by function of predicate. But, my function of predicate,

how I choose predicate? I can choose any function I want. That means that if I

can choose one function which will give me optimal solution, then there exists a smart predicate that I will not need a lot of data. I need what, why I need data, I need data for expansion of our kernel. But for estimating

coefficients, I don’t need data. I use my predicate and that means that what is, what means predicate? It is property, it is explanation about what I want. I will talk about this later. So, this example of

support vector machine show that philosophy of this

learning is very different. According to representer theorem, solution of learning

problem in reproducing kernel Hilbert space have a property. It defined linear parametric function in form of expansion of kernel function. That means that optimal expansion belong to one layer network,

not multi-layer network, but because of reproducing

kernel Hilbert space, is richer than this. Maybe it is not the best idea to use deep network, deep network, OK, we will discuss this deep net. Observation vectors and kernel

define basis for expansion for optimal l parametric solution. But parameters is defined by invariants, but you could use these invariants, and what it means if you

will write in the form, that you have K which is matrix, depending on your training data, phi is predicate matrix, so

you have vector over here, and this is Y vector, F is vector, so you have element, you have different formulation of learning problem in term of this vector and these values. So since function in Hilbert space can be, since any function from

Hilbert space can be used, because when we talked

about weak convergence, it converged for all functions

in the case of Hilbert space. So any function can be used. So we have a chance to pick

up several smart functions, or maybe even one, and it will

be enough for our training, and then we will talk

about what means smart, it is intelligent learning, not just brute force learning. So, let me give you an

illustration to have a feeling. This is least square method, what it does. This is V-matrix method, it will do a little bit improvement. It’s a little bit better. Here, if I will introduce

invariants, I will do better. Here, if I will use both

invariant and V-matrix, so you can see there’s

difference for 48 points, and to be sure that it

is difficult problem, we took 16 from one class

and 32 from another class, it is not difficult. This is 98 points, the same story, this is 192 points, the same story, and this is multidimensional case, so we check something and we did it. And that very interesting case. We introduce some invariants and what 22.73 error rate for diabetes. And we decided can we find invariant to improve it, performance. And what we did, we’re

just looking for area where our invariants does not take place, they violate it. Then we just took predicate which just doing with this area which can’t have one when your, when point belong to this area, and zero when it doesn’t belong. And we improve using this invariant from 73, .73 to .07. So, that means that if you have smart way to looking for invariants,

then you can have a chance to improve your performance, but this is exactly the same

philosophy which use physicist. Find the solution, the box in figure, where the existing solution,

find situation, the box, where existing solution, the

approximation which we obtained contradicts evidence,

does not keep invariants, contradicts invariance inside the box. And then modify the solution,

obtain new approximation which resolves this contradiction, which doesn’t have this contradiction. So you’re just looking,

you inventing invariants, looking where you have contradiction, and that is exactly the same principle that use physicists to

discover law of nature. To discover law of nature, physicists first trying to find situation where existing theory

contradict observations. So invariant fail. Theoretical predictions do

not supported by experiments, that means invariant, but. Then they trying to reconstruct theory to remove contradiction, they

construct new approximation of theory which does not

contain contradiction observed. But the most important part in physics like in here is the more difficult part in scientific discovery, how to find contradictive situation. Let me show something about neural net. I am not fond of neural nets, but we can use our theory from neural net as well. What is neural net? That is neural net, you’re

minimizing least square error. I would minimize this

approximation of invariants which is contained VP

matrix, V plus P matrix, matrix P with invariant, and then I will do the

same back propagation. It so happen that I easily

can do back propagation, not just for this matrix,

but also for this matrix. And if I will do back propagation, I need the back propagation

to do only one correction. Instead of back propagation error where is y minus what you

have on the last layer, and you have all l observations. You just have this matrix and you multiply your vector of propagation on this matrix, you’re correcting your propagation. So that is what is back propagation about. You do for one step, it is OK, the same. You have back propagation,

it is border condition, in back propagation step you

should do some correction, only one on the very last level,

and then update your steps. So I came to NSC and ask guys who have a neural network to incorporate this,

just this improvement, small improvement this. I took just one invariant,

a very trivial invariant. C of x equals one, c of x equals one, I will show you that it’s

not so simple invariant, we will discuss what

it, what invariant does. And then, we set V equal

I, we did not use V matrix. Here it is just identity matrix. We used 1,000 examples,

100 per class, batch six. And this is this line is what does neural network, deep neural network, they have

a good deep neural network. And that’s what this does

improve neural network. Instead of 3.1 they got 2.9. OK, let’s do just one, not very important cosine coefficient. I have a picture of my, my digit. I make cosine expansion, so I have Fourier coefficients for one coefficient for here,

and I will use this predicate so I will use predicate

with this one coefficient. And again, you will see just one invariant with stupid cosine makes improvement, OK? Then we decide, let’s do more. Let’s do 16 coefficients of Fourier. Four from x one and four for x two. And we got .6 bigger, but I can do whatever I want,

I can do 1600 invariants. And I can make, it’s a

lot of game can be played. But, let’s also, but in neural net, we use approximation of exact solution but it works. The statistical part of

learning theory is complete. Why is it complete? Because there exist only

two ways for convergence in Hilbert space. Convergence in functions,

convergence in functionals. There are no third way of convergence. So, from a conceptual point of view, you can play one of two

game or two games together. So why I call this complete, you cannot imagine something else. If you would like to do something what is improved learning,

you should ask yourself how to choose invariants, and that’s what I will talk about. Invariant, it is something

about intelligence, when they talk about duck test, I use looks like a

duck, swims like a duck, quack like a duck, but I can

say play chess like a duck. I can say whatever I want and it should be invariant equality, or say singing can not be like a duck, OK? But, among all these stupid predicate, there exists smart predicate, and subject of learning and subject of intelligent learning, some have to pick up the smart invariants. Let us discuss what is predicate. I don’t know, exactly, I think this is for many

hundred years theory, I will show you that it is continuation of major philosophy from Plato to Hegel to Wigner and so on, I will show you. But it is in predicate, so my claim that when you’re talking

not about imitation, but what is essence of

intelligence, essence in predicate. Predicate is something extra. OK, I will show you later. Let’s say predicate, that is that invariant holds true, mathematically. So let’s take f of x equals one. What does this predicate? Expected number of element

of class y equals one computed using this predicate, equal to number of training

example of the first class. When you will use this predicate, you will pick up function

which will give you on this training data, the number of represents of the first class, exactly the same like in your training data. And that is this predicate. And you saw how strong this

predicate for neural net in terms of class of recognition because they have something. Let’s take another

stupid predicate, just x. It looks like a duck,

it is center of mass. I want the expected center of mass, expected to be the respect

the conditional probability will be equal to average to

center of mass which I see on my training data. That looks like a duck. But you can do smart looks like a duck. So, I can consider x x transport which makes matrix, it is correlation, covariation matrix, and I can see n squared over two predicate of this type that’s covariation which I will get using this predicate using function which, sorry. Predication which I will

get with obtained solution will be the same like

covariation which I observed on my training data. That means this predicate. So, but, I sense that we should not

go for general predicate, and we can imagine when your

predicate, I am not so smart, that to construct very general predicate, but let’s do for 2D image predicate. Like u x one, x two, the

function of two variables, it is image of digits, say, in our case. And we have this function,

y, this function, y. And let’s consider predicate

like I will take image, I will consider coefficients for Fourier, it is my predicate. I want expected value with respect to this over my predicate, will be the same like I show my training data. I can consider convolution, because convolution neural

network is one predicate. I will show you which is predicate. And this is this convolution of point x y, x of different points. You can use value, you

can use whatever you want, because whatever is

coming from inner product it is you who decided what you should use. And the understanding

of image recognition, it means understand which

predicate involved in that. But also, the difference

between predicate and invariant, predicate is abstract concept, but invariant is from your training data, it’s what makes specific

your abstract concept. It’s also general, but it is specific. And that is, I want you to show instruments for special predicates. Let us consider vector x y, x j, just the digit recognition, keep in mind, your digit recognition. And suppose x is your pixel space, and you may clean your transformation, small linear transformation

of your pixel space. And if you have small linear

transformation of pixel space, you transform your picture. But you can transform picture, you can also transform

using Lie derivative, I will show you what is that. But, to see what is

that, I took this picture from paper by Simard et al. Show you, this is

transformation, the first line. In pixel space, you may

clean your transformation. And here, you make the same transformation as Lie derivative, this is Lie derivative, this black one. It just computed Lie derivative, I will show you how to compute. And alpha is coefficient, and

using different coefficients, a equals minus two, you

just have this pair, a equals minus one, you can have this, this, this, and all this stuff. So, you can create clone, clones of digit two, which is transformed with

respect to Lie derivative. You don’t need to have a lot of data. You need to have for digital recognition, you need to have predicate,

and from any data you can get this predicate. And OK, I will show you

invariant with this predicate. And this is what is Lie derivative. It is horizontal translation, x, first coordinate, plus a, you just move it in direction a x one. Then vertical transformation. This is rotation, this

standard from geometry, for full rotation, for small

rotation, you have that. And this is d dx, this is d dx two and so on. You have all this stuff here. And this is a big illustration

from Patrick Simard. Clones, you have this three and

you create all these clones. Just you choose, one, two, three, four, five Lie derivatives, two

different coefficients, and then you have all this stuff. But this is smart guy, why you not taking, like predicate, just Lie derivative of it? And we’ll take invariant with

respect to Lie derivative, it is, I would like to learn

if using statistical invariant try to estimate such

conditional probability, which keep invariant with

respect to all derivatives. But even more, it’s

again from what was done. So suppose I have this set of clones of my digit. This is set of clones of another digit. I call tangent distance

the closest element from these two clones, what that means. I have two digits, they are

different, say, two, three. Then I massage them with

linear transformation to make it as close as is possible, and measure closeness of them. That’s called Lie invariant. That’s called tangent distance. And now, I believe that

this is general concept of predicate symmetry. When you have any picture, you can say what is measure of symmetry

of these two pictures. So for example, you have

three, what I can do, I have, this is my digit three, I will have for horizontal symmetry, I will take first line

here, second line here, last line here, then I

will do the following. I will make another digit. I will make last line the first line. This line, the second line. So what I am doing. I take three, so I leave it like vector like that. Now I will do this. It is two different images. And now, I will say, let me massage them, this tangent distance, these

two different pictures, how close they are. If they are very close, I can say this is coefficient of

symmetry of this three. But I can say horizontal

symmetry, vertical symmetry, antisymmetry, what means antisymmetry? Digit s, it has vertical antisymmetry. I can have vertical symmetry, everything. So you can play many games with symmetry, and this is just one predicate. I have conclusion remarks. What we did, is that we can

minimize this functional which is slightly better

than, say, least square subject to this constraint

which is serious, because this constraint means

admissible set of functions which are trying to construct being smart. So I can consider, say,

all continuous functions. And then from these continuous functions, select by smart predicate,

admissible set of functions. And I can do it, because

according to weak convergence, any invariant take

place with any function, invariant take place with any predicate. So, and also, this provide unique solution for reproducing kernel Hilbert space, and approximation for neural

network, approximate solution. But further progress goes

beyond statistical reasoning. It goes in direction of

searching of predicate which forms basis for

understanding of problems existing in the world. And what means understanding? It means that, in say, 2D image recognition, there exists concept of symmetry, there exists concept of structure, and if you will know these concepts, I believe that it’s not a lot, I will show you why I say it is not a lot, you will understand this problem. And I think that this

line, it’s very old line. It start from Plato, what says Plato? Plato says that there is a vault of ideas, and vault of things, and vault of ideas make vault of things. But you see that I have ideas which is predicate, which abstract can be applied

to different situations, but I have vault of things. But then, in 300 years ago, it was classical German philosophy about that, what it means,

ideas, what means things. And Hegel told, whatever

is reasonable, it exists. It is exactly what we

said about predicate. And whatever exists, it is reasonable. So there is two connections. But recently, 60 years ago, Wigner wrote an article about unreasonable

effectiveness of mathematics. It just says that mathematics

knows something about reality. If you would like to understand reality, you should look in equation

and you will see how it works. So predicate, it’s abstract idea, while invariants that are built using them form elements of solution. These two concepts reflect

essence of intelligence, not just its imitation which

is in artificial intelligence. But, that is subject

which we should attack. And also, I have two more

slides, one slide is challenge. I know that people from deep network get .5% of error rate using 60,000 observations. The challenge is, use 1% of this data and get the same result. But even smart predicate and

all these clones which exist, I think that it is doable. And the very last slide. In 1928, guy Valdimir Propp published book “Morphology of Folk Tale” where

he describes 31 predicates that allow to synthesize

all Russian folk tales. Later, his morphology, the 31 predicates, was applied to literature, to theater, to film, to television,

to television series, to games, et cetera, and this 31 was enough. And this I read from

Wikipedia, you can check it, with Wikipedia of the book. Propp found 31 predicates which describe different actions of people in real world. Probably there exist a

small amount of predicates that describe 2D images. And that is intelligence,

that is how to find them. That what I believe

should be learning about. Thank you. (audience applauds) – [Host] I think we have

time for a few questions. – [Man in Audience] Hello,

thank you, I have two questions. First one is, do you know of any predicates that you recommend for language classification

tasks, specifically? And the second question is,

do you have any strategies for hedging against over fitting? Like if you specify too many predicates, then you might be sort of– – Sorry, I not hear you well, but second question is about over fitting? – [Man in Audience] Over fitting, yes. – Yes, let me answer this. – [Man in Audience] Sure. – The more predicate, you have

why over fitting can happen. Because your set of function is big, and you have small amount of

data in selecting function. So you can select whatever you want. But if you increase number of predicate, you decrease set of function. So the more predicate, the

less over fitting happened. And if you will, a theory

of mathematics says, that if you have infinite

number of predicate, you are left with one

function, if you want. – [Host] He also asked

about natural language. Recommendations for

predicates for language, natural language

processing, the Turing test, any good predicates. – You know it is very complicated story, natural language, I don’t know. – Questions?

– You know, it is, whatever I am talking it

is very trivial, simple. Everyone familiar with 2D images. And we can think, like

this guy Vladimir Propp, what is predicate in these images. Can we formulate, if you’re smart guy, say, couple of dozens, or

maybe one dozen predicate, it should be enough. – [Host] But language

is harder than images. – Oh yeah, absolutely. (audience laughs) Yeah, but don’t do the

immediately hard problem. Try to–

– Try? – Yeah, I tried a very

simple, just step-by-step. It so happens that is

main line of philosophy from Plato to this guy who says that ideas is not too much. There’s not too many

ideas that are existing in world of ideas. It could be like that. – Vladimir, thank you so

much for coming today, and please give him a big hand. (audience applauding)

I really enjoyed this talk by Vladimir. Here's the outline:

0:00 – Introduction

0:46 – Overview: Complete Statistical Theory of Learning

3:47 – Part 1: VC Theory of Generalization

11:04 – Part 2: Target Functional for Minimization

27:13 – Part 3: Selection of Admissible Set of Functions

37:26 – Part 4: Complete Solution in Reproducing Kernel Hilbert Space (RKHS)

53:16 – Part 5: LUSI Approach in Neural Networks

59:28 – Part 6: Examples of Predicates

1:10:39 – Conclusion

1:16:10 – Q&A: Overfitting

1:17:18 – Q&A: Language

First๐๐๐๐๐๐๐

yesss, ive been waiting for a series of lectures worth watching since JPs series

2 views and 6 upvotes! that is what i am talking about!

Funnily, YouTube detects the language of the Video as Russian for subtitles

He is a hero ๐

OMG is Vladimir Vapnik our Valentine!?

This envokes great memories to my university days! Working in applied ml is seldom as elegant as vc theory lol

Thanks a lot!!! Very Informative!!!! And thanks for making all of this happen!!!!

real soviet man, not your regular russkii ๐

so clear explanations. thanks.

Legend in statistical learning โค๏ธ

0:32 robot jokes still don't land quite right lol

Marvin Minsky says statistical learning won't work to build AGI.

I think I would like to speak of AI. I m a sim ple man tho is there real ly such a thing… I think not…. so AM I !

HI Lex ๐ , Can you do a series/playlist on NLP research and where NLP is going after 2020 and its future? That'd be really helpful!!!!

Arriving here from the podcast, must say that horizontal expansion will give us the models that we need and yes, even after than it would be an imitation. Intelligence seems to be far from our reach as of now.

Can't understand a fooking thing he says.

great talk but for you guys out there let's hope it gets released in English

This guys seems straight out of any movie where the bad guy is a soviet scientist. (Hint: he's the scientist.)

Huge respect for the gentleman (he is a legend for us, AI-Masters students in Ireland;) Thank you for uploading to YouTube.

I thought it was just me but apparently his accent might be just a little bit hard to understand.

Thank you for offering us this possibility.

Did this maths in primary school

i find russian intellectuals to be extremely interesting. they make huge contributions and yet get a miniscule amount of funding. they make briliant work while at a time when stalin is putting them in goulags. what little of their work i have been introduced to they always seem to be focused on the practical and inutitve while americans who are comparatively extremely well funded go off and try to seem extremely esoteric and ivory towerish. gotta respect folks who focus on being useful and practical.

i use his invention every single day !