[MUSIC]>>See it. The stack will be

roughly organized in two parts. There’s going to be discussing

some of the efforts that I have been doing and I guess MSR has been doing on

robustness in machine learning, and of course, this

means a million things, and there’s many aspects of the pipeline that one

might want to make robust. In fact, I mean all the

parts of the pipeline, but in this time we’re going to focus generally on two aspects of it. First part, will be about

robustness as training time. This is what happens when perhaps you’ve outsourced to

collection of your data, or your training set is unvetted. Perhaps what happens when

the training set has perhaps arbitrarily chosen

outliers. Here is the first part. Second part is about robustness at test times is what happens when the adversary is

trying to fool your models. Let’s say, I’ve trained a model and now instead of giving

you clean input, I give it some abstract

unperturbed input and then all sorts of weird

thing can happen here. We’ll get into these

things, one at a time. Okay. So robustness at train time. Let me give you two

motivations of why we might care about this

robustness at training time. The first is more classical perhaps, and it’s focused on genetic data. So this is some genetic microarray

data, or some picture of it. What happens here it’s

gene expression data. So every individual you have

some maybe 20 K, or 40 K, or depends on the technology, length, vector, where

every vector is one, zero. This is an oversimplification

is either one or zero, if maybe some protein is expressed, and this is some genetic

information essentially by proxy. Then you want to do a lot of stuff. You want to check whether or not some disease is genetic

or something like this. We typically think of

what happens is that some entity out there collects a bunch of genetic

data and gives it to us, and then we just want to do

whatever processing you want on it. But this is not true because this

technology is really hard to do. It’s really hard to get

large-scale studies. So typically what happens

actually in practice is you agglomerate this information over a bunch of different labs. But these labs have

different standards, different equipment,

calibrations, people, scales. So instead of having a

nice homogeneous dataset, a large dataset you actually

have all this junk. You have a bunch of different things

which are all very different. They’re all heterogeneous

sources of data, and this causes potentially

uncontrolled systematic noise. So now instead of

having data from here, our algorithm needs

to work when the data is a bit more complicated. This can cause really

where systematic areas that computation biologist

struggle with all the time. The second is a MLE application, and it’s known as the

data poisoning attacks. Here this is what happens when. now I think maybe my data’s being crowdsourced

from some application, and think of it as being some recommendation system like maybe I want to recommend

restaurants or something. But instead of just, we crowdsource

a collection of these ratings. Now what happens if the

friends and family of a bad restaurant wanted like boots is reading, these things. How do we prevent

this from happening? Here’s a more interesting

example, perhaps I would say. This is something known as a

watermarking or back-door tech. What these researchers were

able to do was they took a standard training set

for training stop signs, or recognizing stop signs, and the standard algorithm neural network for recognizing stop signs, and they added a small amount of carefully chosen perturbations

to the training set. So they added some crops

into the training set, and then they ran the algorithm

on this training set. What they found, they were

able to do the following. They were able to install a backdoor, or watermark into the neural

network in the following sense. If I feed in a clean

image of a stop sign, or anything else, it performs

exactly as it’s supposed to. But if I maybe add a sticker, or the right sort carefully chosen

ahead of time by the adversary, then I can like make it a

flower or anything you want. So this is obviously

if we’re going to use these things for self-driving cars, this would be a bit of a security

vulnerability, say at least. So what happens when the data

can come from untrusted, tampered sources, how can we guarantee that algorithms

still work well? So the high level problem is

that these large datasets, whether or not because of just

noise in the collection process, or because of the fact that

they can’t be fully vetted, they’re often inherently noisy. They can be really adversarial noise. So the high level

question is how can we learn from this really

high dimensional data. So it turns out that a

statistical abstraction of this setting that really captures more or less all the

settings I already describe, or these example that

described is the following. As some idealized model of the data, let’s say it could be anything, and instead of getting samples from this distribution, at

the training time, I get samples from this

distribution plus corruptions, where these corruptions

are either because of the systematic noise, or because of the

adversarial perturbations that some adversary’s going

to do to my training set. So I get samples from this

funky looking distribution. So instead of getting

samples from here, my algorithm assumes I

get samples from here, but it needs to work when

I get samples from here. So this is the high-level

question. How can we do this? So how can we develop

such algorithms which are probably were robust

to worst case noise. I want to emphasize it here

because we’re really worried about these adversaries trying

to mess with our algorithms. In the worst case, it

is very useful to have these very strong

worst case guarantees. So we want to study this problem, and it turns out that

the most sensible way to do this is really backup from maybe the most immediate

machine learning applications. We really focus on at first at least the most basic

questions in this area, and then we can turns out you can

use these primitives to build really strong algorithms later on, for this more complicated

looking problems. So we’ll focus for a little bit

on this most basic question which is actually a

question that predates modern machine learning

called robust statistics. We’ll see how to use these

techniques to really build on to much more

complicated problems later on. So this is robust

statistics introduced by statisticians back

in ’60s and ’70s, and they ask the following question. Given samples from a distribution, where an adversary can

move an Epsilon fraction of your data arbitrarily, can you recover statistics

like the mean or the covariance or whatever of

the original distribution? A very simple question. So

here’s some repertoire model, and the adversarial

perturbations we’re considering. So I’ve drawn these nice IID

samples from a distribution. Maybe there you ask me, I

don’t get to see the sample, and said I haven’t had them to

very powerful malicious adversary, and what Tom Brady does as

he looked at these samples, he’s allowed to inspect

these samples arbitrarily, he has access, he knows what all my algorithm trying

to do even, let’s say. Then he gets a change

and Epsilon fraction of the points in whatever way

he desires to mess me up. So maybe he chooses, say

the points over there. Now I get these points

without the colors, and my goal is to learn information about the

original distribution, like the mean, the variance,

and these kinds of things. Hopefully, the setting makes sense. Good. As a point of terminology, let’s call that set of samples

that has this behavior where a small fraction of them

had been cropped at epsilon, and we should think of epsilon

as maybe being like 10 percent, 5 percent, these sorts of things. Relatively small, but constant

fraction of our data. So what can we do about this? Suppose I want to learn the

mean of this distribution, it turns out in low dimensions,

this problem is not too hard, just because in low dimensions, if your distributions is

at least are all nice, then points tend to concentrate

pretty closely to the true mean. Everything closes to the true mean. For instance, if your

data is Gaussian, you had this like nice

bell-curve behavior, everything’s really close to truth. So that means if the bad points, ones that are not being

obviously outliers, they also have to be pretty

close to the true mean. So they can’t actually influence

your statistics by too much. So all these points are pretty close, and so even if I just take

the empirical mean here, then I’ll still get

pretty close to two mean. However, this picture

changes dramatically when the mentioned becomes higher, and this is a setting which

we really care about. For instance, in the

genetic data setting, the distribution was

like 20,000 dimensional. But it turns out as the

dimensionality data scales, this picture changes dramatically. I can’t really draw

20,000 dimensions, but you should think of what happens as all the points now

live in this ring, almost like a doughnut, the

square root of the dimension. This becomes very

problematic especially as the dimensionality increases

because now all of a sudden, I have an epsilon fraction of points, and all of them are like root

D away from the true mean. So as the dimensionality grows, disorder of the influence of

these points on the statistic increases polynomially

with dimension. So if you do naive statistics, like if I just tried to throw away points which are

obviously outliers, I couldn’t throw away

any of the other points which lie within this ring

of radius root D. Now, all of a sudden, instead

of an error of epsilon, I get an error of like epsilon

times square root of dimension, and when dimension is like

20,000 or hundreds of thousands, millions, this renders my

algorithms essentially useless. So here’s the problem. Any methods

which looks for outliers on an individual basis will lose

its dimensionality factors, and so to get around this,

we need to somehow look for corruption at a much

more global level. It turns out that this is

really a problem because somehow looking for

global information, that’s like a really

hard search problem. We need to search for

a subsets of size epsilon N. This is a really

hard computational problem. For a long time, there was

a curse of dimensionality. Either the algorithm for a robust mean estimation was computationally intractable

in high dimensions. That is the runtime was

exponential dimension. So I just couldn’t run it in

dimension higher than like 10 or so, or I could run it, but I would lose these dimension factors

in the accuracy. So either I couldn’t run it, or if I did run it, I would get some answer which is like statistically meaningless when

the dimensionality data is high. So this raises a natural question, is efficient robust

estimation possible? High dimensions, and I guess I wouldn’t be here if

the answer wasn’t yes. Yeah. The answer is yes.

At a very high level, intuition is the following. So we can find some

statistics which allow us to certify whether or not the

corruptions are changing the mean. We can’t tell whether or not a single point is a

corruption or not, but we can somehow detect whether

or not the corruptions are acting together to mess things up. Here’s the idea, if the corruptions move the

mean of my distribution, then they will also shift the shape or the covariance in my distribution. Here’s the cartoon

explanation of why. So here’s setting I had before, and here’s the true mean of my distribution centered

roughly at zero. But the empirical mean of my dataset, because you can see the

red points are also appointed roughly in this direction, it’s going to be shifted

somewhere around here. How can that happen? So consider the true covariance

matrix of the data, it’s supposed to look like

this, spherical at least, that’s what this picture suggests. But along this direction of change, the back points are

generally aligned. That’s what’s happening here. They’re generally causing

a drift in the mean, but the only way that that can happen is that if along this direction, there is much more action

than there ought to be. There’s a lot more variance

than there ought to be, and it turns out that

in this setting, if they shift the mean, then

they also cause the data, the covariance to

actually look like this. This algebraically

manifests itself as a large eigenvector or a

principal component on my data, and we can detect

this algorithmically. So this is very useful

for two reasons. The first of all, the

contrapositive of this says that if the large eigenvalue of my

empirical corrupted data is small, then even if I can’t

detect what’s bad, then I know that the corruptions

aren’t changing the mean. Because I know that

if they’re changing the mean, then there’s

a large eigenvalue. So the opposite of that is if

there’s small eigenvalues, if all the eigenvalues are small, then they’re not shifting the mean. So whatever the

corruptions are doing, it doesn’t matter for my problem,

even if I can’t detect them. So in this case, we just

output the empirical mean. However, if the top eigenvalue

is large, then at the same time, this is also very useful

algorithmically because it gives us a concrete direction to look where the bad points must

have undue influence because we know that the good

points have bounded covariance. They’re supposed to be like

this nice spherical thing. So that means the top eigenvector, this direction can only have large variance because the bad

points are large in this direction, much larger than they ought to be. So now, all we had to do is

something like somehow project in this direction and down way based on how large they are,

and this projection, turns out that if actually

do this slightly carefully, you will down way the bad points must more

heavily than the good points, or remove more bad

points than good points. You can think of it

this way. This gives us a very simple meta algorithm

which we call filtering. It proceeds as following, you’re

given a corrupted dataset, you let mu hat be the empirical mean, you let Sigma hat be the

empirical covariance, and then you just take

the top eigenvalue, eigenvector of Sigma hat. Then if the top eigenvalue

is not too large, that means the shape of the data

is actually still spherical, so you start with the empirical mean. This is what we were saying before. Otherwise, we should project along the direction of

the top eigenvector, just throw away points which have large projection, it’s projected. If you are slightly careful, this is not actually a formal

algorithm description, but if you’re a slightly

careful about how you do this, it turns out it works. The high level of

intuition is that when you project on this direction V, maybe you’re supposed to see

it like a nice ball-curve or something because your data

is Gaussian, but you won’t. What you’ll see is this little

kink in the distribution, and like presence of outliers. So they have to be in the tail, and so if you just throw away

the points with bad score, you’ll throw away more bad

points than good points. So we made a lot of progress. So this is a general meta algorithm. I don’t want to get into

too many specifics of how to set these things. It changes based on what you

believe about your good data, but the nice thing about

this is this is it’s actually a very practical algorithm, because if you’re just slightly careful about how

you implement these things, if you do an approximate top

eigenvalue-eigenvector computation, a single iteration of this

runs a near linear time. So this algorithm is

eminently practical, and we’ll actually see some

experimental results of this, and will show that it

works pretty well. So here are some theoretical results, just in case anybody is interested. So if you want to estimate the mean of a distribution

with bounded second moment, you can get it through this error, if you want a Gaussian, you can get even better error, and it turns out all these

things are optimal error rates. More interestingly, and I’ll

talk about this in a little bit, we can also estimate the

shape of the distribution. So if you have a

Gaussian distribution and you don’t know the covariance, so you don’t know actually

what the shape is. Turns out you can use another

variant of this method and actually learn the covariance of

the Gaussian as well. So you can actually

learn the shape of the distribution in the

presence of outliers. This is some technical thing

which I don’t want to get into. But for all these settings,

it turns out that these are the first efficient dimension

with any guarantees, and there’s been a lot of

follow up work afterwards, but so far, this has all

been very theoretical. I want to convince you at

the very least that this is something that actually works. For this, we did a

bunch of experiments, and the first experiments are

simple synthetic experiments. Just to validate that the math

that we did wasn’t wrong. So what we did here was we

took some Gaussian data, or we took our inliers to be

say Gaussian or whatever, it doesn’t really

matter it turns out, and then we just naively

planted some outliers, in whatever way we tried, whatever we wanted, and then

we scaled the dimensionality, so we did this for a bunch

of different dimensions. We just plotted the error

of all the estimators, so ours is the black line. The blue line was a

concurrent algorithm, and then these are standard outlier removal or robust

mean estimation techniques. As you can see, as you increase the dimensionality of the data, these classical techniques, they indeed does scale like routine, but ours stays constant. When you zoom in, ours is better than this concurrent

algorithm as well. The other setting that

was very interesting was this covariance estimation stuff. So how do I estimate the

shape on my distribution given corruptions? We

tried two settings here. Here we tried one setting in which the unknown commands are spherical, so the data was nicely like a

ball like the example I showed. Now, this setting which

is more challenging, where the data is much more skewed in some

direction than the other. So in the isotropic setting, everything did pretty

well, although we again, as long as the dimensionality was

sufficiently high, did the best.>>However, in the skewed setting, there’s a dramatic difference. So our error is very, very small, and the concurrent algorithm

got pretty good error, and everything else had error

up here, we can plot it. So that’s great. That actually demonstrates at least that we didn’t

do the math wrong. That’s really reassuring, but this

is actually work on real data. So we tried some this

experiment which is, well, it’s a classical experiment called Genes Mirror

Geography in Europe. It’s a very nice experiment, and it relates to the first

example I showed you before. What they did was they took a

bunch of gene expression data, this is what I was discussing. They just projected onto

these principal components, and then they like this is the projection

of these points onto their top principal components. What they found was that, roughly speaking, after

appropriate rotation, they and I guess reflection perhaps they more or less recover

a map of Europe which is nice, and you’ve got the Iberian

Peninsula, those things. This is like surprising at

first when you think about it, but maybe now on second thought, not so surprising just

because you do expect that geography is a very important feature for determining genetic tree, you are much more likely to be close to people who are close

to you geographically. Principal components are supposed to tell you the most

important directions, because that’s what they do, and if you expect that the most important directions are like

a line of geography, this is the picture

that you would get. But to get this picture,

they had to be really, really careful, and they’re very clear about this in

their experimental setup. For instance, they have to remove information from immigrants

to get this picture, and this is perhaps

not so surprising. If you believe this

hypothesis that geography really dictates a lot

of genetic drift, then you would expect that immigrants from another

part of the world, it would destroy this

fine grained picture that they’re trying to get for Europe. So for the purposes

of this experiment, one can think of the

immigrant genetic information as outliers in the genetics setting. If you just try to add them back in, indeed what you get is you get no map of Europe, as far as I can tell. However, if we think

of it in this way, we just need to learn the top two principal components of the data in the presence of noise, in the presence of outliers, but the top two principal components of your dataset are just

like the top two eigenvectors of your

covariance matrix. So if I can learn the

covariance matrix robustly, even in the presence of outliers, I can just do the same experiment

without having to carefully go through my data and trying

to cherry pick these things. So if you do our experiments, you indeed get not

quite the same picture, but a much better picture that still covers large geography of Europe, even in the presence of outliers. There’s a lot more applications, another application for

instance is outlier detection. So another nice thing

about this algorithm, this filtering algorithm is

it actually gives you scores, because the projection

onto this top eigenvector, the larger it is, the more

suspicious you data point looks. So this gives you a natural score of which points are more

likely to be outliers. I mean there’s no, perhaps, theoretical justification for this, but it’s a nice heuristic approach, and it allows you to rank it,

and then you can just use this directly as a method

for outlier detection. If you use the state of the art techniques for

robust mean estimation, so we have a slightly more

complicated version of the filter, you get scores which we call

these quantum entropy scores, which are efficient to compute and outperform the previous state of

the art for outlier detection, on a number of synthetic, and

real-world outlier detection tasks. So somehow, this is

a setting where like the more theoretical

insights actually directly give you better

real world applications. So here’s a more synthetic setup. I don’t want to discuss what

all these words really mean, but essentially, so Alpha is a

parameter that we get the tune, and as long as we’re above zero, that means that we’re doing better

than the previous benchmarks, and you can definitely see

that in some benchmarks, at least for some choice of Alpha appropriately chosen,

we’re definitely doing better. So this is against to

specific benchmarks, and also when against

the much larger suite of outlier detection techniques,

we are able to do much better. On synthetic data and

also on real world data. This is an experiment where

we introduced outliers to CIFAR-10 and see how much of

the outliers you can detect, and again, we can detect much more. I don’t want to dwell too

much on this, but I’m happy to talk about

this offline perhaps. But so far, at least in this talk, all we’ve been talking about are these simple robust estimation test, like learning the mean,

learning the covariance, maybe you can do something like

outlier detection with it. But if I want to go

beyond robust statistics, what if I want to do more complicated objectives

like supervised learning? Like regression SVM in

the presence of noise? Maybe I have some

dataset and I want to do regression or train

a neural network, but actually my data

points have been cropped, maybe by somebody who’s

trying to mess me up. Turns out, we can also

handle these things, because these problems

can be phrased in the framework of

stochastic optimization. So the idea is, again hopefully, this has gone

over and one of the previous, days was what happens when you actually train these things is

you’re given some loss function, which takes a data point x and a model w. So this is

the parameters of your model. This is like a label data point, and you get a distribution over

these labeled data points, and your goal is to minimize your

expected loss of your models. So find a model w that has

the smallest expected loss, it minimizes the

function f. So this is a typical stochastic

optimization tasks that really captures all of these

problems like regression SVM, training whatever you want, almost anything, not everything, but a large fraction of

things that people do. Now, the question is

can we do this when we get samples not from

this distribution, but samples which are corrupted? So where a small fraction of these

samples have been corrupted. It turns out that you can’t do this. The first idea is just to run stochastic gradient descent

using robust estimates. So what do I mean by

this? So hopefully, this was covered in the first couple of days,

or something like this. But STD is a first-order

optimization method, and it’s an iterative method. So what you do is if you

have a current model Wt to get the next model Wt plus one, as you take Wt minus some step size times the

gradient in the direction of w, of your loss at a new data point Xt, evaluated at your current model Wt. So this is stochastic

gradient descent, and it works because if I

want to go in this direction, because this is just

the gradient descent, but I don’t have access

to this because I only get samples from

our distribution, but this expectation is an unbiased estimator

for this expectation, and it all just works. So hopefully this is somehow

familiar to people, I hope. The problem is that we can’t

do this in the presence of noise because if I, for instance, draw or crop the data point here, where I’m doing this, then this is no longer an unbiased estimate

for the true gradient. It can be very far off. But stated in other words, I have samples, and each

sample gives me a gradient. So now, let me think of the problem of I’m given a bunch of gradients, some of these gradients

are corrupted, but there’s only an

Epsilon fraction of them, and I want to learn the true

mean of these gradients. So I can just apply

robust mean estimation to robust learn the

mean of the gradients. Instead of like using

the plug-in estimator, I can just use data

estimator instead. So in other words, you just do this. I can just take a step in direction Gt where g sub t is a robust

estimate for the true gradient, and I can do this just by using the techniques that I introduced before using

robust mean estimation. This actually works great in theory, but the it’s a bit slow in practice. So we actually also come up with a more complicated alternative which I don’t really

want to get into, which does something else, but

also has provable guarantees. So here’s some guarantees,

it’s not super important, but it turns out that

it essentially does in fact get an approximate minimizer, and you get some bounds

which are pretty good, and in special cases, you

get very tight answer. But this is still a very

practical algorithm, and so here are some

experiments that we did. So we did ridge regression, we also did SVM, but the results are similar. So I’m just going to show you the

results for ridge regression. So again, on synthetic data, our errors like the best, like this, and when you try

some real-world dataset, there’s very large suites

of attacks that people devised against regression in SVM, and when you like target and attacking as any

of the other benchmarks, any other proposed

defenses that people try, so the error can go off

to infinity very easily, and R stays good. However, even when you try

to target the attack as much as we can against

our own defense, we’re very still competitive, and we are still very bounded. Let me even go beyond. Yeah.>>Do you keep track of all of your gradients and

just calculate the mean?>>All right, so what

we will actually do in practice is we are actually, like this full server algorithm, which I haven’t discussed. You just “Run” to optimality

on your corrupted data and then only then do you “Filter” on the gradients, and then you repeat. It turns out you can still

prove that this works. I pretend that I have no corruptions, I train something and then I check if maybe something’s gone

wrong at this point. If something has gone wrong,

I retrain after throwing out some data and that’s how it works at a high level.

Does that make sense?>>Then how do you identify

the data that you throw out?>>I mean throughout gimmick, it turns out that you can show that as long as you’re throwing out data, you can guarantee you throw out

more bad data than good data. So that’s all you need to

get some formal guarantees.>>You just throw out?>>Just throw out. We have no

idea exactly what we throw out, but all we know that we throw

out more bad than good, and that is sufficient, it turns out. I’m sweeping many

details under the rug, but I should say in the end, while the analysis

is very complicated, the algorithm itself is very simple. I want to now also address the second motivating example I mentioned at the

beginning of the talk, which were these backdoor

attacks against known networks. Turns out these attacks are

really easy to implement. So again the idea is that

the attacker has access to the training data and they’re allowed to add maybe a

small fraction of training points, and the goal is after I train, resonate to whatever

neural network on it, natural images still are

classified correctly, but if I have a specific

predetermined backdoor, then the classification switches. So for instance, this is airplane,

it’s classified correctly, but if I add this little

spot here it turns into a bird and here automobile is classified

correctly and if I add this, it’s like a speck of

dirt round thing, then the classification

again switches. These things are very easy to

implement at a high level, the way they work is that

these attacks try to convince the network that the

implanted water mark is a very strong signal

for classification. So the most basic

attack for instance, suppose I want to do

airplane versus bird, is I just add a bunch

of pictures of images as many as I can of airplanes with this watermark

and I call them birds, and then the neural network

realizes at least at train time, because it’s not very

smart in some sense, that if I want to classify

these images correctly, the only thing that’s different

from this airplanes is this watermark and you’re trying to get really good

accuracy on the training set. That’s the whole point of

what we’re trying to do. So you’re just going

to realize that if it has this implanted watermark, it should ignore everything

else and just focus in on this watermark

and call it a bird. So at train time, the learned

representation is going to amplify the signal of the watermark

and this creates a backdoor, because now when I get a fresh image, as long as I see this

watermark anywhere, the neural network

has learned this is the most important feature of an

airplane, a flight or a bird. If I ever have a picture of this, it’s a bird, or if I ever have that pattern and

the pixel, it’s a bird. No matter what the

other features are. This is at a high level

how the attacks work, and it’s actually kind of

strange because how do we defend against it? It’s not even clear that if I hadn’t like the model

of neural network, that it’s been “Backdoored”. Unless I know the watermark itself, it’s just going to behave like a clean image or

clean neural network. The inside is like what happens if we feed the training set through

the learned representation? Now, let’s consider the

setting where we are trying to “Watermark” cats with dogs. So now my training

set for instance has a bunch of real cats,

they’re labeled cats, but also has a bunch of dogs, which are labeled as cats, but also have this predetermined

watermark of backdoor. The way that the neural

network recognizes that these are cats, there’s two different ways. The first is that there’s real cats, somehow learns a real

cat classification, but for the other set of images, these dogs that have this watermark. It’s going to realize

that the watermark is only thing that matters in this

image and ignore everything else. So as a result the

learned representation of these watermark dogs is

going to be very different. It’s going to look over here, and the point is that in

representation space, like in the final layers

in the neural network, these things tend to

have very large distance and particularly this causes

a very large eigenvalue, my covariance matrix, and

this is all actually we needed for a robust mean

estimation for the filter to work. So we can actually “Run”

the filter on this and the algorithms is going

to detect the corruptions and remove them very effectively. I don’t want to get

into too much of it, but you can see this is

where the clean accuracy, like how successful

the attack is without a defense and this is how

successful it is after a defense. So it’s quite successful I would say. That’s all I want to say about

Robustness at Training Time, let me switch gears a bit and talk more about

Robustness at Test Time. Although, this is already like a mix between the two anyways. So I’m going to focus on here. These things known as Adversarial examples

for neural networks, and this is a very

curious phenomenon, which is what I can do is I

can take any image of a pig essentially and then I can add to it a very carefully

chosen perturbation. This might look like white noise, but it’s very carefully chosen, such that when I add a

very small amount of this to this image, makes

it looks like this. As far as you can tell as humans, I can tell the difference

between these two things, but to a neural network, this now looks like an

airliner and this looks like a pig. Okay, yeah.>>You need to know the

neural network, right?>>Not necessarily,

the most commonly used attack is a white box attack, where you assume you

know the models itself, but there’s also black box attacks. It turns out that

these sorts of things actually transfer

across architecture. So if I train a perturbation that fools enough known architectures, it will also typically fool

another unknown architecture. I should say that from a

security point of view, this is actually a very

real-world problem. You can get these things to be

very effective in real world. Here’s some examples, so here some students in MIT 3D

printed this turtle, which you can tell is

very clearly a turtle, but from many images, well from almost all angles, it looks like a rifle or something

else to a neural network. Here, Researchers at

Berkeley and U-Dub were able to put things on

these stop signs that look like graffiti that make neural networks think that

it’s like a speed limit sign, and similarly,

Researchers at Berkeley were able to design these glasses that make this person

look like Brad Pitt, and there’s also other stuff. So there’s another one, I couldn’t find like a

nice image to show this, but it’s actually maybe

the most real world one, which some Researchers

at Tencent Security Lab, they actually took a real Tesla, and they were able to put stickers on the ground so that when the Tesla

was doing it self-driving stuff, it saw these stickers, which just looked like maybe splotches of dirt or whatever or splashes

of paint to a human, and it turned, it swerved

into a wrong lane. It’s kind of scary, I was

going to buy a Tesla. So the point is that these

are real-world worries, but we want to defend against

these sorts of things. So formally, what are these things? Like what is the formal notion? So formally, we have a

classifier. This takes a point. You can think of RD as this space

of images and it maps it to, say, some set of classes. So here for now we’re going

to focus on classification, although you can also do

this further test as well. Then you have some allowed

perturbation set like an l_p ball. Okay. So some small perturbation set. Then we say an

adversarial example for a data point is the point x

plus Delta where Delta is in this perturbation set such

that the classification of x under f has changed

from x plus delta. So you want to defend against this. Roughly speaking, defenses

against this come in two camps. The first is empirical defenses which are defenses that

seem to work in practice, but we don’t know for sure

whether or not they work. We don’t have any

proof of correctness. In fact, many of these

have been broken often within weeks or

months of publication. So there’s a very

famous paper I think. So there’s last year at ICLEI which

is one of the big conferences. I think five or so, or sorry, eight papers were accepted

which claimed had defenses against

adversarial examples. Before ICLEI even happened,

five of them were broken. So there was this long cat and mouse game between the empirical

attacks and empirical defenses. There’s only one notable

exception to this rule which is there’s a technique

called adversarial training. It was as like the one

accepted defense to ICLEI that actually seems to work.

It works as following. So standard training as I alluded to before is just does the following. You have some query model, Theta t, you have a data point

XY in a loss function. So I have slightly changed

notation, apologies, but you’re just going to

do SGD, which is this, you apply theta t plus one

and theta t minus some step size times the gradient of your

loss, that’s your data point. This is just standard SGD. Adversarial training is just a

very slight modification of this. It says, instead of trying to

minimize the loss at my X, I’m still going to draw

this data point x. What I’m going to do is I’m

going to try to see if there’s any adversarial

examples to this point. I am going to see if there’s

any adversarial perturbations, and I’m going to try minimize

the loss at that point. Intuitively, that tells the network this point currently your model thinks that there is an

adversarial example to this, or currently, your model

thinks that this perturbation is a misclassify but it

shouldn’t be misclassified. I’m just going to force

you to classify correctly. So all it does is instead

of training at X, evaluating the gradient at X, evaluates the gradient X prime, where X prime is some average

o perturbation that you find via a strong of an attack as you can. This just tells the network, “Look, X prime should be classified as y and not whatever you think it is.” You just train it that way.

This seems to work but again, we don’t have any proof

of guarantees about this. The other camp of defenses are these so-called

certified defenses which have proof of correctness that you can’t break them no matter what. Unfortunately, traditionally,

these haven’t scaled very well. Or even if they do scale, they tend to get much worse numbers

than the empirical defenses. Something that’s been a very recent exciting development in this field is the development of a

protocol randomized smoothing by these

sequence of papers, culminating in this

paper by Cohen et al, which seems to bridge the gap. It comes a lot closer and often matches or beats the

previous empirical defenses. So what is randomized smoothing?

It’s a very simple operation. You can state it in a very

complicated looking way. But it’s the following. Given a classifier, it’s just this expression which let me explain in

pictures what it does. So imagine this is my classifier

or the colors of the classify. So if I have a point

in this blue area, its classified as some class. If it’s in this mint-area, it’s classified some other

class and so on and so forth. So the original colors are

the original classifier. It’s just some classifier. So now, given this classifier, I can define for you it’s associated

self classifier as follows. Given a point, what I do, is I sample a Gaussian, a bunch of Gaussians. Then the likelihood of the blue class is the probability that the Gaussian lands within the blue boundary. The probability that it lands of the min-class is now the probability that this Gaussian centered at here ends in the mint-class,

and so on so forth. So now I get this probability

distribution over all my classes. Now again, my prediction

is the most likely class. The point is that this

can actually change. For instance, if I’m like right here, I’d be classified as mint

by the Bayes classifier, but if I do a Gaussian smoothing, I’m still going to more

likely to be blue. So this does change the classifier. So hopefully that

description makes sense. The point is that actually turns

out that as long as I have a good randomized smooth classifier, it actually automatically has a

strong certifiable robustness. Here’s a theorem from

the Cohen paper. Let me just pass this

really really quick. It says that suppose

I have a classifier G which is like one of

these smooth classifiers. I’m at a point X where

this smooth classifier is very confident about his prediction. So if A and B are the

most likely classes, and the probability of A is much

bigger than the probability of B, then there’s a very

large radius delta which depends on the

difference between probability of A and

probability of B, such that the classification

cannot change. So this is just what

this theorem says. What it really depends on

is how good of a gap I can get from my smooth classifier. A larger of a gap between

P_A and P_B I can get, the larger of a radius

I know is robust around my point. So

hopefully this makes sense. As a small caveat technically, I can’t compute these things exactly, but if I sample these things

enough, then with extremely, extremely high probability,

like 10 to the minus 10, I can estimate these things. I can get very good estimates on

how good these quantities are. I can be very confident that

my thing is still robust. But the key is somehow design a hard classifier such that it’s associated like randomized

smooth classifier, has this big gap behavior, and most points I care about. So how do we train

these smooth networks? How do you train the base? Now where essentially the

smooth classifiers effective? So the paper of Cohen

et al tried this method of the Gaussian data augmentation. Intuitively, this makes sense. Somehow, I have the classifier, it’s going to be sampling like a bunch of Gaussian perturbations to my points and making predictions

based on those points. So maybe if I’m just correct

at a bunch of points that are Gaussians where

perturbations in my point, then maybe my method will also work. It’ll at least get pretty

good clean accuracy. This does work okay. They

get pretty good numbers. In fact, before I worked,

they were state of the art. But it turns out you can do much

better by directly robustifying the smooth network using adversarial

training on the smooth loss. So let me try to explain

what I mean by this. This is a recent work which

is to appear in NeurlPS. So recall that this is really

the loss that we care about, so let me break this

down for a second. So we’re going to focus

on cross entropy loss because that’s what people

usually do in practice. We just want, given a data

point and a label XY, we want to find an x_i that

maximizes the loss of G, where G is the smooth classifier. At this point, in an l_2

ball around this point. Because what we want to

do, this will literally be an adversarial perturbation for

the smooth classifier directly. So in particular, if you can’t find an x_i that

has like a large loss, then the smooth classifier is going

to be correct in this radius. Then we’re obviously

also going to be robust. So the point is that

this is the loss. You can just write

this down as this way, just expanding out

with the definition of the cross entropy losses and the definition of the

smooth classifier. We call this the smooth

adversarial objective. Then the point is that what we’re

going to do is we’re going to do adversarial training

with this objective. Meaning what we’re

going do at every step, is we’re going to find

Adversarial Perturbation. So recall what adversarial

training does is, when you’re given a data point, it finds some adversarial perturbation and trains it at that data point. What we’re going to do is

we’re going to do adversarial training trying to find an adversarial perturbation to

this objective at every step. This is very different from the

Gaussian Augmentation Objective. This is what we do, and this is what a Gaussian augmentation would

be doing to work it out. It’s a very subtle difference. So now the only difference

is that the log is on the outside instead of on the inside. But qualitatively, there’s actually very different

interpretations of this, because this says, find me an adversarial example of the truth move classifier

that I care about. This is why we think that this

is a natural thing to do. But this objective means

find an adversarial example, so the Bayes classifier, that when I perturbate by

Gaussian noise is robust, which maybe is a good

property to have, but somehow, we don’t

care about this. We don’t care if the

adversarial example is robust to Gaussian noise, we care about that if we apply the Smooth

classifier at this point, it is in fact an adversarial example. So if you do training

with this objective, you will actually get

much worse results. So here’s a plot that shows these lines are what

happens when you try to use this, and these lines up here is where the higher this is, the stronger the attack that you get. If you do the stronger attacks, you get stronger defenses when you use adversarial training as well. Essentially, these attacks

that use this are much stronger than attacks

that do this. Yeah.>>Excuse me. So assume

that [inaudible] isn’t the Gaussian Augmentation

on the right possibly finding the least strong

adversarial example?>>It’s just finding

an adversarial example that itself is very robust. But sir, I don’t care

about the robustness of the adversarial example itself, I care about the robustness of the smooth classifier

at this point. So that’s the moral difference. Yeah. So when you do

adversarial training with this, it turns out you can

substantially boost or get a much larger gap in the resulting certified correctness. So here are some plots. So

here are some results on CIFAR10. So there’s two lines here. The red line, this red curve, it’s however a certified, what code of all the previous

work could guarantee is robust for different

choices of the outer radius. So this is how much they could

actually certify this radius. The dotted line is how much the empirical attack

was able to achieve. So this is an upper bound on how

robust their model actually was. Similarly, for the blue line, for our side of the blue line, then you can see actually our certified accuracy is actually better than what they

can actually ever hope for. These are more detailed plots, but they show all of

the same picture, and the similar story for ImageNet.

We get substantially better. This is roughly maybe

10 percent or so, which is a very large boost. We can also combine these with

state of the art techniques, others state of the art techniques. So there’s a recent paper that showed that if

you pre-train a model, somehow take advantage of ImageNet. If you first trained

a model at ImageNet, then transfer that over to CIFAR10, the resulting model has

much better accuracy. If you do the same thing for us and you combine it with

our adversarial training, you also get large boosts. So here are the numbers. So again, here’s the plot that

shows both numbers as a table. So ours is before pre-training, we’re still a really large boost, but with pre-training,

you get even larger, often like 20 percent improvement upon what was previously

state of the art. So that’s pretty much all I have. I want to say that in

machine learning, in AI, we are increasingly using the

systems for various sensitive tasks, like driving a car,

these sorts of things. I think we need to be

increasingly aware of the security risks that

these things pose. These are also very interesting

theoretical problems, like there’s very interesting math, but we also need to be aware that these introduce

additional weaknesses potentially through any pipeline

that might be security sensitive. While I’ve shown, at

least the things that I described, were mostly academic. The Tesla example, and

the glasses examples are becoming increasingly real-world. These theoretical insights

often directly lead, as we try to understand these things from a more

theoretical perspective, the insights that we get often directly lead to these

better practical algorithms. The robust mean estimation

thing directly led to these better outlier

detection methods, and these better methods for learning the shape of Europe,

and all these kinds of things. But there’s still many exciting theoretical and applied

questions to consider, and I should say that in Microsoft, we’re constantly red

teaming our own systems to try to find and patch security

vulnerabilities of this way. That’s pretty much all I had. Yeah.>>Have you seen substantial evidence that creating algorithms

which are robust to outliers or adversaries [inaudible] even though

those don’t exist?>>So this is a tricky question. So in the setting of

adversarial examples, there have been these

studies recently that showed that, how do I say this? There’s this drift which is natural, maybe as I shift from one

image set to another, and there’s this adversarial

distribution shift. It seems that these two

are very different. One does not help the

other necessarily. It’s a little bit of a

different issue with the robust statistics because there, there’s directly statistical

interpretations of all these things, like if you’re a total

variation distance epsilon, then you still have

to get these things. So it depends really on the setting. Yeah. Yeah.>>Encountered any examples of adversarial training

data does damage to your model without it showing

up in the covariance matrix?>>So in some sense, we know that for specific

things, that’s impossible. At least, for mean estimation,

we know that if it doesn’t show up on the covariance matrix, it doesn’t mess up at least the mean. Of course, what we

actually care about, corrupting might not be the mean, and then perhaps you need to look for some other statistic which would hopefully give you evidence

that something is wrong. So it depends on your algorithm. For instance, for the

stochastic optimization, if you try to look at

covariances of something else, of not the gradients, then you

wouldn’t find anything wrong. So you have to find out exactly what the thing is that you care about and then use that. Yeah.>>Any other questions?

Yeah. Thank you.

ads.ai

a well structured presentation, thank you

Thank you for the well done subtitles, as always ðŸ™‚