Exploring Massively Multilingual, Massive Neural Machine Translation


>>Good afternoon, everyone. So today we have the great invited speaker that I’m
delighted to introduce to you. So Orhan Firat is a research
scientist at Google Research, working on the sequence
modeling problem, send applications to
machine translation. So he has so far published the many papers in top tier
conferences such as the ACL, LMNOP, [inaudible] , AAAI and so on. I’d like to stress, so he’s one of the pioneers on the Multilingual Neural
Machine Transition areas. At last, so he recently built the Massively Multilingual, Massive MT, called [inaudible] like
handling the 103 languages, which is super exciting. So all of us, welcome you Orhan.>>Thanks for the warm and
generous introduction, and thanks for coming in. It’s great to be back here. So I was here three years ago. I gave another talk with the title, Explorations on Multilingual
Machine Translation.>>That was three years ago?>>It was three years ago, I think.>>Two thousand and sixteen.>>Sixteen, yeah, okay. But I remember title
because I check the slides, Explorations on Multilingual
Machine Translation. So over the last three years, we’ve been trying to
add two more Massive. So that’s the short story. But that talk also was
given with an author, like we the three author list. Now, we have 30 authors
or collaborators spending so many teams at Google
research, Translate, TensorFlow. You’ll see a problems
that is at scale, sometimes it’s a systems problem, sometimes it’s a theoretical problem, sometimes the neural network problem, sometimes the machine
translation problem. So this is a very
multi-faceted problem that I’m very excited to talk about. So I’m happy to be part of
the team and the project. So there are two things to Massive that we added over
the last three years. The first Massive is
about the number of languages that we add
within a single model. Basically, we are trying to extend the coverage of a common
Machine Translation Model, which is used to translate
from one source language into one target language into, if you can think of like machine translation matrix
or LWMT matrix, basically, we’re trying to cover all
cells in that matrix, and increasing our coverage from
one to 100 as the first phase. I’ll talk about what is
our next phase and so on. That’s where we head to
universal translation. The second M is about the
massiveness of the neural networks. So over the last couple of years, it’s actually more than three years, we are seeing a trend. It’s almost a sure shot if you increase the capacity
of your neural network. If you’re successful at
training the neural network, success seems almost guaranteed. There are a lot of criticism about, is this really a neural art? Where’s the science in it?
Where is the research question? But it seems like that
trend is holding. The second Massiveness of the entire project is
basically trying to scale the neural networks
by using this trend or by trying to understand what
is happening in this trend. So we called our project Massively Multilingual,
Massive Machine Translation. It’s a bit of a mouthful, but if you say M4, I think it feels a
little bit more okay. All right, and this is not
working. I’ll go here. Okay, I’ll just go with, what is our goal motivations? I’ll describe how are we
attacking this problem, and I’ll try to end with
some open problems. So our entire goal, our end goal is trying to build
a universal translation model. So this is a goal that’s
like a holy grail in connectionist sequence modeling
or connectionist paradigm. Since 1950s, we’re trying to
parameterize the inter-lingual, or trying to parameterize the intermediate medium between languages or sequences
or communication. Universal translation is
the actualization of it. How do we define
universal translation? Building a single model for
all the languages out there, and being able to
translate between any of these source languages than
any of these target languages. So we believe it’s the holy grail on neural machine translation and whole effort here that
you’re going to be seeing it’s a step forward
towards that direction. Just as a summary, we also summarize our recent
progress in a Google blog post. It’s pretty much what I’m going
to be talking about here in my talk so we can find some references and so on
in that blog post as well. So what is the problem
that we’re solving? What is our motivation? Probably using here, I hope I’m not going to block anyone,
maybe I would use here. Okay. So our first motivation is the improving translation quality, but increasing current translation
quality across the board, not particularly for one language, but for all languages. So this is the data-set
that we’re using, it’s a curated data-set, it’s a crawl data-set from the web, and we kept it around
25 building examples. It’s covering 103 languages that
Google Translate is supporting. I learned that 103 is
an arbitrary number. But that’s all the language that we tried to keep are the number
of languages at some point. So we decided 103, and let’s use all the data that’s available for all
these 103 languages. If you look at on the
left-hand side of this plot, you see high-resource languages, and on the right you see
low-resource languages. For high-resource languages,
you see the number of examples, is actually over billions. It’s of course noisy, but it’s a lot. This is a log scale, and as you go towards
the low-resource end, you will see the example number of available training examples
drops dramatically. On the left-hand side, we can, by training giant models like trolling all the tricks that we have, we can actually achieve
approach human qualities. You can also see that if you have
more than 100 million examples, you can throw so many tricks, you can narrow down the domain, etc. You can actually reach
almost human quality there. But that’s not the case for
low-resource languages. If the number of examples is
less than, say one million, it all of course depends on the
difficulty of the language, etc, but it’s a general
trend that you see. Low-resource if you have
really small amounts of data, you’re not that good in
the translation quality, on high-resource, you’re actually
doing quite good as a community. But our goal is to try to improve
everything across the board. In multilingual NMT, it’s
basically a transfer mechanism, it’s a positive language
transfer you can also think of. What happens if you train
a multilingual model? You’ll observe a transfer from these high-resource languages
towards the low-resource languages. This is known as positive
language transfer or positive transfer
in Machine Learning. So high-resource data or
high-resource languages or tasks are going to help the
low-resource store easier tasks. Okay, so we got this covered, low- resource languages,
we’re going to do good, but what are we going to do on the high-resource
languages if our goal is to build a universal
translation system? So that’s the first low-resource
languages improving low-resource, it’s coming from the
Massively Multilingual part of the entire project, and improving the
high-resource is coming from the Massiveness of
the neural networks. So this is a two-pronged attack. It’s Massively Multilingual to improve mid to
low-resource languages, the other branch is basically
scaling up your neural networks to the capacitor limits to increase the quality of
high-resource languages. The other thing is, we
kept it at 103 languages, but 103 is not the total number
of languages in the world. Here, I chose some numbers from what Google Translate
right now supports. There are at 7,000 plus languages, Google only supports 103,
2,000 African languages, Google only supports 11, and 100 plus Native American
languages, Google supports zero. So actually, we really need to build universal translation
model as soon as possible. Because by the end of this century, half of these languages
are going to be gone. So it’s also our duty as
the Research community. The third interesting thing is about the recent trend
that I taught previously. This is about Neural
Networks scaling, and our new understanding
of generalization. I don’t know the background of the audience, but
please let me know. I can go into the detail. But let’s first talk
about the amount of data that the error that we
get to generalization error. How do they interact with each other? So if you’re also interested, you can take a look at the
details here in this paper, but let’s assume there are three
regions in the current paradigm. The first region is
the Small Data Region. It’s basically you’re at
the best guess error, your basically at chance level, or you can think of it
even smaller than ISSLT, you’re not even able to reduce
the increase the blue score or you’re not getting anything
rather than memorizing everything. So in this region,
what you should do, you should gather more data. But what about the second region? This is the Power-law Region. As you get more training data, you’re actually going to reduce
the generalization error. So we believe this is the region that we’re at now
as the research community. But also there’s, we’re
going to hit the wall, this Irreducible Error Region. So even if you get more data, you’re not going to be
seeing anything better. This is because of not being
able to scale the models up. This is more about the sampling
error of the training set. So data-generating
distribution is no easy, so you cannot reduce the
generalization error even further. So if you’re here, what should we do? It’s actually very simple
recipe and it’s actually explains what’s happening over
the last couple of years. We just search for a model
and then problem fit. This model seems the good
for this problem and then we scale and then it literally
improves the generalization error. So this is one trend. It’s on the amount of data, how does it correlate with
the generalization error, which ties back to the number of
examples that I just mentioned. We were using 25 billion examples. So we believe we have
a long way to go if we actually scale
the model sizes up. The other trend is about why the generalization error decrease or why am I generalizing as I
increase the model capacity? So this is counter-intuitive, if you think about these
Statistical Learning Theory. It suggests that as the
approximation error decreases, your estimation error increase. So this is what we know in the
Statistical Learning Theory, but recent trend in Neural Networks, it suggests as you decrease
your approximation error, your generalization error
will also get better. So this is a mind-blowing finding, or a lot of people are
working on it right now. It’s tightly coupled
the deoptimization, which we call training error. I’m not going to be
going into details, but if there’s interest,
I’ll be happy to talk after or in the last
couple of minutes. So that’s the motivation
why we are actually scaling Neural Networks
and we’re expecting some gain and these
gains are actually targeted to generalization
not memorization. So the another motivation is this
is a really compelling test-bed. You could guess from
the author list there, 30 something participant. It’s a testament of how compelling
of a test bed is this problem. If you think about the
massively multilingual part, we just talked about, you built the system, but how
are you building this system? It’s a big problem. Mostly, studying the
multitask learning framework and current approach of
their current solutions to multitask learning problems
are mostly studied on their meta learning or
continuous learning paradigms. I will talk about
what these entail and what are the future directions
that we can take at the end, but that’s the first
part, that’s coming, that’s the interesting
research problems lying there. The second interesting research
problems about the massiveness of of these approaches,
of the project. So to achieve this
messy multilinguality, you need to also scale up the
model capacity and vice versa. As you scale the model
capacity, you need more data, where this data is going to become
coming from only one language. It’s probably not because you
will have some other problems. So you have to increase your
coverage to enrich your data, but also increase the number of
samples in your training set. If you remember in the log scale
and generalization error plot. So the increasingly model
capacity is not easy. There are so many options. It’s a very large search space and there’s no single
way of doing it. Depending on the architecture, depending on a problem,
things change, but there’s also another dilemma, trainability and
expressivity dilemma. It’s basically the second bullet. trainability and optimization. Just because you scaled your neural networks to
trillion parameters, it doesn’t mean that you
will be able to fit on the data or you’d be able to
make that model called merge. There’s a huge problem
going on there. It’s an active research problem
and right now in the field. Also if you train a trillion rates parameter to trillion
perimeter network, will you be able to analyze? Will you be able to optimize? Will you be able to serve at the end? These are open questions. So there are a lot of efficiency improvements. If you can think of Network pruning, Lottery ticket
hypothesis, they all fall into the third bullet here. So it’s a very compelling test bed for machine learning
research in general. So how did we actually
partition this project? How did the Hub make progress
for the last three years? So we of course probe three
different directions first. Probably, on the contrary, what do you think of
what the first probe, the depth or scaling
up to Neural Networks because that was the crucial
thing and that was the key. So that’s why we first
analyze how we can train very deep Neural Networks
without even having, say, debugging routines or
debugging practices attend. So the main problem right now in Deep Learning or Neural
Network research or across the pro family of models. The problem is what
if you’re going to plot something to understand
what’s going in a model, which dimensions, what should be the X-axis of the plot
and Y-axis of the plot? So that’s crucial. You should probably find out what should I plot to
understand what is going on. It’s not only about low perplexity is decreasing or my blue
score is increasing. How can you interpret what is going on in this model by
using at these two? So we actually devised
a couple of monitors or debugging tools and then we
manage the Scale Neural Networks, Neural Machine translation
models, fairly good amounts. We created a fairly good depths and number of parameters in that paper. Secondary, control. By the way, every time
that we were attacking one approach in one
particular branch, we were fixing everything else. For example, in the first one, we’re not analyzing multilinguality, everything is controlled,
dataset size, amount, everything is controlled
and the second one, you control now everything
about the depth, scale and other
things, other factors. We just focus on how many languages we can cram into a single Neural
Machine translation model. So there we answered that question and that our answer
was we can actually, if you kept the amounts of data
to one million for all 103, it’s basically a cap
version of this plot. So we’re capping it here. So we’re using one million
examples for all languages, it’s actually getting rid of all these trainability issues and
so on or multitasking issues. With that, we were able to cram 103 languages
within a single model. But these two combined, if you want to combine these two, if you relax some of the
assumptions that you have, they come with a lot of
trainability challenges. Think about the data
distribution here again, sorry about that, I’m
jumping back and forth. Think about the training process
of this using all this data. Then you were made one Epoch over the entire set and you’re using SGD, you’re basically sub-sampling
examples within this bucket, within this mix back. What ends up happening,
you are going to be over-sampling high-resource
languages a lot. You’re not going to be touching
any of the low-resource languages. So at the end of you,
you think that you pass, you made one Epoch over this dataset
because of the stochasticity, what you end up seeing
is you over fit on low-resource languages and you have a long way to go for the
high-resource languages, or you haven’t seen any examples on the
low-resource languages and your model is favoring
high-resource languages. So there’s a trade-off that you
should be actually balancing out. That’s what we studied
in the last paper. I will talk about that, the details of this shortly, but these three directions combined. Each one of these directions are promising directions and we can
control things and make progress, and then we decided let’s call it M4. After the pilot studies, we moved to a realistic scenario, removing all the constraints
that we have, use 103 languages, scale your Neural Networks
to the Chip limits or the hardware limits and also do not restrict the constraint number of
examples per language. So this is literally M4 in-the-wild. No rules, no nothing, no constraint. That was that was our setup. If you put one paper
about the M4 in-the-wild, that was more our
position paper that we will outlining what
are the problems in this direction or in
this realm and we followed that position paper
with couple of papers, which I’m going to be
talking about now, addressing each one of
these open problems that we laid out in the
Open Problems Paper. But first, we had to
develop some baselines and we want it to be careful because it’s going to be using
a lot of resources, Compute and headcount
and man power basically. So I’m going to be
talking about how we set our baselines and how did we
learn given this data imbalance, and then at last I’ll talk about how are we increase
the model capacity. Our goal in this phase was okay, you’ll train 103 language
translation model and attain pairativity
to a baselines. Basically, beat your baselines on all these 103 languages
with a single model. So I also catch cold last
week. Sorry about my voice. So let’s look at the data
distribution again where we overlaid the bilingual
baseline BLEU scores on top. So these BLEU scores,
don’t over read it. These are just a held out set that we believe it’s a good proxy
for generalization, but it actually reflects the high-resource, low-resource
quality issues. So on the high-resource languages, they are reaching good 40 something, call it the BLEU Score. On the low-resource, actually
the BLEU score drops quickly. This is also the case for perplexity or how well are you
fitting on the data, but I just showed
here the BLEU scores. The next couple of plots, I will be collapsing this
102 bilingual baselines into the zero line, and then I will be only showing negative and positive BLEU scores
to make it easy to digest. So what about the model? Model is actually not super fancy. It’s a slightly different
wiring of Transformer-Big. It’s bigger than
Transformer-Big or baselines. But they’re basically
using two tricks coming from these two papers. It has different
normalization scheme. It holds, also has very deep models. It has some transparent attention, this basically rated skip
connections to the encoder. But in terms of the
architecture itself, the sharing paradigm
or sharing scheme, it’s extremely important
for multi-task models. I just sketch the history on the multi-task models or multilingual
models, multiple NMT models. What you can share and what
you cannot share, basically, that’s the spectrum here. You can share a fixed
length representation across all the source
and target languages. But that creates a huge bottleneck. So you have to basically
cram the meaning into a single vector here
and decode from that. Or you can share some sub-modules
which naturally emerged as attention module in an RNN
base sequence-to-sequence models, and you can actually share
a single attention network across multiple encoders
and multiple decoders but it increases the number of encoders and decoders linearly
as you add more languages. The third case is you basically share everything including
BERT Embeddings, whatever you can think of, you can share everything. That’s the basically the Google
Multilingual NMT system. Actually, because of the
simplicity and in order to reduce the search space, the narrow down the search space, we go with the last one here. How you guide the language that
you want to translate into, you either prepend a token or
you just learn an embedding for that language and then
add that embedding while you decode either on
the encoder or decoder site. This is our initial baseline, as I mentioned, we collapsed the bilingual
baselines into zero line here. So anything is a zero line
there to a 103 languages here. One thing I forgot to mention, our data set is mostly
English-centric. Examples are, for example, Spanish to English and
English to Spanish. When you train using all this data, you end up with a multilingual
model that is English-centric. You can easily evaluate in
an English-centric fashion. On all my slides, you’ll be seeing on the left, English to any of these languages. On the right-hand side, you’ll
always be seeing any to English translation pairs,
translations directions. So what it means, these
two plots are coming from the same model but
be evaluated either on English to any of
the languages into 103 mix or annual 103 mix to English. You can also evaluate cross, basically off-diagonal
cells in this matrix. Sometimes there’s zero shots or
they’re called Zero Resource, but I’m not going to be
talking about that here. But if you just train, picked
up the data, you pick up data, don’t scale anything but just train a comparable model with your baseline using all the
data, this is what you see. Let’s look at the English to any. So it’s basically only
on a few languages. You see this trend line. Blue trend line is coming from a single model using the
original data distribution. It’s actually quite bad. It’s only beating high resource
languages for some languages. But other from those
couple of languages, it’s actually below the
bilingual baselines on all the languages
that you consider. The story is the same when you’re translating
into English as well. So in a multilingual, mostly, we’re expecting to get a great, amazing English language model
because it’s the target site. The model is seeing so
many English examples, but it’s not the case. So what is going on?
It’s because that you’re using original
data distribution. So if you just change the original data distribution with some smarter sampling strategy, it doesn’t even actually
have to be smarter, just some simple
heuristic you can use. For example, multilingual verges using some simple
temperature but heuristics. Old multilingual models, they were
using some simple heuristics, some temperature based,
some smoothen out. We actually employed
those techniques, and you can get quite some improvements on
the low-resource languages. So that’s plotted the
green curve here. So if you see high-resource
languages are still regressing, there’s still behind compared
to the bilingual baselines. But you start to see the
gains of multilinguality, which emerges as a boosting the
quality of low-resource languages. This is especially the case when
you’re translating into English, because now you’re training a more balanced English
language model and you start to see the gain for all
the low-resource languages. Here, you see up to 10
BLEU score improvement for low-resource languages
when you’re translating from any of these
languages into English. So that checked one of our goals. We can improve low-resource quality. But why is it this bad? Why is it not really getting
close to the bilingual baselines? So it’s because of you’re
cramming so many tasks, you’re basically asking the single
model with a fixed amount of capacity to do so many
things at the same time. So it is hard. So what
we do here, controlling, let’s gradually reduce the number
of languages that we consider in a multilingual model down to
10 within intervals of 25, like 50, we drop it to 50, drop it to 25, dropped
it to 10, and then see. As expected as you remove number
of languages in the whole mix, the per language capacity increases, so you get close to your
bilingual baseline. In this case, your upper bound of capacity is kept by
the bilingual baselines. So this is just to test, is there something
else going on or not. So we validated, okay, this is just a capacity issue. So then what happens if you
train an individual model? So this is the number of languages, but we didn’t control. There seems to be different trends happening when I’m
translating into any, or when I’m translating into
English. So what is happening? Is it also related with
the task interference? So that’s why we trained
individual models. We picked one model. We train one model only from
English to any of these languages. So this is coming
from a single model, this coming from another
different model. So here, we see when we separate, when we individually
trained two models, when we’re training English to any, it’s not actually any better than
our massively multi-way model. So this indicates there’s a massive
interference problem going on. All the tasks are
interfering with each other, or there’s some other
problems happening, say during the search process maybe. Maybe we’re not guiding
our search process and literally being search properly. At every step, it’s basically solving a harder and harder
problem at every step. So there is research
needs to be done here. What about on the right-hand side? This is literally a capacity problem, and then interference
doesn’t seem to be affecting this direction as expected. So as you drop half of the tasks, as you drop English
to any of the tasks, you see a Delta improvement
across the board. So it’s also indicating us, for the future direction, we can just either remove
some of the tasks or we can increase the model
capacity to attack this direction of translations. So at that point, scaling must seem to be a sure shot, but it’s also a research question, how are you going to scale? So, there are so many variables
that you can play with it, and they also interact since
this is a multi-task problem, or you’re learning so many
things there all at once, different hyper parameters have different effects of
either generalization or memorization of the giant network. So the first question
that we did ask was, should we go deep or wide? Which one transfers better, or which one generalizes better? That was the first
question that we asked. The first two are actually the same, sorry about that, and
we also answered, what we can do about the task interference when you’re
translating from English to any. So those are the questions
that we’re trying to answer. So the first one is scaling
up the model capacity. So if you’re familiar
with Transformer-Big, I mentioned our baselines are slightly bigger than Transformer-Big, so they have around 400
million parameters. It’s not the case for all languages,
for lower-resource languages, we are playing with some of the hyper-parameters to
actually regularize the model, and then fit to get
better generalization. But in general, here you see our
previous model in the light blue. This is the 400 Million
Transformer Multilingual Model compared against the
bilingual baselines, it’s the y-axis, and we trained, this was a control study we did. Okay let’s limit the
number of parameters in the network and then
scale it in two axes. These two axes, they were
width or depth of it, the parameter budget of 1.3 billion. So all of you want 1.3
billion parameter, just scale it as you wish, and which one is doing better,
which one is doing worse, and how is the transfer
characteristics of these two regimes? Interestingly, but also
supporting the literature, deeper models transfer better. Very deeper model here you
see is the dark blue curve. It’s doing as good as the white
model on high resource languages, but it’s actually transferring way better to the
lower-resource languages. So if we can actually say transfer is an indicator
of generalization, then we can say deeper
models generalize better. That aligns very well with the current understanding
of deep networks. What white models are doing, white models are spending their capacity mostly
on memorizing things. So if you memorize too much, then it’s hard for you
to generalize as good as the deeper counterparts. So if you have a 1.3
billion parameter budget, and if you are solving
multiple tasks at once, it’s better to go deeper
than going wider. But what is the limit? Okay, if
you go deeper, it gets better. So how can we push the limit of it? So how we can actually
even go beyond 1.3 billion just by increasing the depth. So we actually hit
the hardware limit, we hit at that point
the systems limit. So we had to stop and
develop the systems or the tools to scale even
beyond 1.3 billion. So we took a pulse,
we developed GPipe. It’s a pipeline parallelism framework that allows you to train
very big networks, which I’ll be talking shortly. But by using GPipe, we trained a 128 layer transformers, and it was actually getting
better and better, of course, for some low-resource languages, the quality was saturating already. But on the high-resource languages, without losing anything on
the transfer capability, The models was actually getting better and better as we
make these models deeper. Okay, but that is one axis, one-dimension of scaling things up. How we can also exploit the
multitask nature of this problem. So there are different architectures, or different wirings
that you can play with to increase the model
capacity dramatically, and we chose Mixture-of-Experts. You can think of Mixture of
Expert as a learned ensembling, but everything is happening
inside the model. So you have multiple
experts within the model, and you have some routing mechanism, and then you have some gather, you basically distribute and gather, and then you learn how to
weight individual experts, and then you basically come up
with the implicit ensembles, and each of these experts, you can play with its capacity to
inflate your model parameters. By using GPipe, sorry, by using Mixture-of-Experts,
what we did, this Mixture-of-Experts, the
particular implementation that we used was Sparsely-Gated
Mixture-of-Experts. I put the paper link in a couple of, I think paper link is here. If you’re interested,
take a look at the paper. But the implementation of
Mixture-of-Experts in the transform, it’s not trivial, how do we do it? So transformer consists of
consecutive transformer layers. Each layer consists of self attention
and a feed-forward network. Here, how we implemented
Mixture-of-Experts, We replaced every other
feed-forward layer with a Mixture-of-Experts layer. So here you see, for example, this is the Mixture-of-Expert layer sitting on top of the
Multi-Layered Attention. This is the original transform layer, this is a Mixture-of-Expert
transformer layer, and we implemented
Mixture-of-Experts at token level. So in the forward pass, each token is actually routed
to K number of experts, and by increasing number of
experts here to the extreme, say for example, 512, 12 experts, you can scale the model, you scale the number of parameters in your model to be beyond 10
billion or a 100 billion.>>Can you say more about [inaudible] replacing
the feed-forward layer.>>Yes. So we are replacing
feed-forward layer here. Here there’s a feed-forward layer, we’re just replacing it with
a Mixture-of-Expert layer. Each expert is a transformer
feed-forward layer, but it has a different combination, and it has a different routing
and combination sub-routine, where you send which is a
feed-forward layer with those, and there’s also a dispatch happening at the end of the
multilayered-attention, it is looking at all these
tokens, and then okay, it’s saying these tokens
should go to that expert, these tokens should go
to the other expert, feels to sacrifice some of the
quality gains for efficiency. Probably it’s better to route the entire language
or entire sentence, you can devise
language-specific experts. But if you implemented
at token level, token level experts is that
they have high throughput, high device utilization, and they also increase the
model capacity dramatically.>>They are getting to pull those neutrals government
role token level?>>Those gates are tiny networks, the value of those gates are learned. Those are tiny networks, and they’re also looking
at the token level.>>Token level?>>Yes. I will talk about a special specific case of
Mixture-of-Experts at the end, where the conditioning is
on the language index. We can also look at the language ID and the route individual tokens. By doing so, we first scaled
up to 10 billion parameters, and then scale to 50
billion parameters. So here by the way, I’m
only showing to English, all languages to English, because English to
any other language, it’s still an open problem that I’m going to be
touching at the end. So if you’re willing to
contribute on that direction, that’s a very promising
direction for research.>>So the 50 billion parameters, but you don’t have 50
billion generalizations.>>You have 50 billion
tunable parameters.>>The mixture of
expert giving you that?>>Yes, but at every step you
might be tuning different experts. There is also additional cost term
which I didn’t put the details which ensuring the balance
across each expert. So it’s trying to maintain
a uniform balance across each expert to
utilize all the experts, basically all the
parameters that you put in. By doing so, as you see, 10 million, if you just train a model
with 10 billion parameters, mixture of experts, you’re
already better than all their bilingual baselines
so the second goal check. But why are we stopping
at 10 billion, we can also scale it even further. This is 50 billion. We wanted to see where’s the
upper bound where should we hit, where we should stop. Because we want to also understand which part of the data
regime that we are at, are we in the reducible area region, what does it say about
a transfer and so on. So it’s not only about
improving the quality, it’s also a research and that we
really want to get the answer. So that was the purpose of
scaling the models even beyond. But here you see lower
source transfer. Since these are mixture of experts, they can specialize on
high-resource languages. If you specialize too much
on high-resource languages, then the transfer is
happening less and less. So actually, as you scale
beyond some parameter limit, you’re not seeing that
much transfer happening to the low-resource compared
to the others, of course. What you can do, you can drop half of the tasks to cap the upper bound. To basically set the upper bound, you can drop half of the tasks, which was basically English to any. You can drop all those
tasks and you can train a single model from any to English. It sets your upper bound in
terms of number of parameters. Here we just plotted what happens if this dark green is our quality upper-bound with 50
billion parameters. Improvements here, I don’t
trust too much in blue, I wouldn’t read too much into it. But the improvements are even
on high-resource languages, they start from five blue score. This is a huge improvement if you believe in blue and
if you read in blue. But how we optimize, we usually look a t the perplexity, how well are we fitting on the data. You see the same plot, which I will be showing shortly, in the perplexity or on the loss
that you train your models on. Which one is better, deeper models or mixture of experts? If I actually limit the
capacity between these two, if I control the number of
parameters between these two, which one is going to get me more gains or what can
I learn from these two? So here, we try to control number
of parameters or the throughput. Here green, you see
again mixture of experts which has 10 billion parameters and transformers which is very deep, six billion parameters
transformer deep. This has 256 layers. Here, the quality gain, it’s actually supporting
what I’ve said earlier, deeper models transfer better. In the case of mixture of experts, they bring you huge
device utilization. So we believe the
future is going to be, again, best of both worlds. So we are at the process of
mixing these two together. The final results, this
is the final result, summary of everything that I said. But let’s look at
the scale of things. Let’s take a side step
and ask ourselves, what the heck are we doing? What is this madness of scaling
neural networks capacity? So this is another endeavor
towards if in terms of number of synapses in order to reach
the numbers that human have. It’s arguable you can correlate intelligence with
number of synapses or connections in the brain,
excluding cerebellum. Let’s assume it’s the
case and let’s look at where we are right
now as a community. So this is NMT with attention, it had approximately 25
to 50 million parameters. Then transformer came, it had
around 400 million parameters. It was a revolutionary architecture. It was even worse than a honeybee
in terms of number of synapses. The recent advances
over the last year, these are the models
that we’re seeing. Google-T5, it’s using
11 billion parameters and media Megatron is also
eight billion parameters. Microsoft Zero, it’s
I think 1.5 billion, Open AI GPT2 one point
something and ResNetX. Actually, you’re not even there
to reach to the number of synapses that a mouse has and
M4 is also not there yet. So we have a long, long way to go, if we keep correlating number of
synapses with the intelligence. What’s in it for me, if you’re asking, what we
can learn from these model, for what all these experiments, all these very expensive experiments. What we can learn, there
are two takeaways. Deeper models generalize better. Then second, no matter what you do, task interference is a big problem and scaling is not going
to solve that problem. Here, you see all these models, all the models that I showed
you when translating into x. These are coming from models, they’re coming from
the multiway models that we only evaluated
translating into x. This seems to be an open problem. There’s interesting
research questions here to be studied to get the
gains as you’re seeing here.>>[inaudible]>>I should admit, this is also affected by the particular
wiring that you’re using. I mentioned we chose
sharing everything. Sharing everything inherently
amplifying the interference. If you’re amplifying interference, then this problem is
that apparently too hard in a multitask
setup to be solved. But here, since all the languages are translating into the same language, they are all benefiting from the amazing English language model or the transfer properties from
one language to the other.>>How do you work all that
like smaller groups of much more closely related
languages on X side and you’ve seen if the transfer
works better in that scenario?>>Transfer works better
in that scenario, yes. That whenever you
actually drop number of languages in this
bundle to a 100 to 10, they’re always better
than what you get here. We actually shared that in the paper, in the wild paper, you can take a look at
the details, but yes. You should be solving lesser tasks, than smaller number of tasks when you’re translating into
x or you have to devise the different decoder
wiring and decoder network, basically, you should inject
your inductive bias to wire your decoder differently to
mitigate the negative transfer. This is the position paper
that we put all this, list all the problems,
what are the problems. This paper didn’t mean to
answer any of these problems. So we tried to answer the
problems here in in this paper, in the following papers, which I’ll be talking about now. I’ll go into the details
of other papers right now. So massive networks, I mentioned about the systems
aspect of the problem. By the way, this is the
paper that I mentioned. This is how we implemented
mixture of experts, sparsely gated mixture of experts, and fused it into transformers. This was the earlier paper that I talked how we wire our baselines. This now I’m going to
be talking about GPipe, how we efficiently
train these networks. So what you’re seeing
this general trend, the also validated both
on different tasks, image tasks or vision tasks, and also mission translation, as you scale, the average
improvement is increasing. So here, number of parameters, average blue score, you
see the improvement. So what are the systems
that enable this? So now, I’m going to be
talking a little bit about the systems aspect of the things. So these models, very deep models, they’re trained on 1024 [inaudible]
-v3 chips by using GPipe. They also using Bfloat16 different
floating point operator. They’re using rematerialization or known as gradient
checkpointing trick. We also scale the batch sizes to the limit by putting all these things together as
we add more and more chips, we observed sublinear scaling. So let’s look how we
implemented GPipe. So this is a regular
deep neural network. This is Vanilla Neural Network. Nothing is pipelined but you
have four different devices. You basically distribute your
network into four different devices. If you train your
model in a normal way, what you do is you do the
forward pass device one device, device zero, one, two, three, you compute the loss and then you back-propagate by going through all
the devices and then you compute the gradients
and then you do the update. But if you do some pipe-lining, you can also increase this. This looks like, by the way, if you put this into a time
axis, it looks like this. So your first device is
waiting all the way until here to get some signal
to do some work. So all this time here, this is called the bubble. It’s the time that you spent
by not utilizing your device. Device utilization is huge
because they’re expensive. If you’re under specific
allocation schedule, you should utilize your device. So we want to minimize the time
that the devices are idle here. So what we can do,
we just inspired by the instruction pipe-lining
in operating systems. We applied the instruction
pipe-lining in a batch level. Let’s look at here, we
have four devices and we split our input batches
into micro batches. As we process the micro batches, you’re actually sending
them to the next device. Then we decrease the
bubble time here.>>This is pretty much
it. It’s nothing fancy. But combine, it doesn’t
solve this bubble problem, which we also try to mitigate it
by using gradient checkpointing, which means you can put
multiple layers in a device. Say you have 10 layers, you put 10 layers in one device, and if you think about how
you compute the gradient, you have to wait for the
backpropagation Deltas to come in, and up until that time you
have to basically wait. What we did is, we only store. This basically, I’m talking about checkpoint grading,
checkpointing trick. They only need to store the
activations that are at the edge. You only store the last layer
and the first layer activations, and then you recompute all the intermediate activations
when they’re needed. So it also saves all of memory, and it also decreases
the bubble time by increasing the device utilization. Okay, what does it entail? Well, we see some sublinear scaling. So let’s ignore the GPU, this is not using NVLink, it’s not high-speed
interconnect, but here, let’s look at the transformer on, when they’re using 32 micro
batches at eight-way split. It means we distribute our
models on eight devices. You’re getting 8.6 speedup. That’s what we meant
by sublinear scaling. That was the systems
aspect of things. But it’s also, it enabled
us to investigate and understand the deep neural
network dynamics at scale. If you look at this, so I talked about neural networks
that are thousand layers deep. On the other end, if you go
to, for example, NeurIPS, people analyze single-layer neural
networks and their properties. So we have to come
find a common ground, middle ground to analyze. But to motivate people that
are more on the theory side, we also did some analysis. Okay, well, what these
deep networks tell you if you managed to
analyze them at scale. We actually validated two things. Deep neural nets, there’s an implicit acceleration
as you go deeper. So these models can merge very faster given the same number of examples. So they’re more perimeter efficient. This is what we show on the left. So you see you reach
the same perplexity way faster than your
shallower counterpart. It’s very intriguing. It’s also
validating couple of things. Overparameterization
is also introducing some implicit acceleration on top of your momentum based optimizers. The second thing is the
effect of large batch sizes. It’s particularly interesting for
machine translation researchers. If you have a large enough dataset, there doesn’t seem to be a
limit in your batch size, but you should be careful. Here for example, we
stopped at four million, but as we tried, eight million is also
giving you improvements. We just didn’t put that here. The problem here, as we increase
our batch size to four million, eight million, you’re
still seeing improvement, but there are some
additional problems. For example, as you include
more and more samples that you increase the chance
of hitting a bad example. So you should be more careful about your sample selection policy
or data filtering policy. But it’s interesting, our optimization subroutines
or our optimizer seem to be doing good given
this low variance gradient. So it was a very good
observation that we validated. This is pretty much it in terms of the systems and analysis aspect. I’ll talk about some byproducts
or some additional papers. We talked about how are we
going to train these networks. Now let’s talk about, okay, I’ll give you this network,
what’s in it for me? We tried to plot the
map of languages by using the intermediate
representations of this network. So we have a dataset
of 3,000 sentences that those 10 sentence are
given on 103 languages. So there are basically
N-way parallel, 103 way parallel test set. What we did, we give all these sentences to the encoder and then we picked up the encoder presentations that are at the end, and by using some
similarity analysis, in this case you use SVCCA, which gave us an affinity matrix, the partition the
affinity matrix using a spectral clustering algorithm, and this is the clustering
of all 103 languages. Then we asked our linguists, okay, what are the families? Can you draw some languages
and can you draw some circles? What we found, do they really
look like language family? Do they recover the map of languages? Partially, in two dimensions, they basically projected all
the way down to two dimensions, there it seems to be the case. So at lunch we talked
about Turkish and Finnish. If anyone can see Finnish here, I’d see Turkish is
close to Azerbaijani. Chinese, I’m not sure if it is close, but definitely close is
[inaudible] and Kazak.>>Japanese is in there, right?>>Japanese, it’s [inaudible].>>Chinese and Turkic language, yes.>>Okay.>>It’s pretty good actually.>>[inaudible].>>Although it’s weird to
have [inaudible] area.>>That might be an artifact
of two-dimensional projection.>>Or [inaudible]. There is a lot of history
shared between languages, [inaudible] not family.>>But the interesting thing, so I don’t have much
of NLP background, so to me, it was okay, how do we know it’s happening
in the finer granularity? So we looked at some
token embeddings. I’m not super clear about how
did they plot these by the way, I’m sure about the next bundle. But the takeaway here is as
you go more finer grains, they also resemble the tree
of languages apparently. But I was curious about one thing. So these are highly influenced
by the word embeddings. At the end of the day, we use word embeddings and they
apply heuristics like SPM, sentence piece model or word piece
model or byte pair encoding. What happens if we take two semantically similar
languages that we know they are coming
from the same family, but they are given in
different scripts. So they actually have such
languages for example, I think Turkish and one other
language is such two languages. So these two, Serbian and
Croatian, I’m not sure.>>Correct, Serbian and Croatian.>>Yeah. So those two languages, actually a lot of languages here that you see at the
encoder embedding, they come close together at the
topmost layer of the encoder. So this was a very good indicator for us that’s telling us,
okay, these encoders, this mess of the
multilingual encoders, they’re mapping based
on some semantics. If you have a semantic encoder, you can use it for so
many other things, which I’m going to be talking next. So if your encoder is good
at encoding semantics, then you should be able to do some good cross-lingual downstream
transfer as a byproduct. So what we did, we
trained all these models, what’s in it for the other tasks, like part of speech or XNLI like
natural language understanding, document classification,
part of speech, not from named-entity recognition. So we compared it against
the multilingual BERT. They’re not even very comparable
in terms of number of parameters. But the interesting thing
here was we’re seeing improvements on different
variety of tasks. Sometimes it’s better,
sometimes it’s worse. But this came as a
byproduct of just training a multilingual NMT model, and then just reusing its
encoder out-of-the-box.>>Do you think this is surprising? Surprising [inaudible].>>Surprising for?>>That for the model between
using machine transition data.>>Is it bad?>>I think multilingual [inaudible]
is not multilingual it’s just trend on multilingual
language model here.>>Yeah.>>I find it hard to
understand why multilingual [inaudible] working
actually out of [inaudible]>>Is it bad? Is it good? If you think about what
is the utility of it, you can be more practical, I’ll just train 1,000
language model rather than training a multilingual
third model on 1,000 languages, and I’ll use it to
translate across all these 1,000 languages and also do
all these tasks together.>>I don’t mean bad by [inaudible]. I mean, it’s just
encouraging to train multilingual translation model with those [inaudible] in BERT model, so I find it a little bit frustrating that in BERT
models don’t just good.>>So these models can also
do translation by the way.>>Much stronger.>>Yeah. I don’t know how to
do translation with BERT. So these models are just
trained for translation, and then we just fine tuned one layer to do these
tasks. So I don’t know.>>I think [inaudible]
argument is that you have way more semantic inclamation in the multilingual
translation model than in the BERT model that you
were expected to do better.>>I think this way [inaudible]>>It gets stronger
to train BERT data.>>Because you have the strong
signal from the other language, which you don’t have in BERT
by just the occurrence.>>Good question. Yeah. Cross-lingual downstream
transfer is one thing, but we’re more interested in cross-lingual transmitting
translation like Zero-Shot. So that wasn’t our focus, that was more like okay, what’s in it for other tasks? What we also observed, translating to English from
all languages implicitly forces their representation
to be in the same space. Also, as you fine tune, high-resource language
quality doesn’t change, so they’re really hard to move from their commerce point. All right. So we talked a lot about
scaled models, giant models. But if you have only
limited model capacity, how are you going to use it? Let’s talk a little bit about this practical aspect and say
you train this giant model, and then you have new data. Are you going to train it again?
Are you going to fine-tune? But if you’re going to fine-tune, then you have to store
this giant network again, even the storage is a big
problem of these networks. So we came up with a very simple adaptation technique
inspired by Residual Adapters. So this actually also connects with the language level
mixture of experts. This is basically implementing a hard language level
mixture of experts. What we are doing here, you
take a pre-trained M4 model, and you add some language
specific layers, these are what we
call Adapter Layers, and you just sprinkle it right off this feed forward layers
of their entire network, and then you route the examples
by looking at their language ID. Then the adapter layer’s
looking like this. It’s very simple, it normalizes to allows you to plug in
and play anywhere, and it also has two projection
layers with some non-linearity. If you don’t have enough data, you make it like a down projection, make it a bottleneck network. But if you have a lot of data, you actually expand
the hidden dimension here to make the most
out of your data. So it’s actually
allowing you to adapt to new domains or additional data
on a particular language. By doing so, we added 400 million transform M4 model
these residual adapters. On left, you’re seeing
English to any. This is when you’re
translating into x or any. By giving these language
specific adapters, you’re actually recovering
the high resource regression. But also, it’s not affecting the transfer capability
because you’re not touching anything on
the original network. You’re only fine-tuning these
adapter networks with the new data. Also, you’ll see the similar
trend on any to English tasks, half of the other tasks, you’re seeing huge improvements
on high resource languages, you can still play without any regression on
the low resource languages, so it’s really making it practical. If you actually add more and
more data, more capacity, by adding 13 percent extra capacity, that was the point that we
reached the bilingual baselines. So 200 translation models, 400 million parameter each versus
400 million plus 13 percent, it’s actually getting as good as
the army of translation models. Also, you can increase the
adapter size to get more gains. So you can actually cram more and
more layers in each adapters and then train as you get
more data or you can make this hierarchical adaptation scheme to adapt both language and
domain at the same time. So that’s where we
summarized everything here. Enhancing Zero-Shot, this was also another
primer paper that we put. As you increase the multilinguality, these models are getting
better and better in Zero-Shot translation because it’s adding some implicit regularization, it has to actually
align things together. But you can also
encourage it even further by adding some alignment plus on
top of encoder representation. It’s very simple and very practical, and it scales to
hundreds of languages. I’m not going go into detail of it. I want to talk a little bit about the training problem and
the hyper-parameters. How are we actually doing
the hyper-parameter searching given that giant models? Let’s first talk about
the task scheduling. So earlier, I talked
about if you’re using some temperature based
data sampling strategy to balance the learning process on different low resource
and high resource languages. This experiment and the
following experiment, they’re not implemented at scale. They are scaled down
and tested on WMT data, so they’re not actually
103 languages. But this is an attempt to improve the translation quality when you’re
translating from English to x. So we picked up English-German
and English-French. We separated all of the decoders
for these two languages. After sharing couple of
layers in the decoder, we separated them out. We gave the model, if you’re familiar with
WMT, English-French, English to German setup, one hand, you have
four million samples, on the other hand, you
have 40 million examples. So there is an imbalance
between the training examples. So if you just mix
everything together, even the data sampling strategy, it’s not going to help that much. Well, we have to devise something
else, but we thought, “Okay. Why are we learning how to weight different tasks as an
outer loop optimization?” We also use our baseline knowledge. Here, it’s in the loss function. I don’t want to go into detail, but it gives you a weight at the end. At every update, you’re updating, and after doing the parameter update, you’re also doing an additional
update in the outer loop, and it is giving you a
weight for each task. So this is the learn tasks weights from the
beginning of the training. Here, green is French, blue is German, so it’s
learning for example here, they up-weight the French task and then down-weight the German task because apparently German was too easy to learn because you
don’t have enough data. You’re basically down-weighting
your German examples and up-weighting your French example. You’re also giving the baseline
scores we’re telling, for German, be it 28 Blue score or
be it this perplexity. For French, be it four
to Blue score or be it this perplexity which are given here, these are the baselines that
they give to the model, and this outer loop
optimization while reading a couple of things,
what is the Delta Blue? What is the gain that I’m getting
doing the previous update? By integrating these observations into a loss function and doing
an outer-loop optimization, you get some weights which are here, and then you either apply to change your gradients
or you’re learning rates. We chose to change the
learning rates themselves. Here, you see two learning rates
for different language pairs. It’s basically adding
some noise on top of what you have as a transformed
learning rate schedule. But it was still dominated by the transformer learning rate
schedule which was hand-tuned. So it was changing the learning rate, but the magnitude of
change was not big enough to change the structure
of the learning rate, which is dominated by our
hand-tuned learning rates here. So why are we also not learning
the learning rate itself? That’s what we do by using
a technique from 90’s. It’s a very simple technique
and it’s scalable, you take the derivative of your loss function with
respect to your learning rate, and you end up with a very simple and intuitive formula which is here, let’s just look here. This tells us if you
are agreeing with your previous update direction
with your current gradient. So if your current
gradient update direction is agreeing with your
previous update, just increase your learning
rate, it’s very intuitive. If you’re not agreeing with
your previous update direction, just decrease the learning rate. It’s the close form, and you can actually again implement it as an outer
loop optimization. These are all what Meta-Learning or learning to learn
Framework can enable. By doing so, these are the
learning rates that we learned. In the previous slides, the underlying form was not learned. But here, these learning
rates schedules are learned. So that was very interesting for us. If you have fine-tuned your
bilingual baseline like crazy. So they are on par, but not better than your very well tuned single
language pair baselines. But if you go to multitask setup, they were actually way
better than our baselines, and we also tested it on
BERT and they were also on private hand-tuned BERT
learning rate schedule, etc.>>So this learning rate
is starting from scratch, it’s learning everything
the same as [inaudible]?>>Yes.>>So we set it to zero.>>Zero.>>Some number very close to zero into the minus eight or
nine, something like that. It learns to increase
the learning rate first, and then it starts decreasing. Interestingly, this looks like
inverse square root scaling. It’s actually covers that. It redraws that, what we
hand-tuned for transformer. But interestingly, it also
goes below zero sometimes. So this means, it’s deciding
to the gradient ascent. We thought this is counter-intuitive,
we shouldn’t do that. If you don’t do that,
it just didn’t work. So this is also a very sharp drop. So we also tried to slow it down, as in this case here,
it’s not that fast. So it’s not actually
going below zero. But even in this case, quality
was pretty much the same. So it’s telling us something about the lost area surface
of these networks, the place that you are, you can actually take a step back and then come back to the same regime. Go back and then come
back to the same regime.>>If you prevented from going
to below zero, it didn’t work?>>It didn’t work, yes.>>[inaudible].>>We tried a few different things, but what seems to be working, you just use it, for
example, softmax. You bundle softmax and
embeddings together and then learn a single learning
rate for those two, and then entire
encoder transform body can use a single learning rate. You’ll be surprised
what that looks like, that learned learning
rate looks like. Same goes for the encoder. You can learn basically three
different learning rates for the entire network. This is pretty much
the end of my talk. I’ll finish with some open problems; what the future entails, what
we see as promising directions. This cross-lingual
downstream transfer, it’s still interesting and
there’s a lot of interest. Now, BERT, for example, became the power horse
of so many NLP tasks. What’s enabling BERT to do
that? It’s still unknown. It’s also interesting
to see just training a multilingual model or
machine translation model, you can also recycle the encoder or even decoder of these
models on downstream tasks. I think there’s something going
on that we should understand, we should investigate more. Probably, we should revisit
our objective functions. Just like you said, maybe the semantics that we’re
injecting is pretty good. So what does that tell
to the other tasks, and what we can learn
from other tasks, then inject it into our
objectives. That’s interesting. Also, in the new tasks, are these document classification? Is it really the task that these
models should be used or not? Are those the languages? For example, if you
look at the overlap in terms of the number of languages. XLMR, multilingual BERT, this m4. If you add all of them up, it adds up to 180 languages actually. So they’re not also overlapping. So if you’re trying to extend it, let’s just find and add these to more language and then make
the analysis more fair. Because maybe this
Wikipedia adding something, or edit data Common Crawl
is adding something. So we’re also trying to make
it a more fair comparison. On the unsupervised
machine translation, I said in the second slide, so you’re going to
universal translation. As we take more steps, we will see less informative
data or the data that has less structure in terms of
bilingual data, for example. It’s aligned, right? Or trilingual data or
100-way aligned data. We’re moving into the
regimes or we only have, say, monolingual data, or
even one step further. We don’t even have any
written-form data. You only have audio now, or you have to go and find
the data in a different form, or your data is going to be coming
from different contextual form. So these are more in
the unsupervised MT, and also some multi-modal
translation phase. We believe this knowledge base of huge mission translation model is a good starting point to jump-start
any unsupervised MT direction. That’s basically what’s here, adapting to unseen languages and basically jump-starting
the unsupervised MT. Also, there is a huge opportunity if you want to study
transfer learning. Transfer learning is, let’s
make a crude partition, we can split into two. There’s parallel transfer, which
I talked almost only about parallel transfer where you learn tasks simultaneously and you expect transfer
happening in parallel. But this is not the only
way of doing transfer, even in the parallel transfer case, you can devise smarter schedules,
learning rate schedules, which I showed in
the last two slides, to mitigate interference and so on. So there’s actually an opening to more core machine learning concepts
and meta-learning and so on. In the second one is
the serial transfer. You can also learn incrementally. I’m going to first learn
ten languages and I’m going to add five more
languages without hurting, without forgetting
what I have learnt. So this is a continual
learning setup, our lifelong learning setup. This is also another way to go. The only thing that I presented
here was residual adapters, where you learn Bayes’
model and then extend it into new domains or more data. But there’s also some
promising directions in the serial transfer case by using continual
learning, meta-learning. Basically maximizing
without forgetting. So that’s pretty much it. Thanks a lot. It was a
long run. Thank you.>>So we still have
eight minutes or so? Do you have any questions? So, please, use of the time.>>If you can still speak for us, if you still have some [inaudible].>>You’re all out of questions?>>With the scheduler problem, did you try that on the working
system or that was multitask?>>That was a multilingual
multitask system.>>System is French,
German, which one is it?>>System is English
to French and German.>>When you say here in the formula, the Bayes’ model coming
from bilingual model?>>This baseline is
coming from a bilingual.>>Bilingual.>>Yes. This SJ or S bar, it’s coming from here. Basically, this dark line is the baseline that
we ask the scheduler, “Okay, this is where you should beat, you should surpass.”? Yeah. [inaudible]>>We haven’t scaled
it to that level.>>Why not?>>We were busy with other papers.>>But should work, right?>>Should work. So honestly, why we didn’t try the particular way that we
implemented our framework, it wasn’t allowing us to do a
lot of sharing and not sharing. These experiments are done on GPU, and there is a difference in terms of implementation
between GPU and TPU. So we haven’t actually ported
this one to TPU. Yeah?>>How about the
learning rates running? If you experiment with
different [inaudible] models.>>Different?>>The different sizes, different numbers of players. It seems to me that deeper models are quite sensitive to your
learning rate choice, so is this solving
the problem for you? Can you just go now to 60, 120 layers and not have to fine-tune
the learning rate anymore?>>No.>>It’s not solving?>>It’s solving the
learning rate problem. So deep models, if
you are able to train a 64-layered model by
using your heuristics, then you can also swap your learning rate schedule
heuristic with this, but if you scale your model, okay, let me be more specific. This is not the first
wall that you will hit if you try to
make your model deep. In the GPipe paper, we talk about it. It’s not like window
you just stretch. As you make your models deeper, you will have a lot of
non-finite elements. The source of those
non-finite elements can be coming from a very
large learning rate, but mostly they are either
overconfident in their predictions, which blows up logits
or other things, or you don’t have the gradients
that you can distribute to the lower layers of your network. So there we use two things. We are basically applying
some logit clipping, this is a heuristic,
and we’re also using, say transparent
attention, for example, we’re learning how to
distribute gradients to the layers down below. So if you are able to train, if you’re fairly confident, if you are at this stage, you have all these heuristics tricks, and then you can replace your
learning rate schedule with this. The benefit here is when you’re training 100 billion
parameter network, you only have a few shots. You cannot do a grid search, you cannot do anything. You will only be training
one or two site systems, given the budget, given the time. We are not able to tune
those hyperparameters, but these are actually sub-optimal. If you apply it, for example, on single-payer language pair
baseline, these are sub-optimal. Since we cannot tune the models at scale, they’re also sub-optimal. Actually, these type of
meta learning techniques, they excel at scale.>>Did you find different
[inaudible] encoder versus decoder or [inaudible]>>For encoder-decoder,
we coupled decoder, softmax, decoder embedding
as encoder embedding. Those three are coupled together, they all move in the same pace. We tried so many different
things to change the, say transformer encoder block. Use separate learning
rate for each layer, use separate learning
rate per parameter, etc. It seems like they don’t matter much. You can just use a
single learning rate for the whole encoder if you separate
out the word embeddings.>>Were you surprised you came
out with the same [inaudible]>>This one?>>Yeah.>>I was reading this paper. I solved this kind of curve. This paper, Dave, they
applied it on CIFAR-10. It’s an image classification problem, and I solved this curve. They learned something like this. I said, “Okay, then let’s replicate.” Then we replicate it,
we saw the same thing.>>[inaudible]?>>Yes. There is actually
a theory why it should be inverse square root or why you
should decay learning rate, so on. Starting from a very
small learning rate, basically this warm-up phase, it’s indicating our lack of
understanding on initialization, and it’s also using very,
very large batches. Yeah. There are some other
techniques that allow you to not start from zero, it allows you to get rid of
the warm-up hyperparameter, but we haven’t used.>>For the [inaudible]
transfer learning, did you also look at
learning source language or distance language [inaudible]
because this is for French.>>I think so. I didn’t read
this too much in detail. So we are here, we’re
comparing against all these multilingual BERT and here, again, right-hand side is
the low-resource languages. It’s actually based on
multilingual BERT on XNLI, but it’s better than multilingual
BERT on low-resource languages, and it’s the same for Part of Speech. This may hint something else. Multilingual BERT is not using that
smart learning rate schedules, or it’s not using task waiting. We study that a lot. We try to maximize
transfer to low-resource, so I think it’s kind of expected.>>What about the [inaudible]?>>Sorry?>>We have some research
showing the opposite to that so depends on that
largest-scale language, but to rescale out to the
smaller-scale language [inaudible]. Can language will allow you to scale [inaudible] . Sure. [inaudible]>>Haven’t tried, I don’t know.>>Hand me the notes. [inaudible] I think. So let’s
thank for the speakers.>>Thank you.

1 thought on “Exploring Massively Multilingual, Massive Neural Machine Translation”

Leave a Reply

Your email address will not be published. Required fields are marked *