Researchers report breakthrough in ‘distributed deep learning’


Online shoppers typically string together
a few words to search for the product they want, but in a world with millions of products
and shoppers, the task of matching those unspecific words to the right product is one of the biggest
challenges in information retrieval. Using a divide-and-conquer approach that leverages
the power of compressed sensing, computer scientists from Rice University and Amazon
have shown they can slash the amount of time and computational resources it takes to train
computers for product search and similar “extreme classification problems” like speech translation
and answering general questions. In tests on an Amazon search dataset that
included some 70 million queries and more than 49 million products, the researchers
showed their approach of using “merged-average classifiers via hashing,” (MACH) required
a fraction of the training resources of some state-of-the-art commercial systems. product search is challenging, in part, because
of the sheer number of products. There are about 1 million English words, but
there are easily more than 100 million products online. There are also millions of people shopping
for those products, each in their own way. Some type a question. Others use keywords. And many aren’t sure what they’re looking
for when they start. But because millions of online searches are
performed every day, tech companies like Amazon, Google and Microsoft have a lot of data on
successful and unsuccessful searches. And using this data for a type of machine
learning called deep learning is one of the most effective ways to give better results
to users. Deep learning systems, or neural network models,
are vast collections of mathematical equations that take a set of numbers called input vectors,
and transform them into a different set of numbers called output vectors. The networks are composed of matrices with
several parameters, and state-of-the-art distributed deep learning systems contain billions of
parameters that are divided into multiple layers. During training, data is fed to the first
layer, vectors are transformed, and the outputs are fed to the next layer and so on “Extreme classification problems” are ones
with many possible outcomes, and thus, many parameters. Deep learning models for extreme classification
are so large that they typically must be trained on what is effectively a supercomputer, a
linked set of graphics processing units (GPU) where parameters are distributed and run in
parallel, often for several days. A neural network that takes search input and
predicts from 100 million outputs, or products, will typically end up with about 2,000 parameters
per product. So, the final layer of the neural network
is now 200 billion parameters. It would take about 500 gigabytes of memory
to store those 200 billion parameters. It will need 1.5 terabytes of working memory
just to store the model. The best GPUs out there have only 32 gigabytes
of memory. So training such a model is prohibitive due
to massive inter-GPU communication. MACH takes a very different approach. The researcher describes it with a thought
experiment randomly dividing the 100 million products into three classes, which take the
form of buckets. i-e mixing iPhones with chargers and T-shirts
all in the same bucket. It’s a drastic reduction from 100 million
to three. In the thought experiment, the 100 million
products are randomly sorted into three buckets in two different worlds, which means that
products can wind up in different buckets in each world. A classifier is trained to assign searches
to the buckets rather than the products inside them, meaning the classifier only needs to
map a search to one of three classes of product. In their experiments with Amazon’s training
database, the researchers randomly divided the 49 million products into 10,000 classes,
or buckets, and repeated the process 32 times. That reduced the number of parameters in the
model from around 100 billion to 6.4 billion. And training the model took less time and
less memory than some of the best reported training times on models with comparable parameters. MACH’s most significant feature is that it
requires no communication between parallel processors.

Leave a Reply

Your email address will not be published. Required fields are marked *