Intro to Distributed Deep Learning Systems

What is distributed machine learning?

Generally speaking, distributed machine learning (DML) is an interdisciplinary domain that involves almost every corner of computer science — theoretical areas (such as statistics, learning theory, and optimization), algorithms, core machine learning (deep learning, graphical models, kernel methods, etc), and even distributed and storage systems. There are countless problems to be explored and studied in each of these sub-domains. On the other hand, DML is also the most widely adopted and deployed ML technology in industrial production because of its faculty with Big Data.

What problems is distributed machine learning research trying to solve?

It’s easiest to understand DML if you break it into four classes of research problems. Please note, however, that these classes are absolutely not mutually exclusive.

How to use statistics and optimization theory and algorithms*

Since most ML tasks are essentially minimizing a “loss” on a set of training data, we tend to focus on the following problems:

How long does it take for the optimization process to reach a convergence? Or, in other words, what is the convergence speed (or rate)?
How good is the converged solution?
How much training data are needed to guarantee a good solution?

In order to study these problems, researchers resort to theoretical analysis tools such as optimization theory or statistical learning theory. However, in the context of large-scale ML where we are given more computational resources and our goal is to speed it up (i.e. reduce the training/test time of a model) by leveraging additional resources through parallel or distributed computing techniques, we really want to figure out another set of similar-looking but different problems:

Through distributed or parallel training, are our models and parameters guaranteed to converge on the same state as without acceleration?
If they don’t converge on the same state, then how far are we from the original solution, and how far are we from the true optimal solution?
What other assumptions/conditions are needed to reach a “good” convergence?
How much faster (i.e. scalability) we can get if we compare distributed training to non-distributed training? How can we evaluate this?
How can we design the training process (e.g. data sampling, parameter updating) to ensure both good scalability and good convergence?

How to develop ML models or training algorithms that are more suitable for distributed settings**

This line of research is focused on developing new ML models or adapting (scaling up) existing models in order to handle larger scale data.

How to build large-scale DML applications

There are also some specific application problems like large-scale image classification that require research into scaling up very specific models/algorithms. Most of these solutions can be deployed directly into the production line.

How to develop parallel or distributed computer systems to scale up ML

This line of research is rather intuitive — if our model or algorithm cannot finish the computation workflow on one node, we can try to develop distributed systems to use more nodes (and, therefore, more computational resources). However, to do this, we need to face a lot of system problems:

Consistency: How can we ensure the consensus of multiple nodes if they are simultaneously working toward one goal? What if, for example, they are solving one optimization problem together, but with different partitions of the dataset?
Fault tolerance: If we distribute our workload to a cluster of 1000 computational nodes, what if one of the 1000 nodes crashes? Is there a way to fix it other than just restarting the job from the very beginning?
Communication: ML involves a lot of I/O (e.g. disk read and write) and data processing procedures — can we design storage systems to enable faster I/O and non-blocking data processing procedures for different types of environments (e.g. single node local disk, distributed file systems, CPUI/O, GPU I/O, etc.)?
Resource management: Building a computer cluster is prohibitively expensive so a cluster is usually shared by many users. How should we manage the cluster and allocate resources appropriately to meet everyone’s requests while maximizing usage?
Programming model: Should we program distributed ML models/algorithms in the same way we do for non-distributed ones? Could we design a new programming model that requires less coding and improves efficiency? Can we program in a single-node fashion while automatically amplifying the program with distributed computing techniques?

This is the line of research that we focus on at Petuum. In fact, much of the mainstream ML software we use today is situated in this same area (e.g. GraphLab, TensorFlow, etc.).

Understanding Distributed Deep Learning

Distributed deep learning is a sub-area of general distributed machine learning that has recently become very prominent because of its effectiveness in various applications. Before diving into the nitty gritty of distributed deep learning and the problems it tackles, we should define a few important terms: data parallelism and model parallelism.***

Data Parallelism

Data parallelism is a parallelization technique that is enabled by partitioning data. In data parallel distributed computing, we first divide the data into a few partitions, with the number of partitions equal to the number of worker machines (i.e. computational nodes). Then, we let each worker own one independent partition and let them perform computation over that data. Since we have multiple nodes scanning the data in parallel, we should be able to scan more data than when using a single node — we increase throughput through distributed parallel computing.

In distributed machine learning, where our goal is to speed up the convergence of model training using multiple nodes, applying data parallelism is rather intuitive: we let each worker perform the training (i.e. stochastic gradient descent) on its own data partition and generate a set of parameter updates (i.e. gradients) thereon. We then let all nodes synchronize their parameter states by network communication until they reach a consensus. As long as the synchronization does not take too much time and we see improvement over single-node results, we’ve achieved our goal! Essentially, this is the way Google’s deep learning system DistBelief works.

Model Parallelism****

Compared to data parallelism, model parallelism is a more complex and ambiguous concept. Generally speaking, in model parallelism, instead of partitioning data, we try to partition the machine learning model itself to distribute the workload to multiple computational workers. For example, let’s say we are solving a matrix factorization problem where the matrix is super huge and we want to learn every parameter of this huge matrix. To apply model parallelism, we have to partition the matrix into many small blocks (sub-matrices), and then let each worker take care of a few. In this way, we are able to leverage the additional RAM of multiple nodes if the RAM on one node is not sufficient to store all the parameters in the matrix. Since different nodes have different workloads that map to different blocks of the matrix, we should get a speedup when they compute in parallel.

The question is, how we should partition the model? Since we have so many machine learning models and each model has its own characteristics and representations, there is no principle way to implement model parallelism.

Problems in distributed machine learning

Data parallelism is quite effective when the size of the training data is bigger, since we can scan data much faster with additional nodes. Model parallelism is more applicable to cases in which the size of the model is too big for one node, since it allows us to partition models and leverage additional memory across multiple nodes.

Ideally, in distributed computing, we want to get as many speedups as the number of machine we have allocated (we usually call this metric “scalability”). In detail, if we allocate K machines, and our system can scan data K times faster than on a single node in unit time, then our system has a scalability K, or linear scalability. Linear scalability is the ideal in these systems.

However, due to the overhead caused by synchronization tasks, it usually takes much longer to finish one iteration of training on a distributed computer cluster than on a single node. We have to spend extra time to synchronize (network communication) across multi-nodes at the end of computation in order to ensure the convergence of the ML training task. In practice, the synchronization can take as long as the computation, or even longer.

Why? One major reasons is that some worker nodes in the cluster run more slowly than other nodes (because of cheaper or older hardware), and in order to synchronize with them, the faster workers have to wait until their computation jobs finish — system performance is always bounded by the slower workers. In these cases, putting K machines together will likely result in a negative scalability, which is a huge waste of time and money.

In future posts, we’ll discuss how Google began to address this problem with the parameter server and how Petuum then developed a principled way to address synchronization problems.

*Here’s a good paper to learn more:

More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server, by Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B. Gibbons, Garth A. Gibson, Greg Ganger, Eric P. Xing

**I would recommend the following papers:

Asymptotically Exact, Embarrassingly Parallel MCMC, by Willie Neiswanger, Chong Wang, Eric P. Xing. UAI 2014.

LightLDA: Big Topic Models on Modest Compute Clusters, by Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric P. Xing, Tie-yan Liu, Wei-Ying Ma. WWW 2015.

SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs, by Kaiwei Li, Jianfei Chen, Wenguang Chen, Jun Zhu. ASPLOS 2017.

***Two notable works on large-scale distributed deep learning were published in NIPS 2012 and 2013 ICML.

The first is about the first generation of Google’s internal deep learning platform and the second is by Adam Coates who is currently in charge of Baidu Research Silicon Valley. The core idea of the two articles is to scale up deep learning computation using more computational workers — the first through data parallelism and the second through model parallelism:

Large Scale Distributed Deep Networks, by Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, Quoc V. Le, Andrew Y. Ng.

Deep Learning with COTS HPC, Adam Coates, Brody Huval, Tao Wang, David Wu, Bryan Catanzaro, Andrew Ng. ICML 2013.

****If you are particularly interested in model parallelism, here are two notable papers:

STRADS: A Distributed Framework for Scheduled Model Parallel Machine Learning, by Jin Kyu Kim, Qirong Ho, Seunghak Lee, Xun Zheng, Wei Dai, Garth A. Gibson, Eric P. Xing. EuroSys 2016.

Device Placement Optimization with Reinforcement Learning, by Azalia Mirhoseini Lt ; / RTI