Skip to content

Latest commit

 

History

History
212 lines (107 loc) · 61.1 KB

Federated-Averaging.md

File metadata and controls

212 lines (107 loc) · 61.1 KB

Communication-Efficient Learning of Deep Networks from Decentralized Data

H. Brendan McMahan et. al. Google, Inc.

0. Abstract

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.

现代移动设备可以访问很多数据,这些数据适用于学习模型,从而可以极大的改进在设备上的用户体验。比如,语言模型可以改进语音识别和文本输入,图像模型可以自动选择好的图像。但是,这些丰富的数据通常对隐私很敏感,数量很大,或者都有,这会阻碍用户登陆到数据中心,并使用传统方法在其中训练。我们支持另一种方法,将训练数据分布在移动设备上,通过聚积局部计算的更新来学习一个共享模型。我们称这种去中心化的方法为联合学习。

We present a practical method for the federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets. These experiments demonstrate the approach is robust to the unbalanced and non-IID data distributions that are a defining characteristic of this setting. Communication costs are the principal constraint, and we show a reduction in required communication rounds by 10–100× as compared to synchronized stochastic gradient descent.

我们提出一种实际的方法,基于迭代模型平均来对深度网络进行联合学习,并进行了广泛的经验性评估,考虑了五种不同的架构和四个数据集。这些试验表明,这个方法对不均衡的和非独立同分布的数据是稳健的,在这种设置中,是一种明确的特质。通信代价是主要的限制,我们证明了,与同步SGD相比,需要的通信代价减少了10-100x。

1 Introduction

Increasingly, phones and tablets are the primary computing devices for many people [30, 2]. The powerful sensors on these devices (including cameras, microphones, and GPS), combined with the fact they are frequently carried, means they have access to an unprecedented amount of data, much of it private in nature. Models learned on such data hold the promise of greatly improving usability by powering more intelligent applications, but the sensitive nature of the data means there are risks and responsibilities to storing it in a centralized location.

手机和平板越来越成为很多人的主要计算设备。在这些设备上有强力的传感器(包括相机,麦克风和GPS),而且这些设备使用非常频繁,这意味着他们可以访问巨量数据,而大多数数据本质上都是隐私。在这些数据上学习到的模型可能会通过为更智能的应用赋能,从而极大的改进可用性,但数据的敏感性的本性,意味着如果存储在一个中心化的位置,会有风险和责任。

We investigate a learning technique that allows users to collectively reap the benefits of shared models trained from this rich data, without the need to centrally store it. We term our approach Federated Learning, since the learning task is solved by a loose federation of participating devices (which we refer to as clients) which are coordinated by a central server. Each client has a local training dataset which is never uploaded to the server. Instead, each client computes an update to the current global model maintained by the server, and only this update is communicated. This is a direct application of the principle of focused collection or data minimization proposed by the 2012 White House report on privacy of consumer data [39]. Since these updates are specific to improving the current model, there is no reason to store them once they have been applied.

我们研究了一种学习技术,使得用户可以共同获得在这些丰富数据上训练的模型的好处,而不需要对数据进行中心化的存储。我们称这种方法为联合学习,因为学习任务是在参与设备的一个松散联盟上进行的(我们称之为客户),而这些客户机是由一个中心化的服务器协调的。每个客户机都有一个局部的训练数据集,这些数据永远也不会上传到服务器上。而每个客户都对目前的服务器维护的全局模型计算一个更新,而只对这个更新进行传输。由于这些更新对于改进目前的模型是特别的,一旦这些更新被应用,就没有再存储其的理由。

A principal advantage of this approach is the decoupling of model training from the need for direct access to the raw training data. Clearly, some trust of the server coordinating the training is still required. However, for applications where the training objective can be specified on the basis of data available on each client, federated learning can significantly reduce privacy and security risks by limiting the attack surface to only the device, rather than the device and the cloud.

这个方法的一个主要优点,是模型训练与直接访问原始训练数据的解耦合。很明显,需要一些对服务器的信任,来协调训练过程。但是,对于一些应用,其中训练目标是在每个客户机上数据的可用性,联合学习会显著降低隐私和安全的风险,将攻击的风险仅局限到对设备上,而不是对设备和云上。

Our primary contributions are 1) the identification of the problem of training on decentralized data from mobile devices as an important research direction; 2) the selection of a straightforward and practical algorithm that can be applied to this setting; and 3) an extensive empirical evaluation of the proposed approach. More concretely, we introduce the FederatedAveraging algorithm, which combines local stochastic gradient descent (SGD) on each client with a server that performs model averaging. We perform extensive experiments on this algorithm, demonstrating it is robust to unbalanced and non-IID data distributions, and can reduce the rounds of communication needed to train a deep network on decentralized data by orders of magnitude.

我们的主要贡献是,1)发现了在移动设备的去中心化数据上训练这个问题是一个重要的研究方向;2)选择一个直接的实际的算法,可以应用到这个设置中;3)提出的方法进行了广泛的经验性评估。更具体的,我们提出了FederatedAveraging算法,将每个客户端本地的SGD与进行模型平均的服务器结合了起来。我们对这种算法进行了广泛的试验,证明了对不平衡的非独立同分布的数据,算法也是稳健的,要在去中心化数据上训练一个深度神经网络,可以将通信代价降低几个数量级。

Federated Learning. Ideal problems for federated learning have the following properties: 1) Training on real-world data from mobile devices provides a distinct advantage over training on proxy data that is generally available in the data center. 2) This data is privacy sensitive or large in size (compared to the size of the model), so it is preferable not to log it to the data center purely for the purpose of model training (in service of the focused collection principle). 3) For supervised tasks, labels on the data can be inferred naturally from user interaction.

联合学习。联合学习的理想问题有以下性质:1)在移动设备的真实世界数据上训练,比在代理数据上进行训练,有一个明显的优势,而代理数据一般在数据中心训练上是可用的;2)数据对隐私是很敏感的,或者数据规模很大(与模型的大小相比),所以登陆到数据中心,只是单纯为了模型训练的目的,这是不太合适的;3)对于监督任务,在数据上的标签可以从用户互动中很自然的推断出来。

Many models that power intelligent behavior on mobile devices fit the above criteria. As two examples, we consider image classification, for example predicting which photos are most likely to be viewed multiple times in the future, or shared; and language models, which can be used to improve voice recognition and text entry on touch-screen keyboards by improving decoding, next-word-prediction, and even predicting whole replies [10]. The potential training data for both these tasks (all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.) can be privacy sensitive. The distributions from which these examples are drawn are also likely to differ substantially from easily available proxy datasets: the use of language in chat and text messages is generally much different than standard language corpora, e.g., Wikipedia and other web documents; the photos people take on their phone are likely quite different than typical Flickr photos. And finally, the labels for these problems are directly available: entered text is self-labeled for learning a language model, and photo labels can be defined by natural user interaction with their photo app (which photos are deleted, shared, or viewed).

很多模型为移动设备的智能行为赋能,符合上述特点。举两个例子,我们考虑图像分类模型,比如预测哪幅图像在未来最可能被看很多次,或被分享很多次;语言模型可以用于改进语音识别和在触屏键盘上的文本输入,其方式是改进编码,下一个词的预测,甚至预测整个回应[10]。这两个任务的可能的训练数据(用户拍摄的照片,在移动键盘上敲击出的东西,包括密码,URLs,信息,等)都可能是对隐私敏感的。这些样本抽取的分布,也与很容易获得的代理数据集的分布,是很不一样的:在聊天中用到的语言和文本信息,与标准语料库相比,很可能是不一样的,如维基百科和其他网页文本;手机上拍摄的照片,也典型的Flickr照片,也是非常不一样的。最后,这些问题的标签都是直接可用的:输入的文本,对于学习一个语言模型,是自标注的,图像标签可以很自然的通过用户与照相app的交互来定义(哪个图像是被删除的,被共享的,或被看过的)。

Both of these tasks are well-suited to learning a neural network. For image classification feed-forward deep networks, and in particular convolutional networks, are well-known to provide state-of-the-art results [26, 25]. For language modeling tasks recurrent neural networks, and in particular LSTMs, have achieved state-of-the-art results [20, 5, 22].

这两个任务都适合学习一个神经网络。对图像分类来说,是前向深度网络,特别是卷积网络可以得到目前最好的结果。对语言模型任务来说,循环神经网络,特别是LSTM,可以得到目前最好的结果。

Privacy. Federated learning has distinct privacy advantages compared to data center training on persisted data. Holding even an “anonymized” dataset can still put user privacy at risk via joins with other data [37]. In contrast, the information transmitted for federated learning is the minimal update necessary to improve a particular model (naturally, the strength of the privacy benefit depends on the content of the updates). The updates themselves can (and should) be ephemeral. They will never contain more information than the raw training data (by the data processing inequality), and will generally contain much less. Further, the source of the updates is not needed by the aggregation algorithm, so updates can be transmitted without identifying meta-data over a mix network such as Tor [7] or via a trusted third party. We briefly discuss the possibility of combining federated learning with secure multiparty computation and differential privacy at the end of the paper.

隐私。与在数据中心中进行的本地数据训练相比,联合学习有明显的隐私优势。即使保有一个匿名数据集,仍然会使用户隐私处于风险之中。相比之下,传输给联合学习的信息,是改进特定模型所需的最小更新(很自然的,隐私获益的力度,是依赖于更新的内容)。更新本身可以(也应当)是短暂的。他们不应当比原始训练数据包含更多的信息(数据处理不等式),而且通常所包含的信息会少很多。而且,聚积算法不需要更新的源,所以更新的传输可以不需要识别元数据。我们在本文最后简要讨论了将联合学习与安全多方计算、微分隐私结合到一起的可能。

Federated Optimization. We refer to the optimization problem implicit in federated learning as federated optimization, drawing a connection (and contrast) to distributed optimization. Federated optimization has several key properties that differentiate it from a typical distributed optimization problem:

联合优化。我们称联合学习中隐式的优化问题为联合优化,可以与分布式优化进行连接和对比。联合优化有几个关键的性质,与典型的分布式优化问题与很大区别:

  • Non-IID. The training data on a given client is typically based on the usage of the mobile device by a particular user, and hence any particular user’s local dataset will not be representative of the population distribution. 一个客户机上的训练数据,一般是基于特定用户的移动设备的用途,因此任何特定用户的本地数据集都不是族群分布的代表。

  • Unbalanced. Similarly, some users will make much heavier use of the service or app than others, leading to varying amounts of local training data. 类似的,一些用户会更多的使用特定的服务或app,这会导致本地训练数据的变化。

  • Massively distributed. We expect the number of clients participating in an optimization to be much larger than the average number of examples per client. 我们期待参与到这个优化问题中的客户机数量,比每个客户机中的样本平均数要大很多。

  • Limited communication. Mobile devices are frequently offline or on slow or expensive connections. 移动设备会频繁掉线,或连接很慢,或很贵。

In this work, our emphasis is on the non-IID and unbalanced properties of the optimization, as well as the critical nature of the communication constraints. A deployed federated optimization system must also address a myriad of practical issues: client datasets that change as data is added and deleted; client availability that correlates with the local data distribution in complex ways (e.g., phones from speakers of American English will likely be plugged in at different times than speakers of British English); and clients that never respond or send corrupted updates.

本文中,我们强调的是优化问题的non-IID和不均衡性质,以及通信约束的关键本质。一个部署的联合优化系统,必须处理大量实际问题:客户机数据集随着数据的增加和删除而变化;客户机的可用性与局部数据分布的相关性很复杂;客户机对错误的更新不响应,也不发送错误的更新。

These issues are beyond the scope of the current work; instead, we use a controlled environment that is suitable for experiments, but still addresses the key issues of client availability and unbalanced and non-IID data. We assume a synchronous update scheme that proceeds in rounds of communication. There is a fixed set of K clients, each with a fixed local dataset. At the beginning of each round, a random fraction C of clients is selected, and the server sends the current global algorithm state to each of these clients (e.g., the current model parameters). We only select a fraction of clients for efficiency, as our experiments show diminishing returns for adding more clients beyond a certain point. Each selected client then performs local computation based on the global state and its local dataset, and sends an update to the server. The server then applies these updates to its global state, and the process repeats.

这些问题在本文的范围之外;我们使用的是一个受控的环境,适合进行试验,但仍然处理客户机可用性和不均衡,non-IID数据的关键问题。我们假设更新是同步进行的。有固定的K个客户机集合,每个都有一个固定的本地数据集。在每一轮的开始,会选择随机的C个客户机,服务器发送目前的全局算法状态给每个客户机(如,目前的模型参数)。我们只选择一部分客户机,以提高效率,因为我们试验表明,在一定数量之后,增加更多的客户机,得到的收益越来越小。每个选择的客户机然后进行本地计算,基于全局状态和其局部数据集,然后给服务器发送更新。服务器然后将这些更新应用到其全局状态中,然后进行重复。

While we focus on non-convex neural network objectives, the algorithm we consider is applicable to any finite-sum objective of the form 我们关注的是非凸的神经网络目标,但我们考虑的算法,对于下列形式目标的任意有限和都是适用的:

$$min_{w∈R^d} f(w) \space where \space f(w)=\frac{1}{n} \sum_{i=1}^n f_i(w)$$(1)

For a machine learning problem, we typically take $f_i(w) = l(x_i, y_i; w)$, that is, the loss of the prediction on example ($x_i, y_i$) made with model parameters w. We assume there are K clients over which the data is partitioned, with $P_k$ the set of indexes of data points on client k, with $n_k = |P_k|$. Thus, we can re-write the objective (1) as

对于一个机器学习问题,我们一般会$f_i(w) = l(x_i, y_i; w)$,即,在样本($x_i, y_i$)上用模型参数w进行预测的损失函数。我们假设有K个客户机,数据分布在这些设备上,$P_k$是客户机K上的数据点的索引集合,$n_k = |P_k|$。因此,我们将目标函数(1)重写成:

$$f(w) = \sum_{k=1}^K \frac{n_k}{n} F_k(w), \space where \space F_k(w) = \frac {1}{n_k} \sum_{i∈P_k} f_i(w)$$

If the partition $P_k$ was formed by distributing the training examples over the clients uniformly at random, then we would have $E_{P_k} [F_k(w)] = f (w)$, where the expectation is over the set of examples assigned to a fixed client k. This is the IID assumption typically made by distributed optimization algorithms; we refer to the case where this does not hold (that is, $F_k$ could be an arbitrarily bad approximation to f) as the non-IID setting.

如果$P_k$上的训练样本是随机的均匀分布在客户机上的,那么我们就有$E_{P_k} [F_k(w)] = f(w)$,其中期望是在一个固定客户机k的样本集合上求的。这是IID的假设,一般的分布式优化算法都会做这个假设;我们称不符合这个情况的为non-IID设置,即,$F_k$可能是f的一个任意坏的近似。

In data center optimization, communication costs are relatively small, and computational costs dominate, with much of the recent emphasis being on using GPUs to lower these costs. In contrast, in federated optimization communication costs dominate — we will typically be limited by an upload bandwidth of 1 MB/s or less. Further, clients will typically only volunteer to participate in the optimization when they are charged, plugged-in, and on an unmetered wi-fi connection. Further, we expect each client will only participate in a small number of update rounds per day. On the other hand, since any single on-device dataset is small compared to the total dataset size, and modern smartphones have relatively fast processors (including GPUs), computation becomes essentially free compared to communication costs for many model types. Thus, our goal is to use additional computation in order to decrease the number of rounds of communication needed to train a model. There are two primary ways we can add computation: 1) increased parallelism, where we use more clients working independently between each communication round; and, 2) increased computation on each client, where rather than performing a simple computation like a gradient calculation, each client performs a more complex calculation between each communication round. We investigate both of these approaches, but the speedups we achieve are due primarily to adding more computation on each client, once a minimum level of parallelism over clients is used.

在数据中心的优化中,通信代价相对较小,计算代价占主要部分,而很多最近的都在强调使用GPUs来降低这个代价。对比起来,在联合优化中,通信代价占主要部分,我们主要会被1MB/s或更少的上传带宽所限制。进一步的,当客户机在充电、插入并在联入不计流量的wifi时,才会主动参与这个优化过程。而且,我们期望每个客户机每天只参与少数的更新。另一方面,由于任何单个设备上的数据集与总共的数据集大小相比都很小,而现代智能手机其处理速度相对较快(包含GPUs),对于很多模型类型来说,与通信代价相比,计算已经基本上是免费的了。因此,我们的目标是使用额外的计算,以降低训练一个模型所需要的通信量。有两种基本的方式我们可以增加计算力:1)增加并行性,我们使用更多的客户机,在每次通信时,独立的工作;2)增加每个客户机上的计算力,并不是进行简单的梯度计算,而是每个客户在每次通信之间进行更复杂的计算。我们研究了这两种方法,但我嫩获得的加速主要是由于增加了每个客户机上的计算,客户机层次上的并行性使用的很少。

Related Work. Distributed training by iteratively averaging locally trained models has been studied by McDonald et al. [28] for the perceptron and Povey et al. [31] for speech recognition DNNs. Zhang et al. [42] studies an asynchronous approach with “soft” averaging. These works only consider the cluster / data center setting (at most 16 workers, wall-clock time based on fast networks), and do not consider datasets that are unbalanced and non-IID, properties that are essential to the federated learning setting. We adapt this style of algorithm to the federated setting and perform the appropriate empirical evaluation, which asks different questions than those relevant in the data center setting, and requires different methodology.

相关的工作。通过对本地训练模型的迭代平均的分布式训练,McDonald等[28]已经进行了研究,Povey等[31]对语音识别DNNs进行了研究。Zhang等[42]研究了一种带有软平均的异步方法。这些工作只考虑了集群/数据中心的设置(最多16个workers,wall-clock时间是基于快速网络的),并且没有考虑不均衡而且non-IID的数据集,这些都是对联合学习设置非常关键的。我们调整这种算法的类型到联合学习的设置,进行了合适的经验评估,这与数据中心设置涉及到的是不同的问题,并需要不同的方法论。

Using similar motivation to ours, Neverova et al. [29] also discusses the advantages of keeping sensitive user data on device. The work of Shokri and Shmatikov [35] is related in several ways: they focus on training deep networks, emphasize the importance of privacy, and address communication costs by only sharing a subset of the parameters during each round of communication; however, they also do not consider unbalanced and non-IID data, and the empirical evaluation is limited.

Neverova等[29]与我们的动机类似,也讨论了将隐私用户数据保留在设备上的好处。Shokri等[35]的工作也是相关的:他们关注在训练深度网络上,强调隐私的重要性,处理通信代价是通过在每一轮通信中只共享权重的一个子集;但是,他们也不关心不均衡的和non-IID的数据,经验评估也是很有限的。

In the convex setting, the problem of distributed optimization and estimation has received significant attention [4, 15, 33], and some algorithms do focus specifically on communication efficiency [45, 34, 40, 27, 43]. In addition to assuming convexity, this existing work generally requires that the number of clients is much smaller than the number of examples per client, that the data is distributed across the clients in IID fashion, and that each node has an identical number of data points — all of these assumptions are violated in the federated optimization setting. Asynchronous distributed forms of SGD have also been applied to training neural networks, e.g., Dean et al. [12], but these approaches require a prohibitive number of updates in the federated setting. Distributed consensus algorithms (e.g., [41]) relax the IID assumption, but are still not a good fit for communication-constrained optimization over very many clients.

在凸的设置中,分布式优化和估计的问题已经得到了明显的关注[4,15,33],一些算法确实特别关注通信效率。除了假设凸性,这些已有的工作一般需要客户机的数量比每个客户机上的样本数要少很多,数据在客户机中的分布要是IID的,每个节点中的数据点数量都是一样的,所有这些假设都和联合优化的设置是违背的。非同步分布式SGD也用于训练神经网络,如Dean等[12],但这些方法在联合学习的设置中需要非常多数量的更新次数。分布式的共识算法如[41]对IID的假设并没有那么严格,但仍然不太适合在很多客户机上通信受约束的优化。

One endpoint of the (parameterized) algorithm family we consider is simple one-shot averaging, where each client solves for the model that minimizes (possibly regularized) loss on their local data, and these models are averaged to produce the final global model. This approach has been studied extensively in the convex case with IID data, and it is known that in the worst-case, the global model produced is no better than training a model on a single client [44, 3, 46].

参数化的算法的一个终点是简单的单步平均,其中每个客户机都进行模型的求解,对其本地数据的损失进行最小化(可能有正则化),这些模型进行平均以得到最终的全局模型。这种方法在IID数据凸的情形下已经进行了广泛的研究,在最坏的情况下,产生的全局模型不会比在一个客户机上训练模型好。

2 The FederatedAveraging Algorithm

The recent multitude of successful applications of deep learning have almost exclusively relied on variants of stochastic gradient descent (SGD) for optimization; in fact, many advances can be understood as adapting the structure of the model (and hence the loss function) to be more amenable to optimization by simple gradient-based methods [16]. Thus, it is natural that we build algorithms for federated optimization by starting from SGD.

最近深度学习有不少成功的应用,采用的基本都是SGD来进行优化;实际上,很多进展可以理解为,适应模型架构(因此也是损失函数),使其更容易使用基于地图的方法进行优化[16]。因此,我们从SGD开始,构建联合优化的算法,是很自然的。

SGD can be applied naively to the federated optimization problem, where a single batch gradient calculation (say on a randomly selected client) is done per round of communication. This approach is computationally efficient, but requires very large numbers of rounds of training to produce good models (e.g., even using an advanced approach like batch normalization, Ioffe and Szegedy [21] trained MNIST for 50000 steps on minibatches of size 60). We consider this baseline in our CIFAR-10 experiments.

SGD可以很自然的应用到联合优化问题中,在每一轮通信中,完成一个批次的梯度计算(在一个随机选择的客户机上)。这种方法在计算上效率很高,但需要的训练次数很多,以产生好的结果(如,即使使用高级方法比如BN,Ioffe等训练MNIST的mini-batch大小为60,训练来50000步)。我们在CIFAR-10试验中将这个作为基准。

In the federated setting, there is little cost in wall-clock time to involving more clients, and so for our baseline we use large-batch synchronous SGD; experiments by Chen et al. [8] show this approach is state-of-the-art in the data center setting, where it outperforms asynchronous approaches. To apply this approach in the federated setting, we select a C-fraction of clients on each round, and compute the gradient of the loss over all the data held by these clients. Thus, C controls the global batch size, with C = 1 corresponding to full-batch (non-stochastic) gradient descent. We refer to this baseline algorithm as FederatedSGD (or FedSGD).

在联合学习的设置中,更多的客户机的代价很小,所以在我们的基准中,我们使用的是大批次同步SGD;Chen等[8]的试验表明这种方法在数据中心的情况下是目前最好的,超过了非同步的方法。为将这种方法在联合学习的设置中应用,我们在每一轮中选择客户机的一部分(C),在这些客户机上的所有数据中计算损失函数的梯度。因此,C控制着全局batch大小,C=1对应着全批次(非随机)的梯度下降。我们称这种基准算法为FederatedSGD(或FedSGD)。

A typical implementation of FedSGD with C = 1 and a fixed learning rate η has each client k compute $g_k = ▽F_k(w_t)$, the average gradient on its local data at the current model $w_t$, and the central server aggregates these gradients and applies the update $w_{t+1} ← w_t − η \sum_{i=1}^K \frac{n_k}{n} g_k$, since $\sum_{i=1}^K \frac{n_k}{n} g_k = ▽f(w_t)$. An equivalent update is given by $∀k, w_{t+1}^k ← w_t − ηg_k$ and then $w_{t+1} ← \sum_{k=1}^K \frac{n_k}{n} w_{t+1}^K$. That is, each client locally takes one step of gradient descent on the current model using its local data, and the server then takes a weighted average of the resulting models. Once the algorithm is written this way, we can add more computation to each client by iterating the local update $w^k ← w^k − η▽F_k (w^k)$ multiple times before the averaging step. We term this approach FederatedAveraging (or FedAvg). The amount of computation is controlled by three key parameters: C, the fraction of clients that perform computation on each round; E, the number of training passes each client makes over its local dataset on each round; and B, the local minibatch size used for the client updates. We write B = ∞ to indicate that the full local dataset is treated as a single minibatch. Thus, at one endpoint of this algorithm family, we can take B = ∞ and E = 1 which corresponds exactly to FedSGD. For a client with $n_k$ local examples, the number of local updates per round is given by $u_k = E \frac{n_k}{B}$; Complete pseudo-code is given in Algorithm 1.

C=1,固定学习速率η下典型的FedSGD实现,会让每个客户机k计算$g_k = ▽F_k(w_t)$,也就是在目前模型$w_t$下,在其本地数据上的平均梯度,中心服务器将这些梯度聚积起来,进行更新$w_{t+1} ← w_t − η \sum_{i=1}^K \frac{n_k}{n} g_k$,这是因为$\sum_{i=1}^K \frac{n_k}{n} g_k = ▽f(w_t)$。一个等价的更新是$∀k, w_{t+1}^k ← w_t − ηg_k$,然后$w_{t+1} ← \sum_{k=1}^K \frac{n_k}{n} w_{t+1}^K$。即,每个客户端在目前的模型上,使用其本地数据,计算一步梯度下降,然后服务器使用得到的模型进行加权平均。一旦算法这样写了,我们可以通过多次迭代本地更新$w^k ← w^k − η▽F_k (w^k)$,以在平均步骤前给每个客户端增加更多的计算。我们称这种方法FederatedAveraging,或FedAvg。计算量由三个关键参数来控制:C,在每一轮中进行计算的客户机比例;E,每个客户端在其本地数据集上在每一轮中进行的训练迭代的次数;B,用于客户机更新的本地mini-batch大小。我们用B = ∞来说明,完整的本地数据集作为单个mini-batch进行训练。因此,作为这个算法族的一个端点,我们的FedSGD就是对应B = ∞和E = 1的情况。对于一个有$n_k$个本地样本的客户机,每轮的本地更新由$u_k = E \frac{n_k}{B}$给出;完整的伪代码如算法1所示。

For general non-convex objectives, averaging models in parameter space could produce an arbitrarily bad model. Following the approach of Goodfellow et al. [17], we see exactly this bad behavior when we average two MNIST digit-recognition models trained from different initial conditions (Figure 1, left). For this figure, the parent models w and w' were each trained on non-overlapping IID samples of 600 examples from the MNIST training set. Training was via SGD with a fixed learning rate of 0.1 for 240 updates on minibatches of size 50 (or E = 20 passes over the mini-datasets of size 600). This is approximately the amount of training where the models begin to overfit their local datasets.

对于一般的非凸目标函数,平均模型在参数空间会产生一个任意坏的模型。按照Goodfellow等[17]的方法,我们将两个从不同初始条件训练的MNIST数字识别模型进行平均,看到了这种很坏的结果(图1左)。对于这幅图像,父模型w和w'是在非重叠的IID样本上进行训练的,每个包含MNIST训练集的600个训练样本。训练是用SGD进行的,固定的学习速率为0.1,训练240次更新,mini-batch大小为50(或E=20,mini-batch大小为600)。这种训练量,是模型在其本地数据集上快要开始过拟合的情况。

Recent work indicates that in practice, the loss surfaces of sufficiently over-parameterized NNs are surprisingly well-behaved and in particular less prone to bad local minima than previously thought [11, 17, 9]. And indeed, when we start two models from the same random initialization and then again train each independently on a different subset of the data (as described above), we find that naive parameter averaging works surprisingly well (Figure 1, right): the average of these two models, w/2 + w'/2, achieves significantly lower loss on the full MNIST training set than the best model achieved by training on either of the small datasets independently. While Figure 1 starts from a random initialization, note a shared starting model $w_t$ is used for each round of FedAvg, and so the same intuition applies.

最近的工作表明,在实践上,参数过量的NNs,其损失曲线表现很好,特别是不太容易进入很差的局部极小值。确实,当我们从同样的随机初始化开始两个模型,然后在数据的不同子集上独立进行训练(如上所述),我们发现天然的参数平均效果出奇的好(图1右):两个模型的平均,w/2 + w'/2,与在任意一个小数据集上独立进行训练得到的最好模型相比,在完整的MNIST训练集上得到明显更低的损失。图1是从随机初始化开始的,注意一个共享的起始模型$w_t$用于FedAvg每轮的训练,同样的直觉也是这样应用的。

3. Experimental Results

We are motivated by both image classification and language modeling tasks where good models can greatly enhance the usability of mobile devices. For each of these tasks we first picked a proxy dataset of modest enough size that we could thoroughly investigate the hyperparameters of the FedAvg algorithm. While each individual training run is relatively small, we trained over 2000 individual models for these experiments. We then present results on the benchmark CIFAR-10 image classification task. Finally, to demonstrate the effectiveness of FedAvg on a real-world problem with a natural partitioning of the data over clients, we evaluate on a large language modeling task.

我们是受到图像分类与语言建模任务的推动,这些任务中,好的模型可以极大的增强移动设备的可用性。对于每个这些任务,我们首先选择一个代理数据集,大小适中,这样我们可以彻底研究FedAvg算法的超参数。每次独立的训练相对较小,但我们在2000个独立的模型中进行训练,以进行试验。我们然后在基准测试CIFAR-10图像分类任务中给出结果。最后,为证明FedAvg在真实世界问题的有效性,在这些问题中数据是很自然的分割在不同客户机中的,我们在一个大型语言建模任务中进行了评估。

Our initial study includes three model families on two datasets. The first two are for the MNIST digit recognition task [26]: 1) A simple multilayer-perceptron with 2-hidden layers with 200 units each using ReLu activations (199,210 total parameters), which we refer to as the MNIST 2NN. 2) A CNN with two 5x5 convolution layers (the first with 32 channels, the second with 64, each followed with 2x2 max pooling), a fully connected layer with 512 units and ReLu activation, and a final softmax output layer (1,663,370 total parameters). To study federated optimization, we also need to specify how the data is distributed over the clients. We study two ways of partitioning the MNIST data over clients: IID, where the data is shuffled, and then partitioned into 100 clients each receiving 600 examples, and Non-IID, where we first sort the data by digit label, divide it into 200 shards of size 300, and assign each of 100 clients 2 shards. This is a pathological non-IID partition of the data, as most clients will only have examples of two digits. Thus, this lets us explore the degree to which our algorithms will break on highly non-IID data. Both of these partitions are balanced, however.

我们的初始研究包括在两个数据集中的三族模型。前两个是在MNIST数字识别任务中的[26]:1) 一个简单的多层感知机,有2个隐含层,每层有200个单元,都使用了ReLu激活(总体199210参数),我们称之为MNIST 2NN。2) 一个使用2层5x5卷积层的CNN(第一层有32个通道,第二层有64个通道,每个都带有2x2的最大池化层),一个有512单元的全连接层和ReLU激活,最后是一个softmax输出层(共计1663370参数)。为研究联合优化,我们还需要指定数据是怎样在客户机之间分布的。我们使用两种将MNIST数据分布在客户机上的方法:IID,其中数据进行了重组,然后分成100份,每个600个样本,以及Non-IID的,其中我们首先通过数字标签对数据进行排序,将其分成200份,每份300个样本,给100个客户机每个2份。这是一种数据的non-IID分割,因为多数客户机都只有2个数字的样本。因此,我们可以进行研究,我们的算法在这样的高度non-IID数据上的失败程度。但这两种分割都是均衡的。

For language modeling, we built a dataset from The Complete Works of William Shakespeare [32]. We construct a client dataset for each speaking role in each play with at least two lines. This produced a dataset with 1146 clients. For each client, we split the data into a set of training lines (the first 80% of lines for the role), and test lines (the last 20%, rounded up to at least one line). The resulting dataset has 3,564,579 characters in the training set, and 870,014 characters in the test set. This data is substantially unbalanced, with many roles having only a few lines, and a few with a large number of lines. Further, observe the test set is not a random sample of lines, but is temporally separated by the chronology of each play. Using an identical train/test split, we also form a balanced and IID version of the dataset, also with 1146 clients.

对于语言建模,我们从莎士比亚全集中构建了一个数据集,我们从中构建客户机数据集,这产生了1146个客户机数据集。对于每个客户机,我们将数据分成训练行(前80%),和测试行(后20%)。得到的数据集在训练集中有3564579个字符,测试集中有870014个字符。这种数据是非均衡的,很多角色只有几行,一些则有很多行。而且,观察测试集可以发现,这并不是行的随机样本,而是每一幕随着时间的分割。使用相同的训练/测试分割,我们还形成了数据集的一个均衡的IID版本,也有1146个客户机。

On this data we train a stacked character-level LSTM language model, which after reading each character in a line, predicts the next character [22]. The model takes a series of characters as input and embeds each of these into a learned 8 dimensional space. The embedded characters are then processed through 2 LSTM layers, each with 256 nodes. Finally the output of the second LSTM layer is sent to a softmax output layer with one node per character. The full model has 866,578 parameters, and we trained using an unroll length of 80 characters.

在这些数据上,我们训练了一个堆叠的字符级LSTM语言模型,在读了一行中的每个字符后,预测下一个字符。模型以一个字符串为输入,将其每个嵌入到一个学习到的8维空间中。嵌入的字符然后通过2个LSTM层进行处理,每个都有256个节点。最后,第二个LSTM层的输出送入到一个softmax输出层,每个节点一个字符。完整的模型有866578个参数,我们训练使用80个字符的展开长度。

SGD is sensitive to the tuning of the learning-rate parameter η. The results reported here are based on training over a sufficiently wide grid of learning rates (typically 11-13 values for η on a multiplicative grid of resolution $10^{1/3}$ or $10^{1/6}$). We checked to ensure the best learning rates were in the middle of our grids, and that there was not a significant difference between the best learning rates. Unless otherwise noted, we plot metrics for the best performing rate selected individually for each x-axis value. We find that the optimal learning rates do not vary too much as a function of the other parameters.

SGD对学习速率参数η的调节很敏感。这里得到的结果,是在很多种学习速率下训练得到的(一般对于η是11-13个值)。我们核实以确保最佳的学习速率是在网格中间的,在最佳的学习速率之间,并没有很明显的区别。除非额外说明,我们对每个x-轴的值,都会画出性能最好的值的度量。我们发现最佳学习速率作为其他参数的函数,变化并不大。

Increasing parallelism. We first experiment with the client fraction C, which controls the amount of multi-client parallelism. Table 1 shows the impact of varying C for both MNIST models. We report the number of communication rounds necessary to achieve a target test-set accuracy. To compute this, we construct a learning curve for each combination of parameter settings, optimizing η as described above and then making each curve monotonically improving by taking the best value of test-set accuracy achieved over all prior rounds. We then calculate the number of rounds where the curve crosses the target accuracy, using linear interpolation between the discrete points forming the curve. This is perhaps best understood by reference to Figure 2, where the gray lines show the targets.

增加并行性。我们首先对客户机的比例C进行试验,这控制的是多客户机的并行性。表1给出了C的变化,对两个MNIST模型的影响。我们给出要得到目标测试集准确率的必要的通信次数。为计算这个,我们构建了每种参数设置组合的学习曲线,对η进行如上所述的优化,然后取值是在之前所有轮中测试准确率最好的值,使每条曲线不断的进行单调的改进。我们然后计算曲线与目标准确率相交处的次数,使用形成曲线的离散点的线性插值。这可能在图2中最容易理解,其中灰色线就是目标。

With B = ∞ (for MNIST processing all 600 client examples as a single batch per round), there is only a small advantage in increasing the client fraction. Using the smaller batch size B = 10 shows a significant improvement in using C ≥ 0.1, especially in the non-IID case. Based on these results, for most of the remainder of our experiments we fix C = 0.1, which strikes a good balance between computational efficiency and convergence rate. Comparing the number of rounds for the B = ∞ and B = 10 columns in Table 1 shows a dramatic speedup, which we investigate next.

在B=∞的情况下(对于MNIST,是将600个客户机的样本每一轮作为一个批次进行处理),在增加客户机比例时,只有很少的好处。使用较小的批规模B=10,在C≥0.1时有显著的改进,尤其是在non-IID的情况下。基于这个结果,我们剩下的多数试验我们都固定使用C=0.1,这在计算效率和收敛速率上有很好的平衡。比较表1中B=∞和B=10的列中的数,表示会有很好的加速,我们在下面会进行研究。

Increasing computation per client. In this section, we fix C = 0.1, and add more computation per client on each round, either decreasing B, increasing E, or both. Figure 2 demonstrates that adding more local SGD updates per round can produce a dramatic decrease in communication costs, and Table 2 quantifies these speedups. The expected number of updates per client per round is $u = (E[n_k]/B)E = nE/(KB)$, where the expectation is over the draw of a random client k. We order the rows in each section of Table 2 by this statistic. We see that increasing u by varying both E and B is effective. As long as B is large enough to take full advantage of available parallelism on the client hardware, there is essentially no cost in computation time for lowering it, and so in practice this should be the first parameter tuned.

增加每个客户机上的计算量。本节中,我们固定C=0.1,在每一轮每个客户机上增加更多的计算,要么降低B,要么增加E,要么都有。图2的结果表明,增加每轮中本地SGD的更新可以使得通信次数急剧减少,表2对这种加速进行了量化。期望的每个客户机上每一轮的更新数量为$u = (E[n_k]/B)E = nE/(KB)$,其中期望是在一个随机客户机k上的计算的。我们可以发现,通过对E和B的同时变化来增加u是很有效的。只要B足够大,以充分利用在客户机硬件上并行性带来的好处,那么降低就没有什么计算时间上的代价,所以实践中这应当是第一个要调节的参数。

For the IID partition of the MNIST data, using more computation per client decreases the number of rounds to reach the target accuracy by 35× for the CNN and 46× for the 2NN (see Table 4 in Appendix A for details for the 2NN). The speedups for the pathologically partitioned non-IID data are smaller, but still substantial (2.8 – 3.7×). It is impressive that averaging provides any advantage (vs. actually diverging) when we naively average the parameters of models trained on entirely different pairs of digits. Thus, we view this as strong evidence for the robustness of this approach.

对于MNIST数据的IID分割,在每个客户端上使用更多的计算,降低了需要的通信轮数,就可以达到目标准确率,是CNN的1/35,2NN的1/46(见附录A中的表4,为2NN的细节)。病态分割的non-IID数据的加速更小,但仍然是很大的(2.8 – 3.7×)。当我们只是将在完全不同的数字上训练的模型进行简单的平均时,要是有什么改进的话,这都是令人印象深刻的。因此,我们将这视为这种方法的鲁棒性的很强的证据。

The unbalanced and non-IID distribution of the Shakespeare (by role in the play) is much more representative of the kind of data distribution we expect for real-world applications. Encouragingly, for this problem learning on the non-IID and unbalanced data is actually much easier (a 95× speedup vs 13× for the balanced IID data); we conjecture this is largely due to the fact some roles have relatively large local datasets, which makes increased local training particularly valuable.

莎士比亚语料的不均衡的,non-IID分布分割,是真实世界应用的数据分布的更典型代表。令人鼓舞的是,对于这个问题,在non-IID和不平衡数据上的学习实际上是非常容易的(在平衡的IID数据上有13x的加速,而在这些数据上有95x的加速);我们推测,这主要是因为,一些角色的语料的本地数据集相对较大,这使得增加本地训练特别宝贵。

For all three model classes, FedAvg converges to a higher level of test-set accuracy than the baseline FedSGD models. This trend continues even if the lines are extended beyond the plotted ranges. For example, for the CNN the B = ∞, E = 1 FedSGD model eventually reaches 99.22% accuracy after 1200 rounds (and had not improved further after 6000 rounds), while the B = 10, E = 20 FedAvg model reaches an accuracy of 99.44% after 300 rounds. We conjecture that in addition to lowering communication costs, model averaging produces a regularization benefit similar to that achieved by dropout [36].

对于这三种模型类别,FedAvg比基准FedSGD模型会收敛到一个更高水平的测试集准确率。即使这条线超过了画图的范围,这种趋势仍然没有停止。比如,对于B = ∞, E = 1的CNN,FedSGD模型最终在1200轮以后达到了99.22%的准确率(在6000轮以后就没有什么改进了),而B = 10, E = 20的FedAvg模型在300轮后达到了99.44%的准确率。我们推测,除了降低了通信代价以外,模型平均有一种正则化的好处,与dropout起到的作用类似。

We are primarily concerned with generalization performance, but FedAvg is effective at optimizing the training loss as well, even beyond the point where test-set accuracy plateaus. We observed similar behavior for all three model classes, and present plots for the MNIST CNN in Figure 6 in Appendix A.

我们主要考虑的是泛化的性能,但FedAvg在训练损失的优化上也非常有效,即使在测试集准确率停止后,也有这种效果。我们对所有三种模型类别观察到类似的行为,在附录A中的图6中画出了MNIST的CNN的图。

Can we over-optimize on the client datasets? The current model parameters only influence the optimization performed in each ClientUpdate via initialization. Thus, as E → ∞, at least for a convex problem eventually the initial conditions should be irrelevant, and the global minimum would be reached regardless of initialization. Even for a non-convex problem, one might conjecture the algorithm would converge to the same local minimum as long as the initialization was in the same basin. That is, we would expect that while one round of averaging might produce a reasonable model, additional rounds of communication (and averaging) would not produce further improvements.

我们能在客户机数据集上过度优化吗?。目前的模型参数只影响在每次ClientUpdate中通过初始化进行的优化过程。因此,当E→∞时,至少对于一个凸问题,最终初始条件应当是无关的,不论什么样的初始化,都应当会达到全局极小值。即使是对于非凸问题,也可以推测,只要初始化在同一个盆地中,那么算法应当会收敛到同一个局部极小值。即,我们会期望,一轮平均会生成一个还不错的模型,更多轮的通信(和平均)不会有更多的改进。

Figure 3 shows the impact of large E during initial training on the Shakespeare LSTM problem. Indeed, for very large numbers of local epochs, FedAvg can plateau or diverge. This result suggests that for some models, especially in the later stages of convergence, it may be useful to decay the amount of local computation per round (moving to smaller E or larger B), in the same way decaying learning rates can be useful. Figure 8 in Appendix A gives the analogous experiment for the MNIST CNN. Interestingly, for this model we see no significant degradation in the convergence rate for large values of E. However, we see slightly better performance for E = 1 versus E = 5 for the large-scale language modeling task described below (see Figure 10 in Appendix A).

图3给出了在莎士比亚LSTM问题中,初始训练中很大的E值的影响。确实,对于很大数量的本地epochs,FedAvg会达到性能瓶颈或不能收敛。这个结果说明,对于一些模型,尤其是后期的收敛阶段,每一轮的本地计算时间减少(E更小或B更大)会比较有用,以同样的方式衰减学习速率也会是有用的。附录A中的图8对于MNIST CNN给出了类比的试验。有趣的是,对于这个模型,我们在很大的E值时的学习率中,没有看到明显的降质。但是,我们在大规模语言建模的任务中,对于E=1的情况,看到了比E=5的情况下,更好的模型效果(见附录A的图10效果)。

CIFAR experiments. We also ran experiments on the CIFAR-10 dataset [24] to further validate FedAvg. The dataset consists of 10 classes of 32x32 images with three RGB channels. There are 50,000 training examples and 10,000 testing examples, which we partitioned into 100 clients each containing 500 training and 100 testing examples; since there isn’t a natural user partitioning of this data, we considered the balanced and IID setting. The model architecture was taken from the TensorFlow tutorial [38], which consists of two convolutional layers followed by two fully connected layers and then a linear transformation layer to produce logits, for a total of about 10^6 parameters. Note that state-of-the-art approaches have achieved a test accuracy of 96.5% [19] for CIFAR; nevertheless, the standard model we use is sufficient for our needs, as our goal is to evaluate our optimization method, not achieve the best possible accuracy on this task. The images are preprocessed as part of the training input pipeline, which consists of cropping the images to 24x24, randomly flipping left-right and adjusting the contrast, brightness and whitening.

CIFAR试验。我们也在CIFAR-10数据集上进行了试验,以进一步验证FedAvg算法。数据集包含10类32x32大小的RGB三通道图像。有50000个训练样本,10000个测试样本,我们将其分割到100个客户端,每个包含500个训练样本,100个测试样本;由于没有自然用户分割这个数据,我们考虑平衡的IID设置。模型架构是从TensorFlow教程中的,包含2个卷积层,后面是2个全连接层,然后是一个线性变换层,以产生logits,总计10^6参数。注意,目前最好的方法已经在CIFAR中得到了96.5%的测试准确率;尽管如此,我们所使用的标准模型是满足我们的需要的,因为我们的目标是评估我们的优化方法,并不是在这个任务中获得可能的最好的准确率。图像是作为部分的训练输入流程进行预处理的,这包含将图像剪切到24x24大小,随机左右翻转,并调整对比度、亮度和白度。

For these experiments, we considered an additional baseline, standard SGD training on the full training set (no user partitioning), using minibatches of size 100. We achieved an 86% test accuracy after 197,500 minibatch updates (each minibatch update requires a communication round in the federated setting). FedAvg achieves a similar test accuracy of 85% after only 2,000 communication rounds. For all algorithms, we tuned a learning-rate decay parameter in addition to the initial learning rate. Table 3 gives the number of communication rounds for baseline SGD, FedSGD, and FedAvg to reach three different accuracy targets, and Figure 4 gives learning-rate curves for FedAvg versus FedSGD.

对于这些试验,我们考虑一种额外的基准,在完整训练集(没有用户的分割)用标准SGD的训练,使用的minibatch大小为100。我们在197500次minibatch更新后,获得了86%的测试准确率(每次minibatch更新需要在联合学习设置下进行一次通信)。FedAvg在2000次通信后,获得了类似的测试准确率85%。对于所有算法,除了初始学习速率以外,我们调整了一个学习速率衰减参数。表3给出了基准SGD,FedSGD和FedAvg达到三种不同准确率目标的通信次数,图4给出了FedAvg和FedSGD的学习速率曲线。

By running experiments with minibatches of size B = 50 for both SGD and FedAvg, we can also look at accuracy as a function of the number of such minibatch gradient calculations. We expect SGD to do better here, because a sequential step is taken after each minibatch computation. However, as Figure 9 in the appendix shows, for modest values of C and E, FedAvg makes a similar amount of progress per minibatch computation. Further, we see that both standard SGD and FedAvg with only one client per round (C = 0), demonstrate significant oscillations in accuracy, whereas averaging over more clients smooths this out.

使用minibatch大小B=50,对SGD和FedAvg进行试验,我们也可以将准确率视为这种minibatch梯度计算的数量的函数。我们期望SGD的效果会更好,因为在每次minibatch计算。但是,如图9所示,对于C和E的较适中的值,FedAvg在每个minibatch计算中获得类似的进展。而且,我们看到,标准SGD和FedAvg在每轮一个客户端的情况下(C=0),准确率有明显的震荡,而在多个客户端上进行平均则会将这种震荡抹平。

Large-scale LSTM experiments. We ran experiments on a large-scale next-word prediction task task to demonstrate the effectiveness of our approach on a real-world problem. Our training dataset consists 10 million public posts from a large social network. We grouped the posts by author, for a total of over 500,000 clients. This dataset is a realistic proxy for the type of text entry data that would be present on a user’s mobile device. We limited each client dataset to at most 5000 words, and report accuracy (the fraction of the data where the highest predicted probability was on the correct next word, out of 10000 possibilities) on a test set of 1e5 posts from different (non-training) authors. Our model is a 256 node LSTM on a vocabulary of 10,000 words. The input and output embeddings for each word were of dimension 192, and co-trained with the model; there are 4,950,544 parameters in all. We used an unroll of 10 words.

大规模LSTM试验。我们在大规模下一个词的预测任务中进行试验,以证明我们方法在真实世界问题中的有效性。我们的训练数据集包含大型社交网络中的1000万帖子。我们将帖子按照作者进行分组,总计超过了50万客户机。这个数据集是用户移动设备上给出的文字输入数据的真实代理任务。我们将每个客户机数据集限制到最多5000个词语,并在不同作者的1e5个帖子的测试集上给出准确率。我们的模型是一个256个节点的LSTM模型,字符集包含10000个词语。对每个词语的输入和输出的嵌入都是192维的,并与模型进行联合训练;总计有4950544个参数。我们使用10个词语的展开。

These experiments required significant computational resources and so we did not explore hyper-parameters as thoroughly: all runs trained on 200 clients per round; FedAvg used B = 8 and E = 1. We explored a variety of learning rates for FedAvg and the baseline FedSGD. Figure 5 shows monotonic learning curves for the best learning rates. FedSGD with η = 18.0 required 820 rounds to reach 10.5% accuracy, while FedAvg with η = 9.0 reached an accuracy of 10.5% in only 35 communication rounds (23× fewer then FedSGD). We observed lower variance in test accuracy for FedAvg, see Figure 10 in Appendix A. This figure also include results for E = 5, which performed slightly worse than E = 1.

这个试验需要很多计算资源,所以我们没有非常彻底的尝试超参数:每一轮都是在200个客户机上进行训练;FedAvg使用B=8和E=1。我们对FedAvg和基准FedSGD探索了很多学习速率。图5给出了最佳学习速率的单调学习曲线。η=18.0的FedSGD需要820轮以达到10.5%的准确率,而η=9.0的FedAvg只需要35次通信就达到了10.5%的准确率(比FedSGD少了23倍)。我们观察到FedAvg的测试准确率的方法更小,见图10。这幅图也包括了E=5的结果,比E=1的情况略微差一点。

4 Conclusions and Future Work

Our experiments show that federated learning can be made practical, as FedAvg trains high-quality models using relatively few rounds of communication, as demonstrated by results on a variety of model architectures: a multi-layer perceptron, two different convolutional NNs, a two-layer character LSTM, and a large-scale word-level LSTM. While federated learning offers many practical privacy benefits, providing stronger guarantees via differential privacy [14, 13, 1], secure multi-party computation [18], or their combination is an interesting direction for future work. Note that both classes of techniques apply most naturally to synchronous algorithms like FedAvg.

我们的试验表明,联合学习是可以达到实用的水平的,因为FedAvg使用相对较少的通信次数就可以得到高质量的模型,我们在很多模型架构中都得到了这样的结果:多层感知机,两种不同的CNNs,一个两层字符LSTM,和一个大规模词语级LSTM。联合学习有很多实际的隐私好处,通过差分隐私、安全多方计算给出了更强的保证,这两种方法的结合,也是未来的工作方向。注意两类技术可以很自然的应用到同步算法中,如FedAvg。