Amir Yazdanbakhsh et. al. Google Research
The looming end of Moore’s Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Architecture exploration for such accelerators forms a challenging constrained optimization problem over a complex, high-dimensional, and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient and do not transfer knowledge between related optimizations tasks with different design constraints, such as area and/or latency budget, or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed APOLLO, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use this framework to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that our framework finds high reward design configurations (up to 24.6% speedup) more sample-efficiently than a baseline black-box optimization approach. We further show that by transferring knowledge between target architectures with different design constraints, APOLLO is able to find optimal configurations faster and often with better objective value (up to 25% improvements). This encouraging outcome portrays a promising path forward to facilitate generating higher quality accelerators.
Moore定律的终结和深度学习的更多使用,推动了定制加速器的设计,对特定神经架构进行了优化。对这种加速器的架构探索形成了有挑战的在复杂、高维和结构化的输入空间的约束优化问题,计算目标函数很耗时。加速器设计的现有方法,得到样本的效率是很低的,不能在相关的有不同设计约束的优化任务中迁移知识,这些约束比如面积和/或延迟预算,或神经架构配置。本文中,我们提出了一个可迁移的架构探索框架,称为Apollo,利用最近在黑盒函数优化中的进展,在得到加速器设计的样本时会更高效。我们使用这个框架来优化很多带有不同设计约束的神经架构集合的加速器配置。我们证明了,我们的框架找到高回报设计配置(最多24.6%加速)的样本效率更高,这是与基准黑盒优化方法相比得到的。我们进一步证明了,在有不同的设计约束的目标架构之间迁移知识,Apollo能更快找到最优的配置,通常会得到更好的目标值(最高有25%的改进)。这种鼓舞人心的结果,描述了一个有希望的路径,可以促进生成更高质量的加速器。
The ubiquity of customized accelerators demands efficient architecture exploration approaches, especially for the design of neural network accelerators. However, optimizing the parameters of accelerators is daunting optimization task that generally requires expert knowledge [11, 28]. This complexity in the optimization is because the search space is exponentially large while the objective function is a black-box and costly to evaluate. Constraints imposed on parameters further convolute the identification of valid accelerator configurations. Constrains can arise from hardware limitations or if the evaluation of a configuration is impossible or too expensive [29].
定制加速器的普遍性,需要高效的架构探索方法,尤其是对于神经网络加速器的设计。但是,加速器的参数优化是一个令人畏惧的优化任务,一般需要专家知识。这种优化的复杂度是因为,搜索空间是指数级增大的,而目标函数是黑盒的,计算起来很耗时。在参数上施加的约束,进一步加剧了有效加速器配置的识别的复杂度。约束可以从硬件限制而来,或一个配置的评估是不可能的,或太昂贵的。
To address the aforementioned challenges, we introduce a general accelerator architecture exploration framework, dubbed APOLLO, that leverages the recent advances in black-box optimization to facilitate finding optimal design configurations under different design constraints. We demonstrate how leveraging tailored optimization strategies for complex and high-dimensional space of architecture exploration yields large improvements (up to 24.6%) with a reasonably small number of evaluations (≈ 0.0004% of the search space). Finally, we present the very first study on the impact of transfer learning between architecture exploration tasks with different design constraints in further reducing the number of hardware evaluations. The following outlines the contributions of APOLLO, making the first transferable architecture exploration infrastructure:
为处理之前提出的挑战,我们引入了一个通用加速器架构探索框架,称为Apollo,利用了最近在黑盒优化上的进展,来促进在不同的设计约束中找到最优设计配置。我们证明了,怎样利用对复杂的架构探索高维空间定制的优化策略,在较小数量(≈ 搜索空间的0.0004%)的评估基础上,得到很大的改进(最多达到24.6%)。最后,我们给出不同设计约束的架构探索任务迁移学习第一个研究,进一步降低了硬件评估的次数。下面列出了Apollo的贡献,第一个可迁移的架构探索基础设施:
- End-to-end architecture exploration framework. We introduce and develop APOLLO, an end-to-end and highly configurable framework for architecture exploration. The proposed framework tunes accelerator configurations for a target set of workloads with a relatively small number of hardware evaluations. As hardware simulations are generally time-consuming and expensive to obtain, reducing the number of these simulations not only shortens the design cycle for accelerators, but also provides an effective way to adapt the accelerator itself to various target workloads.
端到端的架构探索框架。我们提出并开发了Apollo,一种端到端的高度可配置的架构探索框架。提出的框架对目标workload集合调节加速器配置,用的硬件评估的数量相对较少。由于硬件仿真一般非常耗时,要得到代价很大,降低这种仿真的数量,不仅缩短了加速器的设计周期,还提供了一种有效的方法来让加速器适应各种目标workloads。
- Supporting various optimization strategies. APOLLO introduces and employs a variety of optimization strategies to facilitate the analysis of optimization performance in the context of architecture exploration. Our evaluations results show that evolutionary and population-based black-box optimization strategies yield the best accelerator configurations (up to 24.6% speedup) compared to a baseline black-box optimization with only ≈ 2K number of hardware evaluations (≈ 0.0004% of search space).
支持各种优化策略。Apollo引入并采用了几种优化策略,来促进优化性能在架构探索上下文中的分析。我们的评估结果表明,与基准黑盒优化相比,只用大约2K次硬件评估(大约0.0004%的搜索空间),基于演化和基于population的黑盒优化策略得到了最好的加速器配置(最多24.6%加速)。
- Transfer learning for architecture exploration. Finally, we study and explore transfer learning between architecture exploration tasks with different design constraints showing its benefit in improving the optimization results and sample-efficiency. Our results show that transfer learning not only improves the optimization outcome (up to 25%) compared to independent exploration, but also reduces the number of hardware evaluations.
架构探索的迁移学习。最后,我们研究和探索了不同设计约束的架构探索任务之间的迁移学习,表明在改进优化结果和样本高效性中的好处。我们的结果表明,余独立的探索相比,迁移学习不仅改进了优化输出(最多25%),还降低了硬件评估的次数。
Problem definition. The objective in APOLLO (architecture exploration) is to discover a set of feasible accelerator parameters (h) for a set of workloads (w) such that a desired objective function (f), e.g. weighted average of runtime, is minimized under an optional set of user-defined constraints, such as area (α) and/or runtime budget (τ).
问题定义。Apollo架构探索的目标,是对workload集合(w),找到可行的加速器参数集合(h),这样期望的目标函数(f),如,运行时间加权平均,在用户定义的约束集下最小化,如面积(α)和/或运行时间预算(τ)。
The manifold of architecture search generally contains infeasible points [28], for example due to impractical hardware implementation for a given set of parameters or impossible mapping of workloads to an accelerator. As such, one of the main challenges for architecture exploration is to effectively sidestep these infeasible points. We present and analyze the performance of optimization strategies to reduce the number of infeasible trials in Section 3.
架构搜索的流行上一般包含不可行点,比如,由于对给定参数集的不实际的硬件实现,或从workloads到加速器的不可能的映射。这样,架构探索的一个主要挑战,是有效的避免这些不可行点。我们给出和分析优化策略的性能,以降低不可行尝试的数量。
Neural models. We evaluate APOLLO on two variations of MobileNet [33, 15] models and five in-house neural networks with distinct accelerator resource requirements. The neural model configurations, including their target domain, number of layers, and total filter sizes are detailed in Table 1. In the multi-model study, the workload contains MobileNetV2 [33], MobileNetEdge [15], M3, M4, M5, M6, and M7.
神经模型。我们在MobileNet的两个变体模型上,和五个需要不同的加速器资源的神经网络中,评估Apollo。神经模型配置,包括其目标领域,层数,总计滤波器大小在表1中详述。在多模型研究中,workload包含MobileNetV2,MobileNetEdge,M3,M4,M5,M6和M7。
Accelerator search space. In this work, we use an in-house and highly parameterized edge accelerator. The accelerator contains a 2D array of processing elements (PE) with multiple compute lanes and dedicated register files, each operating in single-instruction multiple-data (SIMD) style with multiply-accumulate (MAC) compute units. There are distributed local and global buffers that are shared across the compute lanes and PEs, respectively. We designed a cycle-accurate simulator that faithfully models the main microarchitectural details and enables us to perform architecture exploration. Table 2 outlines the microarchitectural parameters (e.g. compute, memory, or bandwidth) and their number of discrete values in the search space. The total number of design points explored in APOLLO is nearly 5×10^8.
加速器搜索空间。本文中,我们使用了一种内部的高度参数化的边缘加速器。加速器包含处理元素(PE)的2D阵列,有多条计算通道,和专用的寄存器组,每个都以带有MAC计算单元的SIMD类型运行。有分布式的局部和全局buffers,分别在不同的计算通道和PE之间共享。我们设计了一个cycle-accurate仿真器,忠实的建模了主要的微架构细节,使我们可以进行架构探索。表2列出了微架构参数(如,计算,内存,或带宽)和其在搜索空间的离散值数量。Apollo中探索的总计设计点数量接近5×10^8。
In APOLLO, we study and analyze the performance of following optimization methods.
在Apollo中,我们研究和分析了下列优化方法的性能。
Evolutionary. Performs evolutionary search using a population of K individuals, where the genome of each individual corresponds to a sequence of discretized accelerator configurations. New individuals are generated by selecting for each individual two parents from the population using tournament selecting, recombining their genomes with some crossover rate γ, and mutating the recombined genome with some probability µ. Following Real et al. [31], individuals are discarded from the population after a fixed number of optimization rounds (‘death by old age’) to promote exploration. In our experiments, we use the default parameters K = 100, γ = 0.1, and µ= 0.01.
演化算法。使用K个个体的种群,进行演化搜索,其中每个个体的基因对应着一个离散化的加速器配置序列。新的个体的生成,是通过为每个个体从种群中使用tournament选择来选择两个父辈,将其基因以交叉率γ进行重新组合,以概率µ对重新结合的基因进行变异。按照[31],个体在经过固定数量的优化轮次后(年龄大了就死了)从种群中被抛弃,以促进探索。在我们的试验中,我们使用默认参数K = 100, γ = 0.1, µ= 0.01。
Model-Based Optimization (MBO). Performs model-based optimization with automatic model selection following [2]. At each optimization round, a set of candidate regression models are fit on the data acquired so far and their hyper-parameter optimized by randomized search and five fold cross-validation. Models with a cross-validation score above a certain threshold are ensembled to define an acquisition function. The acquisition is optimized by evolutionary search and the proposed accelerator configurations with the highest acquisition function values are used for the next objective function evaluation.
基于模型的优化(MBO)。进行基于模型的优化,按照[2]进行自动模型选择。在每个优化轮次中,在目前得到的数据上拟合候选回归模型,模型集合的超参数用随机搜索和五折交叉验证进行优化。交叉验证分数高于一定阈值的模型,集成到一起,来定义一个获得函数。获得函数经过演化搜索进行优化,带有最高获得函数值的加速器配置,用于下一个目标函数评估。
Population-Based black-box optimization (P3BO). Uses an ensemble of optimization methods, including Evolutionary and MBO, which has been recently shown to increase sample-efficiency and robustness [3]. Acquired data are exchanged between optimization methods in the ensemble, and optimizers are weighted by their past performance to generate new accelerator configurations. Adaptive-P3BO is an extension of P3BO which further optimizes the hyper-parameters of optimizers using evolutionary search, which we use in our experiments.
基于种群的黑盒优化(P3BO)。使用优化方法的集成,包括演化方法和MBO,最近证明可以增加样本效率和稳健性。获得的数据在集成的优化方法之间进行了交换,优化器根据其过去的性能进行加权,以生成新的加速器配置。自适应的P3BO是P3BO的一种扩展,进一步对优化器的超参数进行了优化,我们在试验中使用的是演化搜索。
Random. Samples accelerator configurations uniformly at random from the defined search space. 样本加速器配置在定义的搜索空间中均匀随机分布。
Vizier. An alternative approach to MBO based on Bayesian optimization with a Gaussian process regressor and the expected improvement acquisition function, which is optimized by gradient-free hill-climbing [14]. Categorical variables are one-hot encoded.
除了MBO的另一种方法,基于贝叶斯优化,带有高斯过程回归器,和期望改进的获得函数,由无梯度爬坡进行优化。类别变量是独热编码。
We use the Google Vizier framework [14] with the optimization strategies described above for performing our experiments. We use the default hyper-parameter of all strategies [14, 3]. Each optimization strategy is allowed to propose 4096 trials per experiment. We repeat each experiment five times with different random seeds and set the reward of infeasible trials to zero. To parallelize hardware simulations, we use 256 CPU cores each handling one hardware simulation at a time. We further run each optimization experiment asynchronously with 16 workers that can evaluate up to 16 trials in parallel.
我们使用Google Vizier框架,带有上述的优化策略,来进行我们的试验。我们使用所有策略的默认超参数。每个优化策略在每个试验中可以进行4096次尝试。我们重复每个试验5次,每次使用不同的随机种子,设不可行的尝试的回报为0。为并行化硬件仿真,我们使用256个CPU核,每个处理一个硬件仿真。我们进一步异步运行每次优化试验,用16个worker,并行评估16次尝试。
Single model architecture search. For the first experiment, we define the optimization problem as maximizing throughput per area (e.g. 1/latency × 1/area) for each neural model without defining any design constraints. Figure 1 depicts the cumulative reward across various number of trials. Compared to Vizier, Evolutionary and P3BO improve the throughput per area by 4.3% (up to 12.2% in MobileNetV2), on average. In addition, both Evolutionary and P3BO yield lower variance across multiple runs suggesting a more robust optimization method for architecture search.
单模型架构搜索。第一个试验,我们定义优化问题为对每个神经模型最大化单位面积上的吞吐量(如,1/latency × 1/area),不定义任何设计约束。图1展示了在不同数量的尝试后的累积回报。与Vizier比,演化和P3BO将单位面积上的吞吐率改进了平均4.3%(在MobileNetV2上最高,为12.2%)。另外,演化和P3BO在多次运行中都得到了更低的方差,说明是架构搜索的稳健优化方法。
Multi-model architecture search. For multi-model architecture search, we define the optimization as maximizing geomean(speedup) across all the evaluated models (See Section 2) while imposing area budget constraints of 6.8 mm^2 , 5.8 mm^2, and 4.8 mm^2. Note that, as the area budget becomes stricter, the number of infeasible trials increases. The baseline runtime numbers are obtained from a productionized edge accelerator. Figure 2 demonstrates the cumulative reward (e.g. geomean(speedup)) across various number of sampled trials. Across the studied optimization strategies, P3BO delivers the highest improvements across all the design constraints. Compared to Vizier, P3BO improves the speedup by 6.2%, 16.6%, and 24.6% for area budget 6.8 mm^2, 5.8 mm^2, and 4.8 mm^2, respectively. These results demonstrate that as the design space becomes more constrained (e.g. more infeasible points), the improvement by P3BO increases, showing its performance in navigating the search space better.
对于多模型架构搜索,我们定义优化为在所有评估的模型中最大化geomean(speedup),同时施加面积预算约束6.8 mm^2 , 5.8 mm^2, 和4.8 mm^2。注意,随着面积预算变得更加严格,不可行的尝试的数量会增加。基准运行时间数量是通过一个已经产品化的边缘加速器得到的。图2展示了在不同数量的采样尝试下的累积回报(如,geomean(speedup))。在研究的优化策略中,P3BO在所有设计约束中给出了最高的改进。与Vizier相比,P3BO在面积预算6.8 mm^2, 5.8 mm^2, 和4.8 mm^2的情况下,分别改进了speedup 6.2%, 16.6%, 和24.6%。这些结果表明,随着设计空间变得更加受约束(如,更多的不可行点),P3BO的改进增加了,表明在探索搜索空间时的性能更好。
Analysis of infeasible trials. To better understand the effectiveness of each optimization strategy in selecting feasible trials and unique trials, we define two metrics feasibility ratio and uniqueness ratio, respectively. The feasibility (uniqueness) ratio defines the fraction of feasible (unique) trials over the total number of sampled trials. Higher ratios generally indicate improved exploration of feasible regions. Table 3 summarizes the feasibility and uniqueness ratio of each optimization strategy for area budget 6.8 mm^2, averaged over multiple optimization runs. MBO yields the highest avg. feasibility ratio of ≈ 0.803 while Random shows the lowest ratio of ≈ 0.009. While MBO features a high feasibility ratio, it underperforms compared to other optimization strategies in finding accelerator configurations with high performance. The key reason attributed to this behavior for MBO is its low performance (0.236) in identifying unique accelerator parameters compared to other optimization strategies.
不可行尝试的分析。为更好的理解每种优化策略在选择可行尝试和唯一尝试上的有效性,我们分别定义了两个度量,可行率和唯一率。可行率(唯一率)定义了可行的(唯一的)尝试在总计数量的采样尝试中所占的比例。更高的比率,一般表明可行区域更好的探索。表3总结了每种优化策略在面积预算6.8 mm^2下的可行性和唯一性比率,是在多次优化运行下的平均结果。MBO得到了最高的平均可行率约0.803,而随机则得到了最低的可行比率约0.009。虽然MBO的可行性比率很高,但在找到最高性能的加速器配置方面,与其他优化策略相比,反而性能没那么好。MBO这种行为的关键原因是,与其他优化策略相比,在找到唯一加速器参数方面的性能很低(0.236)。
Diversity of architecture configurations. A desired property of optimizers is to not only find a single but a diverse set of architecture configurations with a high reward that can be tested downstream. We quantified the ability of optimizers to find diverse configurations qualitatively by visualizing the 50 best unique trials found by each method using tSNE. Figure 3a shows that Evolutionary and P3BO find both higher-reward and more diverse configurations compared to alternative methods with the exception of Random. This finding is supported quantitatively by Figure 3b, which shows the mean pairwise Euclidean distance of configurations with a reward above the 75th percentile of the maximum reward. The mean pairwise distance of Random is zero since it did not find any configurations with a reward above the 75th percentile. To further visualize the search space in architecture exploration, Figure 4 shows the tSNE visualization of all trials proposed by the Evolutionary method for an area budget of 4.8 mm^2. This figure shows the large number of infeasible trials in the space and the proximity of low- and high-performing trials, which renders identifying high-performing trials challenging.
架构配置的多样性。优化器的一个期望的性质是,不仅找到一个高回报的架构配置,而是找到一个集合,可以在下游进行测试。我们量化了优化器找到多个配置的能力,将每种方法找到的50个最好的唯一尝试使用tSNE进行可视化。图3a展示了,与其他方法相比,演化和P3BO找到了更高回报的,更多样的配置。这种发现有图3b的量化支持,展示了配置的平均成对欧式距离,同时回报要超过最大回报的75%。随机的平均成对距离是0,因为并没有找到任何配置高于最高回报的75%。位进一步可视化架构探索的搜索空间,图4展示了演化方法在面积预算4.8 mm^2建议的所有尝试的tSNE可视化结果。这幅图表明,空间中有大量的不可行尝试,低表现和高表现尝试是很接近的,也说明,找到高表现的尝试是很有挑战的。
Transfer learning between optimizations with different constraints. We analyze the effect of transfer learning between architecture search tasks with different area budgets. To create the source tasks, we select 100 unique trials from optimization studies with area budget constraint of 6.8 mm^2 (See Fig. 2a) under two criteria. First, the area consumption of the selected trials must satisfy the area budget (4.8 mm^2) of the target task. Second, the objective function value (reward) of the selected trials must be below a predefined threshold. In our experiments, we create two source tasks with an objective value of 0.8 and 0.4, respectively, which we chose to better understand the impact of low- and high-value rewards. We use the selected trials to seed the optimization of the target task, which has an area budge of 4.8 mm^2. Figure 5 shows the results. All the optimization strategies find high reward trials in fewer steps with transfer learning than without. The improvement is most pronounced for Vizier, which finds trials with a reward of ≈ 1.0 with transfer learning compared to only ≈ 0.8 without transfer learning. This suggest that Vizier uses the selected trials from the source task more efficiently than Evolutionary and P3BO for optimizing the target task.
不同约束下优化的迁移学习。我们分析了不同面积预算下架构搜索任务之间的迁移学习效果。为创建源任务,我们从面积预算约束为6.8 mm^2的优化研究(图2a)中,在两个准则下,选择了100个唯一的尝试。第一,选择的尝试的面积消耗要满足目标任务的面积预算(4.8 mm^2)。第二,选择的尝试的目标函数值(回报),必须要低于预定义的阈值。在我们的试验中,我们创建了两个源任务,目标值分别为0.8和0.4,我们这样选择是为了更好的理解低值回报和高值回报的影响。我们使用选择的尝试来作为目标任务优化的种子,目标任务的面积预算为4.8 mm^2。图5展示了结果。所有的优化策略在有迁移学习时,比没有迁移学习时,会在更少的步骤中,找到高回报的尝试。改进最明显的是Vizier,在有迁移学习时,找到了回报约为1.0的尝试,在没有迁移学习的时候,只找到了大约0.8回报值的尝试。这说明,Vizier使用从源任务中选择的尝试,在优化目标任务时,比演化和P3BO效率更高。
In our implementation, Evolutionary and P3BO simply use the 100 unique and feasible trails from the source task to initialize the population of evolutionary search. Instead, Vizier uses a more advanced transfer learning approach based on a stack of Gaussian process regressors (see Section 3.3 of Golovin et al. [14]), which may account for the performance improvement. We leave extending Evolutionary and P3BO by more advanced transfer learning approaches as future work.
在我们的实现中,演化和P3BO只是使用了源任务中100个唯一可行的尝试,来初始化演化搜索的种群。Vizier使用了一种更高级的迁移学习方法,基于高斯过程回归器的堆叠,这可能是性能改进的原因。采用更高级的迁移学习方法来拓展演化和P3BO,我们将其作为未来的工作。
Comparison to exhaustive exploration. To understand the optimal design point, we perform a semi-exhaustive search within the search space. Since the search space has almost 5×10^8 design points, it is merely not practical to perform a fully-exhaustive search. As such, we manually prune the search space using domain knowledge where the design points are within a typical edge accelerator configuration (e.g. total memory size within 4–16 MB, total number of PEs within 2–16, etc.). Additionally, we perform a cheaper area estimation to reject design points before performing expensive cycle-level simulations. Using this pruning approach, we reduced the size of search space to around 3K samples. We observe that P3BO can reach the best configurations found by the semi-exhaustive search by performing far fewer evaluations (1.36× less). Another interesting observation is that for the multi-model experiment targeting 6.8 mm^2, P3BO actually finds a design slightly better than semi-exhaustive with 3K-sample search space. We observe that the design uses a very small memory size (3MB) in favor of more compute units. This leverages the compute-intensive nature of vision workloads, which was not included in the original semi-exhaustive search space. This demonstrates the need of manual search space engineering for semi-exhaustive approaches, whereas learning-based optimization methods leverage large search spaces reducing the manual effort.
与穷举式探索的对比。为理解最佳设计点,我们在搜索空间中进行了一次半穷举式搜索。由于搜索空间有几乎5×10^8个设计点,要进行完全穷举式的搜索几乎是不现实的。因此,我们使用领域知识手动对搜索空间进行修剪,使设计点在典型的边缘加速器配置中(如,总计的内存大小在4-16MB中,总计PE数量在2-16个等等)。另外,我们进行了更简单的面积估算,以在昂贵的cycle级的仿真前,拒绝掉一些设计点。使用这种修剪方法,我们将搜索空间的大小降低到了大约3K个样本。我们观察到P3BO可以达到半穷举式搜索找到的最佳的配置,而使用的评估次数会少很多(少了1.36x)。另一个有趣的观察是,对于目标为6.8 mm^2的多模型试验,P3BO实际上找到了一个设计,比半穷举式的3K搜索空间要略好一些。我们观察到,这个设计使用了一个非常小的内存大小,而使用了更多的计算单元。这利用了视觉workload的计算密集的本质,这并不包含在原始的半穷举式的搜索空间中。这证明了,半穷举式方法需要对搜索空间进行手工调整,而基于学习的优化方法利用了大搜索空间,减少了手工的努力。
While inspired by related work, APOLLO is fundamentally different from classic methodologies in design space exploration: (1) we develop a platform to compare the effectiveness of a wide range of optimization algorithms; and (2) we are the first work, to the best of our knowledge, that leverages transfer learning between architecture exploration tasks with different design constraints showing how transfer learning slashes the time for discovering new accelerator configurations. Related work to APOLLO embodies three broad research categories of black-box optimization, architecture exploration, and transfer learning. Below, we overview the most relevant work in these categories.
Apollo是受到相关工作的启发,但与设计空间探索的经典方法是有根本不同的:(1)我们开发出了一个平台,比较很多优化算法的有效性;(2)据我们所知,我们第一个利用了不同设计约束的架构探索任务之间的迁移学习,表明迁移学习大大降低了发现新加速器配置的时间。Apollo相关的工作包括三种研究类别,即,黑盒优化,架构探索,和迁移学习。下面,我们对这些类别中的相关工作进行回顾。
Black-box optimization. Black-box optimization has been broadly applied across different domains and appeared under various optimization categories, including Bayesian [37, 3, 24, 34, 42, 36, 6, 8], evolutionary [1, 39, 20], derivative-free [23, 32, 12], and bandit [7, 25, 38, 13]. APOLLO benefits from advances in black-box optimization and establishes a basis for leveraging this broad range of optimization algorithms in the context of accelerator design. In this work, we extensively studied the effectiveness of some of these black-box optimization algorithms, namely random search [14], Bayesian optimization [14], evolutionary algorithms [3], and ensemble methods [3] in discovering optimal accelerator configurations under different design objectives and constraints.
黑盒优化。黑盒优化在不同的领域中得到了广泛应用,包括几种优化类别,即贝叶斯优化,演化算法,无微分方法,和bandit。Apollo从黑盒优化的进展中受益,确定了利用这种广泛的优化算法进行加速器设计的基础。本文中,我们广泛研究了一些黑盒优化算法的有效性,即随机搜索,贝叶斯优化,演化算法,和集成方法,在不同的设计目标和约束下,发现最优的加速器配置。
Design space exploration. Design space exploration in computer systems has been always an active research and has become even more crucial due to the surge of specialized hardware [30, 18, 40, 28, 10, 21, 5, 4]. Hierarchical-PABO [30] and FlexiBO [18] use multi-objective Bayesian optimization for neural network accelerator design. In order to reduce the use of computational resources, Sun et al. [40] apply genetic algorithm to design CNN models without modifying the underlying architecture. HyperMapper [28] uses a random forest in the automatic tuning of hardware accelerator parameters in a multi-objective setting. HyperMapper optionally uses continuous distributions to model the search space variables as a means to inject prior knowledge into the search space.
设计空间探索。计算机系统中的设计空间塔索一直是一个活跃的研究,由于专用硬件的出现,已经变得更加关键。Hierarchical-PABO和FlexiBO使用了多目标贝叶斯优化,进行神经网络加速器设计。为降低计算资源的使用,Sun等使用遗传算法来设计CNN模型,不用修改潜在的架构。HyperMapper在多目标设置下使用随机森林自动调节硬件加速器参数。HyperMapper还可以使用连续分布来对搜索空间变量进行建模,作为一种将先验知识注射入搜索空间的方法。
Transfer learning. Transfer learning exploits the acquired knowledge in some tasks to facilitate solving similar unexplored problems more efficiently, e.g. consuming a fewer number of data samples and/or outperforming previous solutions. Transfer learning has been explored extensively and applied to various domains [27, 44, 43, 17, 19, 9, 35, 26, 22, 41]. Due to the expensive-to-evaluate nature of hardware evaluations, transfer learning seems to be a practical mechanism for architecture exploration. However, using transfer learning for architecture exploration and accelerator design is rather less explored territory. APOLLO is one of the first methods to bridge this gap between transfer learning and architecture exploration.
迁移学习。迁移学习利用在一些任务中获得的知识,来促进更加高效的求解类似的未探索问题,如消耗更少量的数据样本,和/或超过之前的解决方案。迁移学习已经进行了广泛的探索,应用到了各种领域。由于硬件评估代价很大的本质,迁移学习似乎是架构探索的一个实际的机制。但是,使用迁移学习进行架构探索和加速器设计,研究的还很少。Apollo是第一个弥补迁移学习和架构探索的空白的工作。
In this paper, we propose APOLLO, a framework for sample-efficient architecture exploration for large scale design spaces. The benefits of APOLLO are most noticeable when architecture configurations are costly to evaluate, which is a common trait in various architecture optimization problems. Our framework also facilitates the design of new accelerators with different design constraints by leveraging transfer learning. Our results indicate that transfer learning is effective in improving the target architecture exploration, especially when the optimization constraints have tighter bounds. Finally, we show that the evolutionary algorithms used in this work yield more diverse accelerator designs compared to other studied optimization algorithms, which can potentially discover overlooked architectures. Architecture exploration is just one use case in the accelerator design process that is bolstered by APOLLO. The evolution of accelerator architectures mandates broadening the scope of optimizations to the entire computing stack, including scheduling and mapping, that potentially yields higher benefits at the cost of handling more complex optimization problems. We argue that such co-evolution between the cascaded layers of the computing stack is inevitable in designing efficient accelerators honed for a diverse category of applications. This is an exciting path forward for future research directions.
本文中,我们提出了Apollo,可以在大规模设计空间中高效的利用架构探索样本的框架。Apollo的好处最值得注意的地方,是在架构配置评估起来很昂贵的时候,这在各种架构优化问题中是常见的。我们的框架还利用迁移学习,促进了不同设计约束的新加速器的设计。我们的结果表明,迁移学习在改进目标架构探索中是有效的,尤其是当优化约束的限制更紧时。最后,我们表明,与本研究的其他优化算法相比,本文中使用的演化算法,可以得到更多样化的加速器设计,这可能发现被忽略的架构。架构探索是Apollo加强的加速器设计过程的一种使用案例。加速器架构的演化,会将优化的范围拓宽到整个计算堆栈,包括调度和映射,可能得到更多好处,代价是处理更复杂的优化问题。我们认为在计算堆栈间的堆叠层的这样的共同演化,在为多个类别的应用设计高效的加速器时,是不可避免的。这是一个令人激动的未来研究方向。