BROOM:具有28纳米CMOS的适应低电压的开源乱序处理器(上)
BROOM: An Open-Source Out-of-Order PRocessor With Resilient Low-Voltage Operation in 28-nm CMOS
Abstract:
The Berkeley resilient out-of-order machine (BROOM) is a resilient, wide-voltage-range implementation of an open-source out-of-order (OoO) RISC-V pRocessor implemented in an ASIC flow. A 28-nm test-chip contains a BOOM OoO core and a 1-MiB level-2 (L2) cache, enhanced with architectural error tolerance for low-voltage operation. It was implemented by using an agile design methodology, where the initial OoO architecture was transformed to perform well in a high-performance, low-leakage CMOS pRocess, informed by synthesis, place, and route data by using foundry-provided standard-cell library and memory compiler. The two-person-team productivity was improved in part thanks to a number of open-source artifacts: The Chisel hardware construction language, the RISC-V instruction set architecture, the Rocket-chip SoC generator, and the open-source BOOM core. The resulting chip, taped out using TSMC's 28-nm HPM pRocess, runs at 1.0 GHz at 0.9 V, and is able to operate down to 0.47 V.
Key Words:Open source software ,Random access memory ,Design methodology ,CMOS pRocess ,Generators ,Voltage control ,Agile software development
摘 要
伯克利弹性乱序机器(BROOM)是一种使用ASIC流程实现适应宽电压的开源乱序(OoO)RISC-V处理器。一个28纳米的测试芯片包含一个BOOM OoO内核和一个1-MiB 2级(L2)高速缓存,并为低压操作提供增强的架构误差容限。它是通过使用敏捷设计方法来实现的,该方法在将初始的OoO架构转换为在高性能,低泄漏CMOS处理器的工艺过程中表现优秀,该CMOS处理器是通过使用代工厂提供的标准单元库和内存编译器进行合成、排布和数据线路的。两人团队的生产力在一定程度上得益于许多开源工件:Chisel硬件构造语言、RISC-V指令集体系架构、Rocket-chip SoC生产商以及开源的BOOM内核。使用台积电的28-nm HPM工艺制成的最终芯片,在0.9 GHz的电压下以1.0 GHz的频率运行,并且能够在0.47 V的电压下工作。
关键词:开源软件,随机存取存储器,设计方法,CMOS工艺,发电机,电压控制,敏捷软件开发
RISC-V is an open-source instruction set architecture (ISA) that is gaining wide attention. There are several open-source and commercial in-order cores that implement the RISC-V ISA; however, there is a need for high-performance cores. BOOM is a synthesizable, parameterized, superscalar out-of-order (OoO) RISC-V core, that has been originally designed to serve as the prototypical baseline pRocessor for future microarchitectural studies of OoO pRocessors. Its original goal was to provide a readable, open-source implementation for use in education, research, and industry, and had been evaluated through educational standard-cell libraries. The Berkeley Resilient OoO Machine (BROOM) contains an evolved version of BOOM: the core has been transformed to explore the design space in a representative pRocess for high-performance mobile applications. It has been designed in an ASIC flow, which enabled a rapid evaluation of changes to the RTL and physical design to improve the performance of the pRocessor. Figure 1 shows the block diagram of the BROOM pRocessor. BROOM consists of a single BOOM core and a 1-MiB L2 cache, each in their own clock and voltage domains.
RISC-V是一个引起了广泛的关注的开源指令集体系结构(ISA)。有几个开源商业有序内核实现RISC-V ISA。然而,这些ISA是为了满足高性能需求的内核。BOOM是可综合的,参数化的,超标量的无序(OoO)RISC-V内核,它原本是为将来的OoO处理器进行微体系结构研究的原型基线而设计的处理器。其最初的目标是提供一种可读的,开源的实现方式,用于教育,研究和工业,并已通过教育标准单元库进行了评估。伯克利弹性乱序机器(BROOM)包含一个BOOM的改进版本:该核心已经被转换成具有代表性的探索高性能移动应用程序的设计空间。图1显示了BROOM处理器的框图。BROOM由一个BOOM内核和一个1-MiB L2缓存组成,每个缓存都在各自的时钟和电压域中。

The additional feature of the test chip is the architectural resiliency techniques for operation of the cache in a wide voltage range, enabling the pRocessor to operate with a high efficiency at low voltages.
测试芯片的附加功能是用于在较宽的电压范围内运行高速缓存的体系结构适应技术,从而使处理器能够在低电压下高效运行。
BROOM was implemented using LVT-based standard cells and a foundry-provided memory compiler. The entire chip measures less than 2 mm × 3 mm and is composed of 72 million transistors. The chip is composed of 417 000 standard cells and 73 SRAM macros; the core and L1 caches make up 310 000 cells and 20 SRAM macros. The final sign-off in the slow-slow corner was at 1.68 ns. Figure 1 shows the placed-and-routed chip plot.
使用基于LVT的标准单元和代工厂提供的内存编译器来实现BROOM。整个芯片的尺寸小于2 mm×3 mm,由7200万个晶体管组成。该芯片由417 000个标准单元和73个SRAM宏组成。内核和L1高速缓存组成了310 000个单元和20个SRAM宏。慢速转角处的最终签发为1.68 ns。图1显示了布局和布线的芯片图。
SECTION 2
LEVERAGING OPEN-SOURCE INFRASTRUCTURE
第2章 利用开源基础设施
BOOM implements the open-source RISC-V ISA, which was designed from the ground-up to enable technology-driven computer architecture research. The clean and simple design of RISC-V allows for a focus on the pRocessor without getting weighed down with awkward instructions that demand undue attention or spending extra effort managing software ports.
BOOM实现了开源的RISC-V ISA,该ISA完全是从头开始设计的,目的在于进行技术驱动的计算机体系结构研究。RISC-V简洁的设计使您可以专注于处理器,而不会因那些笨拙的需要过度关注或花费额外精力管理软件端口的指令而烦恼。
BOOM is written in Chisel, an open-source hardware construction language developed to enable the advanced hardware design. Chisel allows designers to utilize concepts such as object orientation, functional programming, parameterized types, and type inference which makes it easier to implement highly parameterized hardware generators. However, Chisel is not a high-level synthesis language—the primitives provided by Chisel are, for example, registers, wires, and memories. One of Chisel’s strengths is its focus on generating well formed, synthesizable Verilog. This feature decreased design risk. Chisel also brings software development-level productivity to the RTL coding, and helps encourage focusing implementation efforts on writing generators, rather than a single design instance. For example, the open-source RISC-V Rocket-chip generator presents a template for designing systems-on-a-chip (SoCs). Rocket-chip supports coherent multilevel caches and standard interconnects. BOOM makes significant use of Rocket-chip as a library—the caches, the uncore, and the functional units all derive from Rocket. In total, over 11 500 lines of code (LOC) are instantiated by BOOM from the Rocket-chip repository.
BOOM用Chisel编写,Chisel是一种开源的硬件构造语言,旨在实现高级硬件设计。Chisel允许设计人员利用诸如面向对象,函数式编程,参数化类型和类型推断之类的概念,这使实现高度参数化的硬件发生器更加容易。然而,Chisel不是由Chisel原语组成的高级语言,例如,寄存器,电线,和内存。一个Chisel的优势是它的专注产生良好的Verilog。此功能降低了设计风险。Chisel还为RTL编码带来了软件开发级的生产力,并有助于将实现工作重点放在编写生成器上,而不是单个设计实例上。例如,开源RISC-V火箭芯片生成器提供了一个用于设计片上系统(SoC)的模板。Rocket-chip支持相干的多层缓存和标准互连。BOOM充分利用了Rocket-chip 作为一个缓存库,uncore和功能单元都从Rocket派生而来。BOOM总共从Rocket-chip存储库中实例化了超过11500行代码(LOC)。
SECTION 3
BOOM CORE
第3章 BOOM核心
The initial BOOM architecture is inspired by the MIPS R10K and Alpha 21264 pRocessors from the 1990s, whose designs teams provided relatively detailed insight into their pRocessors’ microarchitectures.[1][2] However, both pRocessors relied on custom, dynamic logic which allowed them to achieve very high clock frequencies despite their very short pipelines. The seven-stage Alpha 21264 has 15 fanout-of-four (FO4) inverter delays. As a comparison, the synthesizable Tensilica’s Xtensa pRocessor, fabricated in a 0.25-μmASIC pRocess and contemporary with the Alpha 21264, was estimated to have roughly 44 FO4 delays.[3]
最初的BOOM架构受到1990年代的MIPS R10K和Alpha 21264处理器的启发,其设计团队对他们的处理器的微体系结构提供了相对详细的见解。[1] [2]但是,两个处理器都依赖于定制的动态逻辑,尽管它们的流水线非常短,但它们仍可以实现很高的时钟频率。七级Alpha 21264具有15个FO4逆变器延迟。相比之下,可合成的Tensilica Xtensa处理器采用0.25-μ米ASIC工艺和现代Alpha 21264估计大约有44个FO4延迟。[3]
As BOOM is a synthesizable pRocessor, we must rely on microarchitecture-level techniques to address critical paths and add more pipeline stages to trade off instructions per cycle (IPC), cycle time (frequency), and design complexity. However, as pRocess nodes have become smaller, transistor leakage and variability has increased, and power efficiency restrictive, many of the more aggressive custom techniques have become more difficult and expensive to apply.[4] Modern high-performance pRocessors have largely limited their custom design efforts to more regular structures such as memories and register files.
由于BOOM是可综合处理器,因此我们必须依靠微体系结构级的技术来解决关键路径,并增加更多的流水线级来权衡每个周期(IPC),周期时间(频率)和设计复杂性的指令。但是,随着制程节点变得越来越小,晶体管泄漏和可变性增加以及功率效率受到限制,许多更具侵略性的定制技术的应用变得更加困难和昂贵。[4]现代高性能处理器在很大程度上将其自定义设计工作局限于更规则的结构,例如内存和寄存器文件。
We began our design efforts with BOOMv1; a version of BOOM whose implementation was informed using educational technology libraries and CACTI cache models. BOOMv1 follows the 6-stage pipeline structure of the MIPS R10K— fetch, decode/rename, issue/register-read, execute, memory, and writeback. For design simplicity, all uops are placed into a single unified issue window. Likewise, all physical registers (both integer and floating-point registers) are located in a single unified physical register file. BOOMv1 also utilized a short 2-stage front-end pipeline. Conditional branch prediction occurs after the branches have been decoded.
我们从BOOM v1开始设计;该BOOM版本,其实现方式是使用教育技术库和CACTI缓存模型的。BOOMv1遵循MIPS R10K的6级流水线结构-——取指、译码、执行、访存、写结果。为了简化设计,将所有控件放置在一个统一的发布窗口中。同样,所有物理寄存器(整数和浮点寄存器)都位于单个统一的物理寄存器文件中。BOOM v1还利用了较短的2级前端管道。条件分支预测发生在分支已解码之后。
The design of BOOMv1 was partly informed by using educational technology libraries in conjunction with synthesis-only tools. BOOMv1 used Cacti[5] to analytically model the characteristics of memories, which is oriented toward the single-port, cache-sized SRAMs. However, BOOM makes use of a multitude of smaller, irregular SRAMs for modules such as branch predictor (BPD) tables, and address target buffers. Figure 2 lists all of the SRAM macros used within the BOOM core.
BOOMv1的设计部分是通过使用教育技术库以及仅综合工具来提供的。BOOMv1使用Cacti [5]对存储的特性进行了分析建模,该特性面向单端口,高速缓存大小的SRAM。然而,BOOM将许多较小的不规则SRAM用于模块,例如分支预测器(BPD)表和地址目标缓冲区。图2列出了BOOM内核中使用的所有SRAM宏。

Final BOOM core configuration used in the BROOM chip, as well as the configurations used for each of the SRAM macros used within the BOOM core.
BROOM芯片中使用的最终BOOM内核配置,以及BOOM内核中使用的每个SRAM宏的配置
Upon analysis of the timing of BOOMv1 using TSMC 28-nm HPM, the following critical paths were identified:
通过使用TSMC 28-nm HPM分析BOOMv1的时序后,确定了以下关键路径:
issue window select; 发送窗口选择;
register rename busy-table read; 注册重命名忙表读取;
conditional BPD redirect; 有条件的BPD重定向;
register file read. 注册文件读取。
The last path (register-read) only showed up as critical during postplace-and-route analysis.
最后一条路径(寄存器读取)仅在后置和路由分析中显示为关键。
SECTION 4
BOOMv2: IMPROVING BOOM’s QUALITY-OF-RESULTS
第4章 BOOMv2:提高BOOM的结果质量
BOOMv2 is an update to BOOMv1 based on information collected through synthesis, place, and route using a commercial TSMC 28 nm pRocess. We performed the design space exploration by using standard single- and dual-ported memory compilers provided by the foundry, and by hand-crafting a standard-cell-based multiported register file.
BOOMv2是对BOOMv1的升级,它基于使用商业TSMC 28 nm工艺通过合成,排布和布线接收的数据。我们通过使用代工厂提供的标准单端口和双端口内存编译器,以及手工制作基于标准单元的多端口寄存器文件来进行设计空间探索。
Migration to BOOMv2 included 4948 additions and 2377 deleted LOC out of the total 16 000 LOC code base. The following sections describe some of the major changes that comprise the BOOMv2 update.
在总共16 000个LOC代码库中,向BOOMv2的迁移包括4948个新增的LOC和2377个删除的LOC。以下各节描述了组成BOOMv2更新的一些主要更改。
4.1 Frontend (Instruction Fetch)
4.1 前端(指令提取)
PRocessor performance is
best when the frontend provides an uninterrupted stream of instructions.
This requires the frontend to utilize branch prediction techniques to
predict which path it believes the instruction stream will take long
before the branch can be properly resolved. A number of different
predictors are used, each trading off accuracy, area, critical path
cost, and pipeline penalty when making a prediction.
当前端提供不间断的指令流时,处理器性能最佳。这要求前端使用分支预测技术来预测它认为指令流将花费很长时间才能正确解决分支的路径。使用了许多不同的预测器,每个预测器在进行预测时都要权衡准确性、面积、关键路径成本和管道损失。
The Branch Target Buffer (BTB) maintains a set of tables mapping from instruction addresses to branch targets. Some hysteresis bits are used to help guide the taken/not-taken decision of the BTB in the case of a tag hit. The BTB is a very expensive structure—each BTB entry contains a tag and a target. The BTB also contains a return address stack for predicting the function returns.
该分支目标缓冲器(BTB)维持一组表映射从的指令地址到分支目标。在命中标签的情况下,一些滞后位用于帮助指导BTB采取/不采取决策。BTB是一种非常昂贵的结构——每个BTB条目都包含一个标签和一个目标。BTB还包含用于预测函数返回的返回地址堆栈。
To improve a critical path and increase the capacity, we replaced BOOMv1’s fully tagged, fully associative BTB design with a partially tagged, set-associative BTB. We also implemented the new BTB using single-ported SRAM macros, instead of flip-flops.
为了改善关键路径并增加容量,我们用部分标记的集合关联BTB代替了BOOMv1的完全标记的集合关联的BTB设计。我们还使用单端口SRAM宏而不是触发器来实现新的BTB。
The Conditional BPD maintains a set of prediction and hysteresis tables to make taken/not-taken predictions based on a look-up address. The BPD only makes taken/not-taken predictions—it therefore relies on some other agent to provide information on what instructions are branches and what their targets are. The BPD can either use the BTB for this information or it can wait and decode the instructions themselves. Because the BPD does not store the branch targets, it can be much denser and more accurate than the BTB.
在有条件的BPD维护了一个集合,该集合使预测和滞后表根据预测来查找地址来判断是否采用。BPD仅做出采用/未采用的预测,因此它依赖于其他代理来提供有关什么指令是分支以及它们的目标是什么的信息。BPD可以使用BTB来获取此信息,也可以等待并自行解码指令。因为BPD不存储分支目标,所以它比BTB密度更高,更准确。
BOOM uses a global history predictor, which works by tracking the outcome of the last N branches in the program and hashes this history with the look-up address to compute an index into the prediction tables. BOOM’s predictor tables are implemented using single-ported SRAMs. Although many prediction tables are conceptually “tall and skinny” matrices (thousands of 2- or 4-bit entries), a generator written in Chisel transforms the predictor tables into a square memory structure to best match the SRAMs provided by a memory compiler.
BOOM使用全局历史预测器,该预测器通过跟踪最后N个程序中分支的结构并使用查询地址对历史记录进行哈希处理,来计算出预测表中的索引。BOOM的预测器表是使用单端口SRAM来实现的。尽管许多预测表在概念上都是“又高又瘦”的矩阵(成千上万的2位或4位条目),但用Chisel编写的生成器将预测值表转换为方形存储结构,来最佳匹配由存储器编译器提供的SRAM。
We found a critical path in BOOMv1 to be the BPD making a prediction and redirecting the fetch instruction address, as the BPD must first decode the newly fetched instructions and compute potential branch targets before it can redirect fetch. For BOOMv2, we moved the BPD array access back a stage to now operate in parallel with decoding the instructions. The final prediction and redirection are then performed at the beginning of the following stage (see Figure 2). Moving the BPD redirection back a cycle also gave us the freedom to provide a full cycle for the hash indexing function, which removes the hashing off the critical path of Next-PC selection. However, pushing back the BPD redirection a stage comes at the cost of an extra bubble on BPD redirections.
我们发现BOOMv1中的关键路径是BPD进行预测并重定向获取指令的地址,因为BPD必须首先解码新获取的指令并计算潜在的分支目标,然后才能重定向获取。对于BOOMv2,我们将BPD阵列访问移回了一个阶段,以便现在与解码指令并行运行。然后,在下一个阶段的开始执行最终的预测和重定向(参见图2)。将BPD重定向移回一个周期还使我们能够自由地为哈希索引功能提供完整的周期,从而消除了Next-PC选择的关键路径上的哈希。但是,将BPD重定向推迟到某个阶段会以BPD重定向上的额外气泡为代价。
4.2 Distributed Issue Windows
4.2 分布式发送窗口
The issue window holds all inflight and un-executed micro-ops (uops). For BOOM, the issue window is implemented as a collapsing queue to allow the oldest instructions to be compressed toward the top. For issue-select, a cascading priority encoder selects the oldest instruction that is ready to issue. This path is exacerbated either by increasing the number of entries or by increasing the number of issue ports. For BOOMv1, our synthesizable implementation of a 20-entry issue window with three issue ports was found to be too aggressive, so we switched to three distributed issue windows with 16 entries each (separate windows for integer, memory, and floating-point operations). This removes issue-select from the critical path while also increasing the total number of instructions that can be scheduled. However, to maintain performance of executing two integer ALU instructions and one memory instruction per cycle, a common configuration of BOOM uses two issue-select ports on the integer issue window.
该发送窗口包含所有运行中和未执行的微操作。对于BOOM,发送窗口使用折叠队列实现,以允许最早的指令朝顶部压缩。对于发送选择,级联优先级编码器选择准备发送的最早的指令。增加条目数或增加发送端口数都会加剧此路径的传输量。对于BOOMv1,我们发现具有20个条目的发送窗口的可综合实现具有三个发送端口过于激进,因此我们切换到三个分布式发送窗口,每个窗口具有16个条目(整数,内存和浮点运算的单独窗口) 。这消除了关键路径中的问题选择,同时也增加了可以调度的指令总数。然而,为了保持每个周期执行两个整数ALU指令和一个内存指令的性能,BOOM的一个常见配置使用整数问题窗口上的两个问题选择端口。
4.3 Custom Bit-Array Register File Design
4.3 自定义位阵列寄存器文件设计
One of the critical components of an OoO pRocessor, and most challenging to synthesize in a standard ASIC flow, is the multiported register file. BOOM’s register file required both microarchitectural adjustments and a semicustom physical design to achieve the desired performance. The design of a register file provides many challenges—reading data out of the register file is a critical path, and routing read data to functional units is a routing challenge. Both the number of registers and the number of ports further exacerbate the challenges of synthesizing the register file.
多端口寄存器文件是OoO处理器的关键组件之一,在标准ASIC流程中进行合成最具挑战性。BOOM的寄存器文件需要微体系结构调整和半定制的物理设计,才能实现所需的性能。寄存器文件的设计面临许多挑战——从寄存器文件中读取数据是关键路径,而将读取的数据路由到功能单元则是路由挑战。寄存器的数量和端口的数量都进一步加剧了合成寄存器文件的挑战。
The first path to improving the register file design was purely microarchitectural. The issue-select and register-read stages were split into two separate stages—each now gets a full cycle to themselves. The register count is lowered by splitting up the unified physical register file into separate floating-point and integer register files. This split also allows for reducing the read-port count by moving the three-operand fused-multiply add floating-point unit to the smaller floating-point register file.
改进寄存器文件设计的第一个途径是纯粹解决微体系结构的问题,选择发送和寄存器读取阶段被分成两个独立的阶段,现在每一个阶段都得到一个完整的周期。通过将统一物理寄存器文件拆分为单独的浮点数和整数寄存器文件,可以减少寄存器数量。这种拆分还可以通过将三操作数融合乘加浮点单元移动到较小的浮点寄存器文件中来减少读取端口数。
The second path to improving the register file involved physical design. A significant problem in placing and routing a register file is the issue in routing many wires to a relatively dense regfile array. BOOMv2’s 70 entry integer register file of six read ports and three write ports comes to 4480 bits, each needing 18 wires routed into and out of it. There is a mismatch between the synthesized array and the area needed to route all required wires, resulting in routing congestion.
改进寄存器文件的第二条路径涉及物理设计。放置和路由寄存器文件中的一个重要问题是将许多导线路由到相对密集的regfile数组中的问题。BOOMv2的70个条目的整数寄存器文件由6个读端口和3个写端口组成,共4480位,每个都需要18条导线进出。合成阵列与布线所有必需导线所需的面积之间不匹配,从而导致布线拥塞。
The register file in this design was implemented by semicustom crafting a register file bit out of foundry-provided standard cells (see Figure 3). The Chisel register file was blackboxed, and a lower level of hierarchy was manually described in structural Verilog in which standard cells were instantiated to construct a bit-cell with its access ports. The bit-cells were preplaced and the router automatically routed wires correctly to complete the register file.
在该设计中,寄存器文件通过半定制一个寄存器文件中实现位铸造提供的标准单元(参见出图3)。该Chisel寄存器文件是blackboxed和分层结构的较低级别的结构中的Verilog手动描述,其中标准单元中实例化以构建1bit位单元与它的接入端口。这个1bit位单元是预先放置并且路由器自动正确布线,以完成寄存器文件。

Register File Bit manually crafted out of foundry-provided standard cells. Each read port provides a read-enable bit to signal a tri-state buffer to drive its port’s read data line. The register file bits are laid out in an array for placement with guidance to the place tools. The tools are then allowed to automatically route the 18 wires into and out of each bit block.
从代工厂提供的标准单元中手动制作的注册文件位。每个读取端口均提供一个读取使能位,以向三态缓冲器发出信号以驱动其端口的读取数据线。寄存器文件位以数组的形式放置,以便在放置工具的指导下进行放置。然后允许工具自动将18根导线布线到每个位块中以及从每个位块中布线出来
Although the register file bits are implemented in a structural Verilog, the decode logic and peripheral circuitry are implemented in Chisel. We also implemented a behavioral model of the custom array in Chisel to verify the decode logic through RTL simulation and then performed additional verification of the custom bit-array register file in gate-level simulation.
尽管寄存器文件的位是在Verilog中实现的,但解码逻辑和外围电路是在Chisel中实现的。我们还在Chisel中实现了自定义数组的行为模型,以通过RTL仿真来验证解码逻辑,然后在门级仿真中对自定义位数组寄存器文件进行附加验证。
To support the target cycle time, the register file is implemented by using hierarchical bitlines; the bits are divided into clusters, tristates drive the read ports inside of each cluster, and muxes select the read data across clusters. This prevents the tristate buffers from having to drive each read wire across all 70 registers.
为了支持目标周期时间,使用分层位线实现寄存器文件。这些位被分成簇,三态驱动每个簇内部的读取端口,多路复用器跨簇选择读取数据。这避免了三态缓冲器必须驱动所有70个寄存器中的每个读取线。
As a counterpoint, the smaller floating-point register file (three read ports, two write ports) is fully synthesized with no placement guidance. Aside from the integer register file and the SRAMs, no other logic in Chisel was implemented via Verilog blackboxes.
作为对策,较小的浮点寄存器文件(三个读取端口,两个写入端口)是完全合成的,没有放置指导。除了整数寄存器文件和SRAM,Chisel中没有其他逻辑通过Verilog黑盒实现。