Thus, this work lists the essential specifications of prior art edge AI accelerators and the CGRA accelerators during the past few years to define and evaluate the low power ultra-small edge AI accelerators. Most of them tend to compare the architecture with other existing CPUs, GPUs, or other reference research, which implies that the performance exposé of the articles are not comprehensive.
Numerous research articles have proposed the edge AI accelerator for satisfying the applications, but not all include full specifications. These applications require meeting performance targets and resilience constraints due to the limited device area and hostile environments for operation. The proposed scheme, TS-PULP, achieves the best FOM of MAE among four architectures (PULP with SCM, PULP with SRAM, TS-cache, and TS-PULP) with different cache sizes.Įdge AI accelerators have been emerging as a solution for near customers' applications in areas such as unmanned aerial vehicles (UAVs), image recognition sensors, wearable devices, robotics, and remote sensing satellites.
In addition, we introduced the figure of merit (FOM) of million instructions per second (MIPS), area, and energy (MAE) to comprehensively evaluate different approaches. According to the evaluation results from RTL simulations and the FPGA prototype, the proposed caches can achieve a similar performance (over 80%) of the original standard cell memory (SCM)-based PULP design under TSMC 28 nm 0.5 V and 25 ☌ with only about 30% chip area. To quantify the improvement of our approach, a cycle-accurate RTL model of CS-SRAM and a field-programmable gate array (FPGA) prototype of the proposed L1 caches with the open-source system on chip (SoC) platform PULP have also been implemented. In this article, we propose a design of CS-SRAM-based L1 caches with a PVT autotracking mechanism, namely TS-PULP, which adjusts both the TBL and the frequency of the system clock to the optimal points. However, existing timing speculation caches do not track the variations of different PVT conditions to adjust the access timing to the optimal TBL point, on which the system possesses the lowest average memory access time. Meanwhile, for a given process, voltage, and temperature (PVT) condition, CS-SRAM has an optimal bitline discharging time (TBL) to achieve the lowest average access latency. To improve the performance of SRAM in caches under near-threshold voltages, several timing speculation techniques, such as the cross-sensing SRAM (CS-SRAM), are proposed. In a low power 28nm FDSOI process a peak efficiency of 193MOPS/mW(40MHz, 1mW) can be achieved. With four NT-optimized cores, the cluster is operational from 0.6V to 1.2V achieving a peak efficiency of 67MOPS/mW in a low-cost 65nm bulk CMOS technology. SIMD extensions, such as dot-products, and a built-in L0 storage further reduce the shared memory accesses by 8x reducing contentions by 3.2x. For typical data-intensive sensor processing workloads the proposed core is on average 3.5x faster and 3.2x more energy-efficient, thanks to a smart L0 buffer to reduce cache access contentions and support for compressed instructions. We introduce instruction-extensions and microarchitectural optimizations to increase the computational density and to minimize the pressure towards the shared memory hierarchy. In this paper we describe the design of an open-source RISC-V processor core specifically designed for NT operation in tightly coupled multi-core clusters. Near-threshold(NT) operation can achieve higher energy efficiency, and the performance scalability can be gained through parallelism.
Endpoint devices for Internet-of-Things not only need to work under extremely tight power envelope of a few milliwatts, but also need to be flexible in their computing capabilities, from a few kOPS to GOPS.