Introduction by Rajeev Balasubramonian, University of Utah
DianNao proposes accelerators for deep learning applications, but it does near-data processing too.
PIM-enabled instructions, ISCA 2015 considers the often neglected dimension of NDP i.e. programmability.
Keynote I Automata Processing, Mircea Stan(University of Virginia)
Automata processor(AP) is a programmable silicon device capable of performing very high-speed, comprehensive search and analysis of complex, unstructured data streams. It is a massively parallel and scalable hardware architecture that implements the non-deterministic finite automata(NFA) as the core element.
A fully-functional development board is implemented. The production board is envisioned to have DDR3 Memory, PCI Express connectivity and 4 AP units. The AP has hardware building blocks for all the fundamental operations in an NFA like counter elements, state transition blocks etc. The AP can process multiple data streams simultaneously. The elements are implemented in memory with data being very close to the building blocks.
The memory is the processor!
One can program the AP by a combination of RegEx and other low-level tools. AP avoids the von Neumann bottleneck by replacing instruction fetch with hardware reconfiguration. There are many layers of parallelism: individual STE, NFA level, multiple automata, multiple streams. What the AP is not: Doesn’t have systolic arrays, no general purpose logic. The big challenges are, program changes require a reconfiguration step, new and low-level programming model(biggest hurdle), the need to buy a separate board and the resurgence of FPGA acceleration.
Scaling Deep Learning on Multiple In-Memory Processors, Dongping Zhang (AMD)
PIM architecture to accelerate deep learning applications. Two kinds of deep learning models : Deep belief networks and convolutional neural networks. This work focusses on CNNs. In implementing deep learning on PIM, two kinds of parallelism, data and model, can be exploited. The kind of parallelism exploited is different for each kind of layers in convolutional neural networks. Evaluate performance using 256×256 randomly generated images.
Dataflow based Near Data Processing using Coarse Grain Reconfigurable Logic, Charles Shelor(U. North Texas)
Dataflow based execution allows one to extract parallelism from algorithm. Delay elements allow compile time synchronization, decoupled LD/ST for enhanced memory accesses, embedded scratchpad memory for tables. Hardware pipelines to implement histogramming, word-occurrence count, FFT as single cycle operations. Energy and performance results for these 3 benchmarks, up to 13x speedup and 99% reduction in energy.