Preparing for a Post Moore’s Law World, Prof. Todd Austin, University of Michigan
Density scaling is slowing down. It is now taking more and more time(36 months) to get the next generation silicon out in the market. This shift the burden to the computer architects. The first remedy to slowing Moore’s law was chip multiprocessors.
The public perception of ‘how fast computers are’ is very negative. Who is to blame?
- Programmers: There are still some niche markets like Bitcoin and warehous scale computing which are epxloiting abundant parallelism
- Educators: CS enrolment is going up.
- The transistor: Constant power scaling not possible anymore
- Architects: Amdahl’s law. You need an ever-increasing number of processors to get higher speedups. Parallelism is not the solution to bridge the performance gap.
Heterogeneous parallel systems to overcome dark silicon and the tyranny of Amdahl’s law.
Good: Hetero-parallel systems can the close the Moore’s law gap
Bad: The architecture community is gonna have a hard time responding to this extreme challenge.
Ugly: Such systems are gonna cost a lot.
Design costs are skyrocketing and this will kill the community. It takes around $120M to get a 20nm design to market. The software community needs around $500K to generate billion dollar companies.
Ultimate goal is to accelerate system architecture innovation and make it sufficiently inexpensive that anyone can do it anywhere. Expect more from architects, fund research only if you can get 2x speedup, not 10%. Reduce the cost to design customized design. Widen the applicability of customized H/W. Make hardware more open-source. Reduce the cost of manufacturing customized H/W.
Session on DRAM
More is Less: Improving the Energy Efficiency of Data Movement via Opportunistic Use of Sparse Codes, Yanwei Song, University of Rochester.
A sparse representation with a few 0s in each codeword can reduce data movement energy. DDR4 IO interface: constant current flows when transmitting 0. Use a DBI bit to invert the transmission protocol, when current flows we could be transmitting 1 or 0, depending on what the DBI bit is. A longer codeword requires increasing the burst length. Applying a sparse code at all time significantly increases bandwidth.
When there is no data coming in the next few cycles, use sparse code otherwise use original encoding. Proposed simple code MiLC builds upon DBI by exploiting spatial correlations among adjacent bytes of data.
LPDDR3 consumes energy when toggling, so it not optimal to just reduce the number of 0s. Level signaling energy is proportional to the number of bit flips. Transition signaling energy is proportional to the number of 0s.
Improving DRAM Latency with Dynamic Asymmetric Subarray, Shih-Lien Lu, Intel
Memory latency in CPU cycles has grown and affect CPU design. DRAM is organized as banks, sub-arrays, tiles. Device Latency = I/O + Peripheral + Sub-array. Sub-array access contributes a lot to the delay. But reducing sub-array access increases die area a lot. Need a hybrid design to reduce access latency with minimal increase in die area.
Proposed design uses migration cells in between alternate rows to reduce the access latency for different types of operations. Two factors contribute to the area overhead: Extra rows for migration and array efficiency reduction due to shorter bit lines and word lines. Overall 6% area overhead. Dynamic subarray (DAS) DRAM gives most of the performance benefit of fast sub-array (FS) DRAM.
Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses, Vivek Seshadri, Carnegie Mellon University
The problem being tackled is non-unit strided accesses. Current memory systems are inefficient and cache-line optimized. Whether accesses to particular fields in a database result in unit or non-unit strided accesses depends on the database’s layout i.e. row store or column store.
The goal is to eliminate the inefficiency of current cache-line optimized DRAM systems. Fetch the non-unit strided accesses without this inefficiency. In traditional systems, data of each cache line spread across all the chips. Chip conflicts: All the fields of a database might be mapped to the same chip. Shuffle the data from multiple cache lines before sending data to memory. But the data might be stored in different columns in different banks. How does the memory controller know from which column to retrieve data? Using pattern-id based address translation, one can retrieve any power of two strided access pattern.
In memory databases have two kinds of layout – row-store and column-store. Assume a row-store based layout as baseline and efficiently retrieves data from a column store format. Each layout is better suited to a certain kind of workloads – Transactions prefer row-store layout, analytics prefer column-store layout. Using the proposed design one can achieve the best of both worlds.
Session on Micro-architecture
DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores,
Shruti Padmanabha, University of Michigan
An OOO creates a better schedule but burns up to 6d more power. When there are loops, the OOO core redundantly regenerates the same schedule again and again. The idea is to record or memoize the schedule generated by OOO core and then use this schedule on an inorder core. One needs to detect profitable traces to memoize – intelligent trace-based predictor.
The OOO core reorders loads and stores and uses an LSQ to perform memory disambiguation. The OOO core will need to store register renaming information in the trace. Also, one needs to use a larger physical register file in inorder core. Each logical register in inorder can now map to 4 physical registers.
Evaluate 3-wide OOO and 3-wide inorder cores. Achieves 30% energy savings on average.
Long Term Parking: Criticality-aware Resource Allocation in OOO Processors, Andreas Sembrant, Uppsala University
OOO cores allocate resource too eagerly. Identify and park non-performance critical instructions before resource allocation. The goal is to minimize the time an instruction holds resources.
Urgency: Output is used by a long latency instruction (e.g. LLC miss).
Readiness: Input depends on a long latency instruction.
Classify instructions along the above two axes. Non-urgent instructions will be parked. One can only detect urgent instructions.
The Inner Most Loop Iteration counter: a new dimension in branch history, Andre Seznec, INRIA/IRISA
Local history is not very useful in reduction branch misprediction rates. The issue is that there are many instruction from the same branch in inflight. As a result, one ends up using the wrong branch and get wrong prediction. State of the art global history predictors: Neural predictors, Tage GSC. How to identify correlator branches? The loop predictor does it smoothly for loops. Previous work has shown that there is correlation in branch directions of multi-dimensional loops.
IMLI-SIC component. A simple add-on to TAGE-GSC or GEHL.
Filtered Runahead Execution with a Runahead Buffer, Milad Hashemi, UT Austin
Runahead dynamically expands the instruction window when the pipeline is stalled. Traditional runahead generates a lot of instructions that do not generate any cache misses. Most dependency chains are short. A small dependence chain cache(2-entries) improves performance. Runahead generates more MLP by executing a filtered dependency chain.
Bungee Jumps: Accelerating Indirect Branches Through HW/SW Co-Design, Daniel S. McFarlin, Carnegie Mellon University
There have been tremendous improvements in indirect branch prediction accuracy. This is great for OOO cores, not so for inorder cores. Inorder machines specialize based on branch bias or eliminate branch prediction altogether.
Session on Mobile & Emerging Systems
Prediction-Guided Performance-Energy Trade-off for Interactive Applications, Daniel Lo, Cornell University
Many modern applications are highly interactive. User interactions have response-time requirements. Run jobs slower in order to save energy while preserving user experience. Execution times for an application vary from job to job. Optimizing for the worst case wastes energy and optimizing for average case misses deadlines. History based DVFS control is too slow to account for fine-grained variations. Hence, we need proactive control i.e. predict DVFS based on job inputs and program state. In addition, the design has to be general and require minimal programmer input.
Execution time depends on number of instructions executed, which in turn depends on control flow. One solution is to instrument program source to count important features. But will need to run the entire program to get features and this might take a very long time. Instead, one can create program slices – minimal pieces of code that capture the features. Linear model to map features to execution time. Tune the training algorithm to penalize under-prediction much more heavily. In particular, the paper used convex optimization with a custom objective function. In order to translate execution time to DVFS frequency, a simple linear model is used.
Architecture-aware Automatic Computation Offload for Native Applications, Gwangmu Lee, POSTECH
Mobile device performance is too slow for some applications. It might be beneficial to run demanding workloads on a server to improve overall performance and energy efficiency. Most offloading systems are based on VMs. Cross-architecture binary incompatibility becomes an issue. The challenges in offloading native workloads: Different processor architectures, distinct memory, different memory layouts.
Main steps proposed design: Target selection, virtual address unification, code partitioning, server specific optimizations. The target which minimizes both computation and communication latency is chosen. Some functions are not offloaded if no target server provides positive overall gain. The heap areas are aligned in both binaries by using a custom memory allocator.
Evaluated with network speeds of 144Mbps or 844 Mbps.
Fast Support for Unstructured Data Processing: the Unified Automata Processor, Yuanwei Fang, University of Chicago
Finite automata is a powerful tool for pattern matching. Automata processing has poor performance on traditional CPU/GPU due to irregular memory accesses and computation patterns. The goal is to develop general purpose automata processor that can handle many different types of automatas. UAP exploits vector-parallelism and uses a multi-level execution model.
Enabling Interposer-based Disintegration of Multi-core Processors,
Ajaykumar Kannan, University of Toronto
Silicon interposer enables integration of disparate chips. Leading application is the integration of compute with 3D-stacked memory. Big chips are expensive. Break into several smaller pieces and use the existing silicon interposers to reintegrate. Optimization: sort chips before assembly to improve binning.
Disintegrated SOCs decrease manufacturing costs but increase on chip latency. How do you build NoC on the interposer and what type of NoC should one build. In conventional chips, fewer transistors implies smaller chips. But this is not the case for interposers. The paper proposes a new NOC architecture called butterdonut. Hybrid of topologies and misalignment gives best performance.
DCS: A Fast and Scalable Device-Centric Server Architecture, Jaehyung Ahn, POSTECH
Host-centric architecture has large latency and inefficiency. Prior work:
- Single-device optimizations – do not address inter-device communication.
- Inter-device communication – not applicable to unsupported devices.
- Integrating devices – custom devices and protocols and limited applicability.