MICRO 2015 Day 1

Keynote 

The Human Intranet, Where Swarms and Humans Meet, Jan Rabaey

In the 1990s, the idea of having small sensor nodes that collect a lot of data was proposed. There are about 100 sensors per person with a 200% increase per year. The Internet of Things is at the peak of inflated expectations in gartner hype curve. How do we move from this hype to the plateau of productivity? A Path forward: The Swarm platform to foster creation of deployment of new applications and devices.

How will the humans deal with all the smart devices? One viable option is to make humans smarter, but this is a bandwidth limited proposition. This is not a new idea: Augmenting Human Intellect, Doug Engelbart 1962.

Brain Machine Interfaces(BMI): A high bandwidth link between our brain and the devices around us. How to correctly assess mood? The opportunity is hampered by gadgets that do a single function, can only work with devices of the same company and need to be replaced every few years. The solution is to have a open human intranet, empowering humans in an augmented manner.

Why is this happening now and not 10 years back. Although we are seeing the end of Moore’s law, there are many exciting advancements in other fields:

  • Novel sensomotory modalities.
  • Miniaturization.
  • Flexible electronics.

Challenges:

  • One of the biggest problems is powering these devices. Energy sparsity: Intertwining of energy and information distribution.
  • Going forward, we need to focus on the network, body area network: skeleton/skin formed by mesh of hub nodes on/in the body.
  • Need to retain basic or partial functionality under all circumstances.
  • Security and privacy. The idea of building a firewall around a human. Encryption of every link with keys generated from local observations. Also, need to adapt to avoid spamming.
  • Dynamically changing conditions.

The new age of information processing is ubiquitous, distributed and very personal.

Best paper session

Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have it Both Ways?, Binh Pham (Rutgers University)

Large pages can provide a lot of benefits in virtualized environments. However, in order to enable lightweight memory management, most hypervisors splinter or break the large pages created by a guest OS into smaller pages. This work exploits the fact that even after splintering, most of the small pages of the original large page are still aligned in the address space. Hence, one can do speculative translation from guest virtual addresses to host physical addresses. Every speculative translation is verified by a page table walk, which is not on the critical path. Mis-speculations lead to pipeline flushes. There is about 10% improvement in performance. This paper could have direct implications on the virtual memory design of hypervisors.

Exploiting Commutativity to Reduce the Cost of Updates to Shared Data in Cache-Coherent Systems, Guowei Zhang, MIT CSAIL

Many operations to shared data are commutative i.e. they produce the same result irrespective of the order in which the operations are performed. For example, addition, subtraction, bitwise or, bitwise and etc. By allowing multiple threads to concurrently update shared data in a commutative fashion, we can get large speedups(upto 2.4x) in multi-threaded programs.

CCICheck: Using μhb Graphs to Verify the Coherence-Consistency Interface, Yatin A. Manerkar (Princeton University)

At the architecture level coherence and consistency are dealt with independently. However, at the microarchitectural level both are intertwined a lot. This work provides a tool to verify the Coherence Consistence Interface.

Session on Caches

HyComp: A Hybrid Cache Compression Method for Selection of Data-Type-Specific Compression Methods,  Angelos Arelakis, Chalmers University of Technology 

Caches reduce average memory access latency, but consume a lot of energy and chip area. Alternative to increasing cache size is to use cache compression. But decompression adds latency on the critical path.

Important compression methods: BDI, CPACK, SC^2, FP-H. There is lot of difference between how a policy performs on different workloads. Ideally, we want to select the best compression method for a given workload. This paper a method to perform runtime selection of best compression method with 82% accuracy. It also describes a new compression method FP-H using semantic bit-field compression.

The scheme is evaluated on a 8-core multicore with a 3-level cache hierarchy with compression in the LLC. HyComp combines various compression methods together. The right policy is chosen at compression time, i.e. whenever there is a write-back from private caches or insert from memory. A two-phase approach:

  • Block inspection and prediction of content type.
  • Method selection and compression.

For each supported type, choose a compression policy that is the best suited for that data type. In inspection phase, the content type is predicted for each block of 8 bytes in a 64-byte block. The prediction is done by looking for certain patterns in the data bits.

For most floating point workloads, the hybrid scheme achieves the highest compression ratio than other standalone compression methods.

Doppelganger: A Cache for Approximate ComputingJoshua San Miguel, University of Toronto

Accessing memory is 10-100x expensive than doing a single floating operation. Caches reduce access latency but consume 30-50% chip area.

For a wide range of workloads, all data/computations need not be precise. Approximate similarity: Two data blocks are approximately similar if replacing the content of one with that of the other doesn’t degrade application output. Whenever two cache blocks are approximately similar, space can be saved by storing only one block instead of two. This gives 77% cache space savings for a wide range of approximate computing workloads.

The LLC is divided into two equal parts, one each for precise data and approximate data. Each data block is given a signature and the tag array contains pointers to data blocks through the data signature. When the signatures of two blocks are the same only one of them is stored in the data array. In addition, multiple entries in the tag array pointing to this data block. Doppelganger is orthogonal to schemes that perform intra-block compression like BDI. By combining BDI with Doppelganger, one can achieve even higher performance and area improvements.

Doppelganger achieves 3x better compression ratio, 1.36x reduction in chip area and 77% reduction in cache space.

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory, Lavanya Subramanian, Intel Labs

In a multicore system, multiple cores interfere for shared resources like shared LLC caches. Different applications slow down to different degrees, but in general the slowdowns are quite large. The slowdowns are unpredictable i.e. slowdowns depend on all the applications that share the resources.

Previous work looked at estimating impact of interference at a per-request granularity. This work looks at aggregates of requests when characterizing interference.  Shared cache access rate is a proxy for performance. Hence, slowdown can be expressed in terms of cache access rate rather than execution time.

In order to reduce main memory contention, requests of one application can be prioritized over another at the memory controller. Shared cache capacity contention: Applications can evict each other’s blocks. Shared cache interference is hard to minimize through priority. There is a long warmup period before an applications sees the benefits of priority. A per-application auxiliary tag store is used to count the number of contention misses i.e. the number of blocks of an application evicted by another application.

Previous cache capacity partitioning techniques like UCP aimed at reducing overall miss count. This might not be the best choice when one want to improve overall application performance. This paper proposes prioritizing each application based on the relative slowdown it would experience in shared environment.

MORC: A Manycore-Oriented Compressed CacheTri M. Nguyen, Princeton University

Today’s chips support an ever increasing number of threads. There is a large gap between the bandwidth demanded by the application and available bandwidth. Caches help bridge this gap and cache compression can be used to further increase the benefit of caches.

Stream compression : Sequentially compresses a cache line as a single stream. Stream and block-based compression schemes are compared. Stream compression achieves much higher compression ratio. But most designs prefer block-based compression to reduce implementation overheads. Stream compression is difficult to implement with set based caches. The paper proposes using log based caches to reduce the implementation costs of stream compression while getting the same benefits of set based caches.

The scheme achieves up to 6x compression ratio and 3x on average. Maximum throughput improvements of upto 40% as compared to 20% with other schemes. The scheme saves energy on expensive DRAM accesses, while having negligible energy overheads for compression and decompression. Hence, overall the scheme has decent energy reductions.

Session on Prefetching

Efficiently Prefetching Complex Address PatternsManjunath Shevgoor, University of Utah

Prefetching is a good technique to reduce the impact of high DRAM access latency. There are two major types of prefetchers: confirmation-based  and  immediate.

Spatial Correlation: Learn delta patterns and apply patterns when similar conditions re-occur. There are both regular and irregular delta patterns. For irregular delta patterns, both stream and sandboxed prefetchers do not provide timely prefetches. The paper proposes using delta history tables that store last address, last 4 deltas, last predictor, number of times used and last 4 prefetches.

In benchmarks like MG, which have both regular and irregular access patterns, prior schemes perform poorly due to interference between the two kinds of patterns. Variable length data predictor(VLDP) proposed in this paper is able to differentiate between the two access patterns. OPT issues predictions without confirmation, DPT recognizes irregular access patterns.

Self-Contained, Accurate Precomputation Prefetching, Islam Atta, University of Toronto

Big data workloads, with structured and semi-structured data,  are heavily memory bound and prefetching is the traditional remedy for this problem.

What is a precomputation prefetcher?  P-slices ought to be accurate and fast. Past work constructed P-slices manually, at compile-time or by using traces from binary. This work focuses on the last technique. Conventional binary-based methods oversimplify P-slices. The proposed scheme, Ekivolos accurately replicates main thread’s execution path and maintains control flow by merging multiple traces.

Confluence: Unified Instruction Supply for Scale-Out Servers , Cansu Kaynak, EcoCloud, EPFL

Existing mechanisms to improve instruction supply: L1-I prefetchers, multi-level BTBs and prefetchers. The overheads of these structures in current server processors, with large OOO cores and on-chip caches, is negligible. However, emerging scale-out processors have much simpler and larger number cores and smaller on-chip caches. The goal of this work is to reduce the redundant I-supply state of prefetchers and BTBs in future server processors.

 The key insight is that instruction working set subsumes BTB state. BTB prefetchers record and prefetch groups of correlated BTB entries. On the other hand, instruction prefetchers exploit temporal instruction streaming. There is a control-flow metadata redundancy for BTB and L1I prefetchers. In the proposed scheme, BTB entries of an instruction block are bundled and each bundle in the BTB is linked to blocks in I-cache. By doing this, BTB only has entries for instructions that are present in the I-cache. When a block is evicted from L1I, the corresponding entry is also evicted from the BTB. This helps reduce the size of BTB and obviates the need for large two-level BTBs. To enable timely prefetches, confluence uses shared instruction fetch prefetch(SHIFT) proposed in prior work.

Confluence provides the same benefits as  a BTB with 16x entries of a baseline BTB while incurring only 10% of the overhead.

IMP: Indirect Memory PrefetcherXiangyao Yu, MIT

Operations on sparse datasets are becoming increasingly important:

  • Rise of big data analytics.
  • HPC: big gains in dense apps have drawn attention to lousy efficiency of sparse apps.

Access patterns are input-dependent and are indirect. Indirect accesses are generally in the form of A[B[i]]. Examples  applications include sparse matrix-dense vector multiplication and graph algorithms. B is pre-computed and contiguously accessed, unpredictable factor is the content of B.

Indirect Memory Prefetcher(IMP) has both a stream table(to prefetch B[i]) and an indirect table(to prefetch A[B[i]]) which combine to generate prefetch addresses. IMP analyzes the content of prefetches generated by stream prefetches and generates entries in indirect table.

&A[B[i] = coeff * B[i] + base_address

The candidates for &A[B[i]] are the first several cache misses after accessing B[i]. Set up a set of linear equations to solve for coeff and base_address.

 

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s