Parallel Local Graph Clustering

Julian Shun gave a talk at MIT last week on his paper ‘Parallel Local Graph Clustering, Shun et al. VLDB 2016‘. This paper talks about how to parallelize graph clustering heuristics whose runtime scales linearly with the size of the cluster being formed, as opposed to graph size, what the authors call local running time. The paper doesn’t propose new clustering heuristics but instead focuses on how to parallelize various heuristics from previous work, while maintaining the local running time property.

In particular parallelization techniques are presented for 3 clustering heuristics – Nibble, PageRank-Nibble (PR-Nibble) and Deterministic Heat Kernel PageRank. The basic idea behind each the above heuristics is very similar. First, a random vertex in the graph is chosen and an initial weight is assigned to this vertex, usually 1.0. Then a few iterations are performed to propagate this weight/mass to other vertices in the graph. In each iteration, each active vertex distributes some fraction of its current weight among its neighbors, while assigning the remaining fraction to itself. An active vertex is one whose current weight is greater than a certain threshold.The intuition is that vertices that form a good cluster will have larger weights because they are well-connected among themselves. This process is repeated for a pre-specified number of iterations or till a convergence condition is met, very similar to PageRank algorithm for ranking vertices in a web graph.

The heuristics differ in what fraction of a vertex’s weight is kept to itself, how weights are propagated to neighbors, how active vertices are chosen and the convergence criteria. The output of these heuristic is a real-valued vector of weights for all the vertices in the graph. The Sweep Cut heuristic takes this vector and generate the final cluster. The author also discuss a clever parallelization technique for Sweep Cut.

The authors use the primitives from the Ligra Graph Processing framework, to parallelize these heuristics. In particular, the weight propagation step maps directly to EdgeMap primitive of Ligra. The authors develop a concurrent hash-table to store the weights of active vertices. The main benefit of these primitives is that they only process edges of the active vertices, making the runtime proportional to number of active vertices and sum total of their degrees  (called volume). Thus the parallel versions do not lose the local running time property. The authors prove this by providing theoretical bounds on the work efficiency of their parallel versions.

These parallel versions achieve speedups of 23x-28x on a 40 core machine with hyperthreading, for a total of 80 threads. The code is available in the Ligra Github repository.. The paper shows Network Community Profiles for some of the largest publicly available graphs. These profiles plot the conductance (a metric that indicates the quality of a cluster) of a set of randomly generated local clusters versus the size of these clusters. For most graphs the best clusters have size roughly equal to the square root of number of vertices in the graph.

Some questions:

  • Is memory performance the bottleneck to the scalability of these algorithms?
  • The parallel version uses double-buffering, i.e. they maintain a copy of weights for current and previous iterations. Is it possible to use a single buffer while maintaining correctness and what is effect of this on performance. Credits: Yunming Zhang
  • Since the general consensus seems to be that it is difficult to cluster vertices in power-law graphs as opposed to edges, how do these heuristics compare with edge-centric clustering heuristics?

Transactional Memory and Lock Elision

I started reading a post on recent trends in hardware and their implications on distributed systems design. One of the recent additions that could make parallel programming much easier is the TSX extensions in Intel Haswell processors.

 

It comes in two flavors, a backwards-compatible Hardware Lock Elision (HLE) instruction set, and a more flexible forward-looking Restricted Transactional Memory (RTM) instruction set. Here are a few blog posts and papers that give a good overview of Intel TSX and it’s origins in academia.

 

Brief introduction into Intel TSX and it’s implications for the industry

http://www.anandtech.com/show/6290/making-sense-of-intel-haswell-transactional-synchronization-extensions

Analysis of Haswell’s transactional memory by David Kanter

http://www.realworldtech.com/haswell-tm/1/

Hardware Lock Elision on Haswell

https://brooker.co.za/blog/2013/12/14/intel-hle.html

Lock elision support in gcc and linux

http://halobates.de/adding-lock-elision-to-linux.pdf

The original paper on Speculative Lock Elision: Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution

http://pages.cs.wisc.edu/~rajwar/papers/micro01.pdf

 

 

 

MICRO 2015 – Day 2

Keynote

Preparing for a Post Moore’s Law World, Prof. Todd Austin, University of  Michigan

Density scaling is slowing down. It is now taking more and more time(36 months) to get the next generation silicon out in the market. This shift the burden to the computer architects. The first remedy to slowing Moore’s law was chip multiprocessors.

The public perception of ‘how fast computers are’ is very negative. Who is to blame?

  • Programmers: There are still some niche markets like Bitcoin and warehous scale computing which are epxloiting abundant parallelism
  • Educators: CS enrolment is going up.
  • The transistor: Constant power scaling not possible anymore
  • Architects: Amdahl’s law. You need an ever-increasing number of processors to get higher speedups. Parallelism is not the solution to bridge the performance gap.

Heterogeneous parallel systems to overcome dark silicon and the tyranny of Amdahl’s law.

Good: Hetero-parallel systems can the close the Moore’s law gap

Bad: The architecture community is gonna have a hard time responding to this extreme challenge.

Ugly: Such systems are gonna cost a lot.

Design costs are skyrocketing and this will kill the community. It takes around $120M to get a 20nm design to market. The software community needs around $500K to generate billion dollar companies.

Ultimate goal is to accelerate system architecture innovation and make it sufficiently inexpensive that anyone can do it anywhere. Expect more from architects, fund research only if you can get 2x speedup, not 10%. Reduce the cost to design customized design. Widen the applicability of customized H/W. Make hardware more open-source. Reduce the cost of manufacturing customized H/W.

Session on DRAM

More is Less: Improving the Energy Efficiency of Data Movement via Opportunistic Use of Sparse Codes, Yanwei Song, University of Rochester.

A sparse representation with a few 0s in each codeword can reduce data movement energy. DDR4 IO interface: constant current flows when transmitting 0. Use a DBI bit to invert the transmission protocol, when current flows we could be transmitting 1 or 0, depending on what the DBI bit is. A longer codeword requires increasing the burst length. Applying a sparse code at all time significantly increases bandwidth.

When there is no data coming in the next few cycles, use sparse code otherwise use original encoding. Proposed simple code MiLC builds upon DBI by exploiting spatial correlations among adjacent bytes of data.

LPDDR3 consumes energy when toggling, so it not optimal to just reduce the number of 0s. Level signaling energy is proportional to the number of bit flips. Transition signaling energy is proportional to the number of 0s.

Improving DRAM Latency with Dynamic Asymmetric SubarrayShih-Lien Lu, Intel

Memory latency in CPU cycles has grown and affect CPU design. DRAM is organized as banks, sub-arrays, tiles. Device Latency = I/O + Peripheral + Sub-array. Sub-array access contributes a lot to the delay. But reducing sub-array access increases die area a lot. Need a hybrid design to reduce access latency with minimal increase in die area.

Proposed design uses migration cells in between alternate rows to reduce the access latency for different types of operations. Two factors contribute to the area overhead: Extra rows for migration and array efficiency reduction due to shorter bit lines and word lines. Overall 6% area overhead.  Dynamic subarray (DAS) DRAM gives most of the performance benefit of fast sub-array (FS) DRAM.

Gather-Scatter DRAM: In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided AccessesVivek Seshadri, Carnegie Mellon University

The problem being tackled is non-unit strided accesses. Current memory systems are inefficient and cache-line optimized. Whether accesses to particular fields in a database result in unit or non-unit strided accesses depends on the database’s layout i.e. row store or column store.

The goal is to eliminate the inefficiency of current cache-line optimized DRAM systems. Fetch the non-unit strided accesses without this inefficiency. In traditional systems, data of each cache line spread across all the chips. Chip conflicts: All the fields of a database might be mapped to the same chip. Shuffle the data from multiple cache lines before sending data to memory. But the data might be stored in different columns in different banks. How does the memory controller know from which column to retrieve data? Using pattern-id based address translation, one can retrieve any power of two strided access pattern.

In memory databases have two kinds of layout – row-store and column-store. Assume a row-store based layout as baseline and efficiently retrieves data from a column store format. Each layout is better suited to a certain kind of workloads – Transactions prefer row-store layout, analytics prefer column-store layout. Using the proposed design one can achieve the best of both worlds.

Session on  Micro-architecture

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores,
Shruti Padmanabha, University of Michigan

An OOO creates a better schedule but burns up to 6d more power. When there are loops, the OOO core redundantly  regenerates the same schedule again and again. The idea is to record or memoize the schedule generated by OOO core and then use this schedule on an inorder core. One needs to detect profitable traces to memoize – intelligent trace-based predictor.

The OOO core reorders loads and stores and uses an LSQ to perform memory disambiguation. The OOO core will need to store register renaming information in the trace. Also, one needs to use a larger physical register file in inorder core. Each logical register in inorder can now map to 4 physical registers.

Evaluate 3-wide OOO and 3-wide inorder cores. Achieves 30% energy savings on average.

Long Term Parking: Criticality-aware Resource Allocation in OOO ProcessorsAndreas Sembrant, Uppsala University

OOO cores allocate resource too eagerly. Identify and park non-performance critical instructions before resource allocation. The goal is to minimize the time an instruction holds resources.

Urgency: Output is used by a long latency instruction (e.g. LLC miss).

Readiness: Input depends on a long latency instruction.

Classify instructions along the above two axes. Non-urgent instructions will be parked. One can only detect urgent instructions.

The Inner Most Loop Iteration counter: a new dimension in branch history, Andre Seznec, INRIA/IRISA

Local history is not very useful in reduction branch misprediction rates. The issue is that there are many instruction from the same branch in inflight. As a result, one ends up using the wrong branch and get wrong prediction. State of the art global history predictors: Neural predictors, Tage GSC. How to identify correlator branches? The loop predictor does it smoothly for loops. Previous work has shown that there is correlation in branch directions of multi-dimensional loops.

IMLI-SIC component. A simple add-on to TAGE-GSC or GEHL.

Filtered Runahead Execution with a Runahead Buffer, Milad Hashemi, UT Austin

Runahead dynamically expands the instruction window when the pipeline is stalled. Traditional runahead generates a lot of instructions that do not generate any cache misses. Most dependency chains are short. A small dependence chain cache(2-entries) improves performance. Runahead generates more MLP by executing a filtered dependency chain.

Bungee Jumps: Accelerating Indirect Branches Through HW/SW Co-DesignDaniel S. McFarlin, Carnegie Mellon University

There have been tremendous improvements in indirect branch prediction accuracy. This is great for OOO cores, not so for inorder cores. Inorder machines specialize based on branch bias or eliminate branch prediction altogether.

Session on  Mobile & Emerging Systems

Prediction-Guided Performance-Energy Trade-off for Interactive ApplicationsDaniel Lo, Cornell University

Many modern applications are highly interactive. User interactions have response-time requirements. Run jobs slower in order to save energy while preserving user experience. Execution times for an application vary from job to job. Optimizing for the worst case wastes energy and optimizing for average case misses deadlines. History based DVFS control is too slow to account for fine-grained variations. Hence, we need proactive control i.e. predict DVFS based on job inputs and program state. In addition, the design has to be general and require minimal programmer input.

Execution time depends on number of instructions executed, which in turn depends on control flow. One solution is to instrument program source to count important features. But will need to run the entire program to get features and this might take a very long time. Instead, one can create program slices – minimal pieces of code that capture the features. Linear model to map features to execution time. Tune the training algorithm to penalize under-prediction much more heavily. In particular, the paper used convex optimization with a custom objective function. In order to translate execution time to DVFS frequency, a simple linear model is used.

Architecture-aware Automatic Computation Offload for Native ApplicationsGwangmu Lee, POSTECH

Mobile device performance is too slow for some applications. It might be beneficial to run demanding workloads on a server to improve overall performance and energy efficiency. Most offloading systems are based on VMs. Cross-architecture binary incompatibility becomes an issue. The challenges in offloading native workloads: Different processor architectures, distinct memory, different memory layouts.

Main steps proposed design: Target selection, virtual address unification, code partitioning, server specific optimizations. The target which minimizes both computation and communication latency is chosen. Some functions are not offloaded if no target server provides positive overall gain. The heap areas are aligned in both binaries by using a custom memory allocator.

Evaluated with network speeds of 144Mbps or 844 Mbps.

Fast Support for Unstructured Data Processing: the Unified Automata Processor, Yuanwei Fang, University of Chicago

Finite automata is a powerful tool for pattern matching. Automata processing has poor performance on traditional CPU/GPU due to irregular memory accesses and computation patterns. The goal is to develop general purpose automata processor that can handle many different types of automatas. UAP exploits vector-parallelism and uses a multi-level execution model.

Enabling Interposer-based Disintegration of Multi-core Processors,
Ajaykumar Kannan, University of Toronto

Silicon interposer enables integration of disparate chips. Leading application is the integration of compute with 3D-stacked memory. Big chips are expensive. Break into several smaller pieces and use the existing silicon interposers to reintegrate. Optimization: sort chips before assembly to improve binning.

Disintegrated SOCs decrease manufacturing costs but increase on chip latency. How do you build NoC on the interposer and what type of NoC should one build. In conventional chips, fewer transistors implies smaller chips. But this is not the case for interposers. The paper proposes a new NOC architecture called butterdonut. Hybrid of topologies and misalignment gives best performance.

DCS: A Fast and Scalable Device-Centric Server Architecture, Jaehyung Ahn, POSTECH

Host-centric architecture has large latency and inefficiency. Prior work:

  • Single-device optimizations – do not address inter-device communication.
  • Inter-device communication – not applicable to unsupported devices.
  • Integrating devices – custom devices and protocols and limited applicability.

 

 

 

MICRO 2015 Day 1

Keynote 

The Human Intranet, Where Swarms and Humans Meet, Jan Rabaey

In the 1990s, the idea of having small sensor nodes that collect a lot of data was proposed. There are about 100 sensors per person with a 200% increase per year. The Internet of Things is at the peak of inflated expectations in gartner hype curve. How do we move from this hype to the plateau of productivity? A Path forward: The Swarm platform to foster creation of deployment of new applications and devices.

How will the humans deal with all the smart devices? One viable option is to make humans smarter, but this is a bandwidth limited proposition. This is not a new idea: Augmenting Human Intellect, Doug Engelbart 1962.

Brain Machine Interfaces(BMI): A high bandwidth link between our brain and the devices around us. How to correctly assess mood? The opportunity is hampered by gadgets that do a single function, can only work with devices of the same company and need to be replaced every few years. The solution is to have a open human intranet, empowering humans in an augmented manner.

Why is this happening now and not 10 years back. Although we are seeing the end of Moore’s law, there are many exciting advancements in other fields:

  • Novel sensomotory modalities.
  • Miniaturization.
  • Flexible electronics.

Challenges:

  • One of the biggest problems is powering these devices. Energy sparsity: Intertwining of energy and information distribution.
  • Going forward, we need to focus on the network, body area network: skeleton/skin formed by mesh of hub nodes on/in the body.
  • Need to retain basic or partial functionality under all circumstances.
  • Security and privacy. The idea of building a firewall around a human. Encryption of every link with keys generated from local observations. Also, need to adapt to avoid spamming.
  • Dynamically changing conditions.

The new age of information processing is ubiquitous, distributed and very personal.

Best paper session

Large Pages and Lightweight Memory Management in Virtualized Environments: Can You Have it Both Ways?, Binh Pham (Rutgers University)

Large pages can provide a lot of benefits in virtualized environments. However, in order to enable lightweight memory management, most hypervisors splinter or break the large pages created by a guest OS into smaller pages. This work exploits the fact that even after splintering, most of the small pages of the original large page are still aligned in the address space. Hence, one can do speculative translation from guest virtual addresses to host physical addresses. Every speculative translation is verified by a page table walk, which is not on the critical path. Mis-speculations lead to pipeline flushes. There is about 10% improvement in performance. This paper could have direct implications on the virtual memory design of hypervisors.

Exploiting Commutativity to Reduce the Cost of Updates to Shared Data in Cache-Coherent Systems, Guowei Zhang, MIT CSAIL

Many operations to shared data are commutative i.e. they produce the same result irrespective of the order in which the operations are performed. For example, addition, subtraction, bitwise or, bitwise and etc. By allowing multiple threads to concurrently update shared data in a commutative fashion, we can get large speedups(upto 2.4x) in multi-threaded programs.

CCICheck: Using μhb Graphs to Verify the Coherence-Consistency Interface, Yatin A. Manerkar (Princeton University)

At the architecture level coherence and consistency are dealt with independently. However, at the microarchitectural level both are intertwined a lot. This work provides a tool to verify the Coherence Consistence Interface.

Session on Caches

HyComp: A Hybrid Cache Compression Method for Selection of Data-Type-Specific Compression Methods,  Angelos Arelakis, Chalmers University of Technology 

Caches reduce average memory access latency, but consume a lot of energy and chip area. Alternative to increasing cache size is to use cache compression. But decompression adds latency on the critical path.

Important compression methods: BDI, CPACK, SC^2, FP-H. There is lot of difference between how a policy performs on different workloads. Ideally, we want to select the best compression method for a given workload. This paper a method to perform runtime selection of best compression method with 82% accuracy. It also describes a new compression method FP-H using semantic bit-field compression.

The scheme is evaluated on a 8-core multicore with a 3-level cache hierarchy with compression in the LLC. HyComp combines various compression methods together. The right policy is chosen at compression time, i.e. whenever there is a write-back from private caches or insert from memory. A two-phase approach:

  • Block inspection and prediction of content type.
  • Method selection and compression.

For each supported type, choose a compression policy that is the best suited for that data type. In inspection phase, the content type is predicted for each block of 8 bytes in a 64-byte block. The prediction is done by looking for certain patterns in the data bits.

For most floating point workloads, the hybrid scheme achieves the highest compression ratio than other standalone compression methods.

Doppelganger: A Cache for Approximate ComputingJoshua San Miguel, University of Toronto

Accessing memory is 10-100x expensive than doing a single floating operation. Caches reduce access latency but consume 30-50% chip area.

For a wide range of workloads, all data/computations need not be precise. Approximate similarity: Two data blocks are approximately similar if replacing the content of one with that of the other doesn’t degrade application output. Whenever two cache blocks are approximately similar, space can be saved by storing only one block instead of two. This gives 77% cache space savings for a wide range of approximate computing workloads.

The LLC is divided into two equal parts, one each for precise data and approximate data. Each data block is given a signature and the tag array contains pointers to data blocks through the data signature. When the signatures of two blocks are the same only one of them is stored in the data array. In addition, multiple entries in the tag array pointing to this data block. Doppelganger is orthogonal to schemes that perform intra-block compression like BDI. By combining BDI with Doppelganger, one can achieve even higher performance and area improvements.

Doppelganger achieves 3x better compression ratio, 1.36x reduction in chip area and 77% reduction in cache space.

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory, Lavanya Subramanian, Intel Labs

In a multicore system, multiple cores interfere for shared resources like shared LLC caches. Different applications slow down to different degrees, but in general the slowdowns are quite large. The slowdowns are unpredictable i.e. slowdowns depend on all the applications that share the resources.

Previous work looked at estimating impact of interference at a per-request granularity. This work looks at aggregates of requests when characterizing interference.  Shared cache access rate is a proxy for performance. Hence, slowdown can be expressed in terms of cache access rate rather than execution time.

In order to reduce main memory contention, requests of one application can be prioritized over another at the memory controller. Shared cache capacity contention: Applications can evict each other’s blocks. Shared cache interference is hard to minimize through priority. There is a long warmup period before an applications sees the benefits of priority. A per-application auxiliary tag store is used to count the number of contention misses i.e. the number of blocks of an application evicted by another application.

Previous cache capacity partitioning techniques like UCP aimed at reducing overall miss count. This might not be the best choice when one want to improve overall application performance. This paper proposes prioritizing each application based on the relative slowdown it would experience in shared environment.

MORC: A Manycore-Oriented Compressed CacheTri M. Nguyen, Princeton University

Today’s chips support an ever increasing number of threads. There is a large gap between the bandwidth demanded by the application and available bandwidth. Caches help bridge this gap and cache compression can be used to further increase the benefit of caches.

Stream compression : Sequentially compresses a cache line as a single stream. Stream and block-based compression schemes are compared. Stream compression achieves much higher compression ratio. But most designs prefer block-based compression to reduce implementation overheads. Stream compression is difficult to implement with set based caches. The paper proposes using log based caches to reduce the implementation costs of stream compression while getting the same benefits of set based caches.

The scheme achieves up to 6x compression ratio and 3x on average. Maximum throughput improvements of upto 40% as compared to 20% with other schemes. The scheme saves energy on expensive DRAM accesses, while having negligible energy overheads for compression and decompression. Hence, overall the scheme has decent energy reductions.

Session on Prefetching

Efficiently Prefetching Complex Address PatternsManjunath Shevgoor, University of Utah

Prefetching is a good technique to reduce the impact of high DRAM access latency. There are two major types of prefetchers: confirmation-based  and  immediate.

Spatial Correlation: Learn delta patterns and apply patterns when similar conditions re-occur. There are both regular and irregular delta patterns. For irregular delta patterns, both stream and sandboxed prefetchers do not provide timely prefetches. The paper proposes using delta history tables that store last address, last 4 deltas, last predictor, number of times used and last 4 prefetches.

In benchmarks like MG, which have both regular and irregular access patterns, prior schemes perform poorly due to interference between the two kinds of patterns. Variable length data predictor(VLDP) proposed in this paper is able to differentiate between the two access patterns. OPT issues predictions without confirmation, DPT recognizes irregular access patterns.

Self-Contained, Accurate Precomputation Prefetching, Islam Atta, University of Toronto

Big data workloads, with structured and semi-structured data,  are heavily memory bound and prefetching is the traditional remedy for this problem.

What is a precomputation prefetcher?  P-slices ought to be accurate and fast. Past work constructed P-slices manually, at compile-time or by using traces from binary. This work focuses on the last technique. Conventional binary-based methods oversimplify P-slices. The proposed scheme, Ekivolos accurately replicates main thread’s execution path and maintains control flow by merging multiple traces.

Confluence: Unified Instruction Supply for Scale-Out Servers , Cansu Kaynak, EcoCloud, EPFL

Existing mechanisms to improve instruction supply: L1-I prefetchers, multi-level BTBs and prefetchers. The overheads of these structures in current server processors, with large OOO cores and on-chip caches, is negligible. However, emerging scale-out processors have much simpler and larger number cores and smaller on-chip caches. The goal of this work is to reduce the redundant I-supply state of prefetchers and BTBs in future server processors.

 The key insight is that instruction working set subsumes BTB state. BTB prefetchers record and prefetch groups of correlated BTB entries. On the other hand, instruction prefetchers exploit temporal instruction streaming. There is a control-flow metadata redundancy for BTB and L1I prefetchers. In the proposed scheme, BTB entries of an instruction block are bundled and each bundle in the BTB is linked to blocks in I-cache. By doing this, BTB only has entries for instructions that are present in the I-cache. When a block is evicted from L1I, the corresponding entry is also evicted from the BTB. This helps reduce the size of BTB and obviates the need for large two-level BTBs. To enable timely prefetches, confluence uses shared instruction fetch prefetch(SHIFT) proposed in prior work.

Confluence provides the same benefits as  a BTB with 16x entries of a baseline BTB while incurring only 10% of the overhead.

IMP: Indirect Memory PrefetcherXiangyao Yu, MIT

Operations on sparse datasets are becoming increasingly important:

  • Rise of big data analytics.
  • HPC: big gains in dense apps have drawn attention to lousy efficiency of sparse apps.

Access patterns are input-dependent and are indirect. Indirect accesses are generally in the form of A[B[i]]. Examples  applications include sparse matrix-dense vector multiplication and graph algorithms. B is pre-computed and contiguously accessed, unpredictable factor is the content of B.

Indirect Memory Prefetcher(IMP) has both a stream table(to prefetch B[i]) and an indirect table(to prefetch A[B[i]]) which combine to generate prefetch addresses. IMP analyzes the content of prefetches generated by stream prefetches and generates entries in indirect table.

&A[B[i] = coeff * B[i] + base_address

The candidates for &A[B[i]] are the first several cache misses after accessing B[i]. Set up a set of linear equations to solve for coeff and base_address.

 

 

NOPE Tutorial – MICRO 2015

Notes from NOPE: Negative Outcomes, Post-mortems, and Experiences at MICRO 2015

Jason Mars, University of Michigan

Resilience: Keep submitting even if you get rejected multiple times. Expect your first submission to get rejected. Look at what reviewers say and refine your paper till you get the paper accepted. Students should be prepared to get 4-5 rejections for each paper.

PLP (Paper level parallelism) – Whenever you have a paper under submission, your next work should be independent of the first one to avoid having no papers for long periods.

MTTP – Mean Time To Paper.

Opportunity Analysis – Sometimes, ideas that seem great to begin with only give 3% improvement at the end.

It is really tough to do a second paper in our field. We seem to prefer incremental work.

Acoherent Shared Memory, Mark Hill, University of Wisconsin

Coherence is complex and inefficient, switch to CVS-like checkout/checkin model, same performance and less energy for CPUs. But this idea doesn’t work for GPUs or accelerators. The timing was just not right. It is very unfortunate we do not have a venue to publish papers about great ideas that are either not practical or do not have immediate impact.

Coherence is optimized for fine-grained share everything. But programs are not like that. Let software have control over coherence, but we want to make minimal hardware changes. Checkin/Checkout model. What is the granularity? Different memory regions can have different kinds of coherence models.

Story of grad student xxx: Did great work on record and replay before starting graduate school. Wanted to go for the home run and do really high-impact work. But because of this failed idea, someone who was destined for academia had to go into industry.

The end of the road for my career, Vijaya Janapa Reddi, UT Austin

NSF Career Award. Got rejected 3 times. Mobile computing is important but,  2/157 papers in top computer architecture conferences in 2010 were related to mobile computing. Themes is reviewers comments:

  • Industry is solving the problem, so need to fund this project in academia.
  • Problem is in the network, not in the client processor.
  • Industry is still defining the standards for mobile web and the results of the project might not have a significant impact.
  • What if the user stops using heavy websites and just uses websites that are highly optimized for mobile browsing?
  • Major source of power consumption in mobile devices is the display and not the compute engine.

It really takes time for the community to accept the importance of a new topic. There are a lot of papers about datacenters in architecture conferences since 2010, but mobile computing is lagging behind. 2015 is for mobile computing what 2010 was for data centers.

Science advances one funeral at a time.

Exploiting Choice: An implementable approach to architecture research, Joel Emer, Nvidia Research/MIT CSAIL

Fail fast. How big is a failure. Magnitude of failure = time * work.
Research principles of operation:
Choose a challenge
Choose an approach
Evaluate early and as often as possible.

Mastering the fine art art of pivot, Todd Austin, University of Michigan.

Rule breaking research: Find a rule that no one ever breaks, break it and see what happens. Litmus test, half the people in the community will hate the idea.

There is a fundamental tension between high risk, failure prone research and PhD students. Make the undergraduates do all the cool work.

Cache conscious data placement. Attempting to improve data cache performance by reordering data for better/spatial locality.
Programmers do a great job at creating a good layout. Build a better layout by starting from programmer’s layout instead of ignoring programmer’s layout.

Tips for phd students :
Assume no one will ever read your paper. Figures, captions, title, section names should convince a reviewer to accept the paper.
Getting the word out is critical to an idea’s success. Talk to people about your project and have a good name for your project.

3rd Workshop on Near Data Processing – WoNDP

Introduction by Rajeev Balasubramonian, University of Utah

DianNao proposes accelerators for deep learning applications, but it does near-data processing too.

PIM-enabled instructions, ISCA 2015 considers the often neglected dimension of NDP i.e. programmability.

Keynote I Automata Processing, Mircea Stan(University of Virginia)

Automata processor(AP) is a programmable silicon device capable of performing very high-speed, comprehensive search and analysis of complex, unstructured data streams. It is a massively parallel and scalable hardware architecture that implements the non-deterministic finite automata(NFA) as the core element.

A fully-functional development board is implemented. The production board is envisioned to have DDR3 Memory, PCI Express connectivity and 4 AP units.  The AP has hardware building blocks for all the fundamental operations in an NFA like counter elements, state transition blocks etc.  The AP can process multiple data streams simultaneously. The elements are implemented in memory with data being very close to the building blocks.

The memory is the processor!

One can program the AP by a combination of RegEx and other low-level tools. AP avoids the von Neumann bottleneck by replacing instruction fetch with hardware reconfiguration. There are many layers of parallelism: individual STE, NFA level, multiple automata, multiple streams. What the AP is not: Doesn’t have systolic arrays, no general purpose logic. The big challenges are, program changes require a reconfiguration step, new and low-level programming model(biggest hurdle), the need to buy a separate board and the resurgence of FPGA acceleration.

Scaling Deep Learning on Multiple In-Memory Processors, Dongping Zhang (AMD)

PIM architecture to accelerate deep learning applications. Two kinds of deep learning models : Deep belief networks and convolutional neural networks. This work focusses on CNNs. In implementing deep learning on PIM, two kinds of parallelism, data and model, can be exploited. The kind of parallelism exploited is different for each kind of layers in convolutional neural networks. Evaluate performance using 256×256 randomly generated images.

Dataflow based Near Data Processing using Coarse Grain Reconfigurable Logic, Charles Shelor(U. North Texas)

Dataflow based execution allows one to extract parallelism from algorithm. Delay elements allow compile time synchronization, decoupled LD/ST for enhanced memory accesses, embedded scratchpad memory for tables. Hardware pipelines to implement histogramming, word-occurrence count, FFT as single cycle operations. Energy and performance results for these 3 benchmarks, up to 13x speedup and 99% reduction in energy.

Cache hierarchy design for server workloads – Challenges and opportunities

In the past week, I have been reading a few papers on the implications of server workloads on cache hierarchy design. In this post, I will discuss some of the main insights from these papers.

There has been a growing interest in the performance of server workloads due to the migration from personal to cloud computing in the past decade. The adoption of cloud computing in different application domains in not uniform. However, there is a general consensus that in the coming decades, a bulk of the computing needs will be satisfied by large data centers owned by a few companies. One main reason for this trend is the economies of scale that large data centers offer. It is cheaper for both consumers and service providers to run applications in the cloud.

Traditional microprocessor design has focused on improving the performance of desktop computers. Yes, there were designs optimized for servers but desktop computers accounted for most of the market and it was logical to optimize your designs for desktop applications. However, recent server workloads have vastly different characteristics from desktop workloads and even server workloads from a few decades ago. These workloads process a large number of independent requests and hence enjoy large inter-request parallelism. However, each request also processes a large amount of data and there is little sharing of data between requests other than instructions and shared OS code. In such a scenario, we see two major trends emerging. Firstly, server workloads have a large instruction footprint and shared OS code and/or data. Secondly, there is little locality in data used by these workloads : the primary working sets are small enough to fit small private caches and the secondary working sets are too large to fit in any practically sized on chip caches.

In ‘High Performing Cache Hierarchies for Server Workloads’ HPCA 2015, Jaleel et.al discuss the limitations of current cache hierarchies in running server workloads and motivate the design of future caches optimized to these workloads. Since on chip caches can’t fit the large working sets of server workloads, it is beneficial to make the shared caches as small possible. This improves the shared cache access latency. In addition, we need to ensure the time spent in the cache hierarchy for instruction fetches should be as low as possible. Server workloads have large instruction footprint and spend a lot of time in front end stalls, due to large cache access latency in current designs.

To achieve this goal, the authors propose using a large L2 cache in a three-level cache hierarchy. The L2 should be sized such that most instruction accesses hit in the L2. Naively, increasing the private cache size created problems due to inclusion constraints.  To this end, the authors propose using an exclusive cache hierarchy as opposed to an inclusive cache hierarchy common in most server processors. This allows the designer to have a larger L2 while still devoting the same area to all the on-chip caches.

Earlier work by the same authors has proposed high performance cache replacement policies for shared last-level caches(LLC). These replacement policies are used in current Intel processors. However, these policies only work with inclusive caches. The authors propose some trivial changes to the replacement to account for exclusive caches and achieve the same performance as inclusive caches. The authors also present a simple replacement policy for private L2 caches that prioritizes instruction fetches over data fetches.

With these optimizations, there is 10% improvement on most sever workloads. The paper also discusses several open problems in designing caches in future server processors.

In ‘Scale Out Processors’ ISCA 2012, the authors discuss a new kind of processor design that is optimized for deployment in data center workloads. They compare three kinds of designs : 1) a conventional design consisting of a few large OOO cores supported by large on-chip caches. 2) a tiled architecture consisting of a large number of cores and the shared cache organized as distributed banks. 3) A pod-based design where each pod is similar to a tiled architecture, but with smaller number of cores. Each server processor is composed of multiple pods(typically 2 or 3).

Before discussing the benefits of each design, I will elaborate on the pod-based design. In such a design, each pod runs an independent operating system and doesn’t interact with other pods except for sharing the same interface to main memory and other peripherals. The reason for having multiple pods on a single chip is to improve the die yield and reduce manufacturing costs. This also has the added benefit of simplifying the chip design and verification process.

The relative performance of the three designs on server workloads is as follows: pod-based design performs better than a tiled designs followed by the conventional design( 3 > 2 > 1 ).  Tiled architectures support a larger number of simpler cores as compared to a conventional processor and can better exploit the request-level parallelism inherent in server workloads. For server workloads, overall throughput is more important  than single-threaded performance and conventional designs are an overkill, spending too much area and energy for too little benefit. However, tiled-architectures stop being beneficial at a certain number of cores. This is because, as core count increases, for each cache access the time spent traversing the on-chip network goes up. As discussed earlier, it is very important to reduce cache access latency for server workloads.

A pod-based design builds on a tiled architecture but chooses the right number of cores to maximize overall processor performance. The main insight is to stop scaling the number of cores at a certain point to keep on-chip network latency within a certain bound. One interesting metric proposed in the paper is ‘performance normalized to overall chip area’. To traverse the huge design space, the author use models that predict the performance, area and energy of different designs. On the whole, it is a very insightful paper on the trade-offs in building chips for future server processors.

Hope you enjoyed reading this post. Looking forward to your comments.