Transactional Memory and Lock Elision

I started reading a post on recent trends in hardware and their implications on distributed systems design. One of the recent additions that could make parallel programming much easier is the TSX extensions in Intel Haswell processors.


It comes in two flavors, a backwards-compatible Hardware Lock Elision (HLE) instruction set, and a more flexible forward-looking Restricted Transactional Memory (RTM) instruction set. Here are a few blog posts and papers that give a good overview of Intel TSX and it’s origins in academia.


Brief introduction into Intel TSX and it’s implications for the industry

Analysis of Haswell’s transactional memory by David Kanter

Hardware Lock Elision on Haswell

Lock elision support in gcc and linux

The original paper on Speculative Lock Elision: Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution




NOPE Tutorial – MICRO 2015

Notes from NOPE: Negative Outcomes, Post-mortems, and Experiences at MICRO 2015

Jason Mars, University of Michigan

Resilience: Keep submitting even if you get rejected multiple times. Expect your first submission to get rejected. Look at what reviewers say and refine your paper till you get the paper accepted. Students should be prepared to get 4-5 rejections for each paper.

PLP (Paper level parallelism) – Whenever you have a paper under submission, your next work should be independent of the first one to avoid having no papers for long periods.

MTTP – Mean Time To Paper.

Opportunity Analysis – Sometimes, ideas that seem great to begin with only give 3% improvement at the end.

It is really tough to do a second paper in our field. We seem to prefer incremental work.

Acoherent Shared Memory, Mark Hill, University of Wisconsin

Coherence is complex and inefficient, switch to CVS-like checkout/checkin model, same performance and less energy for CPUs. But this idea doesn’t work for GPUs or accelerators. The timing was just not right. It is very unfortunate we do not have a venue to publish papers about great ideas that are either not practical or do not have immediate impact.

Coherence is optimized for fine-grained share everything. But programs are not like that. Let software have control over coherence, but we want to make minimal hardware changes. Checkin/Checkout model. What is the granularity? Different memory regions can have different kinds of coherence models.

Story of grad student xxx: Did great work on record and replay before starting graduate school. Wanted to go for the home run and do really high-impact work. But because of this failed idea, someone who was destined for academia had to go into industry.

The end of the road for my career, Vijaya Janapa Reddi, UT Austin

NSF Career Award. Got rejected 3 times. Mobile computing is important but,  2/157 papers in top computer architecture conferences in 2010 were related to mobile computing. Themes is reviewers comments:

  • Industry is solving the problem, so need to fund this project in academia.
  • Problem is in the network, not in the client processor.
  • Industry is still defining the standards for mobile web and the results of the project might not have a significant impact.
  • What if the user stops using heavy websites and just uses websites that are highly optimized for mobile browsing?
  • Major source of power consumption in mobile devices is the display and not the compute engine.

It really takes time for the community to accept the importance of a new topic. There are a lot of papers about datacenters in architecture conferences since 2010, but mobile computing is lagging behind. 2015 is for mobile computing what 2010 was for data centers.

Science advances one funeral at a time.

Exploiting Choice: An implementable approach to architecture research, Joel Emer, Nvidia Research/MIT CSAIL

Fail fast. How big is a failure. Magnitude of failure = time * work.
Research principles of operation:
Choose a challenge
Choose an approach
Evaluate early and as often as possible.

Mastering the fine art art of pivot, Todd Austin, University of Michigan.

Rule breaking research: Find a rule that no one ever breaks, break it and see what happens. Litmus test, half the people in the community will hate the idea.

There is a fundamental tension between high risk, failure prone research and PhD students. Make the undergraduates do all the cool work.

Cache conscious data placement. Attempting to improve data cache performance by reordering data for better/spatial locality.
Programmers do a great job at creating a good layout. Build a better layout by starting from programmer’s layout instead of ignoring programmer’s layout.

Tips for phd students :
Assume no one will ever read your paper. Figures, captions, title, section names should convince a reviewer to accept the paper.
Getting the word out is critical to an idea’s success. Talk to people about your project and have a good name for your project.

3rd Workshop on Near Data Processing – WoNDP

Introduction by Rajeev Balasubramonian, University of Utah

DianNao proposes accelerators for deep learning applications, but it does near-data processing too.

PIM-enabled instructions, ISCA 2015 considers the often neglected dimension of NDP i.e. programmability.

Keynote I Automata Processing, Mircea Stan(University of Virginia)

Automata processor(AP) is a programmable silicon device capable of performing very high-speed, comprehensive search and analysis of complex, unstructured data streams. It is a massively parallel and scalable hardware architecture that implements the non-deterministic finite automata(NFA) as the core element.

A fully-functional development board is implemented. The production board is envisioned to have DDR3 Memory, PCI Express connectivity and 4 AP units.  The AP has hardware building blocks for all the fundamental operations in an NFA like counter elements, state transition blocks etc.  The AP can process multiple data streams simultaneously. The elements are implemented in memory with data being very close to the building blocks.

The memory is the processor!

One can program the AP by a combination of RegEx and other low-level tools. AP avoids the von Neumann bottleneck by replacing instruction fetch with hardware reconfiguration. There are many layers of parallelism: individual STE, NFA level, multiple automata, multiple streams. What the AP is not: Doesn’t have systolic arrays, no general purpose logic. The big challenges are, program changes require a reconfiguration step, new and low-level programming model(biggest hurdle), the need to buy a separate board and the resurgence of FPGA acceleration.

Scaling Deep Learning on Multiple In-Memory Processors, Dongping Zhang (AMD)

PIM architecture to accelerate deep learning applications. Two kinds of deep learning models : Deep belief networks and convolutional neural networks. This work focusses on CNNs. In implementing deep learning on PIM, two kinds of parallelism, data and model, can be exploited. The kind of parallelism exploited is different for each kind of layers in convolutional neural networks. Evaluate performance using 256×256 randomly generated images.

Dataflow based Near Data Processing using Coarse Grain Reconfigurable Logic, Charles Shelor(U. North Texas)

Dataflow based execution allows one to extract parallelism from algorithm. Delay elements allow compile time synchronization, decoupled LD/ST for enhanced memory accesses, embedded scratchpad memory for tables. Hardware pipelines to implement histogramming, word-occurrence count, FFT as single cycle operations. Energy and performance results for these 3 benchmarks, up to 13x speedup and 99% reduction in energy.

Cache hierarchy design for server workloads – Challenges and opportunities

In the past week, I have been reading a few papers on the implications of server workloads on cache hierarchy design. In this post, I will discuss some of the main insights from these papers.

There has been a growing interest in the performance of server workloads due to the migration from personal to cloud computing in the past decade. The adoption of cloud computing in different application domains in not uniform. However, there is a general consensus that in the coming decades, a bulk of the computing needs will be satisfied by large data centers owned by a few companies. One main reason for this trend is the economies of scale that large data centers offer. It is cheaper for both consumers and service providers to run applications in the cloud.

Traditional microprocessor design has focused on improving the performance of desktop computers. Yes, there were designs optimized for servers but desktop computers accounted for most of the market and it was logical to optimize your designs for desktop applications. However, recent server workloads have vastly different characteristics from desktop workloads and even server workloads from a few decades ago. These workloads process a large number of independent requests and hence enjoy large inter-request parallelism. However, each request also processes a large amount of data and there is little sharing of data between requests other than instructions and shared OS code. In such a scenario, we see two major trends emerging. Firstly, server workloads have a large instruction footprint and shared OS code and/or data. Secondly, there is little locality in data used by these workloads : the primary working sets are small enough to fit small private caches and the secondary working sets are too large to fit in any practically sized on chip caches.

In ‘High Performing Cache Hierarchies for Server Workloads’ HPCA 2015, Jaleel discuss the limitations of current cache hierarchies in running server workloads and motivate the design of future caches optimized to these workloads. Since on chip caches can’t fit the large working sets of server workloads, it is beneficial to make the shared caches as small possible. This improves the shared cache access latency. In addition, we need to ensure the time spent in the cache hierarchy for instruction fetches should be as low as possible. Server workloads have large instruction footprint and spend a lot of time in front end stalls, due to large cache access latency in current designs.

To achieve this goal, the authors propose using a large L2 cache in a three-level cache hierarchy. The L2 should be sized such that most instruction accesses hit in the L2. Naively, increasing the private cache size created problems due to inclusion constraints.  To this end, the authors propose using an exclusive cache hierarchy as opposed to an inclusive cache hierarchy common in most server processors. This allows the designer to have a larger L2 while still devoting the same area to all the on-chip caches.

Earlier work by the same authors has proposed high performance cache replacement policies for shared last-level caches(LLC). These replacement policies are used in current Intel processors. However, these policies only work with inclusive caches. The authors propose some trivial changes to the replacement to account for exclusive caches and achieve the same performance as inclusive caches. The authors also present a simple replacement policy for private L2 caches that prioritizes instruction fetches over data fetches.

With these optimizations, there is 10% improvement on most sever workloads. The paper also discusses several open problems in designing caches in future server processors.

In ‘Scale Out Processors’ ISCA 2012, the authors discuss a new kind of processor design that is optimized for deployment in data center workloads. They compare three kinds of designs : 1) a conventional design consisting of a few large OOO cores supported by large on-chip caches. 2) a tiled architecture consisting of a large number of cores and the shared cache organized as distributed banks. 3) A pod-based design where each pod is similar to a tiled architecture, but with smaller number of cores. Each server processor is composed of multiple pods(typically 2 or 3).

Before discussing the benefits of each design, I will elaborate on the pod-based design. In such a design, each pod runs an independent operating system and doesn’t interact with other pods except for sharing the same interface to main memory and other peripherals. The reason for having multiple pods on a single chip is to improve the die yield and reduce manufacturing costs. This also has the added benefit of simplifying the chip design and verification process.

The relative performance of the three designs on server workloads is as follows: pod-based design performs better than a tiled designs followed by the conventional design( 3 > 2 > 1 ).  Tiled architectures support a larger number of simpler cores as compared to a conventional processor and can better exploit the request-level parallelism inherent in server workloads. For server workloads, overall throughput is more important  than single-threaded performance and conventional designs are an overkill, spending too much area and energy for too little benefit. However, tiled-architectures stop being beneficial at a certain number of cores. This is because, as core count increases, for each cache access the time spent traversing the on-chip network goes up. As discussed earlier, it is very important to reduce cache access latency for server workloads.

A pod-based design builds on a tiled architecture but chooses the right number of cores to maximize overall processor performance. The main insight is to stop scaling the number of cores at a certain point to keep on-chip network latency within a certain bound. One interesting metric proposed in the paper is ‘performance normalized to overall chip area’. To traverse the huge design space, the author use models that predict the performance, area and energy of different designs. On the whole, it is a very insightful paper on the trade-offs in building chips for future server processors.

Hope you enjoyed reading this post. Looking forward to your comments.