NOPE Tutorial – MICRO 2015

Notes from NOPE: Negative Outcomes, Post-mortems, and Experiences at MICRO 2015

Jason Mars, University of Michigan

Resilience: Keep submitting even if you get rejected multiple times. Expect your first submission to get rejected. Look at what reviewers say and refine your paper till you get the paper accepted. Students should be prepared to get 4-5 rejections for each paper.

PLP (Paper level parallelism) – Whenever you have a paper under submission, your next work should be independent of the first one to avoid having no papers for long periods.

MTTP – Mean Time To Paper.

Opportunity Analysis – Sometimes, ideas that seem great to begin with only give 3% improvement at the end.

It is really tough to do a second paper in our field. We seem to prefer incremental work.

Acoherent Shared Memory, Mark Hill, University of Wisconsin

Coherence is complex and inefficient, switch to CVS-like checkout/checkin model, same performance and less energy for CPUs. But this idea doesn’t work for GPUs or accelerators. The timing was just not right. It is very unfortunate we do not have a venue to publish papers about great ideas that are either not practical or do not have immediate impact.

Coherence is optimized for fine-grained share everything. But programs are not like that. Let software have control over coherence, but we want to make minimal hardware changes. Checkin/Checkout model. What is the granularity? Different memory regions can have different kinds of coherence models.

Story of grad student xxx: Did great work on record and replay before starting graduate school. Wanted to go for the home run and do really high-impact work. But because of this failed idea, someone who was destined for academia had to go into industry.

The end of the road for my career, Vijaya Janapa Reddi, UT Austin

NSF Career Award. Got rejected 3 times. Mobile computing is important but,  2/157 papers in top computer architecture conferences in 2010 were related to mobile computing. Themes is reviewers comments:

  • Industry is solving the problem, so need to fund this project in academia.
  • Problem is in the network, not in the client processor.
  • Industry is still defining the standards for mobile web and the results of the project might not have a significant impact.
  • What if the user stops using heavy websites and just uses websites that are highly optimized for mobile browsing?
  • Major source of power consumption in mobile devices is the display and not the compute engine.

It really takes time for the community to accept the importance of a new topic. There are a lot of papers about datacenters in architecture conferences since 2010, but mobile computing is lagging behind. 2015 is for mobile computing what 2010 was for data centers.

Science advances one funeral at a time.

Exploiting Choice: An implementable approach to architecture research, Joel Emer, Nvidia Research/MIT CSAIL

Fail fast. How big is a failure. Magnitude of failure = time * work.
Research principles of operation:
Choose a challenge
Choose an approach
Evaluate early and as often as possible.

Mastering the fine art art of pivot, Todd Austin, University of Michigan.

Rule breaking research: Find a rule that no one ever breaks, break it and see what happens. Litmus test, half the people in the community will hate the idea.

There is a fundamental tension between high risk, failure prone research and PhD students. Make the undergraduates do all the cool work.

Cache conscious data placement. Attempting to improve data cache performance by reordering data for better/spatial locality.
Programmers do a great job at creating a good layout. Build a better layout by starting from programmer’s layout instead of ignoring programmer’s layout.

Tips for phd students :
Assume no one will ever read your paper. Figures, captions, title, section names should convince a reviewer to accept the paper.
Getting the word out is critical to an idea’s success. Talk to people about your project and have a good name for your project.


3rd Workshop on Near Data Processing – WoNDP

Introduction by Rajeev Balasubramonian, University of Utah

DianNao proposes accelerators for deep learning applications, but it does near-data processing too.

PIM-enabled instructions, ISCA 2015 considers the often neglected dimension of NDP i.e. programmability.

Keynote I Automata Processing, Mircea Stan(University of Virginia)

Automata processor(AP) is a programmable silicon device capable of performing very high-speed, comprehensive search and analysis of complex, unstructured data streams. It is a massively parallel and scalable hardware architecture that implements the non-deterministic finite automata(NFA) as the core element.

A fully-functional development board is implemented. The production board is envisioned to have DDR3 Memory, PCI Express connectivity and 4 AP units.  The AP has hardware building blocks for all the fundamental operations in an NFA like counter elements, state transition blocks etc.  The AP can process multiple data streams simultaneously. The elements are implemented in memory with data being very close to the building blocks.

The memory is the processor!

One can program the AP by a combination of RegEx and other low-level tools. AP avoids the von Neumann bottleneck by replacing instruction fetch with hardware reconfiguration. There are many layers of parallelism: individual STE, NFA level, multiple automata, multiple streams. What the AP is not: Doesn’t have systolic arrays, no general purpose logic. The big challenges are, program changes require a reconfiguration step, new and low-level programming model(biggest hurdle), the need to buy a separate board and the resurgence of FPGA acceleration.

Scaling Deep Learning on Multiple In-Memory Processors, Dongping Zhang (AMD)

PIM architecture to accelerate deep learning applications. Two kinds of deep learning models : Deep belief networks and convolutional neural networks. This work focusses on CNNs. In implementing deep learning on PIM, two kinds of parallelism, data and model, can be exploited. The kind of parallelism exploited is different for each kind of layers in convolutional neural networks. Evaluate performance using 256×256 randomly generated images.

Dataflow based Near Data Processing using Coarse Grain Reconfigurable Logic, Charles Shelor(U. North Texas)

Dataflow based execution allows one to extract parallelism from algorithm. Delay elements allow compile time synchronization, decoupled LD/ST for enhanced memory accesses, embedded scratchpad memory for tables. Hardware pipelines to implement histogramming, word-occurrence count, FFT as single cycle operations. Energy and performance results for these 3 benchmarks, up to 13x speedup and 99% reduction in energy.