Cache hierarchy design for server workloads – Challenges and opportunities

In the past week, I have been reading a few papers on the implications of server workloads on cache hierarchy design. In this post, I will discuss some of the main insights from these papers.

There has been a growing interest in the performance of server workloads due to the migration from personal to cloud computing in the past decade. The adoption of cloud computing in different application domains in not uniform. However, there is a general consensus that in the coming decades, a bulk of the computing needs will be satisfied by large data centers owned by a few companies. One main reason for this trend is the economies of scale that large data centers offer. It is cheaper for both consumers and service providers to run applications in the cloud.

Traditional microprocessor design has focused on improving the performance of desktop computers. Yes, there were designs optimized for servers but desktop computers accounted for most of the market and it was logical to optimize your designs for desktop applications. However, recent server workloads have vastly different characteristics from desktop workloads and even server workloads from a few decades ago. These workloads process a large number of independent requests and hence enjoy large inter-request parallelism. However, each request also processes a large amount of data and there is little sharing of data between requests other than instructions and shared OS code. In such a scenario, we see two major trends emerging. Firstly, server workloads have a large instruction footprint and shared OS code and/or data. Secondly, there is little locality in data used by these workloads : the primary working sets are small enough to fit small private caches and the secondary working sets are too large to fit in any practically sized on chip caches.

In ‘High Performing Cache Hierarchies for Server Workloads’ HPCA 2015, Jaleel et.al discuss the limitations of current cache hierarchies in running server workloads and motivate the design of future caches optimized to these workloads. Since on chip caches can’t fit the large working sets of server workloads, it is beneficial to make the shared caches as small possible. This improves the shared cache access latency. In addition, we need to ensure the time spent in the cache hierarchy for instruction fetches should be as low as possible. Server workloads have large instruction footprint and spend a lot of time in front end stalls, due to large cache access latency in current designs.

To achieve this goal, the authors propose using a large L2 cache in a three-level cache hierarchy. The L2 should be sized such that most instruction accesses hit in the L2. Naively, increasing the private cache size created problems due to inclusion constraints.  To this end, the authors propose using an exclusive cache hierarchy as opposed to an inclusive cache hierarchy common in most server processors. This allows the designer to have a larger L2 while still devoting the same area to all the on-chip caches.

Earlier work by the same authors has proposed high performance cache replacement policies for shared last-level caches(LLC). These replacement policies are used in current Intel processors. However, these policies only work with inclusive caches. The authors propose some trivial changes to the replacement to account for exclusive caches and achieve the same performance as inclusive caches. The authors also present a simple replacement policy for private L2 caches that prioritizes instruction fetches over data fetches.

With these optimizations, there is 10% improvement on most sever workloads. The paper also discusses several open problems in designing caches in future server processors.

In ‘Scale Out Processors’ ISCA 2012, the authors discuss a new kind of processor design that is optimized for deployment in data center workloads. They compare three kinds of designs : 1) a conventional design consisting of a few large OOO cores supported by large on-chip caches. 2) a tiled architecture consisting of a large number of cores and the shared cache organized as distributed banks. 3) A pod-based design where each pod is similar to a tiled architecture, but with smaller number of cores. Each server processor is composed of multiple pods(typically 2 or 3).

Before discussing the benefits of each design, I will elaborate on the pod-based design. In such a design, each pod runs an independent operating system and doesn’t interact with other pods except for sharing the same interface to main memory and other peripherals. The reason for having multiple pods on a single chip is to improve the die yield and reduce manufacturing costs. This also has the added benefit of simplifying the chip design and verification process.

The relative performance of the three designs on server workloads is as follows: pod-based design performs better than a tiled designs followed by the conventional design( 3 > 2 > 1 ).  Tiled architectures support a larger number of simpler cores as compared to a conventional processor and can better exploit the request-level parallelism inherent in server workloads. For server workloads, overall throughput is more important  than single-threaded performance and conventional designs are an overkill, spending too much area and energy for too little benefit. However, tiled-architectures stop being beneficial at a certain number of cores. This is because, as core count increases, for each cache access the time spent traversing the on-chip network goes up. As discussed earlier, it is very important to reduce cache access latency for server workloads.

A pod-based design builds on a tiled architecture but chooses the right number of cores to maximize overall processor performance. The main insight is to stop scaling the number of cores at a certain point to keep on-chip network latency within a certain bound. One interesting metric proposed in the paper is ‘performance normalized to overall chip area’. To traverse the huge design space, the author use models that predict the performance, area and energy of different designs. On the whole, it is a very insightful paper on the trade-offs in building chips for future server processors.

Hope you enjoyed reading this post. Looking forward to your comments.

3 thoughts on “Cache hierarchy design for server workloads – Challenges and opportunities

  1. Thanks for the post. I did try to understand at the scale out processors work. These are the questions from my limited understanding of the works when I try to design a chip specific to a datacenter. My thoughts might along the lines of HPC perspective. @scale out processors: I am having the following questions about scale-out processors work. Don’t we need heterogeneous cores(I choose to concentrate on programming the heterogeneous cores later-single/multiple ISA) in a pod? Say , in a datacenter environment, we have a bunch of batch and latency-critical applications execiute concurrently. Will a bunch of simple homogeneous cores alone satisify the problem? Don’t we need (a/ set of) customized cores to treat based on Applications’ different communication/branching behavior. I remember read a paper from ISCA’15 : Accelerating asynchronous programs through Event Sneak Peek, which have a customized core which does prefetching of asynchronous programs instructions in Level-1 I-cache. Do the future generation server chips going to homogeneous (or) heterogeneous? Looking forward for reading your thoughts!

    Like

    1. I think it makes sense to use heterogeneous cores in a pod to improve performance and energy. However in practice, datacenter operators have a hard time mapping applications to different types of cores. They prefer to have homogeneous systems as much as possible. But, this could change in future.

      Like

Leave a comment