Modular Coherence Protocols

From gem5
Revision as of 12:47, 18 July 2008 by Beckmann (talk | contribs) (Support Hardware Prefetching without Requiring Specific Protocol Support)
Jump to: navigation, search

Introduction

We need a way to add/extend/replace cache coherence protocols in a modular fashion. The original cache module had coherence factored out into a separate object, but the interface was totally inadequate for doing anything interesting. The redesigned memory system from 2.0 integrated coherence back in to the cache module as a temporary step to eliminate the useless bloat caused by the original separation. It's now time to go back and re-introduce this modularity, but to do it right this time.

We are hugely inspired by SLICC, the coherence protocol description language/compiler that's part of the GEMS simulator from Wisconsin. Following their example, we plan on developing a similar domain-specific language (DSL) and compiler for coherence protocols. SLICC is fairly closely tied to the GEMS framework, so it's not possible to adopt it directly for M5. In addition, there are a number of aspects of SLICC that we would like to change, partly based on personal bias and partly based on areas where we think we can improve on their design. To try and come up with the best solution we can, we will approach this design as a clean sheet, and not automatically adopt any part of SLICC by default. At the same time, SLICC is a terrific working example, so many of our design decisions will probably end up being "let's do this like SLICC, except...". Correspondingly, most of the initial design requirements and preferences listed below are "things we should do differently than SLICC", with the implicit assumption that most of the issues not listed below will be handled similarly.

Design Requirements

Protocol inheritance

This is Brad's #1 request for improving on SLICC. Related protocols should be able to share common elements. New protocols that are minor modifications or extensions of existing protocols should be describable by inheriting from the existing protocol and describing only the differences.

Implicit hit path

The coherence protocol should not be involved on cache hits. Requiring a call into the protocol object on hits will drastically reduce our opportunities for performance optimization of this critical path in the simulator. As is often the case, simulation reflects reality, and no sane coherence protocol would intrude on the hit path of a real implementation either. The Tag object is currently consulted on hits (to enable replacement algorithm bookkeeping), so if someone desperately needs to collect information on hits they can implement a new Tag object to do that.

While the coherence protocol should not be involved on cache hits, there are a few situations we must handle correctly. First, the cache hit path should differentiate read permission from read/write permission. Second, the cache hit path should handle silently upgraded cache blocks from exclusive to modified state.

Support internal cache timing

Though we currently don't model any internal cache timing, we probably should eventually, and we want to make sure the language is up to the task. For example, having a getState() function that we expect to return immediately means that we can't explicitly model the tag array access latency. It might be better to assume that the tag lookup happens explicitly when a message is first processed and then assume that the controller (and thus the coherence code) is handed both the message and the tag state right from the start.

Automatic and Powerful Protocol Debugging Support

Protocol designers will likely spend the majority of their time debugging protocols. The protocol support should make debugging as easy as possible. In particular, concise and informative auto generated DPRINTFs are a high-priority requirement. The key is outputting debug traces with enough information to completely follow all state transitions while allowing a single (< 1 GB) trace to span millions of cycles. Furthermore, the protocols must detect invalid transition errors as soon as they occur.

Virtual Channels

To encounter and solve protocol deadlock issues and the coherence protocols must support virtual channels. Therefore the network connecting different memory components must support virtual channels as well.

Design Preferences

Balance Explicit States and Events vs. Attributes

Ideally, we would like to find a balance of using explicit states and events versus attributes. Currently, SLICC and M5 highlight the extremes of each approach. For example, the current M5 coherence protocol uses memory command attributes to allow clean code; see the MemCmd class in src/mem/packet.{hh,cc}. The cache block state has similar properties, though the implementation is even simpler because the current protocol doesn't need any non-transient state beyond the basics (valid, dirty, exclusive). See src/mem/cache/blk.hh. The MemCmd structure is pretty good as is, but needs to be extended to support inheritance. While attributes work well for simple and abstract protocols, it is likely an attribute only approach is sufficient to describe a complex and more realistic protocol.

In contrast, SLICC uses different explicit events and states to signify all easily visible protocol differences. The advantage of this approach is all differences in events and states can be easily tracked in the protocol. Therefore many errors in complicated protocols can be found with easily traceable invalid transition errors, rather than hard to isolate incorrect attribute handling errors. The disadvantage is the language specification files are quite long and verbose, and are susceptible to cut-and-paste errors. SLICC does support attributes as well, but they are only utilized within actions.

Overall, the best solution may be provide explicit event and state support, but ensure attributes are treated equal to states and events when it comes to automatic debugging support.

Encode Protocol-Independent State with Protocol-Specific State

Ideally, we would like the stored cache state to encompas both the protocol-independent and protocol-specific state. For example, M5 currently supports CacheBlkStatusBits in src/mem/cache/blk.hh which for the most part are protocol independent (except the dirty bit). The block state model needs to be expanded to allow protocol-specific state, but we should consider continuing to use encoding tricks to avoid redundantly storing both the block access bits and the full coherence state separately. Therefore, with protocol-independent access permissions encoded in the state, the cache access path can determine hits and misses without invoking the cache controller logic and without requiring separate set permission logic.

In contrast, SLICC doesn't directly associate semantics with requests or states, and you end up with code like this:

     // Set permission
     if ((state == State:I) || (state == State:MI_A) || (state == State:II_A)) {
       changePermission(addr, AccessPermission:Invalid);
     } else if (state == State:S || state == State:O) {
       changePermission(addr, AccessPermission:Read_Only);
     } else if (state == State:M) {
       changePermission(addr, AccessPermission:Read_Write);
     } else {
       changePermission(addr, AccessPermission:Busy);
     }

Support Hardware Prefetching without Requiring Specific Protocol Support

Another advantage with encoding protocol-independent state within the cache state, is hardware prefetching can possibly be supported in all designs without requiring specific protocol support. We should be careful to ensure hardware prefetches don't impede demand requests. Hopefully, we can enforce this policy without impacting the protocol specific logic. In contrast, GEMS requires each protocol to explicitly handle hardware prefetches separately.

Fixed MSHR Model

A fixed MSHR model will allow all subsequent misses "underneath" the initial miss to be treated consistently across protocols. It is unclear if all interesting cache protocols can be implemented with a fixed MSHR model, but if possible, there could be a significant complexity advantage. In particular, the resource requirements of support multiple outstanding requests can be dealt with realistically. In contrast, we could require the protocol designer to create a protocol-specific MSHR model. This was the GEMS solution and it caused protocols to either unnecessarily stall the request queue or unrealistically "recycle" the request queue.

No new imperative language

Let's not invent a new language for imperative computation. I admit that the "treat C++ code blocks as strings" approach of the ISA description language is a bit cheesy, but for the most part it works. The main drawback is that you can't parse the code to analyze its behavior (though the ISA parser extracts a lot of information in an imperfect way using regexps; see Code parsing), but I think we'll have even less need of that here than the ISA parser does.

Even if we do feel the need to fully parse imperative computations, or to restrict what people can do in them, I'd rather see us do a subset of C++ than invent something new. (That might actually be useful if we can reuse the C++-subset parser for the ISA description language.) Unfortunately, this is easier said than done; one big challenge is getting the context set up right so the parser can figure out which symbols are types, preprocessor macros are expanded, etc.

Note that some of the SLICC design notes talk about wanting to make the language synthesizable to hardware, which is a reasonable motivation for them to have a limited imperative language. We don't have that goal, so there's much less reason to be restrictive.

Design Challenges

Interfacing generic with protocol-specific bits

Though it's reasonable to specialize caches and interconnects based on the coherence protocol, we don't want to recompile every object that connects to the memory system (CPUs, devices, etc.) for the protocol too. Yet these devices do need to send out and respond to basic memory requests (read, write, instruction fetch, etc.). One possible solution is to leverage inheritance, and have a base coherence protocol that defines the set of messages which all CPUs and devices can use. This base protocol will probably be abstract (if only by having all the required functions call fatal()). Then generic memory system clients can depend only on the base protocol, and as long as the derived protocols extend it in a compatible fashion, everything will work.

Note that this solution will constrain how we implement inheritance: we can't do it entirely in the DSL and have it spit out flattened C++, as we'll need this base protocol to be separate in C++ so that the compilation of CPU and device models can be independent of the specific protocol.

System Configuration Issues

(Steve I'm not sure what public information you want to discuss here.)

Basic Design

There's not much here yet, unfortunately.

Simulator implementation

The coherence protocol implementation (which will eventually be auto-generated from the DSL description) should consist of one or more C++ objects. The cache and interconnect models will be modified to take a coherence protocol object as a template parameter. The protocol object(s) will define types used to define the allowable commands, define the protocol-specific state included in Packet and CacheBlk classes, etc.

  • Should the C++ protocol object(s) form an inheritance hierarchy that mirrors the inheritance relationship of the protocols in the DSL?

Domain-specific language

We should write the parser in PLY and the compiler in Python because we know them, and to avoid introducing any new tool dependencies.

Development Plan

  1. Separate the current coherence protocol from the cache module into a separate policy object (or objects). This will give us an initial interface definition and an idea of what kind of code the description language will need to generate.
  2. Implement a few different protocols using this interface, expanding and modifying it as needed. Ideally these protocols will be significantly different from the base protocol, e.g., directory based, point-to-point, etc. This will help us flesh out the interface, and get started on implementing features that we'll need for more advanced protocols (such as virtual channels) that will be reflected in the description language but are largely orthogonal to it. Some candidates would be Geoff Blake's LogTM implementation or a manual translation of an existing SLICC protocol from the GEMS distribution. A side benefit will be that those of us who need a particular protocol can implement it at this stage and not have the description language be on their critical path. There is the down side that there will be some extra effort to reimplement these protocols in the description language, but if our language is as concise as it should be then this effort should be minimal.
  3. Design and implement the description language and associated compiler. This work can overlap with the previous step somewhat, but we should avoid getting too far ahead of ourselves in case the previous step uncovers some features that require significant modifications to the language structure.

--Brad 13:41, 18 July 2008 (EDT)