Difference between revisions of "Ruby"

From gem5
Jump to: navigation, search
(Controller Description)
(Update replacement policy information)
 
(93 intermediate revisions by 8 users not shown)
Line 1: Line 1:
==Ruby==
+
== High level components of Ruby ==
=== High level components of Ruby ===
 
  
Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:
+
Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various [[Replacement_policy|replacement policies]], coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:
  
 
# Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
 
# Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
Line 11: Line 10:
 
[[File:ruby_overview.jpg|600px|center]]
 
[[File:ruby_overview.jpg|600px|center]]
  
==== SLICC + Coherence protocols: ====
+
=== SLICC + Coherence protocols: ===
    Need to say what is SLICC and whats its purpose.
+
 
    Talk about high level strcture of a typical coherence protocol file, that SLICC uses to generate code.
+
'''''[[SLICC]]''''' stands for ''Specification Language for Implementing Cache Coherence''. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.
    A simple example structure from protocol like MI_example can help here.
 
 
 
SLICC stands for ''Specification Language for Implementing Cache Coherence''. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.
 
  
 
[[File:slicc_overview.jpg|700px|center]]
 
[[File:slicc_overview.jpg|700px|center]]
  
The language it self is syntactically similar to C/C++. The SLICC compiler generates C++ code for the state machine. Additionally it can also generate HTML documentation for the protocol.
+
The following cache coherence protocols are supported:
 
 
''Nilay will do it''
 
 
 
==== Protocol independent memory components ====
 
# Cache Memory
 
# Replacement Policies
 
# Memory Controller
 
 
 
''Arka will do it''
 
 
 
==== Interconnection Network ====
 
 
 
The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together. There are 3 key parts to this:
 
 
# '''Topology specification''': These are specified with python files that describe the topology (mesh/crossbar/ etc.), link latencies and link bandwidth. The following picture shows a high level view of several well-known network topologies, all of which can be easily specified.
 
[[File:topology_overview.jpg|1200px|center]]
 
# '''Cycle-accurate network model''': [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4919636 Garnet] is a cycle accurate, pipelined, network model that builds and simulates the specified topology. It simulates the router pipeline and movement of flits across the network subject to the routing algorithm, latency and bandwidth constraints.
 
# '''Network Power model''': The [http://www.princeton.edu/~peh/orion.html Orion] power model is used to keep track of router and link activity in the network. It calculates both router static power and link and router dynamic power as flits move through the network.
 
 
 
More details about the network model implementation are described [[#Interconnection_Network_2|here]]
 
 
 
=== Implementation of Ruby ===
 
 
 
==== Directory Structure ====
 
 
 
* '''src/mem/'''
 
** '''protocols''': SLICC specification for coherence protocols
 
** '''slicc''': implementation for SLICC parser and code generator
 
** '''ruby'''
 
*** '''buffers''': implementation for message buffers that are used for exchanging information between the cache, directory, memory controllers and the interconnect
 
*** '''common''': frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block, basic types (int32, uint64, etc.)
 
*** '''eventqueue''': Ruby’s event queue API for scheduling events on the gem5 event queue
 
*** '''filters''': various Bloom filters (stale code from GEMS)
 
*** '''network''': Interconnect implementation, sample topology specification, network power calculations
 
*** '''profiler''': Profiling for cache events, memory controller events
 
*** '''recorder''':  Cache warmup and access trace recording
 
*** '''slicc_interface''': Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
 
*** '''system''': Protocol independent memory components – CacheMemory, DirectoryMemory, Sequencer, RubyPort
 
 
 
==== SLICC ====
 
    Explain functionality/ capability of SLICC
 
    Talk about
 
    AST, Symbols, Parser and code generation in some details but NO need to cover every file and/or functions.
 
    Few examples should suffice.
 
 
 
''Nilay will do it''
 
 
 
SLICC is a domain specific language for specifying cache coherence protocols. The SLICC compiler generates C++ code for different controllers, which can work in tandem with other parts of Ruby. The compiler also generates an HTML specification of the protocol.
 
 
 
===== Input To the Compiler =====
 
 
 
The SLICC compiler takes as input files that specify the controllers involved in the protocol. The .slicc file specifies the different files used by the particular protocol under consideration. For example, if trying to specify the MI protocol using SLICC, then we may ues MI.slicc as the file that specifies all the files necessary for the protocol. The files necessary for specifying a protocol include the definitions of the state machines for different controllers, and of the network messages that are passed on between these controllers.
 
 
 
The files have a syntax similar to that of C++. The compiler, written using [http://www.dabeaz.com/ply/ PLY (Python Lex-Yacc)], parses these files to create an Abstract Syntax Tree (AST). The AST is then traversed to build some of the internal data structures. Finally the compiler outputs the C++ code by traversing the tree again. The AST represents the hierarchy of different structures present with in a state machine. We describe these structures next.
 
 
 
===== Protocol State Machines =====
 
In this section we take a closer look at what goes in to a file containing specification of a state machine. Each state machine is described using SLICC's '''machine''' datatype. Each machine has several different types of members. Machines for cache and directory controllers include cache memory and directory memory data members respectively. In order to let the controller receive messages from different entities in the system, the machine has a number of '''Message Buffers'''. These act as input and output ports for the machine.
 
 
 
  MessageBuffer requestFromCache, network="To", virtual_network="2", ordered="true";
 
  MessageBuffer responseFromCache, network="To", virtual_network="4", ordered="true";
 
 
 
  MessageBuffer forwardToCache, network="From", virtual_network="3", ordered="true";
 
  MessageBuffer responseToCache, network="From", virtual_network="4", ordered="true";
 
 
 
Note that Message Buffers have some attributes that need to be specified correctly. Next the machine includes a declaration of the '''states''' that machine can possibly reach.
 
 
 
==== Protocols ====
 
    Need to talk about each protocol being shipped. Need to talk about protocol specific configuration parameters.
 
    NO need to explain every action or every state/events, but need to give overall idea and how it works
 
    and assumptions (if any).
 
 
 
===== Common Notations and Data Structures =====
 
 
 
====== '''Coherence Messages''' ======
 
 
 
These are described in the <''protocol-name''>-msg.sm file for each protocol.
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! Message !! Description
 
|-
 
| '''ACK/NACK''' || positive/negative acknowledgement for requests that wait for the direction of resolution before deciding on the next action. Examples are writeback requests, exclusive requests.
 
|-
 
| '''GETS''' || request for shared permissions to satisfy a CPU's load or IFetch.
 
|-
 
| '''GETX''' || request for exclusive access.
 
|-
 
| '''INV''' || invalidation request. This can be triggered by the coherence protocol itself, or by the next cache level/directory to enforce inclusion or to trigger a writeback for a DMA access so that the latest copy of data is obtained.
 
|-
 
| '''PUTX''' || request for writeback of cache block. Some protocols (e.g. MOESI_CMP_directory) may use this only for writeback requests of exclusive data.
 
|-
 
| '''PUTS''' || request for writeback of cache block in shared state.
 
|-
 
| '''PUTO''' || request for writeback of cache block in owned state.
 
|-
 
| '''PUTO_Sharers''' || request for writeback of cache block in owned state but other sharers of the block exist.
 
|-
 
| '''UNBLOCK''' || message to unblock next cache level/directory for blocking protocols.
 
|}
 
 
 
====== '''AccessPermissions''' ======
 
 
 
These are associated with each cache block and determine what operations are permitted on that block. It is closely correlated with coherence protocol states.
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! Permissions !! Description
 
|-
 
| '''Invalid''' || The cache block is invalid. The block must first be obtained (from elsewhere in the memory hierarchy) before loads/stores can be performed. No action on invalidates (except maybe sending an ACK). No action on replacements. The associated coherence protocol states are I or NP and are stable states in every protocol.
 
|-
 
| '''Busy''' || TODO
 
|-
 
| '''Read_Only''' || Only operations permitted are loads, writebacks, invalidates. Stores cannot be performed before transitioning to some other state.
 
|-
 
| '''Read_Write''' || Loads, stores, writebacks, invalidations are allowed. Usually indicates that the block is dirty.
 
|}
 
 
 
====== Data Structures ======
 
* '''Message Buffers''':TODO
 
* '''TBE Table''': TODO
 
* '''Timer Table''': This maintains a map of address-based timers. For each target address, a timeout value can be associated and added to the Timer table. This data structure is used, for example, by the L1 cache controller implementation of the MOESI_CMP_directory protocol to trigger separate timeouts for cache blocks. Internally, the Timer Table uses the event queue to schedule the timeouts. The TimerTable supports a polling-based interface, '''isReady()''' to check if a timeout has occurred. Timeouts on addresses can be set using the '''set()''' method and removed using the '''unset()''' method.
 
:: '''Related Files''':
 
::::  src/mem/ruby/system/TimerTable.hh: Declares the TimerTable class
 
::::  src/mem/ruby/system/TimerTable.cc: Implementation of the methods of the TimerTable class, that deals with setting addresses & timeouts, scheduling events using the event queue.
 
 
 
====== Coherence controller FSM Diagrams ======
 
* The Finite State Machines show only the stable states
 
* Transitions are annotated using the notation "'''Event list'''" or "'''Event list : Action list'''" or "'''Event list : Action list : Event list'''". For example, Store : GETX indicates that on a Store event, a GETX message was sent whereas GETX : Mem Read indicates that on receiving a GETX message, a memory read request was sent. Only the main triggers and actions are listed.
 
* In the diagrams, the  transition labels are associated with the arc that cuts across the transition label or the closest arc.
 
 
 
===== MI example =====
 
 
 
This is a simple cache coherence protocol that is used to illustrate protocol specification using SLICC.
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/protocols'''
 
** '''MI_example-cache.sm''': cache controller specification
 
** '''MI_example-dir.sm''': directory controller specification
 
** '''MI_example-dma.sm''': dma controller specification
 
** '''MI_example-msg.sm''': message type specification
 
** '''MI_example.slicc''': container file
 
 
 
====== Cache Hierarchy ======
 
 
 
This protocol assumes a 1-level cache hierarchy. The cache is private to each node. The caches are kept coherent by a directory controller. Since the hierarchy is only 1-level, there is no inclusion/exclusion requirement. This protocol does not differentiate between loads and stores.
 
 
 
====== Stable States and Invariants ======
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''M''' || The cache block has been accessed (read/written) by this node. No other node holds a copy of the cache block
 
|-
 
| '''I''' || The cache block at this node is invalid
 
|}
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
====== Cache controller ======
 
 
 
* Requests, Responses, Triggers:
 
** Load, Instruction fetch, Store from the core
 
** Replacement from self
 
** Data from the directory controller
 
** Forwarded request (intervention) from the directory controller
 
** Writeback acknowledgement from the directory controller
 
** Invalidations from directory controller (on dma activity)
 
 
 
[[File:MI_example_cache_FSM.jpg|450px|right]]
 
 
 
* Main Operation:
 
** On a '''load/Instruction fetch/Store''' request from the core:
 
*** it checks whether the corresponding block is present in the M state. If so, it returns a hit
 
*** otherwise, if in I state, it initiates a GETX request from the directory controller
 
 
 
** On a '''replacement''' trigger from self:
 
*** it evicts the block, issues a writeback request to the directory controller
 
*** it waits for acknowledgement from the directory controller (to prevent races)
 
 
 
** On a '''forwarded request''' from the directory controller:
 
*** This means that the block was in M state at this node when the request was generated by some other node
 
*** It sends the block directly to the requesting node (cache-to-cache transfer)
 
*** It evicts the block from this node
 
 
 
** '''Invalidations''' are similar to replacements
 
 
 
====== Directory controller ======
 
 
 
* Requests, Responses, Triggers:
 
** GETX from the cores, Forwarded GETX to the cores
 
** Data from memory, Data to the cores
 
** Writeback requests from the cores, Writeback acknowledgements to the cores
 
** DMA read, write requests from the DMA controllers
 
 
 
[[File:MI_example_dir_FSM.jpg|450px|right]]
 
 
 
* Main Operation:
 
** The directory maintains track of which core has a block in the M state. It designates this core as owner of the block.
 
** On a '''GETX''' request from a core:
 
*** If the block is not present, a memory fetch request is initiated
 
*** If the block is already present, then it means the request is generated from some other core
 
**** In this case, a forwarded request is sent to the original owner
 
**** Ownership of the block is transferred to the requestor
 
** On a '''writeback''' request from a core:
 
*** If the core is owner, the data is written to memory and acknowledgement is sent back to the core
 
*** If the core is not owner, a NACK is sent back
 
**** This can happen in a race condition
 
**** The core evicted the block while a forwarded request some other core was on the way and the directory has already changed ownership for the core
 
**** The evicting core holds the data till the forwarded request arrives
 
** On '''DMA''' accesses (read/write)
 
*** Invalidation is sent to the owner node (if any). Otherwise data is fetched from memory.
 
*** This ensures that the most recent data is available.
 
 
 
====== Other features ======
 
 
 
** MI protocols don't support LL/SC semantics. A load from a remote core will invalidate the cache block.
 
** This protocol has no timeout mechanisms.
 
 
 
===== MOESI_hammer =====
 
 
 
This is an implementation of AMD's Hammer protocol, which is used in AMD's Hammer chip (also know as the Opteron or Athlon 64).
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/protocols'''
 
** '''MOESI_hammer-cache.sm''': cache controller specification
 
** '''MOESI_hammer-dir.sm''': directory controller specification
 
** '''MOESI_hammer-dma.sm''': dma controller specification
 
** '''MOESI_hammer-msg.sm''': message type specification
 
** '''MOESI_hammer.slicc''': container file
 
 
 
====== Cache Hierarchy ======
 
 
 
This protocol implements a 2-level private cache hierarchy. It assigns separate Instruction and Data L1 caches, and a unified L2 cache to each core. These caches are private to each core and are controlled with one shared cache controller. This protocol enforce exclusion between L1 and L2 caches.
 
 
 
====== Stable States and Invariants ======
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''MM''' || The cache block is held exclusively by this node and is potentially locally modified (similar to conventional "M" state).
 
|-
 
| '''O''' || The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
 
|-
 
| '''M''' || The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
 
|-
 
| '''S''' || The cache line holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The cache line can be read, but not written in this state.
 
|-
 
| '''I''' || The cache line is invalid and does not hold a valid copy of the data.
 
|}
 
 
 
====== Cache controller ======
 
 
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
MOESI_hammer supports cache flushing. To flush a cache line, the cache controller first issues a GETF request to the directory to block the line until the flushing is completed. It then issues a PUTF and writes back the cache line.
 
 
 
[[File:MOESI_hammer_cache_FSM.jpg|center]]
 
 
 
====== Directory controller ======
 
 
 
MOESI_hammer memory module, unlike a typical directory protocol, does not contain any directory state and instead broadcasts requests to all the processors in the system. In parallel, it fetches the data from the DRAM and forward the response to the requesters.
 
 
 
probe filter: TODO
 
 
 
* '''Stable States and Invariants'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''NX''' || Not Owner, probe filter entry exists, block in O at Owner.
 
|-
 
| '''NO''' || Not Owner, probe filter entry exists, block in E/M at Owner.
 
|-
 
| '''S''' || Data clean, probe filter entry exists pointing to the current owner.
 
|-
 
| '''O''' || Data clean, probe filter entry exists.
 
|-
 
| '''E''' || Exclusive Owner, no probe filter entry.
 
|}
 
 
 
* '''Controller'''
 
 
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
[[File:MOESI_hammer_dir_FSM.jpg|center]]
 
 
 
===== MOESI_CMP_token =====
 
This protocol is an extension of MOESI_CMP_directory and it maintains coherence permission by explicitly exchanging and counting tokens.
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/protocols'''
 
** '''MOESI_CMP_token-L1cache.sm''': L1 cache controller specification
 
** '''MOESI_CMP_token-L2cache.sm''': L2 cache controller specification
 
** '''MOESI_CMP_token-dir.sm''': directory controller specification
 
** '''MOESI_CMP_token-dma.sm''': dma controller specification
 
** '''MOESI_CMP_token-msg.sm''': message type specification
 
** '''MOESI_CMP_token.slicc''': container file
 
 
 
====== Cache Hierarchy ======
 
 
 
This protocol implements a 2-level private cache hierarchy. It assigns separate Instruction and Data L1 caches, and a unified L2 cache to each core. These caches are private to each core and are controlled with one shared cache controller. This protocol enforce exclusion between L1 and L2 caches.
 
 
 
====== L1 Cache ======
 
 
 
* '''Stable States and Invariants'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''MM''' || The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state).
 
|-
 
| '''MM_W''' || The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state). Replacements and DMA accesses are not allowed in this state. The block automatically transitions to MM state after a timeout.
 
|-
 
| '''O''' || The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
 
|-
 
| '''M''' || The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
 
|-
 
| '''M_W''' || The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Only loads and stores are allowed. Silent upgrade happens to MM_W state on store. Replacements and DMA accesses are not allowed in this state. The block automatically transitions to M state after a timeout.
 
|-
 
| '''S''' ||  The cache block is held in shared state by 1 or more nodes. Stores are not allowed in this state.
 
|-
 
| '''I''' || The cache block is invalid.
 
|}
 
 
 
* '''Controller'''
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
[[File:MOESI_CMP_directory_L1cache_FSM.jpg|center]]
 
 
 
 
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
[[File:MOESI_CMP_directory_L1cache_optim_FSM.jpg|center]]
 
 
 
======L2 cache=========
 
====== Stable States and Invariants ======
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''NP''' || The cache block is held exclusively by this node and is potentially locally modified (similar to conventional "M" state).
 
|-
 
| '''O''' || The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
 
|-
 
| '''M''' || The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
 
|-
 
| '''S''' || The cache line holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The cache line can be read, but not written in this state.
 
|-
 
| '''I''' || The cache line is invalid and does not hold a valid copy of the data.
 
|}
 
 
 
====== Cache controller ======
 
 
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
 
 
 
 
====== Directory controller ======
 
 
 
 
 
 
 
* '''Stable States and Invariants'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''O''' || Owner, .
 
|-
 
| '''NO''' || Not Owner, block in E/M at Owner.
 
|-
 
| '''L''' || Locked,  to the current owner.
 
 
|}
 
 
 
* '''Controller'''
 
 
 
TODO
 
 
 
====== Other features ======
 
 
 
TODO
 
 
 
===== MOESI_CMP_directory =====
 
 
 
In contrast with the MESI protocol, the MOESI protocol introduces an additional '''Owned''' state.
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/protocols'''
 
** '''MOESI_CMP_directory-L1cache.sm''': L1 cache controller specification
 
** '''MOESI_CMP_directory-L2cache.sm''': L2 cache controller specification
 
** '''MOESI_CMP_directory-dir.sm''': directory controller specification
 
** '''MOESI_CMP_directory-dma.sm''': dma controller specification
 
** '''MOESI_CMP_directory-msg.sm''': message type specification
 
** '''MOESI_CMP_directory.slicc''': container file
 
 
 
====== Cache Hierarchy ======
 
 
 
====== L1 Cache ======
 
 
 
* '''Stable States and Invariants'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''MM''' || The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state).
 
|-
 
| '''MM_W''' || The cache block is held exclusively by this node and is potentially modified (similar to conventional "M" state). Replacements and DMA accesses are not allowed in this state. The block automatically transitions to MM state after a timeout.
 
|-
 
| '''O''' || The cache block is owned by this node. It has not been modified by this node. No other node holds this block in exclusive mode, but sharers potentially exist.
 
|-
 
| '''M''' || The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Stores are not allowed in this state.
 
|-
 
| '''M_W''' || The cache block is held in exclusive mode, but not written to (similar to conventional "E" state). No other node holds a copy of this block. Only loads and stores are allowed. Silent upgrade happens to MM_W state on store. Replacements and DMA accesses are not allowed in this state. The block automatically transitions to M state after a timeout.
 
|-
 
| '''S''' ||  The cache block is held in shared state by 1 or more nodes. Stores are not allowed in this state.
 
|-
 
| '''I''' || The cache block is invalid.
 
|}
 
 
 
* '''Controller'''
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
[[File:MOESI_CMP_directory_L1cache_FSM.jpg|center]]
 
 
 
** '''Optimizations'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Description
 
|-
 
| '''SM''' || A GETX has been issued to get exclusive permissions for an impending store to the cache block, but an old copy of the block is still present. Stores and Replacements are not allowed in this state.
 
|-
 
| '''OM''' || A GETX has been issued to get exclusive permissions for an impending store to the cache block, the data has been received, but all expected acknowledgments have not yet arrived. Stores and Replacements are not allowed in this state.
 
|}
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
[[File:MOESI_CMP_directory_L1cache_optim_FSM.jpg|center]]
 
 
 
====== L2 Cache ======
 
 
 
* '''Stable States and Invariants'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''NP/I''' || TODO
 
|-
 
| '''ILS''' || TODO
 
|-
 
| '''ILX''' || TODO
 
|-
 
| '''ILO''' || TODO
 
|-
 
| '''ILOX''' || TODO
 
|-
 
| '''ILOS''' || TODO
 
|-
 
| '''ILOSX''' || TODO
 
|-
 
| '''S''' || TODO
 
|-
 
| '''O''' || TODO
 
|-
 
| '''OLS''' || TODO
 
|-
 
| '''OLSX''' || TODO
 
|-
 
| '''SLS''' || TODO
 
|-
 
| '''M''' || TODO
 
|}
 
 
 
* '''Controller'''
 
 
 
====== Directory ======
 
 
 
* '''Stable States and Invariants'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''M''' || The cache block is held in exclusive state by only 1 node (which is also the owner). There are no sharers of this block. The data is potentially different from that in memory.
 
|-
 
| '''O''' || The cache block is owned by exactly 1 node. There may be sharers of this block. The data is potentially different from that in memory.
 
|-
 
| '''S''' || The cache block is held in shared state by 1 or more nodes. No node has ownership of the block. The data is consistent with that in memory (Check).
 
|-
 
| '''I''' || The cache block is invalid.
 
|}
 
 
 
* '''Controller'''
 
 
 
'''The notation used in the controller FSM diagrams is described [[#Coherence_controller_FSM_Diagrams|here]].'''
 
 
 
[[File:MOESI_CMP_directory_dir_FSM.jpg|center]]
 
 
 
====== Other features ======
 
 
 
* '''Timeouts''':
 
 
 
''Rathijit will do it''
 
 
 
===== MESI_CMP_directory =====
 
 
 
====== '''Important features of the protocol''' ======
 
 
 
* This protocol models '''two-level cache hierarchy'''. The L1 cache is private to a core, while the L2 cache is shared among the cores. L1 Cache is split into Instruction and Data cache.
 
* '''Inclusion''' is maintained between the L1 and L2 cache.
 
* At high level the protocol has four stable states, '''M''', '''E''', '''S''' and '''I'''. A block in '''M''' state means the blocks is writable (i.e. has exclusive permission) and has been dirtied (i.e. its the only valid copy on-chip). '''E''' state represent a cache block with exclusive permission (i.e. writable) but is not written yet. '''S''' state means the cache block is only readable and possible multiple copies of it exists in multiple private cache and as well as in the shared cache. '''I''' means that the cache block is invalid.
 
* The on-chip cache coherence is maintained through '''Directory Coherence''' scheme, where the directory information is co-located with the corresponding cache blocks in the shared L2 cache.
 
* The protocol has four types of controllers -- '''L1 cache controller, L2 cache controller, Directory controller''' and '''DMA controller'''. L1 cache controller is responsible for managing L1 Instruction and L1 Data Cache. Number of instantiation of L1 cache controller is equal to the number of cores in the simulated system. L2 cache controller is responsible for managing the shared L2 cache and for maintaining coherence of on-chip data through directory coherence scheme. The Directory controller act as interface to the Memory Controller/Off-chip main memory and also responsible for coherence across multiple chips/and external coherence request from DMA controller. DMA controller is responsible for satisfying coherent DMA requests.
 
* One of the primary optimization in this protocol is that if a L1 Cache request a data block even for read permission, the L2 cache controller if finds that no other core has the block, it returns the cache block with exclusive permission. This is an optimization done in anticipation that a cache blocks read would be written by the same core soon and thus save an extra request with this optimization. This is exactly why '''E''' state exits (i.e. when a cache block is writable but not yet written).
 
* The protocol supports ''silent eviction'' of ''clean'' cache blocks from the private L1 caches. This means that cache blocks which have not been written to and has readable permission only can drop the cache block from the private L1 cache without informing the L2 cache. This optimization helps reducing write-back traffic to the L2 cache controller.
 
 
 
====== '''Related Files''' ======
 
 
 
* '''src/mem/protocols'''
 
** '''MESI_CMP_directory-L1cache.sm''': L1 cache controller specification
 
** '''MESI_CMP_directory-L2cache.sm''': L2 cache controller specification
 
** '''MESI_CMP_directory-dir.sm''': directory controller specification
 
** '''MESI_CMP_directory-dma.sm''': dma controller specification
 
** '''MESI_CMP_directory-msg.sm''': coherence message type specifications. This defines different field of different type of messages that would be used by the given protocol
 
** '''MESI_CMP_directory.slicc''': container file
 
 
 
====== '''Controller Description''' ======
 
 
 
* '''L1 cache controller'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants and Semantic of the state
 
|-
 
| '''M''' || The cache block is held in exclusive state by '''only one L1 cache'''. There are no sharers of this block. The data is potentially is the only valid copy in the system. The copy of the cache block is '''writable''' and as well as '''readable'''.
 
|-
 
| '''E''' || The cache block is held with exclusive permission by exactly '''only one L1 cache'''. The difference with the '''M''' state is that the cache block is writable (and readable) but not yet written.
 
|-
 
| '''S''' || The cache block is held in shared state by 1 or more L1 caches and/or by the L2 cache. The block is only '''readable'''. No cache can have the cache block with exclusive permission.
 
|-
 
| '''I / NP''' || The cache block is invalid.
 
|-
 
| '''IS''' || Its a transient state. This means that '''GETS (Read)''' request has been issued for the cache block and awaiting for response. The cache block is neither readable nor writable.
 
|-
 
| '''IM''' || Its a transient state. This means that '''GETX (Write)''' request has been issued for the cache block and awaiting for response. The cache block is neither readable nor writable.
 
|-
 
| '''SM''' || Its a transient state. This means the cache block was originally in S state and then '''UPGRADE (Write)''' request was issued to get exclusive permission for the blocks and awaiting response. The cache block is '''readable'''.
 
|-
 
| '''IS_I''' || Its a transient state. This means that while in IS state the cache controller received Invalidation from the L2 Cache's directory. This happens due to race condition due to write to the same cache block by other core, while the given core was trying to get the same cache blocks for reading. The cache block is neither readable nor writable..
 
|-
 
| '''M_I''' || Its a transient state. This state indicates that the cache is trying to replace a cache block in '''M''' state from its cache and the write-back (PUTX) to the L2 cache's directory has been issued but awaiting write-back acknowledgement.
 
|-
 
| '''SINK_WB_ACK''' || Its a transient state. This state is reached when waiting for write-back acknowledgement from the L2 cache's directory, the L1 cache received intervention (forwarded request from other cores). This indicates a race between the issued write-back to the directory and another request from the another cache has happened. This also indicates that the write-back has lost the race (i.e. before it reached the L2 cache's directory, another core's request has reached the L2). This state is essential to  avoid possibility of complicated race condition that can happen if write-backs are silently dropped at the directory.
 
|-
 
 
|}
 
 
 
* '''L2 cache controller'''
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants and Semantic of the state
 
|-
 
| '''NP''' || The cache blocks is not present in the on-chip cache hierarchy.
 
|-
 
| '''SS''' || The cache block is present in potentially multiple private caches in only readable mode (i.e.in  "S" state in private caches). Corresponding "Sharers" vector with the block should give the identity of the private caches which possibly have the cache block in its cache. The cache block in the L2 cache is valid and '''readable'''.
 
|-
 
| '''M''' || The cache block is present ONLY in the L2 cache  and has exclusive permission. L1 Cache's read/write requests (GETS/GETX) can be satisfied directly from the L2 cache.
 
|-
 
| '''MT''' || The cache block is in ONE of the private L1 caches with exclusive permission. The data in the L2 cache is potentially stale. The identity of the L1 cache which has the block can be found in the "Owner" field associated with the cache block. Any request for read/write (GETS/GETX) from other cores/private L1 caches need to be forwarded to the owner of the cache block. L2 can not service requests itself. 
 
|-
 
| '''M_I''' ||  Its a transient state. This state indicates that the cache is trying to replace the cache block from its cache and the write-back (PUTX/PUTS) to the Directory controller (which act as interface to Main memory) has been issued but awaiting write-back acknowledgement. The data is neither readable nor writable.
 
|-
 
| '''MT_I''' ||  Its a transient state. This state indicates that the cache is trying to replace a cache block in '''MT''' state from its cache. Invalidation to the current owner (private L1 cache) of the cache block has been issued and awaiting write-back from the Owner L1 cache. Note that the this Invalidation (called back-invalidation) is instrumental in making sure that the inclusion is maintained between L1 and L2 caches. The data is neither readable nor writable.
 
|-
 
| '''MCT_I''' || Its a transient state.This state is same as '''MT_I''', except that it is known that the data in the L2 cache is in ''clean'' state. The data is neither readable nor writable.
 
|-
 
| '''I_I''' || Its a transient state. The L2 cache is trying to replace a cache block in the '''SS''' state and the cache block in the L2 is in ''clean'' state. Invalidations has been sent to all potential sharers (L1 caches) of the cache block. The L2 cache's directory is waiting for all the required Acknowledgements to arrive from the L1 caches. Note that the this Invalidation (called back-invalidation) is instrumental in making sure that the inclusion is maintained between L1 and L2 caches. The data is neither readable nor writable.
 
|-
 
| '''S_I''' || Its a transient state.Same as '''I_I''', except the data in L2 cache for the cache block is ''dirty''. This means unlike in the case of '''I_I''', the data needs to be sent to the Main memory. The cache block is neither readable nor writable..
 
|-
 
| '''ISS''' || Its a transient state. L2 has received a '''GETS (read)''' request from one of the private L1 caches, for a cache block that it not present in the on-chip caches. A read request has been sent to the Main Memory (Directory controller) and waiting for the response from the memory. This state is reached only when the request is for data cache block (not instruction cache block). The purpose of this state is that if it is found that only one L1 cache has requested the cache block then the block is returned to the requester with exclusive permission (although it was requested for reading permission). The cache block is neither readable nor writable.
 
|-
 
| '''IS''' || Its a transient state. The state is similar to '''ISS''', except the fact that if the requested cache block is Instruction cache block or more than one core request the same cache block while waiting for the response from the memory, this state is reached instead of '''ISS'''. Once the requested cache block arrives from the Main Memory, the block is sent to the requester(s) with read-only permission. The cache block is neither readable nor writable at this state.
 
|-
 
| '''IM''' || Its a transient state. This state is reached when a L1 GETX (write) request is received by the L2 cache for a cache blocks that is not present in the on-chip cache hierarchy. The request for the cache block in exclusive mode has been issued to the main memory but response is yet to arrive.The cache block is neither readable nor writable at this state.
 
|-
 
| '''SS_MB''' || Its a transient state. In general any state whose name ends with "B" (like this one) also means that it is a ''blocking'' coherence state. This means the directory awaiting for some response from the private L1 cache ans until it receives the desired response any other request is not entertained (i.e. request are effectively serialized). This particular state is reached when a L1 cache requests a cache block with exclusive permission (i.e. GETX or UPGRADE) and the coherence state of the cache blocks was in '''SS''' state. This means that the requested cache blocks potentially has readable copies in the private L1 caches. Thus before giving the exclusive permission to the requester, all the readable copies in the L1 caches need to be invalidated. This state indicate that the required invalidations has been sent to the potential sharers (L1 caches) and the requester has been informed about the required number of Invalidation Acknowledgement it needs before it can have the exclusive permission for the cache block. Once the requester L1 cache gets the required number of Invalidation Acknowledgement it informs the director about this by ''UNBLOCK'' message which allows the directory to move out of this blocking coherence state and thereafter it can resume entertaining other request for the given cache block. The cache block is neither readable nor writable at this state.
 
|-
 
| '''MT_MB''' || Its a transient state and also a ''blocking'' state. This state is reached when L2 cache's directory has sent out a cache block with exclusive permission to a requester L1 cache but yet to receive ''UNBLOCK'' from the requester L1 cache acknowledging the receipt of exclusive permission. The cache block is neither readable nor writable at this state.
 
|-
 
| '''MT_IIB''' || Its a transient state and also a ''blocking'' state. This state is reached when a read request (GETS) request is received for a cache blocks which is currently held with exclusive permission in another private L1 cache (i.e. directory state is '''MT'''). On such requests the L2 cache's directory forwards the request to the current owner L1 cache and transitions to this state.  Two events need to happen before this cache block can be unblocked (and thus start entertaining further request for this cache block). The current owner cache block need to send a write-back to the L2 cache to update the L2's copy with latest value. The requester L1 cache also needs to send ''UNBLOCK'' to the L2 cache indicating that it has got the requested cache block with desired coherence permissions. The cache block is neither readable nor writable at this state in the L2 cache.
 
|-
 
| '''MT_IB''' || Its a transient state and also a ''blocking'' state. This state is reached when at '''MT_IIB''' state the L2 cache controller receives the ''UNBLOCK'' from the requester L1 cache but yet to receive the write-back from the previous owner L1 cache of the block. The cache block is neither readable nor writable at this state in the L2 cache.
 
|-
 
| '''MT_IB''' || Its a transient state and also a ''blocking'' state. This state is reached when at '''MT_IIB''' state the L2 cache controller receives write-back from the previous owner L1 cache for the blocks, while yet to receive the ''UNBLOCK'' from the current requester for the cache block. The cache block is neither readable nor writable at this state in the L2 cache.
 
|}
 
 
 
===== Network_test =====
 
This is a dummy cache coherence protocol that is used to operate the ruby network tester. The details about running the network tester can be found [[networktest|here]].
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/protocols'''
 
** '''Network_test-cache.sm''': cache controller specification
 
** '''Network_test-dir.sm''': directory controller specification
 
** '''Network_test-msg.sm''': message type specification
 
** '''Network_test.slicc''': container file
 
 
 
====== Cache Hierarchy ======
 
 
 
This protocol assumes a 1-level cache hierarchy. The role of the cache is to simply send messages from the cpu to the appropriate directory (based on the address), in the appropriate virtual network (based on the message type). It does not track any state. Infact, no CacheMemory is created unlike other protocols. The directory receives the messages from the caches, but does not send any back. The goal of this protocol is to enable simulation/testing of just the interconnection network.
 
 
 
====== Stable States and Invariants ======
 
 
 
{| border="1" cellpadding="10" class="wikitable"
 
! States !! Invariants
 
|-
 
| '''I''' || Default state of all cache blocks
 
|}
 
 
 
====== Cache controller ======
 
 
 
* Requests, Responses, Triggers:
 
** Load, Instruction fetch, Store from the core.
 
The network tester (in src/cpu/testers/networktest/networktest.cc) generates packets of the type '''ReadReq''', '''INST_FETCH''', and '''WriteReq''', which are converted into '''RubyRequestType:LD''', '''RubyRequestType:IFETCH''', and '''RubyRequestType:ST''', respectively, by the RubyPort (in src/mem/ruby/system/RubyPort.hh/cc). These messages reach the cache controller via the Sequencer. The destination for these messages is determined by the traffic type, and embedded in the address. More details can be found [[networktest|here]].
 
 
 
* Main Operation:
 
** The goal of the cache is only to act as a source node in the underlying interconnection network. It does not track any states.
 
** On a '''LD''' from the core:
 
*** it returns a hit, and
 
*** maps the address to a directory, and issues a message for it of type '''MSG''', and size '''Control''' (8 bytes) in the request vnet (0).
 
*** Note: vnet 0 could also be made to broadcast, instead of sending a directed message to a particular directory, by uncommenting the appropriate line in the ''a_issueRequest'' action in Network_test-cache.sm
 
** On a '''IFETCH''' from the core:
 
*** it returns a hit, and
 
*** maps the address to a directory, and issues a message for it of type '''MSG''', and size '''Control''' (8 bytes) in the forward vnet (1).
 
** On a '''ST''' from the core:
 
*** it returns a hit, and
 
*** maps the address to a directory, and issues a message for it of type '''MSG''', and size '''Data''' (72 bytes) in the response vnet (2).
 
** Note: request, forward and response are just used to differentiate the vnets, but do not have any physical significance in this protocol.
 
 
 
====== Directory controller ======
 
 
 
* Requests, Responses, Triggers:
 
** '''MSG''' from the cores
 
 
 
* Main Operation:
 
** The goal of the directory is only to act as a destination node in the underlying interconnection network. It does not track any states.
 
** The directory simply pops its incoming queue upon receiving the message.
 
 
 
====== Other features ======
 
 
 
** This protocol assumes only 3 vnets.
 
** It should only be used when running the ruby network test.
 
 
 
 
 
==== Protocol Independent Memory components ====
 
===== System =====
 
 
 
This is a high level container for few of the important components of the Ruby which may need to be accessed from various parts and components of Ruby. Only '''ONE''' instance of this class is created. The instance of this class is globally available through a pointer named '''''g_system_ptr'''''. It holds pointer to the Ruby's profiler object. This allows any component of Ruby to get hold of the profiler and collect statistics in a central location by accessing it though '''''g_system_ptr'''''. It also holds important information about the memory hierarchy like Cache blocks size, Physical memory size and makes them available to all parts of Ruby as and when required.  It also holds the pointer to the on-chip network in Ruby. Another important objects that it hold pointer to is the Ruby's wrapper for the simulator event queue (called RubyEventQueue). It also contains pointer to the simulated physical memory (pointed by variable name '''''m_mem_vec_ptr'''''). Thus in sum, System class in Ruby just acts as a container to pointers to some important objects of Ruby's memory system and makes them available globally to all places in Ruby through exposing it self through '''''g_system_ptr'''''.
 
 
 
====== Parameters ======
 
# '''''random_seed''''' is seed to randomize delays in Ruby. This allows simulating multiple runs with slightly perturbed delays/timings.
 
# '''''randomization''''' is the parameter when turned on, asks Ruby to randomly delay messages. This falg is useful when stress testing a system to expose corner cases. This flag should NOT be turned on when collecting simulation statistics.
 
# '''''clock''''' is the parameter for setting the clock frequency (on-chip).
 
# '''''block_size_bytes''''' specifies the size of cache blocks in bytes.
 
# '''''mem_size''''' specifies the physical memory size of the simulated system.
 
# '''''network''''' gives the pointer to the on-chip network for Ruby.
 
# '''''profiler''''' gives the pointer to the profiler of Ruby.
 
# '''''tracer''''' is the pointer to the Ruby 's memory request tracer. Tracer is primarily used to playback memory request trace in order to warm up Ruby's caches before actual simulation starts.
 
 
 
====== '''Related files''' ======
 
* '''src/mem/ruby/system'''
 
** '''System.hh/cc''': contains code for the System
 
** '''RubySystem.py''' :the corresponding python file with parameters.
 
 
 
===== Sequencer =====
 
 
 
'''''Sequencer''''' is one of the most important classes in Ruby, through which every memory request must pass through at least twice -- once before getting serviced by the cache coherence mechanism and once just after being serviced by the cache coherence mechanism. There is one instantiation of Sequencer class for each of the hardware thread being simulated. For example, if we are simulating a 16-core system with each core has single hardware thread context, then there would be 16 Sequencer object in the system, with each being responsible for ''managing'' memory requests (Load, Store, Atomic operations etc) from one of the given hardware thread context. ''ith'' Sequencer object handles request from only ''ith'' hardware context (in case of above example it is ''ith'' core). Each Sequencer assumes that it has access to the L1 Instruction and Data caches that are attached to a given core.
 
 
 
Following are the primary responsibilities of Sequencer:
 
# Injecting and accounting for all memory request to the underlying cache hierarchy (and coherence).
 
# Resource allocation and accounting (e.g. makes sure a particular core/hardware thread does not have more than specified number of outstanding memory requests).
 
# Making sure atomic operations are handled properly.
 
# Making sure that underlying cache hierarchy and coherence protocol is making forward progress.
 
# Once a request is serviced by the underlying cache hierarchy, Sequencer is responsible for returning the result to the corresponding port of the frontend (i.e.M5).
 
 
 
====== '''Parameters''' ======
 
# '''''icache''''' is the parameter where the L1 Instruction cache pointer is passed.
 
# '''''dcache''''' is the parameter where the L1 Data cache pointer is passed.
 
# '''''max_outstanding_requests''''' is the parameter for specifying maximum allowed number of outstanding memory request from a given core or hardware thread.
 
# '''''deadlock_threshold''''' is the parameter that specifies number of cycle (ruby cycles) after which if a given memory request is not satisfied by the cache hierarchy, a possibility of deadlock (or lack of forward progress) is declared.
 
 
 
====== '''Related files''' ======
 
* '''src/mem/ruby/system'''
 
** '''Sequencer.hh/cc''': contains code for the Sequencer
 
** '''Sequencer.py''' :the corresponding python file with parameters.
 
 
 
====== '''More detailed operation description''' ======
 
In this section, we will describe the operations of Sequencer in more details.
 
* '''Injection of memory request to cache hierarchy, request accounting and resource allocation:'''
 
 
 
: The entry point for a memory request to the Sequencer is method called '''''makeRequest'''''. This method is called from the corresponding RubyPort (Ruby's wrapper for M5 port). Once it gets the request, Sequencer checks for whether required resource limitations need to be enforced or not. This is done by calling a method called '''''getRequestStatus'''''.  In this method, it is made sure that a given Sequencer (i.e. a core or hardware thread context) can does NOT issue multiple simultaneous requests to same cache block. If the current request indeed for a cache block for which another request is still pending from the same Sequencer then the current request is not issued to the cache hierarchy and instead the current request wait for the previous request from the same cache block to satisfied first. This is done in the code by checking for two requesting accounting table called ''m_writeRequestTable'' and ''m_readRequestTable'' and setting the status to ''RequestSatus_Aliased''. It also makes sure that number of outsatnding memory request from a given Sequencer does not overshoot.
 
:If it is found that the current request won't violate any of the above constrained then the request is registered for accounting purposes. This is done by calling the method named '''''insertRequest'''''. In this method, depending upon the type of the request, an entry for the request is created in the either ''m_writeRequestTable'' or ''m_readRequestTable''. These two tables keep record for write request and read requests, respectively, that are issued to the cache hierarchy by the given Sequencer but still to be satisfied. Finally the memory request is finally pushed to the cache hierarchy by calling the method named '''''issueRequest'''''. This method is responsible of creating the request structure that is understood by the underlying SLICC generated coherence protocol implementation and cache hierarchy. This is done by setting the request type and mode accordingly and creating an object of class ''RubyRequest'' for the current request. Finally, L1 Instruction or L1 Data cache accesses latencies are accounted for an the request is pushed to the Cache hierarchy and the coherence mechanism for processing. This is done by ''enqueue''-ing the request to the pointer to the mandatory queue (''m_mandatory_q_ptr''). Every request is passed to the corresponding L1 Cache controller through this mandatory queue. Cache hierarchy is then responsible for satisfying the request.
 
 
 
* ''' Deadlock/lack of forward progress detection: '''
 
: As mentioned earlier, one other responsibility of the Sequencer is to make sure that Cache hierarchy is making progress in servicing the memory requests that have been issued. This is done by periodically waking up and scanning through the ''m_writeRequestTable'' and ''m_readRequestTable'' tables. which holds the currently outstanding requests from the given Sequencer and finds out which requests have been issued but have not satisfied by the cache hierarchy. If it finds any unsatisfied request that have been issues more than ''m_deadlock_threshold'' (parameter) cycles back, it reports a possible deadlock by the Cache hierarchy. Note that, although it reports possible deadlock, it actually detects lack of forward progress. Thus there may be false positives in this deadlock detection mechanism.
 
 
 
* ''' Send back result to the front-end and making sure Atomic operations are handled properly:'''
 
: Once the Cache hierarchy satisfies a request it calls Sequencer's '''''readCallback''''' or ''writeCallback''''' method depending upon the type of the request. Note that the time taken to service the request is automatically accounted for through event scheduling as the '''''readCallback''''' or '''''writeCallback'''' are called only after number of cycles required to satisfy the request has been accounted for. In these two methods the corresponding record of the request from the '' m_readRequestTable'' or ''m_writeRequestTable'' is removed.  Also if the request was found to be part of a Atomic operations (e.g. RMW or LL/SC), then appropriate actions are taken to make sure that semantics of atomic operations are respected.
 
: After that a method called '''''hitCallback''''' is called. In this method, some statics is collected by calling functions on the Ruby's profiler. For write request, the data is actually updated in this function (and not while simulating the request through the cache hierarchy and coherence protocol). Finally '''''ruby_hit_callback''''' is called ultimately sends back the packet to the front-end, signifying completion of the processing of the memory request from Ruby's side.
 
 
 
===== CacheMemory and Cache Replacement Polices =====
 
 
 
 
 
This module can model any '''Set-associative Cache structure''' with a given associativity and size. Each instantiation of the following module models a '''single bank''' of a cache. Thus different types of caches in system (e.g. L1 Instruction, L1 Data , L2 etc) and every banks of a cache needs to have separate instantiation of this module.  This module can also model Fully associative cache when the associativity is set to 1. In Ruby memory system, this module is primarily expected to be accessed by the SLICC generated codes of the given Coherence protocol being modeled. 
 
====== '''Basic Operation''' ======
 
This module models the set-associative structure as a two dimensional (2D) array. Each row of the 2D array represents a set of in the set-associative cache structure, while columns represents ways. The number of columns  is equal to the given associativity (parameter), while the number of rows is decided depending on the desired size of the structure (parameter), associativity (parameter) and the size of the cache line (parameter).
 
This module exposes six important functionalities which Coherence Protocols uses to manage the caches.
 
# It allows to query if a given cache line address is present in the set-associative structure being modeled through a function named '''''isTagPresent'''''. This function returns ''true'', iff the given cache line address is present in it.
 
# It allows a lookup operation which returns the cache entry for a given cache line address (if present), through a function named '''''lookup'''''. It returns NULL if the blocks with given address is not present in the set-associative cache structure.
 
# It allows to allocate a new cache entry in the set-associative structure through a function named '''''allocate'''''.
 
# It allows to deallocate a cache entry of a given cache line address through a function named '''''deallocate'''''.
 
# It can be queried to find out whether to allocate an entry with given cache line address would require replacement of another entry in the designated set (derived from the cache line address) or not. This functionality is provided through '''''cacheAvail''''' function, which for a given cache line address, returns True, if NO replacement of another entry the same set as the given address is required to make space for a new entry with the given address.
 
# The function '''''cacheProbe''''' is used to find out cache line address of a victim line, in case placing a new entry would require victimizing another cache blocks in the same set. This function returns the cache line address of the victim line given the address of the address of the new cache line that would have to be allocated.
 
 
 
====== '''Parameters''' ======
 
There are four important parameters for this class.
 
# '''''size''''' is the parameter that provides the size of the set-associative structure being modeled in units of bytes.
 
# '''''assoc''''' specifies the set-associativity of the structure.
 
# '''''replacement_policy''''' is the name of the replacement policy that would be used to select victim cache line when there is conflict in a given set. Currently, only two possible choices are available ('''''PSEUDO_LRU''''' and '''''LRU''''').
 
# Finally, '''''start_index_bit''''' parameter specifies the bit position in the address from where indexing into the cache should start. This is a tricky parameter and if not set properly would end up using only portion of the cache capacity. Thus how this value should be specified is explained through couple of examples. Let us assume the cache line size if 64 bytes and a single core machine with a L1 cache with only one bank and a L2 cache with 4 banks. For the CacheMemory module that would model the L1 cache should have '''''start_index_bit''''' set to log2(64) = 6 (this is the default value assuming 64 bytes cache line). This is required as addresses passed around in the Ruby is ''full address'' (i.e. equal to the number of bits required to access any address in the physical address range) and as the caches would be accessed in granularity of cache line size (here 64 bytes), the lower order 6 bits in the address would be essentially 0. So we should discard last 6 bits of the given address while calculating which set (index) in the set associative structure the given address should go to. Now let's look into a more complicated case of L2 cache, which has 4 banks. As mentioned previously, this modules models a single bank of a set-associative cache. Thus there will be four instantiation of the CacheMemory class to model the whole L2 cache. Assuming which cache bank a request goes to is statically decided by the low oder log2(4) = 2 bits of the ''cache line address'', the value of the bits in the address at the position ''6'' and ''7'' would be same for all accesses coming to a given bank (i.e. a instance of CacheMemory here). Thus indexing within the set associative structure (CacheMemory instance) modeling a given bank should use address bits 8 and higher for finding which set a cache block should go to. Thus '''''start_index_bit''''' parameter should be set to 8 for the banks of L2 in this example. If ''erroneously'' if this is set 6, only a fourth of desired L2 capacity would be utilized !!!
 
 
 
====== '''More detailed description of operation''' ======
 
 
 
As mentioned previously, the set-associative structure is modeled as a 2D array in the CacheMemory class. The variable '''''m_cache''''' is this all important 2D array containing the set-associative structure. Each element of this 2D array is derived from type '''''AbstractCacheEntry''''' class. Beside the minimal required functionality and contents of each cache entry, it can be extended inside the Coherence protocol files. This allows CacheMemory to be generic enough to hold any type of cache entry as desired by a given Coherence protocol as long as it derives from '''''AbstractCacheEntry''''' interface. The '''''m_cache''''' 2D array has number of rows equal to the number of sets in the set-associative structure being modeled while the number of columns is equal to associativity.
 
 
 
As should happen in any set-associative structure, which set (row) a cache entry should reside is decided by part of the cache block address used for indexing. The function '''''addressToCacheSet''''' calculates this index given an address. The ''way'' in which a cache entry reside in its designated set (row) is noted in the a hash_map object named '''''m_tag_index'''''. So to access an cache entry in the set-associative structure, first the set number where the cache block should reside is calculated and then '''''m_tag_index''''' is looked-up to find out the way in which the required cache block resides. If an cache entry holds invalid entry or its empty then its set to ''NULL'' or its permission is set to ''NotPresent''.
 
 
 
One important aspect of the Ruby's caches are the segregation of the set-associative structure for the cache and its replacement policy. This allows modular design where structure of the cache is independent of the replacement policy in the cache. When a victim needs to be selected to make space for a new cache block (by calling '''''cacheProbe''''' function), '''''getVictim''''' function of the class implementing replacement policy is called for the given set. '''''getVictim''''' returns the way number of the victim. The replacement policy is updated about accesses by calling '''''touch''''' function of the replacement policy, which allows it to update the access recency. Currently there are two replacement policies are supported -- LRU and PseudoLRU. LRU policy has a straight forward implementation where it keeps track of the absolute time when each way within each set is accessed last time and it always victimizes the entry which was last accessed furthest back in time. PseudoLRU implements a binary-tree based Non-Recently-Used policy. It arranges the ways in each set in an implicit '''''binary tree''''' like structure. Every node of the binary tree encodes the information which of its two subtrees was accessed more recently. During victim selection process, it starts from the root of the tree and traverse down such that it chooses the subtree which was touched ''less'' recently. Traversal continues until it reaches a leaf node. It then returns the id of the leaf node reached.
 
 
 
====== '''Related files''' ======
 
* '''src/mem/ruby/system'''
 
** '''CacheMemory.cc''': contains CacheMemory class which models a cache bank
 
** '''CacheMemory.hh''': Interface for the CacheMemory class
 
** '''Cache.py''': Python configuration file for CacheMemory class
 
** '''AbstractReplacementPolicy.hh''': Generic interface for Replacement policies for the Cache
 
** '''LRUPolicy.hh''': contains LRU replacement policy
 
** '''PseudoLRUPolicy.hh''': contains Pseudo-LRU replacement policy
 
* '''src/mem/ruby/slicc_interface'''
 
** '''AbstarctCacheEntry.hh''': contains the interface of Cache Entry
 
 
 
===== DMASequencer =====
 
 
 
This module implements handling for DMA transfers. It is derived from the RubyPort class.  There can be a number of DMA controllers that interface with the DMASequencer. The DMA sequencer has a protocol-independent interface and implementation. The DMA controllers are described with SLICC and are protocol-specific.
 
 
 
'''TODO: Fix documentation to reflect latest changes in the implementation.'''
 
 
 
''Note:''
 
 
 
# <span style="color:#006400">''At any time there can be only 1 request active in the DMASequencer.''</span>
 
# <span style="color:#006400">''Only ordinary load and store requests are handled. No other request types such as Ifetch, RMW, LL/SC are handled''</span>
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/ruby/system'''
 
** '''DMASequencer.hh''': Declares the DMASequencer class and structure of a DMARequest
 
** '''DMASequencer.cc''': Implements the methods of the DMASequencer class, such as request issue and callbacks.
 
 
 
====== Configuration Parameters ======
 
 
 
Currently there are no special configuration parameters for the DMASequencer.
 
 
 
====== Basic Operation ======
 
 
 
A request for data transfer is split up into multiple requests each transferring cache-block-size chunks of data. A request is active as long as all the smaller transfers are not completed. During this time, the DMASequencer is in a busy state and cannot accepts any new transfer requests.
 
 
 
DMA requests are made through the '''makeRequest''' method. If the sequencer is not busy and the request is of the correct type (LD/ST), it is accepted. A sequence of requests for smaller data chunks is then issued. The '''issueNext''' method issues each of the smaller requests. A data/acknowledgment callback signals completion of the last transfer and triggers the next call to '''issueNext''' as long as all of the original data transfer is not complete. There is no separate event scheduler within the DMASequencer.
 
 
 
===== Memory Controller =====
 
 
 
This module simulates a basic DDR-style memory controller. It models a '''single channel''', connected to any number of DIMMs with any number of ranks of DRAMs each. The following picture shows an overview of the memory organization and connections to the memory controller. General information about memory controllers can be found [http://en.wikipedia.org/wiki/Memory_controller here].
 
 
 
[[File:mc_overview.jpg|center]]
 
 
 
'' Note: ''
 
# <span style="color:#006400">''The product of the memory bus cycle multiplier, memory controller latency, and clock cycle time(=1/processor frequency) gives a first-order approximation of the latency of memory requests in time. The Memory Controller module refines this further by considering bank & bus contention, queueing effects of finite queues, and refreshes.'' </span>
 
# <span style="color:#006400">''Data sheet values for some components of the memory latency are specified in time (nanoseconds), whereas the Memory Controller module expects all delay configuration parameters in cycles. The parameters should be set appropriately taking into account the processor and bus frequencies.''</span>
 
# <span style="color:#006400">''The current implementation does not consider pin-bandwidth contention. Infinite bandwidth is assumed.''</span>
 
# <span style="color:#006400">''Only closed bank policy is currently implemented; that is, each bank is automatically closed after a single read or write.''</span>
 
# <span style="color:#006400">''The current implementation handles only a single channel. If you want multiple address/data channels, you need to instantiate multiple copies of this module.'' </span>
 
# <span style="color:#006400">''This is the only controller that is NOT specified in SLICC, but in C++.''</span>
 
 
 
'''Documentation source''': Most (but not all) of the writeup in this section is taken verbatim from documentation in the gem5 source files, rubyconfig.defaults file of GEMS, and a ppt created by Andy Phelps on Jan 18, 2008.
 
 
 
====== Related Files ======
 
 
 
* '''src/mem/ruby/system'''
 
** '''MemoryControl.hh''': This file declares the Memory Controller class.
 
** '''MemoryControl.cc''': This file implements all the operations of the memory controller. This includes processing of input packets, address decoding and bank selection, request scheduling and contention handling, handling refresh, returning completed requests to the directory controller.
 
** '''MemoryControl.py''': Configuration parameters
 
 
 
====== Configuration Parameters ======
 
 
 
* '''dimms_per_channel''': Currently the only thing that matters is the number of ranks per channel, i.e. the product of this parameter and '''ranks_per_dimm'''.  But if and when this is expanded to do FB-DIMMs, the distinction between the two will matter.
 
 
 
* ''Address Mapping'': This is controlled by configuration parameters '''banks_per_rank''', '''bank_bit_0''', '''ranks_per_dimm''', '''rank_bit_0''', '''dimms_per_channel''', '''dimm_bit_0'''.  You could choose to have the bank bits, rank bits, and DIMM bits in any order. For the default values, we assume this format for addresses:
 
** Offset within line:    [5:0]
 
** Memory controller #:    [7:6]
 
** Bank:                  [10:8]
 
** Rank:                    [11]
 
** DIMM:                    [12]
 
** Row addr / Col addr: [top:13]
 
 
 
If you get these bits wrong, then some banks won't see any requests; you need to check for this in the .stats output.
 
 
 
* '''mem_bus_cycle_multiplier''': Basic cycle time of the memory controller.  This defines the period which is used as the memory channel clock period, the address bus bit time, and the memory controller cycle time. Assuming a 200 MHz memory channel (DDR-400, which has 400 bits/sec data), and a 2 GHz processor clock, mem_bus_cycle_multiplier=10.
 
 
 
* '''mem_ctl_latency''': Latency to returning read request or writeback acknowledgement. Measured in memory address cycles. This equals tRCD + CL + AL + (four bit times) + (round trip on channel) + (memory control internal delays). It's going to be an approximation, so pick what you like. ''Note:  The fact that latency is a constant, and does not depend on two low-order address bits, implies that our memory controller either: (a) tells the DRAM to read the critical word first, and sends the critical word first back to the CPU, or (b) waits until it has seen all four bit times on the data wires before sending anything back.  Either is plausible.  If (a), remove the "four bit times" term from the calculation above.''
 
 
 
* '''rank_rank_delay''': This is how many memory address cycles to delay between reads to different ranks of DRAMs to allow for clock skew.
 
 
 
* '''read_write_delay''': This is how many memory address cycles to delay between a read and a write.  This is based on two things:  (1) the data bus is used one cycle earlier in the operation; (2) a round-trip wire delay from the controller to the DIMM that did the reading. Usually this is set to 2.
 
 
 
* '''basic_bus_busy_time''': Basic address and data bus occupancy.  If you are assuming a 16-byte-wide data bus (pairs of DIMMs side-by-side), then the data bus occupancy matches the address bus occupancy at 2 cycles.  But if the channel is only 8 bytes wide, you need to increase this bus occupancy time to 4 cycles.
 
 
 
* '''mem_random_arbitrate''':  By default, the memory controller uses round-robin to arbitrate between ready bank queues for use of the address bus.  If you wish to add randomness to the system, set this parameter to one instead, and it will restart the round-robin pointer at a random bank number each cycle.  If you want additional nondeterminism, set the parameter to some integer n >= 2, and it will in addition add a n% chance each cycle that a ready bank will be delayed an additional cycle.  Note that if you are in mem_fixed_delay mode (see below), mem_random_arbitrate=1 will have no effect, but mem_random_arbitrate=2 or more will.
 
 
 
* '''mem_fixed_delay''': If this is nonzero, it will disable the memory controller and instead give every request a fixed latency.  The nonzero value specified here is measured in memory cycles and is just added to MEM_CTL_LATENCY.  It will also show up in the stats file as a contributor to memory delays stalled at head of bank queue.
 
 
 
* '''tFAW''': This is an obscure DRAM parameter that says that no more than four activate requests can happen within a window of a certain size. For most configurations this does not come into play, or has very little effect, but it could be used to throttle the power consumption of the DRAM.  In this implementation (unlike in a DRAM data sheet) TFAW is measured in memory bus cycles; i.e. if TFAW = 16 then no more than four activates may happen within any 16 cycle window. Refreshes are included in the activates.
 
 
 
* '''refresh_period''': This is the number of memory cycles between refresh of row x in bank n and refresh of row x+1 in bank n.  For DDR-400, this is typically 7.8 usec for commercial systems; after 8192 such refreshes, this will have refreshed the whole chip in 64 msec.  If we have a 5 nsec memory clock, 7800 / 5 = 1560 cycles.  The memory controller will divide this by the total number of banks, and kick off a refresh to somebody every time that amount is counted down to zero. (There will be some rounding error there, but it should have minimal effect.)
 
 
 
* '''Typical Settings for configuration parameters''': The default values are for DDR-400 assuming a 2GHz processor clock. If instead of DDR-400, you wanted DDR-800, the channel gets faster but the basic operation of the DRAM core is unchanged. Busy times appear to double just because they are measured in smaller clock cycles.  The performance advantage comes because the bus busy times don't actually quite double. You would use something like these values:
 
 
 
:: mem_bus_cycle_multiplier: 5
 
:: bank_busy_time: 22
 
:: rank_rank_delay: 2
 
:: read_write_delay: 3
 
:: basic_bus_busy_time: 3
 
:: mem_ctl_latency: 20
 
:: refresh_period: 3120
 
 
 
====== Basic Operation ======
 
 
 
[[File:mc_data_struct.jpg|600px|center]]
 
 
 
* '''Data Structures'''
 
 
 
Requests are enqueued into a single input queue. Responses are dequeued from a single response queue. There is a single bank queue for each DRAM bank (the total number of banks is the number of DIMMs per channel x number of ranks per DIMM x number of banks per rank). Each bank also has a busy counter. tFAW shift registers are maintained per rank.
 
 
 
* '''Timing'''
 
 
 
[[File:mc_addr_command_timing.jpg|730px|left]] [[File:mc_addr_command_timing_back_to_back.jpg|730px|right]]
 
 
 
The “Act” (Activate) and “Rd” (Read) commands (or activate and write) always come as a pair, because we are modeling posted-CAS mode.  (In non-posted-CAS, the read or write command would be scheduled separately later.)  We do not explicitly model the separate commands; we simply say that the address bus occupancy is 2 cycles.
 
 
 
Since the data bus is also occupied for 2 cycles at a fixed offset in time as shown above, we do not need to explicitly model it; memory channel occupancy is still 2 cycles.
 
 
 
For back-to-back requests the data for the 2nd request could be delayed due to the following reasons:
 
** Read happens from a different rank
 
** Read is followed by a write
 
** 2nd request has a busy bank
 
** Basic request time > 2 (e.g. needs 8 data phits)
 
  
* '''Scheduling and Bank Contention'''
+
# '''[[MI_example]]''': example protocol, 1-level cache.
 +
# '''[[MESI_Two_Level]]''': single chip, 2-level caches, strictly-inclusive hierarchy.
 +
# '''[[MOESI_CMP_directory]]''': multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.
 +
# '''[[MOESI_CMP_token]]''': 2-level caches. TODO.
 +
# '''[[MOESI_hammer]]''': single chip, 2-level private caches, strictly-exclusive hierarchy.
 +
# '''[[Garnet_standalone]]''': protocol to run the Garnet network in a standalone manner.
 +
# '''[[MESI Three Level]]''': 3-level caches, strictly-inclusive hierarchy.
  
The '''wakeup''' function, and in turn, the '''executeCycle''' function is tiggered once every memory clock cycle.
+
Commonly used notations and data structures in the protocols have been described in detail [[Cache Coherence Protocols|here]].
  
Each memory request is placed in a queue associated with a specific memory bank.  This queue is of finite size; if the queue is full the request will back up in an (infinite) common queue and will effectively throttle the whole system.  This sort of behavior is intended to be closer to real system behavior than if we had an infinite queue on each bank.  If you want the latter, just make the bank queues unreasonably large.
+
=== Protocol independent memory components ===
  
The head item on a bank queue is issued when all of the following are true:
+
# '''Sequencer'''
# The bank is available
+
# '''Cache Memory'''
# The address path to the DIMM is available
+
# '''Replacement Policies'''
# The data path to or from the DIMM is available
+
# '''Memory Controller'''
  
Note that we are not concerned about fixed offsets in time. The bank will not be used at the same moment as the address path, but since there is no queue in the DIMM or the DRAM it will be used at a constant number of cycles later, so it is treated as if it is used at the same time.
+
In general cache coherence protocol independent components comprises of the Sequencer, Cache Memory structure, [[Replacement_policy|replacement policies]] and the Memory controller. The Sequencer class is responsible for feeding the memory subsystem (including the caches and the off-chip memory) with load/store/atomic memory requests from the processor. Every memory request when completed by the memory subsystem also send back the response to the processor via the Sequencer. There is one Sequencer for each hardware thread (or core) simulated in the system. The Cache Memory models a set-associative cache structure with parameterizable size, associativity, and replacement policy. L1, L2, L3 caches in the system are instances of Cache Memory, if they exist. The replacement policies are kept modular from the Cache Memory, so that different instances of Cache Memory can use different replacement policies of their choice. The Memory Controller is responsible for simulating and servicing any request that misses on all the on-chip caches of the simulated system. Memory Controller currently simple, but models DRAM ban contention, DRAM refresh faithfully. It also models close-page policy for DRAM buffer.  
  
We are assuming "posted CAS"; that is, we send the READ or WRITE immediately after the ACTIVATE.  This makes scheduling the address bus trivial; we always schedule a fixed set of cycles.  For DDR-400, this is a set of two cycles; for some configurations such as DDR-800 the parameter tRRD forces this to be set to three cycles.
+
'''''Each component is described in details [[Coherence-Protocol-Independent Memory Components|here]].'''''
  
We assume a four-bit-time transfer on the data wires.  This is the minimum burst length for DDR-2.  This would correspond to (for example) a memory where each DIMM is 72 bits wide and DIMMs are ganged in pairs to deliver 64 bytes at a shot.This gives us the same occupancy on the data wires as on the address wires (for the two-address-cycle case).
+
=== Interconnection Network ===
  
The only non-trivial scheduling problem is the data wires. A write will use the wires earlier in the operation than a read will; typically one cycle earlier as seen at the DRAM, but earlier by a worst-case round-trip wire delay when seen at the memory controller. So, while reads from one rank can be scheduled back-to-back every two cycles, and writes (to any rank) scheduled every two cycles, when a read is followed by a write we need to insert a bubble. Furthermore, consecutive reads from two different ranks may need to insert a bubble due to skew between when one DRAM stops driving the wires and when the other one starts.  (These bubbles are parameters.)
+
The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together.  
  
This means that when some number of reads and writes are at the heads of their queues, reads could starve writes, and/or reads to the same rank could starve out other requests, since the others would never see the data bus ready. For this reason, we have implemented an anti-starvation feature. A group of requests is marked "old", and a counter is incremented each cycle as long as any request from that batch has not issued. If the counter reaches twice the bank busy time, we hold off any newer requests until all of the "old" requests have issued.
+
[[File:Interconnection_network.jpg|600px|center]]
  
==== Interconnection Network ====
+
The key components of an interconnection network are:
The various controllers of the memory hierarchy (L1/L2 caches, Directory, DMA etc), specified by the coherence protocol, are connected together via an interconnection network. The primary components of this network are  
 
# Topology
 
# Routing
 
# Router Microarchitecture and Flow-Control
 
  
===== Topology =====
+
# '''Topology'''
The connection between the various controllers are specified via python files.
+
# '''Routing'''
* '''Related Files''':
+
# '''Flow Control'''
**  '''src/mem/ruby/network/topologies/Pt2Pt.py'''
+
# '''Router Microarchitecture'''
**  '''src/mem/ruby/network/topologies/Crossbar.py'''
 
** ''' src/mem/ruby/network/topologies/Mesh.py'''
 
**  '''src/mem/ruby/network/topologies/MeshDirCorners.py'''
 
** '''src/mem/ruby/network/Network.py'''
 
  
* '''Topology Descriptions''':
+
'''''More details about the network model implementation are described [[Interconnection Network|here]].'''''
** '''Pt2Pt''': Each controller (L1/L2/Directory) is connected to every other controller via a direct link. This can be invoked from command line by '''--topology=Pt2Pt'''.
 
** '''Crossbar''': Each controller (L1/L2/Directory) is connected to every other controller via one switch (modeling the crossbar). This can be invoked from command line by '''--topology=Crossbar'''.
 
** '''Mesh''': This topology requires the number of directories to be equal to the number of cpus. The number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present), and one Directory. It can be invoked from command line by '''--topology=Mesh'''. The number of rows in the mesh '''has to be specified''' by '''--mesh-rows'''. This parameter enables the creation of non-symmetrical meshes too.
 
** '''MeshDirCorners''': This topology requires the number of directories to be equal to 4.  number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present). Each corner router/switch is connected to one Directory. It can be invoked from command line by '''--topology=MeshDirCorners'''. The number of rows in the mesh '''has to be specified''' by '''--mesh-rows'''.
 
  
* Additional parameters
+
Alternatively, Interconnection network could be replaced with the external simulator [http://www.atc.unican.es/topaz/  TOPAZ]. This simulator is ready to run within gem5 and adds a significant number of [https://sites.google.com/site/atcgalerna/home-1/publications/files/NOCS-2012_Topaz.pdf?attredirects=0  features] over original ruby network simulator. It includes, new advanced router micro-architectures, new topologies, precision-performance adjustable router models, mechanisms to speed-up network simulation, etc ... The presentation of the tool (and the reason why is not included in the gem5 repostories)  is [http://thread.gmane.org/gmane.comp.emulators.m5.users/9651 here]
The topology file can be used to specify additional link-level parameters (defaults in Network.py).
 
** '''latency''': latency of traversal within the link.
 
** '''weight''': weight associated with this link. This parameter is used by the routing table while deciding routes, as explained next in [[Ruby#Routing|Routing]].
 
** '''bw_multiplier''': used by simple network to model different link bandwidths. This parameter is specified in 1000th of a byte, and the individual link bandwidth = bw_multiplier x endpoint_bandwidth (specified in Network.py).
 
  
 +
== Life of a memory request in Ruby ==
  
===== Routing =====
 
Based on the topology, shortest path graph traversals are used to populate ''routing tables'' at each router/switch. The default routing algorithm tries to choose the route with minimum number of link traversals. Links can be given weights in the topology files to model different routing algorithms. For example, in Mesh.py and MeshDirCorners.py Y-direction links are given weights of 2, while X-direction links are given weights of 1, resulting in XY traversals in a mesh. '''adaptive_routing''' (in src/mem/ruby/network/SimpleNetwork.py) can be enabled to make the simple network choose routes based on occupancy of queues at each output port.
 
 
 
===== Router Microarchitecture and Flow-Control =====
 
 
Ruby supports two network models, Simple and Garnet, which trade-off detailed modeling versus simulation speed respectively.
 
 
* '''Related Files''':
 
** src/mem/ruby/network/Network.py
 
** src/mem/ruby/network/simple
 
**  src/mem/ruby/network/garnet/BaseGarnetNetwork.py
 
**  src/mem/ruby/network/garnet/fixed-pipeline
 
**  src/mem/ruby/network/garnet/flexible-pipeline
 
 
====== Configuration and Setup ======
 
 
The default network model in Ruby is the [[Ruby#Simple_Network|simple]] network.
 
[[Ruby#Garnet_Networks|Garnet]] fixed-pipeline or flexible-pipeline networks can be enabled by adding '''--garnet-network=fixed''', or '''--garnet-network=flexible''' on the command line, respectively.
 
 
TODO: Network.py parameters
 
 
 
====== Simple Network ======
 
 
TODO
 
 
====== Garnet Networks ======
 
 
Garnet is an on-chip network simulator that provides the ability to model pipelined routers, unlike the simple network.
 
Garnet comes in two flavors: fixed-pipeline, and flexible-pipeline. The garnet research paper can be found [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4919636 here].
 
 
Garnet is a detailed interconnection network model inside GEM5. It consists of a detailed ''fixed-pipeline'' model and an approximate ''flexible-pipeline'' model. The ''fixed-pipeline'' model is intended for low-level interconnection network evaluations and models the detailed features of a  network. Researchers interested in investigating different network microarchitectures can readily modify the modeled microarchitecture and pipeline. Also, for system level evaluations that are not concerned with the detailed network characteristics, this model provides an accurate network model and should be used as the default model. The flexible pipeline model is intended to provide a reasonable abstraction of all interconnection network models, while allowing the router pipeline depth to be flexible adjusted. A router pipeline might range from a single cycle to several cycles. For evaluations that wish to easily change the router pipeline depth, the "flexible pipeline" model provides a neat abstraction that can be used.
 
 
The input parameters of garnet are specified in BaseGarnetNetwork.py.
 
They are
 
* '''flit_size''': flit size in bytes. Protocol control and data messages are broken down into multiple ''flits'' or flow-control units inside garnet, based on the size of the message (m_control_msg_size and m_data_msg_size in Network.cc), and the size of each flit.
 
* '''vcs_per_class''': Number of Virtual Channels per input port.
 
 
Each cache/dir/dma controller connects to a network interface inside garnet. The network interface breaks down the message into
 
multiple flits, based on the size of the message (m_control_msg_size and m_data_msg_size in Network.cc), and the flit_size.
 
 
* '''Garnet Fixed Pipeline Network'''
 
 
The garnet fixed-pipeline models a classic 5-stage Virtual Channel router. The 5-stages are:
 
# Buffer Write:
 
 
TODO
 
 
* '''Garnet Flexible Pipeline Network'''
 
 
TODO
 
 
 
==== Life of a memory request in Ruby ====
 
 
In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.
 
In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.
  
# A memory request from a core or hardware context of M5 enters the jurisdiction of Ruby through the '''''RubyPort::recvTiming''''' interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of ''non-multithreaded'' cores). A port from the side of each core is tied to a corresponding RubyPort.  
+
# A memory request from a core or hardware context of gem5 enters the jurisdiction of Ruby through the '''''RubyPort::recvTiming''''' interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of ''non-multithreaded'' cores). A port from the side of each core is tied to a corresponding RubyPort.  
# The memory request arrives as a M5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a ''normal'' memory request (not PIO access), it passes the request to the '''''Sequencer::makeRequest''''' interface of the attached Sequencer object with the port (variable ''ruby_port'' holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
+
# The memory request arrives as a gem5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a ''normal'' memory request (not PIO access), it passes the request to the '''''Sequencer::makeRequest''''' interface of the attached Sequencer object with the port (variable ''ruby_port'' holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
# As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the '''''Sequencer::makeRequest''''', it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the ''mandatory queue'' after accounting for L1 cache access latency. The ''mandatory queue'' (variable name ''m_mandatory_q_ptr'') effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
+
# As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is an one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the '''''Sequencer::makeRequest''''', it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the ''mandatory queue'' after accounting for L1 cache access latency. The ''mandatory queue'' (variable name ''m_mandatory_q_ptr'') effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
 
# L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the ''mandatory queue'' and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of ''MessageBuffer'' class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through ''MessageBuffer''. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols).  Once the request get satisfied it is pushed upwards in the hierarchy through ''MessageBuffer''s.  
 
# L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the ''mandatory queue'' and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of ''MessageBuffer'' class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through ''MessageBuffer''. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols).  Once the request get satisfied it is pushed upwards in the hierarchy through ''MessageBuffer''s.  
 +
# The ''MessageBuffers'' also act as entry point of coherence messages to the on-chip interconnect modeled. The MesageBuffers are connected according to the interconnect topology specified. The coherence messages thus travel through this on-chip interconnect accordingly. 
 
# Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its '''''readCallback''''' or ''''writeCallback''''' method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
 
# Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its '''''readCallback''''' or ''''writeCallback''''' method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
# The Sequencer then clears up the accounting information for the corresponding request and then calls the '''''RubyPort::ruby_hit_callback''''' method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (M5).
+
# The Sequencer then clears up the accounting information for the corresponding request and then calls the '''''RubyPort::ruby_hit_callback''''' method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (gem5).
 +
 
 +
== Directory Structure ==
 +
 
 +
* '''src/mem/'''
 +
** '''protocols''': SLICC specification for coherence protocols
 +
** '''slicc''': implementation for SLICC parser and code generator
 +
** '''ruby'''
 +
*** '''common''': frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block
 +
*** '''filters''': various Bloom filters (stale code from GEMS)
 +
*** '''network''': Interconnect implementation, sample topology specification, network power calculations, message buffers used for connecting controllers
 +
*** '''profiler''': Profiling for cache events, memory controller events
 +
*** '''recorder''':  Cache warmup and access trace recording
 +
*** '''slicc_interface''': Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
 +
*** '''structures''': Protocol independent memory components – CacheMemory, DirectoryMemory
 +
*** '''system''': Glue components – Sequencer, RubyPort, RubySystem

Latest revision as of 06:31, 5 November 2019

High level components of Ruby

Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:

  1. Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
  2. Rich configurability -- almost any aspect affecting the memory hierarchy functionality and timing can be controlled.
  3. Rapid prototyping -- a high-level specification language, SLICC, is used to specify functionality of various controllers.

The following picture, taken from the GEMS tutorial in ISCA 2005, shows a high-level view of the main components in Ruby.

Ruby overview.jpg

SLICC + Coherence protocols:

SLICC stands for Specification Language for Implementing Cache Coherence. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.

Slicc overview.jpg

The following cache coherence protocols are supported:

  1. MI_example: example protocol, 1-level cache.
  2. MESI_Two_Level: single chip, 2-level caches, strictly-inclusive hierarchy.
  3. MOESI_CMP_directory: multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.
  4. MOESI_CMP_token: 2-level caches. TODO.
  5. MOESI_hammer: single chip, 2-level private caches, strictly-exclusive hierarchy.
  6. Garnet_standalone: protocol to run the Garnet network in a standalone manner.
  7. MESI Three Level: 3-level caches, strictly-inclusive hierarchy.

Commonly used notations and data structures in the protocols have been described in detail here.

Protocol independent memory components

  1. Sequencer
  2. Cache Memory
  3. Replacement Policies
  4. Memory Controller

In general cache coherence protocol independent components comprises of the Sequencer, Cache Memory structure, replacement policies and the Memory controller. The Sequencer class is responsible for feeding the memory subsystem (including the caches and the off-chip memory) with load/store/atomic memory requests from the processor. Every memory request when completed by the memory subsystem also send back the response to the processor via the Sequencer. There is one Sequencer for each hardware thread (or core) simulated in the system. The Cache Memory models a set-associative cache structure with parameterizable size, associativity, and replacement policy. L1, L2, L3 caches in the system are instances of Cache Memory, if they exist. The replacement policies are kept modular from the Cache Memory, so that different instances of Cache Memory can use different replacement policies of their choice. The Memory Controller is responsible for simulating and servicing any request that misses on all the on-chip caches of the simulated system. Memory Controller currently simple, but models DRAM ban contention, DRAM refresh faithfully. It also models close-page policy for DRAM buffer.

Each component is described in details here.

Interconnection Network

The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together.

Interconnection network.jpg

The key components of an interconnection network are:

  1. Topology
  2. Routing
  3. Flow Control
  4. Router Microarchitecture

More details about the network model implementation are described here.

Alternatively, Interconnection network could be replaced with the external simulator TOPAZ. This simulator is ready to run within gem5 and adds a significant number of features over original ruby network simulator. It includes, new advanced router micro-architectures, new topologies, precision-performance adjustable router models, mechanisms to speed-up network simulation, etc ... The presentation of the tool (and the reason why is not included in the gem5 repostories) is here

Life of a memory request in Ruby

In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.

  1. A memory request from a core or hardware context of gem5 enters the jurisdiction of Ruby through the RubyPort::recvTiming interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of non-multithreaded cores). A port from the side of each core is tied to a corresponding RubyPort.
  2. The memory request arrives as a gem5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a normal memory request (not PIO access), it passes the request to the Sequencer::makeRequest interface of the attached Sequencer object with the port (variable ruby_port holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
  3. As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is an one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the Sequencer::makeRequest, it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the mandatory queue after accounting for L1 cache access latency. The mandatory queue (variable name m_mandatory_q_ptr) effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
  4. L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the mandatory queue and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of MessageBuffer class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through MessageBuffer. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols). Once the request get satisfied it is pushed upwards in the hierarchy through MessageBuffers.
  5. The MessageBuffers also act as entry point of coherence messages to the on-chip interconnect modeled. The MesageBuffers are connected according to the interconnect topology specified. The coherence messages thus travel through this on-chip interconnect accordingly.
  6. Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its readCallback or 'writeCallback method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
  7. The Sequencer then clears up the accounting information for the corresponding request and then calls the RubyPort::ruby_hit_callback method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (gem5).

Directory Structure

  • src/mem/
    • protocols: SLICC specification for coherence protocols
    • slicc: implementation for SLICC parser and code generator
    • ruby
      • common: frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block
      • filters: various Bloom filters (stale code from GEMS)
      • network: Interconnect implementation, sample topology specification, network power calculations, message buffers used for connecting controllers
      • profiler: Profiling for cache events, memory controller events
      • recorder: Cache warmup and access trace recording
      • slicc_interface: Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
      • structures: Protocol independent memory components – CacheMemory, DirectoryMemory
      • system: Glue components – Sequencer, RubyPort, RubySystem