Difference between revisions of "Ruby"

From gem5
Jump to: navigation, search
(Protocol Independent Memory components)
(Update replacement policy information)
 
(39 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 
== High level components of Ruby ==
 
== High level components of Ruby ==
  
Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:
+
Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various [[Replacement_policy|replacement policies]], coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:
  
 
# Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
 
# Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
Line 11: Line 11:
  
 
=== SLICC + Coherence protocols: ===
 
=== SLICC + Coherence protocols: ===
    Need to say what is SLICC and whats its purpose.  
+
 
    Talk about high level strcture of a typical coherence protocol file, that SLICC uses to generate code.  
+
'''''[[SLICC]]''''' stands for ''Specification Language for Implementing Cache Coherence''. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.
    A simple example structure from protocol like MI_example can help here.
 
  
[[SLICC]] stands for ''Specification Language for Implementing Cache Coherence''. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.
+
[[File:slicc_overview.jpg|700px|center]]
  
[[File:slicc_overview.jpg|700px|center]]
+
The following cache coherence protocols are supported:
  
 +
# '''[[MI_example]]''': example protocol, 1-level cache.
 +
# '''[[MESI_Two_Level]]''': single chip, 2-level caches, strictly-inclusive hierarchy.
 +
# '''[[MOESI_CMP_directory]]''': multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.
 +
# '''[[MOESI_CMP_token]]''': 2-level caches. TODO.
 +
# '''[[MOESI_hammer]]''': single chip, 2-level private caches, strictly-exclusive hierarchy.
 +
# '''[[Garnet_standalone]]''': protocol to run the Garnet network in a standalone manner.
 +
# '''[[MESI Three Level]]''': 3-level caches, strictly-inclusive hierarchy.
  
[[Cache Coherence Protocols]]
+
Commonly used notations and data structures in the protocols have been described in detail [[Cache Coherence Protocols|here]].
  
 
=== Protocol independent memory components ===
 
=== Protocol independent memory components ===
# Cache Memory
 
# Replacement Policies
 
# Memory Controller
 
  
''Arka will do it''
+
# '''Sequencer'''
 +
# '''Cache Memory'''
 +
# '''Replacement Policies'''
 +
# '''Memory Controller'''
 +
 
 +
In general cache coherence protocol independent components comprises of the Sequencer, Cache Memory structure, [[Replacement_policy|replacement policies]] and the Memory controller. The Sequencer class is responsible for feeding the memory subsystem (including the caches and the off-chip memory) with load/store/atomic memory requests from the processor. Every memory request when completed by the memory subsystem also send back the response to the processor via the Sequencer. There is one Sequencer for each hardware thread (or core) simulated in the system. The Cache Memory models a set-associative cache structure with parameterizable size, associativity, and replacement policy. L1, L2, L3 caches in the system are instances of Cache Memory, if they exist. The replacement policies are kept modular from the Cache Memory, so that different instances of Cache Memory can use different replacement policies of their choice. The Memory Controller is responsible for simulating and servicing any request that misses on all the on-chip caches of the simulated system. Memory Controller currently simple, but models DRAM ban contention, DRAM refresh faithfully. It also models close-page policy for DRAM buffer. 
 +
 
 +
'''''Each component is described in details [[Coherence-Protocol-Independent Memory Components|here]].'''''
  
 
=== Interconnection Network ===
 
=== Interconnection Network ===
Line 42: Line 52:
 
# '''Router Microarchitecture'''
 
# '''Router Microarchitecture'''
  
More details about the network model implementation are described [[#Interconnection_Network_2|here]]
+
'''''More details about the network model implementation are described [[Interconnection Network|here]].'''''
 +
 
 +
Alternatively, Interconnection network could be replaced with the external simulator [http://www.atc.unican.es/topaz/  TOPAZ]. This simulator is ready to run within gem5 and adds a significant number of [https://sites.google.com/site/atcgalerna/home-1/publications/files/NOCS-2012_Topaz.pdf?attredirects=0  features] over original ruby network simulator. It includes, new advanced router micro-architectures, new topologies, precision-performance adjustable router models, mechanisms to speed-up network simulation, etc ... The presentation of the tool (and the reason why is not included in the gem5 repostories)  is [http://thread.gmane.org/gmane.comp.emulators.m5.users/9651 here]
  
== Implementation of Ruby ==
+
== Life of a memory request in Ruby ==
  
=== Directory Structure ===
+
In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.
 +
 
 +
# A memory request from a core or hardware context of gem5 enters the jurisdiction of Ruby through the '''''RubyPort::recvTiming''''' interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of ''non-multithreaded'' cores). A port from the side of each core is tied to a corresponding RubyPort.
 +
# The memory request arrives as a gem5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a ''normal'' memory request (not PIO access), it passes the request to the '''''Sequencer::makeRequest''''' interface of the attached Sequencer object with the port (variable ''ruby_port'' holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
 +
# As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is an one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the '''''Sequencer::makeRequest''''', it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the ''mandatory queue'' after accounting for L1 cache access latency. The ''mandatory queue'' (variable name ''m_mandatory_q_ptr'') effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
 +
# L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the ''mandatory queue'' and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of ''MessageBuffer'' class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through ''MessageBuffer''. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols).  Once the request get satisfied it is pushed upwards in the hierarchy through ''MessageBuffer''s.
 +
# The ''MessageBuffers'' also act as entry point of coherence messages to the on-chip interconnect modeled. The MesageBuffers are connected according to the interconnect topology specified. The coherence messages thus travel through this on-chip interconnect accordingly. 
 +
# Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its '''''readCallback''''' or ''''writeCallback''''' method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
 +
# The Sequencer then clears up the accounting information for the corresponding request and then calls the '''''RubyPort::ruby_hit_callback''''' method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (gem5).
 +
 
 +
== Directory Structure ==
  
 
* '''src/mem/'''
 
* '''src/mem/'''
Line 52: Line 74:
 
** '''slicc''': implementation for SLICC parser and code generator
 
** '''slicc''': implementation for SLICC parser and code generator
 
** '''ruby'''
 
** '''ruby'''
*** '''buffers''': implementation for message buffers that are used for exchanging information between the cache, directory, memory controllers and the interconnect
+
*** '''common''': frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block
*** '''common''': frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block, basic types (int32, uint64, etc.)
 
*** '''eventqueue''': Ruby’s event queue API for scheduling events on the gem5 event queue
 
 
*** '''filters''': various Bloom filters (stale code from GEMS)
 
*** '''filters''': various Bloom filters (stale code from GEMS)
*** '''network''': Interconnect implementation, sample topology specification, network power calculations
+
*** '''network''': Interconnect implementation, sample topology specification, network power calculations, message buffers used for connecting controllers
 
*** '''profiler''': Profiling for cache events, memory controller events
 
*** '''profiler''': Profiling for cache events, memory controller events
 
*** '''recorder''':  Cache warmup and access trace recording
 
*** '''recorder''':  Cache warmup and access trace recording
 
*** '''slicc_interface''': Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
 
*** '''slicc_interface''': Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
*** '''system''': Protocol independent memory components – CacheMemory, DirectoryMemory, Sequencer, RubyPort
+
*** '''structures''': Protocol independent memory components – CacheMemory, DirectoryMemory
 
+
*** '''system''': Glue components Sequencer, RubyPort, RubySystem
=== Protocols ===
 
 
 
=== Protocol Independent Memory components ===
 
 
 
[[Coherence-Protocol-Independent Memory Components]]
 
 
 
=== Interconnection Network ===
 
The various controllers of the memory hierarchy (L1/L2 caches, Directory, DMA etc), specified by the coherence protocol, are connected together via an interconnection network. The primary components of this network are
 
# Topology
 
# Routing
 
# Flow-Control
 
# Router Microarchitecture
 
 
 
==== Topology ====
 
The connection between the various controllers are specified via python files.
 
* '''Related Files''':
 
**  '''src/mem/ruby/network/topologies/Pt2Pt.py'''
 
**  '''src/mem/ruby/network/topologies/Crossbar.py'''
 
** ''' src/mem/ruby/network/topologies/Mesh.py'''
 
**  '''src/mem/ruby/network/topologies/MeshDirCorners.py'''
 
** '''src/mem/ruby/network/Network.py'''
 
 
 
* '''Topology Descriptions''':
 
** '''Pt2Pt''': Each controller (L1/L2/Directory) is connected to every other controller via a direct link. This can be invoked from command line by '''--topology=Pt2Pt'''.
 
** '''Crossbar''': Each controller (L1/L2/Directory) is connected to every other controller via one switch (modeling the crossbar). This can be invoked from command line by '''--topology=Crossbar'''.
 
** '''Mesh''': This topology requires the number of directories to be equal to the number of cpus. The number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present), and one Directory. It can be invoked from command line by '''--topology=Mesh'''. The number of rows in the mesh '''has to be specified''' by '''--mesh-rows'''. This parameter enables the creation of non-symmetrical meshes too.
 
** '''MeshDirCorners''': This topology requires the number of directories to be equal to 4.  number of routers/switches is equal to the number of cpus in the system. Each router/switch is connected to one L1, one L2 (if present). Each corner router/switch is connected to one Directory. It can be invoked from command line by '''--topology=MeshDirCorners'''. The number of rows in the mesh '''has to be specified''' by '''--mesh-rows'''.
 
 
 
[[File:topology_overview.jpg|1000px|center]]
 
 
 
* '''Optional parameters specified by the topology files (defaults in Network.py)''':
 
** '''latency''': latency of traversal within the link.
 
** '''weight''': weight associated with this link. This parameter is used by the routing table while deciding routes, as explained next in [[Ruby#Routing|Routing]].
 
** '''bw_multiplier''': used by simple network to model different link bandwidths. This parameter is specified in 1000th of a byte, and the individual link bandwidth = bw_multiplier x endpoint_bandwidth (specified in Network.py).
 
 
 
 
 
==== Routing ====
 
Based on the topology, shortest path graph traversals are used to populate ''routing tables'' at each router/switch. The default routing algorithm tries to choose the route with minimum number of link traversals. Links can be given weights in the topology files to model different routing algorithms. For example, in Mesh.py and MeshDirCorners.py Y-direction links are given weights of 2, while X-direction links are given weights of 1, resulting in XY traversals in a mesh. '''adaptive_routing''' (in src/mem/ruby/network/SimpleNetwork.py) can be enabled to make the simple network choose routes based on occupancy of queues at each output port.
 
 
 
 
 
==== Flow-Control and Router Microarchitecture ====
 
 
 
Ruby supports two network models, Simple and Garnet, which trade-off detailed modeling versus simulation speed respectively.
 
 
 
* '''Related Files''':
 
** src/mem/ruby/network/Network.py
 
** src/mem/ruby/network/simple
 
**  src/mem/ruby/network/garnet/BaseGarnetNetwork.py
 
**  src/mem/ruby/network/garnet/fixed-pipeline
 
**  src/mem/ruby/network/garnet/flexible-pipeline
 
 
 
===== Configuration and Setup =====
 
 
 
The default network model in Ruby is the [[Ruby#Simple_Network|simple]] network.
 
[[Ruby#Garnet_Networks|Garnet]] fixed-pipeline or flexible-pipeline networks can be enabled by adding '''--garnet-network=fixed''', or '''--garnet-network=flexible''' on the command line, respectively.
 
 
 
* '''Configuration''':
 
Some of the network parameters specified in Network.py are:
 
** '''number_of_virtual_networks''': This is the maximum number of virtual networks. The actual number of active virtual networks is determined by the protocol.
 
** '''control_msg_size''': The size of control messages in bytes. Default is 8. '''m_data_msg_size''' in Network.cc is set to the block size in bytes + control_msg_size.
 
** '''link_latency''': Latency of each link in cycles. This can be specified for each link from the topology files. Default is 1.
 
 
 
 
 
===== Simple Network =====
 
 
 
The simple network models hop-by-hop network traversal, but abstracts out detailed modeling within the switches.
 
The switches are modeled in simple/PerfectSwitch.cc while the links are modeled in simple/Throttle.cc.
 
The flow-control is implemented by monitoring the available buffers and available bandwidth in output links before sending.
 
 
 
[[File:Simple_network.jpg|200px|center]]
 
 
 
* '''Configuration''':
 
Simple network uses the generic network parameters in Network.py. Additional parameters are specified in SimpleNetwork.py:
 
** '''buffer_size''': Size of buffers at each switch input and output ports. A value of 0 implies infinite buffering.
 
** '''endpoint_bandwidth''': Bandwidth at the end points of the network in 1000th of byte.
 
** '''bw_multiplier''': Bandwidth specified in 1000th of byte. The individual link bandwidth becomes bw_multipler x endpoint_bandwidth.
 
** '''adaptive_routing''': This enables adaptive routing based on occupancy of output buffers.
 
 
 
 
 
===== Garnet Networks =====
 
 
 
Garnet is a detailed interconnection network model inside GEM5. It consists of a detailed ''fixed-pipeline'' model, and an approximate ''flexible-pipeline'' model.
 
 
 
The ''fixed-pipeline'' model is intended for low-level interconnection network evaluations and models the detailed micro-architectural features of a 5-stage Virtual Channel router with credit-based flow-control. Researchers interested in investigating different network microarchitectures can readily modify the modeled microarchitecture and pipeline. Also, for system level evaluations that are not concerned with the detailed network characteristics, this model provides an accurate network model and should be used as the default model.
 
 
 
The ''flexible-pipeline'' model is intended to provide a reasonable abstraction of all interconnection network models, while allowing the router pipeline depth to be flexibly adjusted. A router pipeline might range from a single cycle to several cycles. For evaluations that wish to easily change the router pipeline depth, the ''flexible-pipeline'' model provides a neat abstraction that can be used.
 
 
 
If your use of Garnet contributes to a published paper, please cite the research paper which can be found [http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4919636 here].
 
 
 
* '''Configuration'''
 
Garnet uses the generic network parameters in Network.py. Additional parameters are specified in BaseGarnetNetwork.py:
 
** '''flit_size''': flit size in bytes. Flits are the granularity at which information is sent from one router to the other. Default is 16 (=> 128 bits). [This default value of 16 results in control messages fitting within 1 flit, and data messages fitting within 5 flits].
 
** '''vcs_per_class''': number of virtual channels (VC) per message class (Note: message class = virtual network). Default is 4.
 
 
 
The following are only valid for fixed-pipeline:
 
** '''buffers_per_data_vc''': number of flit-buffers per VC in the data message class. Since data messages occupy 5 flits, this value can lie between 1-5. Default is 4.
 
** '''buffers_per_ctrl_vc''': number of flit-buffers per VC in the control message class. Since control messages occupy 1 flit, and a VC can only hold one message at a time, this value has to be 1. Default is 1.
 
Note: garnet assumes that ctrl messages are 1-flit wide. If ctrl flits occupy more than one flit (due to smaller flit-size, or larger control_msg_size), all VCs are given buffers_per_data_vc number of flit-buffers.
 
 
 
The following are only valid for flexible-pipeline:
 
** '''number_of_pipe_stages''': number of pipeline stages in each router in the flexible-pipeline model. Default is 4.
 
 
 
 
 
* '''Additional features'''
 
** '''Routing''': Currently, garnet only models deterministic routing using the routing tables described [[Ruby#Routing|earlier]].
 
** '''Modeling variable link bandwidth''': The flit size specifies the link bandwidth as the number of bytes per cycle per network link. Links which have lower bandwidth than this (for instance some off-chip links) can be modeled by specifying a longer latency across them in the topology file (as explained [[Ruby#Topology|earlier]]).
 
** '''Multicast messages''': The network modeled does not have hardware multi-cast support within the network. A multi-cast message gets broken into multiple uni-cast messages at the interface to the network.
 
 
 
 
 
* '''Garnet fixed-pipeline network'''
 
 
 
The garnet fixed-pipeline models a classic 5-stage Virtual Channel router. The 5-stages are:
 
# '''Buffer Write (BW) + Route Compute (RC)''': The incoming flit gets buffered and computes its output port.
 
# '''VC Allocation (VA)''': All buffered flits allocate for VCs at the next routers. [The allocation occurs in a ''separable'' manner: First, each input VC chooses one output VC, choosing input arbiters, and places a request for it. Then, each output VC breaks conflicts via output arbiters]. All arbiters in ordered virtual networks are ''queueing'' to maintain point-to-point ordering. All other arbiters are ''round-robin''.
 
# '''Switch Allocation (SA)''': All buffered flits try to reserve the switch ports for the next cycle. [The allocation occurs in a ''separable'' manner: First, each input chooses one input VC, using input arbiters, which places a switch request. Then, each output port breaks conflicts via output arbiters]. All arbiters in ordered virtual networks are ''queueing'' to maintain point-to-point ordering. All other arbiters are ''round-robin''.
 
# '''Switch Traversal (ST)''': Flits that won SA traverse the crossbar switch.
 
# '''Link Traversal (LT)''': Flits from the crossbar traverse links to reach the next routers.
 
 
 
The flow-control implemented is credit-based.
 
 
 
[[File:Garnet_router.jpg|500px|center]]
 
 
 
 
 
* '''Garnet flexible-pipeline network'''
 
 
 
The garnet flexible-pipeline model should be used when one desires a router pipeline different than 5 stages (the 5 stages include the link traversal stage).
 
All the components of a router (buffers, VC and switch allocators, switch etc) are modeled similar to the fixed-pipeline design, but the pipeline depth is not modeled, and comes as an input parameter '''number_of_pipe_stages'''.
 
The flow-control is implemented by monitoring the availability of buffers at each output port before sending.
 
 
 
 
 
== Life of a memory request in Ruby ==
 
In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.
 
 
 
# A memory request from a core or hardware context of M5 enters the jurisdiction of Ruby through the '''''RubyPort::recvTiming''''' interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of ''non-multithreaded'' cores). A port from the side of each core is tied to a corresponding RubyPort.
 
# The memory request arrives as a M5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a ''normal'' memory request (not PIO access), it passes the request to the '''''Sequencer::makeRequest''''' interface of the attached Sequencer object with the port (variable ''ruby_port'' holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
 
# As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the '''''Sequencer::makeRequest''''', it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the ''mandatory queue'' after accounting for L1 cache access latency. The ''mandatory queue'' (variable name ''m_mandatory_q_ptr'') effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
 
# L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the ''mandatory queue'' and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of ''MessageBuffer'' class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through ''MessageBuffer''. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols).  Once the request get satisfied it is pushed upwards in the hierarchy through ''MessageBuffer''s.
 
# Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its '''''readCallback''''' or ''''writeCallback''''' method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
 
# The Sequencer then clears up the accounting information for the corresponding request and then calls the '''''RubyPort::ruby_hit_callback''''' method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (M5).
 

Latest revision as of 06:31, 5 November 2019

High level components of Ruby

Ruby implements a detailed simulation model for the memory subsystem. It models inclusive/exclusive cache hierarchies with various replacement policies, coherence protocol implementations, interconnection networks, DMA and memory controllers, various sequencers that initiate memory requests and handle responses. The models are modular, flexible and highly configurable. Three key aspects of these models are:

  1. Separation of concerns -- for example, the coherence protocol specifications are separate from the replacement policies and cache index mapping, the network topology is specified separately from the implementation.
  2. Rich configurability -- almost any aspect affecting the memory hierarchy functionality and timing can be controlled.
  3. Rapid prototyping -- a high-level specification language, SLICC, is used to specify functionality of various controllers.

The following picture, taken from the GEMS tutorial in ISCA 2005, shows a high-level view of the main components in Ruby.

Ruby overview.jpg

SLICC + Coherence protocols:

SLICC stands for Specification Language for Implementing Cache Coherence. It is a domain specific language that is used for specifying cache coherence protocols. In essence, a cache coherence protocol behaves like a state machine. SLICC is used for specifying the behavior of the state machine. Since the aim is to model the hardware as close as possible, SLICC imposes constraints on the state machines that can be specified. For example, SLICC can impose restrictions on the number of transitions that can take place in a single cycle. Apart from protocol specification, SLICC also combines together some of the components in the memory model. As can be seen in the following picture, the state machine takes its input from the input ports of the inter-connection network and queues the output at the output ports of the network, thus tying together the cache / memory controllers with the inter-connection network itself.

Slicc overview.jpg

The following cache coherence protocols are supported:

  1. MI_example: example protocol, 1-level cache.
  2. MESI_Two_Level: single chip, 2-level caches, strictly-inclusive hierarchy.
  3. MOESI_CMP_directory: multiple chips, 2-level caches, non-inclusive (neither strictly inclusive nor exclusive) hierarchy.
  4. MOESI_CMP_token: 2-level caches. TODO.
  5. MOESI_hammer: single chip, 2-level private caches, strictly-exclusive hierarchy.
  6. Garnet_standalone: protocol to run the Garnet network in a standalone manner.
  7. MESI Three Level: 3-level caches, strictly-inclusive hierarchy.

Commonly used notations and data structures in the protocols have been described in detail here.

Protocol independent memory components

  1. Sequencer
  2. Cache Memory
  3. Replacement Policies
  4. Memory Controller

In general cache coherence protocol independent components comprises of the Sequencer, Cache Memory structure, replacement policies and the Memory controller. The Sequencer class is responsible for feeding the memory subsystem (including the caches and the off-chip memory) with load/store/atomic memory requests from the processor. Every memory request when completed by the memory subsystem also send back the response to the processor via the Sequencer. There is one Sequencer for each hardware thread (or core) simulated in the system. The Cache Memory models a set-associative cache structure with parameterizable size, associativity, and replacement policy. L1, L2, L3 caches in the system are instances of Cache Memory, if they exist. The replacement policies are kept modular from the Cache Memory, so that different instances of Cache Memory can use different replacement policies of their choice. The Memory Controller is responsible for simulating and servicing any request that misses on all the on-chip caches of the simulated system. Memory Controller currently simple, but models DRAM ban contention, DRAM refresh faithfully. It also models close-page policy for DRAM buffer.

Each component is described in details here.

Interconnection Network

The interconnection network connects the various components of the memory hierarchy (cache, memory, dma controllers) together.

Interconnection network.jpg

The key components of an interconnection network are:

  1. Topology
  2. Routing
  3. Flow Control
  4. Router Microarchitecture

More details about the network model implementation are described here.

Alternatively, Interconnection network could be replaced with the external simulator TOPAZ. This simulator is ready to run within gem5 and adds a significant number of features over original ruby network simulator. It includes, new advanced router micro-architectures, new topologies, precision-performance adjustable router models, mechanisms to speed-up network simulation, etc ... The presentation of the tool (and the reason why is not included in the gem5 repostories) is here

Life of a memory request in Ruby

In this section we will provide a high level overview of how a memory request is serviced by Ruby as a whole and what components in Ruby it goes through. For detailed operations within each components though, refer to previous sections describing each component in isolation.

  1. A memory request from a core or hardware context of gem5 enters the jurisdiction of Ruby through the RubyPort::recvTiming interface (in src/mem/ruby/system/RubyPort.hh/cc). The number of Rubyport instantiation in the simulated system is equal to the number of hardware thread context or cores (in case of non-multithreaded cores). A port from the side of each core is tied to a corresponding RubyPort.
  2. The memory request arrives as a gem5 packet and RubyPort is responsible for converting it to a RubyRequest object that is understood by various components of Ruby. It also finds out if the request is for some PIO or not and maneuvers the packet to correct PIO. Finally once it has generated the corresponding RubyRequest object and ascertained that the request is a normal memory request (not PIO access), it passes the request to the Sequencer::makeRequest interface of the attached Sequencer object with the port (variable ruby_port holds the pointer to it). Observe that Sequencer class itself is a derived class from the RubyPort class.
  3. As mentioned in the section describing Sequencer class of Ruby, there are as many objects of Sequencer in a simulated system as the number of hardware thread context (which is also equal to the number of RubyPort object in the system) and there is an one-to-one mapping between the Sequencer objects and the hardware thread context. Once a memory request arrives at the Sequencer::makeRequest, it does various accounting and resource allocation for the request and finally pushes the request to the Ruby's coherent cache hierarchy for satisfying the request while accounting for the delay in servicing the same. The request is pushed to the Cache hierarchy by enqueueing the request to the mandatory queue after accounting for L1 cache access latency. The mandatory queue (variable name m_mandatory_q_ptr) effectively acts as the interface between the Sequencer and the SLICC generated cache coherence files.
  4. L1 cache controllers (generated by SLICC according to the coherence protocol specifications) dequeues request from the mandatory queue and looks up the cache, makes necessary coherence state transitions and/or pushes the request to the next level of cache hierarchy as per the requirements. Different controller and components of SLICC generated Ruby code communicates among themselves through instantiations of MessageBuffer class of Ruby (src/mem/ruby/buffers/MessageBuffer.cc/hh) , which can act as ordered or unordered buffer or queues. Also the delays in servicing different steps for satisfying a memory request gets accounted for scheduling enqueue-ing and dequeue-ing operations accordingly. If the requested cache block may be found in L1 caches and with required coherence permissions then the request is satisfied and immediately returned. Otherwise the request is pushed to the next level of cache hierarchy through MessageBuffer. A request can go all the way up to the Ruby's Memory Controller (also called Directory in many protocols). Once the request get satisfied it is pushed upwards in the hierarchy through MessageBuffers.
  5. The MessageBuffers also act as entry point of coherence messages to the on-chip interconnect modeled. The MesageBuffers are connected according to the interconnect topology specified. The coherence messages thus travel through this on-chip interconnect accordingly.
  6. Once the requested cache block is available at L1 cache with desired coherence permissions, the L1 cache controller informs the corresponding Sequencer object by calling its readCallback or 'writeCallback method depending upon the type of the request. Note that by the time these methods on Sequencer are called the latency of servicing the request has been implicitly accounted for.
  7. The Sequencer then clears up the accounting information for the corresponding request and then calls the RubyPort::ruby_hit_callback method. This ultimately returns the result of the request to the corresponding port of the core/ hardware context of the frontend (gem5).

Directory Structure

  • src/mem/
    • protocols: SLICC specification for coherence protocols
    • slicc: implementation for SLICC parser and code generator
    • ruby
      • common: frequently used data structures, e.g. Address (with bit-manipulation methods), histogram, data block
      • filters: various Bloom filters (stale code from GEMS)
      • network: Interconnect implementation, sample topology specification, network power calculations, message buffers used for connecting controllers
      • profiler: Profiling for cache events, memory controller events
      • recorder: Cache warmup and access trace recording
      • slicc_interface: Message data structure, various mappings (e.g. address to directory node), utility functions (e.g. conversion between address & int, convert address to cache line address)
      • structures: Protocol independent memory components – CacheMemory, DirectoryMemory
      • system: Glue components – Sequencer, RubyPort, RubySystem