Parallel M5
Parallelizing M5
Parallelizing M5 has been a long term goal of mine (nate) for quite some time.
Here's my plan for going about making this happen:
- Get rid of the global mainEventQueue
- Add an EventQueue pointer to every SimObject and add schedule()/deschedule()/reschedule() functions to the Base SimObject to use that event queue pointer.
- Change all calls to event scheduling to use that EventQueue pointer. An example of this is something like this:
- old:
- new LinkDelayEvent(this, packet, curTick + linkDelay);
- new:
- Event *event = new LinkDelayEvent(this, packet);
- this->schedule(event, curTick + linkDelay);
- old:
- Remove the schedule/deschedule/reschedule functions on the Event object. Now, you must create an event and schedule it on an event queue.
- See note below about outstanding issues.
- Add an EventQueue pointer to the SimObjectParams class
- We're going to keep the mainEventQueue, but it will be for certain global functions like managing barriers simulator exits and the like.
- Create EventQueues in python and pass the pointer to each SimObject via the Params struct for every object
- In the first phase of implementing parallel M5, I do all of the steps up to this point, and just create one event queue (the mainEventQueue) and populate every sim object with that one event queue. This should essentially keep the status quo.
- Tell SCons to link M5 with -lpthread
- Create a set of wrapper classes for the pthread stuff since it would be nice to eventually support other mechanisms.
- Add support for the python code to determine the number of CPU cores it has available (automatically using /proc maybe, but with the ability to override the number with a command line option).
- Create a barrier event that can be used to synchronize sets of event queues
- Initially, the barrier event will cause ticks to be run in lock-step, guaranteeing that all cycles are doing in order
- Create a point-to-point synchronization events for controlling the slack between event queues
- The plan is to rely mainly on these events for maintaining slack in the system.
- One major idea with these events is that they will be squashed as frequently as possible to avoid synchronization.
- It may make more sense to build this directly into the event queue, but the all-to-all nature of the synchronization may make this less desirable.
- Create one event queue per thread and one thread per CPU core. Bind logical groups of objects to different EventQueues.
- Create certain objects which can use multiple EventQueues.
- This will be done on is the EtherLink object, allowing two separate systems to be simulated on two separate cores
- Next, I'll do this on is the bus object so that each core can run on a different event queue
- I'll probably also create some sort of etherlink like or etherbridge like object for connecting two arbitrary memory objects across event queues. This may be done instead of doing the bus directly.
Outstanding Issues
I'd like to remove the queue pointer from the event object since there is only one use case where you've scheduled an event and you don't know which queue it's on if you want to de/reschedule it. It's for repeat events like the SimLoopExitEvent.
Here are the options:
- Leave the queue pointer in the object
- Pass the queue pointer as a parameter to the process() function
- Record the queue pointer in just those objects that require it
- Create a new flag to the event called AutoRepeat, create a virtual function that can be called to determine the repeat interval, and add support for repeat in the event queue
- Create a thread local global variable called currentEventQueue. (I hate this idea)
I go back and forth as to the right thing to do. I'd really like to avoid the queue pointer in all objects so we can keep events small, but I guess it can easily be argued that I shouldn't keep that optimization unless I know that it will pay off, but the only way to know if it will pay off is to just do it. I also basically hate the last idea and it's on the bottom of my list. One issue is that because of the committed instruction queue, there can be more than one event queue in a given thread.