A quick list of Answered Questions for new people (and not so much) to ROSS.

  • Are there any simple, example ROSS models to study and copy?

    • Absolute minimum template: template-model
    • A simple mailing model (with people sending letters and delivery cars transfering to one their corresponding destination): ROSS-Mail-Model
    • A Game of Life example: ross-highlife
  • Can ROSS make use of CUDA to accelerate the execution of a model?

    ROSS doesn’t come with CUDA or GPUs support. Some optimizations might be possible at a per model basis. ROSS is written in pure C, which allows it to access extern variables at ease. Access to the GPUs can be implemented as an object (with no main function) and all functions externed to C.

  • What is Phold precisely? It’s the only model present by default in ROSS, it seems to be used for benchmarking, but it’s pretty empty, so what the deal?

    As you mentioned, phold is a benchmarking model. It’s used for both strong scaling (set problem size, varied worker cardinality) and weak scaling (varied problem size, set worker cardinality). See the wikipedia paragraph on this.

    It generates random events to/from random LPs with some guidance as to how many remote (different PE) and local (same PE) events are generated. Local events won’t ever trigger out-of-order event errors by themselves but remote events can because optimistic execution allows for PEs to process events at their own pace without strict synchronization. If we were to create some sort of optimization for ROSS that improves the engine’s performance, a simple model like PHOLD is a good choice because we can measure the simulation’s efficiency (a metric on how effective the simulation was at progressing - more rollbacks means lower efficiency).

  • What is an LP? What is a PE?

    An LP (Logical Process) is a single unit of computation, an actor. An LP receives messages, computes them, and optionally sends more messages to other LPs or itself. There can be millions or billions of LPs in a simulation. LPs are created at the start of a simulation on an assigned PE, they have a unique global ID across the simulation, and they are only destroyed once the simulation stops.

    A physical process (PE) is a single core/MPI rank in charge of coordinating the simulation and communication with other PEs. PEs handle LPs and their mailboxes.

  • Cool. What is a KP?

    KPs (Kernel Process) are really just an organizational structure. A container of LPs contained themselves into a PE. Most importantly, they determine the granularity of rollbacks. When an out of order event is detected by a receiving LP (“I got an event that is timestamped to have occurred ‘before’ the last event that I have processed”), all LPs in the same KP as the receiving LP must rollback to the time before the triggering event is set to occur. Too few KPs and you’ll be rolling back a lot of LPs very frequently. Too many KPs is not really a terrible problem but could maybe be memory inefficient - idk I haven’t really tested that.

  • I am getting a Out of events in GVT! Consider increasing --extramem. What is that?

    The GVT (Global Virtual Time) is a mechanism used to keep all LPs in synch.

    ROSS is built in C and is designed to be as fast as possible. So it allocates a bunch of memory up front and creates a “pool” of unused event structs that can be pulled from any time the simulation needs an event. It does not resize this on-the-fly (could be time expensive, possibly better to just error out and tell the user that they should allocate more). Because we may need to rollback the simulation, we need to ‘hold’ all forward events after they’ve been processed until we know that it is safe to garbage collect them. The time when we can know this is during synchronization points: GVT Calculation. GVT is the minimum point in virtual time that all LPs can agree “has happened”. Any events with timestamps up to that point can be safely garbage collected and their event structs returned to the pool.

    If, however, we create so many events that we run out of empty events in the pool, the simulation will recognize this, end, and error out telling the user how to rectify the situation.

    Why is memory not allocated dynamically then? There are two reasons for this. First, ROSS is designed to be as fast as possible, and dynamic allocation takes time. If we abstract this responsibility away from the user, it’s likely that they won’t think about it anymore and then when they create models that make a TON of events and have no idea why it’s running so slowly - when it’s ultimately because it’s having to repeatedly malloc a bunch of events during the simulation instead of all up front. Second, and more worse even, the model can be unstable in the creation of messages/events. It’s quite simple to write a program that sends two messages per each message it receives. This program requires an unbounded pool which is not feasible in any system. If you keep receiving this error no matter how much extra memory you allocate for the simulation, you might have a model with an unstable message structure.

  • I am getting a Lookahead violation: decrease g_tw_lookahead. What is that?

    ROSS can make use of “three” modes of execution or schedulers. The conservative scheduler works by advancing at tiny synchronized steps in the simulation. For this, it is assumed that within a small window of time, a lookahead, no events will be created by the events to be processed. If an event is created with a an offset smaller than the lookahead, it will be in violation of the assumption and the execution will halt.

    This means that models with zero-offset events cannot run using the conservative scheduler, unfortunately. You can either use the sequential, --synch=1, or optimistic, --synch=3, schedulers.

  • What is the “AVL Tree” doing? What is its function in ROSS? and, why is it so important that its size is shown in the statistics after executing a ROSS model?

    The AVL tree is an efficient way of keeping track of existing events that haven’t yet been confirmed to have been “committed” to simulation history. In optimistic mode, we can’t “commit” any event to simulation history until GVT is greater than the events timestamp. Until then, it’s entirely possible for that event to be rolled back. Keeping track of the AVL tree size during the course of the simulation is kind of a way of monitoring event fanout (the rate at which events are being created/destroyed).

  • Storing whole events seems rather wasteful. Why do we have to keep track of all of the memory of an event? Couldn’t we just keep track of some signature of the event and save some space?

    There are essentially two ways of accomplishing rollbacks:

    1. Reverse Computation - If we know the exact context for the event and its computation is easily reversible. For example, “This was an event of type X, which means that it incremented the state property s->y by one”. So we know the opposite would be to “decrement s->y by one”. We don’t need any more information! But it’s not always this simple to revert the state.
    2. State Swapping - If we instead, at time of processing the forward event, encode its previous state into the message struct, we can during rollback fetch that saved data from the message and replace it back in the LP state. This is helpful if you have an event that does some destructive change like “set whatever value of s->y to equal zero”. It is, in principle, impossible to recover information that has been destroyed. For all we know s->y could have been 23526 prior to our forward event that set it to zero. It could have been 1. If we saved that information in the message struct, then we can just look at the message struct during RC and say “oh we know that s->y was 23526 prior to setting it to zero”.

    Note: #2 requires that we don’t overwrite any event data until we know we don’t need it!

  • Can we allocate dynamic memory for an event?

    Yes and no. The gist of it is: if you are sending a message, don’t allocate memory! If you are processing a message, yes you can but you have to remember to deallocate it. Every event processed forward is either rollbacked or removed (commited) at some point.

    Very important: Be very careful about using dynamic memory encoded into messages. You might have already considered this, but when you alloc() something, you get a pointer to that memory. If we are saving that data into the message struct as dynamic memory, all that’s really saved is just the pointer to that data. If this message was destined for an LP on a different PE (different processor). The LP that receives it will get the message, and if it tries to read the data located at the encoded pointer it will segfault, or worse, it will run with invalid data.

    One usage of allocating dynamic memory is when storing the state of an LP at the forward computation step. There are two options for the LP’s size: it might be defined only at runtime or it might be too large to save in the message struct. Later on, the allocated spaced created in the forward computation step must be freed in the reverse event handler or the commit handler. Basically, allocating space within message handling should never be used for any purpose other than in forward, reverse and commit handlers and even then, use caution.

  • What is the purpose of the commit_f function? Is it to deallocate any memory after a reverse message is processed and is no longer needed (ie, behind the GVT)? (see previous question)

    The commit_f function can be tricky and should be handled with care. Yes, the commit_f function can be used to free any dynamic memory that was allocated during a forward event handler into the message struct. (It should also be freed in the reverse event as well - a message will be passed through a commit callback XOR a reverse callback, never both). Doing this will prevent memory leaks.

    However. While you technically have access to LP state during a commit_f function, you should refrain from touching it.

    The reason comes with when the commit_f function is happening. Say you had a ROSS program where LPs received events of type ADD which increment some encoded LP state. And say an LP receives an ADD event timestamped at virtual time 5. Eventually ROSS gets around to its gvt calculation phase and everyone agrees that GVT is now at 6. But the LP that received that ADD event is now at say virtual time 8 (because we can optimistically execute events and LPs aren’t in lock step with each other following GVT). When the commit_f handler is processing that ADD event, the LP believes that it is at time 8. But the event that we are committing was timestamped at 5. If we touch LP state now, we are not touching it at time 5, we’re touching it at time 8. If we were to printout tw_now(lp) inside of the commit handler, it would read 8, not 5.

    It’s OK if the implications of this are not 100% clear, just be aware that commit_f should really only be reading data or freeing dynamic memory from events. And that the data to be read (unless it’s data encoded in the message struct which is kept safe), will be data from virtual time 8, not 5.

  • I’ve noticed that we can have multiple LP definitions, but can we have multiple message definitions? What if I want to send messages with different payloads and sizes? Do I resort to declaring a fix-sized message and play around with casting to different struct messages for different things? What about sending an arbitrarily sized message?

    No, the system operates under only one message type. If you want anything fancy, maybe allocate memory on the fly and put a reference to it in the struct. Beware that you should never, ever allocate dynamic memory when creating a message (see two previous questions!).

    ROSS expects a single event size so it knows how to allocate the event pool. Technically, when we are processing an event, we just get a pointer to the message struct. We can cast that to be whatever we want. But the size of transmitted messages must always be the max of any message struct that we’d be casting to.

  • ROSS seems kinda limited. I want to allocate dynamic memory and a garbage collector, have arbitrarily sized messages, LP creation on the fly, reallocation of LPs, dynamic message pool, and …

    With ROSS, most things that make you think “oh that seems rather limiting” has a reason behind it: speed.

    ROSS’ claim is being incredibly scalable and fast. Having variable event sizes would complicate the pooling of events, requiring us to dynamically allocate events as they are created or create several different pools for every single event type (challenging to support cleanly and also very inefficient - will lead to more problems than it solves), and so on and on.

  • My model produces correct results in serial mode but it glitches on optimistic mode. No, it’s not segfaulting. It runs, but its output is weird. The more ranks/cores I use the more it fails. What is happening?

    There are many possible reasons for this to be happening, but one that has made almost every ROSS developer suffer is messing with the message state. Sometimes reverse computation is not enough to reverse an LP state. You decide to encode some information in the message, so that if you have to rollback you just decode the information into the LP state. This is how everybody does it, and it works, but there’s a catch: you have to make sure to clean the state of the message/event on rollbacks too. Events are not automatically cleansed on a rollback. Remember: in rollbacks both the LP state and the contents of the message have to be brought to their previous state.