API Description

The RIO API is designed to be familiar to anyone who is using ROSS. The API includes:

The model-defined callback functions that must be implemented by the model developer.
A number of global variable that should be set in the model’s main function before working with a checkpoint or running the simulation.
The functions which create and load checkpoint files.

A full implementation of the RIO API can be seen in the PHOLD-IO model.

Model Callbacks

In order to efficiently use the checkpoints, the model must provide RIO with some information about the LPs.

LP Size

typedef size_t (*model_size_f)(void * state, tw_lp *lp);

Given an LP, return the size needed to serialize its data.

LP Serialize

typedef void (*serialize_f)(void * state, void * buffer, tw_lp *lp);

Given an LP and pre-allocated buffer space, serialize the LP state data into the buffer.

LP Deserialize

typedef void (*deserialize_f)(void * state, void * buffer, tw_lp *lp);

Given an LP state and a buffer of serialized data, de-serialize the buffer data into the LP state. This function will most likely be similar your LP initialize function.

IO LP Types

Similar to the tw_lptype array, the RIO function callback should be stored in an io_lptype array.

Global Variables

There are a few global variable that should be specified before running the simulation or working with a checkpoint.

LP IO Callbacks

g_io_lp_types

This is the global array variable where the LP RIO function callbacks are stored, very similar to the ROSS g_tw_lp_types array. This variable should be set in main before calling tw_lp_setup_types().

Number of Checkpoint Data Files

g_io_number_of_files

This variable is the total number of data files in the RIO checkpoint. When writing a checkpoint, there should never be more than one file per MPI rank.

This variable can be set through command line option --io-files=n. The default value is 1.

Event Buffer Size

g_io_events_buffered_per_rank

This variable is used for memory management when creating a checkpoint. RIO pre-allocations event memory in order to capture “in-flight” events at the end of the simulation. As implied by the name, this is the number of outstanding events that may exist on any one MPI rank.

For example, the PHOLD model includes a per-LP event population variable: g_pholdio_start_events. Thus, for the PHOLD-IO model, a good buffer size is:

    g_io_events_buffered_per_rank = 2 * g_tw_nlp * g_pholdio_start_events;

API Functions

With RIO there are simple functions available to work with checkpoints.

Initialize IO System

void io_init()

The IO system must be initialized before running the simulation. After specifying the size of the RIO event buffer, simply call the io_init() function. Be sure that the g_io_number_of_files variable is set before calling this init function (otherwise the default value of 1 file is used).

Load a Checkpoint

void io_load_checkpoint(char * cp_name, [ PRE_INIT, INIT, POST_INIT ])

The load checkpoint function should be called before tw_run(). This function needs the root name of the checkpoint files (the same name used during the store_checkpoint function call).

When starting a simulation from a checkpoint, there are several options for initializing LPs:

PRE_INIT: This option loads the LP state data from the checkpoint, then calls the LP init function.
INIT: This option does not call the LP init function. Instead, it simply initializes LP state data from the checkpoint.
POST_INIT This option calls the LP init function before loading the LP state data from the checkpoint.

Store a Checkpoint

void io_store_checkpoint(char * cp_name, int data_file)

A checkpoint is stored after a simulation has ended. To store a checkpoint be sure to do a few things:

Allocate some buffer space for in-flight events (by setting the g_io_events_buffered_per_rank).
Then call io_init()
Run the simulation using tw_run()
Store the checkpoint using io_store_checkpoint

The store checkpoint function takes two arguments: the global checkpoint name and which data-file the current MPI rank should be a part of. If your mapping is LINEAR (or some other evenly distributed mapping), the simplest way to determine the data file number is:

    // either set g_io_number_of_files
    // or use --io-files=n
    int ranks_per_file = tw_nnodes() / g_io_number_of_files;
    int data_file = g_tw_mynode / ranks_per_file;

Register a Model’s Version

void io_register_model_version (char model_version[40])

Before creating a checkpoint, you can register the model’s current revision number. Using CMake and a configured header file, it is easy to capture the Git hash. The 40 chars version information will appear in the RIO checkpoint description file.

Advanced Creating a Checkpoint from Many Simulations

void io_appending_job()

RIO has the possibility to create one checkpoint from many simulations. This may useful in a situation where a ROSS simulation must be assembled from many disparate initialization or set-up files. The full-size model can be initialized into ROSS and assembled into one RIO checkpoint.

Call the io_appending_job function before calling io_store_checkpoint.