API Description
The RIO API is designed to be familiar to anyone who is using ROSS. The API includes:
- The model-defined callback functions that must be implemented by the model developer.
- A number of global variable that should be set in the model’s main function before working with a checkpoint or running the simulation.
- The functions which create and load checkpoint files.
A full implementation of the RIO API can be seen in the PHOLD-IO model.
Model Callbacks
In order to efficiently use the checkpoints, the model must provide RIO with some information about the LPs.
LP Size
typedef size_t (*model_size_f)(void * state, tw_lp *lp);
Given an LP, return the size needed to serialize its data.
LP Serialize
typedef void (*serialize_f)(void * state, void * buffer, tw_lp *lp);
Given an LP and pre-allocated buffer space, serialize the LP state data into the buffer.
LP Deserialize
typedef void (*deserialize_f)(void * state, void * buffer, tw_lp *lp);
Given an LP state and a buffer of serialized data, de-serialize the buffer data into the LP state. This function will most likely be similar your LP initialize function.
IO LP Types
Similar to the tw_lptype array, the RIO function callback should be stored in an io_lptype array.
Global Variables
There are a few global variable that should be specified before running the simulation or working with a checkpoint.
LP IO Callbacks
g_io_lp_types
This is the global array variable where the LP RIO function callbacks are stored, very similar to the ROSS g_tw_lp_types
array.
This variable should be set in main before calling tw_lp_setup_types()
.
Number of Checkpoint Data Files
g_io_number_of_files
This variable is the total number of data files in the RIO checkpoint. When writing a checkpoint, there should never be more than one file per MPI rank.
This variable can be set through command line option --io-files=n
.
The default value is 1.
Event Buffer Size
g_io_events_buffered_per_rank
This variable is used for memory management when creating a checkpoint. RIO pre-allocations event memory in order to capture “in-flight” events at the end of the simulation. As implied by the name, this is the number of outstanding events that may exist on any one MPI rank.
For example, the PHOLD model includes a per-LP event population variable: g_pholdio_start_events
.
Thus, for the PHOLD-IO model, a good buffer size is:
g_io_events_buffered_per_rank = 2 * g_tw_nlp * g_pholdio_start_events;
API Functions
With RIO there are simple functions available to work with checkpoints.
Initialize IO System
void io_init()
The IO system must be initialized before running the simulation.
After specifying the size of the RIO event buffer, simply call the io_init()
function.
Be sure that the g_io_number_of_files
variable is set before calling this init function (otherwise the default value of 1 file is used).
Load a Checkpoint
void io_load_checkpoint(char * cp_name, [ PRE_INIT, INIT, POST_INIT ])
The load checkpoint function should be called before tw_run()
.
This function needs the root name of the checkpoint files (the same name used during the store_checkpoint function call).
When starting a simulation from a checkpoint, there are several options for initializing LPs:
PRE_INIT
: This option loads the LP state data from the checkpoint, then calls the LP init function.INIT
: This option does not call the LP init function. Instead, it simply initializes LP state data from the checkpoint.POST_INIT
This option calls the LP init function before loading the LP state data from the checkpoint.
Store a Checkpoint
void io_store_checkpoint(char * cp_name, int data_file)
A checkpoint is stored after a simulation has ended. To store a checkpoint be sure to do a few things:
- Allocate some buffer space for in-flight events (by setting the
g_io_events_buffered_per_rank
). - Then call
io_init()
- Run the simulation using
tw_run()
- Store the checkpoint using
io_store_checkpoint
The store checkpoint function takes two arguments: the global checkpoint name and which data-file the current MPI rank should be a part of. If your mapping is LINEAR (or some other evenly distributed mapping), the simplest way to determine the data file number is:
// either set g_io_number_of_files
// or use --io-files=n
int ranks_per_file = tw_nnodes() / g_io_number_of_files;
int data_file = g_tw_mynode / ranks_per_file;
Register a Model’s Version
void io_register_model_version (char model_version[40])
Before creating a checkpoint, you can register the model’s current revision number. Using CMake and a configured header file, it is easy to capture the Git hash. The 40 chars version information will appear in the RIO checkpoint description file.
Advanced Creating a Checkpoint from Many Simulations
void io_appending_job()
RIO has the possibility to create one checkpoint from many simulations. This may useful in a situation where a ROSS simulation must be assembled from many disparate initialization or set-up files. The full-size model can be initialized into ROSS and assembled into one RIO checkpoint.
Call the io_appending_job
function before calling io_store_checkpoint
.