Parallelism and performance#
HemoCell is built for large simulations and scales from a single workstation to
HPC clusters with thousands of cores. Parallelism uses MPI and, for most cases,
is hidden from the user — you simply launch with mpirun and HemoCell
distributes the work.
Domain decomposition#
The fluid domain is divided into atomic blocks (a Palabos concept): smaller cuboidal sub-domains, each owned by one MPI process. The cells living in a block are handled by the same process that owns that fluid region, which keeps the fluid–cell coupling local. Processes exchange a thin envelope of data around each block so that forces and velocities are consistent across block boundaries.
Two configuration values relate directly to this decomposition:
<domain><particleEnvelope>— the width (in lattice units) of the halo used to communicate particle data between blocks. It must be a little larger than the longest reach of any cell, otherwise parts of a cell can fall outside the communicated region and the cell is deleted. A value around25is typical; HemoCell warns if it looks too small.<domain><blockSize>— an optional target edge length for atomic blocks, useful in combination with load balancing.
Note
Running on multiple processes changes one user-visible behaviour: a cell is initialised only by the process that owns the block containing the cell’s centre. Cells whose centre lies outside the domain are removed at initialisation (on a single core they are kept). This is why a case can show different cell counts on 1 vs. N cores right after start-up. See Cell deletions.
Running across processes#
A case is launched with the number of MPI ranks you want:
mpirun -n 4 ./pipeflow config.xml
The number of ranks used to resume from a checkpoint does not have to match the number used for the original run (see Resuming from a checkpoint), which makes it easy to migrate a simulation between machines.
Load balancing#
When cells are unevenly distributed (dense in some regions, sparse in others),
the processes owning the dense regions do more work and the others wait. HemoCell
can rebalance the atomic blocks across processes to even out this cost. The
relevant case-level controls are
HemoCell::calculateFractionalLoadImbalance(),
HemoCell::doLoadBalance(), and
HemoCell::doRestructure(), together with the <sim><tbalance>
interval.
Note
Load balancing relies on the optional Parmetis dependency and a
library built with that feature enabled (the _parmetis target, see
Compiling examples with special features). At the time of writing
the load-balancing routines are still under active development.
Reducing cost: time-scale separation#
The most effective performance lever in HemoCell is usually not the number of cores but the time-scale separation: running the expensive material-model and velocity-interpolation steps only once every several fluid iterations. Tuning these separations often yields a larger speed-up than adding hardware, with no change to the result if chosen sensibly.
Profiling#
HemoCell ships with a lightweight built-in profiler to find the expensive parts
of a run. See Profiling Hemocell performance (the
profiler.h helper) for how to enable and read it.
Note
For details on running on specific HPC systems (module environments and batch
templates), see the system *_env.sh and *_batch_template.sh scripts in
scripts/ (hemocell/scripts/[_system_]_env.sh).