Parallelism and performance

Parallelism and performance#

HemoCell is built for large simulations and scales from a single workstation to HPC clusters with thousands of cores. Parallelism uses MPI and, for most cases, is hidden from the user — you simply launch with mpirun and HemoCell distributes the work.

Domain decomposition#

The fluid domain is divided into atomic blocks (a Palabos concept): smaller cuboidal sub-domains, each owned by one MPI process. The cells living in a block are handled by the same process that owns that fluid region, which keeps the fluid–cell coupling local. Processes exchange a thin envelope of data around each block so that forces and velocities are consistent across block boundaries.

Two configuration values relate directly to this decomposition:

<domain><particleEnvelope> — the width (in lattice units) of the halo used to communicate particle data between blocks. It must be a little larger than the longest reach of any cell, otherwise parts of a cell can fall outside the communicated region and the cell is deleted. A value around 25 is typical; HemoCell warns if it looks too small.
<domain><blockSize> — an optional target edge length for atomic blocks, useful in combination with load balancing.

Note

Running on multiple processes changes one user-visible behaviour: a cell is initialised only by the process that owns the block containing the cell’s centre. Cells whose centre lies outside the domain are removed at initialisation (on a single core they are kept). This is why a case can show different cell counts on 1 vs. N cores right after start-up. See Cell deletions.

Running across processes#

A case is launched with the number of MPI ranks you want:

mpirun -n 4 ./pipeflow config.xml

The number of ranks used to resume from a checkpoint does not have to match the number used for the original run (see Resuming from a checkpoint), which makes it easy to migrate a simulation between machines.

Load balancing#

When cells are unevenly distributed (dense in some regions, sparse in others), the processes owning the dense regions do more work and the others wait. HemoCell can rebalance the atomic blocks across processes to even out this cost. The relevant case-level controls are HemoCell::calculateFractionalLoadImbalance(), HemoCell::doLoadBalance(), and HemoCell::doRestructure(), together with the <sim><tbalance> interval.

Note

Load balancing relies on the optional Parmetis dependency and a library built with that feature enabled (the _parmetis target, see Compiling examples with special features). At the time of writing the load-balancing routines are still under active development.

Reducing cost: time-scale separation#

The most effective performance lever in HemoCell is usually not the number of cores but the time-scale separation: running the expensive material-model and velocity-interpolation steps only once every several fluid iterations. Tuning these separations often yields a larger speed-up than adding hardware, with no change to the result if chosen sensibly.

Profiling#

HemoCell ships with a lightweight built-in profiler to find the expensive parts of a run. See Profiling Hemocell performance (the profiler.h helper) for how to enable and read it.

Note

For details on running on specific HPC systems (module environments and batch templates), see the system *_env.sh and *_batch_template.sh scripts in scripts/ (hemocell/scripts/[_system_]_env.sh).

Parallelism and performance

Contents

Parallelism and performance#

Domain decomposition#

Running across processes#

Load balancing#

Reducing cost: time-scale separation#

Profiling#