Fix overlap tiling (real cases no tested) (781999e9) · Commits · HYDRO / TRITON

doc/configuration_reference.rst

+15 −2

Original line number	Diff line number	Diff line
		@@ -753,6 +753,19 @@ Advanced settings for stability and parallel performance.
		this global switch for the specified segments only. Other edges remain
		subject to the global setting.

		.. index:: single: overlap_tiling
		.. index:: pair: configuration; overlap_tiling

		- overlap_tiling: enable communication-avoiding time stepping. Options:
		- ``0`` = standard mode, halo exchange every timestep (default)
		- ``1`` = overlap tiling mode, halo exchange every ``GHOST_CELL_PADDING`` timesteps

		When enabled, performs up to ``GHOST_CELL_PADDING`` local substeps before
		exchanging halo data with MPI neighbors. This reduces communication overhead
		while maintaining bitwise-identical results. Requires ``GHOST_CELL_PADDING >= 1``
		(compile-time setting). Note that ``GHOST_CELL_PADDING=1`` provides no
		communication reduction (halo exchange every step).

		.. index:: single: it_count
		.. index:: pair: configuration; it_count

doc/configuration_variable_index.rst

+4 −0

Original line number	Diff line number	Diff line
		@@ -33,6 +33,7 @@ Short, alphabetized reference for all configuration variables. Each row links to
		pair: configuration; num_sources
		pair: configuration; observation_loc_file
		pair: configuration; open_boundaries
		pair: configuration; overlap_tiling
		pair: configuration; outfile_pattern
		pair: configuration; output_format
		pair: configuration; output_option
		@@ -123,6 +124,9 @@ Short, alphabetized reference for all configuration variables. Each row links to
		* - open_boundaries
		- :ref:`misc_params`
		- Global switch to open domain edges; ignored when explicit boundaries are defined.
		* - overlap_tiling
		- :ref:`misc_params`
		- Enable communication-avoiding time stepping (halo exchange every GHOST_CELL_PADDING steps).
		* - outfile_pattern
		- :ref:`io_formats`
		- Naming convention for output files.

doc/index.rst

+1 −0

Original line number	Diff line number	Diff line
		@@ -160,6 +160,7 @@ Project Website
		simulation_setup
		configuration_reference
		configuration_variable_index
		overlap_tiling
		triton_run
		docker_run
		ensemble_run

doc/overlap_tiling.rst

0 → 100644

+158 −0

Original line number	Diff line number	Diff line
		.. _overlap_tiling:

		Overlap Tiling / Communication-Avoiding Time Stepping
		=====================================================

		Overview
		--------

		Overlap tiling is a performance optimization that reduces MPI communication
		overhead in parallel simulations. When enabled, the solver performs multiple
		local timesteps before exchanging halo data with neighboring MPI ranks.

		The key insight is that with ``GHOST_CELL_PADDING`` halo cells available, we
		can compute up to ``GHOST_CELL_PADDING`` substeps before needing fresh halo
		data from neighbors. Each substep "consumes" one layer of valid halo data
		from the MPI boundaries inward.


		How It Works
		------------

		Standard Mode (overlap_tiling=0)
		~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

		In standard mode, every timestep follows this pattern:

		1. Compute fluxes and update cells over the full domain
		2. Exchange halo data with MPI neighbors
		3. Apply wet/dry corrections at MPI boundaries

		This requires one MPI communication per timestep.

		Overlap Tiling Mode (overlap_tiling=1)
		~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

		In overlap tiling mode, we perform ``m = GHOST_CELL_PADDING`` substeps
		before exchanging halo data:

		Substep 1:
		1. Compute fluxes and update cells over the full domain
		2. Exchange halo data with MPI neighbors
		3. Apply wet/dry corrections at MPI boundaries

		Substeps 2 to m:
		1. Compute fluxes and update cells over the interior domain only
		(skipping boundary rows that depend on stale halo data)
		2. Apply wet/dry corrections at MPI boundaries

		After m substeps:
		- Halo exchange is performed
		- Substep counter resets

		This reduces MPI communication from ``N`` exchanges to ``N/m`` exchanges
		for a simulation with ``N`` timesteps.


		Configuration
		-------------

		Enable overlap tiling in your configuration file:

		.. code-block:: text

		overlap_tiling=1

		The number of substeps per halo exchange is determined by the compile-time
		parameter ``GHOST_CELL_PADDING``. To change this value, rebuild with:

		.. code-block:: bash

		cmake -DGHOST_CELL_PADDING=4 ..
		make

		.. note::
		``GHOST_CELL_PADDING=1`` provides no communication reduction since
		halo exchange occurs every timestep. A warning is printed in this case.


		Bitwise Reproducibility
		-----------------------

		Overlap tiling produces bitwise-identical results to standard mode.
		This is ensured by:

		1. Computing the same CFL-limited timestep on every substep
		2. Using identical numerical kernels (only the loop bounds change)
		3. Applying wet/dry corrections at MPI boundaries every substep
		4. Exchanging halo data at the correct intervals


		Performance Considerations
		--------------------------

		Overlap tiling provides the most benefit when:

		- MPI communication is a significant portion of runtime
		- ``GHOST_CELL_PADDING >= 2`` (larger values = fewer communications)
		- The domain is partitioned across many MPI ranks

		The overhead of skipping boundary rows in substeps 2+ is minimal compared
		to the savings from reduced MPI communication.


		Technical Details
		-----------------

		Domain Bounds
		~~~~~~~~~~~~~

		For a process with ``rows`` total rows, the bounds shrink with each substep
		as halo data becomes "stale":

		- Substep s (where s = 1, 2, ..., m):
		- ``ilo = GHOST_CELL_PADDING + (s - 1)``
		- ``ihi = rows - GHOST_CELL_PADDING - (s - 1)``

		For example, with ``GHOST_CELL_PADDING=3``:

		- Substep 1: compute rows [3, rows-3]
		- Substep 2: compute rows [4, rows-4]
		- Substep 3: compute rows [5, rows-5]
		- After substep 3: halo exchange, reset to substep 1

		Each substep skips the outermost valid rows that depend on halo data from
		the previous exchange, ensuring correctness without fresh halo data.

		Kernels Modified
		~~~~~~~~~~~~~~~~

		The following kernels accept optional ``ilo`` and ``ihi`` parameters:

		- ``flux_x()`` - x-direction flux computation
		- ``flux_y()`` - y-direction flux computation (uses ``ihi+1`` internally for stencil)
		- ``update_cells()`` - cell state update
		- ``wet_dry()`` - wet/dry cell correction

		When called without bounds, they default to the original full-domain behavior.


		Validation
		----------

		To verify bitwise reproducibility, compare outputs with and without overlap tiling:

		.. code-block:: bash

		# Run baseline
		echo "overlap_tiling=0" > test.cfg
		mpirun -np 4 ./triton.exe test.cfg
		cp -r output output_baseline

		# Run with overlap tiling
		echo "overlap_tiling=1" > test.cfg
		mpirun -np 4 ./triton.exe test.cfg
		cp -r output output_tiling

		# Compare (should show no differences)
		diff -r output_baseline output_tiling

src/kernels.h

+10 −11

Original line number	Diff line number	Diff line
		@@ -346,8 +346,9 @@ namespace Kernels
		int ilo = GHOST_CELL_PADDING,
		int ihi = -1)
		{
		// Default ihi to nrows - GHOST_CELL_PADDING if not provided
		if (ihi < 0) ihi = nrows - GHOST_CELL_PADDING;
		// flux_y needs one extra row at bottom (computes N edge = south of row above)
		int ihi_flux_y = ihi + 1;

		/****
		* RHS sketch
		@@ -369,7 +370,7 @@ namespace Kernels

		bool
		is_top = (ix < ilo),
		is_btm = (ix >= ihi),
		is_btm = (ix >= ihi_flux_y),
		is_lt = (iy <= GHOST_CELL_PADDING-1),
		is_rt = (iy >= ncols - GHOST_CELL_PADDING);

		@@ -625,8 +626,6 @@ namespace Kernels
		int ilo = GHOST_CELL_PADDING,
		int ihi = -1)
		{
		// Default ihi to nrows - GHOST_CELL_PADDING if not provided
		if (ihi < 0) ihi = nrows - GHOST_CELL_PADDING;
		if (ihi < 0) ihi = nrows - GHOST_CELL_PADDING;

		triton::parallel_for( AUTO_LABEL() , size , KOKKOS_LAMBDA (int id) {
		@@ -776,9 +775,9 @@ namespace Kernels
		}


		/** @brief It updates q_y for halo cells.
		/** @brief Updates q_y at boundary interfaces for wet/dry cells.
		*
		* @param size Array size
		* @param size Array size (2 * ncols * GHOST_CELL_PADDING)
		* @param nrows Number of rows in that domain/subdomain
		* @param ncols Number of columns in that domain/subdomain
		* @param h_arr Water depth array
		@@ -787,7 +786,7 @@ namespace Kernels
		* @param hextra Minimum depth (tolerance below water is at rest)
		*/
		template<typename T>
		void wet_dry_qy_halo(int size, int nrows, int ncols,
		void wet_dry_qy(int size, int nrows, int ncols,
		T const * KOKKOS_RESTRICT h_arr ,
		T * KOKKOS_RESTRICT qy_arr,
		T const * KOKKOS_RESTRICT dem ,