+15
−15
+4
−0
+3
−3
+12
−3
+42
−0
Loading
* apply pipeline ring algorithm in G2 send/recv * test wip * add ring G test and verify it works * main_dca works with one accumulator and if local measurements are equal (w/ 0), accuracy needs verification * update send/recv tag for ring G alg make ringG available at compile time, add documentation, clean up code add python tool that can diff two G4s remove distributed test and ensure ring G test runs correctly, add compiler flag where needed modify G4 tiling method to rank index instead of w decomposed index reset G4 size, get index boundary right it works on multiple ransk, unevenly distributed G4 array size add STL algorithm and clean up code * swicth nvlink flip from compilation flag to config settings * remove recv buffer to avoid copy * temporarily adding GPTL profiling library * update python tool for G4 diff * add multi threaded support to ring G by adding thread id and n_acc to tp_acc * fixing typo * add comments to mci parameters related to nvlink * remove gptl from code * trying to improve memory allocation * add copyFrom function in RMatrix and modify copy operation in sendbuff * modify copy operator and add allocate method fo RMatrix, update sendbuff copy and allocation * add allocation flag to sendbuff and remove unnecessary MPI_Barrier * remove allocate method in RMatrix and move the allocation into cached_ndft_gpu * remove allocation in cached_ndft_gpu, use swap op in sendbuff to G_ * cleaned up reshapable matrix assignment. * compute start and end of G4 linearized 1D index in CPU code and launch 1d thread blocks * rename nvlink-enabled and nvlink related variables to distributed-g4-enabled to avoid vendor-specific naming * replace int to uint64_t type for G4 index related variable * more index processing * add g4 index back if distributed g4 is not enabled * fix G4 mem allocation * comment out gatherv, add doc, and format code * more formatting, remove MPI related unnecessary code * clean up the python file and add author info * offset index in mpi_gatherv to correct pos and cleanup code * rm std::fill as not necessary * add wiki doc for distributedg4 and upload helper file * add uint64_t cast wherever needed * Make function only allocates portion of G4 on CPU locally * fix typos * clean up changes in function constructor * adding missing mpi.h headers * rename reset_size to resize * avoid index overflow in device code * Added integration test for getComputeRange * now the subindices info should only get printed in verbose * fix off by one on kernel code * rename nb_more_work_ranks to more_work_ranks * fixing typo * fix off by one for unbalanced case * propagate index off changes into kernel code * demonstrating and checking the function subindexspan * rename Nb_elements to nb_elements_ in function.hpp * add missing g4 accumulate guard * use Module operator and MPITypeMap G2 ScalarType for ringG alg * Clean up G4 indices computation. Added missing include. * quick and not yet dry refactor to disentangle MPI dependence * fix crash from refactor of tp_accumulator_gpu into _gpu and _mpi_gpu * update of copyright year and names * actually add plumbing for runtime distributed G4 * add missing mpi type header to build ringG test * silenced memory type warning. * more changes for working runtime distribution choice * add missing file * missing header cmdlauncher execute permission * compiles but fails ringG test due to change of sense of start_-end_ * partially fixed start end (i Think) * remove @weili's bug * careful with the sizes even undistributed G4_ != tp_dmn * fix tp_acc_mpi_gpu and ringG test * supporting cuda visible devices so ctest can run the ringG test. * removing needless constructor complication * ringG fixed on summit, equivalent to the smpiargs for other plat? * slight modification for smpiargs * fixing test compilation failure. * hopefully this will placate gcc 8.3.0 on cray * more fixes for CI, cautionary comment in function.hpp * add comment Co-authored-by:Weile Wei <lokwei9@gmail.com> Co-authored-by:
gbalduzz <gbalduzz@itp.phys.ethz.ch>