Draft: Resolve "Basic DDP example"
Closes #23
Merge request reports
Activity
added In Progress label
assigned to @28t
I think this may be a good time to learn how to use PyTorch parametrizations: https://pytorch.org/tutorials/intermediate/parametrizations.html We would have
probe_f
parametrized byprobe_f_real
which is NxNx2 and the parametrization would just calltorch.view_as_complex(self.probe_f_real)
. Then we could useprobe_f
as normal without having to remember to manually do the parametrization every time. Note that I have not used this feature yet though!@4jh I am still hitting the RuntimeError: Input tensor data type is not supported for NCCL process group: ComplexDouble Even if I put an exit on at the first line of training_step and remove all complex from the training script still give that error. Also printing the parameters before going to training step, I can see no complex there.
Probably it needs a change beyond the lightning script I think, also not quite get training step works, I think if I exit before go into I shouldn't see a nccl error. Also DDP transfer 1 MB by default hard-coded in, but if no complex on lightning script why there is a nccl error?
@4jh I started a simple pure pytorch example to debug the DDP errors. Here is my first try: https://code.ornl.gov/ai-ptychography/ptychopath/-/blob/23-basic-ddp-example/examples/train.py, https://code.ornl.gov/ai-ptychography/ptychopath/-/blob/23-basic-ddp-example/examples/runtrain.sh and it is working for single GPU (just runs I mean no convergence test), and going to mutli I am getting
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Edited by Tsaris, Aristeidis (aris)The error was obvious, didn't properly set device to local rank. So not if I use DDP I am getting the complex error from nccl. Interesting if I do manually average gradients https://code.ornl.gov/ai-ptychography/ptychopath/-/blob/23-basic-ddp-example/examples/train.py#L104 I am not getting that error.
added 1 commit
- fc8dc138 - adding pure torch example to better debug the DDP errors
- tests/test_structure.py 0 → 100644
15 devices = ["cpu"] 16 if torch.cuda.is_available(): 17 devices.append("cuda") 18 19 20 def test_mustem_xtl_read(): 21 xtlfile = os.path.join("/gpfs/alpine/csc455/proj-shared/data/simulations/neutral atoms", "PTO3.xtl") 22 crystal = structure.Orthorhombic.from_mustem_xtl(xtlfile) 23 24 assert crystal.atoms == ["Pb", "Ti", "O"] 25 26 assert crystal.atom_positions[0].shape == (1, 3) 27 28 29 def test_ortho_tiling(): 30 xtlfile = os.path.join("/gpfs/alpine/csc455/proj-shared/data/simulations/neutral atoms", "PTO3.xtl") If you need commits from another branch it is better to use
git cherry-pick
, which maintains the commit messages associated with commits and ensures you also get the files that are included in the commits you cherry-pick liketests/PTO3.xtl
which is present in !13 so you shouldn't need to hardcode a path like this which will break CI.changed this line in version 12 of the diff
Hi Jacob, I don't need it now, and I will delete it sometime today. I was thinking about using your class function to read the xtl file from Mark's dataset and generate ground truth scattering potential for comparison. However, it seems that you haven't implemented a function that can generate potential from crystal object. I also tried pyms make potential function. However, that doesn't support xtl files. It can only read file from xyz or p1 files. I don't need make potential function right now, but will need it to generate ground truth for convergence plot in the future.