Suppose that every device takes in a batch of tensors where the tensors across devices are of different sizes, will 3D parallelism still work?
Published:
As I’m learning more about 3D parallelism, I wonder - suppose that every device takes in a batch of tensors where the tensors across devices are of different sizes, will 3D parallelism still work? Turns out, it works for data and pipeline parallelism, but tensor parallelism will need some work.