world_size = model_parallel_size, ddp_size = 1 is broken
Created by: stephenroller
🐛 Bug
Error message you'll get:
RuntimeError: "adam_cuda_kernel" not implemented for 'BFloat16'
Expected Behavior
Model loads and can train without an issue
Workaround:
Don't allow ddp_size = 1. This is most common in testing runs with 8 gpus and model parallel 8. As an alternative, use model parallel 4.