TRAINING LARGE LANGUAGE MODELS ON HPC GPU CLUSTER: AN OVERVIEW
Keywords:
Large Language Models, GPU Cluster, Distributed Training MethodsAbstract
Training large language models (LLMs) on HPC clusters presents significant challenges, including efficient memory management, parallelism, and minimizing communication overhead. This paper compares three methods – Megatron-LM, ZeRO, and PyTorch Fully Sharded Data Parallel (FSDP) – highlighting their approaches and effectiveness in addressing these challenges. The analysis provides insights into the advantages and limitations of each method, offering guidance for optimizing LLM training on large-scale HPC systems.
References
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019. Retrieved from https://github.com/NVIDIA/Megatron-LM.
Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models, 2020. Retrieved from https://arxiv.org/pdf/1910.02054.
PyTorch FSDP: Lin, H., Mukherjee, S., Lusted, L., Huang, M., & Deb, D. Fully Sharded Data Parallel in PyTorch, 2021. Retrieved from https://pytorch.org/ docs/stable/fsdp.html.
Hugging Face (n.d.). Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel. Retrieved from https://huggingface.co/blog/pytorch-fsdp.