TRAINING LARGE LANGUAGE MODELS ON HPC GPU CLUSTER: AN OVERVIEW

Oleksandr Korovii

Authors

Oleksandr Korovii National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute", Ukraine

Keywords:

Large Language Models, GPU Cluster, Distributed Training Methods

Abstract

Training large language models (LLMs) on HPC clusters presents significant challenges, including efficient memory management, parallelism, and minimizing communication overhead. This paper compares three methods – Megatron-LM, ZeRO, and PyTorch Fully Sharded Data Parallel (FSDP) – highlighting their approaches and effectiveness in addressing these challenges. The analysis provides insights into the advantages and limitations of each method, offering guidance for optimizing LLM training on large-scale HPC systems.

References

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, 2019. Retrieved from https://github.com/NVIDIA/Megatron-LM.

Rasley, J., Rajbhandari, S., Ruwase, O., & He, Y. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models, 2020. Retrieved from https://arxiv.org/pdf/1910.02054.

PyTorch FSDP: Lin, H., Mukherjee, S., Lusted, L., Huang, M., & Deb, D. Fully Sharded Data Parallel in PyTorch, 2021. Retrieved from https://pytorch.org/ docs/stable/fsdp.html.

Hugging Face (n.d.). Accelerate Large Model Training using PyTorch Fully Sharded Data Parallel. Retrieved from https://huggingface.co/blog/pytorch-fsdp.

TRAINING LARGE LANGUAGE MODELS ON HPC GPU CLUSTER: AN OVERVIEW

Authors

Keywords:

Abstract

References

Downloads

Published

Issue

Section

Current Issue

Information

Developed By