Detailed Course Outline
Introduction
- Meet the instructor.
- Create an account at courses.nvidia.com/join
Introduction to Training of Large Models
- Learn about the motivation behind and key challenges of training large models.
- Get an overview of the basic techniques and tools needed for large-scale training.
- Get an introduction to distributed training and the Slurm job scheduler.
- Train a GPT model using data parallelism.
- Profile the training process and understand execution performance.
Model Parallelism: Advanced Topics
- Increase the model size using a range of memory-saving techniques.
- Get an introduction to tensor and pipeline parallelism.
- Go beyond natural language processing and get an introduction to DeepSpeed.
- Auto-tune model performance.
- Learn about mixture-of-experts models.
Inference of Large Models
- Understand the challenges of deployment associated with large models.
- Explore techniques for model reduction.
- Learn how to use TensorRT-LLM.
- Learn how to use Triton Inference Server.
- Understand the process of deploying GPT checkpoint to production.
- See an example of prompt engineering.
Final Review
- Review key learnings and answer questions.
- Complete the assessment and earn a certificate.
- Complete the workshop survey.