Machine Learning Engineer (Training Optimization)
Full-time Mid-Senior LevelJob Overview
About the Role/Specialty
As a Machine Learning Engineer, you’ll lead efforts to scale and optimize the training system for our large-scale multimodal and foundation models. You’ll design distributed training systems using Megatron-LM, NVIDIA NeMo, FSDP, and Triton—pushing the limits of performance across compute, memory, and communication layers. You'll sit at the intersection of systems and AI research, directly shaping how we train the models that will power Canva’s next generation of products.
What you’ll do (responsibilities)
- You’ll design, implement, and optimize large-scale machine learning systems for training
- You’ll improve all aspects of performance, including GPU utilization, communication overhead, and memory efficiency.
- You’ll partner with research and modeling teams to align systems with algorithmic needs.
- You’ll evaluate and apply best practices for distributed training using industry-leading frameworks.
- You’ll dive deep into low-level optimization, including custom CUDA or Triton kernels.
- You’ll debug, profile, and fine-tune training workflows to unlock new levels of scalability.
Make Your Resume Now