Senior Distributed Systems Engineer

Ifm-us

United States

Posted March 03, 2026

200000 - 400000 USD per-year-salary

Job Overview

About the Institute of Foundation Models

The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.

This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission

We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.

This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.

· Design and optimize expert-parallel and hybrid-parallel communication patterns

· Drive high-performance hierarchical collectives for MoE workloads

· Co-design runtime orchestration with communication topology awareness

· Reduce tail latency and improve determinism across thousands of GPUs

· Architect fault-tolerant distributed execution under real-world cluster failures

Core Technical Scope

· Communication-compute overlap and topology-aware collective optimization

· Deep debugging of NCCL, RDMA, and custom communication layers

· Hybrid expert parallel strategies in modern large-scale MoE systems

· Elastic and resilient distributed job orchestration concepts

· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics

· Microbenchmarking and performance modeling for communication-heavy workloads

Expected Technical Depth

· Hybrid expert parallel communication for Mixture-of-Experts training

· Scaling behavior under network pressure

· Distributed orchestration for elastic, large-scale training

· Fault detection and recovery in distributed GPU workloads

· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler

Required Background

· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)

· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA

· Deep familiarity with NCCL and/or UCX internals

· Strong systems programming ability (C/C++, Rust, or Go)

· Strong familiarity with modern model training frameworks such as PyTorch

· Ability to troubleshoot and profile training performance issues related to communication bottlenecks

· Ability to translate research ideas into production-grade optimizations

· Experience debugging distributed hangs, desynchronization, and performance regressions

What We Mean by "Hardcore"

· You can explain why an communication degrades at scale and how to fix it

· You have improved real cluster throughput via communication redesign

· You can trace a distributed hang across ranks and identify the root cause

· You are comfortable working at the boundary between hardware and runtime

Application Requirements

· Include a link to your GitHub (required)

· Provide links to relevant distributed systems, HPC, or large-scale training projects

· Include a list of publications and/or public technical reports (if applicable)

· Describe the hardest distributed debugging problem you solved

· Include measurable performance improvements you have delivered