
About the role
Cohere is hiring a Senior ML Systems Engineer for Frameworks and Tooling, remote-flexible (offices in Toronto, NY, SF, London, Paris).
You help build, maintain, and evolve the training framework that powers Cohere's frontier-scale language models. The role sits at the intersection of large-scale training, distributed systems, and HPC infrastructure. You design and maintain the core components that enable fast, reliable, scalable model training and build the tooling that connects research ideas to thousands of GPUs.
Work includes: distributed training abstractions (data/tensor/pipeline parallelism, FSDP/ZeRO, memory management, checkpointing), throughput and stability on multi-node clusters (GB200/300, AMD, H200/100), monitoring/logging/debugging tooling, performance bottleneck investigation across the ML systems stack.
Required: strong distributed training or HPC background, deep JAX internals or distributed training libraries or custom kernels, multi-node cluster orchestration (Slurm, Ray, Kubernetes), CUDA/NCCL debugging, containerized environments.
What stands out: end-to-end ownership over critical components of the training stack. Direct impact on frontier model quality.
This recap is dataskew's editorial summary, not the company's copy.