
About the role
Cohere is hiring a Senior Research Scientist for Model Evaluation, remote-flexible (offices in Toronto, NY, SF, London, Paris).
Evaluation is critical to making progress in scaling intelligence. As models become superhuman in many real-world use cases, the team develops new evaluation techniques that accurately reflect what models are already capable of and set the agenda for future model capabilities.
You create ambitious new evaluation benchmarks that push the limits of Cohere's models. You work cross-functionally to translate model feedback into trustworthy, repeatable evaluations. Research includes training LLM judges, refining LLM-based data synthesis pipelines, and improving evaluation efficiency. You build scalable, reusable tools for digging into model performance.
Required: rapid prototyping ability, dozens of hours reviewing complex data and LLM outputs, obsession with rigorous measurement, strong software engineering skills.
What stands out: the team owns the definition of what frontier models should be capable of, not just what they are. Direct path to influencing Cohere's model roadmap.
This recap is dataskew's editorial summary, not the company's copy.