- Professional
- Office in San Francisco
Responsibilities:
Design and maintain evaluation frameworks that measure AI output quality across all experiences, developing metrics and benchmarks to assess model performance
Systematically improve production prompts through iterative experimentation—diagnosing failure patterns, crafting targeted improvements, and validating against quality benchmarks
Fine-tune models on targeted datasets to improve baseline performance (e.g., preventing poor layout choices, improving outline quality)
Conduct rigorous experiments to understand model behavior, analyze results, and derive insights that inform prompt and model improvements
Build tools and workflows to support rapid experimentation and quality analysis, enabling faster iteration on AI improvements
Qualifications:
3+ years working with AI systems with demonstrated experience in shipping production grade AI products
Deep hands-on experience with prompt engineering, LLM experimentation, and systematic evaluation of AI outputs
Strong experimental mindset with ability to design tests, analyze model performance, and iterate toward quality improvements
Experience post-training LLMs (RL, SFT, etc)
Research-oriented approach to problem-solving; comfortable working in ambiguity and exploring novel solutions to AI quality challenges
Exceptional attention to detail and quality obsession—cares deeply about output quality across all dimensions, including less visible aspects
Bachelor's degree in Computer Science, ML, or related field (or equivalent hands-on experience with AI research/experimentation)