Platzhalter Bild

ML Training Platform Intern (6 months) bei AION

AION · Seattle, Vereinigte Staaten Von Amerika · Onsite

Jetzt bewerben

aion is building the next generation of AI cloud platform by transforming the future of high-performance computing (HPC) through its decentralized AI cloud. Purpose-built for bare-metal performance, aion democratizes access to compute power for AI training, fine-tuning, inference, data labeling, and beyond.

By leveraging underutilized resources such as idle GPUs and data centers, AION provides a scalable, cost-effective, and sustainable solution tailored for developers, researchers, and enterprises.

Led by high-pedigree founders with previous exits, aion is well-funded by major VCs with strategic global partnerships. Headquartered in the US with global presence, the company is building its initial core team in London, Seattle and India.

Who You Are

You're an aspiring ML engineer passionate about distributed training and helping customers succeed with large-scale ML workloads. You love solving complex technical problems, learning from customer challenges, and building solutions that accelerate AI development. You're excited to learn cutting-edge training techniques while working directly with customers to implement distributed training architectures and advanced ML workflows.

Requirements

Key Responsibilities

  • Learn and implement distributed training architectures including data parallelism, model parallelism, and pipeline parallelism under mentorship.
  • Build reference implementations for training workflows including DDP setups, gradient synchronization, and multi-GPU configurations.
  • Develop training optimization tools including efficient data loading pipelines, memory optimization techniques, and performance monitoring.
  • Create customer documentation and tutorials covering distributed training best practices and implementation guides.
  • Assist with customer workshops and training sessions on distributed training methodologies and platform usage.
  • Build debugging and profiling tools for identifying bottlenecks in distributed training workloads.
  • Experiment with emerging techniques including reward model training, DPO optimization, and constitutional AI workflows.
  • Contribute to training framework improvements based on customer feedback and platform optimization opportunities.

Skills & Experience

  • High agency individual looking to own customer success and influence training platform architecture.
  • Working knowledge of deep learning fundamentals including neural networks, transformers, and basic training/inference concepts.
  • Working PyTorch experience with some knowledge of distributed training, DDP implementation, and multi-GPU optimization.
  • High level understanding of distributed training techniques including data parallelism, model parallelism, pipeline parallelism.
  • Basic working knowledge of any of the training infrastructure tools such as Megatron-LM, DeepSpeed, FairScale, or similar frameworks.
  • Surface level understanding of reasoning techniques including Chain-of-Thought prompting and advanced reasoning workflows.
  • Previous internships or projects in ML infrastructure, contributions using PyTorch/ML frameworks, competitive programming achievements, research experience in ML systems, familiarity with agent systems or reasoning techniques.
  • Strong coding and implementation skills in Python and C++ with demonstrated ability to write performant, production-quality code.
  • Experience reading and contributing to large codebases with proof of open-source contributions (GitHub profile required).
  • Proof of technical work through projects like Google Summer of Code, hackathon wins, competitive programming, or significant open-source contributions.

Benefits

  • Join the ground floor of a mission-driven AI startup revolutionizing compute infrastructure.
  • Learn from world-class engineers and gain hands-on experience with cutting-edge inference optimization techniques.
  • Work with a high-caliber, globally distributed team backed by major VCs.
  • Significant learning and growth opportunity in one of the fastest-moving areas of AI infrastructure.
  • Competitive internship compensation with potential for full-time conversion.
  • Fast-paced, flexible work environment with room for ownership and impact.

In case you got any questions about the role please reach out to hiring manager on linkedin or X.

Jetzt bewerben

Weitere Jobs