Hybrid Research Intern presso Xpeng motors
Xpeng motors · Santa Clara, CA, Stati Uniti d'America · Hybrid
- Ufficio in Santa Clara, CA
-
Investigate, reproduce, and diagnose failures in PyTorch-based distributed training pipelines.
-
Develop tools and techniques for automatic failure detection using system traces, logs, and hardware-level metrics.
-
Implement failure-aware monitoring for NCCL, CUDA runtime, and communication components.
-
Analyze training disruptions/errors related to GPU/NVLink/network instability, OOMs, deadlocks, and degraded throughput.
-
Integrate with system components such as etcd, cupti, or XLA profiling tools to extract telemetry and debug information.
-
Collaborate with senior engineers to design robust, scalable diagnostic frameworks.
-
Strong programming skills in Python and C/C++.
-
Hands-on experience with PyTorch and distributed training (e.g., DDP, NCCL).
-
Solid understanding of Operating Systems and Distributed Systems, especially process management, memory, and networking.
-
Familiarity with debugging and profiling tools (e.g., gdb, perf, nvprof, nsys).
-
Experience with failure diagnosis, logging systems, or automated root cause analysis.
-
Understanding of NCCL internals, CUDA architecture, or GPU performance profiling.
-
Experience working with etcd, cupti, or other telemetry tools in a production-grade system.
-
Exposure to cloud-native systems or large-scale cluster management.
-
Exposure to real-world infrastructure challenges in deep learning systems at scale.
-
Mentorship from experienced engineers/researchers in system design and AI infra.
-
Opportunity to contribute to internal tools or publications (if applicable).
-
Hands-on experience with cutting-edge hardware and training platforms.
-
The tools you build will be directly integrated into our production platform, helping our machine learning teams train models faster and more reliably.
-
Potential to publish patents and papers.
-
A fun, supportive and engaging environment
-
Infrastructures and computational resources to support your work.
-
Opportunity to work on cutting edge technologies with the top talents in the field.
-
Opportunity to make significant impact on the transportation revolution by the means of advancing autonomous driving
-
Competitive compensation package
-
Snacks, lunches, dinners, and fun activities