Hybrid GPGPU Software Architect/ Principal Engineer bei XPENG
XPENG · Santa Clara, Vereinigte Staaten Von Amerika · Hybrid
- Senior
- Optionales Büro in Santa Clara
- Develop and refine a comprehensive 3-year roadmap for a software stack compatible with CUDA, encompassing Runtime, Driver, Compiler, Profiler, Debugger, and AI acceleration libraries
- Define binding specifications that link our upcoming GPU ISA to CUDA APIs, ensuring forward compatibility with CUDA 12.x features
- Evaluate and integrate the latest technological advancements: CUDA Graph, Transformer Engine, virtual memory management, CUDA dynamic CUTLASS 3.x, TMA, Blackwell FP4, among others
- Create a modular, layered Runtime architecture: CUDA → HAL → Kernel → Hardware, applicable across emulators, FPGA prototypes, and actual silicon
- Define the task launch protocol, including Queue, Stream, Event, and Graph, as well as the memory model
- Design a dual-mode (JIT & offline) compiler supporting LTO, PGO, Auto-Tuning, and efficient PTX→ISA microcode caching
- Develop GPU virtualization schemes(MIG) that work across processes and containers
- Implement an end-to-end performance model: Python API → CUDA Runtime → Driver → ISA → Micro-architecture → Board-level interconnect
- Build an observability platform: Nsys-compatible traces, real-time Metric-QPS dashboards, and an AI Advisor for identifying bottlenecks automatically
- Manage internal AI benchmarks as the single source of truth. Benchmark includes MLPerf Inference, Stable Diffusion XL, and 70B LLM
- Co-design ISA which compatible with CUDA Compute Capability 12.x with our hardware architecture team
- Collaborate with AI framework teams (PyTorch, TensorFlow, JAX, ONNX Runtime) to build fully reusable kernel libraries
- Partner with Cloud and K8s teams to co-develop Device Plugins, GPU Operators, and RDMA Network Policies
Minimum Requirements:
- 10 years + in systems software, with at least 5 years in designing CUDA Compute stacks
- Led end-to-end development of a GPU Runtime or AI acceleration library generation
- Comprehensive mastery of PTX/SASS, CUDA Driver API, and cuBLAS/cuDNN/cuFFT internals; experience with LLVM NVPTX backend
- Profound understanding of GPU micro-architecture, including SM architecture, Warp Scheduler, Shared-Memory conflicts, and Tensor Core pipelines
- Proficiency with PCIe/CXL/RDMA topologies, NUMA settings, and GPU Direct RDMA/Storage