Hybrid [Opportunistic Hire] Staff Engineer , AI Infrastructure at Coupang Internal
Coupang Internal · Bengaluru, India · Hybrid
- Senior
- Office in Bengaluru
We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.
We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been at since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.
Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.
Role summary
You will own the day-to-day reliability of our multi-region NVIDIA DGX cloud. Your charter: keep every host, hypervisor and Kubernetes node battle-hardened so that large-language-model training runs for weeks without a hiccup and real-time inference always returns in milliseconds—whether the workload lives on-prem or bursts to one of several public-cloud providers.
What you’ll do
- Host & firmware hardening — flash, validate and auto-baseline BIOS, BMC, network-interface and GPU firmware for DGX H100/H200 nodes.
- Virtualisation & container runtime — run KVM or ESXi at scale, expose VMs to Kubernetes via KubeVirt/Kata Containers, and tune vGPU passthrough, SR-IOV and NUMA pinning for maximum GPU utilisation.
- Kubernetes SRE — upgrade clusters with zero guest interruption, manage etcd quorum, tune kube-scheduler for GPU topology-aware placement, and operate service meshes (Istio or Ambient Cilium) for gRPC-heavy AI micro-services.
- High-speed networks — design and troubleshoot 200/400 Gb InfiniBand or RoCE v2 fabrics; enforce network policies with Cilium eBPF and optimise RDMA flows for multi-tenant isolation.
- Data-resilience flows — implement Velero- or Restic-based backup, cross-AZ snapshot orchestration and quarterly disaster-recovery drills covering control-plane, metadata and model artefacts.
- Automation first — write Go or Python to drive Terraform, Ansible and Argo CD pipelines; integrate with internal provisioning tool “Void” for end-to-end, push-button node builds.
- Operational leadership — rotate on high-severity incident duty, publish RCA documents within 72 hours and mentor L5 engineers in Kubernetes, GPU and RDMA debugging.
Success indicators (first 12 months)
- Any DGX host can be rebuilt—firmware → OS → driver → Kubelet—in under 15 minutes.
- Control-plane uptime stays ≥ 99.95 % across three regions.
- Average GPU queueing latency per pod drop by at least 20 % through topology-aware scheduling.
- All disaster-recovery objectives (RPO 15 min, RTO 1 hour) are validated in live exercises.
Minimum qualifications
- 8 + years of production Linux, networking and virtualisation.
- Active CKA and CKS (or equivalent open-source contributions proving the same depth).
- At least one year running NVIDIA DGX or comparable GPU clusters at ≥ 1 PFLOP scale.
- Deep KVM or ESXi expertise including vMotion/live-migration, SR-IOV NICs and vGPU scheduling.
- Hands-on InfiniBand/RDMA troubleshooting with perfquery, ibstat, nvidia-smi nvlink topology, TCPDump on RDMA (diag mode).
- Professional-level cloud networking or architect certification (AWS Advanced Networking Specialty, Azure Network Engineer Expert, Google PCNE, etc.).
- Proficient English plus local language. (Seoul: fluent Korean mandatory, English optional.)
Preferred extras
NVIDIA Certified Professional — Data Center (Professional Level), contributions to GPU Operator or KubeVirt, author of internal Kubernetes operator or CRI shim, speaker at CNCF or NVIDIA GTC events.
Apply Now