Hybrid [Opportunistic Hire] Staff Engineer , AI Infrastructure at Coupang Internal

Coupang Internal · Bengaluru, India · Hybrid

2025-08-02 14:00:00.0

Python

Kubernetes Container Platforms

AWS

Cloud

Senior
Office in Bengaluru

Apply Now

We exist to wow our customers. We know we’re doing the right thing when we hear our customers say, “How did we ever live without Coupang?” Born out of an obsession to make shopping, eating, and living easier than ever, we’re collectively disrupting the multi-billion-dollar e-commerce industry from the ground up. We are one of the fastest-growing e-commerce companies that established an unparalleled reputation for being a dominant and reliable force in South Korean commerce.

We are proud to have the best of both worlds — a startup culture with the resources of a large global public company. This fuels us to continue our growth and launch new services at the speed we have been at since our inception. We are all entrepreneurial surrounded by opportunities to drive new initiatives and innovations. At our core, we are bold and ambitious people that like to get our hands dirty and make a hands-on impact. At Coupang, you will see yourself, your colleagues, your team, and the company grow every day.

Our mission to build the future of commerce is real. We push the boundaries of what’s possible to solve problems and break traditional tradeoffs. Join Coupang now to create an epic experience in this always-on, high-tech, and hyper-connected world.

Role summary

You will own the day-to-day reliability of our multi-region NVIDIA DGX cloud. Your charter: keep every host, hypervisor and Kubernetes node battle-hardened so that large-language-model training runs for weeks without a hiccup and real-time inference always returns in milliseconds—whether the workload lives on-prem or bursts to one of several public-cloud providers.

What you’ll do

Host & firmware hardening — flash, validate and auto-baseline BIOS, BMC, network-interface and GPU firmware for DGX H100/H200 nodes.

Virtualisation & container runtime — run KVM or ESXi at scale, expose VMs to Kubernetes via KubeVirt/Kata Containers, and tune vGPU passthrough, SR-IOV and NUMA pinning for maximum GPU utilisation.

Kubernetes SRE — upgrade clusters with zero guest interruption, manage etcd quorum, tune kube-scheduler for GPU topology-aware placement, and operate service meshes (Istio or Ambient Cilium) for gRPC-heavy AI micro-services.

High-speed networks — design and troubleshoot 200/400 Gb InfiniBand or RoCE v2 fabrics; enforce network policies with Cilium eBPF and optimise RDMA flows for multi-tenant isolation.

Data-resilience flows — implement Velero- or Restic-based backup, cross-AZ snapshot orchestration and quarterly disaster-recovery drills covering control-plane, metadata and model artefacts.

Automation first — write Go or Python to drive Terraform, Ansible and Argo CD pipelines; integrate with internal provisioning tool “Void” for end-to-end, push-button node builds.

Operational leadership — rotate on high-severity incident duty, publish RCA documents within 72 hours and mentor L5 engineers in Kubernetes, GPU and RDMA debugging.

Success indicators (first 12 months)

Any DGX host can be rebuilt—firmware → OS → driver → Kubelet—in under 15 minutes.

Control-plane uptime stays ≥ 99.95 % across three regions.

Average GPU queueing latency per pod drop by at least 20 % through topology-aware scheduling.

All disaster-recovery objectives (RPO 15 min, RTO 1 hour) are validated in live exercises.

Minimum qualifications

8 + years of production Linux, networking and virtualisation.

Active CKA and CKS (or equivalent open-source contributions proving the same depth).

At least one year running NVIDIA DGX or comparable GPU clusters at ≥ 1 PFLOP scale.

Deep KVM or ESXi expertise including vMotion/live-migration, SR-IOV NICs and vGPU scheduling.

Hands-on InfiniBand/RDMA troubleshooting with perfquery, ibstat, nvidia-smi nvlink topology, TCPDump on RDMA (diag mode).

Professional-level cloud networking or architect certification (AWS Advanced Networking Specialty, Azure Network Engineer Expert, Google PCNE, etc.).

Proficient English plus local language. (Seoul: fluent Korean mandatory, English optional.)

Preferred extras

NVIDIA Certified Professional — Data Center (Professional Level), contributions to GPU Operator or KubeVirt, author of internal Kubernetes operator or CRI shim, speaker at CNCF or NVIDIA GTC events.

Apply Now

Hybrid [Opportunistic Hire] Staff Engineer , AI Infrastructure at Coupang Internal

Additional benefits

Other home office and work from home jobs

Hybrid Senior Event Marketing Manager

Hybrid Engineering Team Lead - Frontend

Hybrid Area Manager -COK (Cochin, IN)

Search job

Menu

Choose a language

Sign in

Cookie Settings

Cookie Settings

Target group oriented cookies

We use cookies

Hybrid [Opportunistic Hire] Staff Engineer , AI Infrastructure at Coupang Internal

Additional benefits

Other home office and work from home jobs

Hybrid Senior Event Marketing Manager

Hybrid Engineering Team Lead - Frontend

Hybrid Area Manager -COK (Cochin, IN)

Search job

The latest home office jobs weekly by email.

Menu

Choose a language

Sign in

Cookie Settings

Cookie Settings

Target group oriented cookies

We use cookies

The latest home office jobs
weekly by email.