Platzhalter Bild

Senior Engineer I bei DigitalOcean

DigitalOcean · Hyderabad, Indien · Hybrid

Jetzt bewerben

Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you’ll find your place here.  We value winning together—while learning, having fun, and making a profound difference for the dreamers and builders in the world. 

We are seeking a skilled DevOps and AI Cloud Infrastructure Engineer to provision, deploy, manage, and optimize our GPU-based compute environment, ensuring high availability, performance, and security for compute-intensive workloads. The ideal candidate will have expertise in Linux system administration, cloud platforms, containerization, GPU hardware management, and cluster computing, with a focus on supporting AI/ML and high-performance computing (HPC) workloads. In this role, you will also provide technical support to investigate and resolve customer-reported issues related to the GPU-based compute environment. You will work closely with architects, AI engineers, and software developers to ensure seamless deployment, scalability, and reliability of our cloud-based AI/ML pipelines and GPU-based compute environments.

What You’ll Be Doing:

  • Infrastructure Management: Provision, deploy, and maintain scalable, secure, and high-availability cloud infrastructure on platforms such as Digital Ocean Cloud to support AI workloads.
  • Documentation: Maintain clear documentation for infrastructure setups, and processes.
  • System Management: Administer and maintain Linux-based servers and clusters optimized for GPU compute workloads, ensuring high availability and performance.
  • GPU Infrastructure: Configure, monitor, and troubleshoot GPU hardware (e.g., NVIDIA GPUs) and related software stacks (e.g., CUDA, cuDNN) for optimal performance in AI/ML and HPC applications.
  • Troubleshooting: Diagnose and resolve hardware and software issues related to GPU compute nodes and performance issues in GPU clusters.
  • High-Speed Interconnects: Implement and manage high-speed networking technologies like RDMA over Converged Ethernet (RoCE) to support low-latency, high-bandwidth communication for GPU workloads.
  • Automation: Develop and maintain Infrastructure as Code (IaC) using tools like Terraform, Ansible to automate provisioning and management of resources.
  • CI/CD Pipelines: Build and optimize continuous integration and deployment (CI/CD) pipelines for testing GPU-based servers and managing deployments using tools like GitHub Actions.
  • Containerization & Orchestration: Build and manage LXC-based containerized environments to support cloud infrastructure and provisioning toolchains
  • Monitoring & Performance: Set up and maintain monitoring, logging, and alerting systems (e.g., Prometheus, Victoria Metrics, Grafana) to track system performance, GPU utilization, resource bottlenecks, and uptime of GPU resources.
  • Security and Compliance: Implement network security measures, including firewalls, VLANs, VPNs, and intrusion detection systems, to protect the GPU compute environment and comply with standards like SOC 2 or ISO 27001.
  • Cluster Support: Collaborate with other engineers to ensure seamless integration of networking with cluster management tools like Slurm, or PBS Pro.
  • Scalability: Optimize infrastructure for high-throughput AI workloads, including GPU and auto-scaling configurations.
  • Collaboration: Work closely with Architects, Software engineers to streamline model deployment, optimize resource utilization, and troubleshoot infrastructure issues.

What We’ll Expect From You:

  • Experience: 3+ years of experience in DevOps, Site Reliability Engineering (SRE), or cloud infrastructure management, with at least 1 year working on GPU-based compute environments in the cloud.
  • Linux Administration: Strong knowledge of Linux system administration for managing network services and tools in a GPU compute environment.
  • High-Speed Interconnects: Experience with high-performance networking technologies like RoCE, or 100GbE Ethernet in compute-intensive environments.
  • GPU-Specific Networking: Proficiency with NVIDIA GPU networking technologies, such as Mellanox ConnectX adapters, and configuring Netplan to support their drivers and firmware.
  • Cloud Platforms: Hands-on experience with at least one major cloud provider (AWS, Azure, GCP).
  • Networking & Security: Knowledge of networking concepts (VPC, subnets) and security best practices (IAM, encryption, firewall configurations).
  • Container Technologies: Proficiency in LXC and Docker for container orchestration and management.
  • IaC Tools: Expertise in Infrastructure as Code tools such as Terraform, and Ansible.
  • CI/CD Tools: Experience with CI/CD pipelines using Jenkins, GitHub Actions, or similar tools.
  • Scripting & Programming: Strong scripting skills in Python, Bash, or similar languages; familiarity with Go or other programming languages is a plus.
  • Monitoring Tools: Experience with monitoring and logging tools like Prometheus, Victoria metrics, and Grafana.
  • Problem-Solving: Strong analytical and troubleshooting skills to resolve complex infrastructure and performance issues.
  • Communication: Excellent collaboration and communication skills to work with cross-functional teams.

Preferred Qualifications

  • Experience with GPU-based workloads and familiarity with AI/ML frameworks like TensorFlow or PyTorch.
  • Knowledge of configuring Netplan to work with cloud-specific networking features like VPCs or virtual network interfaces

*This is role located in Hyderabad, India

#LI-Hybrid

Jetzt bewerben

Weitere Jobs