Apply Now
Company Description:

We are a Digital Product Engineering company that is scaling in a big way! We build products, services, and experiences that inspire, excite, and delight. We work at scale — across all devices and digital mediums, and our people exist everywhere in the world (18000+ experts across 36 countries, to be exact). Our work culture is dynamic and non-hierarchical. We are looking for great new colleagues. That is where you come in!

Job Description:
  • Manage end-to-end data and infrastructure operations, from writing SQL queries to CI/CD pipeline creation and optimization and VM and cloud-based deployments
  • Drive incident and request management through ServiceNow, ensuring SLA compliance, ownership, and proactive issue resolution
  • Implement and refine monitoring and observability frameworks using Datadog, Grafana, Prometheus to maintain uptime, identify bottlenecks, and enhance system reliability.
  • Collaborate across global teams”including Data Engineering, Product, and IT Infrastructure”to resolve production issues, improve deployment practices, and optimize system performance
  • Conduct root cause analyses and contribute to blameless post-incident reviews and preventive action plans
  • Collaborate with security and compliance teams to uphold operational standards and data protection practices
  • Contribute to automation and continuous improvement initiatives through scripting (Python, Shell) and infrastructure-as-code (Terraform, Ansible) principles
  • Support the data lifecycle, ensuring accuracy, integrity, and accessibility of data pipelines and dashboards across analytics platforms
  • Collaborate with Data Engineering teams to ensure data pipelines, ETL processes, and analytics platforms are performant, reliable, and production-ready
  • Collaborate on capacity planning, scaling, and performance optimization to ensure reliability during growth and high-load scenarios
  • Use operational metrics (MTTR, uptime, failure rate, latency) to drive service reliability improvements
  • Participate in Agile ceremonies within a Scrum/Kanban model, aligning with delivery squads to ensure cross-functional visibility and operational excellence

Experience:

  • 6+ years in DataOps, DevOps, infrastructure operations, site reliability engineering or analytics platform support.
  • Intermediate SQL for data extraction, transformation, and diagnostics
  • Strong understanding of CI/CD pipelines (Jenkins, Azure DevOps, Git-based version control)
  • Proficiency in monitoring and observability tools (Datadog, Grafana, Prometheus)
  • Hands-on with Python or Shell scripting for automation and diagnostics 
  • Familiarity with containerization (Docker, Kubernetes) and cloud platforms (AWS, Azure, GCP). Knowledge of AWS services is a must
  • Solid grasp of infrastructure-as-code concepts (Terraform, Ansible)

 

  • Proven record in incident management, maintaining SLA/SLI/SLO's for critical systems and escalation handling in enterprise environments.
  • Analytical Mindset: Ability to interpret system and data metrics, identify trends, and recommend performance improvements
  • Collaboration: Strong communication skills with cross-functional, global teams across technical and non-technical domains
  • Agility: Comfort working in dynamic, fast-paced environments, maintaining composure and prioritization under pressure
Qualifications:

Must have Skills: Docker (Strong), Kubernetes (Strong), DevOps - AWS (Strong), Terraform.

Good to have: ETL, Python, Shell scripting.

Apply Now

Other home office and work from home jobs