Senior System Software Engineer, Cloud Services presso NVIDIA
NVIDIA · Santa Clara, Stati Uniti d'America · Hybrid
- Senior
- Ufficio in Santa Clara
Our team builds, operates, and maintains cloud-hosted services that provide user and service authentication/authorization across NVIDIA. Ensuring continuity of operations is critical to our mission.
We are in search of a highly proficient software engineer with extensive experience in AWS service development, deployment, and observability practices. In this capacity, you will have the responsibility of ensuring the reliability, performance, and scalability of our services, while providing the team with actionable insights for continuous improvement. You will build, implement, and coordinate observability infrastructure to proactively identify, fix, and address operational issues across our services.
What you’ll be doing:
- Architect, implement, and maintain observability systems at scale to enable monitoring, alerting, logging, and tracing for our cloud-based services. 
- Define and refine service-level indicators (SLIs), service-level objectives (SLOs), and error budgets in partnership with service owners and product teams. 
- Invent, construct, and uphold actionable dashboards that display important measurements, SLI/SLOs, and system health for distributed services. 
- Collaborate with software, platform, and networking teams to integrate observability at all stages of the application lifecycle, from development to incident response. 
- Drive automation efforts to reduce manual toil in monitoring, telemetry, and incident response workflows; build and maintain self-service observability tooling. 
- Address performance and reliability issues by bringing to bear root cause analysis, distributed tracing, and log correlation. 
- Participate in Pager Duty rotations, contribute to post-incident reviews, detailing findings and driving solutions that improve long-term system resilience and visibility. 
- Develop expertise in the functions and capabilities of our offerings, and assist in managing our support channels for other NVIDIA teams. 
What we need to see:
- Bachelor’s or master’s degree in computer science, engineering, or equivalent experience in the field. 
- 8+ years in large-scale systems engineering roles with exposure to dealing with live service development, working end-to-end from service development, deployment, and observability, as well as being on-call. 
- Hands-on experience with modern monitoring systems (Prometheus, Grafana, Loki, Tempo, Datadog, New Relic, OpenTelemetry, etc.) within a production environment. 
- Advanced coding skills in Python, Go, or similar languages for building automation and integrating observability solutions. Comfort with JavaScript frameworks such as React and Next.js. 
- Proficiency in cloud platforms (AWS, GCP, Azure) and containerized environments (Kubernetes, Docker); experience with configuration-as-code tools (Terraform, Helm, Ansible). 
- Strong communication and collaboration skills, with experience working in global, cross-disciplinary teams. 
- Detailed, analytical problem-solving approach and high standards for operational excellence and customer happiness. 
- Experience with incident management, postmortem processes. 
Ways to stand out from the crowd:
- Familiarity with the Java Spring Boot framework, hands-on experience with Apache Cassandra and HashiCorp Vault would be very advantageous. 
- Besides our core duties, our team also manages multiple custom front-end services based on React for admin functions. Having relevant coding experience and being open to supporting development would be a huge plus. 
You will also be eligible for equity and benefits.
 
			 
			 
			 
			