Senior Site Reliability Engineer Classmates chez PeopleConnect, Inc.
PeopleConnect, Inc. · Bellevue, États-Unis d'Amérique · Hybrid
- Professional
- Bureau à Bellevue
Do you aspire to take on a strategic, leadership-oriented role where you design and guide infrastructure at an architectural level? Are you passionate about identifying and solving complex operational challenges, improving system reliability, and driving modernization? Do you thrive on designing scalable, fault-tolerant systems and implementing automation that transforms on-prem applications into cloud-native solutions? If so, this role is the perfect next step in your journey!
As a Senior Site Reliability Engineer, you’ll take ownership of critical infrastructure and reliability initiatives that power our applications and services. You’ll design, automate, and optimize systems to improve performance, scalability, and operational efficiency — while driving adoption of reliability best practices across the engineering organization.
You’ll act as a technical leader and mentor, collaborating closely with developers, operations, and security teams to solve complex challenges and advance our cloud-first strategy. This role requires deep technical expertise, sound judgment, and the ability to translate reliability goals into measurable outcomes that benefit both our users and our business.
Location and Logistics
- Hybrid role requiring 3+ days per week in our Bellevue, WA office
- Local candidates will be interviewed in-person in the Bellevue office
- We are unable to offer visa sponsorship, visa transfer, or corp-to-corp arrangements
Key Responsibilities:
Cloud Strategy and Architecture
- Own key cloud architecture initiatives, guiding design decisions for scalability, security, and cost efficiency
- Partner with architecture and engineering leadership to define modernization standards and patterns (containerization, microservices, serverless)
- Evaluate and introduce emerging cloud technologies to enhance performance, reliability, and developer autonomy
- Drive adoption of a cloud-first mindset and infrastructure best practices across teams
Infrastructure Automation & Design
- Lead design and implementation of infrastructure automation using IaC tools such as Terraform, Terragrunt, and Puppet
- Apply GitOps principles for configuration management and application delivery
- Build and maintain CI/CD pipelines that ensure reliable, repeatable deployments (GitLab preferred)
- Develop reusable, modular automation components and mentor others on automation standards
Reliability and Performance Engineering
- Define and own service-level indicators (SLIs), service-level objectives (SLOs), and error budgets for critical systems
- Drive continuous improvement of uptime, latency, and scalability through instrumentation and testing
- Implement and evolve observability stacks (monitoring, logging, tracing) using Datadog, Prometheus, Grafana, or similar tools
- Conduct capacity planning, load testing, and chaos engineering to proactively identify weaknesses
Incident Management & Resilience
- Lead incident response for critical production systems, ensuring rapid recovery and clear communication
- Facilitate blameless post-incident reviews and drive remediation of root causes
- Develop and maintain operational runbooks, escalation paths, and playbooks
- Advocate for a culture of transparency, accountability, and learning within incident management
Security & Compliance
- Partner with Security Engineering to implement secure infrastructure-by-default designs and monitor compliance with PCI, SOC2, and other standards
- Proactively detect, investigate, and remediate security vulnerabilities and misconfigurations
- Integrate security scanning and validation into CI/CD pipelines
Disaster Recovery & Business Continuity
- Design and maintain disaster recovery (DR) and business continuity strategies
- Test and validate RPO/RTO targets regularly, ensuring operational readiness and audit compliance
Cost Management & FinOps
- Monitor and optimize cloud resource utilization through data-driven FinOps practices
- Collaborate with finance and engineering stakeholders to improve cost visibility and accountability
Mentorship, Collaboration & Knowledge Sharing
- Mentor peers and junior engineers through design reviews, code reviews, and paired work
- Lead by example in documentation, automation quality, and technical decision-making
- Partner with cross-functional teams to align reliability initiatives with product and business objectives
- Contribute to a culture of continuous learning and operational excellence
Qualifications:
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
- 5+ years of experience as a Site Reliability Engineer or in a similar role, working with highly available and production environments.
- Proficiency in AWS and containerization technologies like Kubernetes and Docker.
- Strong experience with Infrastructure as Code (IaC) using Terraform, with automation scripting skills in Python, Bash/Shell, or Go.
- Deep knowledge of Linux/Unix systems and networking fundamentals (e.g., TCP/IP, DNS, HTTP, VPN).
- Experience with monitoring and observability tools (e.g., Datadog, Prometheus, Grafana) and incident management.
- Familiarity with CI/CD pipelines, preferably using tools like GitLab, and strong knowledge of DevOps practices.
- Excellent troubleshooting skills, with experience in performance optimization and root cause analysis.
- Strong communication and collaboration skills.
Bonus Skills:
- Tools: Rundeck, Vector, Loki, VictoriaMetrics
- Frameworks: Java, Spring, Go
- Multi-cloud experience (Azure, GCP)
- Certifications: AWS Solutions Architect, Certified Kubernetes Administrator (CKA)
What Success Looks Like
- Core services consistently meet or exceed SLOs and error budgets
- Infrastructure deployments are automated, reproducible, and observable
- Cost efficiency and system performance improve through data-driven insights
- Post-incident reviews lead to measurable reliability gains
- The team benefits from your mentorship, leadership, and technical influence
Classmates
Classmates is the premier online, social, and mobile destination for reconnecting with the people from your high school years. Classmates offers the largest digitized collection of high school yearbooks online, with over 450,000 available to view, tag, sign, and share, and has the most comprehensive directory of high schools and class lists from the 1940s to today.
Salary Range:
Min: $152,700
Mid: $170,800
Max: $190,600
The pay range reflects the salary amount the Company reasonably expects to pay for the position. It is not a guarantee of actual compensation or a specific payment amount to any candidate. The actual compensation will depend on numerous factors including, without limitation, a particular candidate’s experience and qualifications.
The Company's Applicant and Worker Privacy Notice can be found here.
PeopleConnect is an equal opportunity employer.
Local area candidates are encouraged to apply, and please note we are not able to offer visa sponsorship, visa transfer, or corp-corp arrangements.
Note for Principal Agencies - Principal agents should not forward resumes to PeopleConnect, as we will not be responsible for any fees arising from the use of resumes submitted from agencies without a prior written and signed agreement and authorized job order for this position in place.
PeopleConnect, Inc. is an equal opportunity employer