- Professional
- Ufficio in Overland Park
Competitive Compensation & Benefits Package * 401(k) with Profit Sharing * Flexible Time Off * Office Dog!!
ABOUT US
By combining our unparalleled domain expertise with leading-edge technology, Ad Astra is helping higher education in its mission to advance timely student completions. We are building a cloud-based software platform that will provide the foundation for our next generation of industry-leading solutions and analytics. Simply put, we're helping students graduate faster.
OUR CORE VALUES
- We recognize talent. We recognize and appreciate the unique God-given talents that our people bring to Ad Astra. Aligning these individual gifts with our work sets team members up to succeed.
- We’re unpretentious. There’s no room for ego. We admit our imperfections and have the humility to know what we don’t know.
- We’re passionate. We aren’t satisfied with the status quo. We’re on a mission together to protect the value of degree completion and to transform the higher education industry.
- We’re pioneering. We’re pioneering and aren’t afraid of failing—in fact, we celebrate it. We love it when our people boldly experiment with innovative solutions.
- We love fun. The health of our relationships is strengthened by working with people who stretch our thinking—and by enjoying the lighter side of life together. We don’t take ourselves too seriously, but we do take fun seriously.
- We have grit. Beyond talent and intelligence, our people have stick-to-itiveness. We push through challenges to make goals a reality.
POSITION SUMMARY
The Site Reliability Engineer (SRE) will ensure the performance, reliability, and scalability of our systems as we continue to grow. This role bridges the gap between software development and operations, applying software engineering principles to automate, optimize, and enhance the reliability of our infrastructure and production systems. Your role includes identifying recurring failure patterns, implementing automated solutions, and continuously improving platform performance. Leveraging your intellectual curiosity and expertise in operations and development, you will also play a pivotal role in monitoring security and reliability threats, while actively advocating effective solutions.
CORE RESPONSIBILITIES
- Write automation and production code to improve system reliability and performance
- Design, build, and maintain highly available, scalable systems across cloud environments (e.g., AWS, Azure, or GCP)
- Maintain and extend logging, monitoring, and alerting systems to enhance observability and proactive incident response
- Bridge development and operations by automating workflows, deployments, and infrastructure provisioning
- Proactively monitor and respond to alerts and incidents, ensuring system uptime and performance
- Collaborate with engineering, product, and operations teams to capacity plan and enhance the overall reliability and efficiency of our products
- Support production systems, including participation in on-call rotations and performing limited after-hours maintenance
- Lead and contribute to post-incident reviews, driving root cause analysis and long-term solutions
- Document reliability patterns, runbooks, and learnings to build operational maturity
- Other duties as assigned
POSITION REQUIREMENTS
- Bachelor’s degree in Computer Science, Engineering, or related field preferred; equivalent experience in supporting distributed software systems accepted
- 2+ years of experience in Site Reliability Engineering or 4+ years of experience in Development or Systems Engineering roles
- Strong understanding of networking concepts including load balancing, DNS, IPSec, and VPNs
- Experience with source version control, CI/CD, and Infrastructure as Code tools (e.g., GitHub, Jenkins, Terraform, CloudFormation)
- Working knowledge of Linux operating systems
- Proficiency with relational or NoSQL database technologies (both preferred)
- Proficiency in at least one scripting or programming language (Node.js, Python, Go, Bash, PowerShell, etc.)
- Experience with containerization and orchestration (Docker, ECS, Kubernetes)
- Familiarity with observability tools (Graylog, New Relic, Prometheus, Grafana, ELK Stack, etc.)
- Strong collaboration, problem-solving, and communication skills
ESSENTIAL COMPETENCIES
- Problem Solving
- Collaborative Communication
- Adaptability & Flexibility
- Sense of Urgency with Quality
- Attention to Detail
- Creative Problem Solving
- Technical Aptitude
ADDITIONAL PREFERRED QUALIFICATIONS
- Expertise in git, docker, terraform, ansible and AWS
- Experience with blue/green or canary deployment strategies and zero-downtime releases
- Understanding of security best practices in cloud-native environments
- Background in automating large-scale infrastructure management
- Experience working in an agile or SaaS-based environment
KEY MEASURES OF SUCCESS
- Meaningful contributions to the SRE high-value/team stories
- Timely response to infrastructure alerts and ensuring system reliability
- Regular preventative maintenance
- Contribution to the overall success of the Cloud Ops team
- Incident Mean time to Acknowledge (MTTA) < 15 minutes
- Drive availability improvements to exceed 99.95% uptime
Ad Astra is proud to be an equal opportunity employer. We are committed to fostering an inclusive workplace where all individuals are treated with respect and fairness—regardless of race, color, national origin, sex, gender identity or expression, sexual orientation, religion, age, political affiliation, disability, veteran status, or any other characteristic protected by law.
All applicants must be legally authorized to work in the United States. Please note that Ad Astra is unable to provide work visa sponsorship for this position.