Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team in Vancouver. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at customer sites. This role requires a high level of technical expertise, a collaborative mindset, and a strong desire to continuously improve systems and processes.
Responsibilities
Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events.
Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly.
Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency.
Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring customers' infrastructure can handle increasing workloads.
Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times.
Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations.
Customer Focus: Working closely with the AI Program Manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction.
Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.
Basic Qualifications
Bachelor's degree in computer science, engineering, or a related discipline
5+ years of industry experience as a Site Reliability Engineer
Experience with cloud platforms (AWS, GCP, Azure), containerization technologies (Docker, Kubernetes), observability and alerting tools (Prometheus, Grafana, ElasticSearch, Jaeger)
Experience with scripting languages (Python, Bash)
Working knowledge of Github, Github actions, CI/CD concepts
Experience in ticket management, issue resolution, and troubleshooting
Strong problem-solving and troubleshooting skills
Excellent customer communication and interpersonal skills, fluency in verbal and written English
Preferred Qualifications
Knowledge of AI/ML infrastructure and workloads
Knowledge of big data technologies (Kafka, Flink)
Knowledge of database technologies (MongoDB, PostgreSQL)
Questi cookie sono necessari per il funzionamento del sito e non possono essere disattivati nei nostri sistemi. È possibile impostare il proprio browser in modo da bloccare questi cookie, ma alcune parti del sito potrebbero non funzionare.
Sicurezza
Esperienza dell'utente
Cookie orientati al gruppo target
Questi cookie sono impostati attraverso il nostro sito web dai nostri partner pubblicitari. Possono essere utilizzati da queste aziende per profilare i vostri interessi e mostrarvi pubblicità pertinenti altrove.
Google Analytics
Google Ads
Utilizziamo i cookie
🍪
Il nostro sito web utilizza i cookie e tecnologie simili per personalizzare i contenuti, ottimizzare l'esperienza dell'utente e per indvidualizzare e valutare la pubblicità. Facendo clic su Ok o attivando un'opzione nelle impostazioni dei cookie, l'utente accetta questo.
Le migliori offerte di lavoro da remoto via e-mail
Unisciti alle oltre 5'000+ persone che ricevono notifiche settimanali sulle offerte di lavoro da remoto!