Gridmatic Inc. is a high-growth startup with offices in the Bay Area and Houston that is accelerating the clean energy transition by applying our expertise in data, machine learning, and energy to power markets. We are the rare startup that has multiple years of profitability without raising venture capital. At Gridmatic, we foster a collaborative and inclusive culture where learning and growth are constant. We move quickly, solve problems with integrity, and balance environmental responsibility with data-driven excellence.
We are looking for a Machine Learning Infrastructure Engineer to accelerate the decarbonization of the electricity system by building and optimizing the backbone of our ML platform. The ideal candidate will have solid expertise in machine learning, distributed systems and GPU-based training, and will design scalable, high-performance infrastructure for training, inference, and evaluation. They will push the boundaries of throughput and efficiency on large-scale time-series and weather datasets, while shaping the long-term vision of our ML platform and generalizing solutions for broader use. A successful candidate will thrive on continuous learning across engineering, ML systems, and energy markets, while contributing to a collaborative, mission-driven team. The ideal candidate must have strong deep learning fundamentals in addition to strong software engineering skills.
We are looking for a Machine Learning Infrastructure Engineer to accelerate the decarbonization of the electricity system by building and optimizing the backbone of our ML platform. The ideal candidate will have solid expertise in machine learning, distributed systems and GPU-based training, and will design scalable, high-performance infrastructure for training, inference, and evaluation. They will push the boundaries of throughput and efficiency on large-scale time-series and weather datasets, while shaping the long-term vision of our ML platform and generalizing solutions for broader use. A successful candidate will thrive on continuous learning across engineering, ML systems, and energy markets, while contributing to a collaborative, mission-driven team. The ideal candidate must have strong deep learning fundamentals in addition to strong software engineering skills.
You will:
Own a significant piece of our ML platform while rapidly building and iterating scalable, robust distributed infrastructure for ML training, inference, and evaluation on large-scale time-series and weather datasets.
Optimize throughput and cost by supporting model training and deployment across multiple clusters and clouds.
Improve the efficiency of machine learning models and other workloads by optimizing latency, throughput, and memory consumption. This involves pushing the boundaries of current hardware capabilities through techniques like GPU performance engineering.
Help define the long-term vision for Gridmatic’s ML platform.
Play a key role in mentoring junior engineers and interns, contributing to a collaborative, innovative, and growth-oriented team culture.
You might be a good fit if you are:
A strong engineer with 3+ years of experience who is committed to technical excellence. You possess a deep understanding of the codebases you work in and write readable, scalable code.
Experienced in researching and implementing deep learning models.
Experienced in distributed training and inference of large models on GPU clusters, utilizing core libraries and frameworks such as PyTorch, PyTorch Lightning, and Ray.
Comfortable with large-scale data storage infrastructure and formats, e.g. Zarr, SQL, and feature stores
A self-starter with a strong sense of independence and ownership, and the capability to engineer large, robust systems from the initial design and conceptualization to productionization.
A mission-driven individual who is enthusiastic about working toward a renewable grid and diving into the intersection of ML and energy. No prior energy experience required, but curiosity and a willingness to learn are must-haves!
Nice to haves:
End to end proficiency in building, maintaining, and debugging cluster infrastructure, utilizing Kubernetes and Terraform.
Expertise in identifying performance bottlenecks and designing and writing high-performance code for large-scale ML workloads.
Experience with at least one of: torch.profiler, TorchDynamo, TorchInductor, Triton, or other deep learning compiler stacks.
Knowledge of cluster communication protocols such as nccl or gloo
Experience working with any of the following: weather data, energy systems, time-series forecasting, electricity markets, or financial trading.
#LI-DNI
Join our team and make a difference! Click below or email us at [email protected].
Estas cookies son necesarias para que el sitio web funcione y no se pueden desactivar en nuestros sistemas. Puede configurar su navegador para bloquear estas cookies, pero entonces algunas partes del sitio web podrían no funcionar.
Seguridad
Experiencia de usuario
Cookies orientadas al público objetivo
Estas cookies son instaladas a través de nuestro sitio web por nuestros socios publicitarios. Estas empresas pueden utilizarlas para elaborar un perfil de sus intereses y mostrarle publicidad relevante en otros lugares.
Google Analytics
Anuncios Google
Utilizamos cookies
🍪
Nuestro sitio web utiliza cookies y tecnologías similares para personalizar el contenido, optimizar la experiencia del usuario e indvidualizar y evaluar la publicidad. Al hacer clic en Aceptar o activar una opción en la configuración de cookies, usted acepta esto.
Los mejores empleos remotos por correo electrónico
¡Únete a más de 5.000 personas que reciben alertas semanales con empleos remotos!