Gridmatic Inc. is a high-growth startup with offices in the Bay Area and Houston that is accelerating the clean energy transition by applying our expertise in data, machine learning, and energy to power markets. We are the rare startup that has multiple years of profitability without raising venture capital. At Gridmatic, we foster a collaborative and inclusive culture where learning and growth are constant. We move quickly, solve problems with integrity, and balance environmental responsibility with data-driven excellence.
We are looking for a Machine Learning Infrastructure Engineer to accelerate the decarbonization of the electricity system by building and optimizing the backbone of our ML platform. The ideal candidate will have solid expertise in machine learning, distributed systems and GPU-based training, and will design scalable, high-performance infrastructure for training, inference, and evaluation. They will push the boundaries of throughput and efficiency on large-scale time-series and weather datasets, while shaping the long-term vision of our ML platform and generalizing solutions for broader use. A successful candidate will thrive on continuous learning across engineering, ML systems, and energy markets, while contributing to a collaborative, mission-driven team. The ideal candidate must have strong deep learning fundamentals in addition to strong software engineering skills.
We are looking for a Machine Learning Infrastructure Engineer to accelerate the decarbonization of the electricity system by building and optimizing the backbone of our ML platform. The ideal candidate will have solid expertise in machine learning, distributed systems and GPU-based training, and will design scalable, high-performance infrastructure for training, inference, and evaluation. They will push the boundaries of throughput and efficiency on large-scale time-series and weather datasets, while shaping the long-term vision of our ML platform and generalizing solutions for broader use. A successful candidate will thrive on continuous learning across engineering, ML systems, and energy markets, while contributing to a collaborative, mission-driven team. The ideal candidate must have strong deep learning fundamentals in addition to strong software engineering skills.
You will:
Own a significant piece of our ML platform while rapidly building and iterating scalable, robust distributed infrastructure for ML training, inference, and evaluation on large-scale time-series and weather datasets.
Optimize throughput and cost by supporting model training and deployment across multiple clusters and clouds.
Improve the efficiency of machine learning models and other workloads by optimizing latency, throughput, and memory consumption. This involves pushing the boundaries of current hardware capabilities through techniques like GPU performance engineering.
Help define the long-term vision for Gridmatic’s ML platform.
Play a key role in mentoring junior engineers and interns, contributing to a collaborative, innovative, and growth-oriented team culture.
You might be a good fit if you are:
A strong engineer with 3+ years of experience who is committed to technical excellence. You possess a deep understanding of the codebases you work in and write readable, scalable code.
Experienced in researching and implementing deep learning models.
Experienced in distributed training and inference of large models on GPU clusters, utilizing core libraries and frameworks such as PyTorch, PyTorch Lightning, and Ray.
Comfortable with large-scale data storage infrastructure and formats, e.g. Zarr, SQL, and feature stores
A self-starter with a strong sense of independence and ownership, and the capability to engineer large, robust systems from the initial design and conceptualization to productionization.
A mission-driven individual who is enthusiastic about working toward a renewable grid and diving into the intersection of ML and energy. No prior energy experience required, but curiosity and a willingness to learn are must-haves!
Nice to haves:
End to end proficiency in building, maintaining, and debugging cluster infrastructure, utilizing Kubernetes and Terraform.
Expertise in identifying performance bottlenecks and designing and writing high-performance code for large-scale ML workloads.
Experience with at least one of: torch.profiler, TorchDynamo, TorchInductor, Triton, or other deep learning compiler stacks.
Knowledge of cluster communication protocols such as nccl or gloo
Experience working with any of the following: weather data, energy systems, time-series forecasting, electricity markets, or financial trading.
#LI-DNI
Join our team and make a difference! Click below or email us at [email protected].
These cookies are necessary for the website to function and cannot be turned off in our systems. You can set your browser to block these cookies, but then some parts of the website might not work.
Security
User experience
Target group oriented cookies
These cookies are set through our website by our advertising partners. They may be used by these companies to profile your interests and show you relevant advertising elsewhere.
Google Analytics
Google Ads
We use cookies
🍪
Our website uses cookies and similar technologies to personalize content, optimize the user experience and to indvidualize and evaluate advertising. By clicking Okay or activating an option in the cookie settings, you agree to this.
The best remote jobs via email
Join 5'000+ people getting weekly alerts with remote jobs!