Site Reliability Engineer en Albatross
Albatross · Alemania · Remote
Description
Location
Remote, right to work and travel in Europe.
Albatross
At Albatross, we’re building the second pillar of AI: a perception layer that understands how users actually experience content, in real time. Trained on live user interactions, Albatross learns and reasons on the fly. Our technology powers real-time, in-session discovery by adapting to evolving user interests, in real-time. We have raised significant funding and our platform already operates at scale, with billions of events being processed and hundreds of millions of predictions served.
The Role
We’re looking for a Site Reliability Engineer to own the reliability and observability of our platform. This is a hands-on leadership role where you’ll design, build, and maintain our observability stack, lead incident response, oversee releases, and establish the processes and standards that allow the team to ship quickly and confidently. More specifically you will:
- Observability & Monitoring:
Own and evolve our observability stack (Prometheus, Grafana, Loki, Jaeger), including dashboards, alerts, and SLOs.
Instrument services for meaningful metrics and tracing, reducing noise and improving signal. - Reliability & Incident Response:
Lead incident response and establish blameless postmortems, runbooks, and automated remediation.
Define, track, and improve SLIs/SLOs to proactively reduce reliability risk. - Release Management:
Own the release process end-to-end, improving deployment speed, safety, and recovery.
Implement progressive rollouts, feature flags, and rollback strategies. - Platform & Tooling:
Embed observability into the development lifecycle in close collaboration with engineering. Maintain and evolve our Kubernetes-based platform, adopting new tools when they add real value.
Requirements
- 5–7+ years in SRE, platform engineering, DevOps, or similar roles.
- Strong production experience with Kubernetes and modern observability stacks (Prometheus, Grafana, Loki, Jaeger/OpenTelemetry).
- Proven track record leading incident response and building monitoring systems teams actually use.
- Deep distributed systems knowledge and production debugging experience.
- Pragmatic approach to tooling and alerting that teams trust.
- Clear communicator across engineering, product, and leadership.
- STEM degree (Computer Science, Engineering, Mathematics, or similar).
- Plus: contributions to open-source observability projects and background in high-scale or high-availability environments.
Benefits
- Remote-first, async-friendly culture.
- Ownership and autonomy, you'll shape how we do reliability.
- A team that cares about building things right.