- Professional
- Escritório em Warwick
Key Responsibilities
- Observability & Experimentation: Enhancing Langfuse for LLM tracing, evaluation, and experimentation capabilities
- Developer Self-Service: Building and improving Backstage as an internal developer portal for platform discoverability
- LLM Operations: Deploying and maintaining LiteLLM proxy, Langflow runtime, and other core LLM services
- Monitoring & Logging: Implementing platform-wide monitoring (Prometheus/Grafana) and logging infrastructure (Loki)
- LLM Ops Security: Implementing guardrails (LlamaGuard, Azure Guardrails) and security controls
- GDPR & PII Management: Building automated PII detection, minimization strategies, and compliance tooling
- Incident Response: Establishing security incident response procedures for LLM operations
- Kubernetes Operations: Managing AKS clusters, implementing reliable deployment tooling via ArgoCD
- Infrastructure as Code: Productionizing infrastructure with Terraform, eliminating manual configuration
- Autoscaling & Performance: Implementing workload management and autoscaling for AI services
- Storage Solutions: Migrating from self-hosted MinIO to managed Azure Blob Storage
- RAG (Retrieval-Augmented Generation) applications like Ask IPASS and Ask UK Pay Centre
- Document processing applications (BrightCapture)
- Employee onboarding automation (Oscar)
- Internal AI assistant (Bright GPT)
Skills, Knowledge and Expertise
- Platform Engineering Fundamentals: 2-4 years experience with cloud infrastructure, preferably Azure
- Kubernetes: Practical experience deploying and managing applications in Kubernetes (AKS experience is a plus)
- Infrastructure as Code: Hands-on experience with Terraform or similar IaC tools
- CI/CD: Experience with GitOps workflows and tools like ArgoCD, GitHub Actions, or similar
- System Programming: Proficiency in Python or Go for automation and tooling; shell scripting essential
- Linux & Containers: Solid understanding of containerization with Docker and container orchestration
- Exposure to LLM technologies or AI/ML infrastructure
- Experience with observability tools (Prometheus, Grafana, Loki)
- Knowledge of Helm and Helmfile for Kubernetes deployments
- Knowledge of Kustomize
- Understanding of security best practices and compliance requirements (GDPR)
- Backend-as-a-Service platforms (Supabase or similar)
- Developer portal platforms (Backstage or similar)
- Application programming experience with .NET and/or TypeScript
- Learning Mindset: You're excited to learn about LLM operations and emerging AI infrastructure patterns
- Systems Thinking: You understand how distributed systems work and can reason about failure modes
- Pragmatic Approach: You balance perfect solutions with shipping value quickly
- Collaboration: You work well with both technical and product stakeholders
- Documentation: You believe good documentation is as important as good code
- Ownership: You take responsibility for your work from development through to production
- Reports to: Head of AI
- Works closely with: Two senior/principal platform engineers
- Collaborates with: Application development teams, product managers, and security/compliance stakeholders
- Team size: Small, full-stack AI team covering development, DevOps, operations, and support
- You've contributed to multiple platform epics from our roadmap
- You understand the architecture of our AI platform and can navigate the codebase
- You've successfully deployed services to our Kubernetes clusters
- You're participating in on-call rotation and can troubleshoot platform issues
- You're independently owning epics and driving them to completion
- You're contributing to architectural decisions and technical direction
- You've improved platform reliability, observability, or developer experience
- You're mentoring junior engineers or helping onboard new team members
Platform Services: LiteLLM, Langflow, Langfuse, Supabase, Open Web UI, Backstage
Observability: Prometheus, Grafana, Loki, Langfuse tracing
CI/CD: ArgoCD, GitHub Actions, Helmfile
Languages: Python, Go, Shell scripting
Security: Azure Guardrails, LlamaGuard, PII detection tooling
- Impact: Your work directly enables AI innovation across the entire organization
- Growth: Learn from experienced platform engineers in a supportive environment
- Cutting Edge: Work with the latest AI infrastructure and tooling
- Autonomy: Small team means you'll have significant ownership and influence
- Mission: Help accountants and finance professionals work more efficiently with AI
Benefits
- Competitive salary
- Performance based bonus
- 25 days annual leave
- Health Insurance
- Company pension
- Company events
- free food onsite
- On-site parking
- Referral programme
- Sick pay
- Wellness programmes