
- Professional
- Optionales Büro in Bengaluru
Job Description
Cloud-Native Data Engineering on AWS
- Strong, hands-on expertise in AWS native data services: S3, Glue (Schema Registry, Data Catalog), Step Functions, Lambda, Lake Formation, Athena, MSK/Kinesis, EMR (Spark), SageMaker (inc. Feature Store)
- Comfort designing and optimizing pipelines for both batch (Step Functions) and streaming (Kinesis/MSK) ingestion.
- Data Mesh & Distributed Architectures
- Deep understanding of data mesh principles: including domain-oriented ownership, treating data as a product, and the use of federated governance models
- Experience enabling self-service platforms, decentralized ingestion, and transformation workflows.
- Data Contracts & Schema Management
- Advanced knowledge of schema enforcement, evolution, and validation (preferably AWS Glue Schema Registry/JSON/Avro)
- Data Transformation & Modelling
- Proficiency with modern ELT/ETL stack: Spark (EMR), dbt, AWS Glue, and Python (pandas)
AI/ML Data Enablement
- Designing and supporting vector stores (OpenSearch), feature stores (SageMaker Feature Store), and integrating with MLOps/data pipelines for AI/semantic search and RAG-type workloads
- Metadata, Catalog, and Lineage
- Familiarity with central cataloging, lineage solutions, and data discovery (Glue Data Catalog, Collibra, Atlan, Amundsen, etc.)
- Implementing end-to-end lineage, auditability, and governance processes.
- Security, Compliance, and Data Governance
- Design and implementation of data security: row/column-level security (Lake Formation), KMS encryption, role-based access using AuthN/AuthZ standards (JWT/OIDC), GDPR/SOC2/ISO 27001-aligned policies
- Orchestration & Observability
- Experience with pipeline orchestration (AWS Step Functions, Apache Airflow/MWAA) and monitoring (CloudWatch, X-Ray) in large-scale environments.
APIs & Integration
- API design for both batch and real-time data delivery (REST, GraphQL endpoints for AI/reporting/BI consumption)
Job Responsibilities
- Design, build, and maintain ETL/ELT pipelines to extract, transform, and load data from various sources into cloud-based data platforms.
- Develop and manage data architectures, data lakes, and data warehouses on AWS (e.g., S3, Redshift, Glue, Athena).
- Collaborate with data scientists, analysts, and business stakeholders to ensure data accessibility, quality, and security.
- Optimize performance of large-scale data systems and implement monitoring, logging, and alerting for pipelines.
- Work with both structured and unstructured data, ensuring reliability and scalability.
- Implement data governance, security, and compliance standards.
- Continuously improve data workflows by leveraging automation, CI/CD, and Infrastructure-as-Code (IaC)