Seb SebM42

SÉBASTIEN MOREAU · DATA ENGINEER

Pipelines · Lakehouse · CDC · Data Quality · Lineage

About

Data Engineer focused on production-grade pipelines — from raw ingestion to business-ready outputs.

I build systems where data quality, traceability, and compliance are first-class concerns — not afterthoughts.
My work spans batch and event-driven architectures, CDC-based ingestion, lakehouse modeling, entity resolution, and end-to-end lineage.
I translate audit and compliance requirements into reliable, shippable workflows — with full ownership across the stack.

Master's equivalent in Data Engineering — OpenClassrooms (2026).
Open to relocation — Belgium, Netherlands, Sweden, Denmark.

Stack

— Core —

— Ingestion & Streaming —

— Storage & Lakehouse —

— Orchestration & Infra —

— Quality & Observability —

Projects

▸ Economic Activity Monitor — End-to-End Lakehouse Pipeline (SIRENE/INSEE)

Python Databricks Delta Lake dbt Databricks Workflows GitHub Actions Unity Catalog

Production-grade monthly pipeline on 42M+ French business establishments (INSEE SIRENE). Configurable geographic scope. Medallion architecture (Bronze → Silver → Gold): Bronze as a transit-only dead letter queue, Silver as a SCD Type 2 historical source of truth, Gold as dbt-powered business aggregations. Single atomic Delta MERGE for Silver writes — no rollback needed. Circuit breaker pattern for transformation failure isolation, reset via GitHub Actions manual trigger without direct Databricks access. Gold failures trigger automatic rollback via Delta time travel. Full ADR documentation (DECISIONS.md). Deployable on any Databricks workspace in minutes.

Key decisions: significance filter on tracked columns — source system updates its treatment timestamp for both business and purely technical changes, filtering prevents spurious SCD2 period creation · single atomic MERGE eliminates partial write risk without rollback logic · Bronze as implicit DLQ — batch retained until Silver transformation confirmed successful · two source files at initialization (stock + historical) to reconstruct full SCD2 history from day one · circuit breaker reset decoupled from Databricks access — operator only needs Git.

▸ End-to-End Streaming Lakehouse — HR & Sport Activity

Python Kafka/Redpanda Debezium Delta Lake Great Expectations Prometheus OSMNX

Production-grade event-driven pipeline: CDC from PostgreSQL via Debezium, Python micro-batch consumer (100ms poll), geospatial distance computation via OSMNX/BAN API. Delta Lake lakehouse (Bronze → Silver → Gold) with SCD Type 2 historization for full auditability. Data quality gates at every layer transition. Dead letter table for failed batches. Metrics exposed via Prometheus.

Key decisions: CDC over polling — fact table capture is natural fit, state table monitored on 2 columns only · quarantine logic for implausible records · PII excluded from the lakehouse · SCD Type 2 scoped to auditability, not retroactive rule changes · at-least-once delivery with explicit crash recovery per stage.

▸ MongoDB Replica Set — S3 Migration & Integrity Validation

Python MongoDB boto3 AWS S3

Containerized MongoDB replica set with keyFile intra-cluster authentication. Automated S3 ingestion with JSON parsing and normalization. Post-migration integrity validation: schema checks, type enforcement, duplicate detection, replication consistency across all nodes. Fully automated bootstrap — zero manual intervention.

Key decisions: 3 nodes across independent datacenters — simultaneous failure of 2 nodes in separate datacenters is statistically negligible, all 3 is not an operational scenario · odd node count reaches quorum without an arbiter · idempotent bootstrap skips init on subsequent restarts.

▸ Orchestrated Monthly Pipeline — Analytics & Reporting

Python Kestra DuckDB

Kestra-orchestrated monthly pipeline: ingestion, SQL transformations, deduplication, per-step data quality assertions, parallel report generation (revenue analytics + z-score anomaly detection). CSV → Parquet upfront for columnar query performance with DuckDB. Custom Docker image to resolve C-level dependency constraints in Kestra's Python task type.

Key decisions: per-step integrity checks rather than end-to-end only · partial failure isolation between reporting branches · Parquet chosen defensively given unknown volume baseline.

▸ Conversational AI System — RAG Pipeline & Architecture Design

Python LangChain FAISS Mistral AI FastAPI

Full production architecture study (POC → MVP): cloud design, cost modeling, and systematic component trade-off analysis across scalability, cost efficiency, fault tolerance, access control, and observability constraints. Two-service RAG system as POC implementation.

Key decisions: two-stage LLM pipeline designed from the start to separate filter extraction from generation · single Mistral ecosystem to guarantee embedding space consistency · FAISS FlatL2 for exact search at POC scale, with explicit migration path to Pinecone documented for production.

▸ MongoDB Schema Migration — Structured → Document + Sharding (coming soon)

Python MongoDB

Relational-to-document schema migration with a configurable transformation engine — schema mapping defined as data config rather than hardcoded logic. Sharding strategy designed around the business access pattern.

Volunteer

Project	Org	Role
Trawl Watch	BLOOM Association	Transformation architecture · Data modeling · Business logic implementation
Dans Mon Eau	Public Health Data Platform	EDA on opaque merged datasets · Normalization · Cross-functional team (40+ volunteers)

Dans Mon Eau is in production and publicly accessible. Trawl Watch pre-production repository is publicly available on GitHub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly