Skip to content
View SebM42's full-sized avatar

Block or report SebM42

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
SebM42/README.md

SÉBASTIEN MOREAU · DATA ENGINEER

Pipelines · Lakehouse · CDC · Data Quality · Lineage

Email


About

Data Engineer focused on production-grade pipelines — from raw ingestion to business-ready outputs.

I build systems where data quality, traceability, and compliance are first-class concerns — not afterthoughts.
My work spans batch and event-driven architectures, CDC-based ingestion, lakehouse modeling, entity resolution, and end-to-end lineage.
I translate audit and compliance requirements into reliable, shippable workflows — with full ownership across the stack.

Master's equivalent in Data Engineering — OpenClassrooms (2026).
Open to relocation — Belgium, Netherlands, Sweden, Denmark.


Stack

— Core —

Python SQL

— Ingestion & Streaming —

Kafka Debezium Airbyte PySpark

— Storage & Lakehouse —

Delta Lake Databricks dbt PostgreSQL MongoDB DuckDB AWS S3

— Orchestration & Infra —

Databricks Workflows GitHub Actions Kestra Airflow Docker

— Quality & Observability —

Great Expectations Prometheus


Projects

Python Databricks Delta Lake dbt Databricks Workflows GitHub Actions Unity Catalog

Production-grade monthly pipeline on 42M+ French business establishments (INSEE SIRENE). Configurable geographic scope. Medallion architecture (Bronze → Silver → Gold): Bronze as a transit-only dead letter queue, Silver as a SCD Type 2 historical source of truth, Gold as dbt-powered business aggregations. Single atomic Delta MERGE for Silver writes — no rollback needed. Circuit breaker pattern for transformation failure isolation, reset via GitHub Actions manual trigger without direct Databricks access. Gold failures trigger automatic rollback via Delta time travel. Full ADR documentation (DECISIONS.md). Deployable on any Databricks workspace in minutes.

Key decisions: significance filter on tracked columns — source system updates its treatment timestamp for both business and purely technical changes, filtering prevents spurious SCD2 period creation · single atomic MERGE eliminates partial write risk without rollback logic · Bronze as implicit DLQ — batch retained until Silver transformation confirmed successful · two source files at initialization (stock + historical) to reconstruct full SCD2 history from day one · circuit breaker reset decoupled from Databricks access — operator only needs Git.


Python Kafka/Redpanda Debezium Delta Lake Great Expectations Prometheus OSMNX

Production-grade event-driven pipeline: CDC from PostgreSQL via Debezium, Python micro-batch consumer (100ms poll), geospatial distance computation via OSMNX/BAN API. Delta Lake lakehouse (Bronze → Silver → Gold) with SCD Type 2 historization for full auditability. Data quality gates at every layer transition. Dead letter table for failed batches. Metrics exposed via Prometheus.

Key decisions: CDC over polling — fact table capture is natural fit, state table monitored on 2 columns only · quarantine logic for implausible records · PII excluded from the lakehouse · SCD Type 2 scoped to auditability, not retroactive rule changes · at-least-once delivery with explicit crash recovery per stage.


Python MongoDB boto3 AWS S3

Containerized MongoDB replica set with keyFile intra-cluster authentication. Automated S3 ingestion with JSON parsing and normalization. Post-migration integrity validation: schema checks, type enforcement, duplicate detection, replication consistency across all nodes. Fully automated bootstrap — zero manual intervention.

Key decisions: 3 nodes across independent datacenters — simultaneous failure of 2 nodes in separate datacenters is statistically negligible, all 3 is not an operational scenario · odd node count reaches quorum without an arbiter · idempotent bootstrap skips init on subsequent restarts.


Python Kestra DuckDB

Kestra-orchestrated monthly pipeline: ingestion, SQL transformations, deduplication, per-step data quality assertions, parallel report generation (revenue analytics + z-score anomaly detection). CSV → Parquet upfront for columnar query performance with DuckDB. Custom Docker image to resolve C-level dependency constraints in Kestra's Python task type.

Key decisions: per-step integrity checks rather than end-to-end only · partial failure isolation between reporting branches · Parquet chosen defensively given unknown volume baseline.


Python LangChain FAISS Mistral AI FastAPI

Full production architecture study (POC → MVP): cloud design, cost modeling, and systematic component trade-off analysis across scalability, cost efficiency, fault tolerance, access control, and observability constraints. Two-service RAG system as POC implementation.

Key decisions: two-stage LLM pipeline designed from the start to separate filter extraction from generation · single Mistral ecosystem to guarantee embedding space consistency · FAISS FlatL2 for exact search at POC scale, with explicit migration path to Pinecone documented for production.


▸ MongoDB Schema Migration — Structured → Document + Sharding (coming soon)

Python MongoDB

Relational-to-document schema migration with a configurable transformation engine — schema mapping defined as data config rather than hardcoded logic. Sharding strategy designed around the business access pattern.


Volunteer

Project Org Role
Trawl Watch BLOOM Association Transformation architecture · Data modeling · Business logic implementation
Dans Mon Eau Public Health Data Platform EDA on opaque merged datasets · Normalization · Cross-functional team (40+ volunteers)

Dans Mon Eau is in production and publicly accessible. Trawl Watch pre-production repository is publicly available on GitHub.

Pinned Loading

  1. dataforgoodfr/12_bloom dataforgoodfr/12_bloom Public

    Python 30 10

  2. dataforgoodfr/13_pollution_eau dataforgoodfr/13_pollution_eau Public

    Jupyter Notebook 21 49