Skip to content

7ben18/mse-mlops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

51 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

mse-mlops

Build status Commit activity License

DINOv3 fine-tuning setup for binary skin lesion classification on processed HAM10000.

Quick Start

Use Finetuned Model (without training)

dvc pull  # pulls data and models
make ui-up  # starts the ui and api => http://localhost:7777

Finetuning Smoke Test

dvc pull  # pulls data and models
make mlflow-up
make train-docker-smoke
make ui-up  # starts the ui and api => http://localhost:7777
make docker-down  # optional full Docker teardown

Without relying on DVC provided data:

make data-download
make data-split
make model-download  # download pretrained model
make mlflow-up
make train-docker-smoke
make ui-up
make docker-down  # optional full Docker teardown

Full Finetuning

dvc pull
make mlflow-up
make train-docker
dvc add models/finetuned/dinov3_ham10000/best_model.pt
git add models/finetuned/dinov3_ham10000/best_model.pt.dvc
git commit -m "Add best model checkpoint from training"
make ui-up

Hyperparameter Tuning

dvc pull
make mlflow-up
docker compose --profile train run --build --rm tune

Run Without Docker

Install the project dependencies first:

uv sync --group api --group ui --group dev

Then run each part directly from the repo root:

  • Data download: bash scripts/download_ham10000.sh
  • Data processing: uv run python scripts/data_processing.py
  • Pretrained model download: uv run python scripts/download_model.py
  • MLflow tracking server:
uv run mlflow server --host 127.0.0.1 --port 5001 --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlartifacts
  • Training against the local MLflow server:
uv run python scripts/train.py --mlflow-tracking-uri http://127.0.0.1:5001
  • One-epoch local smoke test on CPU:
uv run python scripts/train.py --mlflow-tracking-uri http://127.0.0.1:5001 --epochs 1 --max-train-batches 1 --max-val-batches 1 --device cpu
  • Ray Tune search:
uv run python scripts/tune.py --config config/tune.yaml
  • Inference API:
$env:FEEDBACK_DIR = "reports/feedback"
uv run --group api python scripts/serve_api.py
  • Streamlit UI:
$env:API_URL = "http://127.0.0.1:8000"
uv run --group ui python scripts/serve_ui.py

Local URLs:

  • MLflow: http://127.0.0.1:5001
  • API: http://127.0.0.1:8000
  • UI: http://127.0.0.1:7777

If you use the checked-in config/train.yaml, local training must override tracking.mlflow_tracking_uri, because the default config points at the Docker service name http://mlflow:5001.

Resources

Data Lifecycle

The project is organized around four dataset buckets:

  • train: supervised data used for fitting model weights.
  • val: held-out validation data used during experimentation and future hyperparameter tuning.
  • future: newly collected production data. Keep it separate until it has been reviewed and selected samples are promoted into train for later fine-tuning.
  • test: final hold-out set. Do not use it during normal development; touch it only for the final pre-production evaluation.

Data policy:

  • Keep dataset contents out of git.
  • Put raw source data under data/raw/.
  • Put derived tables and intermediate analysis outputs under data/processed/.
  • Keep acquisition dates in dataset-specific DATE.txt files where needed.

Canonical fine-tuning input:

data/processed/ham10000/
  metadata.csv
  HAM10000_images/
    train/
    val/
    test/
    future/

Normal workflow pulls this processed dataset from DVC. make data-split remains a local fallback only when the DVC remote is unavailable.

Dataset Sources

HAM10000

  • Harvard Dataverse: doi:10.7910/DVN/DBW86T
  • License: CC0 / Public Domain
  • Download script: scripts/download_ham10000.sh
  • References: official GitHub, Nature description, ISIC archive, dermatoscopy overview
  • Contents: about 10k dermatoscopic RGB .jpg images of pigmented skin lesions, lesion segmentation masks, and metadata with diagnosis, age, sex, localization, and source dataset.
  • Processing input metadata: data/raw/ham10000/HAM10000_metadata.csv.
  • Raw metadata columns: lesion_id, image_id, dx, dx_type, age, sex, localization, dataset.
  • Raw diagnosis classes: akiec, bcc, bkl, df, mel, nv, vasc.
  • The processing pipeline reads only raw HAM10000 metadata and raw images/masks, then fully rebuilds data/processed/ham10000/.
  • The canonical processed table is data/processed/ham10000/metadata.csv. It is image-level, keeps all images from the same lesion in the same split, and includes the split in set.
  • metadata.csv also includes the EDA-derived mb label: malignant for akiec, bcc, mel, and benign for bkl, df, nv, vasc.
  • The only other processed outputs are the split image and mask directories under HAM10000_images/<set>/ and HAM10000_segmentations_lesion_tschandl/<set>/.
  • HAM10000 analysis derives lesion-level views in memory from metadata.csv; it does not depend on a second lesion-mapping CSV.
  • Quick stats from the EDA notebook: 10015 metadata rows, 7470 unique lesions, 6301 benign lesions, and 1169 malignant lesions.
  • Missing values in metadata: 52 missing age, 203 unknown localization, and 50 unknown sex.
  • Splits: the source does not provide official splits. Project splits are generated by the data processing pipeline and make data-split.

Scripts

  • scripts/download_ham10000.sh: download and normalize HAM10000 into data/raw/ham10000/. Run bash scripts/download_ham10000.sh. Useful options: --check-url, --test-extract. Make shortcut: make data-download.
  • make data-split: rebuild data/processed/ham10000/ from raw HAM10000 metadata, images, and masks.

Makefile

Use make help to list the available shortcuts. Common targets are make install, make check, make test, make docs, make data-download, make model-download, make data-split, make mlflow-up, make mlflow-stop, make train-docker, make train-docker-smoke, make ui-up, make ui-down, and make docker-down.

DVC Data Setup

Copy the local secrets template before pulling data:

cp .dvc/config.local-example .dvc/config.local

Then replace the placeholder values in .dvc/config.local with your real DVC remote credentials.

Config split:

  • .dvc/config stores the shared remote name, URL, endpoint, and region.
  • .dvc/config.local is git-ignored and should contain local secrets only.
  • .dvc/config.local-example is only a template showing which secret keys must be provided. Do not put real secrets in tracked files.

Fetch the tracked dataset:

dvc pull

This is the normal way to obtain data/processed/ham10000. Use make data-split only if the DVC remote is unavailable and you need to rebuild the processed dataset locally from raw HAM10000.

Inspect local data changes:

  • dvc status
  • dvc data status --granular data
  • dvc diff --targets data

Training

For the high-level workflow, use this README. For download script options and artifact layout details, use docs/data.md.

Model:

facebook/dinov3-vits16-pretrain-lvd1689m

Before the first training run:

  1. Request access to the model on Hugging Face.
  2. Log in locally with the Hugging Face CLI:

hf auth login

  1. Download the pretrained backbone locally:

make model-download

  1. Start MLflow:

make mlflow-up

The default training config expects the downloaded model at:

models/pretrained/dinov3-vits16-pretrain-lvd1689m

If you want to track the downloaded backbone in DVC, run:

dvc add models/pretrained/dinov3-vits16-pretrain-lvd1689m

The training config expects MLflow at:

tracking.mlflow_tracking_uri: http://mlflow:5001

If that server is not running, scripts/train.py fails before the first epoch starts.

Run training in Docker:

make train-docker

For a one-epoch smoke test in Docker:

make train-docker-smoke

Run Ray Tune in Docker:

docker compose --profile train run --build --rm tune

The train container is opt-in only. A plain docker compose up will not start training.

Docker training mounts these host folders into the container:

  • config/ -> /app/config (read-only)
  • data/ -> /app/data
  • models/ -> /app/models

That means edits to config/train.yaml apply to the next Docker training run without rebuilding the image. Rebuilds are still needed after code or dependency changes.

Training settings:

config/train.yaml

Tuning settings:

config/tune.yaml

Training artifacts are written under models/finetuned/dinov3_ham10000/:

  • best_model.pt: promoted local serving checkpoint selected on validation ROC AUC.
  • checkpoints/epoch_*.pt: resumable epoch checkpoints.

Tuning writes Ray trial outputs and leaderboard exports under reports/.

MLflow Tracking Server

MLflow now has a first-class Docker Compose service and is part of the default Docker stack.

Start only MLflow if you just want to browse runs:

make mlflow-up

Open the UI at:

http://127.0.0.1:5001

Stop only MLflow with:

make mlflow-stop

This starts:

  • mlflow: MLflow UI and tracking server on http://localhost:5001

MLflow state is stored in the same git-ignored repo paths:

  • mlflow.db
  • mlartifacts/

So make docker-down still does not remove run history, because these are repo-local bind-mounted files, not Docker volumes.

Training containers talk to MLflow at:

http://mlflow:5001

Serving

Serving now follows the main project layout instead of living as a nested standalone app. Importable API and UI code lives under src/mse_mlops/serving, while service Dockerfiles live under docker/.

Start the inference API and Streamlit UI from the repo root:

make ui-up

This starts:

  • mlflow: MLflow on http://localhost:5001
  • api: FastAPI on http://localhost:8000
  • ui: Streamlit on http://localhost:7777

The serving stack is now behind the ui profile, so a plain docker compose up does not try to start api before a model exists.

Stop the whole Docker stack with:

make docker-down

This removes:

  • Compose containers
  • the Compose network
  • named volumes such as feedback_data

This does not remove:

  • repo-local mlflow.db
  • repo-local mlartifacts/
  • local Docker images

Stop only the API and UI while keeping MLflow running:

make ui-down

Suggested Docker workflow:

  1. make mlflow-up
  2. make train-docker
  3. make ui-up
  4. make docker-down

The API expects a trained checkpoint at:

models/finetuned/dinov3_ham10000/best_model.pt

Project Conventions

  • Importable and reusable Python code belongs under src/mse_mlops.
  • Notebook-backed reusable analysis helpers belong under src/mse_mlops/analysis.
  • Exploratory notebooks belong under notebooks/ and should be organized by dataset or topic, for example notebooks/ham10000/.
  • Reusable notebook helper code belongs under src/mse_mlops, not under notebooks/.
  • Raw source data belongs under data/raw/ and derived tables belong under data/processed/.
  • Dataset contents and local runtime artifacts stay out of git; only placeholders and provenance notes under data/ should be tracked.
  • Explanatory project material belongs under docs/.
  • Reports and exported analysis outputs belong under reports/.
  • Service-specific Dockerfiles belong under docker/, not in nested app subtrees.

Structure

β”œβ”€β”€ .github
β”‚   β”œβ”€β”€ actions        <- Github Actions configuration.
β”‚   └── workflows      <- Github Actions workflows.
β”‚
β”œβ”€β”€ config            <- Training and experiment configuration.
β”œβ”€β”€ docker            <- Service Dockerfiles for API and UI.
β”œβ”€β”€ scripts           <- Utility scripts for local and overnight runs.
β”œβ”€β”€ src/mse_mlops     <- Source code for this project.
β”‚   β”œβ”€β”€ analysis      <- Reusable analysis helpers used by notebooks.
β”‚   └── serving       <- FastAPI API, Streamlit UI, and serving helpers.
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ raw            <- Local source data and provenance notes (kept out of git).
β”‚   └── processed      <- Local derived datasets and exported tables (kept out of git).
β”‚
β”œβ”€β”€ docs               <- MkDocs documentation for the project.
β”œβ”€β”€ models             <- Model checkpoints, predictions, metrics, and summaries.
β”œβ”€β”€ notebooks          <- Exploratory notebooks only, grouped by dataset/topic.
β”œβ”€β”€ logs               <- Local training and smoke-test logs.
β”œβ”€β”€ reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
β”œβ”€β”€ tests              <- Unit tests for the project.
β”œβ”€β”€ .gitignore         <- Files to be ignored by git.
β”œβ”€β”€ docker/train.Dockerfile <- Dockerfile for the training image.
β”œβ”€β”€ LICENSE            <- MIT License.
β”œβ”€β”€ Makefile           <- Makefile with commands like `make install` or `make test`.
β”œβ”€β”€ mkdocs.yml         <- MkDocs configuration.
β”œβ”€β”€ pyproject.toml     <- Package build configuration.
β”œβ”€β”€ README.md          <- The top-level README for this project.
└── uv.lock            <- Lock file for uv.

About

TSM_MachLeData - Group Project FS26

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors