DINOv3 fine-tuning setup for binary skin lesion classification on processed HAM10000.
dvc pull # pulls data and models
make ui-up # starts the ui and api => http://localhost:7777dvc pull # pulls data and models
make mlflow-up
make train-docker-smoke
make ui-up # starts the ui and api => http://localhost:7777
make docker-down # optional full Docker teardownWithout relying on DVC provided data:
make data-download
make data-split
make model-download # download pretrained model
make mlflow-up
make train-docker-smoke
make ui-up
make docker-down # optional full Docker teardowndvc pull
make mlflow-up
make train-docker
dvc add models/finetuned/dinov3_ham10000/best_model.pt
git add models/finetuned/dinov3_ham10000/best_model.pt.dvc
git commit -m "Add best model checkpoint from training"
make ui-updvc pull
make mlflow-up
docker compose --profile train run --build --rm tuneInstall the project dependencies first:
uv sync --group api --group ui --group devThen run each part directly from the repo root:
- Data download:
bash scripts/download_ham10000.sh - Data processing:
uv run python scripts/data_processing.py - Pretrained model download:
uv run python scripts/download_model.py - MLflow tracking server:
uv run mlflow server --host 127.0.0.1 --port 5001 --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlartifacts- Training against the local MLflow server:
uv run python scripts/train.py --mlflow-tracking-uri http://127.0.0.1:5001- One-epoch local smoke test on CPU:
uv run python scripts/train.py --mlflow-tracking-uri http://127.0.0.1:5001 --epochs 1 --max-train-batches 1 --max-val-batches 1 --device cpu- Ray Tune search:
uv run python scripts/tune.py --config config/tune.yaml- Inference API:
$env:FEEDBACK_DIR = "reports/feedback"
uv run --group api python scripts/serve_api.py- Streamlit UI:
$env:API_URL = "http://127.0.0.1:8000"
uv run --group ui python scripts/serve_ui.pyLocal URLs:
- MLflow:
http://127.0.0.1:5001 - API:
http://127.0.0.1:8000 - UI:
http://127.0.0.1:7777
If you use the checked-in config/train.yaml, local training must override tracking.mlflow_tracking_uri, because the default config points at the Docker service name http://mlflow:5001.
- π Repository
- π Documentation
The project is organized around four dataset buckets:
train: supervised data used for fitting model weights.val: held-out validation data used during experimentation and future hyperparameter tuning.future: newly collected production data. Keep it separate until it has been reviewed and selected samples are promoted intotrainfor later fine-tuning.test: final hold-out set. Do not use it during normal development; touch it only for the final pre-production evaluation.
Data policy:
- Keep dataset contents out of git.
- Put raw source data under
data/raw/. - Put derived tables and intermediate analysis outputs under
data/processed/. - Keep acquisition dates in dataset-specific
DATE.txtfiles where needed.
Canonical fine-tuning input:
data/processed/ham10000/
metadata.csv
HAM10000_images/
train/
val/
test/
future/
Normal workflow pulls this processed dataset from DVC. make data-split remains a local fallback only when the DVC remote is unavailable.
- Harvard Dataverse:
doi:10.7910/DVN/DBW86T - License: CC0 / Public Domain
- Download script:
scripts/download_ham10000.sh - References: official GitHub, Nature description, ISIC archive, dermatoscopy overview
- Contents: about 10k dermatoscopic RGB
.jpgimages of pigmented skin lesions, lesion segmentation masks, and metadata with diagnosis, age, sex, localization, and source dataset. - Processing input metadata:
data/raw/ham10000/HAM10000_metadata.csv. - Raw metadata columns:
lesion_id,image_id,dx,dx_type,age,sex,localization,dataset. - Raw diagnosis classes:
akiec,bcc,bkl,df,mel,nv,vasc. - The processing pipeline reads only raw HAM10000 metadata and raw images/masks, then fully rebuilds
data/processed/ham10000/. - The canonical processed table is
data/processed/ham10000/metadata.csv. It is image-level, keeps all images from the same lesion in the same split, and includes the split inset. metadata.csvalso includes the EDA-derivedmblabel:malignantforakiec,bcc,mel, andbenignforbkl,df,nv,vasc.- The only other processed outputs are the split image and mask directories under
HAM10000_images/<set>/andHAM10000_segmentations_lesion_tschandl/<set>/. - HAM10000 analysis derives lesion-level views in memory from
metadata.csv; it does not depend on a second lesion-mapping CSV. - Quick stats from the EDA notebook:
10015metadata rows,7470unique lesions,6301benign lesions, and1169malignant lesions. - Missing values in metadata:
52missingage,203unknownlocalization, and50unknownsex. - Splits: the source does not provide official splits. Project splits are generated by the data processing pipeline and
make data-split.
scripts/download_ham10000.sh: download and normalize HAM10000 intodata/raw/ham10000/. Runbash scripts/download_ham10000.sh. Useful options:--check-url,--test-extract. Make shortcut:make data-download.make data-split: rebuilddata/processed/ham10000/from raw HAM10000 metadata, images, and masks.
Use make help to list the available shortcuts. Common targets are make install, make check, make test, make docs, make data-download, make model-download, make data-split, make mlflow-up, make mlflow-stop, make train-docker, make train-docker-smoke, make ui-up, make ui-down, and make docker-down.
Copy the local secrets template before pulling data:
cp .dvc/config.local-example .dvc/config.local
Then replace the placeholder values in .dvc/config.local with your real DVC remote credentials.
Config split:
.dvc/configstores the shared remote name, URL, endpoint, and region..dvc/config.localis git-ignored and should contain local secrets only..dvc/config.local-exampleis only a template showing which secret keys must be provided. Do not put real secrets in tracked files.
Fetch the tracked dataset:
dvc pull
This is the normal way to obtain data/processed/ham10000. Use make data-split only if the DVC remote is unavailable and you need to rebuild the processed dataset locally from raw HAM10000.
Inspect local data changes:
dvc statusdvc data status --granular datadvc diff --targets data
For the high-level workflow, use this README. For download script options and artifact layout details, use docs/data.md.
Model:
facebook/dinov3-vits16-pretrain-lvd1689m
Before the first training run:
- Request access to the model on Hugging Face.
- Log in locally with the Hugging Face CLI:
hf auth login
- Download the pretrained backbone locally:
make model-download
- Start MLflow:
make mlflow-up
The default training config expects the downloaded model at:
models/pretrained/dinov3-vits16-pretrain-lvd1689m
If you want to track the downloaded backbone in DVC, run:
dvc add models/pretrained/dinov3-vits16-pretrain-lvd1689m
The training config expects MLflow at:
tracking.mlflow_tracking_uri: http://mlflow:5001
If that server is not running, scripts/train.py fails before the first epoch starts.
Run training in Docker:
make train-docker
For a one-epoch smoke test in Docker:
make train-docker-smoke
Run Ray Tune in Docker:
docker compose --profile train run --build --rm tune
The train container is opt-in only. A plain docker compose up will not start training.
Docker training mounts these host folders into the container:
config/->/app/config(read-only)data/->/app/datamodels/->/app/models
That means edits to config/train.yaml apply to the next Docker training run without rebuilding the image. Rebuilds are still needed after code or dependency changes.
Training settings:
config/train.yaml
Tuning settings:
config/tune.yaml
Training artifacts are written under models/finetuned/dinov3_ham10000/:
best_model.pt: promoted local serving checkpoint selected on validation ROC AUC.checkpoints/epoch_*.pt: resumable epoch checkpoints.
Tuning writes Ray trial outputs and leaderboard exports under reports/.
MLflow now has a first-class Docker Compose service and is part of the default Docker stack.
Start only MLflow if you just want to browse runs:
make mlflow-up
Open the UI at:
http://127.0.0.1:5001
Stop only MLflow with:
make mlflow-stop
This starts:
mlflow: MLflow UI and tracking server onhttp://localhost:5001
MLflow state is stored in the same git-ignored repo paths:
mlflow.dbmlartifacts/
So make docker-down still does not remove run history, because these are repo-local bind-mounted files, not Docker volumes.
Training containers talk to MLflow at:
http://mlflow:5001
Serving now follows the main project layout instead of living as a nested standalone app. Importable API and UI code lives under src/mse_mlops/serving, while service Dockerfiles live under docker/.
Start the inference API and Streamlit UI from the repo root:
make ui-up
This starts:
mlflow: MLflow onhttp://localhost:5001api: FastAPI onhttp://localhost:8000ui: Streamlit onhttp://localhost:7777
The serving stack is now behind the ui profile, so a plain docker compose up does not try to start api before a model exists.
Stop the whole Docker stack with:
make docker-down
This removes:
- Compose containers
- the Compose network
- named volumes such as
feedback_data
This does not remove:
- repo-local
mlflow.db - repo-local
mlartifacts/ - local Docker images
Stop only the API and UI while keeping MLflow running:
make ui-down
Suggested Docker workflow:
make mlflow-upmake train-dockermake ui-upmake docker-down
The API expects a trained checkpoint at:
models/finetuned/dinov3_ham10000/best_model.pt
- Importable and reusable Python code belongs under
src/mse_mlops. - Notebook-backed reusable analysis helpers belong under
src/mse_mlops/analysis. - Exploratory notebooks belong under
notebooks/and should be organized by dataset or topic, for examplenotebooks/ham10000/. - Reusable notebook helper code belongs under
src/mse_mlops, not undernotebooks/. - Raw source data belongs under
data/raw/and derived tables belong underdata/processed/. - Dataset contents and local runtime artifacts stay out of git; only placeholders and provenance notes under
data/should be tracked. - Explanatory project material belongs under
docs/. - Reports and exported analysis outputs belong under
reports/. - Service-specific Dockerfiles belong under
docker/, not in nested app subtrees.
βββ .github
β βββ actions <- Github Actions configuration.
β βββ workflows <- Github Actions workflows.
β
βββ config <- Training and experiment configuration.
βββ docker <- Service Dockerfiles for API and UI.
βββ scripts <- Utility scripts for local and overnight runs.
βββ src/mse_mlops <- Source code for this project.
β βββ analysis <- Reusable analysis helpers used by notebooks.
β βββ serving <- FastAPI API, Streamlit UI, and serving helpers.
βββ data
β βββ raw <- Local source data and provenance notes (kept out of git).
β βββ processed <- Local derived datasets and exported tables (kept out of git).
β
βββ docs <- MkDocs documentation for the project.
βββ models <- Model checkpoints, predictions, metrics, and summaries.
βββ notebooks <- Exploratory notebooks only, grouped by dataset/topic.
βββ logs <- Local training and smoke-test logs.
βββ reports <- Generated analysis as HTML, PDF, LaTeX, etc.
βββ tests <- Unit tests for the project.
βββ .gitignore <- Files to be ignored by git.
βββ docker/train.Dockerfile <- Dockerfile for the training image.
βββ LICENSE <- MIT License.
βββ Makefile <- Makefile with commands like `make install` or `make test`.
βββ mkdocs.yml <- MkDocs configuration.
βββ pyproject.toml <- Package build configuration.
βββ README.md <- The top-level README for this project.
βββ uv.lock <- Lock file for uv.