Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
50ff5a2
add task2vec
VseMeshkov Mar 1, 2026
be7b709
rm extra logs
VseMeshkov Mar 1, 2026
93a4587
Merge pull request #1 from intsystems/meshkovvl
VseMeshkov Mar 1, 2026
db6d456
Create BaseEmbedder.py
papayiv Mar 9, 2026
8011ae9
Create WassersteinEmbedder.py
papayiv Mar 9, 2026
b847bf8
Update test.yml
papayiv Mar 9, 2026
09ef264
Create badge_generator.py
papayiv Mar 9, 2026
f9c925c
Update README.md
papayiv Mar 9, 2026
86ecd73
Create __init__.py
papayiv Mar 9, 2026
7455851
Create test_wasserstein.py
papayiv Mar 9, 2026
e0373b7
Delete tests/test_simple.py
papayiv Mar 9, 2026
33af2ff
Update test.yml
papayiv Mar 9, 2026
f9aa99e
Update test.yml
papayiv Mar 9, 2026
974ef47
Update test.yml
papayiv Mar 9, 2026
b3907a8
Update test.yml
papayiv Mar 9, 2026
85f6012
Update coverage badge
github-actions[bot] Mar 9, 2026
08f70a5
Merge pull request #2 from intsystems/master
VseMeshkov Mar 9, 2026
92e712c
Update test.yml
papayiv Mar 9, 2026
e67f304
Update PLAN.md (add baseline selection description)
VseMeshkov Mar 9, 2026
aaa74c3
Update PLAN.md
VseMeshkov Mar 9, 2026
a079dce
Update PLAN.md
ILIAHHne63 Mar 10, 2026
7bdc2ca
Update PLAN.md
minashkinvladislav Mar 10, 2026
aa35520
Update PLAN.md
papayiv Mar 10, 2026
8370ded
Update PLAN.md
papayiv Mar 10, 2026
0a295af
Update PLAN.md
papayiv Mar 10, 2026
86d548c
add basic pretrain and inference code
VseMeshkov Mar 14, 2026
f513e62
add pretrain_benchmark
VseMeshkov Mar 15, 2026
098e88c
Merge pull request #3 from intsystems/meshkovvl
VseMeshkov Mar 15, 2026
48a9ce7
dataset2vec + demo d2v first version
ILIAHHne63 Mar 18, 2026
c20410d
Add files via upload
ILIAHHne63 Mar 18, 2026
900f8b1
add report first version
ILIAHHne63 Mar 18, 2026
01a4302
Update README.md
papayiv Mar 18, 2026
81ed9e3
Update README.md
papayiv Mar 29, 2026
372823f
Create __init__.py
papayiv Apr 5, 2026
8880fcf
Add files via upload
papayiv Apr 5, 2026
f408a66
add logs dir
VseMeshkov Apr 5, 2026
56ee2ae
Create __init__py
papayiv Apr 5, 2026
e8bc6ff
Add files via upload
papayiv Apr 5, 2026
879d9e3
Merge pull request #4 from intsystems/meshkovvl
VseMeshkov Apr 5, 2026
77eb5ea
Add files via upload
papayiv Apr 5, 2026
d9f20c3
Delete benchmarks/pretrain_benchmark/wasserstein/apply_benchmark_wass…
papayiv Apr 5, 2026
28ec670
Add dataset2vec and dataset2vec demo
ILIAHHne63 Apr 8, 2026
5b1529c
tests resolved
Apr 18, 2026
610c073
Update test.yml
papayiv Apr 18, 2026
ab799d2
Update test.yml
papayiv Apr 18, 2026
8e04a6d
Update test.yml
papayiv Apr 18, 2026
a004fff
Update test_task2vec.py
papayiv Apr 18, 2026
ab21552
Update test_wasserstein.py
papayiv Apr 18, 2026
ed9c863
Update coverage badge
github-actions[bot] Apr 18, 2026
adb16b3
Update installation and pyproject toml
ILIAHHne63 Apr 30, 2026
c3f8ff9
add demo and rreport dataset2vec
ILIAHHne63 Apr 30, 2026
e76e353
add tests task2vec
VseMeshkov May 4, 2026
16a3400
Add tests to dataset2vec, Change Readme, and pyproject
ILIAHHne63 May 6, 2026
c3d9bfc
Update coverage badge
github-actions[bot] May 6, 2026
752dad7
Add report
ILIAHHne63 May 6, 2026
d8ef7ef
Update references.bib
ILIAHHne63 May 6, 2026
6999e62
Create BLOGPOST.md
VseMeshkov May 6, 2026
b65c176
Update README.md
VseMeshkov May 6, 2026
fa0a6e4
Merge branch 'develop' into meshkovvl
VseMeshkov May 6, 2026
da95c36
Merge pull request #5 from intsystems/meshkovvl
VseMeshkov May 6, 2026
6ff2a26
rename pdf file
ILIAHHne63 May 7, 2026
0909a19
Update test.yml
papayiv May 7, 2026
1b2abef
Update test_task2vec.py
papayiv May 7, 2026
878fc4b
Update test_task2vec.py
papayiv May 7, 2026
a3797bf
Update coverage badge
github-actions[bot] May 7, 2026
9c0dc64
Update test_task2vec.py
papayiv May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 46 additions & 23 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,36 +1,59 @@
name: test
name: Testing

on: [push, pull_request, workflow_dispatch]

jobs:
pytest:
name: pytest
build:
runs-on: ubuntu-latest

strategy:
matrix:
python-version: [3.7]

steps:
- name: Checkout 🛎️
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v1
uses: actions/setup-python@v4
with:
python-version: 3.7
python-version: '3.x'

- name: Checkout 🛎️
uses: actions/checkout@v2

- name: Install Dependencies
run: |
pip install torch
pip install torchvision
pip install gymnasium
pip install pyro-ppl
pip install POT
pip install numpy
pip install scikit-learn
pip install tqdm
pip install pandas
pip install lightning
pip install pydantic
pip install -U pytest pytest-cov
pip install -U -r src/requirements.txt
ls ./

- name: Testing
run: |
PYTHONPATH=src/ pytest tests/ --cov=mylib --cov-report=xml

- name: Upload to Codecov
uses: codecov/codecov-action@v2
with:
files: ./coverage.xml,
fail_ci_if_error: true
verbose: true
PYTHONPATH=src/ pytest tests/ --cov=src --cov-report=xml

- name: Generate coverage badge
run: |
python badge_generator.py

- name: Check for changes in coverage badge
id: check_changes
run: |
if git diff --exit-code -- coverage-badge.svg; then
echo "No changes in coverage badge"
echo "::set-output name=changes::false"
else
echo "Changes detected in coverage badge"
echo "::set-output name=changes::true"
fi

- name: Commit and push coverage badge
if: steps.check_changes.outputs.changes == 'true'
run: |
git config --global user.name 'github-actions[bot]'
git config --global user.email 'github-actions[bot]@users.noreply.github.com'
git add coverage-badge.svg
git commit -m "Update coverage badge"
git push origin develop
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,10 @@ logs/
*/mnist
*.csv
!.dvc
data
logs
pretrain_embeddings
wget-log
checkpoints
*.parquet
*.pt
224 changes: 224 additions & 0 deletions BLOGPOST.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# DataMetaMap: Why Compare Datasets? A Method-Driven Blogpost

Understanding **dataset similarity** is a hidden key to transfer learning. If we can measure how "close" one dataset is to another, we can make smarter choices about which model to pre-train on—saving time and boosting performance.

But how do you embed entire datasets into a shared vector space? Our new library, **DataMetaMap**, implements four powerful, research-backed approaches. Below, we walk through the **key methodological insight** behind each one.

---

## 1. Maximum Mean Discrepancy (MMD) — A Classical Kernel View

**Based on:** *A Kernel Two-Sample Test* (Gretton et al., 2012, following the review in arXiv:2208.11726)

### Core Idea

MMD answers a fundamental question: *Are two datasets sampled from the same distribution?* Unlike deep learning approaches, MMD works without training—it directly computes a distance between distributions using kernel functions.

### Mathematical Formulation

Let $P$ and $Q$ be two probability distributions. Given samples $X = \{x_1, ..., x_m\} \sim P$ and $Y = \{y_1, ..., y_n\} \sim Q$, the squared MMD is:

$$\text{MMD}^2(P, Q) = \left\| \mathbb{E}_{x \sim P}[\phi(x)] - \mathbb{E}_{y \sim Q}[\phi(y)] \right\|^2_{\mathcal{H}}$$

where $\phi$ maps data into a Reproducing Kernel Hilbert Space (RKHS) $\mathcal{H}$. Using the kernel trick $k(x, x') = \langle \phi(x), \phi(x') \rangle_{\mathcal{H}}$, we get:

$$\text{MMD}^2 = \mathbb{E}_{x, x' \sim P}[k(x, x')] - 2\mathbb{E}_{x \sim P, y \sim Q}[k(x, y)] + \mathbb{E}_{y, y' \sim Q}[k(y, y')]$$

In practice, we use the unbiased empirical estimate:

$$\widehat{\text{MMD}}^2 = \frac{1}{m(m-1)}\sum_{i \neq j} k(x_i, x_j) - \frac{2}{mn}\sum_{i,j} k(x_i, y_j) + \frac{1}{n(n-1)}\sum_{i \neq j} k(y_i, y_j)$$

### Key Observations

- **No training required** — MMD works directly on raw features or neural network representations
- **Choice of kernel matters** — RBF (Gaussian) kernels with bandwidth selection are standard; DataMetaMap supports multiple kernels
- **Computational cost** — $O((m+n)^2)$ makes it suitable for moderate-sized datasets

### How to Use in DataMetaMap

Pass two datasets to the MMD embedder. The method returns a scalar distance. For embedding, we compute pairwise MMD distances to a set of reference datasets, creating a distance vector.

**Best for:** Quick baseline comparisons, detecting dataset shift, benchmarking other methods.

---

## 2. Task2Vec — Embedding Tasks via Fisher Information

**Based on:** *Task2Vec: Task Embedding for Meta-Learning* (Achille et al., arXiv:1905.11063)

### Core Idea

Every dataset defines a "task" for a neural network. The Fisher Information Matrix (FIM) tells us which parameters are most important for that task. By computing the diagonal of the FIM, Task2Vec creates a vector that captures the task's geometry.

### Mathematical Formulation

For a model with parameters $\theta$ and a dataset $\mathcal{D} = \{(x_i, y_i)\}$ with loss $\mathcal{L}(x, y; \theta)$, the Fisher Information Matrix is:

$$F(\theta) = \mathbb{E}_{x, y \sim \mathcal{D}}\left[ \nabla_\theta \log p(y|x; \theta) \nabla_\theta \log p(y|x; \theta)^\top \right]$$

Computing the full $F$ is prohibitive for modern networks. Task2Vec uses the **diagonal approximation**:

$$f_k = \mathbb{E}_{x, y \sim \mathcal{D}}\left[ \left( \frac{\partial \log p(y|x; \theta)}{\partial \theta_k} \right)^2 \right]$$

The task embedding is then:

$$z_{\text{task}} = \text{diag}(F) \quad \text{or} \quad z_{\text{task}} = \log \text{diag}(F)$$

After fine-tuning a reference network on $\mathcal{D}$ (or using a single gradient step), we compute these per-parameter importances.

### Key Observations

- **Reference network dependent** — Different architectures produce different similarity judgments
- **Fine-tuning is required** — Each dataset needs adaptation of the base model
- **Embedding dimensionality** — Equals number of network parameters (typically millions), often reduced via PCA
- **Log-transform** helps stabilize high-variance Fisher entries

### How to Use in DataMetaMap

1. Choose a reference network (e.g., ResNet-18 pretrained on ImageNet)
2. For each dataset, fine-tune for a few epochs
3. Compute diagonal Fisher Information matrix
4. Return the flattened vector (optionally log-transformed)

**Best for:** Comparing classification tasks when you have a good reference model.

---

## 3. Dataset2Vec — Learning Dataset Representations

**Based on:** *Dataset2Vec: Learning Dataset Meta-Features* (Jomaa et al., arXiv:1902.03545)

### Core Idea

Why compute Fisher matrices when we can learn to embed datasets directly? Dataset2Vec is a **meta-learning** approach: train a neural network that encodes any dataset (as a set of labeled examples) into a fixed-size vector, then optimize this encoder to predict something useful (like relative task similarity).

### Mathematical Formulation

Dataset2Vec processes a dataset as an unordered set:

$$z_{\mathcal{D}} = f_{\text{pool}} \left( \{ g(x_i, y_i) \mid (x_i, y_i) \in \mathcal{D} \} \right)$$

where:
- $g$ is a per-example encoder (typically a small MLP processing the concatenated input and one-hot label)
- $f_{\text{pool}}$ is a permutation-invariant pooling function (sum, mean, or max)
- The output $z_{\mathcal{D}}$ is a $d$-dimensional vector (e.g., $d=128$)

The training objective is meta-learning: given triplets of datasets $\mathcal{D}_a, \mathcal{D}_b, \mathcal{D}_c$ where $\mathcal{D}_a$ is more similar to $\mathcal{D}_b$ than to $\mathcal{D}_c$ (in terms of downstream transfer performance), we use a ranking loss:

$$\mathcal{L} = \max\left(0, \|z_a - z_b\|^2 - \|z_a - z_c\|^2 + \alpha\right)$$

### Key Observations

- **Once trained, inference is fast** — No per-dataset fine-tuning or Fisher computation
- **Meta-training requires many datasets** — Typically hundreds or thousands
- **Permutation invariance** ensures the embedding doesn't depend on data order
- **Generalization potential** — Can embed datasets not seen during meta-training

### How to Use in DataMetaMap

Our library includes:
- Pre-trained Dataset2Vec models on standard benchmarks (e.g., Meta-Dataset)
- Ability to train your own meta-encoder on custom dataset collections
- Support for various pooling strategies and per-example encoders

**Best for:** Large-scale dataset retrieval when you have many datasets and can afford meta-training.

---

## 4. Wasserstein Task Embedding — Optimal Transport Between Datasets

**Based on:** *Wasserstein Task Embedding for Meta-Learning* (Lee et al., arXiv:1605.09522)

### Core Idea

Instead of comparing datasets through a model, compare them directly using **optimal transport**. The Wasserstein distance measures how much "mass" you must move to transform one probability distribution into another. This geometric viewpoint respects the underlying feature space structure.

### Mathematical Formulation

For two probability distributions $\mu$ and $\nu$ on $\mathbb{R}^d$, the $p$-Wasserstein distance is:

$$W_p(\mu, \nu) = \left( \inf_{\gamma \in \Gamma(\mu, \nu)} \int_{\mathbb{R}^d \times \mathbb{R}^d} \|x - y\|^p d\gamma(x, y) \right)^{1/p}$$

where $\Gamma(\mu, \nu)$ is the set of all couplings (joint distributions) with marginals $\mu$ and $\nu$.

For empirical distributions (our datasets), we solve:
- **1D case** (after projecting features): $W_1(\hat{\mu}, \hat{\nu}) = \frac{1}{n} \sum_{i=1}^n |X_{(i)} - Y_{(i)}|$ (sorted samples)
- **High-dimensional case**: Use entropy-regularized Sinkhorn algorithm for $O(n^2)$ approximation

To create an **embedding**, Wasserstein Task Embedding computes distances to $K$ reference distributions:

$$z_{\mathcal{D}} = [W(\mathcal{D}, R_1), W(\mathcal{D}, R_2), ..., W(\mathcal{D}, R_K)]$$

Reference distributions can be:
- Randomly sampled subsets from a large meta-dataset
- Prototypical distributions (e.g., Gaussian with different covariances)
- Other datasets in your collection

### Key Observations

- **No training required** — Works directly on features (e.g., penultimate layer of a frozen network)
- **Handles different dataset sizes** — Unlike maximum mean discrepancy, optimal transport is robust to $n_1 \neq n_2$
- **Computational cost** — $O(n^2)$ for exact Wasserstein, $O(n^2 \log n)$ for Sinkhorn approximation
- **Choice of ground distance** — Euclidean is standard, but any metric works (e.g., cosine distance for embeddings)

### How to Use in DataMetaMap

1. Extract features for all examples using a frozen pre-trained network
2. Choose reference distributions (e.g., 50 random datasets from a meta-collection)
3. For each dataset, compute Wasserstein distance to each reference
4. Return the $K$-dimensional distance vector as the embedding
5. Optionally apply dimensionality reduction (PCA) if $K$ is large

**Best for:** Comparing datasets with imbalanced classes, different sizes, or when you want a geometry-aware metric.

---

## What DataMetaMap Does

Our library implements all four methods **in a unified PyTorch interface**:

- **Unified API** — Same `fit()` and `transform()` pattern across all embedders
- **Flexible feature extraction** — Raw data, pre-trained features, or learned representations
- **Reference management** — For MMD and Wasserstein methods, handle reference dataset selection
- **Visualization tools** — PCA, t-SNE, and UMAP projections of dataset embeddings
- **Similarity search** — Find nearest datasets to a target

**No code examples here—just the methods. But the repo contains ready-to-run demos.**

---

## Method Comparison at a Glance

| Method | Training Required | Inference Speed | Dimensionality | Handles Different Sizes | Geometric Interpretation |
|--------|:----------------:|:---------------:|:--------------:|:-----------------------:|:------------------------:|
| MMD | None | Medium (quadratic) | Variable (n_refs) | Yes | RKHS distance |
| Task2Vec | Per-dataset fine-tuning | Slow (per dataset) | # Parameters | N/A (fixed network) | Fisher information |
| Dataset2Vec | Meta-training (once) | Fast | Fixed (e.g., 128) | Yes | Learned similarity |
| Wasserstein | None | Slow (quadratic) | Fixed (n_refs) | Yes | Optimal transport |

---

## Key Insight Across All Methods

Despite their different mathematical origins (kernel methods, Fisher information, learned encoders, optimal transport), **all four approaches reduce to the same operation**: mapping a dataset to a vector where Euclidean distance correlates with transfer learning performance. DataMetaMap lets you compare which method works best for your domain.

---

## Practical Recommendations from Our Observations

- **For quick baselines** → Start with MMD on pre-trained features
- **When you have a strong reference model** → Try Task2Vec with few-shot fine-tuning
- **When you have many datasets for training** → Train a Dataset2Vec meta-encoder
- **When dataset sizes vary greatly** → Wasserstein embedding is your best bet
- **When computational budget is high** → Ensemble multiple methods

---

## References

- Gretton et al. (2012) – *A Kernel Two-Sample Test.* Review: arXiv:2208.11726
- Achille et al. (2019) – *Task2Vec: Task Embedding for Meta-Learning.* arXiv:1905.11063
- Jomaa et al. (2019) – *Dataset2Vec: Learning Dataset Meta-Features.* arXiv:1902.03545
- Lee et al. (2016) – *Wasserstein Task Embedding for Meta-Learning.* arXiv:1605.09522

**Our repo:** [DataMetaMap](https://github.com/intsystems/DataMetaMap)
28 changes: 26 additions & 2 deletions PLAN.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,15 +49,33 @@ DataMetaMap aims to compare datasets within a unified vector space to identify s

- **Baseline Selection**
Identify and select baseline methods from literature for comparison during benchmarking.

Description (done by Meshkov Vladislav):
- Establish baselines for each embedding method as specified in the paper
- Assess baselines from the literature and determine their appropriateness for our benchmarking framework
- Conduct a literature review to identify similar papers and gather additional straightforward baselines for meaningful comparison
- Document baseline descriptions in the benchmark specifications, along with rationale for their inclusion

- **Data Collection**
Gather a diverse collection of datasets for experimentation, ensuring they represent various domains and formats.

- **Data Preprocessing Pipeline**
Design and implement preprocessing steps to handle different dataset formats and ensure consistent input for embedding methods.


Description (done by Minashkin Vladislav):
- Handle diverse data types: images, text, tabular, and time series with type-specific loaders
- Fill missing values and remove bad data points
- Save all settings for exact reproduction

- **Evaluation Metrics Definition**
Define quantitative metrics to evaluate embedding quality and similarity measurement accuracy.

Description (done by Stepanov Ilya):
- Define cosine similarity, Euclidean distance and kernel-based distance as core metrics to evaluate geometric separability and structural relationships between dataset embeddings in the latent space
- Define Maximum Mean Discrepancy (MMD) metric as described in the original paper
- Ensure that all embedding methods and baselines will be evaluate using all metrics so that comparison across methods is consistent and reproducible


- **Planning and Specifications**
Define technical specifications and success criteria based on research findings and data availability.
Expand Down Expand Up @@ -89,8 +107,14 @@ DataMetaMap aims to compare datasets within a unified vector space to identify s
- **Technical Report**
Document the methodology, experimental setup, and findings in a comprehensive technical report.

- **User and Developer Documentation**
Create detailed documentation for users and contributors, including setup guides and API references. In this task we should create github.io page where user can find documentation for all classes and their methods. Github.io page must have headers for functions and links to their each source code.
- **User and Developer Documentation**
Build documentation.


Description (done by Papay Ivan):
- create detailed documentation for users and contributors, including setup guides and API references
- create github.io page where user can find documentation for all classes and their methods
- Github.io page must have headers for functions and links to their each source code.

- **Demo Examples and Blog Post**
Prepare example notebooks or scripts demonstrating real-world use cases, and write an explanatory blog post highlighting project value and insights.
Expand Down
Loading
Loading