moltok

Tokenizer and evaluation utilities for turning molecular structures into token sequences using Residual Vector Quantization (RVQ) codebooks. This repo focuses on:

sampling OMol25 into CSV
building RVQ codebooks for positions, forces, and energy
serializing molecules into token sequences
pushing tokenized shards and the tokenizer to the Hugging Face Hub
evaluating trained models on OMol25 benchmarks (S2EF + 7 chemistry tasks)

Repo Contents

omol_sample_1m_to_csv.py: stream a 1M sample from colabfit/OMol25_train into CSV.
build_rvq_codebooks.py: build RVQ codebooks (quantile or k-means) and evaluate reconstruction.
serialize_molecules.py: MoleculeTokenizer + CSV serialization to token sequences.
push_to_hf.py: tokenize parquet shards and upload dataset shards to HF.
push_tokenizer_to_hf.py: upload the tokenizer as a HF model repo.
test_roundtrip.py: round-trip test through HF tokenizer and decode back to molecules.
evaluate_omol25.py: evaluate trained models on OMol25 S2EF and chemistry benchmarks.
omol25_train_sample_1k.csv: small sample for quick tests.
codebook_mol_1m.pkl: prebuilt codebook used by the scripts.

Setup

Python 3.12+ is required.

# Option 1: uv (recommended)
uv sync

# Option 2: pip
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

For GPU support, install PyTorch with CUDA first:

pip install torch --index-url https://download.pytorch.org/whl/cu121

Token Format

Tokens are organized into sections so they can be shuffled for flexible conditioning:

[BOS]
[ATOMS] [Z=6] [Z=1] ... [ATOMS_END]
[POS]   [P0:...] ... [NL] ... [POS_END]
[FORCE] [FX0:...] [FY0:...] [FZ0:...] [NL] ... [FORCE_END]
[ENERGY] [E0:...] [NL] [ENERGY_END]
[EOS]

[NL] is a newline token that separates per-atom rows in the position/force sections.

Quickstart (using the included sample)

python3 serialize_molecules.py omol25_train_sample_1k.csv \
  --codebook codebook_mol_1m.pkl \
  --output tokenized_molecules.pkl \
  --show-examples 1

Workflows

1) Sample OMol25 to CSV

python3 omol_sample_1m_to_csv.py

This streams a 1M-row sample from HF colabfit/OMol25_train and writes omol25_train_sample_1m.csv. If you are authenticated with HF, the script will pick up HF_TOKEN automatically.

2) Build RVQ codebooks

python3 build_rvq_codebooks.py omol25_train_sample_1m.csv \
  --output codebook_mol_1m.pkl \
  --method quantile \
  --pos-levels 8 \
  --n-levels 4 \
  --codebook-size 256

Quantile mode uses Morton (Z-order) binning for 3D positions. Use --method kmeans for k-means.

3) Serialize molecules to tokens

python3 serialize_molecules.py omol25_train_sample_1m.csv \
  --codebook codebook_mol_1m.pkl \
  --output tokenized_molecules.pkl

4) Push tokenized dataset to HF

push_to_hf.py expects local OMol25 parquet shards (downloaded with huggingface-cli download facebook/OMol25) and writes train/validation shards to a dataset repo.

python3 push_to_hf.py /path/to/omol25_parquet_dir WillHeld/Tomol25 \
  --codebook codebook_mol_1m.pkl

Dataset repo: https://huggingface.co/datasets/WillHeld/Tomol25

5) Push tokenizer to HF

python3 push_tokenizer_to_hf.py --repo-id WillHeld/marin-tomol

Tokenizer repo: https://huggingface.co/WillHeld/marin-tomol

6) Round-trip test

python3 test_roundtrip.py

This uses omol25_train_sample_1k.csv, codebook_mol_1m.pkl, and the HF tokenizer repo WillHeld/marin-tomol.

7) Evaluate a trained model on OMol25 benchmarks

# S2EF evaluation on validation set (with metrics)
python3 evaluate_omol25.py \
  --model WillHeld/qwen3-omol \
  --codebook codebook_mol_1m.pkl \
  --split val \
  --output predictions_val.npz

# Run all 7 chemistry evaluation tasks
python3 evaluate_omol25.py \
  --model WillHeld/qwen3-omol \
  --codebook codebook_mol_1m.pkl \
  --run-evals \
  --eval-output-dir eval_results

# Quick test with limited samples
python3 evaluate_omol25.py \
  --model WillHeld/qwen3-omol \
  --codebook codebook_mol_1m.pkl \
  --split val \
  --max-samples 100

Evaluation Tasks

Task	Description	Metric
S2EF	Structure to Energy and Forces	Energy MAE (meV/atom), Force MAE (meV/Å)
conformers	Identify lowest energy conformer	Accuracy (%)
distance_scaling	Energy vs intermolecular distance	MAE (meV)
ligand_strain	Bound vs relaxed ligand energy	MAE (meV)
ligand_pocket	Protein-ligand interaction energy	MAE (meV)
protonation	Protonated vs deprotonated energy (pKa proxy)	MAE (meV)
ie_ea	Ionization energy / electron affinity	MAE (meV)
spin_gap	Singlet-triplet energy gap	MAE (meV)

Note: This tokenization scheme does not include charge or spin conditioning. The model predicts based on geometry alone, which works well for conformers, distance_scaling, ligand_strain, and ligand_pocket tasks. For ie_ea and spin_gap, the model relies on geometric differences between charge/spin states.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.gitignore		.gitignore
README.md		README.md
analyze_from_parquet.py		analyze_from_parquet.py
analyze_token_distribution.py		analyze_token_distribution.py
build_fp16_config.py		build_fp16_config.py
build_rvq_codebooks.py		build_rvq_codebooks.py
codebook_mol_1m.pkl		codebook_mol_1m.pkl
evaluate_omol25.py		evaluate_omol25.py
fp16_config.json		fp16_config.json
omol25_train_sample_1k.csv		omol25_train_sample_1k.csv
omol_sample_1m_to_csv.py		omol_sample_1m_to_csv.py
push_sample_to_hf.py		push_sample_to_hf.py
push_to_hf.py		push_to_hf.py
push_tokenizer_to_hf.py		push_tokenizer_to_hf.py
pyproject.toml		pyproject.toml
serialize_molecules.py		serialize_molecules.py
test_hf_tokenizer.py		test_hf_tokenizer.py
test_model.py		test_model.py
test_roundtrip.py		test_roundtrip.py
test_vllm.py		test_vllm.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

moltok

Repo Contents

Setup

Token Format

Quickstart (using the included sample)

Workflows

1) Sample OMol25 to CSV

2) Build RVQ codebooks

3) Serialize molecules to tokens

4) Push tokenized dataset to HF

5) Push tokenizer to HF

6) Round-trip test

7) Evaluate a trained model on OMol25 benchmarks

Evaluation Tasks

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

moltok

Repo Contents

Setup

Token Format

Quickstart (using the included sample)

Workflows

1) Sample OMol25 to CSV

2) Build RVQ codebooks

3) Serialize molecules to tokens

4) Push tokenized dataset to HF

5) Push tokenizer to HF

6) Round-trip test

7) Evaluate a trained model on OMol25 benchmarks

Evaluation Tasks

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages