Large files in git, your bucket, one binary.
A Rust rewrite of Dan Loewenherz's git-bigstore (2013), which got one thing right that everything else got wrong: large file storage should use git's own clean/smudge filters and your own bucket, not a vendor-hosted server with its own protocol and billing.
Dan's original insight was that git-media (and later Git LFS) broke
idempotency — running the clean filter twice produced different output, which
corrupted repos during collaboration. bigstore fixed that with a simple,
idempotent pointer format and direct cloud storage. It was a Python script, a
.gitattributes line, and your S3 credentials. Nothing else.
This rewrite is an exploration of the modern state of git and big data — what's possible now with Rust, async object-store crates, and the DVC/LFS ecosystem that didn't exist when Dan wrote the original. Git itself may absorb much of this soon. Until then: one binary, no server, no lock-in, and a storage-layer bridge that lets Git LFS clients pull from the same bucket without knowing bigstore exists.
cargo install --path .The binary is called git-bigstore. Git discovers it automatically as a
subcommand (git bigstore ...).
# Initialize with your storage backend
git bigstore init s3://my-bucket/bigstore
# Tell git which files to track
echo '*.bin filter=bigstore' >> .gitattributes
git add .gitattributes .bigstore.toml
# Use git normally — large files are transparently replaced with pointers
cp ~/large-model.bin .
git add large-model.bin
git commit -m "add model"
# Upload to remote storage
git bigstore push
# On another machine: clone and pull
git clone ...
git bigstore pull| Scheme | Example | Notes |
|---|---|---|
s3:// |
s3://bucket/prefix |
AWS S3 (uses standard AWS credentials) |
gs:// |
gs://bucket/prefix |
Google Cloud Storage |
az:// |
az://container/prefix |
Azure Blob Storage |
r2:// |
r2://bucket/prefix |
Cloudflare R2 (requires --endpoint) |
t3:// or tigris:// |
t3://bucket |
Tigris (auto-configures endpoint) |
rclone:// |
rclone://remote:path |
Any rclone remote |
local:// or file:// |
local:///tmp/store |
Local filesystem (testing) |
# R2 requires an explicit endpoint
git bigstore init r2://my-bucket --endpoint https://ACCOUNT_ID.r2.cloudflarestorage.com
# Tigris auto-configures
git bigstore init t3://my-bucketInitialize bigstore in the current repository. Creates .bigstore.toml and
configures git clean/smudge filters.
Upload cached objects to remote storage. Skips objects already present on the remote. Optional glob patterns filter which files to push.
git bigstore push # push all tracked files
git bigstore push "models/*" # push only models
git bigstore push --jobs 16 # use 16 concurrent uploadsDownload objects from remote storage with integrity verification. Every downloaded object is hash-verified before entering the local cache.
git bigstore pull # pull all tracked files
git bigstore pull "*.bin" # pull only .bin files
git bigstore pull --jobs 4 # limit to 4 concurrent downloadsShow the state of each tracked large file:
ok models/bert.bin
cached (not checked out) models/gpt2.bin
pointer only (needs pull) data/train.bin
Use --verify to re-hash cached objects and detect corruption:
git bigstore status --verifyReports CORRUPTED (hash mismatch) for bad cache entries and exits non-zero
with repair guidance.
Show history of bigstore-tracked files with change classification:
a1b2c3d 2024-01-15 12:00:00 +0000 update model
~ models/bert.bin sha256:abc123..def456 -> sha256:789abc..def012
d4e5f6a 2024-01-14 10:00:00 +0000 add training data
+ data/train.bin sha256:111222..333444
Symbols: + added, - deleted, ~ modified, R renamed, C copied.
Create a bigstore pointer from a DVC file. Imports the object from the DVC
cache (.dvc/cache/) into the bigstore cache with hash verification.
git bigstore ref model.bin.dvc model.bin
echo 'model.bin filter=bigstore' >> .gitattributes
git add model.bin .gitattributes
git commit -m "migrate model from DVC"
git bigstore pushList files in a DVC .dir manifest:
git bigstore dvc-ls models.dvc
# 17 entries in models.dvc (manifest md5:0f0d92...)
# 28a6a97b... exports/model.onnx
# 46ce4109... exports/model.onnx.dataImport files from a DVC .dir manifest into bigstore. Content is restored to
the working tree automatically.
# Import everything
git bigstore import-dvc-dir models.dvc models/
# Import selectively
git bigstore import-dvc-dir models.dvc models/ "exports/*.onnx"
# Overwrite existing files
git bigstore import-dvc-dir models.dvc models/ --forceMigrate legacy .bigstore config to .bigstore.toml.
git bigstore migrate-config
git add .bigstore.toml
git rm .bigstore
git commit -m "migrate config to toml"Created by init. Committed to the repo so all collaborators share the same
backend.
layout = "files/{hash_fn}/{prefix}/{rest}"
[backend]
type = "s3"
bucket = "my-bucket"
prefix = "bigstore"The layout field controls how objects are stored remotely. The default layout
is DVC-compatible (files/{hash_fn}/{prefix}/{rest}).
Standard git mechanism for declaring which files use the bigstore filter:
*.bin filter=bigstore
*.safetensors filter=bigstore
models/** filter=bigstoreTracked files are replaced in git with small pointer files:
bigstore
sha256
a1b2c3d4e5f6... (64-character hex digest)
Pointers are 3 lines, ~81 bytes. The clean filter creates them on git add;
the smudge filter restores the real content on checkout (if cached locally).
Push and pull run up to 8 transfers concurrently by default. Override with
--jobs:
git bigstore push --jobs 16
git bigstore pull --jobs 1 # sequentialOr set BIGSTORE_JOBS as a default:
export BIGSTORE_JOBS=16
git bigstore push # uses 16
git bigstore push --jobs 4 # CLI flag winsbigstore can import files tracked by DVC, verified against the DVC cache.
bigstore resolves the DVC cache location by running dvc cache dir. This means
shared/global caches (dvc cache dir --global ~/.dvc/cache) work automatically.
If dvc is not installed, bigstore falls back to .dvc/cache in the DVC
project directory.
git bigstore ref model.bin.dvc model.bin
echo 'model.bin filter=bigstore' >> .gitattributes
git add model.bin .gitattributes
git commit -m "migrate model from DVC"
git bigstore pushMost DVC repos use .dir tracking. Inspect first, then import:
# List contents
git bigstore dvc-ls models.dvc
# Import all (or use glob patterns for selective import)
git bigstore import-dvc-dir models.dvc models/
# Stage, commit, push
echo 'models/** filter=bigstore' >> .gitattributes
git add models/ .gitattributes
git commit -m "migrate models from DVC"
git bigstore pushTested against a real monorepo with 34 .dvc files across nested DVC projects.
Prerequisites:
-
Consolidate DVC cache (recommended for multi-worktree repos):
dvc cache dir --global ~/.dvc/cache # Move per-project caches into global cache
-
Populate the DVC cache — objects must be pulled locally before import:
dvc pull path/to/file.dvc
-
Set credentials —
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYfor push.
Per-artifact workflow:
- Classify:
single-file(useref) or.dir(useimport-dvc-dir) - Import — objects are md5-verified from DVC cache
- Edit DVC
.gitignore— DVC auto-generates.gitignorefiles next to.dvcfiles that ignore the output paths. Remove the relevant entries sogit addcan stage the bigstore-tracked files. git add— the clean filter re-hashes content as sha256 (bigstore's native hash). The md5 cache entries from DVC import remain for deduplication.git bigstore status --verify— confirm all files are ok- Commit and push
What to watch for:
- DVC sibling
.gitignorefiles must be edited per migrated output path. Without this,git addsilently ignores the imported files. - Content is auto-restored to the working tree after import (real data, not
pointer text). The clean filter converts back to pointers on
git add. - If
git-bigstoreis not in PATH, set full filter paths beforegit add:git config filter.bigstore.clean "/path/to/git-bigstore filter-clean" git config filter.bigstore.smudge "/path/to/git-bigstore filter-smudge"
During git bigstore pull, if an md5-hashed object is not on the remote but
exists in the local DVC cache, bigstore imports it automatically with
verification.
The default storage layout (files/{hash_fn}/{prefix}/{rest}) is
DVC-compatible. Objects uploaded by bigstore can coexist with DVC objects in the
same bucket.
All three solve "large files in git." They differ in where control sits.
| bigstore | Git LFS | DVC | |
|---|---|---|---|
| Mechanism | Git clean/smudge filter | Git clean/smudge filter | Separate CLI, .dvc metafiles |
| Storage | Any S3-compatible bucket you own | Host's LFS server | Any remote (S3, GCS, SSH, etc.) |
| Pointer format | 3-line (bigstore\nsha256\n<hex>) |
version, oid, size |
YAML .dvc files |
| Server required | No (direct bucket access) | Yes (LFS HTTP API on host) | No |
| Billing | Your bucket costs | Host LFS quotas + bandwidth | Your bucket costs |
| DVC migration | Built-in (ref, import-dvc-dir) |
None | N/A |
| File locking | No | Yes | No |
| Ecosystem support | Custom tooling | Broad (GitHub, GitLab, etc.) | ML/data pipelines |
| Integrity verification | Hash-verified on every transfer | Hash-verified | Hash-verified |
Git LFS when you want standard tooling with broad hosting support and don't mind host-managed storage. Best for teams on GitHub/GitLab who want minimal operational burden.
DVC when your large files are part of ML pipelines with versioned experiments, parameters, and metrics. DVC is a data pipeline tool that happens to store files, not a git extension.
bigstore when you want the git-native clean/smudge workflow with full control over your object storage. No LFS server needed, no host quotas, works with any S3-compatible bucket. Best for teams that already manage their own infrastructure.
bigstore + DVC: Content-level interop via ref and import-dvc-dir.
DVC-compatible storage layout allows coexistence in the same bucket. DVC is a
byte source for bigstore, not a shared pointer layer.
bigstore + Git LFS: Can coexist in one repo on different path patterns.
Migration from LFS: git lfs pull, change .gitattributes to
filter=bigstore, git add, push. No protocol-level interop — different
pointer formats, different transfer mechanisms.
Same file path cannot use both filters. Both bigstore and LFS use git clean/smudge, so applying both to the same path will break.
git bigstore lfs-adapter is a Git LFS custom transfer agent that lets LFS
clients upload/download from bigstore's bucket. No LFS API server needed — LFS
talks directly to your object store.
Setup (in an LFS-configured repo):
git config lfs.standalonetransferagent bigstore
git config lfs.customtransfer.bigstore.path git-bigstore
git config lfs.customtransfer.bigstore.args lfs-adapterConfig resolution:
.bigstore.toml(if present in repo)git config bigstore-lfs.url+bigstore-lfs.endpoint(for LFS-only repos)
# LFS-only repo without .bigstore.toml:
git config bigstore-lfs.url s3://my-bucket/bigstore
git config bigstore-lfs.endpoint https://t3.storage.devObject mapping: LFS oid sha256:<hex> maps to files/sha256/<2>/<rest> —
the same key bigstore uses natively. Shared bytes, separate pointer formats.
Scope and limits:
- SHA-256 only (LFS OIDs are SHA-256; bigstore's native hash)
- Shared remote objects, separate local caches (LFS cache != bigstore cache)
- No locking support
- No pointer-format bridging — LFS pointers stay LFS, bigstore pointers stay bigstore
- Credentials via
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY(same as bigstore)
When to use it:
- Migrating a team from hosted LFS to owned bucket storage
- Letting LFS-native collaborators pull from a bigstore-managed bucket
- Coexisting LFS and bigstore in one org with shared object storage
If your repo has a .bigstore file (no .toml extension), bigstore will load
it with a deprecation warning. Run git bigstore migrate-config to upgrade.
Repos with layout templates that omit {hash_fn} (e.g.,
files/sha256/{prefix}/{rest}) continue to work for SHA-256 objects. MD5/DVC
objects require the {hash_fn} placeholder — bigstore will error with a clear
message if the layout doesn't support the hash function.
"no bigstore config found" — Run git bigstore init <url> first, or check
that .bigstore.toml is committed.
"not found on remote" — The object hasn't been pushed yet. Run
git bigstore push from a machine that has the file cached.
"pointer only (needs pull)" — The file is tracked but not downloaded. Run
git bigstore pull.
"integrity check failed" — A downloaded or cached object doesn't match its expected hash. This indicates corruption in transit or at rest. Delete the corrupted cache entry and re-pull.
"layout template does not contain {hash_fn}" — Your .bigstore.toml uses a
legacy layout that only supports SHA-256. Update the layout to
files/{hash_fn}/{prefix}/{rest} to support MD5/DVC objects.