Cylf is a shared, format-agnostic codec layer for chunked data formats. It provides a common foundation for encoding and decoding data chunks, so that codec implementations, optimizations, and bug fixes benefit every format that builds on it rather than being locked inside format-specific libraries.
Formats like Zarr, TIFF/COG, HDF5, Parquet, ORC, and others all use the same fundamental codecs (zstd, lz4, shuffle, delta, bit-packing) but each implements them independently. Cylf defines a shared interface for codecs, a composition model for combining them into pipelines, and a runtime that executes them, with a standard library of native codecs for performance and a WebAssembly runtime for portability and extensibility.
- Motivation — why a shared codec layer should exist, and what it enables
- Architecture — the components of the Cylf codec layer, how they relate, and where the boundaries lie
- Glossary — quick reference for terms used throughout the specification
- Signature Model — how codecs describe their interfaces: typed ports, bidirectional encode/decode blocks, awareness taxonomy, and format alias mappings
- Pipeline Model — how codecs compose into executable DAGs: steps, wiring references, constants, explicit bidirectional definitions
- Codec Inventory — catalog of 60+ codecs across formats, with signatures, aliases, and composition properties
- Codec Contract — the interface contract between the pipeline engine and codec implementations: WIT interface (Component Model), binary ABI (Core Wasm), native interface, and signature format
- Distribution & Registry — how codec artifacts are distributed: HTTPS, OCI registries, the warg ecosystem, native plugins, and embedded codecs
- Pipeline Architecture Tradeoffs — analysis of mixed native + Wasm vs all-Wasm pipelines, Core module ABI vs Component Model, and how the two axes interact
- Performance & Data Copy Accounting — copy counts per step boundary, the Canonical ABI bottleneck under Python, why a native orchestrator is required, and approaches to copy reduction
- F3 Comparison — comparison with the F3 file format: where the approaches overlap, where they diverge, and how they could complement each other
- Open Questions & Roadmap — unresolved design questions and planned work across the ecosystem
Forward-looking documents exploring how the codec layer could serve as infrastructure for higher-level systems. These are drafts, not part of the core spec.
- Format Drivers and Data Orchestration — a plan format and execution model for orchestrating storage I/O, codec pipelines, format-specific callbacks, and concurrent chunk processing
- CCRP and the Cylf Ecosystem — how the codec layer could enable a protocol for querying multidimensional datasets across formats and storage backends (planned)
Currently only one implementation exists: chonkle. This is a proof-of-concept Python host for the codec pipeline engine. It is the first research implementation intended to inform the spec and future development, not as the reference runtime.
Cylf is an open source project at github.com/cylf-dev. If you are interested in contributing, the Open Questions & Roadmap lists areas where input is most valuable.