A command-line tool for streaming Parquet as line-delimited JSON.
It reads only required ranges from file, HTTP or S3 locations, and supports offset/limit and column selection.
It uses the Apache Parquet Official Native Rust Implementation which has excellent support for compression formats and complex types.
Install from crates.io and execute from the command line, e.g.:
$ cargo install parquet2json
$ parquet2json --help
Usage: parquet2json <FILE> <COMMAND>
Commands:
cat Outputs data as JSON lines
schema Outputs the Thrift schema
rowcount Outputs only the total row count
help Print this message or the help of the given subcommand(s)
Arguments:
<FILE> Location of Parquet input file (file path, HTTP or S3 URL)
Options:
-h, --help Print help
-V, --version Print version
$ parquet2json cat --help
Usage: parquet2json <FILE> cat [OPTIONS]
Options:
-o, --offset <OFFSET> Starts outputting from this row (first row: 0, last row: -1) [default: 0]
-l, --limit <LIMIT> Maximum number of rows to output
-c, --columns <COLUMNS> Select columns by name (comma,separated,?prefixed_optional)
-n, --nulls Outputs null values
-h, --help Print helpUse it to stream output to files and other tools such as grep and jq.
$ parquet2json ./myfile.parquet cat > output.jsonl$ parquet2json s3://your-bucket/your-file.parquet cat$ parquet2json https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet rowcount$ parquet2json ./myfile.pq cat --columns=url,level | jq 'select(.level==3) | .url'Credentials are provided as per standard AWS toolchain, i.e. per environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), AWS credentials file or IAM ECS container/instance profile.
The default AWS region must be set per environment variable (AWS_DEFAULT_REGION) in AWS credentials file and must match region of the object's bucket.
Custom S3-compatible endpoints are supported via the endpoint_url key in an AWS profile or the AWS_ENDPOINT_URL environment variable. For example, to use Cloudflare R2, add endpoint_url to your profile in ~/.aws/config:
[profile r2]
aws_access_key_id = <R2_ACCESS_KEY_ID>
aws_secret_access_key = <R2_SECRET_ACCESS_KEY>
endpoint_url = https://<ACCOUNT_ID>.r2.cloudflarestorage.comThen read from R2 using that profile:
$ AWS_PROFILE=r2 parquet2json s3://mybucket/myfile.parquet cat