parquet2json

A command-line tool for streaming Parquet as line-delimited JSON.

It reads only required ranges from file, HTTP or S3 locations, and supports offset/limit and column selection.

It uses the Apache Parquet Official Native Rust Implementation which has excellent support for compression formats and complex types.

How to use

Install from crates.io and execute from the command line, e.g.:

$ cargo install parquet2json
$ parquet2json --help

Usage: parquet2json <FILE> <COMMAND>

Commands:
  cat       Outputs data as JSON lines
  schema    Outputs the Thrift schema
  rowcount  Outputs only the total row count
  help      Print this message or the help of the given subcommand(s)

Arguments:
  <FILE>  Location of Parquet input file (file path, HTTP or S3 URL)

Options:
  -h, --help     Print help
  -V, --version  Print version

$ parquet2json cat --help

Usage: parquet2json <FILE> cat [OPTIONS]

Options:
  -o, --offset <OFFSET>    Starts outputting from this row (first row: 0, last row: -1) [default: 0]
  -l, --limit <LIMIT>      Maximum number of rows to output
  -c, --columns <COLUMNS>  Select columns by name (comma,separated,?prefixed_optional)
  -n, --nulls              Outputs null values
  -h, --help               Print help

Examples

Use it to stream output to files and other tools such as grep and jq.

Output to a file

$ parquet2json ./myfile.parquet cat > output.jsonl

From S3 or HTTP

$ parquet2json s3://your-bucket/your-file.parquet cat

$ parquet2json https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet rowcount

Filter selected columns with jq

$ parquet2json ./myfile.pq cat --columns=url,level | jq 'select(.level==3) | .url'

S3 Settings

Credentials are provided as per standard AWS toolchain, i.e. per environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY), AWS credentials file or IAM ECS container/instance profile.

The default AWS region must be set per environment variable (AWS_DEFAULT_REGION) in AWS credentials file and must match region of the object's bucket.

S3-compatible endpoints (e.g. Cloudflare R2)

Custom S3-compatible endpoints are supported via the endpoint_url key in an AWS profile or the AWS_ENDPOINT_URL environment variable. For example, to use Cloudflare R2, add endpoint_url to your profile in ~/.aws/config:

[profile r2]
aws_access_key_id = <R2_ACCESS_KEY_ID>
aws_secret_access_key = <R2_SECRET_ACCESS_KEY>
endpoint_url = https://<ACCOUNT_ID>.r2.cloudflarestorage.com

Then read from R2 using that profile:

$ AWS_PROFILE=r2 parquet2json s3://mybucket/myfile.parquet cat

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.github/workflows		.github/workflows
scripts		scripts
src		src
tests/fixtures		tests/fixtures
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

parquet2json

How to use

Examples

Output to a file

From S3 or HTTP

Filter selected columns with jq

S3 Settings

S3-compatible endpoints (e.g. Cloudflare R2)

License

About

Uh oh!

Releases 21

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

parquet2json

How to use

Examples

Output to a file

From S3 or HTTP

Filter selected columns with jq

S3 Settings

S3-compatible endpoints (e.g. Cloudflare R2)

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages