Skip to content

load_table consumes enormous amounts of memory on large metadata file #3162

@thomas-pfeiffer

Description

@thomas-pfeiffer

Apache Iceberg version

0.11.0 (latest release)

Please describe the bug 🐞

Apologies, this is a bit of a fuzzy one right now, but I thought reporting it anyway.

Context:
We're using Iceberg with AWS Glue and AWS S3 as storage. In S3 there are roughly speaking 3 kinds of files (metadata, manifests, and data files). The first one that is read when loading a table via catalog.load_table() is the metadata file. The metadata file contains information on all current* snapshots and schema versions of the table. py-iceberg seems to load these completely into memory.

Issue:
As we worked on the Iceberg table, there were a lot of snapshots created over time and with that a lot of schema versions. This led to the latest metadata file to be grow to ~10MB gzip compressed (or ~250MB uncompressed JSON). When we load this table via catalog.load_table() it consumes ~4GB of memory (total usage of the python process in memray). This is a lot - especially since we only need the latest snapshot and the respective schema version. (Which is probably true for most users I guess.)

Semi-Workaround:
One could try to expire some snapshots, e.g. via Sparks expire_snapshots procedure [https://iceberg.apache.org/docs/1.10.0/spark-procedures/#expire_snapshots], but it will not get rid of the old / unused schemas unless you set clean_expired_metadata as well (which is only supported since 1.10.x, so relatively new).

(Preliminary) Root-Cause:
I believe the issue is that we leverage Pydantic's model_validate_json in

return TableMetadataWrapper.model_validate_json(data).root
, which loads the whole JSON into memory and then we seem to keep the full TableMetadata object around.

Suggestion:
Would it make sense to parse the JSON not fully into memory and load the needed snapshots and schemas lazy / on demand? (Would be also fine, if that is a configurable option of catalog.load_table())

Remark:
Obviously we could blame this on an un-maintained Iceberg table, but I think it would be good for the pyIceberg lib to be robust against such scenarios, hence why I opened the issue.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions