Skip to content

RFE: stop oc image mirror creating duplicate files when mirroring to disk for an airgap install #1388

@m-g-k

Description

@m-g-k

When running a command like:
oc image mirror -f images-mapping-to-filesystem.txt --filter-by-os '.*' --skip-multiple-scopes --max-per-registry=1

some manifest and blob files are duplicated into different folders. For example, if I run this command from inside the root v2 folder after the mirror is complete I see:

find -name sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc -printf "%p %s\n"
./v2/<path1/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path2/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path3/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path4/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path5/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path6/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path7/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path8/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path9/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path10/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821

This shows that out of ~438Mb downloaded, ~394Mb are duplicates. Obviously this is an extreme case, but over a whole airgap mirror I'm seeing on average that about 1/3 of the size is taken up in duplicate files, and in some I see over 100GB of duplicates for large mirrors.

If the command below is run from the root of a mirrored folder on disk (inside the v2 folder) it will provide a list of all the duplicates files preceded by a count of how many times each one is duplicated and is followed by the size of each image:

find -name sha256:* -printf "%f %s\n" | sort | uniq -dc | sort -n
 ...
 9 blobs/sha256:5d9ff8920718132b2498fcbe2cfd5477e94d38f7f70e4aa319b44df5bf62a9e0 39235316
10 blobs/sha256:2f19a8cf89693277baaa454087d49d95967ad8872e2bcc44741d4046abaf1cd6 37461527
10 blobs/sha256:c3b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38eba 43875821
14 blobs/sha256:fc70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024
 ...

The example above shows there are 9 copies of the first blob starting sha256:5d9ff... and each one is 39235316 bytes in size.

Whereas this command below will count all the duplicates and provide a total of the total space lost in duplicates so you can see how big the problem is on different mirrors:

find -name sha256:* -printf "%f %s\n" | sort | uniq -dc | sed -e "s/^ *\([0-9]*\) .* \([0-9]*\)/((\1-1)*\2)/" | paste -sd+ | bc | numfmt --to=iec
130G

Given that the main purpose of oc image mirror is to mirror a registry to prepare for an airgap install, this is a lot of wasted space and time when mirroring large repositories. Therefore, it would be really helpful to eliminate the duplicates, perhaps by using the link file mechanism that some registries use internally, such as the manifestTagIndexEntryLinkPathSpec and the layerLinkPathSpec from distribution.

Happy to provide more information if required.

MGK

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions