When running a command like:
oc image mirror -f images-mapping-to-filesystem.txt --filter-by-os '.*' --skip-multiple-scopes --max-per-registry=1
some manifest and blob files are duplicated into different folders. For example, if I run this command from inside the root v2 folder after the mirror is complete I see:
find -name sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc -printf "%p %s\n"
./v2/<path1/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path2/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path3/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path4/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path5/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path6/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path7/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path8/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path9/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
./v2/<path10/to/image>/blobs/sha256:f2b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38ebc 43875821
This shows that out of ~438Mb downloaded, ~394Mb are duplicates. Obviously this is an extreme case, but over a whole airgap mirror I'm seeing on average that about 1/3 of the size is taken up in duplicate files, and in some I see over 100GB of duplicates for large mirrors.
If the command below is run from the root of a mirrored folder on disk (inside the v2 folder) it will provide a list of all the duplicates files preceded by a count of how many times each one is duplicated and is followed by the size of each image:
find -name sha256:* -printf "%f %s\n" | sort | uniq -dc | sort -n
...
9 blobs/sha256:5d9ff8920718132b2498fcbe2cfd5477e94d38f7f70e4aa319b44df5bf62a9e0 39235316
10 blobs/sha256:2f19a8cf89693277baaa454087d49d95967ad8872e2bcc44741d4046abaf1cd6 37461527
10 blobs/sha256:c3b490814c92873a4f533992ccbc1e625e6afbaf01046a25debf0ed487e38eba 43875821
14 blobs/sha256:fc70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef 1024
...
The example above shows there are 9 copies of the first blob starting sha256:5d9ff... and each one is 39235316 bytes in size.
Whereas this command below will count all the duplicates and provide a total of the total space lost in duplicates so you can see how big the problem is on different mirrors:
find -name sha256:* -printf "%f %s\n" | sort | uniq -dc | sed -e "s/^ *\([0-9]*\) .* \([0-9]*\)/((\1-1)*\2)/" | paste -sd+ | bc | numfmt --to=iec
130G
Given that the main purpose of oc image mirror is to mirror a registry to prepare for an airgap install, this is a lot of wasted space and time when mirroring large repositories. Therefore, it would be really helpful to eliminate the duplicates, perhaps by using the link file mechanism that some registries use internally, such as the manifestTagIndexEntryLinkPathSpec and the layerLinkPathSpec from distribution.
Happy to provide more information if required.
MGK
When running a command like:
oc image mirror -f images-mapping-to-filesystem.txt --filter-by-os '.*' --skip-multiple-scopes --max-per-registry=1some manifest and blob files are duplicated into different folders. For example, if I run this command from inside the root
v2folder after the mirror is complete I see:This shows that out of ~438Mb downloaded, ~394Mb are duplicates. Obviously this is an extreme case, but over a whole airgap mirror I'm seeing on average that about 1/3 of the size is taken up in duplicate files, and in some I see over 100GB of duplicates for large mirrors.
If the command below is run from the root of a mirrored folder on disk (inside the
v2folder) it will provide a list of all the duplicates files preceded by a count of how many times each one is duplicated and is followed by the size of each image:The example above shows there are 9 copies of the first blob starting
sha256:5d9ff...and each one is39235316bytes in size.Whereas this command below will count all the duplicates and provide a total of the total space lost in duplicates so you can see how big the problem is on different mirrors:
Given that the main purpose of
oc image mirroris to mirror a registry to prepare for an airgap install, this is a lot of wasted space and time when mirroring large repositories. Therefore, it would be really helpful to eliminate the duplicates, perhaps by using thelinkfile mechanism that some registries use internally, such as themanifestTagIndexEntryLinkPathSpecand thelayerLinkPathSpecfrom distribution.Happy to provide more information if required.
MGK