Problem Description
The current CloudPath implementation makes multiple redundant metadata API calls during common operations like open(), download_to(), and copy(). Each call to exists(), is_file(), is_dir(), and stat() results in a separate _get_metadata() call to Azure Blob Storage, even though all these properties are available from a single metadata response.
What happens during an open() call
open() calls exists() + is_file()
_refresh_cache() calls stat()
download_to() calls exists() + is_file() again
On Azure, all of these end up calling the same AzureBlobClient._get_metadata(), which returns all the necessary information (existence, file/directory status, size, last modified time) in a single API call.
Performance Impact
After removing the redundant calls, I was able to achieve:
- ~2× speedup for 1 MB downloads
- ~1.5× speedup for 10 MB downloads
Proposal
There are two possible solutions:
Option 1: Azure-specific optimization
Optimize this in AzureBlobClient and AzureBlobPath.
Implementation:
- Add
_get_blob_properties() to AzureBlobClient that returns all the needed information in one call
- Store the result of
AzureBlobClient._get_blob_properties() at the start of e.g. AzureBlobPath.open()
- Pass metadata between internal methods to avoid redundant calls
- Alternatively implement metadata caching/invalidation logic
Example:
def open(self, mode="r", **kwargs):
meta = self.client._get_blob_properties(self) # Single call
if meta.exists and meta.is_directory:
raise CloudPathIsADirectoryError(...)
if mode == "x" and meta.exists:
raise CloudPathFileExistsError(...)
self._refresh_cache_with_meta(meta, **kwargs) # Reuse metadata
# ... rest of implementation
Option 2: CloudPath optimization
Change Client API and optimize Cloudpath
- Modify
Client API to explicitly require _get_metadata() method that will fetch all the required data
- Similar optimization to
Cloudpath as described in option 1
PR for Option 1 coming
Problem Description
The current
CloudPathimplementation makes multiple redundant metadata API calls during common operations likeopen(),download_to(), andcopy(). Each call toexists(),is_file(),is_dir(), andstat()results in a separate_get_metadata()call to Azure Blob Storage, even though all these properties are available from a single metadata response.What happens during an
open()callopen()callsexists()+is_file()_refresh_cache()callsstat()download_to()callsexists()+is_file()againOn Azure, all of these end up calling the same
AzureBlobClient._get_metadata(), which returns all the necessary information (existence, file/directory status, size, last modified time) in a single API call.Performance Impact
After removing the redundant calls, I was able to achieve:
Proposal
There are two possible solutions:
Option 1: Azure-specific optimization
Optimize this in
AzureBlobClientandAzureBlobPath.Implementation:
_get_blob_properties()toAzureBlobClientthat returns all the needed information in one callAzureBlobClient._get_blob_properties()at the start of e.g.AzureBlobPath.open()Example:
Option 2: CloudPath optimization
Change
ClientAPI and optimizeCloudpathClientAPI to explicitly require_get_metadata()method that will fetch all the required dataCloudpathas described in option 1PR for Option 1 coming