chore: Setting the default spark_version value from pyspark.__version__#175
chore: Setting the default spark_version value from pyspark.__version__#175aagumin wants to merge 3 commits into
Conversation
chenliu0831
left a comment
There was a problem hiding this comment.
Thanks - I think this is an acceptable solution (although not airtight :)). A few comments
|
Test is failing - @Argo3V5A would you take a look? |
| spark_version = os.getenv("SPARK_VERSION") | ||
| if not spark_version: | ||
| spark_version = str(pyspark.__version__) | ||
| logging.info( |
There was a problem hiding this comment.
BUG: The except KeyError block on line 30 is now dead code. os.getenv() never raises KeyError (it returns None on missing keys). If pyspark.__version__ somehow fails or the version string is invalid, the KeyError exception handler will never be triggered. The except KeyError block should be removed or replaced with appropriate error handling for the new logic.
Line 25 changed from
os.environ["SPARK_VERSION"](which raises KeyError) toos.getenv("SPARK_VERSION")(which returns None). Theexcept KeyError:on line 30 is now unreachable dead code.
|
|
||
| import pyspark | ||
|
|
||
| SPARK_TO_DEEQU_COORD_MAPPING = { |
There was a problem hiding this comment.
DESIGN: Importing pyspark at module top-level in configs.py creates a hard dependency on pyspark being installed just to read configuration. Previously pyspark was an optional dependency (see pyproject.toml: pyspark = { version = ">=2.4.7,<4.0.0", optional = true }). This import will cause an ImportError if pyspark is not installed, breaking the package for users who haven't installed the optional extra.
pyproject.toml declares
pyspark = { version = ">=2.4.7,<4.0.0", optional = true }under[tool.poetry.extras]. Line 8 of configs.py now hasimport pysparkat module level, which will fail if pyspark is not installed.
| with mock.patch.object(pyspark, "__version__", spark_version): | ||
| assert _get_spark_version() == expected | ||
| _get_spark_version.cache_clear() | ||
|
|
There was a problem hiding this comment.
BUG: The test__get_spark_version_with_cache test has incorrect expected values. The parametrize decorator says ("3.1", "3.2"), ("3.10.3", "3.2"), ("3.10", "3.2") — these expected values don't match the mocked pyspark.__version__. The test relies on the LRU cache retaining the result from the first parametrized case ("3.2.1" → "3.2"), but @pytest.mark.parametrize creates separate test instances, and the cache state leaks unpredictably between them. The test will pass or fail depending on execution order, making it fragile and non-deterministic.
Lines 38-44: parametrize has
("3.2.1", "3.2"), ("3.1", "3.2"), ("3.10.3", "3.2"), ("3.10", "3.2"). The test never callscache_clear(), so it depends on the first parametrized case running first and caching "3.2". If tests run in isolation or different order, the assertions for inputs "3.1", "3.10.3", "3.10" expecting "3.2" will fail.
|
|
||
| @pytest.mark.parametrize( | ||
| "spark_version, expected", [("3.2.1", "3.2"), ("3.1", "3.1"), ("3.10.3", "3.10"), ("3.10", "3.10")] | ||
| ) |
There was a problem hiding this comment.
BUG: _get_spark_version is decorated with @lru_cache. In test__get_spark_version_without_cache, cache_clear() is called AFTER the assertion (line 34), meaning the first test case caches its result and subsequent parametrized cases will return the stale cached value instead of re-executing the function with the new mock. The cache_clear() should be called BEFORE each invocation (e.g., in a fixture or at the start of the test).
Lines 32-35:
assert _get_spark_version() == expectedruns first (line 34), then_get_spark_version.cache_clear()runs after (line 35). For the second parametrized case, the cache still holds the result from the first case when_get_spark_version()is called.
| import re | ||
|
|
||
| import pyspark | ||
|
|
There was a problem hiding this comment.
DESIGN: The Deequ version was downgraded from 2.0.8 to 2.0.7 in the SPARK_TO_DEEQU_COORD_MAPPING without explanation. This is a regression from the existing codebase which uses 2.0.8 (as shown in the knowledge base). The pyproject.toml version is 1.5.0 and the README mentions 2.0.7 being available with 1.3.0, while 2.0.8 is the latest.
The diff shows the mapping values changed from
2.0.8-spark-3.5etc. to2.0.7-spark-3.5etc. The knowledge base configs.py shows"3.5": "com.amazon.deequ:deequ:2.0.8-spark-3.5".
*Issue #170
Description of changes:
The default value for the SPARK_VERSION variable will be taken from
pyspark.__version__. In case of problems, the user also sets the environment variable himselfBy submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.