Handle exception for invalid column names containing .#102
Handle exception for invalid column names containing .#102sidharthbolar wants to merge 1 commit into
Conversation
| """ | ||
| :param column: column to which constraint is to be applied | ||
| """ | ||
| if "." in column: |
There was a problem hiding this comment.
I think this is quite strict, perhaps simply enclosing the field with ticks will do, like f"`column.name`"
There was a problem hiding this comment.
@herminio-iovio passing through ticks also does not work and returns the same error
There was a problem hiding this comment.
Use dot with column name in Spark/SQL is quite legitimate, basically to fetch values from fields that are nested. Deequ/PyDeequ as a library does not seem to be the right place to handle this since it's hard to guess customer's intent. For example, this might break some existing use-cases where people actually want to profile/check nested column types.
I think it's up to the caller or other higher level library / service to make the call (e.g. escape with "`" or rename)
|
hi @chenliu0831 thanks for the review |
@sidharthbolar right I think I understand the purpose, but column with dot is a valid usage for structured column (JSON type). Do you mind add a test to verify if that works? Looks like you actually want a better Pythonic error (than the vanilla Spark error) when such column with dot does not exist? If so, how about add a helper to check column existence instead? |
| @staticmethod | ||
| def _handle_invalid_column(column): | ||
| """ | ||
| :param column: column to which constraint is to be applied |
There was a problem hiding this comment.
BUG: This validation rejects legitimate column names. In Spark, dots in column names are valid (e.g., nested struct fields like address.city, or columns named with backtick-escaping). The PR blocks all columns containing dots, which is overly restrictive and breaks valid use cases. As noted in existing feedback, this should instead check column existence or provide a better error message rather than blanket-rejecting dots.
Line 105:
if "." in column:raises ValueError for any column containing a dot. Spark supports nested column access via dots (e.g., struct fields) and backtick-escaped column names with dots. The existing feedback from @chenliu0831 confirms: 'column with dot is a valid usage for structured column (JSON type)'.
| @@ -160,6 +168,7 @@ def hasCompleteness(self, column, assertion, hint=None): | |||
| """ | |||
There was a problem hiding this comment.
BUG: A debug print(self) statement was left in the hasCompleteness method. This will produce unwanted output in production usage and is inconsistent with all other check methods.
Line 169 in the full source:
print(self)appears afterself._Check = self._Check.hasCompleteness(column, assertion_func, hint). No other method in the class has a print statement.
| @@ -144,6 +151,7 @@ def isComplete(self, column, hint=None): | |||
| :return: isComplete self:A Check.scala object that asserts on a column completion. | |||
| """ | |||
There was a problem hiding this comment.
BUG: The isComplete method signature was changed to remove the third argument self._jvm.scala.Option.apply(None) that was previously passed to the Scala method. This will cause a Py4J error at runtime if the underlying Scala isComplete method requires 3 arguments (column, hint, analyzerOption).
In the knowledge base, the original code at line 153 was:
self._Check = self._Check.isComplete(column, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed to justself._Check = self._Check.isComplete(column, hint)(the third argument is removed).
| @@ -160,6 +168,7 @@ def hasCompleteness(self, column, assertion, hint=None): | |||
| """ | |||
| assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) | |||
There was a problem hiding this comment.
BUG: The hasCompleteness method signature was changed to remove the fourth argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala hasCompleteness method requires 4 arguments.
In the knowledge base, the original code was:
self._Check = self._Check.hasCompleteness(column, assertion_func, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed toself._Check = self._Check.hasCompleteness(column, assertion_func, hint)(fourth argument removed).
| @@ -225,6 +238,7 @@ def isUnique(self, column, hint=None): | |||
| :return: isUnique self: A Check.scala object that asserts uniqueness in the column. | |||
| """ | |||
There was a problem hiding this comment.
BUG: The isUnique method signature was changed to remove the third argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala isUnique method requires 3 arguments.
In the knowledge base, the original code was:
self._Check = self._Check.isUnique(column, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed toself._Check = self._Check.isUnique(column, hint)(third argument removed).
| @@ -785,6 +834,7 @@ def isContainedIn(self, column, allowed_values): | |||
| arr = self._spark_session.sparkContext._gateway.new_array(self._jvm.java.lang.String, len(allowed_values)) | |||
There was a problem hiding this comment.
BUG: The isContainedIn method signature was changed to remove the assertion and hint parameters entirely. The method no longer accepts or passes assertion/hint to the Scala method, breaking the existing public API that tests rely on (e.g., self.isContainedIn("a", ["foo", "bar", "baz"], lambda x: x == 1, hint="it should be 1")).
The full source shows
def isContainedIn(self, column, allowed_values):at line 826 with no assertion or hint parameters. However, the test file at lines 882-889 callsself.isContainedIn("a", ["foo", "bar", "baz"], lambda x: x == 1)andself.isContainedIn("a", ["foo", "bar", "baz"], lambda x: x == 1, hint="it should be 1")which would fail with TypeError.
| @@ -287,6 +303,7 @@ def hasUniqueValueRatio(self, columns, assertion, hint=None): | |||
| """ | |||
| assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) | |||
| hint = self._jvm.scala.Option.apply(hint) | |||
There was a problem hiding this comment.
BUG: The hasUniqueValueRatio method was changed to remove the fourth argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala method requires it.
In the knowledge base, the original code was:
self._Check = self._Check.hasUniqueValueRatio(columns_seq, assertion_func, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed toself._Check = self._Check.hasUniqueValueRatio(columns_seq, assertion_func, hint).
| @@ -409,6 +433,7 @@ def hasMinLength(self, column, assertion, hint=None): | |||
| """ | |||
| assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion) | |||
| hint = self._jvm.scala.Option.apply(hint) | |||
There was a problem hiding this comment.
BUG: The hasMinLength method was changed to remove the fourth argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala method requires it.
In the knowledge base, the original code was:
self._Check = self._Check.hasMinLength(column, assertion_func, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed toself._Check = self._Check.hasMinLength(column, assertion_func, hint).
| @@ -141,6 +148,7 @@ def hasDataType(self, column, datatype, assertion=None, hint=None): | |||
| ) | |||
|
|
|||
There was a problem hiding this comment.
BUG: A debug df.show(truncate=False) statement was left in the test helper method hasDataType. This will produce unwanted console output during test runs.
Line 149:
df.show(truncate=False)in thehasDataTypehelper method.
| self.hasDataType("boolean", ConstrainableDataTypes.Boolean), [Row(constraint_status="Success")] | ||
| ) | ||
| self.assertEqual(self.hasDataType("boolean", ConstrainableDataTypes.Boolean), [Row(constraint_status="Success")]) | ||
|
|
There was a problem hiding this comment.
MISSING_TEST: The test test_invalidColumnException only tests hasDataType with a dotted column. There are no tests verifying the new _handle_invalid_column validation for any of the other ~30 methods where it was added.
Line 597-599: Only one test exists:
self.hasDataType("column.with.dot", ConstrainableDataTypes.String). No tests for isComplete, hasCompleteness, areComplete, isUnique, etc. with dotted columns.
Issue #, if available:
#99
Description of changes:
Columns with dot in their name are throwing an Analysis Exception
Based on analysis done , the exception is being thrown by the core scala deeque library
Have proposed to handle this exception to notify the caller that a column name with "." should not be passed
the exception handled is added to all check methods and implemented via a static method
Local Tests were passing and have added an additional UT test_invalidColumnException to verify the behaviour
*version = "1.0.1"
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.