Skip to content

Handle exception for invalid column names containing .#102

Open
sidharthbolar wants to merge 1 commit into
awslabs:masterfrom
sidharthbolar:bug_fix/issue-99
Open

Handle exception for invalid column names containing .#102
sidharthbolar wants to merge 1 commit into
awslabs:masterfrom
sidharthbolar:bug_fix/issue-99

Conversation

@sidharthbolar
Copy link
Copy Markdown

@sidharthbolar sidharthbolar commented Aug 29, 2022

Issue #, if available:
#99
Description of changes:
Columns with dot in their name are throwing an Analysis Exception
Based on analysis done , the exception is being thrown by the core scala deeque library
Have proposed to handle this exception to notify the caller that a column name with "." should not be passed
the exception handled is added to all check methods and implemented via a static method

Local Tests were passing and have added an additional UT test_invalidColumnException to verify the behaviour

*version = "1.0.1"
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread pydeequ/checks.py
"""
:param column: column to which constraint is to be applied
"""
if "." in column:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite strict, perhaps simply enclosing the field with ticks will do, like f"`column.name`"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@herminio-iovio passing through ticks also does not work and returns the same error

Copy link
Copy Markdown
Contributor

@chenliu0831 chenliu0831 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use dot with column name in Spark/SQL is quite legitimate, basically to fetch values from fields that are nested. Deequ/PyDeequ as a library does not seem to be the right place to handle this since it's hard to guess customer's intent. For example, this might break some existing use-cases where people actually want to profile/check nested column types.

I think it's up to the caller or other higher level library / service to make the call (e.g. escape with "`" or rename)

@sidharthbolar
Copy link
Copy Markdown
Author

hi @chenliu0831 thanks for the review
In that case this PR handles this exception and notifies the caller to not pass a column with a "." so that he can handle it outside this library

@chenliu0831
Copy link
Copy Markdown
Contributor

hi @chenliu0831 thanks for the review In that case this PR handles this exception and notifies the caller to not pass a column with a "." so that he can handle it outside this library

@sidharthbolar right I think I understand the purpose, but column with dot is a valid usage for structured column (JSON type). Do you mind add a test to verify if that works?

Looks like you actually want a better Pythonic error (than the vanilla Spark error) when such column with dot does not exist? If so, how about add a helper to check column existence instead?

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: db2249a9) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/checks.py
@staticmethod
def _handle_invalid_column(column):
"""
:param column: column to which constraint is to be applied
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: This validation rejects legitimate column names. In Spark, dots in column names are valid (e.g., nested struct fields like address.city, or columns named with backtick-escaping). The PR blocks all columns containing dots, which is overly restrictive and breaks valid use cases. As noted in existing feedback, this should instead check column existence or provide a better error message rather than blanket-rejecting dots.

Line 105: if "." in column: raises ValueError for any column containing a dot. Spark supports nested column access via dots (e.g., struct fields) and backtick-escaped column names with dots. The existing feedback from @chenliu0831 confirms: 'column with dot is a valid usage for structured column (JSON type)'.

Comment thread pydeequ/checks.py
@@ -160,6 +168,7 @@ def hasCompleteness(self, column, assertion, hint=None):
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: A debug print(self) statement was left in the hasCompleteness method. This will produce unwanted output in production usage and is inconsistent with all other check methods.

Line 169 in the full source: print(self) appears after self._Check = self._Check.hasCompleteness(column, assertion_func, hint). No other method in the class has a print statement.

Comment thread pydeequ/checks.py
@@ -144,6 +151,7 @@ def isComplete(self, column, hint=None):
:return: isComplete self:A Check.scala object that asserts on a column completion.
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: The isComplete method signature was changed to remove the third argument self._jvm.scala.Option.apply(None) that was previously passed to the Scala method. This will cause a Py4J error at runtime if the underlying Scala isComplete method requires 3 arguments (column, hint, analyzerOption).

In the knowledge base, the original code at line 153 was: self._Check = self._Check.isComplete(column, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed to just self._Check = self._Check.isComplete(column, hint) (the third argument is removed).

Comment thread pydeequ/checks.py
@@ -160,6 +168,7 @@ def hasCompleteness(self, column, assertion, hint=None):
"""
assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: The hasCompleteness method signature was changed to remove the fourth argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala hasCompleteness method requires 4 arguments.

In the knowledge base, the original code was: self._Check = self._Check.hasCompleteness(column, assertion_func, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed to self._Check = self._Check.hasCompleteness(column, assertion_func, hint) (fourth argument removed).

Comment thread pydeequ/checks.py
@@ -225,6 +238,7 @@ def isUnique(self, column, hint=None):
:return: isUnique self: A Check.scala object that asserts uniqueness in the column.
"""
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: The isUnique method signature was changed to remove the third argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala isUnique method requires 3 arguments.

In the knowledge base, the original code was: self._Check = self._Check.isUnique(column, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed to self._Check = self._Check.isUnique(column, hint) (third argument removed).

Comment thread pydeequ/checks.py
@@ -785,6 +834,7 @@ def isContainedIn(self, column, allowed_values):
arr = self._spark_session.sparkContext._gateway.new_array(self._jvm.java.lang.String, len(allowed_values))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: The isContainedIn method signature was changed to remove the assertion and hint parameters entirely. The method no longer accepts or passes assertion/hint to the Scala method, breaking the existing public API that tests rely on (e.g., self.isContainedIn("a", ["foo", "bar", "baz"], lambda x: x == 1, hint="it should be 1")).

The full source shows def isContainedIn(self, column, allowed_values): at line 826 with no assertion or hint parameters. However, the test file at lines 882-889 calls self.isContainedIn("a", ["foo", "bar", "baz"], lambda x: x == 1) and self.isContainedIn("a", ["foo", "bar", "baz"], lambda x: x == 1, hint="it should be 1") which would fail with TypeError.

Comment thread pydeequ/checks.py
@@ -287,6 +303,7 @@ def hasUniqueValueRatio(self, columns, assertion, hint=None):
"""
assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
hint = self._jvm.scala.Option.apply(hint)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: The hasUniqueValueRatio method was changed to remove the fourth argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala method requires it.

In the knowledge base, the original code was: self._Check = self._Check.hasUniqueValueRatio(columns_seq, assertion_func, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed to self._Check = self._Check.hasUniqueValueRatio(columns_seq, assertion_func, hint).

Comment thread pydeequ/checks.py
@@ -409,6 +433,7 @@ def hasMinLength(self, column, assertion, hint=None):
"""
assertion_func = ScalaFunction1(self._spark_session.sparkContext._gateway, assertion)
hint = self._jvm.scala.Option.apply(hint)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: The hasMinLength method was changed to remove the fourth argument self._jvm.scala.Option.apply(None). This will cause a Py4J error at runtime if the underlying Scala method requires it.

In the knowledge base, the original code was: self._Check = self._Check.hasMinLength(column, assertion_func, hint, self._jvm.scala.Option.apply(None)). The diff shows it changed to self._Check = self._Check.hasMinLength(column, assertion_func, hint).

Comment thread tests/test_checks.py
@@ -141,6 +148,7 @@ def hasDataType(self, column, datatype, assertion=None, hint=None):
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUG: A debug df.show(truncate=False) statement was left in the test helper method hasDataType. This will produce unwanted console output during test runs.

Line 149: df.show(truncate=False) in the hasDataType helper method.

Comment thread tests/test_checks.py
self.hasDataType("boolean", ConstrainableDataTypes.Boolean), [Row(constraint_status="Success")]
)
self.assertEqual(self.hasDataType("boolean", ConstrainableDataTypes.Boolean), [Row(constraint_status="Success")])

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MISSING_TEST: The test test_invalidColumnException only tests hasDataType with a dotted column. There are no tests verifying the new _handle_invalid_column validation for any of the other ~30 methods where it was added.

Line 597-599: Only one test exists: self.hasDataType("column.with.dot", ConstrainableDataTypes.String). No tests for isComplete, hasCompleteness, areComplete, isUnique, etc. with dotted columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants