Skip to content

Resolve AttributeError: 'ScalaFunction1' object has no attribute 'hashCode'.#184

Open
poolis wants to merge 3 commits into
awslabs:masterfrom
clarifyhealth:scalafunction1_hashcode
Open

Resolve AttributeError: 'ScalaFunction1' object has no attribute 'hashCode'.#184
poolis wants to merge 3 commits into
awslabs:masterfrom
clarifyhealth:scalafunction1_hashcode

Conversation

@poolis
Copy link
Copy Markdown
Contributor

@poolis poolis commented Jan 3, 2024

Issue #, if available:
#91

Description of changes:
Added hashCode() method to ScalaFunction1 and ScalaFunction2.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment thread tests/test_checks.py
check.isLessThanOrEqualTo("b", "d")
vrb.addCheck(check)
check.hasDataType("d", ConstrainableDataTypes.String, lambda x: x >= 1)
vrb.addCheck(check)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to add and verify one by one?

Copy link
Copy Markdown
Contributor Author

@poolis poolis Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because that is the use case for triggering the exception. If I change the test like so, it does not use hashCode.

        vrb = VerificationSuite(self.spark) \
            .onData(self.df)
        check = Check(self.spark, CheckLevel.Error, "Enough checks to trigger a hashCode not an attribute of ScalaFunction1")
        check.addConstraints([
            check.isComplete('b'),
            check.containsEmail('email'),
            check.isGreaterThanOrEqualTo("d", "b"),
            check.isLessThanOrEqualTo("b", "d"),
            check.hasDataType("d", ConstrainableDataTypes.String, lambda x: x >= 1)])

         result = vrb.addCheck(check).run()

Copy link
Copy Markdown
Contributor

@chenliu0831 chenliu0831 Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it only fail at the magic 5th one? It's a big strange if so.. btw CI is failing on this test

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue write up has more info: #91

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't replicate the error in CI but submitted an attempt to fix it.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Generated by AI (model: us.anthropic.claude-opus-4-6-v1, prompt: db2249a9) — may not be fully accurate. Reply if this doesn't help.

Comment thread pydeequ/scala_utils.py
@@ -37,6 +37,9 @@ def apply(self, arg):
"""Implements the apply function"""
return self.lambda_function(arg)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDGE_CASE: Using hash() on a lambda function returns a value based on the lambda's identity (memory address), not its logical content. Two identical lambda expressions (e.g., lambda x: x > 10 defined twice) will produce different hash values, making hashCode() non-deterministic across equivalent functions. More critically, hash() in Python can return negative values or values outside Java's int range, and Integer.hashCode(int) on the JVM expects a Java int. If hash(self.lambda_function) returns a Python int larger than Integer.MAX_VALUE or smaller than Integer.MIN_VALUE, this could cause unexpected behavior or overflow when passed through py4j.

Line 39: return self.gateway.jvm.java.lang.Integer.hashCode(hash(self.lambda_function)). Python's hash() returns a 64-bit integer on 64-bit platforms, but java.lang.Integer.hashCode(int) expects a 32-bit int. The test at test_scala_utils.py line 28 (self.assertNotEqual(greaterThan10.hashCode(), notNoneTest.hashCode())) passes only because the two lambdas happen to have different object identities, not because of logical equivalence checking.

Comment thread tests/test_checks.py
check.hasDataType("d", ConstrainableDataTypes.String, lambda x: x >= 1)
vrb.addCheck(check)

result = vrb.run()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MISSING_TEST: The test never asserts anything about the result of vrb.run(). It only relies on the absence of an exception as the pass condition, but doesn't verify that the verification actually completed successfully (e.g., checking result.status or the check results). If run() silently fails or returns an error status, this test would still pass.

Line 1390: result = vrb.run() — the variable result is assigned but never used in any assertion. The docstring says 'Lack of Exception is passing' but a proper test should also assert the result is valid.

Comment thread tests/test_checks.py
self.assertEqual(df.select("constraint_status").collect(), [Row(constraint_status="Success"), Row(constraint_status="Success")]) No newline at end of file
self.assertEqual(df.select("constraint_status").collect(), [Row(constraint_status="Success"), Row(constraint_status="Success")])

def test_hash_code(self):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DESIGN: The test repeatedly calls vrb.addCheck(check) with the same check object after mutating it in-place. This adds the same Check instance multiple times to the VerificationRunBuilder, which means the JVM receives the same underlying _Check Java object multiple times. This doesn't properly test the scenario described in issue #91 where multiple different checks with different ScalaFunction1 instances trigger the hashCode call. The test is fragile and may not reliably reproduce the original bug.

Lines 1378-1389: check.isComplete('b') mutates check._Check in place, then vrb.addCheck(check) adds it. Then check.containsEmail('email') mutates the same object again, and vrb.addCheck(check) adds the same Python object (now pointing to a different _Check Java object). The existing feedback confirms CI is failing on this test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants